All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions
@ 2021-02-21 18:56 Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
                   ` (21 more replies)
  0 siblings, 22 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
soon.  AMX consists of configurable TMM "TILE" registers plus new
accelerator instructions that operate on them.  TMUL (Tile matrix MULtiply)
is the first accelerator instruction set to use the new registers, and we
anticipate additional instructions in the future.

Neither AMX state nor TMUL instructions depend on AVX.  However, AMX and
AVX do share common challenges.  The TMM registers are 8KB today, and
architecturally as large as 64KB, which merits updates to hardware and
software state management.

Further, both technologies run faster when they are not simultaneously
running on SMT siblings, and both technologies use of power and bandwidth
impact the power and performance available to neighboring cores.  (This
impact has measurably improved in recent hardware.)

If the existing kernel approach for managing XSAVE state was employed to
handle AMX, 8KB space would be added to every task, but possibly rarely
used.  So Linux support is optimized by using a new XSAVE feature: eXtended
Feature Disabling (XFD).  The kernel arms XFD to provide a #NM exception
upon a tasks' first access to TILE state. The kernel exception handler
installs the appropriate XSAVE context switch buffer, and the task behaves
as if the kernel had done that for all tasks.  Using XFD, AMX space is
allocated only when needed, eliminating the memory waste for unused state
components.

This series requires the new minimum sigaltstack support [3] and is based
on the mainline. The series is composed of three parts:
* Patch 01-15: Foundation to support dynamic user state management
* Patch 16-20: AMX enablement, including unit tests
* Patch 21-22: Signal handling optimization and new boot-parameters

Thanks to Len Brown and Dave Hansen for help with the cover letter.

Changes from v3 [6]:
* Updated some commit messages and code comments. (Borislav Petkov)
* Added and removed some helpers. (Borislav Petkov)
* Revised the buffer allocation function. (Borislav Petkov)
* Simplified in accessing buffers. (Borislav Petkov)
* Re-organized some code change more reviewable. (PATCH9/10)
* Reverted unnecessary changes. (PATCH4)
* Fixed typo in the documentation. (Randy Dunlap)

Changes from v2 [5]:
* Removed the patch for the tile data inheritance. Also, updated the
  selftest patch. (Andy Lutomirski)
* Changed the kernel tainted when any unknown state is enabled. (Andy
  Lutomirski)
* Changed to use the XFD feature only when the compacted format in use.
* Improved the test code.
* Simplified the cmdline handling.
* Removed 'task->fpu' in changelogs. (Boris Petkov)
* Updated the variable name / comments / changelogs for clarification.

Changes from v1 [4]:
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
* Inlined the #NM handling code (Andy Lutomirski)
* Made signal handling optimization revertible
* Revised the new parameter handling code (Andy Lutomirski and Dave Hansen)
* Rebased on the upstream kernel

[1]: Intel Architecture Instruction Set Extension Programming Reference
    February 2021, https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
[2]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
[3]: https://lore.kernel.org/lkml/20210203172242.29644-1-chang.seok.bae@intel.com/
[4]: https://lore.kernel.org/lkml/20201001203913.9125-1-chang.seok.bae@intel.com/
[5]: https://lore.kernel.org/lkml/20201119233257.2939-1-chang.seok.bae@intel.com/
[6]: https://lore.kernel.org/lkml/20201223155717.19556-1-chang.seok.bae@intel.com/

Chang S. Bae (22):
  x86/fpu/xstate: Modify the initialization helper to handle both static
    and dynamic buffers
  x86/fpu/xstate: Modify state copy helpers to handle both static and
    dynamic buffers
  x86/fpu/xstate: Modify address finders to handle both static and
    dynamic buffers
  x86/fpu/xstate: Modify the context restore helper to handle both
    static and dynamic buffers
  x86/fpu/xstate: Add a new variable to indicate dynamic user states
  x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer
    size
  x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes
  x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
  x86/fpu/xstate: Introduce helpers to manage the xstate buffer
    dynamically
  x86/fpu/xstate: Define the scope of the initial xstate data
  x86/fpu/xstate: Update the xstate save function to support dynamic
    states
  x86/fpu/xstate: Update the xstate buffer address finder to support
    dynamic states
  x86/fpu/xstate: Update the xstate context copy function to support
    dynamic states
  x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic
    user state
  x86/fpu/xstate: Support ptracer-induced xstate buffer expansion
  x86/fpu/xstate: Extend the table to map state components with features
  x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature
    bits
  x86/fpu/amx: Define AMX state components and have it used for
    boot-time checks
  x86/fpu/amx: Enable the AMX feature in 64-bit mode
  selftest/x86/amx: Include test cases for the AMX state management
  x86/fpu/xstate: Support dynamic user state in the signal handling path
  x86/fpu/xstate: Introduce boot-parameters to control state component
    support

 .../admin-guide/kernel-parameters.txt         |  15 +
 arch/x86/include/asm/cpufeatures.h            |   4 +
 arch/x86/include/asm/fpu/internal.h           |  77 +-
 arch/x86/include/asm/fpu/types.h              |  67 +-
 arch/x86/include/asm/fpu/xstate.h             |  60 +-
 arch/x86/include/asm/msr-index.h              |   2 +
 arch/x86/include/asm/pgtable.h                |   2 +-
 arch/x86/include/asm/processor.h              |  10 +-
 arch/x86/include/asm/trace/fpu.h              |   9 +-
 arch/x86/kernel/cpu/common.c                  |   2 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   4 +
 arch/x86/kernel/fpu/core.c                    |  67 +-
 arch/x86/kernel/fpu/init.c                    |  99 ++-
 arch/x86/kernel/fpu/regset.c                  |  63 +-
 arch/x86/kernel/fpu/signal.c                  |  61 +-
 arch/x86/kernel/fpu/xstate.c                  | 580 +++++++++++---
 arch/x86/kernel/process.c                     |  12 +
 arch/x86/kernel/process_32.c                  |   2 +-
 arch/x86/kernel/process_64.c                  |   2 +-
 arch/x86/kernel/traps.c                       |  40 +
 arch/x86/kvm/x86.c                            |  46 +-
 arch/x86/math-emu/fpu_aux.c                   |   2 +-
 arch/x86/math-emu/fpu_entry.c                 |   4 +-
 arch/x86/math-emu/fpu_system.h                |   2 +-
 arch/x86/mm/pkeys.c                           |   2 +-
 tools/testing/selftests/x86/Makefile          |   2 +-
 tools/testing/selftests/x86/amx.c             | 743 ++++++++++++++++++
 27 files changed, 1717 insertions(+), 262 deletions(-)
 create mode 100644 tools/testing/selftests/x86/amx.c


base-commit: f40ddce88593482919761f74910f42f4b84c004b
-- 
2.17.1


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-03-10 13:40   ` Borislav Petkov
  2021-02-21 18:56 ` [PATCH v4 02/22] x86/fpu/xstate: Modify state copy helpers " Chang S. Bae
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

Have the function initializing the xstate buffer take a struct fpu *
pointer in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to fpstate_init().

Also, fpstate_init_xstate() now accepts the state component bitmap to
configure XCOMP_BV for the compacted format.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/internal.h |  6 +++---
 arch/x86/kernel/fpu/core.c          | 16 +++++++++++++---
 arch/x86/kernel/fpu/init.c          |  2 +-
 arch/x86/kernel/fpu/regset.c        |  2 +-
 arch/x86/kernel/fpu/xstate.c        |  3 +--
 arch/x86/kvm/x86.c                  |  2 +-
 6 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 8d33ad80704f..d81d8c407dc0 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -80,20 +80,20 @@ static __always_inline __pure bool use_fxsr(void)
 
 extern union fpregs_state init_fpstate;
 
-extern void fpstate_init(union fpregs_state *state);
+extern void fpstate_init(struct fpu *fpu);
 #ifdef CONFIG_MATH_EMULATION
 extern void fpstate_init_soft(struct swregs_state *soft);
 #else
 static inline void fpstate_init_soft(struct swregs_state *soft) {}
 #endif
 
-static inline void fpstate_init_xstate(struct xregs_state *xsave)
+static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 xcomp_mask)
 {
 	/*
 	 * XRSTORS requires these bits set in xcomp_bv, or it will
 	 * trigger #GP:
 	 */
-	xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xfeatures_mask_all;
+	xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xcomp_mask;
 }
 
 static inline void fpstate_init_fxstate(struct fxregs_state *fx)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 571220ac8bea..d43661d309ab 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -192,8 +192,18 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
 	fp->fos = 0xffff0000u;
 }
 
-void fpstate_init(union fpregs_state *state)
+/*
+ * @fpu: If NULL, use init_fpstate
+ */
+void fpstate_init(struct fpu *fpu)
 {
+	union fpregs_state *state;
+
+	if (fpu)
+		state = &fpu->state;
+	else
+		state = &init_fpstate;
+
 	if (!static_cpu_has(X86_FEATURE_FPU)) {
 		fpstate_init_soft(&state->soft);
 		return;
@@ -202,7 +212,7 @@ void fpstate_init(union fpregs_state *state)
 	memset(state, 0, fpu_kernel_xstate_size);
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&state->xsave);
+		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
 	if (static_cpu_has(X86_FEATURE_FXSR))
 		fpstate_init_fxstate(&state->fxsave);
 	else
@@ -262,7 +272,7 @@ static void fpu__initialize(struct fpu *fpu)
 	WARN_ON_FPU(fpu != &current->thread.fpu);
 
 	set_thread_flag(TIF_NEED_FPU_LOAD);
-	fpstate_init(&fpu->state);
+	fpstate_init(fpu);
 	trace_x86_fpu_init_state(fpu);
 }
 
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 701f196d7c68..74e03e3bc20f 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -124,7 +124,7 @@ static void __init fpu__init_system_generic(void)
 	 * Set up the legacy init FPU context. (xstate init might overwrite this
 	 * with a more modern format, if the CPU supports it.)
 	 */
-	fpstate_init(&init_fpstate);
+	fpstate_init(NULL);
 
 	fpu__init_system_mxcsr();
 }
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index c413756ba89f..4c4d9059ff36 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -144,7 +144,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	 * In case of failure, mark all states as init:
 	 */
 	if (ret)
-		fpstate_init(&fpu->state);
+		fpstate_init(fpu);
 
 	return ret;
 }
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5d8047441a0a..1a3e5effe0fa 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -457,8 +457,7 @@ static void __init setup_init_fpu_buf(void)
 	print_xstate_features();
 
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
-						     xfeatures_mask_all;
+		fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);
 
 	/*
 	 * Init all the features state with header.xfeatures being 0x0
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b404e4d7dd8..b933d005d45e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9911,7 +9911,7 @@ static void fx_init(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.guest_fpu)
 		return;
 
-	fpstate_init(&vcpu->arch.guest_fpu->state);
+	fpstate_init(vcpu->arch.guest_fpu);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
 		vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 02/22] x86/fpu/xstate: Modify state copy helpers to handle both static and dynamic buffers
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 03/22] x86/fpu/xstate: Modify address finders " Chang S. Bae
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Have all the functions copying xstate take a struct fpu * pointer in
preparation for dynamic state buffer support.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/xstate.h |  8 ++++----
 arch/x86/kernel/fpu/regset.c      |  6 +++---
 arch/x86/kernel/fpu/signal.c      | 16 +++++++---------
 arch/x86/kernel/fpu/xstate.c      | 19 +++++++++++++++----
 4 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 47a92232d595..e0f1b22f53ce 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -105,10 +105,10 @@ const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int xfeature_size(int xfeature_nr);
 struct membuf;
-void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave);
-int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
-int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf);
-void copy_supervisor_to_kernel(struct xregs_state *xsave);
+void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu);
+int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
+int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
+void copy_supervisor_to_kernel(struct fpu *fpu);
 void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask);
 void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask);
 
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 4c4d9059ff36..5e13e58d11d4 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -85,7 +85,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
 	fpu__prepare_read(fpu);
 
 	if (using_compacted_format()) {
-		copy_xstate_to_kernel(to, xsave);
+		copy_xstate_to_kernel(to, fpu);
 		return 0;
 	} else {
 		fpstate_sanitize_xstate(fpu);
@@ -126,9 +126,9 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	if (using_compacted_format()) {
 		if (kbuf)
-			ret = copy_kernel_to_xstate(xsave, kbuf);
+			ret = copy_kernel_to_xstate(fpu, kbuf);
 		else
-			ret = copy_user_to_xstate(xsave, ubuf);
+			ret = copy_user_to_xstate(fpu, ubuf);
 	} else {
 		ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, xsave, 0, -1);
 		if (!ret)
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index a4ec65317a7f..0d6deb75c507 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -212,11 +212,11 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 }
 
 static inline void
-sanitize_restored_user_xstate(union fpregs_state *state,
+sanitize_restored_user_xstate(struct fpu *fpu,
 			      struct user_i387_ia32_struct *ia32_env,
 			      u64 user_xfeatures, int fx_only)
 {
-	struct xregs_state *xsave = &state->xsave;
+	struct xregs_state *xsave = &fpu->state.xsave;
 	struct xstate_header *header = &xsave->header;
 
 	if (use_xsave()) {
@@ -253,7 +253,7 @@ sanitize_restored_user_xstate(union fpregs_state *state,
 		xsave->i387.mxcsr &= mxcsr_feature_mask;
 
 		if (ia32_env)
-			convert_to_fxsr(&state->fxsave, ia32_env);
+			convert_to_fxsr(&fpu->state.fxsave, ia32_env);
 	}
 }
 
@@ -396,7 +396,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		 * current supervisor states first and invalidate the FPU regs.
 		 */
 		if (xfeatures_mask_supervisor())
-			copy_supervisor_to_kernel(&fpu->state.xsave);
+			copy_supervisor_to_kernel(fpu);
 		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__fpu_invalidate_fpregs_state(fpu);
@@ -406,7 +406,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		u64 init_bv = xfeatures_mask_user() & ~user_xfeatures;
 
 		if (using_compacted_format()) {
-			ret = copy_user_to_xstate(&fpu->state.xsave, buf_fx);
+			ret = copy_user_to_xstate(fpu, buf_fx);
 		} else {
 			ret = __copy_from_user(&fpu->state.xsave, buf_fx, state_size);
 
@@ -416,8 +416,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		if (ret)
 			goto err_out;
 
-		sanitize_restored_user_xstate(&fpu->state, envp, user_xfeatures,
-					      fx_only);
+		sanitize_restored_user_xstate(fpu, envp, user_xfeatures, fx_only);
 
 		fpregs_lock();
 		if (unlikely(init_bv))
@@ -437,8 +436,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			goto err_out;
 		}
 
-		sanitize_restored_user_xstate(&fpu->state, envp, user_xfeatures,
-					      fx_only);
+		sanitize_restored_user_xstate(fpu, envp, user_xfeatures, fx_only);
 
 		fpregs_lock();
 		if (use_xsave()) {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 1a3e5effe0fa..6156dad0feb6 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1071,14 +1071,17 @@ static void copy_part(struct membuf *to, unsigned *last, unsigned offset,
  * It supports partial copy but pos always starts from zero. This is called
  * from xstateregs_get() and there we check the CPU has XSAVES.
  */
-void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave)
+void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 {
 	struct xstate_header header;
 	const unsigned off_mxcsr = offsetof(struct fxregs_state, mxcsr);
+	struct xregs_state *xsave;
 	unsigned size = to.left;
 	unsigned last = 0;
 	int i;
 
+	xsave = &fpu->state.xsave;
+
 	/*
 	 * The destination is a ptrace buffer; we put in only user xstates:
 	 */
@@ -1127,8 +1130,9 @@ void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave)
  * Convert from a ptrace standard-format kernel buffer to kernel XSAVES format
  * and copy to the target thread. This is called from xstateregs_set().
  */
-int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
+int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 {
+	struct xregs_state *xsave;
 	unsigned int offset, size;
 	int i;
 	struct xstate_header hdr;
@@ -1141,6 +1145,8 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
+	xsave = &fpu->state.xsave;
+
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
@@ -1180,8 +1186,9 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
  * xstateregs_set(), as well as potentially from the sigreturn() and
  * rt_sigreturn() system calls.
  */
-int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
+int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 {
+	struct xregs_state *xsave;
 	unsigned int offset, size;
 	int i;
 	struct xstate_header hdr;
@@ -1195,6 +1202,8 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
+	xsave = &fpu->state.xsave;
+
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
@@ -1235,9 +1244,10 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
  * old states, and is intended to be used only in __fpu__restore_sig(), where
  * user states are restored from the user buffer.
  */
-void copy_supervisor_to_kernel(struct xregs_state *xstate)
+void copy_supervisor_to_kernel(struct fpu *fpu)
 {
 	struct xstate_header *header;
+	struct xregs_state *xstate;
 	u64 max_bit, min_bit;
 	u32 lmask, hmask;
 	int err, i;
@@ -1251,6 +1261,7 @@ void copy_supervisor_to_kernel(struct xregs_state *xstate)
 	max_bit = __fls(xfeatures_mask_supervisor());
 	min_bit = __ffs(xfeatures_mask_supervisor());
 
+	xstate = &fpu->state.xsave;
 	lmask = xfeatures_mask_supervisor();
 	hmask = xfeatures_mask_supervisor() >> 32;
 	XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 03/22] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 02/22] x86/fpu/xstate: Modify state copy helpers " Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 04/22] x86/fpu/xstate: Modify the context restore helper " Chang S. Bae
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

Have all the functions finding xstate address take a struct fpu * pointer
in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to get_xsave_addr() and __raw_xsave_addr().

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
 arch/x86/include/asm/fpu/internal.h |  2 +-
 arch/x86/include/asm/fpu/xstate.h   |  2 +-
 arch/x86/include/asm/pgtable.h      |  2 +-
 arch/x86/kernel/cpu/common.c        |  2 +-
 arch/x86/kernel/fpu/xstate.c        | 38 +++++++++++++++++++++--------
 arch/x86/kvm/x86.c                  | 10 +++-----
 arch/x86/mm/pkeys.c                 |  2 +-
 7 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index d81d8c407dc0..0153c4d4ca77 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -579,7 +579,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 	 * return to userland e.g. for a copy_to_user() operation.
 	 */
 	if (current->mm) {
-		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+		pk = get_xsave_addr(new_fpu, XFEATURE_PKRU);
 		if (pk)
 			pkru_val = pk->pkru;
 	}
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index e0f1b22f53ce..24bf8d3f559a 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -100,7 +100,7 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..83268b41444f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -141,7 +141,7 @@ static inline void write_pkru(u32 pkru)
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return;
 
-	pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
+	pk = get_xsave_addr(&current->thread.fpu, XFEATURE_PKRU);
 
 	/*
 	 * The PKRU value in xstate needs to be in sync with the value that is
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..860b19db208b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -478,7 +478,7 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
 		return;
 
 	cr4_set_bits(X86_CR4_PKE);
-	pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
+	pk = get_xsave_addr(NULL, XFEATURE_PKRU);
 	if (pk)
 		pk->pkru = init_pkru_value;
 	/*
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 6156dad0feb6..5401a71dd15e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -894,15 +894,24 @@ void fpu__resume_cpu(void)
  * Given an xstate feature nr, calculate where in the xsave
  * buffer the state is.  Callers should ensure that the buffer
  * is valid.
+ *
+ * @fpu: If NULL, use init_fpstate
  */
-static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	void *xsave;
+
 	if (!xfeature_enabled(xfeature_nr)) {
 		WARN_ON_FPU(1);
 		return NULL;
 	}
 
-	return (void *)xsave + xstate_comp_offsets[xfeature_nr];
+	if (fpu)
+		xsave = &fpu->state.xsave;
+	else
+		xsave = &init_fpstate.xsave;
+
+	return xsave + xstate_comp_offsets[xfeature_nr];
 }
 /*
  * Given the xsave area and a state inside, this function returns the
@@ -915,15 +924,18 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
  * this will return NULL.
  *
  * Inputs:
- *	xstate: the thread's storage area for all FPU data
+ *	fpu: the thread's FPU data to reference xstate buffer(s).
+ *	     (A null pointer parameter indicates init_fpstate.)
  *	xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
  *	XFEATURE_SSE, etc...)
  * Output:
  *	address of the state in the xsave area, or NULL if the
  *	field is not present in the xsave buffer.
  */
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	struct xregs_state *xsave;
+
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -936,6 +948,12 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	 */
 	WARN_ONCE(!(xfeatures_mask_all & BIT_ULL(xfeature_nr)),
 		  "get of unsupported state");
+
+	if (fpu)
+		xsave = &fpu->state.xsave;
+	else
+		xsave = &init_fpstate.xsave;
+
 	/*
 	 * This assumes the last 'xsave*' instruction to
 	 * have requested that 'xfeature_nr' be saved.
@@ -950,7 +968,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
 		return NULL;
 
-	return __raw_xsave_addr(xsave, xfeature_nr);
+	return __raw_xsave_addr(fpu, xfeature_nr);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -981,7 +999,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 	 */
 	fpu__save(fpu);
 
-	return get_xsave_addr(&fpu->state.xsave, xfeature_nr);
+	return get_xsave_addr(fpu, xfeature_nr);
 }
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -1116,7 +1134,7 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 		 * Copy only in-use xstates:
 		 */
 		if ((header.xfeatures >> i) & 1) {
-			void *src = __raw_xsave_addr(xsave, i);
+			void *src = __raw_xsave_addr(fpu, i);
 
 			copy_part(&to, &last, xstate_offsets[i],
 				  xstate_sizes[i], src);
@@ -1151,7 +1169,7 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 		u64 mask = ((u64)1 << i);
 
 		if (hdr.xfeatures & mask) {
-			void *dst = __raw_xsave_addr(xsave, i);
+			void *dst = __raw_xsave_addr(fpu, i);
 
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
@@ -1208,7 +1226,7 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 		u64 mask = ((u64)1 << i);
 
 		if (hdr.xfeatures & mask) {
-			void *dst = __raw_xsave_addr(xsave, i);
+			void *dst = __raw_xsave_addr(fpu, i);
 
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
@@ -1450,7 +1468,7 @@ void update_pasid(void)
 		 */
 		xsave = &fpu->state.xsave;
 		xsave->header.xfeatures |= XFEATURE_MASK_PASID;
-		ppasid_state = get_xsave_addr(xsave, XFEATURE_PASID);
+		ppasid_state = get_xsave_addr(fpu, XFEATURE_PASID);
 		/*
 		 * Since XFEATURE_MASK_PASID is set in xfeatures, ppasid_state
 		 * won't be NULL and no need to check its value.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b933d005d45e..cc3b604ddcd2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4461,7 +4461,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 	while (valid) {
 		u64 xfeature_mask = valid & -valid;
 		int xfeature_nr = fls64(xfeature_mask) - 1;
-		void *src = get_xsave_addr(xsave, xfeature_nr);
+		void *src = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);
 
 		if (src) {
 			u32 size, offset, ecx, edx;
@@ -4504,7 +4504,7 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 	while (valid) {
 		u64 xfeature_mask = valid & -valid;
 		int xfeature_nr = fls64(xfeature_mask) - 1;
-		void *dest = get_xsave_addr(xsave, xfeature_nr);
+		void *dest = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);
 
 		if (dest) {
 			u32 size, offset, ecx, edx;
@@ -10140,12 +10140,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		 */
 		if (init_event)
 			kvm_put_guest_fpu(vcpu);
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
-					XFEATURE_BNDREGS);
+		mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDREGS);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
-					XFEATURE_BNDCSR);
+		mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDCSR);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
 		if (init_event)
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 8873ed1438a9..772e8bc3d49d 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -177,7 +177,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-	pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
+	pk = get_xsave_addr(NULL, XFEATURE_PKRU);
 	if (!pk)
 		return -EINVAL;
 	pk->pkru = new_init_pkru;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 04/22] x86/fpu/xstate: Modify the context restore helper to handle both static and dynamic buffers
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (2 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 03/22] x86/fpu/xstate: Modify address finders " Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 05/22] x86/fpu/xstate: Add a new variable to indicate dynamic user states Chang S. Bae
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

Have the function restoring xstate take a struct fpu * pointer in
preparation for dynamic state buffer support.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Reverted the change on the copy_kernel_to_xregs_err() function as not
  needed.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/internal.h | 6 ++++--
 arch/x86/kernel/fpu/core.c          | 4 ++--
 arch/x86/kvm/x86.c                  | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 0153c4d4ca77..b34d0d29e4b8 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -425,8 +425,10 @@ static inline void __copy_kernel_to_fpregs(union fpregs_state *fpstate, u64 mask
 	}
 }
 
-static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
+static inline void copy_kernel_to_fpregs(struct fpu *fpu)
 {
+	union fpregs_state *fpstate = &fpu->state;
+
 	/*
 	 * AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
 	 * pending. Clear the x87 state here by setting it to fixed values.
@@ -511,7 +513,7 @@ static inline void __fpregs_load_activate(void)
 		return;
 
 	if (!fpregs_state_valid(fpu, cpu)) {
-		copy_kernel_to_fpregs(&fpu->state);
+		copy_kernel_to_fpregs(fpu);
 		fpregs_activate(fpu);
 		fpu->last_cpu = cpu;
 	}
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index d43661d309ab..5775e64b0172 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -173,7 +173,7 @@ void fpu__save(struct fpu *fpu)
 
 	if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		if (!copy_fpregs_to_fpstate(fpu)) {
-			copy_kernel_to_fpregs(&fpu->state);
+			copy_kernel_to_fpregs(fpu);
 		}
 	}
 
@@ -251,7 +251,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
-		copy_kernel_to_fpregs(&dst_fpu->state);
+		copy_kernel_to_fpregs(dst_fpu);
 
 	fpregs_unlock();
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc3b604ddcd2..dd9565d12d81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9313,7 +9313,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu)
 		kvm_save_current_fpu(vcpu->arch.guest_fpu);
 
-	copy_kernel_to_fpregs(&vcpu->arch.user_fpu->state);
+	copy_kernel_to_fpregs(vcpu->arch.user_fpu);
 
 	fpregs_mark_activate();
 	fpregs_unlock();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 05/22] x86/fpu/xstate: Add a new variable to indicate dynamic user states
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (3 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 04/22] x86/fpu/xstate: Modify the context restore helper " Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 06/22] x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer size Chang S. Bae
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

The xstate per-task buffer is in preparation to be dynamic for user states.
Introduce a new mask variable to indicate the 'dynamic' user states. The
value is determined at boot-time.

The perf subsystem has a separate buffer to save some state only when
needed, not in every context switch. The states are named as 'dynamic'
supervisor states. Some define and helper are not named with dynamic
supervisor states, so rename them.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)

Changes from v2:
* Updated the changelog for clarification.
---
 arch/x86/include/asm/fpu/xstate.h | 12 ++++++-----
 arch/x86/kernel/fpu/xstate.c      | 33 ++++++++++++++++++++-----------
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 24bf8d3f559a..6ce8350672c2 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -56,7 +56,7 @@
  * - Don't set the bit corresponding to the dynamic supervisor feature in
  *   IA32_XSS at run time, since it has been set at boot time.
  */
-#define XFEATURE_MASK_DYNAMIC (XFEATURE_MASK_LBR)
+#define XFEATURE_MASK_SUPERVISOR_DYNAMIC (XFEATURE_MASK_LBR)
 
 /*
  * Unsupported supervisor features. When a supervisor feature in this mask is
@@ -66,7 +66,7 @@
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
-				      XFEATURE_MASK_DYNAMIC | \
+				      XFEATURE_MASK_SUPERVISOR_DYNAMIC | \
 				      XFEATURE_MASK_SUPERVISOR_UNSUPPORTED)
 
 #ifdef CONFIG_X86_64
@@ -87,14 +87,16 @@ static inline u64 xfeatures_mask_user(void)
 	return xfeatures_mask_all & XFEATURE_MASK_USER_SUPPORTED;
 }
 
-static inline u64 xfeatures_mask_dynamic(void)
+static inline u64 xfeatures_mask_supervisor_dynamic(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_ARCH_LBR))
-		return XFEATURE_MASK_DYNAMIC & ~XFEATURE_MASK_LBR;
+		return XFEATURE_MASK_SUPERVISOR_DYNAMIC & ~XFEATURE_MASK_LBR;
 
-	return XFEATURE_MASK_DYNAMIC;
+	return XFEATURE_MASK_SUPERVISOR_DYNAMIC;
 }
 
+extern u64 xfeatures_mask_user_dynamic;
+
 extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 
 extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5401a71dd15e..43940828d1a3 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -61,6 +61,12 @@ static short xsave_cpuid_features[] __initdata = {
  */
 u64 xfeatures_mask_all __read_mostly;
 
+/*
+ * This represents user xstates, a subset of xfeatures_mask_all, saved in a
+ * dynamic kernel XSAVE buffer.
+ */
+u64 xfeatures_mask_user_dynamic __read_mostly;
+
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
@@ -237,7 +243,7 @@ void fpu__init_cpu_xstate(void)
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
-				     xfeatures_mask_dynamic());
+				     xfeatures_mask_supervisor_dynamic());
 	}
 }
 
@@ -615,8 +621,8 @@ static void check_xstate_against_struct(int nr)
  * how large the XSAVE buffer needs to be.  We are recalculating
  * it to be safe.
  *
- * Dynamic XSAVE features allocate their own buffers and are not
- * covered by these checks. Only the size of the buffer for task->fpu
+ * Dynamic supervisor XSAVE features allocate their own buffers and are
+ * not covered by these checks. Only the size of the buffer for task->fpu
  * is checked here.
  */
 static void do_extra_xstate_size_checks(void)
@@ -686,7 +692,7 @@ static unsigned int __init get_xsaves_size(void)
  */
 static unsigned int __init get_xsaves_size_no_dynamic(void)
 {
-	u64 mask = xfeatures_mask_dynamic();
+	u64 mask = xfeatures_mask_supervisor_dynamic();
 	unsigned int size;
 
 	if (!mask)
@@ -773,6 +779,7 @@ static int __init init_xstate_size(void)
 static void fpu__init_disable_system_xstate(void)
 {
 	xfeatures_mask_all = 0;
+	xfeatures_mask_user_dynamic = 0;
 	cr4_clear_bits(X86_CR4_OSXSAVE);
 	setup_clear_cpu_cap(X86_FEATURE_XSAVE);
 }
@@ -839,6 +846,8 @@ void __init fpu__init_system_xstate(void)
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
+	/* Do not support the dynamically allocated buffer yet. */
+	xfeatures_mask_user_dynamic = 0;
 
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
@@ -886,7 +895,7 @@ void fpu__resume_cpu(void)
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
-				     xfeatures_mask_dynamic());
+				     xfeatures_mask_supervisor_dynamic());
 	}
 }
 
@@ -1321,8 +1330,8 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
  * @mask: Represent the dynamic supervisor features saved into the xsave area
  *
  * Only the dynamic supervisor states sets in the mask are saved into the xsave
- * area (See the comment in XFEATURE_MASK_DYNAMIC for the details of dynamic
- * supervisor feature). Besides the dynamic supervisor states, the legacy
+ * area (See the comment in XFEATURE_MASK_SUPERVISOR_DYNAMIC for the details of
+ * dynamic supervisor feature). Besides the dynamic supervisor states, the legacy
  * region and XSAVE header are also saved into the xsave area. The supervisor
  * features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
  * XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not saved.
@@ -1331,7 +1340,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
  */
 void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
 {
-	u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+	u64 dynamic_mask = xfeatures_mask_supervisor_dynamic() & mask;
 	u32 lmask, hmask;
 	int err;
 
@@ -1357,9 +1366,9 @@ void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
  * @mask: Represent the dynamic supervisor features restored from the xsave area
  *
  * Only the dynamic supervisor states sets in the mask are restored from the
- * xsave area (See the comment in XFEATURE_MASK_DYNAMIC for the details of
- * dynamic supervisor feature). Besides the dynamic supervisor states, the
- * legacy region and XSAVE header are also restored from the xsave area. The
+ * xsave area (See the comment in XFEATURE_MASK_SUPERVISOR_DYNAMIC for the
+ * details of dynamic supervisor feature). Besides the dynamic supervisor states,
+ * the legacy region and XSAVE header are also restored from the xsave area. The
  * supervisor features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
  * XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not restored.
  *
@@ -1367,7 +1376,7 @@ void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
  */
 void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask)
 {
-	u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+	u64 dynamic_mask = xfeatures_mask_supervisor_dynamic() & mask;
 	u32 lmask, hmask;
 	int err;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 06/22] x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer size
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (4 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 05/22] x86/fpu/xstate: Add a new variable to indicate dynamic user states Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 07/22] x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes Chang S. Bae
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

The xstate per-task buffer is in preparation to be dynamic for user states.
Introduce new size variables to indicate the minimum and maximum size of
the buffer. The value is determined at boot-time.

Instead of adding them as newly exported, introduce helper functions to
access them as well as the user buffer size.

No functional change. Those sizes have no difference, as the buffer is not
dynamic yet.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v3:
* Added as a new patch to add the variables along with new helpers.
  (Borislav Petkov)
---
 arch/x86/include/asm/fpu/xstate.h |  9 ++++
 arch/x86/include/asm/processor.h  | 10 +---
 arch/x86/kernel/fpu/core.c        | 24 +++++++---
 arch/x86/kernel/fpu/init.c        | 26 ++++-------
 arch/x86/kernel/fpu/regset.c      |  4 +-
 arch/x86/kernel/fpu/signal.c      | 27 ++++++-----
 arch/x86/kernel/fpu/xstate.c      | 78 ++++++++++++++++++++++++-------
 arch/x86/kernel/process.c         |  7 +++
 arch/x86/kvm/x86.c                |  5 +-
 9 files changed, 129 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 6ce8350672c2..1fba2ca15874 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -102,6 +102,15 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
+enum xstate_config {
+	XSTATE_MIN_SIZE,
+	XSTATE_MAX_SIZE,
+	XSTATE_USER_SIZE
+};
+
+extern unsigned int get_xstate_config(enum xstate_config cfg);
+void set_xstate_config(enum xstate_config cfg, unsigned int value);
+
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c20a52b5534b..f70228312790 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -477,9 +477,6 @@ DECLARE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
 DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
 #endif	/* X86_64 */
 
-extern unsigned int fpu_kernel_xstate_size;
-extern unsigned int fpu_user_xstate_size;
-
 struct perf_event;
 
 struct thread_struct {
@@ -545,12 +542,7 @@ struct thread_struct {
 };
 
 /* Whitelist the FPU state from the task_struct for hardened usercopy. */
-static inline void arch_thread_struct_whitelist(unsigned long *offset,
-						unsigned long *size)
-{
-	*offset = offsetof(struct thread_struct, fpu.state);
-	*size = fpu_kernel_xstate_size;
-}
+extern void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size);
 
 /*
  * Thread-synchronous status.
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 5775e64b0172..043fdba8431c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -198,21 +198,30 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
 void fpstate_init(struct fpu *fpu)
 {
 	union fpregs_state *state;
+	unsigned int size;
+	u64 mask;
 
-	if (fpu)
+	if (fpu) {
 		state = &fpu->state;
-	else
+		/* The dynamic user states are not prepared yet. */
+		mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
+		size = get_xstate_config(XSTATE_MIN_SIZE);
+	} else {
 		state = &init_fpstate;
+		mask = xfeatures_mask_all;
+		size = get_xstate_config(XSTATE_MAX_SIZE);
+	}
 
 	if (!static_cpu_has(X86_FEATURE_FPU)) {
 		fpstate_init_soft(&state->soft);
 		return;
 	}
 
-	memset(state, 0, fpu_kernel_xstate_size);
+	memset(state, 0, size);
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
+		fpstate_init_xstate(&state->xsave, mask);
+
 	if (static_cpu_has(X86_FEATURE_FXSR))
 		fpstate_init_fxstate(&state->fxsave);
 	else
@@ -235,8 +244,11 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	/*
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
+	 *
+	 * The child does not inherit the dynamic states. So,
+	 * the xstate buffer has the minimum size.
 	 */
-	memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_size);
+	memset(&dst_fpu->state.xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
 
 	/*
 	 * If the FPU registers are not current just memcpy() the state.
@@ -248,7 +260,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);
+		memcpy(&dst_fpu->state, &src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
 		copy_kernel_to_fpregs(dst_fpu);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 74e03e3bc20f..f63765b7a83c 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -129,15 +129,6 @@ static void __init fpu__init_system_generic(void)
 	fpu__init_system_mxcsr();
 }
 
-/*
- * Size of the FPU context state. All tasks in the system use the
- * same context size, regardless of what portion they use.
- * This is inherent to the XSAVE architecture which puts all state
- * components into a single, continuous memory block:
- */
-unsigned int fpu_kernel_xstate_size;
-EXPORT_SYMBOL_GPL(fpu_kernel_xstate_size);
-
 /* Get alignment of the TYPE. */
 #define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)
 
@@ -167,8 +158,10 @@ static void __init fpu__init_task_struct_size(void)
 	/*
 	 * Add back the dynamically-calculated register state
 	 * size.
+	 *
+	 * Use the minimum size as embedded to task_struct.
 	 */
-	task_size += fpu_kernel_xstate_size;
+	task_size += get_xstate_config(XSTATE_MIN_SIZE);
 
 	/*
 	 * We dynamically size 'struct fpu', so we require that
@@ -193,6 +186,7 @@ static void __init fpu__init_task_struct_size(void)
 static void __init fpu__init_system_xstate_size_legacy(void)
 {
 	static int on_boot_cpu __initdata = 1;
+	unsigned int size;
 
 	WARN_ON_FPU(!on_boot_cpu);
 	on_boot_cpu = 0;
@@ -203,17 +197,17 @@ static void __init fpu__init_system_xstate_size_legacy(void)
 	 */
 
 	if (!boot_cpu_has(X86_FEATURE_FPU)) {
-		fpu_kernel_xstate_size = sizeof(struct swregs_state);
+		size = sizeof(struct swregs_state);
 	} else {
 		if (boot_cpu_has(X86_FEATURE_FXSR))
-			fpu_kernel_xstate_size =
-				sizeof(struct fxregs_state);
+			size = sizeof(struct fxregs_state);
 		else
-			fpu_kernel_xstate_size =
-				sizeof(struct fregs_state);
+			size = sizeof(struct fregs_state);
 	}
 
-	fpu_user_xstate_size = fpu_kernel_xstate_size;
+	set_xstate_config(XSTATE_MIN_SIZE, size);
+	set_xstate_config(XSTATE_MAX_SIZE, size);
+	set_xstate_config(XSTATE_USER_SIZE, size);
 }
 
 /*
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 5e13e58d11d4..6a025fa26a7e 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -99,7 +99,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
 		/*
 		 * Copy the xstate memory layout.
 		 */
-		return membuf_write(&to, xsave, fpu_user_xstate_size);
+		return membuf_write(&to, xsave, get_xstate_config(XSTATE_USER_SIZE));
 	}
 }
 
@@ -117,7 +117,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	/*
 	 * A whole standard-format XSAVE buffer is needed:
 	 */
-	if ((pos != 0) || (count < fpu_user_xstate_size))
+	if ((pos != 0) || (count < get_xstate_config(XSTATE_USER_SIZE)))
 		return -EFAULT;
 
 	xsave = &fpu->state.xsave;
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 0d6deb75c507..3a2d8665b9a3 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -35,7 +35,7 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
 	/* Check for the first magic field and other error scenarios. */
 	if (fx_sw->magic1 != FP_XSTATE_MAGIC1 ||
 	    fx_sw->xstate_size < min_xstate_size ||
-	    fx_sw->xstate_size > fpu_user_xstate_size ||
+	    fx_sw->xstate_size > get_xstate_config(XSTATE_USER_SIZE) ||
 	    fx_sw->xstate_size > fx_sw->extended_size)
 		return -1;
 
@@ -98,7 +98,7 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame)
 		return err;
 
 	err |= __put_user(FP_XSTATE_MAGIC2,
-			  (__u32 __user *)(buf + fpu_user_xstate_size));
+			  (__u32 __user *)(buf + get_xstate_config(XSTATE_USER_SIZE)));
 
 	/*
 	 * Read the xfeatures which we copied (directly from the cpu or
@@ -135,7 +135,7 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
 	else
 		err = copy_fregs_to_user((struct fregs_state __user *) buf);
 
-	if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
+	if (unlikely(err) && __clear_user(buf, get_xstate_config(XSTATE_USER_SIZE)))
 		err = -EFAULT;
 	return err;
 }
@@ -196,7 +196,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (!fault_in_pages_writeable(buf_fx, get_xstate_config(XSTATE_USER_SIZE)))
 			goto retry;
 		return -EFAULT;
 	}
@@ -290,13 +290,13 @@ static int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_only)
 static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 {
 	struct user_i387_ia32_struct *envp = NULL;
-	int state_size = fpu_kernel_xstate_size;
 	int ia32_fxstate = (buf != buf_fx);
 	struct task_struct *tsk = current;
 	struct fpu *fpu = &tsk->thread.fpu;
 	struct user_i387_ia32_struct env;
 	u64 user_xfeatures = 0;
 	int fx_only = 0;
+	int state_size;
 	int ret = 0;
 
 	ia32_fxstate &= (IS_ENABLED(CONFIG_X86_32) ||
@@ -330,6 +330,9 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			state_size = fx_sw_user.xstate_size;
 			user_xfeatures = fx_sw_user.xfeatures;
 		}
+	} else {
+		/* The buffer cannot be dynamic without using XSAVE. */
+		state_size = get_xstate_config(XSTATE_MIN_SIZE);
 	}
 
 	if ((unsigned long)buf_fx % 64)
@@ -469,8 +472,9 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 
 static inline int xstate_sigframe_size(void)
 {
-	return use_xsave() ? fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE :
-			fpu_user_xstate_size;
+	int size = get_xstate_config(XSTATE_USER_SIZE);
+
+	return use_xsave() ? size + FP_XSTATE_MAGIC2_SIZE : size;
 }
 
 /*
@@ -514,19 +518,20 @@ fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
  */
 void fpu__init_prepare_fx_sw_frame(void)
 {
-	int size = fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+	int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
+	int ext_size = xstate_size + FP_XSTATE_MAGIC2_SIZE;
 
 	fx_sw_reserved.magic1 = FP_XSTATE_MAGIC1;
-	fx_sw_reserved.extended_size = size;
+	fx_sw_reserved.extended_size = ext_size;
 	fx_sw_reserved.xfeatures = xfeatures_mask_user();
-	fx_sw_reserved.xstate_size = fpu_user_xstate_size;
+	fx_sw_reserved.xstate_size = xstate_size;
 
 	if (IS_ENABLED(CONFIG_IA32_EMULATION) ||
 	    IS_ENABLED(CONFIG_X86_32)) {
 		int fsave_header_size = sizeof(struct fregs_state);
 
 		fx_sw_reserved_ia32 = fx_sw_reserved;
-		fx_sw_reserved_ia32.extended_size = size + fsave_header_size;
+		fx_sw_reserved_ia32.extended_size = ext_size + fsave_header_size;
 	}
 }
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 43940828d1a3..16379c368714 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -72,12 +72,50 @@ static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] =
 static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 
-/*
- * The XSAVE area of kernel can be in standard or compacted format;
- * it is always in standard format for user mode. This is the user
- * mode standard format size used for signal and ptrace frames.
+/**
+ * struct fpu_xstate_buffer_config - xstate per-task buffer configuration
+ * @min_size, @max_size:	The size of the kernel buffer. It is variable with the dynamic user
+ *				states. Every task has the minimum buffer by default. It can be
+ *				expanded to the max size.  The two sizes are the same when using the
+ *				standard format.
+ * @user_size:			The size of the userspace buffer. The buffer is always in the
+ *				standard format. It is used for signal and ptrace frames.
  */
-unsigned int fpu_user_xstate_size;
+struct fpu_xstate_buffer_config {
+	unsigned int min_size, max_size;
+	unsigned int user_size;
+};
+
+static struct fpu_xstate_buffer_config buffer_config __read_mostly;
+
+unsigned int get_xstate_config(enum xstate_config cfg)
+{
+	switch (cfg) {
+	case XSTATE_MIN_SIZE:
+		return buffer_config.min_size;
+	case XSTATE_MAX_SIZE:
+		return buffer_config.max_size;
+	case XSTATE_USER_SIZE:
+		return buffer_config.user_size;
+	default:
+		return 0;
+	}
+}
+EXPORT_SYMBOL_GPL(get_xstate_config);
+
+void set_xstate_config(enum xstate_config cfg, unsigned int value)
+{
+	switch (cfg) {
+	case XSTATE_MIN_SIZE:
+		buffer_config.min_size = value;
+		break;
+	case XSTATE_MAX_SIZE:
+		buffer_config.max_size = value;
+		break;
+	case XSTATE_USER_SIZE:
+		buffer_config.user_size = value;
+	}
+}
 
 /*
  * Return whether the system supports a given xfeature.
@@ -659,7 +697,7 @@ static void do_extra_xstate_size_checks(void)
 		 */
 		paranoid_xstate_size += xfeature_size(i);
 	}
-	XSTATE_WARN_ON(paranoid_xstate_size != fpu_kernel_xstate_size);
+	XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
 }
 
 
@@ -754,21 +792,29 @@ static int __init init_xstate_size(void)
 	else
 		possible_xstate_size = xsave_size;
 
-	/* Ensure we have the space to store all enabled: */
-	if (!is_supported_xstate_size(possible_xstate_size))
-		return -EINVAL;
-
 	/*
-	 * The size is OK, we are definitely going to use xsave,
-	 * make it known to the world that we need more space.
+	 * The size accounts for all the possible states reserved in the
+	 * per-task buffer.  Set the maximum with this value.
 	 */
-	fpu_kernel_xstate_size = possible_xstate_size;
+	set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
+
+	/* Perform an extra check for the maximum size. */
 	do_extra_xstate_size_checks();
 
+	/*
+	 * Set the minimum to be the same as the maximum. The dynamic
+	 * user states are not supported yet.
+	 */
+	set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
+
+	/* Ensure the minimum size fits in the statically-alocated buffer: */
+	if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
+		return -EINVAL;
+
 	/*
 	 * User space is always in standard format.
 	 */
-	fpu_user_xstate_size = xsave_size;
+	set_xstate_config(XSTATE_USER_SIZE, xsave_size);
 	return 0;
 }
 
@@ -859,7 +905,7 @@ void __init fpu__init_system_xstate(void)
 	 * Update info used for ptrace frames; use standard-format size and no
 	 * supervisor xstates:
 	 */
-	update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user());
+	update_regset_xstate_info(get_xstate_config(XSTATE_USER_SIZE), xfeatures_mask_user());
 
 	fpu__init_prepare_fx_sw_frame();
 	setup_init_fpu_buf();
@@ -869,7 +915,7 @@ void __init fpu__init_system_xstate(void)
 
 	pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
 		xfeatures_mask_all,
-		fpu_kernel_xstate_size,
+		get_xstate_config(XSTATE_MAX_SIZE),
 		boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
 	return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 145a7ac0c19a..2070ae35ccbc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -96,6 +96,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return fpu__copy(dst, src);
 }
 
+void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
+{
+	*offset = offsetof(struct thread_struct, fpu.state);
+	/* The buffer embedded in thread_struct has the minimum size. */
+	*size = get_xstate_config(XSTATE_MIN_SIZE);
+}
+
 /*
  * Free thread data structures etc..
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dd9565d12d81..5c70a4270157 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9271,10 +9271,13 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	/*
 	 * If the target FPU state is not resident in the CPU registers, just
 	 * memcpy() from current, else save CPU state directly to the target.
+	 *
+	 * KVM does not support dynamic user states yet. Assume the buffer
+	 * always has the minimum size.
 	 */
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
 		memcpy(&fpu->state, &current->thread.fpu.state,
-		       fpu_kernel_xstate_size);
+		       get_xstate_config(XSTATE_MIN_SIZE));
 	else
 		copy_fpregs_to_fpstate(fpu);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 07/22] x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (5 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 06/22] x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer size Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 08/22] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

The xstate buffer is currently embedded into struct fpu with static size.
To accommodate dynamic user xstates, record the maximum and minimum buffer
sizes.

Rename the size calculation function. It calculates the maximum xstate size
and sanity checks it with CPUID. It also calculates the static embedded
buffer size by excluding the dynamic user states from the maximum size.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)
* Adjusted the calculation function naming.
* Moved out the new variable addition into a new patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Renamed the in-line size variable.
* Updated some code comments.
---
 arch/x86/kernel/fpu/xstate.c | 52 +++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 16379c368714..b7686f107f3a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -655,23 +655,31 @@ static void check_xstate_against_struct(int nr)
 }
 
 /*
- * This essentially double-checks what the cpu told us about
- * how large the XSAVE buffer needs to be.  We are recalculating
- * it to be safe.
+ * Calculate the xstate per-task buffer sizes -- maximum and minimum.
+ *
+ * And record the minimum. Also double-check the maximum against what
+ * the cpu told.
+ *
+ * Dynamic user states are stored in this buffer. They account for the
+ * delta between the maximum and the minimum.
  *
  * Dynamic supervisor XSAVE features allocate their own buffers and are
- * not covered by these checks. Only the size of the buffer for task->fpu
- * is checked here.
+ * not covered by these checks.
  */
-static void do_extra_xstate_size_checks(void)
+static void calculate_xstate_sizes(void)
 {
-	int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+	int paranoid_min_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+	int paranoid_max_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
 	int i;
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		bool user_dynamic;
+
 		if (!xfeature_enabled(i))
 			continue;
 
+		user_dynamic = (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? true : false;
+
 		check_xstate_against_struct(i);
 		/*
 		 * Supervisor state components can be managed only by
@@ -681,23 +689,32 @@ static void do_extra_xstate_size_checks(void)
 			XSTATE_WARN_ON(xfeature_is_supervisor(i));
 
 		/* Align from the end of the previous feature */
-		if (xfeature_is_aligned(i))
-			paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
+		if (xfeature_is_aligned(i)) {
+			paranoid_max_size = ALIGN(paranoid_max_size, 64);
+			if (!user_dynamic)
+				paranoid_min_size = ALIGN(paranoid_min_size, 64);
+		}
 		/*
 		 * The offset of a given state in the non-compacted
 		 * format is given to us in a CPUID leaf.  We check
 		 * them for being ordered (increasing offsets) in
 		 * setup_xstate_features().
 		 */
-		if (!using_compacted_format())
-			paranoid_xstate_size = xfeature_uncompacted_offset(i);
+		if (!using_compacted_format()) {
+			paranoid_max_size = xfeature_uncompacted_offset(i);
+			if (!user_dynamic)
+				paranoid_min_size = xfeature_uncompacted_offset(i);
+		}
 		/*
 		 * The compacted-format offset always depends on where
 		 * the previous state ended.
 		 */
-		paranoid_xstate_size += xfeature_size(i);
+		paranoid_max_size += xfeature_size(i);
+		if (!user_dynamic)
+			paranoid_min_size += xfeature_size(i);
 	}
-	XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
+	XSTATE_WARN_ON(paranoid_max_size != get_xstate_config(XSTATE_MAX_SIZE));
+	set_xstate_config(XSTATE_MIN_SIZE, paranoid_min_size);
 }
 
 
@@ -798,14 +815,11 @@ static int __init init_xstate_size(void)
 	 */
 	set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
 
-	/* Perform an extra check for the maximum size. */
-	do_extra_xstate_size_checks();
-
 	/*
-	 * Set the minimum to be the same as the maximum. The dynamic
-	 * user states are not supported yet.
+	 * Calculate and double-check the maximum size. Calculate and record
+	 * the minimum size.
 	 */
-	set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
+	calculate_xstate_sizes();
 
 	/* Ensure the minimum size fits in the statically-alocated buffer: */
 	if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 08/22] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (6 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 07/22] x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 09/22] x86/fpu/xstate: Introduce helpers to manage the xstate buffer dynamically Chang S. Bae
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

The xstate per-task buffer is embedded into struct fpu. And the field
'state' represents the buffer. When the dynamic user states in use, the
buffer may be dynamically allocated.

Convert the 'state' field to point either the embedded buffer or the
dynamically-allocated buffer. Add a new field to represent the embedded
buffer.

Every child process will set the pointer on its creation. And the initial
task sets it before dealing with soft FPU.

No functional change.

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v3:
* Added as a new patch to simplify the buffer access. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/internal.h |  6 +++---
 arch/x86/include/asm/fpu/types.h    | 27 +++++++++++++++++++++------
 arch/x86/include/asm/trace/fpu.h    |  4 ++--
 arch/x86/kernel/fpu/core.c          | 26 ++++++++++++++------------
 arch/x86/kernel/fpu/init.c          |  6 ++++--
 arch/x86/kernel/fpu/regset.c        | 22 +++++++++++-----------
 arch/x86/kernel/fpu/signal.c        | 22 +++++++++++-----------
 arch/x86/kernel/fpu/xstate.c        | 18 +++++++++---------
 arch/x86/kernel/process.c           |  2 +-
 arch/x86/kvm/x86.c                  | 18 +++++++++---------
 arch/x86/math-emu/fpu_aux.c         |  2 +-
 arch/x86/math-emu/fpu_entry.c       |  4 ++--
 arch/x86/math-emu/fpu_system.h      |  2 +-
 13 files changed, 89 insertions(+), 70 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index b34d0d29e4b8..46cb51ef4d17 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -199,9 +199,9 @@ static inline int copy_user_to_fregs(struct fregs_state __user *fx)
 static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 {
 	if (IS_ENABLED(CONFIG_X86_32))
-		asm volatile( "fxsave %[fx]" : [fx] "=m" (fpu->state.fxsave));
+		asm volatile( "fxsave %[fx]" : [fx] "=m" (fpu->state->fxsave));
 	else
-		asm volatile("fxsaveq %[fx]" : [fx] "=m" (fpu->state.fxsave));
+		asm volatile("fxsaveq %[fx]" : [fx] "=m" (fpu->state->fxsave));
 }
 
 /* These macros all use (%edi)/(%rdi) as the single memory argument. */
@@ -427,7 +427,7 @@ static inline void __copy_kernel_to_fpregs(union fpregs_state *fpstate, u64 mask
 
 static inline void copy_kernel_to_fpregs(struct fpu *fpu)
 {
-	union fpregs_state *fpstate = &fpu->state;
+	union fpregs_state *fpstate = fpu->state;
 
 	/*
 	 * AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f5a38a5f3ae1..dcd28a545377 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -339,13 +339,28 @@ struct fpu {
 	/*
 	 * @state:
 	 *
-	 * In-memory copy of all FPU registers that we save/restore
-	 * over context switches. If the task is using the FPU then
-	 * the registers in the FPU are more recent than this state
-	 * copy. If the task context-switches away then they get
-	 * saved here and represent the FPU state.
+	 * A pointer to indicate the in-memory copy of all FPU registers that are
+	 * saved/restored over context switches.
+	 *
+	 * Initially @state points to @__default_state. When dynamic states get
+	 * used, a memory is allocated for the larger state copy and @state is
+	 * updated to point to it. Then, the state in ->state supersedes and
+	 * invalidates the state in @__default_state.
+	 *
+	 * In general, if the task is using the FPU then the registers in the FPU
+	 * are more recent than the state copy. If the task context-switches away
+	 * then they get saved in ->state and represent the FPU state.
+	 */
+	union fpregs_state		*state;
+
+	/*
+	 * @__default_state:
+	 *
+	 * Initial in-memory copy of all FPU registers that saved/restored
+	 * over context switches. When the task is switched to dynamic states,
+	 * this copy is replaced with the new in-memory copy in ->state.
 	 */
-	union fpregs_state		state;
+	union fpregs_state		__default_state;
 	/*
 	 * WARNING: 'state' is dynamically-sized.  Do not put
 	 * anything after it here.
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 879b77792f94..ef82f4824ce7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -22,8 +22,8 @@ DECLARE_EVENT_CLASS(x86_fpu,
 		__entry->fpu		= fpu;
 		__entry->load_fpu	= test_thread_flag(TIF_NEED_FPU_LOAD);
 		if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
-			__entry->xfeatures = fpu->state.xsave.header.xfeatures;
-			__entry->xcomp_bv  = fpu->state.xsave.header.xcomp_bv;
+			__entry->xfeatures = fpu->state->xsave.header.xfeatures;
+			__entry->xcomp_bv  = fpu->state->xsave.header.xcomp_bv;
 		}
 	),
 	TP_printk("x86/fpu: %p load: %d xfeatures: %llx xcomp_bv: %llx",
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 043fdba8431c..60a581aa0be8 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -95,13 +95,13 @@ EXPORT_SYMBOL(irq_fpu_usable);
 int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
-		copy_xregs_to_kernel(&fpu->state.xsave);
+		copy_xregs_to_kernel(&fpu->state->xsave);
 
 		/*
 		 * AVX512 state is tracked here because its use is
 		 * known to slow the max clock speed of the core.
 		 */
-		if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+		if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
 			fpu->avx512_timestamp = jiffies;
 		return 1;
 	}
@@ -115,7 +115,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
 	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
 	 * so we have to mark them inactive:
 	 */
-	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
+	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state->fsave));
 
 	return 0;
 }
@@ -202,7 +202,7 @@ void fpstate_init(struct fpu *fpu)
 	u64 mask;
 
 	if (fpu) {
-		state = &fpu->state;
+		state = fpu->state;
 		/* The dynamic user states are not prepared yet. */
 		mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
 		size = get_xstate_config(XSTATE_MIN_SIZE);
@@ -241,6 +241,8 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 
 	WARN_ON_FPU(src_fpu != &current->thread.fpu);
 
+	dst_fpu->state = &dst_fpu->__default_state;
+
 	/*
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
@@ -248,7 +250,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 * The child does not inherit the dynamic states. So,
 	 * the xstate buffer has the minimum size.
 	 */
-	memset(&dst_fpu->state.xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
+	memset(&dst_fpu->state->xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
 
 	/*
 	 * If the FPU registers are not current just memcpy() the state.
@@ -260,7 +262,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&dst_fpu->state, &src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
+		memcpy(dst_fpu->state, src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
 		copy_kernel_to_fpregs(dst_fpu);
@@ -396,7 +398,7 @@ static void fpu__clear(struct fpu *fpu, bool user_only)
 	if (user_only) {
 		if (!fpregs_state_valid(fpu, smp_processor_id()) &&
 		    xfeatures_mask_supervisor())
-			copy_kernel_to_xregs(&fpu->state.xsave,
+			copy_kernel_to_xregs(&fpu->state->xsave,
 					     xfeatures_mask_supervisor());
 		copy_init_fpstate_to_fpregs(xfeatures_mask_user());
 	} else {
@@ -478,11 +480,11 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 		 * fully reproduce the context of the exception.
 		 */
 		if (boot_cpu_has(X86_FEATURE_FXSR)) {
-			cwd = fpu->state.fxsave.cwd;
-			swd = fpu->state.fxsave.swd;
+			cwd = fpu->state->fxsave.cwd;
+			swd = fpu->state->fxsave.swd;
 		} else {
-			cwd = (unsigned short)fpu->state.fsave.cwd;
-			swd = (unsigned short)fpu->state.fsave.swd;
+			cwd = (unsigned short)fpu->state->fsave.cwd;
+			swd = (unsigned short)fpu->state->fsave.swd;
 		}
 
 		err = swd & ~cwd;
@@ -496,7 +498,7 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 		unsigned short mxcsr = MXCSR_DEFAULT;
 
 		if (boot_cpu_has(X86_FEATURE_XMM))
-			mxcsr = fpu->state.fxsave.mxcsr;
+			mxcsr = fpu->state->fxsave.mxcsr;
 
 		err = ~(mxcsr >> 7) & mxcsr;
 	}
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index f63765b7a83c..f2fcdcc979e7 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -31,10 +31,12 @@ static void fpu__init_cpu_generic(void)
 		cr0 |= X86_CR0_EM;
 	write_cr0(cr0);
 
+	current->thread.fpu.state = &current->thread.fpu.__default_state;
+
 	/* Flush out any pending x87 state: */
 #ifdef CONFIG_MATH_EMULATION
 	if (!boot_cpu_has(X86_FEATURE_FPU))
-		fpstate_init_soft(&current->thread.fpu.state.soft);
+		fpstate_init_soft(&current->thread.fpu.state->soft);
 	else
 #endif
 		asm volatile ("fninit");
@@ -170,7 +172,7 @@ static void __init fpu__init_task_struct_size(void)
 	 * you hit a compile error here, check the structure to
 	 * see if something got added to the end.
 	 */
-	CHECK_MEMBER_AT_END_OF(struct fpu, state);
+	CHECK_MEMBER_AT_END_OF(struct fpu, __default_state);
 	CHECK_MEMBER_AT_END_OF(struct thread_struct, fpu);
 	CHECK_MEMBER_AT_END_OF(struct task_struct, thread);
 
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 6a025fa26a7e..ee27df4caed6 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -37,7 +37,7 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
 	fpu__prepare_read(fpu);
 	fpstate_sanitize_xstate(fpu);
 
-	return membuf_write(&to, &fpu->state.fxsave, sizeof(struct fxregs_state));
+	return membuf_write(&to, &fpu->state->fxsave, sizeof(struct fxregs_state));
 }
 
 int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
@@ -54,19 +54,19 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
 	fpstate_sanitize_xstate(fpu);
 
 	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
-				 &fpu->state.fxsave, 0, -1);
+				 &fpu->state->fxsave, 0, -1);
 
 	/*
 	 * mxcsr reserved bits must be masked to zero for security reasons.
 	 */
-	fpu->state.fxsave.mxcsr &= mxcsr_feature_mask;
+	fpu->state->fxsave.mxcsr &= mxcsr_feature_mask;
 
 	/*
 	 * update the header bits in the xsave header, indicating the
 	 * presence of FP and SSE state.
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVE))
-		fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+		fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
 
 	return ret;
 }
@@ -80,7 +80,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
 	if (!boot_cpu_has(X86_FEATURE_XSAVE))
 		return -ENODEV;
 
-	xsave = &fpu->state.xsave;
+	xsave = &fpu->state->xsave;
 
 	fpu__prepare_read(fpu);
 
@@ -120,7 +120,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	if ((pos != 0) || (count < get_xstate_config(XSTATE_USER_SIZE)))
 		return -EFAULT;
 
-	xsave = &fpu->state.xsave;
+	xsave = &fpu->state->xsave;
 
 	fpu__prepare_write(fpu);
 
@@ -224,7 +224,7 @@ static inline u32 twd_fxsr_to_i387(struct fxregs_state *fxsave)
 void
 convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
 {
-	struct fxregs_state *fxsave = &tsk->thread.fpu.state.fxsave;
+	struct fxregs_state *fxsave = &tsk->thread.fpu.state->fxsave;
 	struct _fpreg *to = (struct _fpreg *) &env->st_space[0];
 	struct _fpxreg *from = (struct _fpxreg *) &fxsave->st_space[0];
 	int i;
@@ -297,7 +297,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
 		return fpregs_soft_get(target, regset, to);
 
 	if (!boot_cpu_has(X86_FEATURE_FXSR)) {
-		return membuf_write(&to, &fpu->state.fsave,
+		return membuf_write(&to, &fpu->state->fsave,
 				    sizeof(struct fregs_state));
 	}
 
@@ -328,7 +328,7 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	if (!boot_cpu_has(X86_FEATURE_FXSR))
 		return user_regset_copyin(&pos, &count, &kbuf, &ubuf,
-					  &fpu->state.fsave, 0,
+					  &fpu->state->fsave, 0,
 					  -1);
 
 	if (pos > 0 || count < sizeof(env))
@@ -336,14 +336,14 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &env, 0, -1);
 	if (!ret)
-		convert_to_fxsr(&target->thread.fpu.state.fxsave, &env);
+		convert_to_fxsr(&target->thread.fpu.state->fxsave, &env);
 
 	/*
 	 * update the header bit in the xsave header, indicating the
 	 * presence of FP.
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVE))
-		fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FP;
+		fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FP;
 	return ret;
 }
 
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 3a2d8665b9a3..9719241da034 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -58,7 +58,7 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
 static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)
 {
 	if (use_fxsr()) {
-		struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+		struct xregs_state *xsave = &tsk->thread.fpu.state->xsave;
 		struct user_i387_ia32_struct env;
 		struct _fpstate_32 __user *fp = buf;
 
@@ -216,7 +216,7 @@ sanitize_restored_user_xstate(struct fpu *fpu,
 			      struct user_i387_ia32_struct *ia32_env,
 			      u64 user_xfeatures, int fx_only)
 {
-	struct xregs_state *xsave = &fpu->state.xsave;
+	struct xregs_state *xsave = &fpu->state->xsave;
 	struct xstate_header *header = &xsave->header;
 
 	if (use_xsave()) {
@@ -253,7 +253,7 @@ sanitize_restored_user_xstate(struct fpu *fpu,
 		xsave->i387.mxcsr &= mxcsr_feature_mask;
 
 		if (ia32_env)
-			convert_to_fxsr(&fpu->state.fxsave, ia32_env);
+			convert_to_fxsr(&fpu->state->fxsave, ia32_env);
 	}
 }
 
@@ -366,7 +366,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			 */
 			if (test_thread_flag(TIF_NEED_FPU_LOAD) &&
 			    xfeatures_mask_supervisor())
-				copy_kernel_to_xregs(&fpu->state.xsave,
+				copy_kernel_to_xregs(&fpu->state->xsave,
 						     xfeatures_mask_supervisor());
 			fpregs_mark_activate();
 			fpregs_unlock();
@@ -411,10 +411,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		if (using_compacted_format()) {
 			ret = copy_user_to_xstate(fpu, buf_fx);
 		} else {
-			ret = __copy_from_user(&fpu->state.xsave, buf_fx, state_size);
+			ret = __copy_from_user(&fpu->state->xsave, buf_fx, state_size);
 
 			if (!ret && state_size > offsetof(struct xregs_state, header))
-				ret = validate_user_xstate_header(&fpu->state.xsave.header);
+				ret = validate_user_xstate_header(&fpu->state->xsave.header);
 		}
 		if (ret)
 			goto err_out;
@@ -429,11 +429,11 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		 * Restore previously saved supervisor xstates along with
 		 * copied-in user xstates.
 		 */
-		ret = copy_kernel_to_xregs_err(&fpu->state.xsave,
+		ret = copy_kernel_to_xregs_err(&fpu->state->xsave,
 					       user_xfeatures | xfeatures_mask_supervisor());
 
 	} else if (use_fxsr()) {
-		ret = __copy_from_user(&fpu->state.fxsave, buf_fx, state_size);
+		ret = __copy_from_user(&fpu->state->fxsave, buf_fx, state_size);
 		if (ret) {
 			ret = -EFAULT;
 			goto err_out;
@@ -449,14 +449,14 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
 		}
 
-		ret = copy_kernel_to_fxregs_err(&fpu->state.fxsave);
+		ret = copy_kernel_to_fxregs_err(&fpu->state->fxsave);
 	} else {
-		ret = __copy_from_user(&fpu->state.fsave, buf_fx, state_size);
+		ret = __copy_from_user(&fpu->state->fsave, buf_fx, state_size);
 		if (ret)
 			goto err_out;
 
 		fpregs_lock();
-		ret = copy_kernel_to_fregs_err(&fpu->state.fsave);
+		ret = copy_kernel_to_fregs_err(&fpu->state->fsave);
 	}
 	if (!ret)
 		fpregs_mark_activate();
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b7686f107f3a..8c067a7a0eec 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -185,14 +185,14 @@ static bool xfeature_is_supervisor(int xfeature_nr)
  */
 void fpstate_sanitize_xstate(struct fpu *fpu)
 {
-	struct fxregs_state *fx = &fpu->state.fxsave;
+	struct fxregs_state *fx = &fpu->state->fxsave;
 	int feature_bit;
 	u64 xfeatures;
 
 	if (!use_xsaveopt())
 		return;
 
-	xfeatures = fpu->state.xsave.header.xfeatures;
+	xfeatures = fpu->state->xsave.header.xfeatures;
 
 	/*
 	 * None of the feature bits are in init state. So nothing else
@@ -976,7 +976,7 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 	}
 
 	if (fpu)
-		xsave = &fpu->state.xsave;
+		xsave = &fpu->state->xsave;
 	else
 		xsave = &init_fpstate.xsave;
 
@@ -1019,7 +1019,7 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
 		  "get of unsupported state");
 
 	if (fpu)
-		xsave = &fpu->state.xsave;
+		xsave = &fpu->state->xsave;
 	else
 		xsave = &init_fpstate.xsave;
 
@@ -1167,7 +1167,7 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 	unsigned last = 0;
 	int i;
 
-	xsave = &fpu->state.xsave;
+	xsave = &fpu->state->xsave;
 
 	/*
 	 * The destination is a ptrace buffer; we put in only user xstates:
@@ -1232,7 +1232,7 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
-	xsave = &fpu->state.xsave;
+	xsave = &fpu->state->xsave;
 
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
@@ -1289,7 +1289,7 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
-	xsave = &fpu->state.xsave;
+	xsave = &fpu->state->xsave;
 
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
@@ -1348,7 +1348,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
 	max_bit = __fls(xfeatures_mask_supervisor());
 	min_bit = __ffs(xfeatures_mask_supervisor());
 
-	xstate = &fpu->state.xsave;
+	xstate = &fpu->state->xsave;
 	lmask = xfeatures_mask_supervisor();
 	hmask = xfeatures_mask_supervisor() >> 32;
 	XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
@@ -1535,7 +1535,7 @@ void update_pasid(void)
 		 * update the PASID state in the memory buffer here. The
 		 * PASID MSR will be loaded when returning to user mode.
 		 */
-		xsave = &fpu->state.xsave;
+		xsave = &fpu->state->xsave;
 		xsave->header.xfeatures |= XFEATURE_MASK_PASID;
 		ppasid_state = get_xsave_addr(fpu, XFEATURE_PASID);
 		/*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 2070ae35ccbc..c41116543a82 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -98,7 +98,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 
 void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
 {
-	*offset = offsetof(struct thread_struct, fpu.state);
+	*offset = offsetof(struct thread_struct, fpu.__default_state);
 	/* The buffer embedded in thread_struct has the minimum size. */
 	*size = get_xstate_config(XSTATE_MIN_SIZE);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5c70a4270157..c10122547ecd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4439,7 +4439,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 
 static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
 	u64 xstate_bv = xsave->header.xfeatures;
 	u64 valid;
 
@@ -4481,7 +4481,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 
 static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
 	u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
 	u64 valid;
 
@@ -4532,7 +4532,7 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 		fill_xsave((u8 *) guest_xsave->region, vcpu);
 	} else {
 		memcpy(guest_xsave->region,
-			&vcpu->arch.guest_fpu->state.fxsave,
+			&vcpu->arch.guest_fpu->state->fxsave,
 			sizeof(struct fxregs_state));
 		*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
 			XFEATURE_MASK_FPSSE;
@@ -4566,7 +4566,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 		if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
 			mxcsr & ~mxcsr_feature_mask)
 			return -EINVAL;
-		memcpy(&vcpu->arch.guest_fpu->state.fxsave,
+		memcpy(&vcpu->arch.guest_fpu->state->fxsave,
 			guest_xsave->region, sizeof(struct fxregs_state));
 	}
 	return 0;
@@ -9276,7 +9276,7 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	 * always has the minimum size.
 	 */
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&fpu->state, &current->thread.fpu.state,
+		memcpy(fpu->state, current->thread.fpu.state,
 		       get_xstate_config(XSTATE_MIN_SIZE));
 	else
 		copy_fpregs_to_fpstate(fpu);
@@ -9295,7 +9295,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 	 */
 	if (vcpu->arch.guest_fpu)
 		/* PKRU is separately restored in kvm_x86_ops.run. */
-		__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
+		__copy_kernel_to_fpregs(vcpu->arch.guest_fpu->state,
 					~XFEATURE_MASK_PKRU);
 
 	fpregs_mark_activate();
@@ -9832,7 +9832,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+	fxsave = &vcpu->arch.guest_fpu->state->fxsave;
 	memcpy(fpu->fpr, fxsave->st_space, 128);
 	fpu->fcw = fxsave->cwd;
 	fpu->fsw = fxsave->swd;
@@ -9855,7 +9855,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+	fxsave = &vcpu->arch.guest_fpu->state->fxsave;
 
 	memcpy(fxsave->st_space, fpu->fpr, 128);
 	fxsave->cwd = fpu->fcw;
@@ -9916,7 +9916,7 @@ static void fx_init(struct kvm_vcpu *vcpu)
 
 	fpstate_init(vcpu->arch.guest_fpu);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
+		vcpu->arch.guest_fpu->state->xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
 
 	/*
diff --git a/arch/x86/math-emu/fpu_aux.c b/arch/x86/math-emu/fpu_aux.c
index 034748459482..51432a73024c 100644
--- a/arch/x86/math-emu/fpu_aux.c
+++ b/arch/x86/math-emu/fpu_aux.c
@@ -53,7 +53,7 @@ void fpstate_init_soft(struct swregs_state *soft)
 
 void finit(void)
 {
-	fpstate_init_soft(&current->thread.fpu.state.soft);
+	fpstate_init_soft(&current->thread.fpu.state->soft);
 }
 
 /*
diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c
index 8679a9d6c47f..6ba56632170e 100644
--- a/arch/x86/math-emu/fpu_entry.c
+++ b/arch/x86/math-emu/fpu_entry.c
@@ -640,7 +640,7 @@ int fpregs_soft_set(struct task_struct *target,
 		    unsigned int pos, unsigned int count,
 		    const void *kbuf, const void __user *ubuf)
 {
-	struct swregs_state *s387 = &target->thread.fpu.state.soft;
+	struct swregs_state *s387 = &target->thread.fpu.state->soft;
 	void *space = s387->st_space;
 	int ret;
 	int offset, other, i, tags, regnr, tag, newtop;
@@ -691,7 +691,7 @@ int fpregs_soft_get(struct task_struct *target,
 		    const struct user_regset *regset,
 		    struct membuf to)
 {
-	struct swregs_state *s387 = &target->thread.fpu.state.soft;
+	struct swregs_state *s387 = &target->thread.fpu.state->soft;
 	const void *space = s387->st_space;
 	int offset = (S387->ftop & 7) * 10, other = 80 - offset;
 
diff --git a/arch/x86/math-emu/fpu_system.h b/arch/x86/math-emu/fpu_system.h
index 9b41391867dc..a6291ddfdda6 100644
--- a/arch/x86/math-emu/fpu_system.h
+++ b/arch/x86/math-emu/fpu_system.h
@@ -73,7 +73,7 @@ static inline bool seg_writable(struct desc_struct *d)
 	return (d->type & SEG_TYPE_EXECUTE_MASK) == SEG_TYPE_WRITABLE;
 }
 
-#define I387			(&current->thread.fpu.state)
+#define I387			(current->thread.fpu.state)
 #define FPU_info		(I387->soft.info)
 
 #define FPU_CS			(*(unsigned short *) &(FPU_info->regs->cs))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 09/22] x86/fpu/xstate: Introduce helpers to manage the xstate buffer dynamically
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (7 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 08/22] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 10/22] x86/fpu/xstate: Define the scope of the initial xstate data Chang S. Bae
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

The static per-task xstate buffer contains the extended register states --
but it is not expandable at runtime. Introduce runtime methods and a new
fpu struct field to support the expansion.

fpu->state_mask indicates which state components are reserved to be
saved in the xstate buffer.

alloc_xstate_buffer() uses vmalloc(). If use of this mechanism grows to
allocate buffers larger than 64KB, a more sophisticated allocation scheme
that includes purpose-built reclaim capability might be justified.

Introduce a new helper -- get_xstate_size() to calculate the buffer size.

Also, use the new field and helper to initialize the buffer.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Updated code comments. (Borislav Petkov)
* Used vzalloc() instead of vmalloc() with memset(). (Borislav Petkov)
* Removed the max size check for >64KB. (Borislav Petkov)
* Removed the allocation size check in the helper. (Borislav Petkov)
* Switched the function description in the kernel-doc style.
* Used them for buffer initialization -- moved from the next patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Replaced 'area' with 'buffer' in the comments and the changelog.
* Updated the code comments.

Changes from v1:
* Removed unneeded interrupt masking (Andy Lutomirski)
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
---
 arch/x86/include/asm/fpu/types.h  |   7 ++
 arch/x86/include/asm/fpu/xstate.h |   4 +
 arch/x86/include/asm/trace/fpu.h  |   5 ++
 arch/x86/kernel/fpu/core.c        |  14 ++--
 arch/x86/kernel/fpu/xstate.c      | 125 ++++++++++++++++++++++++++++++
 5 files changed, 148 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index dcd28a545377..6fc707c14350 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -336,6 +336,13 @@ struct fpu {
 	 */
 	unsigned long			avx512_timestamp;
 
+	/*
+	 * @state_mask:
+	 *
+	 * The bitmap represents state components reserved to be saved in ->state.
+	 */
+	u64				state_mask;
+
 	/*
 	 * @state:
 	 *
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 1fba2ca15874..cbb4795d2b45 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -112,6 +112,10 @@ extern unsigned int get_xstate_config(enum xstate_config cfg);
 void set_xstate_config(enum xstate_config cfg, unsigned int value);
 
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
+unsigned int get_xstate_size(u64 mask);
+int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
+void free_xstate_buffer(struct fpu *fpu);
+
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index ef82f4824ce7..b691c2db47c7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -89,6 +89,11 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
 	TP_ARGS(fpu)
 );
 
+DEFINE_EVENT(x86_fpu, x86_fpu_xstate_alloc_failed,
+	TP_PROTO(struct fpu *fpu),
+	TP_ARGS(fpu)
+);
+
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH asm/trace/
 #undef TRACE_INCLUDE_FILE
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 60a581aa0be8..5debb1cd3c74 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -203,9 +203,8 @@ void fpstate_init(struct fpu *fpu)
 
 	if (fpu) {
 		state = fpu->state;
-		/* The dynamic user states are not prepared yet. */
-		mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
-		size = get_xstate_config(XSTATE_MIN_SIZE);
+		mask = fpu->state_mask;
+		size = get_xstate_size(fpu->state_mask);
 	} else {
 		state = &init_fpstate;
 		mask = xfeatures_mask_all;
@@ -241,14 +240,15 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 
 	WARN_ON_FPU(src_fpu != &current->thread.fpu);
 
+	/*
+	 * The child does not inherit the dynamic states. Thus, use the buffer
+	 * embedded in struct task_struct, which has the minimum size.
+	 */
+	dst_fpu->state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
 	dst_fpu->state = &dst_fpu->__default_state;
-
 	/*
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
-	 *
-	 * The child does not inherit the dynamic states. So,
-	 * the xstate buffer has the minimum size.
 	 */
 	memset(&dst_fpu->state->xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 8c067a7a0eec..86251b947403 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,6 +10,7 @@
 #include <linux/pkeys.h>
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -19,6 +20,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/cpufeature.h>
+#include <asm/trace/fpu.h>
 
 /*
  * Although we spell it out in here, the Processor Trace
@@ -71,6 +73,11 @@ static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] =
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
+/*
+ * True if the buffer of the corresponding XFEATURE is located on the next 64
+ * byte boundary. Otherwise, it follows the preceding component immediately.
+ */
+static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};
 
 /**
  * struct fpu_xstate_buffer_config - xstate per-task buffer configuration
@@ -168,6 +175,58 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/**
+ * get_xstate_size() - calculate an xstate buffer size
+ * @mask:	This bitmap tells which components reserved in the buffer.
+ *
+ * Available once those arrays for the offset, size, and alignment info are set up,
+ * by setup_xstate_features().
+ *
+ * Returns:	The buffer size
+ */
+unsigned int get_xstate_size(u64 mask)
+{
+	unsigned int size;
+	u64 xmask;
+	int i, nr;
+
+	if (!mask)
+		return 0;
+
+	/*
+	 * The minimum buffer size excludes the dynamic user state. When a task
+	 * uses the state, the buffer can grow up to the max size.
+	 */
+	if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
+		return get_xstate_config(XSTATE_MIN_SIZE);
+	else if (mask == xfeatures_mask_all)
+		return get_xstate_config(XSTATE_MAX_SIZE);
+
+	nr = fls64(mask) - 1;
+
+	if (!using_compacted_format())
+		return xstate_offsets[nr] + xstate_sizes[nr];
+
+	xmask = BIT_ULL(nr + 1) - 1;
+
+	if (mask == (xmask & xfeatures_mask_all))
+		return xstate_comp_offsets[nr] + xstate_sizes[nr];
+
+	/*
+	 * With the given mask, no relevant size is found so far. So, calculate
+	 * it by summing up each state size.
+	 */
+	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
+		if (!(mask & BIT_ULL(i)))
+			continue;
+
+		if (xstate_aligns[i])
+			size = ALIGN(size, 64);
+		size += xstate_sizes[i];
+	}
+	return size;
+}
+
 /*
  * When executing XSAVEOPT (or other optimized XSAVE instructions), if
  * a processor implementation detects that an FPU state component is still
@@ -308,10 +367,12 @@ static void __init setup_xstate_features(void)
 	xstate_offsets[XFEATURE_FP]	= 0;
 	xstate_sizes[XFEATURE_FP]	= offsetof(struct fxregs_state,
 						   xmm_space);
+	xstate_aligns[XFEATURE_FP]	= true;
 
 	xstate_offsets[XFEATURE_SSE]	= xstate_sizes[XFEATURE_FP];
 	xstate_sizes[XFEATURE_SSE]	= sizeof_field(struct fxregs_state,
 						       xmm_space);
+	xstate_aligns[XFEATURE_SSE]	= true;
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
@@ -329,6 +390,7 @@ static void __init setup_xstate_features(void)
 			continue;
 
 		xstate_offsets[i] = ebx;
+		xstate_aligns[i] = (ecx & 2) ? true : false;
 
 		/*
 		 * In our xstate size checks, we assume that the highest-numbered
@@ -915,6 +977,9 @@ void __init fpu__init_system_xstate(void)
 	if (err)
 		goto out_disable;
 
+	/* Make sure init_task does not include the dynamic user states. */
+	current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+
 	/*
 	 * Update info used for ptrace frames; use standard-format size and no
 	 * supervisor xstates:
@@ -1135,6 +1200,66 @@ static inline bool xfeatures_mxcsr_quirk(u64 xfeatures)
 	return true;
 }
 
+void free_xstate_buffer(struct fpu *fpu)
+{
+	/* Free up only the dynamically-allocated memory. */
+	if (fpu->state != &fpu->__default_state)
+		vfree(fpu->state);
+}
+
+/**
+ * alloc_xstate_buffer() - allocate an xstate buffer with the size calculated based on @mask.
+ *
+ * @fpu:	A struct fpu * pointer
+ * @mask:	The bitmap tells which components to be reserved in the new buffer.
+ *
+ * Use vmalloc() simply here. If the task with a vmalloc()-allocated buffer tends
+ * to terminate quickly, vfree()-induced IPIs may be a concern. Caching may be
+ * helpful for this. But the task with large state is likely to live longer.
+ *
+ * Also, this method does not shrink or reclaim the buffer.
+ *
+ * Returns 0 on success, -ENOMEM on allocation error.
+ */
+int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
+{
+	union fpregs_state *state;
+	unsigned int oldsz, newsz;
+	u64 state_mask;
+
+	state_mask = fpu->state_mask | mask;
+
+	oldsz = get_xstate_size(fpu->state_mask);
+	newsz = get_xstate_size(state_mask);
+
+	if (oldsz >= newsz)
+		return 0;
+
+	state = vzalloc(newsz);
+	if (!state) {
+		/*
+		 * When allocation requested from #NM, the error code may not be
+		 * populated well. Then, this tracepoint is useful for providing
+		 * the failure context.
+		 */
+		trace_x86_fpu_xstate_alloc_failed(fpu);
+		return -ENOMEM;
+	}
+
+	if (using_compacted_format())
+		fpstate_init_xstate(&state->xsave, state_mask);
+
+	/*
+	 * As long as the register state is intact, save the xstate in the new buffer
+	 * at the next context copy/switch or potentially ptrace-driven xstate writing.
+	 */
+
+	free_xstate_buffer(fpu);
+	fpu->state = state;
+	fpu->state_mask = state_mask;
+	return 0;
+}
+
 static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
 {
 	if (*last >= offset)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 10/22] x86/fpu/xstate: Define the scope of the initial xstate data
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (8 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 09/22] x86/fpu/xstate: Introduce helpers to manage the xstate buffer dynamically Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 11/22] x86/fpu/xstate: Update the xstate save function to support dynamic states Chang S. Bae
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

init_fpstate is used to record the initial xstate value and covers all the
states. But it is wasteful to cover large states all with trivial initial
data.

Limit init_fpstate by clarifying its size and coverage, which are all but
dynamic user states. The dynamic states are assumed to be large but having
initial data with zeros.

Expand copy_xregs_to_kernel_booting() to receive a mask argument of which
states to save.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Removed the helper functions. (Borislav Petkov)
* Removed 'no functional change' in the changelog. (Borislav Petkov)
* Updated the code comment.
* Moved out the other initialization changes into the previous patch.

Changes from v2:
* Updated the changelog for clarification.
* Updated the code comments.
---
 arch/x86/include/asm/fpu/internal.h |  3 +--
 arch/x86/kernel/fpu/core.c          | 13 ++++++++++---
 arch/x86/kernel/fpu/xstate.c        | 11 +++++++++--
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 46cb51ef4d17..e4afc1831e29 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -272,9 +272,8 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
  * This function is called only during boot time when x86 caps are not set
  * up and alternative can not be used yet.
  */
-static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate)
+static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate, u64 mask)
 {
-	u64 mask = xfeatures_mask_all;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 5debb1cd3c74..dc20eabb072d 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -21,7 +21,10 @@
 
 /*
  * Represents the initial FPU state. It's mostly (but not completely) zeroes,
- * depending on the FPU hardware format:
+ * depending on the FPU hardware format.
+ *
+ * The dynamic user states are excluded as they are large but having initial
+ * values with zeros.
  */
 union fpregs_state init_fpstate __read_mostly;
 
@@ -206,9 +209,13 @@ void fpstate_init(struct fpu *fpu)
 		mask = fpu->state_mask;
 		size = get_xstate_size(fpu->state_mask);
 	} else {
+		/*
+		 * init_fpstate excludes the dynamic user states as they are
+		 * large but having initial values with zeros.
+		 */
 		state = &init_fpstate;
-		mask = xfeatures_mask_all;
-		size = get_xstate_config(XSTATE_MAX_SIZE);
+		mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+		size = get_xstate_config(XSTATE_MIN_SIZE);
 	}
 
 	if (!static_cpu_has(X86_FEATURE_FPU)) {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 86251b947403..daf76108aa5f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -552,6 +552,7 @@ static void __init print_xstate_offset_size(void)
 static void __init setup_init_fpu_buf(void)
 {
 	static int on_boot_cpu __initdata = 1;
+	u64 mask;
 
 	WARN_ON_FPU(!on_boot_cpu);
 	on_boot_cpu = 0;
@@ -562,8 +563,14 @@ static void __init setup_init_fpu_buf(void)
 	setup_xstate_features();
 	print_xstate_features();
 
+	/*
+	 * Exclude the dynamic user states as they are large but having
+	 * initial values with zeros.
+	 */
+	mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
+
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);
+		fpstate_init_xstate(&init_fpstate.xsave, mask);
 
 	/*
 	 * Init all the features state with header.xfeatures being 0x0
@@ -574,7 +581,7 @@ static void __init setup_init_fpu_buf(void)
 	 * Dump the init state again. This is to identify the init state
 	 * of any feature which is not represented by all zero's.
 	 */
-	copy_xregs_to_kernel_booting(&init_fpstate.xsave);
+	copy_xregs_to_kernel_booting(&init_fpstate.xsave, mask);
 }
 
 static int xfeature_uncompacted_offset(int xfeature_nr)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 11/22] x86/fpu/xstate: Update the xstate save function to support dynamic states
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (9 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 10/22] x86/fpu/xstate: Define the scope of the initial xstate data Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 12/22] x86/fpu/xstate: Update the xstate buffer address finder " Chang S. Bae
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

Extend copy_xregs_to_kernel() to receive a mask argument of which states to
save, in preparation for dynamic user state handling.

Update KVM to set a valid fpu->state_mask, so it can continue to share with
the core code.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Made the code change more reviewable.

Changes from v2:
* Updated the changelog to clarify the KVM code changes.
---
 arch/x86/include/asm/fpu/internal.h | 3 +--
 arch/x86/kernel/fpu/core.c          | 2 +-
 arch/x86/kvm/x86.c                  | 9 +++++++--
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index e4afc1831e29..f964f3efc92e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -317,9 +317,8 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
 /*
  * Save processor xstate to xsave area.
  */
-static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
+static inline void copy_xregs_to_kernel(struct xregs_state *xstate, u64 mask)
 {
-	u64 mask = xfeatures_mask_all;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index dc20eabb072d..ad1ac80f98ef 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -98,7 +98,7 @@ EXPORT_SYMBOL(irq_fpu_usable);
 int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
-		copy_xregs_to_kernel(&fpu->state->xsave);
+		copy_xregs_to_kernel(&fpu->state->xsave, fpu->state_mask);
 
 		/*
 		 * AVX512 state is tracked here because its use is
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c10122547ecd..ca2c0574acf2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9275,11 +9275,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	 * KVM does not support dynamic user states yet. Assume the buffer
 	 * always has the minimum size.
 	 */
-	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		memcpy(fpu->state, current->thread.fpu.state,
 		       get_xstate_config(XSTATE_MIN_SIZE));
-	else
+	} else {
+		struct fpu *src_fpu = &current->thread.fpu;
+
+		if (fpu->state_mask != src_fpu->state_mask)
+			fpu->state_mask = src_fpu->state_mask;
 		copy_fpregs_to_fpstate(fpu);
+	}
 }
 
 /* Swap (qemu) user FPU context for the guest FPU context. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 12/22] x86/fpu/xstate: Update the xstate buffer address finder to support dynamic states
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (10 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 11/22] x86/fpu/xstate: Update the xstate save function to support dynamic states Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 13/22] x86/fpu/xstate: Update the xstate context copy function " Chang S. Bae
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

__raw_xsave_addr() returns the requested component's pointer in an xstate
buffer, by simply looking up the offset table. The offset used to be fixed,
but, with dynamic user states, it becomes variable.

get_xstate_size() has a routine to find an offset at runtime. Refactor to
use it for the address finder.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Added the function description in the kernel-doc style. (Borislav Petkov)
* Removed 'no functional change' in the changelog. (Borislav Petkov)
---
 arch/x86/kernel/fpu/xstate.c | 80 ++++++++++++++++++++++--------------
 1 file changed, 50 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index daf76108aa5f..84b55f51bdb7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -175,6 +175,40 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/**
+ * get_xstate_comp_offset() - Find the feature's offset in the compacted format
+ * @mask:	This bitmap tells which components reserved in the format.
+ * @feature_nr:	The feature number
+ *
+ * Returns:	The offset value
+ */
+static unsigned int get_xstate_comp_offset(u64 mask, int feature_nr)
+{
+	u64 xmask = BIT_ULL(feature_nr + 1) - 1;
+	unsigned int next_offset, offset = 0;
+	int i;
+
+	if ((mask & xmask) == (xfeatures_mask_all & xmask))
+		return xstate_comp_offsets[feature_nr];
+
+	/*
+	 * With the given mask, no relevant size is found. Calculate it by summing
+	 * up each state size.
+	 */
+
+	next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+
+	for (i = FIRST_EXTENDED_XFEATURE; i <= feature_nr; i++) {
+		if (!(mask & BIT_ULL(i)))
+			continue;
+
+		offset = xstate_aligns[i] ? ALIGN(next_offset, 64) : next_offset;
+		next_offset += xstate_sizes[i];
+	}
+
+	return offset;
+}
+
 /**
  * get_xstate_size() - calculate an xstate buffer size
  * @mask:	This bitmap tells which components reserved in the buffer.
@@ -186,9 +220,8 @@ static bool xfeature_is_supervisor(int xfeature_nr)
  */
 unsigned int get_xstate_size(u64 mask)
 {
-	unsigned int size;
-	u64 xmask;
-	int i, nr;
+	unsigned int offset;
+	int nr;
 
 	if (!mask)
 		return 0;
@@ -207,24 +240,8 @@ unsigned int get_xstate_size(u64 mask)
 	if (!using_compacted_format())
 		return xstate_offsets[nr] + xstate_sizes[nr];
 
-	xmask = BIT_ULL(nr + 1) - 1;
-
-	if (mask == (xmask & xfeatures_mask_all))
-		return xstate_comp_offsets[nr] + xstate_sizes[nr];
-
-	/*
-	 * With the given mask, no relevant size is found so far. So, calculate
-	 * it by summing up each state size.
-	 */
-	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
-		if (!(mask & BIT_ULL(i)))
-			continue;
-
-		if (xstate_aligns[i])
-			size = ALIGN(size, 64);
-		size += xstate_sizes[i];
-	}
-	return size;
+	offset = get_xstate_comp_offset(mask, nr);
+	return offset + xstate_sizes[nr];
 }
 
 /*
@@ -1042,17 +1059,20 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
 	void *xsave;
 
-	if (!xfeature_enabled(xfeature_nr)) {
-		WARN_ON_FPU(1);
-		return NULL;
-	}
-
-	if (fpu)
-		xsave = &fpu->state->xsave;
-	else
+	if (!xfeature_enabled(xfeature_nr))
+		goto not_found;
+	else if (!fpu)
 		xsave = &init_fpstate.xsave;
+	else if (!(fpu->state_mask & BIT_ULL(xfeature_nr)))
+		goto not_found;
+	else
+		xsave = &fpu->state->xsave;
+
+	return xsave + get_xstate_comp_offset(fpu->state_mask, xfeature_nr);
 
-	return xsave + xstate_comp_offsets[xfeature_nr];
+not_found:
+	WARN_ON_FPU(1);
+	return NULL;
 }
 /*
  * Given the xsave area and a state inside, this function returns the
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 13/22] x86/fpu/xstate: Update the xstate context copy function to support dynamic states
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (11 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 12/22] x86/fpu/xstate: Update the xstate buffer address finder " Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state Chang S. Bae
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

ptrace() and signal return paths use xstate context copy functions. They
allow callers to read (or write) xstate values in the target's buffer. With
dynamic user states, a component's position in the buffer may vary and the
initial value is not always stored in init_fpstate.

Change the helpers to find a component's offset accordingly.

When copying an initial value, explicitly check the init_fpstate coverage.
If not found, reset the memory in the destination. Otherwise, copy values
from init_fpstate.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Cleaned up the code change with more comments.
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/kernel/fpu/xstate.c | 69 ++++++++++++++++++++++++++++--------
 1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 84b55f51bdb7..c57877df797d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -301,7 +301,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
 	 * in a special way already:
 	 */
 	feature_bit = 0x2;
-	xfeatures = (xfeatures_mask_user() & ~xfeatures) >> 2;
+	xfeatures = (xfeatures_mask_user() & fpu->state_mask & ~xfeatures) >> feature_bit;
 
 	/*
 	 * Update all the remaining memory layouts according to their
@@ -310,12 +310,19 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
 	 */
 	while (xfeatures) {
 		if (xfeatures & 0x1) {
-			int offset = xstate_comp_offsets[feature_bit];
+			int offset = get_xstate_comp_offset(fpu->state_mask, feature_bit);
 			int size = xstate_sizes[feature_bit];
 
-			memcpy((void *)fx + offset,
-			       (void *)&init_fpstate.xsave + offset,
-			       size);
+			/*
+			 * init_fpstate does not include the dynamic user states
+			 * as having initial values with zeros.
+			 */
+			if (xfeatures_mask_user_dynamic & BIT_ULL(feature_bit))
+				memset((void *)fx + offset, 0, size);
+			else
+				memcpy((void *)fx + offset,
+				       (void *)&init_fpstate.xsave + offset,
+				       size);
 		}
 
 		xfeatures >>= 1;
@@ -1291,15 +1298,31 @@ static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
 {
 	if (*last >= offset)
 		return;
-	membuf_write(to, (void *)&init_fpstate.xsave + *last, offset - *last);
+
+	/*
+	 * Copy initial data.
+	 *
+	 * init_fpstate buffer has the minimum size as excluding the dynamic user
+	 * states. But their initial values are zeros.
+	 */
+	if (offset <= get_xstate_config(XSTATE_MIN_SIZE))
+		membuf_write(to, (void *)&init_fpstate.xsave + *last, offset - *last);
+	else
+		membuf_zero(to, offset - *last);
 	*last = offset;
 }
 
+/*
+ * @from: If NULL, copy zeros.
+ */
 static void copy_part(struct membuf *to, unsigned *last, unsigned offset,
 		      unsigned size, void *from)
 {
 	fill_gap(to, last, offset);
-	membuf_write(to, from, size);
+	if (from)
+		membuf_write(to, from, size);
+	else
+		membuf_zero(to, size);
 	*last = offset + size;
 }
 
@@ -1351,15 +1374,27 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 		  sizeof(header), &header);
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		u64 mask = BIT_ULL(i);
+		void *src;
+
+		if (!(xfeatures_mask_user() & mask))
+			continue;
+
 		/*
-		 * Copy only in-use xstates:
+		 * Copy states if used. Otherwise, copy the initial data.
 		 */
-		if ((header.xfeatures >> i) & 1) {
-			void *src = __raw_xsave_addr(fpu, i);
 
-			copy_part(&to, &last, xstate_offsets[i],
-				  xstate_sizes[i], src);
-		}
+		if (header.xfeatures & mask)
+			src = __raw_xsave_addr(fpu, i);
+		else
+			/*
+			 * init_fpstate buffer does not include the dynamic
+			 * user state data as having initial values with zeros.
+			 */
+			src = (xfeatures_mask_user_dynamic & mask) ?
+			      NULL : (void *)&init_fpstate.xsave + last;
+
+		copy_part(&to, &last, xstate_offsets[i], xstate_sizes[i], src);
 
 	}
 	fill_gap(&to, &last, size);
@@ -1392,6 +1427,9 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 		if (hdr.xfeatures & mask) {
 			void *dst = __raw_xsave_addr(fpu, i);
 
+			if (!dst)
+				continue;
+
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
 
@@ -1449,6 +1487,9 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 		if (hdr.xfeatures & mask) {
 			void *dst = __raw_xsave_addr(fpu, i);
 
+			if (!dst)
+				continue;
+
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
 
@@ -1529,7 +1570,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
 			continue;
 
 		/* Move xfeature 'i' into its normal location */
-		memmove(xbuf + xstate_comp_offsets[i],
+		memmove(xbuf + get_xstate_comp_offset(fpu->state_mask, i),
 			xbuf + xstate_supervisor_only_offsets[i],
 			xstate_sizes[i]);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (12 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 13/22] x86/fpu/xstate: Update the xstate context copy function " Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-03-20 22:13   ` Thomas Gleixner
  2021-03-26 16:34   ` Jann Horn
  2021-02-21 18:56 ` [PATCH v4 15/22] x86/fpu/xstate: Support ptracer-induced xstate buffer expansion Chang S. Bae
                   ` (7 subsequent siblings)
  21 siblings, 2 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can defer allocating the large XSAVE buffer until tasks
need it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
state-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand the xstate
buffer. Introduce two sets of helper functions for that:

1. The first set is primarily for interacting with the XFD hardware:
	xdisable_setbits()
	xdisable_getbits()
	xdisable_switch()

2. The second set is for managing the first-use status and handling #NM
   trap:
	xfirstuse_enabled()
	xfirstuse_not_detected()

The #NM handler induces the xstate buffer expansion to save the first-used
states.

The XFD feature is enabled only for the compacted format. If the kernel
uses the standard format, the buffer has to be always enough for all the
states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Changed to enable XFD only when the compacted format is used.
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Inlined the XFD-induced #NM handling code (Andy Lutomirski)
---
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/fpu/internal.h | 51 ++++++++++++++++++++++++++++-
 arch/x86/include/asm/msr-index.h    |  2 ++
 arch/x86/kernel/cpu/cpuid-deps.c    |  1 +
 arch/x86/kernel/fpu/xstate.c        | 37 +++++++++++++++++++--
 arch/x86/kernel/process.c           |  5 +++
 arch/x86/kernel/process_32.c        |  2 +-
 arch/x86/kernel/process_64.c        |  2 +-
 arch/x86/kernel/traps.c             | 40 ++++++++++++++++++++++
 9 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..3170ab367cf2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -277,6 +277,7 @@
 #define X86_FEATURE_XSAVEC		(10*32+ 1) /* XSAVEC instruction */
 #define X86_FEATURE_XGETBV1		(10*32+ 2) /* XGETBV with ECX = 1 instruction */
 #define X86_FEATURE_XSAVES		(10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD			(10*32+ 4) /* eXtended Feature Disabling */
 
 /*
  * Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index f964f3efc92e..c467312d38d8 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -557,11 +557,58 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  * Misc helper functions:
  */
 
+/* The first-use detection helpers: */
+
+static inline void xdisable_setbits(u64 value)
+{
+	wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xdisable_getbits(void)
+{
+	u64 value;
+
+	rdmsrl_safe(MSR_IA32_XFD, &value);
+	return value;
+}
+
+static inline u64 xfirstuse_enabled(void)
+{
+	/* All the dynamic user components are first-use enabled. */
+	return xfeatures_mask_user_dynamic;
+}
+
+/*
+ * Convert fpu->state_mask to the xdisable configuration to be written to
+ * MSR IA32_XFD.  So, xdisable_setbits() only uses this outcome.
+ */
+static inline u64 xfirstuse_not_detected(struct fpu *fpu)
+{
+	u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
+
+	/*
+	 * If first-use is not detected, set the bit. If the detection is
+	 * not enabled, the bit is always zero in firstuse_bv. So, make
+	 * following conversion:
+	 */
+	return  (xfirstuse_enabled() ^ firstuse_bv);
+}
+
+/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
+static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
+{
+	if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+		return;
+
+	if (unlikely(prev->state_mask != next->state_mask))
+		xdisable_setbits(xfirstuse_not_detected(next));
+}
+
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
  * complete FPU state until the return to userland.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
 {
 	u32 pkru_val = init_pkru_value;
 	struct pkru_state *pk;
@@ -571,6 +618,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 
 	set_thread_flag(TIF_NEED_FPU_LOAD);
 
+	xdisable_switch(old_fpu, new_fpu);
+
 	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
 		return;
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 546d6ecf0a35..eb65e836c2d1 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -622,6 +622,8 @@
 #define MSR_IA32_BNDCFGS_RSVD		0x00000ffc
 
 #define MSR_IA32_XSS			0x00000da0
+#define MSR_IA32_XFD			0x000001c4
+#define MSR_IA32_XFD_ERR		0x000001c5
 
 #define MSR_IA32_APICBASE		0x0000001b
 #define MSR_IA32_APICBASE_BSP		(1<<8)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 42af31b64c2c..4423046c2d74 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -72,6 +72,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_AVX512_FP16,		X86_FEATURE_AVX512BW  },
 	{ X86_FEATURE_ENQCMD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_PER_THREAD_MBA,		X86_FEATURE_MBA       },
+	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
 	{}
 };
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c57877df797d..b69913ae30ed 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -175,6 +175,21 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+static bool xfeature_disable_supported(int xfeature_nr)
+{
+	u32 eax, ebx, ecx, edx;
+
+	if (!boot_cpu_has(X86_FEATURE_XFD))
+		return false;
+
+	/*
+	 * If state component 'i' supports xfeature disable (first-use
+	 * detection), ECX[2] return 1; otherwise, 0.
+	 */
+	cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
+	return ecx & 4;
+}
+
 /**
  * get_xstate_comp_offset() - Find the feature's offset in the compacted format
  * @mask:	This bitmap tells which components reserved in the format.
@@ -366,6 +381,9 @@ void fpu__init_cpu_xstate(void)
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
 				     xfeatures_mask_supervisor_dynamic());
 	}
+
+	if (boot_cpu_has(X86_FEATURE_XFD))
+		xdisable_setbits(xfirstuse_enabled());
 }
 
 static bool xfeature_enabled(enum xfeature xfeature)
@@ -565,8 +583,9 @@ static void __init print_xstate_offset_size(void)
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
 			continue;
-		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
-			 i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
+			i, xstate_comp_offsets[i], i, xstate_sizes[i],
+			(xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "on-demand" : "default");
 	}
 }
 
@@ -999,9 +1018,18 @@ void __init fpu__init_system_xstate(void)
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
-	/* Do not support the dynamically allocated buffer yet. */
 	xfeatures_mask_user_dynamic = 0;
 
+	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		u64 feature_mask = BIT_ULL(i);
+
+		if (!(xfeatures_mask_user() & feature_mask))
+			continue;
+
+		if (xfeature_disable_supported(i))
+			xfeatures_mask_user_dynamic |= feature_mask;
+	}
+
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
 	err = init_xstate_size();
@@ -1053,6 +1081,9 @@ void fpu__resume_cpu(void)
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
 				     xfeatures_mask_supervisor_dynamic());
 	}
+
+	if (boot_cpu_has(X86_FEATURE_XFD))
+		xdisable_setbits(xfirstuse_not_detected(&current->thread.fpu));
 }
 
 /*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c41116543a82..697d554a242d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -103,6 +103,11 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
 	*size = get_xstate_config(XSTATE_MIN_SIZE);
 }
 
+void arch_release_task_struct(struct task_struct *tsk)
+{
+	free_xstate_buffer(&tsk->thread.fpu);
+}
+
 /*
  * Free thread data structures etc..
  */
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	this_cpu_write(current_task, next_p);
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(prev_fpu, next_fpu);
 
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ad582f9ac5a6..6fb44c4ceeee 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -594,7 +594,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(current_task, next_p);
 	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(prev_fpu, next_fpu);
 
 	/* Reload sp0. */
 	update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7f5aec758f0e..821a7f408ad4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1106,10 +1106,50 @@ DEFINE_IDTENTRY(exc_spurious_interrupt_bug)
 	 */
 }
 
+static __always_inline bool handle_xfirstuse_event(struct fpu *fpu)
+{
+	bool handled = false;
+	u64 event_mask;
+
+	/* Check whether the first-use detection is running. */
+	if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+		return handled;
+
+	rdmsrl_safe(MSR_IA32_XFD_ERR, &event_mask);
+
+	/* The trap event should happen to one of first-use enabled features */
+	WARN_ON(!(event_mask & xfirstuse_enabled()));
+
+	/* If IA32_XFD_ERR is empty, the current trap has nothing to do with. */
+	if (!event_mask)
+		return handled;
+
+	/*
+	 * The first-use event is presumed to be from userspace, so it should have
+	 * nothing to do with interrupt context.
+	 */
+	if (WARN_ON(in_interrupt()))
+		return handled;
+
+	if (alloc_xstate_buffer(fpu, event_mask))
+		return handled;
+
+	xdisable_setbits(xfirstuse_not_detected(fpu));
+
+	/* Clear the trap record. */
+	wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+	handled = true;
+
+	return handled;
+}
+
 DEFINE_IDTENTRY(exc_device_not_available)
 {
 	unsigned long cr0 = read_cr0();
 
+	if (handle_xfirstuse_event(&current->thread.fpu))
+		return;
+
 #ifdef CONFIG_MATH_EMULATION
 	if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
 		struct math_emu_info info = { };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 15/22] x86/fpu/xstate: Support ptracer-induced xstate buffer expansion
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (13 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features Chang S. Bae
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

ptrace() may update xstate data before the target task has taken an XFD
fault and expanded the xstate buffer. Detect this case and allocate a
sufficient buffer to support the request. Also, disable the (now
unnecessary) associated first-use fault.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Removed 'no functional changes' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Updated the code comments.
---
 arch/x86/kernel/fpu/regset.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index ee27df4caed6..ec6cbb75010e 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -122,6 +122,35 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	xsave = &fpu->state->xsave;
 
+	/*
+	 * When a ptracer attempts to write any state in the target buffer but not
+	 * sufficiently allocated, it dynamically expands the buffer.
+	 */
+	if (count > get_xstate_size(fpu->state_mask)) {
+		unsigned int offset, size;
+		struct xstate_header hdr;
+		u64 mask;
+
+		offset = offsetof(struct xregs_state, header);
+		size = sizeof(hdr);
+
+		/* Retrieve XSTATE_BV */
+		if (kbuf) {
+			memcpy(&hdr, kbuf + offset, size);
+		} else {
+			ret = __copy_from_user(&hdr, ubuf + offset, size);
+			if (ret)
+				return ret;
+		}
+
+		mask = hdr.xfeatures & xfeatures_mask_user_dynamic;
+		if (!mask) {
+			ret = alloc_xstate_buffer(fpu, mask);
+			if (ret)
+				return ret;
+		}
+	}
+
 	fpu__prepare_write(fpu);
 
 	if (using_compacted_format()) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (14 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 15/22] x86/fpu/xstate: Support ptracer-induced xstate buffer expansion Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-03-20 21:25   ` Thomas Gleixner
  2021-02-21 18:56 ` [PATCH v4 17/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

At compile-time xfeatures_mask_all includes all possible XCR0 features. At
run-time fpu__init_system_xstate() clears features in xfeatures_mask_all
that are not enabled in CPUID. It does this by looping through all possible
XCR0 features.

Update the code to handle the possibility that there will be gaps in the
XCR0 feature bit numbers.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v1:
* Rebased on the upstream kernel (5.10)
---
 arch/x86/kernel/fpu/xstate.c | 41 ++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b69913ae30ed..4421ef424670 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -43,18 +43,23 @@ static const char *xfeature_names[] =
 	"unknown xstate feature"	,
 };
 
-static short xsave_cpuid_features[] __initdata = {
-	X86_FEATURE_FPU,
-	X86_FEATURE_XMM,
-	X86_FEATURE_AVX,
-	X86_FEATURE_MPX,
-	X86_FEATURE_MPX,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_INTEL_PT,
-	X86_FEATURE_PKU,
-	X86_FEATURE_ENQCMD,
+struct xfeature_capflag_info {
+	int xfeature_idx;
+	short cpu_cap;
+};
+
+static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
+	{ XFEATURE_FP,				X86_FEATURE_FPU },
+	{ XFEATURE_SSE,				X86_FEATURE_XMM },
+	{ XFEATURE_YMM,				X86_FEATURE_AVX },
+	{ XFEATURE_BNDREGS,			X86_FEATURE_MPX },
+	{ XFEATURE_BNDCSR,			X86_FEATURE_MPX },
+	{ XFEATURE_OPMASK,			X86_FEATURE_AVX512F },
+	{ XFEATURE_ZMM_Hi256,			X86_FEATURE_AVX512F },
+	{ XFEATURE_Hi16_ZMM,			X86_FEATURE_AVX512F },
+	{ XFEATURE_PT_UNIMPLEMENTED_SO_FAR,	X86_FEATURE_INTEL_PT },
+	{ XFEATURE_PKRU,			X86_FEATURE_PKU },
+	{ XFEATURE_PASID,			X86_FEATURE_ENQCMD },
 };
 
 /*
@@ -1010,11 +1015,15 @@ void __init fpu__init_system_xstate(void)
 	}
 
 	/*
-	 * Clear XSAVE features that are disabled in the normal CPUID.
+	 * Cross-check XSAVE feature with CPU capability flag. Clear the
+	 * mask bit for disabled features.
 	 */
-	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
-		if (!boot_cpu_has(xsave_cpuid_features[i]))
-			xfeatures_mask_all &= ~BIT_ULL(i);
+	for (i = 0; i < ARRAY_SIZE(xfeature_capflags); i++) {
+		short cpu_cap = xfeature_capflags[i].cpu_cap;
+		int idx = xfeature_capflags[i].xfeature_idx;
+
+		if (!boot_cpu_has(cpu_cap))
+			xfeatures_mask_all &= ~BIT_ULL(idx);
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 17/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (15 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Intel's Advanced Matrix Extension (AMX) is a new 64-bit extended feature
consisting of two-dimensional registers and an accelerator unit. The first
implementation of the latter is the tile matrix multiply unit (TMUL). TMUL
performs SIMD dot-products on four bytes (INT8) or two bfloat16
floating-point (BF16) elements.

Here we add AMX to the kernel/user ABI, by enumerating the capability.
E.g., /proc/cpuinfo: amx_tile, amx_bf16, amx_int8

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/cpufeatures.h | 3 +++
 arch/x86/kernel/cpu/cpuid-deps.c   | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 3170ab367cf2..f9990841238a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -379,6 +379,9 @@
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR		(18*32+19) /* Intel ARCH LBR */
 #define X86_FEATURE_AVX512_FP16		(18*32+23) /* AVX512 FP16 */
+#define X86_FEATURE_AMX_BF16		(18*32+22) /* AMX BF16 Support */
+#define X86_FEATURE_AMX_TILE		(18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8		(18*32+25) /* AMX INT8 Support */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
 #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 4423046c2d74..154c18e493c5 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -73,6 +73,9 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_ENQCMD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_PER_THREAD_MBA,		X86_FEATURE_MBA       },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
+	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XSAVE     },
+	{ X86_FEATURE_AMX_INT8,			X86_FEATURE_AMX_TILE  },
+	{ X86_FEATURE_AMX_BF16,			X86_FEATURE_AMX_TILE  },
 	{}
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (16 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 17/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-03-20 21:31   ` Thomas Gleixner
  2021-02-21 18:56 ` [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Linux uses check_xstate_against_struct() to sanity check the size of
XSTATE-enabled features. AMX is the XSAVE-enabled feature, and its size is
not hard-coded but discoverable at run-time via CPUID.

The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.

Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v2:
* Updated the code comments.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
 arch/x86/include/asm/fpu/types.h  | 27 ++++++++++++++
 arch/x86/include/asm/fpu/xstate.h |  2 +
 arch/x86/kernel/fpu/xstate.c      | 62 +++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 6fc707c14350..2f297aa85d8f 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -120,6 +120,9 @@ enum xfeature {
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
+	XFEATURE_RSRVD_COMP_16,
+	XFEATURE_XTILE_CFG,
+	XFEATURE_XTILE_DATA,
 
 	XFEATURE_MAX,
 };
@@ -136,11 +139,15 @@ enum xfeature {
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
+#define XFEATURE_MASK_XTILE_CFG	(1 << XFEATURE_XTILE_CFG)
+#define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
 					 | XFEATURE_MASK_ZMM_Hi256 \
 					 | XFEATURE_MASK_Hi16_ZMM)
+#define XFEATURE_MASK_XTILE		(XFEATURE_MASK_XTILE_DATA \
+					 | XFEATURE_MASK_XTILE_CFG)
 
 #define FIRST_EXTENDED_XFEATURE	XFEATURE_YMM
 
@@ -153,6 +160,9 @@ struct reg_256_bit {
 struct reg_512_bit {
 	u8	regbytes[512/8];
 };
+struct reg_1024_byte {
+	u8	regbytes[1024];
+};
 
 /*
  * State component 2:
@@ -255,6 +265,23 @@ struct arch_lbr_state {
 	u64 ler_to;
 	u64 ler_info;
 	struct lbr_entry		entries[];
+};
+
+/*
+ * State component 17: 64-byte tile configuration register.
+ */
+struct xtile_cfg {
+	u64				tcfg[8];
+} __packed;
+
+/*
+ * State component 18: 1KB tile data register.
+ * Each register represents 16 64-byte rows of the matrix
+ * data. But the number of registers depends on the actual
+ * implementation.
+ */
+struct xtile_data {
+	struct reg_1024_byte		tmm;
 } __packed;
 
 /*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cbb4795d2b45..4112dbf05f19 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -13,6 +13,8 @@
 
 #define XSTATE_CPUID		0x0000000d
 
+#define TILE_CPUID		0x0000001d
+
 #define FXSAVE_SIZE	512
 
 #define XSAVE_HDR_SIZE	    64
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 4421ef424670..7e708d6f43b5 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -41,6 +41,14 @@ static const char *xfeature_names[] =
 	"Protection Keys User registers",
 	"PASID state",
 	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"AMX Tile config"		,
+	"AMX Tile data"			,
+	"unknown xstate feature"	,
 };
 
 struct xfeature_capflag_info {
@@ -60,6 +68,8 @@ static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
 	{ XFEATURE_PT_UNIMPLEMENTED_SO_FAR,	X86_FEATURE_INTEL_PT },
 	{ XFEATURE_PKRU,			X86_FEATURE_PKU },
 	{ XFEATURE_PASID,			X86_FEATURE_ENQCMD },
+	{ XFEATURE_XTILE_CFG,			X86_FEATURE_AMX_TILE },
+	{ XFEATURE_XTILE_DATA,			X86_FEATURE_AMX_TILE }
 };
 
 /*
@@ -474,6 +484,8 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
+	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
 
 /*
@@ -733,6 +745,51 @@ static void __xstate_dump_leaves(void)
 	}								\
 } while (0)
 
+static void check_xtile_data_against_struct(int size)
+{
+	u32 max_palid, palid, state_size;
+	u32 eax, ebx, ecx, edx;
+	u16 max_tile;
+
+	/*
+	 * Check the maximum palette id:
+	 *   eax: the highest numbered palette subleaf.
+	 */
+	cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
+
+	/*
+	 * Cross-check each tile size and find the maximum
+	 * number of supported tiles.
+	 */
+	for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
+		u16 tile_size, max;
+
+		/*
+		 * Check the tile size info:
+		 *   eax[31:16]:  bytes per title
+		 *   ebx[31:16]:  the max names (or max number of tiles)
+		 */
+		cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
+		tile_size = eax >> 16;
+		max = ebx >> 16;
+
+		if (WARN_ONCE(tile_size != sizeof(struct xtile_data),
+			      "%s: struct is %zu bytes, cpu xtile %d bytes\n",
+			      __stringify(XFEATURE_XTILE_DATA),
+			      sizeof(struct xtile_data), tile_size))
+			__xstate_dump_leaves();
+
+		if (max > max_tile)
+			max_tile = max;
+	}
+
+	state_size = sizeof(struct xtile_data) * max_tile;
+	if (WARN_ONCE(size != state_size,
+		      "%s: calculated size is %u bytes, cpu state %d bytes\n",
+		      __stringify(XFEATURE_XTILE_DATA), state_size, size))
+		__xstate_dump_leaves();
+}
+
 /*
  * We have a C struct for each 'xstate'.  We need to ensure
  * that our software representation matches what the CPU
@@ -756,6 +813,11 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
+	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+
+	/* The tile data size varies between implementations */
+	if (nr == XFEATURE_XTILE_DATA)
+		check_xtile_data_against_struct(sz);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (17 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-03-20 21:26   ` Thomas Gleixner
  2021-02-21 18:56 ` [PATCH v4 20/22] selftest/x86/amx: Include test cases for the AMX state management Chang S. Bae
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

In 64-bit mode, include the AMX state components in
XFEATURE_MASK_USER_SUPPORTED.

The XFD feature will be used to dynamically expand the xstate per-task
buffer on the first use.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/xstate.h | 3 ++-
 arch/x86/kernel/fpu/init.c        | 8 ++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 4112dbf05f19..9e5c28f3beaa 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -34,7 +34,8 @@
 				      XFEATURE_MASK_Hi16_ZMM	 | \
 				      XFEATURE_MASK_PKRU | \
 				      XFEATURE_MASK_BNDREGS | \
-				      XFEATURE_MASK_BNDCSR)
+				      XFEATURE_MASK_BNDCSR | \
+				      XFEATURE_MASK_XTILE)
 
 /* All currently supported supervisor features */
 #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index f2fcdcc979e7..046889f31037 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -219,8 +219,12 @@ static void __init fpu__init_system_xstate_size_legacy(void)
  */
 u64 __init fpu__get_supported_xfeatures_mask(void)
 {
-	return XFEATURE_MASK_USER_SUPPORTED |
-	       XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+	u64 mask = XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		mask &= ~(XFEATURE_MASK_XTILE);
+
+	return mask;
 }
 
 /* Legacy code to initialize eager fpu mode. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 20/22] selftest/x86/amx: Include test cases for the AMX state management
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (18 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 21/22] x86/fpu/xstate: Support dynamic user state in the signal handling path Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support Chang S. Bae
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, linux-kselftest

This selftest exercises the kernel's behavior not to inherit AMX state and
the ability to switch the context by verifying that they retain unique
data between multiple threads.

Also, ptrace() is used to insert AMX state into existing threads -- both
before and after the existing thread has initialized its AMX state.

Collect the test cases of validating those operations together, as they
share some common setup for the AMX state.

These test cases do not depend on AMX compiler support, as they employ
userspace-XSAVE directly to access AMX state.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Changes from v2:
* Updated the test messages and the changelog as tile data is not inherited
  to a child anymore.
* Removed bytecode for the instructions already supported by binutils.
* Changed to check the XSAVE availability in a reliable way.

Changes from v1:
* Removed signal testing code
---
 tools/testing/selftests/x86/Makefile |   2 +-
 tools/testing/selftests/x86/amx.c    | 677 +++++++++++++++++++++++++++
 2 files changed, 678 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/amx.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 333980375bc7..2f7feb03867b 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -17,7 +17,7 @@ TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
-TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering
+TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering amx
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
new file mode 100644
index 000000000000..f4ecdfd27ae9
--- /dev/null
+++ b/tools/testing/selftests/x86/amx.c
@@ -0,0 +1,677 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <elf.h>
+#include <pthread.h>
+#include <sched.h>
+#include <setjmp.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <time.h>
+#include <malloc.h>
+#include <unistd.h>
+#include <ucontext.h>
+
+#include <linux/futex.h>
+
+#include <sys/ipc.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/shm.h>
+#include <sys/signal.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/uio.h>
+#include <sys/ucontext.h>
+
+#include <x86intrin.h>
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+typedef uint8_t u8;
+typedef uint16_t u16;
+typedef uint32_t u32;
+typedef uint64_t u64;
+
+#define PAGE_SIZE			(1 << 12)
+
+#define NUM_TILES			8
+#define TILE_SIZE			1024
+#define XSAVE_SIZE			((NUM_TILES * TILE_SIZE) + PAGE_SIZE)
+
+struct xsave_data {
+	u8 area[XSAVE_SIZE];
+} __attribute__((aligned(64)));
+
+/* Tile configuration associated: */
+#define MAX_TILES			16
+#define RESERVED_BYTES			14
+
+struct tile_config {
+	u8  palette_id;
+	u8  start_row;
+	u8  reserved[RESERVED_BYTES];
+	u16 colsb[MAX_TILES];
+	u8  rows[MAX_TILES];
+};
+
+struct tile_data {
+	u8 data[NUM_TILES * TILE_SIZE];
+};
+
+static inline u64 __xgetbv(u32 index)
+{
+	u32 eax, edx;
+
+	asm volatile("xgetbv;"
+		     : "=a" (eax), "=d" (edx)
+		     : "c" (index));
+	return eax + ((u64)edx << 32);
+}
+
+static inline void __cpuid(u32 *eax, u32 *ebx, u32 *ecx, u32 *edx)
+{
+	asm volatile("cpuid;"
+		     : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
+		     : "0" (*eax), "2" (*ecx));
+}
+
+/* Load tile configuration */
+static inline void __ldtilecfg(void *cfg)
+{
+	asm volatile(".byte 0xc4,0xe2,0x78,0x49,0x00"
+		     : : "a"(cfg));
+}
+
+/* Load tile data to %tmm0 register only */
+static inline void __tileloadd(void *tile)
+{
+	asm volatile(".byte 0xc4,0xe2,0x7b,0x4b,0x04,0x10"
+		     : : "a"(tile), "d"(0));
+}
+
+/* Save extended states */
+static inline void __xsave(void *buffer, u32 lo, u32 hi)
+{
+	asm volatile("xsave (%%rdi)"
+		     : : "D" (buffer), "a" (lo), "d" (hi)
+		     : "memory");
+}
+
+/* Restore extended states */
+static inline void __xrstor(void *buffer, u32 lo, u32 hi)
+{
+	asm volatile("xrstor (%%rdi)"
+		     : : "D" (buffer), "a" (lo), "d" (hi));
+}
+
+/* Release tile states to init values */
+static inline void __tilerelease(void)
+{
+	asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0" ::);
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+static void clearhandler(int sig)
+{
+	struct sigaction sa;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_handler = SIG_DFL;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+/* Hardware info check: */
+
+static jmp_buf jmpbuf;
+static bool xsave_disabled;
+
+static void handle_sigill(int sig, siginfo_t *si, void *ctx_void)
+{
+	xsave_disabled = true;
+	siglongjmp(jmpbuf, 1);
+}
+
+#define XFEATURE_XTILE_CFG      17
+#define XFEATURE_XTILE_DATA     18
+#define XFEATURE_MASK_XTILE     ((1 << XFEATURE_XTILE_DATA) | \
+				 (1 << XFEATURE_XTILE_CFG))
+
+static inline bool check_xsave_supports_xtile(void)
+{
+	bool supported = false;
+
+	sethandler(SIGILL, handle_sigill, 0);
+
+	if (!sigsetjmp(jmpbuf, 1))
+		supported = __xgetbv(0) & XFEATURE_MASK_XTILE;
+
+	clearhandler(SIGILL);
+	return supported;
+}
+
+struct xtile_hwinfo {
+	struct {
+		u16 bytes_per_tile;
+		u16 bytes_per_row;
+		u16 max_names;
+		u16 max_rows;
+	} spec;
+
+	struct {
+		u32 offset;
+		u32 size;
+	} xsave;
+};
+
+static struct xtile_hwinfo xtile;
+
+static bool __enum_xtile_config(void)
+{
+	u32 eax, ebx, ecx, edx;
+	u16 bytes_per_tile;
+	bool valid = false;
+
+#define TILE_CPUID			0x1d
+#define TILE_PALETTE_CPUID_SUBLEAVE	0x1
+
+	eax = TILE_CPUID;
+	ecx = TILE_PALETTE_CPUID_SUBLEAVE;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx || !ecx)
+		return valid;
+
+	xtile.spec.max_names = ebx >> 16;
+	if (xtile.spec.max_names < NUM_TILES)
+		return valid;
+
+	bytes_per_tile = eax >> 16;
+	if (bytes_per_tile < TILE_SIZE)
+		return valid;
+
+	xtile.spec.bytes_per_row = ebx;
+	xtile.spec.max_rows = ecx;
+	valid = true;
+
+	return valid;
+}
+
+static bool __enum_xsave_tile(void)
+{
+	u32 eax, ebx, ecx, edx;
+	bool valid = false;
+
+#define XSTATE_CPUID			0xd
+#define XSTATE_USER_STATE_SUBLEAVE	0x0
+
+	eax = XSTATE_CPUID;
+	ecx = XFEATURE_XTILE_DATA;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx)
+		return valid;
+
+	xtile.xsave.offset = ebx;
+	xtile.xsave.size = eax;
+	valid = true;
+
+	return valid;
+}
+
+static bool __check_xsave_size(void)
+{
+	u32 eax, ebx, ecx, edx;
+	bool valid = false;
+
+	eax = XSTATE_CPUID;
+	ecx = XSTATE_USER_STATE_SUBLEAVE;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (ebx && ebx <= XSAVE_SIZE)
+		valid = true;
+
+	return valid;
+}
+
+/*
+ * Check the hardware-provided tile state info and cross-check it with the
+ * hard-coded values: XSAVE_SIZE, NUM_TILES, and TILE_SIZE.
+ */
+static int check_xtile_hwinfo(void)
+{
+	bool success = false;
+
+	if (!__check_xsave_size())
+		return success;
+
+	if (!__enum_xsave_tile())
+		return success;
+
+	if (!__enum_xtile_config())
+		return success;
+
+	if (sizeof(struct tile_data) >= xtile.xsave.size)
+		success = true;
+
+	return success;
+}
+
+/* The helpers for managing XSAVE buffer and tile states: */
+
+/* Use the uncompacted format without 'init optimization' */
+static void save_xdata(void *data)
+{
+	__xsave(data, -1, -1);
+}
+
+static void restore_xdata(void *data)
+{
+	__xrstor(data, -1, -1);
+}
+
+static inline u64 __get_xsave_xstate_bv(void *data)
+{
+#define XSAVE_HDR_OFFSET	512
+	return *(u64 *)(data + XSAVE_HDR_OFFSET);
+}
+
+static void set_tilecfg(struct tile_config *cfg)
+{
+	int i;
+
+	memset(cfg, 0, sizeof(*cfg));
+	/* The first implementation has one significant palette with id 1 */
+	cfg->palette_id = 1;
+	for (i = 0; i < xtile.spec.max_names; i++) {
+		cfg->colsb[i] = xtile.spec.bytes_per_row;
+		cfg->rows[i] = xtile.spec.max_rows;
+	}
+}
+
+static void load_tilecfg(struct tile_config *cfg)
+{
+	__ldtilecfg(cfg);
+}
+
+static void make_tiles(void *tiles)
+{
+	u32 iterations = xtile.xsave.size / sizeof(u32);
+	static u32 value = 1;
+	u32 *ptr = tiles;
+	int i;
+
+	for (i = 0, ptr = tiles; i < iterations; i++, ptr++)
+		*ptr  = value;
+	value++;
+}
+
+/*
+ * Initialize the XSAVE buffer:
+ *
+ * Make sure tile configuration loaded already. Load limited tile data (%tmm0 only)
+ * and save all the states. XSAVE buffer is ready to complete tile data.
+ */
+static void init_xdata(void *data)
+{
+	struct tile_data tiles;
+
+	make_tiles(&tiles);
+	__tileloadd(&tiles);
+	__xsave(data, -1, -1);
+}
+
+static inline void *__get_xsave_tile_data_addr(void *data)
+{
+	return data + xtile.xsave.offset;
+}
+
+static void copy_tiles_to_xdata(void *xdata, void *tiles)
+{
+	void *dst = __get_xsave_tile_data_addr(xdata);
+
+	memcpy(dst, tiles, xtile.xsave.size);
+}
+
+static int compare_xdata_tiles(void *xdata, void *tiles)
+{
+	void *tile_data = __get_xsave_tile_data_addr(xdata);
+
+	if (memcmp(tile_data, tiles, xtile.xsave.size))
+		return 1;
+
+	return 0;
+}
+
+static int nerrs, errs;
+
+/* Testing tile data inheritance */
+
+static void test_tile_data_inheritance(void)
+{
+	struct xsave_data xdata;
+	struct tile_data tiles;
+	struct tile_config cfg;
+	pid_t child;
+	int status;
+
+	set_tilecfg(&cfg);
+	load_tilecfg(&cfg);
+	init_xdata(&xdata);
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	restore_xdata(&xdata);
+
+	errs = 0;
+
+	child = fork();
+	if (child < 0)
+		err(1, "fork");
+
+	if (child == 0) {
+		memset(&xdata, 0, sizeof(xdata));
+		save_xdata(&xdata);
+		if (compare_xdata_tiles(&xdata, &tiles)) {
+			printf("[OK]\tchild didn't inherit tile data at fork()\n");
+		} else {
+			printf("[FAIL]\tchild inherited tile data at fork()\n");
+			nerrs++;
+		}
+		_exit(0);
+	}
+	wait(&status);
+}
+
+static void test_fork(void)
+{
+	pid_t child;
+	int status;
+
+	child = fork();
+	if (child < 0)
+		err(1, "fork");
+
+	if (child == 0) {
+		test_tile_data_inheritance();
+		_exit(0);
+	}
+
+	wait(&status);
+}
+
+/* Context switching test */
+
+#define ITERATIONS			10
+#define NUM_THREADS			5
+
+struct futex_info {
+	int current;
+	int next;
+	int *futex;
+};
+
+static inline void command_wait(struct futex_info *info, int value)
+{
+	do {
+		sched_yield();
+	} while (syscall(SYS_futex, info->futex, FUTEX_WAIT, value, 0, 0, 0));
+}
+
+static inline void command_wake(struct futex_info *info, int value)
+{
+	do {
+		*info->futex = value;
+		while (!syscall(SYS_futex, info->futex, FUTEX_WAKE, 1, 0, 0, 0))
+			sched_yield();
+	} while (0);
+}
+
+static inline int get_iterative_value(int id)
+{
+	return ((id << 1) & ~0x1);
+}
+
+static inline int get_endpoint_value(int id)
+{
+	return ((id << 1) | 0x1);
+}
+
+static void *check_tiles(void *info)
+{
+	struct futex_info *finfo = (struct futex_info *)info;
+	struct xsave_data xdata;
+	struct tile_data tiles;
+	struct tile_config cfg;
+	int i;
+
+	set_tilecfg(&cfg);
+	load_tilecfg(&cfg);
+	init_xdata(&xdata);
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	restore_xdata(&xdata);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		command_wait(finfo, get_iterative_value(finfo->current));
+
+		memset(&xdata, 0, sizeof(xdata));
+		save_xdata(&xdata);
+		errs += compare_xdata_tiles(&xdata, &tiles);
+
+		make_tiles(&tiles);
+		copy_tiles_to_xdata(&xdata, &tiles);
+		restore_xdata(&xdata);
+
+		command_wake(finfo, get_iterative_value(finfo->next));
+	}
+
+	command_wait(finfo, get_endpoint_value(finfo->current));
+	__tilerelease();
+	return NULL;
+}
+
+static int create_children(int num, struct futex_info *finfo)
+{
+	const int shm_id = shmget(IPC_PRIVATE, sizeof(int), IPC_CREAT | 0666);
+	int *futex = shmat(shm_id, NULL, 0);
+	pthread_t thread;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		finfo[i].futex = futex;
+		finfo[i].current = i + 1;
+		finfo[i].next = (i + 2) % (num + 1);
+
+		if (pthread_create(&thread, NULL, check_tiles, &finfo[i])) {
+			err(1, "pthread_create");
+			return 1;
+		}
+	}
+	return 0;
+}
+
+static void test_context_switch(void)
+{
+	struct futex_info *finfo;
+	cpu_set_t cpuset;
+	int i;
+
+	printf("[RUN]\t%u context switches of tile states in %d threads\n",
+	       ITERATIONS * NUM_THREADS, NUM_THREADS);
+
+	errs = 0;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 0");
+
+	finfo = malloc(sizeof(*finfo) * NUM_THREADS);
+
+	if (create_children(NUM_THREADS, finfo))
+		return;
+
+	for (i = 0; i < ITERATIONS; i++) {
+		command_wake(finfo, get_iterative_value(1));
+		command_wait(finfo, get_iterative_value(0));
+	}
+
+	for (i = 1; i <= NUM_THREADS; i++)
+		command_wake(finfo, get_endpoint_value(i));
+
+	if (errs) {
+		printf("[FAIL]\t%u incorrect tile states\n", errs);
+		nerrs += errs;
+		return;
+	}
+
+	printf("[OK]\tall tile states are correct\n");
+}
+
+/* Ptrace test */
+
+static inline long get_tile_state(pid_t child, struct iovec *iov)
+{
+	return ptrace(PTRACE_GETREGSET, child, (u32)NT_X86_XSTATE, iov);
+}
+
+static inline long set_tile_state(pid_t child, struct iovec *iov)
+{
+	return ptrace(PTRACE_SETREGSET, child, (u32)NT_X86_XSTATE, iov);
+}
+
+static int write_tile_state(bool load_tile, pid_t child)
+{
+	struct xsave_data xdata;
+	struct tile_data tiles;
+	struct iovec iov;
+
+	iov.iov_base = &xdata;
+	iov.iov_len = sizeof(xdata);
+
+	if (get_tile_state(child, &iov))
+		err(1, "PTRACE_GETREGSET");
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	if (set_tile_state(child, &iov))
+		err(1, "PTRACE_SETREGSET");
+
+	memset(&xdata, 0, sizeof(xdata));
+	if (get_tile_state(child, &iov))
+		err(1, "PTRACE_GETREGSET");
+
+	if (!load_tile)
+		memset(&tiles, 0, sizeof(tiles));
+
+	return compare_xdata_tiles(&xdata, &tiles);
+}
+
+static void test_tile_state_write(bool load_tile)
+{
+	pid_t child;
+	int status;
+
+	child = fork();
+	if (child < 0)
+		err(1, "fork");
+
+	if (child == 0) {
+		printf("[RUN]\tPtrace-induced tile state write, ");
+		printf("%s tile data loaded\n", load_tile ? "with" : "without");
+
+		if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+			err(1, "PTRACE_TRACEME");
+
+		if (load_tile) {
+			struct tile_config cfg;
+			struct tile_data tiles;
+
+			set_tilecfg(&cfg);
+			load_tilecfg(&cfg);
+			make_tiles(&tiles);
+			/* Load only %tmm0 but inducing the #NM */
+			__tileloadd(&tiles);
+		}
+
+		raise(SIGTRAP);
+		_exit(0);
+	}
+
+	do {
+		wait(&status);
+	} while (WSTOPSIG(status) != SIGTRAP);
+
+	errs = write_tile_state(load_tile, child);
+	if (errs) {
+		nerrs++;
+		printf("[FAIL]\t%s write\n", load_tile ? "incorrect" : "unexpected");
+	} else {
+		printf("[OK]\t%s write\n", load_tile ? "correct" : "no");
+	}
+
+	ptrace(PTRACE_DETACH, child, NULL, NULL);
+	wait(&status);
+}
+
+static void test_ptrace(void)
+{
+	bool ptracee_loads_tiles;
+
+	ptracee_loads_tiles = true;
+	test_tile_state_write(ptracee_loads_tiles);
+
+	ptracee_loads_tiles = false;
+	test_tile_state_write(ptracee_loads_tiles);
+}
+
+int main(void)
+{
+	/* Check hardware availability at first */
+
+	if (!check_xsave_supports_xtile()) {
+		if (xsave_disabled)
+			printf("XSAVE disabled.\n");
+		else
+			printf("Tile data not available.\n");
+		return 0;
+	}
+
+	if (!check_xtile_hwinfo()) {
+		printf("Available tile state size is insufficient to test.\n");
+		return 0;
+	}
+
+	nerrs = 0;
+
+	test_fork();
+	test_context_switch();
+	test_ptrace();
+
+	return nerrs ? 1 : 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 21/22] x86/fpu/xstate: Support dynamic user state in the signal handling path
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (19 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 20/22] selftest/x86/amx: Include test cases for the AMX state management Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 18:56 ` [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support Chang S. Bae
  21 siblings, 0 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, linux-kselftest

Entering a signal handler, the kernel saves xstate in signal frame. The
dynamic user state is better to be saved only when used. fpu->state_mask
can help to exclude unused states.

Returning from a signal handler, XRSTOR re-initializes the excluded state
components.

Add a test case to verify in the signal handler that the signal frame
excludes AMX data when the signaled thread has initialized AMX state.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Changes from v3:
* Removed 'no functional changes' in the changelog. (Borislav Petkov)

Changes from v1:
* Made it revertable (moved close to the end of the series).
* Included the test case.
---
 arch/x86/include/asm/fpu/internal.h |  2 +-
 tools/testing/selftests/x86/amx.c   | 66 +++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index c467312d38d8..090eb5bb277b 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -354,7 +354,7 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
  */
 static inline int copy_xregs_to_user(struct xregs_state __user *buf)
 {
-	u64 mask = xfeatures_mask_user();
+	u64 mask = current->thread.fpu.state_mask;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
index f4ecdfd27ae9..a7386b886532 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -650,6 +650,71 @@ static void test_ptrace(void)
 	test_tile_state_write(ptracee_loads_tiles);
 }
 
+/* Signal handling test */
+
+static int sigtrapped;
+struct tile_data sig_tiles, sighdl_tiles;
+
+static void handle_sigtrap(int sig, siginfo_t *info, void *ctx_void)
+{
+	ucontext_t *uctxt = (ucontext_t *)ctx_void;
+	struct xsave_data xdata;
+	struct tile_config cfg;
+	struct tile_data tiles;
+	u64 header;
+
+	header = __get_xsave_xstate_bv((void *)uctxt->uc_mcontext.fpregs);
+
+	if (header & (1 << XFEATURE_XTILE_DATA))
+		printf("[FAIL]\ttile data was written in sigframe\n");
+	else
+		printf("[OK]\ttile data was skipped in sigframe\n");
+
+	set_tilecfg(&cfg);
+	load_tilecfg(&cfg);
+	init_xdata(&xdata);
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	restore_xdata(&xdata);
+
+	save_xdata(&xdata);
+	if (compare_xdata_tiles(&xdata, &tiles))
+		err(1, "tile load file");
+
+	printf("\tsignal handler: load tile data\n");
+
+	sigtrapped = sig;
+}
+
+static void test_signal_handling(void)
+{
+	struct xsave_data xdata = { 0 };
+	struct tile_data tiles = { 0 };
+
+	sethandler(SIGTRAP, handle_sigtrap, 0);
+	sigtrapped = 0;
+
+	printf("[RUN]\tCheck tile state management in handling signal\n");
+
+	printf("\tbefore signal: initial tile data state\n");
+
+	raise(SIGTRAP);
+
+	if (sigtrapped == 0)
+		err(1, "sigtrap");
+
+	save_xdata(&xdata);
+	if (compare_xdata_tiles(&xdata, &tiles)) {
+		printf("[FAIL]\ttile data was not loaded at sigreturn\n");
+		nerrs++;
+	} else {
+		printf("[OK]\ttile data was re-initialized at sigreturn\n");
+	}
+
+	clearhandler(SIGTRAP);
+}
+
 int main(void)
 {
 	/* Check hardware availability at first */
@@ -672,6 +737,7 @@ int main(void)
 	test_fork();
 	test_context_switch();
 	test_ptrace();
+	test_signal_handling();
 
 	return nerrs ? 1 : 0;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (20 preceding siblings ...)
  2021-02-21 18:56 ` [PATCH v4 21/22] x86/fpu/xstate: Support dynamic user state in the signal handling path Chang S. Bae
@ 2021-02-21 18:56 ` Chang S. Bae
  2021-02-21 19:30   ` Randy Dunlap
  2021-03-20 20:56   ` Thomas Gleixner
  21 siblings, 2 replies; 78+ messages in thread
From: Chang S. Bae @ 2021-02-21 18:56 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, linux-doc

"xstate.disable=0x60000" will disable AMX on a system that has AMX compiled
into XFEATURE_MASK_USER_ENABLED.

"xstate.enable=0x60000" will enable AMX on a system that does NOT have AMX
compiled into XFEATURE_MASK_USER_ENABLED (assuming the kernel is new enough
to support this feature).

Rename XFEATURE_MASK_USER_SUPPORTED to XFEATURE_MASK_USER_ENABLED to be
aligned with the new parameters.

While this cmdline is currently enabled only for AMX, it is intended to be
easily enabled to be useful for future XSAVE-enabled features.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v3:
* Fixed a few typos. (Randy Dunlap)

Changes from v2:
* Changed the kernel tainted when any unknown state is enabled. (Andy
  Lutomirski)
* Simplified the cmdline handling.
* Edited the changelog.

Changes from v1:
* Renamed the user state mask define (Andy Lutomirski and Dave Hansen)
* Changed the error message (Dave Hansen)
* Fixed xfeatures_mask_user()
* Rebased the upstream kernel (5.10) -- revived the param parse function
---
 .../admin-guide/kernel-parameters.txt         | 15 +++++
 arch/x86/include/asm/fpu/types.h              |  6 ++
 arch/x86/include/asm/fpu/xstate.h             | 24 +++----
 arch/x86/kernel/fpu/init.c                    | 65 +++++++++++++++++--
 4 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a10b545c2070..ec79f63979a4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6014,6 +6014,21 @@
 			which allow the hypervisor to 'idle' the guest on lock
 			contention.
 
+	xstate.enable=	[X86-64]
+	xstate.disable=	[X86-64]
+			The kernel is compiled with a default xstate bitmask --
+			enabling it to use the XSAVE hardware to efficiently
+			save and restore thread states on context switch.
+			xstate.enable allows adding to that default mask at
+			boot-time without recompiling the kernel just to support
+			the new thread state. (Note that the kernel will ignore
+			any bits in the mask that do not correspond to features
+			that are actually available in CPUID.)  xstate.disable
+			allows clearing bits in the default mask, forcing the
+			kernel to forget that it supports the specified thread
+			state. When a bit set for both, the kernel takes
+			xstate.disable as a priority.
+
 	xirc2ps_cs=	[NET,PCMCIA]
 			Format:
 			<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 2f297aa85d8f..967d38cc7eb1 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -149,6 +149,12 @@ enum xfeature {
 #define XFEATURE_MASK_XTILE		(XFEATURE_MASK_XTILE_DATA \
 					 | XFEATURE_MASK_XTILE_CFG)
 
+#define XFEATURE_REGION_MASK(max_bit, min_bit) \
+	((BIT_ULL((max_bit) - (min_bit) + 1) - 1) << (min_bit))
+
+#define XFEATURE_MASK_CONFIGURABLE \
+	XFEATURE_REGION_MASK(XFEATURE_XTILE_DATA, XFEATURE_XTILE_CFG)
+
 #define FIRST_EXTENDED_XFEATURE	XFEATURE_YMM
 
 struct reg_128_bit {
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9e5c28f3beaa..1e64afea9f68 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -25,17 +25,17 @@
 
 #define XSAVE_ALIGNMENT     64
 
-/* All currently supported user features */
-#define XFEATURE_MASK_USER_SUPPORTED (XFEATURE_MASK_FP | \
-				      XFEATURE_MASK_SSE | \
-				      XFEATURE_MASK_YMM | \
-				      XFEATURE_MASK_OPMASK | \
-				      XFEATURE_MASK_ZMM_Hi256 | \
-				      XFEATURE_MASK_Hi16_ZMM	 | \
-				      XFEATURE_MASK_PKRU | \
-				      XFEATURE_MASK_BNDREGS | \
-				      XFEATURE_MASK_BNDCSR | \
-				      XFEATURE_MASK_XTILE)
+/* All currently enabled user features */
+#define XFEATURE_MASK_USER_ENABLED (XFEATURE_MASK_FP | \
+				    XFEATURE_MASK_SSE | \
+				    XFEATURE_MASK_YMM | \
+				    XFEATURE_MASK_OPMASK | \
+				    XFEATURE_MASK_ZMM_Hi256 | \
+				    XFEATURE_MASK_Hi16_ZMM	 | \
+				    XFEATURE_MASK_PKRU | \
+				    XFEATURE_MASK_BNDREGS | \
+				    XFEATURE_MASK_BNDCSR | \
+				    XFEATURE_MASK_XTILE)
 
 /* All currently supported supervisor features */
 #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
@@ -87,7 +87,7 @@ static inline u64 xfeatures_mask_supervisor(void)
 
 static inline u64 xfeatures_mask_user(void)
 {
-	return xfeatures_mask_all & XFEATURE_MASK_USER_SUPPORTED;
+	return xfeatures_mask_all & ~(XFEATURE_MASK_SUPERVISOR_ALL);
 }
 
 static inline u64 xfeatures_mask_supervisor_dynamic(void)
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 046889f31037..0166d3eb9916 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -5,6 +5,7 @@
 #include <asm/fpu/internal.h>
 #include <asm/tlbflush.h>
 #include <asm/setup.h>
+#include <asm/cmdline.h>
 
 #include <linux/sched.h>
 #include <linux/sched/task.h>
@@ -215,14 +216,45 @@ static void __init fpu__init_system_xstate_size_legacy(void)
 /*
  * Find supported xfeatures based on cpu features and command-line input.
  * This must be called after fpu__init_parse_early_param() is called and
- * xfeatures_mask is enumerated.
+ * xfeatures_mask_all is enumerated.
  */
+
+static u64 xstate_enable;
+static u64 xstate_disable;
+
 u64 __init fpu__get_supported_xfeatures_mask(void)
 {
-	u64 mask = XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
-
-	if (!IS_ENABLED(CONFIG_X86_64))
-		mask &= ~(XFEATURE_MASK_XTILE);
+	u64 mask = XFEATURE_MASK_USER_ENABLED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+
+	if (!IS_ENABLED(CONFIG_X86_64)) {
+		mask  &= ~(XFEATURE_MASK_XTILE);
+	} else if (xstate_enable || xstate_disable) {
+		u64 custom = mask;
+		u64 unknown;
+
+		custom |= xstate_enable;
+		custom &= ~xstate_disable;
+
+		unknown = custom & ~mask;
+		if (unknown) {
+			/*
+			 * User should fully understand the result of using undocumented
+			 * xstate component.
+			 */
+			add_taint(TAINT_CPU_OUT_OF_SPEC, LOCKDEP_STILL_OK);
+			pr_warn("x86/fpu: Attempt to enable unknown xstate features 0x%llx\n",
+				unknown);
+			WARN_ON_FPU(1);
+		}
+
+		if ((custom & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE) {
+			pr_warn("x86/fpu: Error in xstate.disable. Additionally disabling 0x%x components.\n",
+				XFEATURE_MASK_XTILE);
+			custom &= ~(XFEATURE_MASK_XTILE);
+		}
+
+		mask = custom;
+	}
 
 	return mask;
 }
@@ -236,12 +268,35 @@ static void __init fpu__init_system_ctx_switch(void)
 	on_boot_cpu = 0;
 }
 
+/*
+ * Longest parameter of 'xstate.enable=' is 22 octal number characters with '0' prefix and
+ * an extra '\0' for termination.
+ */
+#define MAX_XSTATE_MASK_CHARS	24
+/*
+ * We parse xstate parameters early because fpu__init_system() is executed before
+ * parse_early_param().
+ */
+static void __init fpu__init_parse_early_param(void)
+{
+	char arg[MAX_XSTATE_MASK_CHARS];
+
+	if (cmdline_find_option(boot_command_line, "xstate.enable", arg, sizeof(arg)) &&
+	    !kstrtoull(arg, 0, &xstate_enable))
+		xstate_enable &= XFEATURE_MASK_CONFIGURABLE;
+
+	if (cmdline_find_option(boot_command_line, "xstate.disable", arg, sizeof(arg)) &&
+	    !kstrtoull(arg, 0, &xstate_disable))
+		xstate_disable &= XFEATURE_MASK_CONFIGURABLE;
+}
+
 /*
  * Called on the boot CPU once per system bootup, to set up the initial
  * FPU state that is later cloned into all processes:
  */
 void __init fpu__init_system(struct cpuinfo_x86 *c)
 {
+	fpu__init_parse_early_param();
 	fpu__init_system_early_generic(c);
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-02-21 18:56 ` [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support Chang S. Bae
@ 2021-02-21 19:30   ` Randy Dunlap
  2021-02-21 20:10     ` Bae, Chang Seok
  2021-03-20 20:56   ` Thomas Gleixner
  1 sibling, 1 reply; 78+ messages in thread
From: Randy Dunlap @ 2021-02-21 19:30 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	linux-doc

On 2/21/21 10:56 AM, Chang S. Bae wrote:
> "xstate.disable=0x60000" will disable AMX on a system that has AMX compiled
> into XFEATURE_MASK_USER_ENABLED.
> 
> "xstate.enable=0x60000" will enable AMX on a system that does NOT have AMX
> compiled into XFEATURE_MASK_USER_ENABLED (assuming the kernel is new enough
> to support this feature).
> 
> Rename XFEATURE_MASK_USER_SUPPORTED to XFEATURE_MASK_USER_ENABLED to be
> aligned with the new parameters.
> 
> While this cmdline is currently enabled only for AMX, it is intended to be
> easily enabled to be useful for future XSAVE-enabled features.
> 

Hi,
Can we tell people (in this Doc file) where to look up the values that can be
used in xstate.enable and xstate.disable?

thanks.

> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Cc: x86@kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
> Changes from v3:
> * Fixed a few typos. (Randy Dunlap)
> 
> Changes from v2:
> * Changed the kernel tainted when any unknown state is enabled. (Andy
>   Lutomirski)
> * Simplified the cmdline handling.
> * Edited the changelog.
> 
> Changes from v1:
> * Renamed the user state mask define (Andy Lutomirski and Dave Hansen)
> * Changed the error message (Dave Hansen)
> * Fixed xfeatures_mask_user()
> * Rebased the upstream kernel (5.10) -- revived the param parse function
> ---
>  .../admin-guide/kernel-parameters.txt         | 15 +++++
>  arch/x86/include/asm/fpu/types.h              |  6 ++
>  arch/x86/include/asm/fpu/xstate.h             | 24 +++----
>  arch/x86/kernel/fpu/init.c                    | 65 +++++++++++++++++--
>  4 files changed, 93 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a10b545c2070..ec79f63979a4 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6014,6 +6014,21 @@
>  			which allow the hypervisor to 'idle' the guest on lock
>  			contention.
>  
> +	xstate.enable=	[X86-64]
> +	xstate.disable=	[X86-64]
> +			The kernel is compiled with a default xstate bitmask --
> +			enabling it to use the XSAVE hardware to efficiently
> +			save and restore thread states on context switch.
> +			xstate.enable allows adding to that default mask at
> +			boot-time without recompiling the kernel just to support
> +			the new thread state. (Note that the kernel will ignore
> +			any bits in the mask that do not correspond to features
> +			that are actually available in CPUID.)  xstate.disable
> +			allows clearing bits in the default mask, forcing the
> +			kernel to forget that it supports the specified thread
> +			state. When a bit set for both, the kernel takes
> +			xstate.disable as a priority.
> +
>  	xirc2ps_cs=	[NET,PCMCIA]
>  			Format:
>  			<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]



-- 
~Randy


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-02-21 19:30   ` Randy Dunlap
@ 2021-02-21 20:10     ` Bae, Chang Seok
  2021-02-21 20:37       ` Randy Dunlap
  0 siblings, 1 reply; 78+ messages in thread
From: Bae, Chang Seok @ 2021-02-21 20:10 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Borislav Petkov, luto, tglx, mingo, x86, Brown, Len, Hansen,
	Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel, linux-doc

On Feb 21, 2021, at 11:30, Randy Dunlap <rdunlap@infradead.org> wrote:
> Can we tell people (in this Doc file) where to look up the values that can be
> used in xstate.enable and xstate.disable?

Perhaps add something like this with the change below:
    “See comment before function fpu__init_parse_early_param() in
     arch/x86/kernel/fpu/init.c."

/*
 * The kernel parameter "xstate.enable='mask'" and "xstate.disable='mask'" have a
 * mask value in a subset of XFEATURE_MASK_CONFIGURABLE.
 *
 * The longest parameter is 22 octal number characters with '0' prefix and an extra
 * '\0' for termination.
 */
#define MAX_XSTATE_MASK_CHARS   24

/**
 * fpu__init_parse_early_param() - parse the xstate kernel parameters
 *
 * Parse them early because fpu__init_system() is executed before
 * parse_early_param().
 */
static void __init fpu__init_parse_early_param(void)

Thanks,
Chang


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-02-21 20:10     ` Bae, Chang Seok
@ 2021-02-21 20:37       ` Randy Dunlap
  0 siblings, 0 replies; 78+ messages in thread
From: Randy Dunlap @ 2021-02-21 20:37 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Borislav Petkov, luto, tglx, mingo, x86, Brown, Len, Hansen,
	Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel, linux-doc

On 2/21/21 12:10 PM, Bae, Chang Seok wrote:
> On Feb 21, 2021, at 11:30, Randy Dunlap <rdunlap@infradead.org> wrote:
>> Can we tell people (in this Doc file) where to look up the values that can be
>> used in xstate.enable and xstate.disable?
> 
> Perhaps add something like this with the change below:
>     “See comment before function fpu__init_parse_early_param() in
>      arch/x86/kernel/fpu/init.c."

Hi,

I was thinking more along the lines of where can I find the value
0x60000 or BIT(22) or BIT(19), for example and see what they mean,
even though it will likely be some abbreviation.


> /*
>  * The kernel parameter "xstate.enable='mask'" and "xstate.disable='mask'" have a
>  * mask value in a subset of XFEATURE_MASK_CONFIGURABLE.
>  *
>  * The longest parameter is 22 octal number characters with '0' prefix and an extra
>  * '\0' for termination.
>  */
> #define MAX_XSTATE_MASK_CHARS   24
> 
> /**
>  * fpu__init_parse_early_param() - parse the xstate kernel parameters
>  *
>  * Parse them early because fpu__init_system() is executed before
>  * parse_early_param().
>  */
> static void __init fpu__init_parse_early_param(void)

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers
  2021-02-21 18:56 ` [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
@ 2021-03-10 13:40   ` Borislav Petkov
  0 siblings, 0 replies; 78+ messages in thread
From: Borislav Petkov @ 2021-03-10 13:40 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, jing2.liu,
	ravi.v.shankar, linux-kernel, kvm

On Sun, Feb 21, 2021 at 10:56:16AM -0800, Chang S. Bae wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 571220ac8bea..d43661d309ab 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -192,8 +192,18 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
>  	fp->fos = 0xffff0000u;
>  }
>  
> -void fpstate_init(union fpregs_state *state)
> +/*
> + * @fpu: If NULL, use init_fpstate
> + */

A note either for you - if you get to resend a new revision - or for the
committer to fix up this into a proper kernel-doc style:

Documentation/doc-guide/kernel-doc.rst

-- 
Regards/Gruss,
    Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-02-21 18:56 ` [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support Chang S. Bae
  2021-02-21 19:30   ` Randy Dunlap
@ 2021-03-20 20:56   ` Thomas Gleixner
  2021-03-25 22:59     ` Len Brown
  1 sibling, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-20 20:56 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, linux-doc

On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> "xstate.disable=0x60000" will disable AMX on a system that has AMX compiled
> into XFEATURE_MASK_USER_ENABLED.
>
> "xstate.enable=0x60000" will enable AMX on a system that does NOT have AMX
> compiled into XFEATURE_MASK_USER_ENABLED (assuming the kernel is new enough
> to support this feature).

This makes no sense at all.

> Rename XFEATURE_MASK_USER_SUPPORTED to XFEATURE_MASK_USER_ENABLED to be
> aligned with the new parameters.
>
> While this cmdline is currently enabled only for AMX, it is intended to be
> easily enabled to be useful for future XSAVE-enabled features.

I have a hard time to map this changelog to the actual code.

> +/* All currently enabled user features */
> +#define XFEATURE_MASK_USER_ENABLED (XFEATURE_MASK_FP | \
> +				    XFEATURE_MASK_SSE | \
> +				    XFEATURE_MASK_YMM | \
> +				    XFEATURE_MASK_OPMASK | \
> +				    XFEATURE_MASK_ZMM_Hi256 | \
> +				    XFEATURE_MASK_Hi16_ZMM	 | \
> +				    XFEATURE_MASK_PKRU | \
> +				    XFEATURE_MASK_BNDREGS | \
> +				    XFEATURE_MASK_BNDCSR | \
> +				    XFEATURE_MASK_XTILE)
  
> +
> +static u64 xstate_enable;
> +static u64 xstate_disable;

This needs to be kept around forever because it's used where outside of
__init code?

>  u64 __init fpu__get_supported_xfeatures_mask(void)
>  {
> -	u64 mask = XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
> -
> -	if (!IS_ENABLED(CONFIG_X86_64))
> -		mask &= ~(XFEATURE_MASK_XTILE);
> +	u64 mask = XFEATURE_MASK_USER_ENABLED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
> +
> +	if (!IS_ENABLED(CONFIG_X86_64)) {
> +		mask  &= ~(XFEATURE_MASK_XTILE);
> +	} else if (xstate_enable || xstate_disable) {
> +		u64 custom = mask;
> +		u64 unknown;
> +
> +		custom |= xstate_enable;
> +		custom &= ~xstate_disable;
> +
> +		unknown = custom & ~mask;
> +		if (unknown) {
> +			/*
> +			 * User should fully understand the result of using undocumented
> +			 * xstate component.
> +			 */

What is to understand here? Absolutely nothing.  This has been tried to
be smuggled into the kernel ever so often and it's again in something
which claims to do something else and the changelog is silent about it.

The argument 'it allows easier testing of new features' is absolutely
not true simply because the rest of the kernel knows absolutely nothing
about the feature and stuff would go south anyway.

We won't enable features which are unknown ever. Keep that presilicon
test gunk where it belongs: In the Intel poison cabinet along with the
rest of the code which nobody ever want's to see.

> +		}
> +
> +		if ((custom & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE) {
> +			pr_warn("x86/fpu: Error in xstate.disable. Additionally disabling 0x%x components.\n",
> +				XFEATURE_MASK_XTILE);

What?

If the user added: xstate.disable=0x60000 to the command line, then the
code above:

> +		custom &= ~xstate_disable;

has cleared XFEATURE_MASK_XTILE in custom which makes that check true,
the warning emitted and then 

> +			custom &= ~(XFEATURE_MASK_XTILE);

this part clears out XFEATURE_MASK_XTILE once more.

> +		}

What the heck.

> +/*
> + * Longest parameter of 'xstate.enable=' is 22 octal number characters with '0' prefix and
> + * an extra '\0' for termination.
> + */
> +#define MAX_XSTATE_MASK_CHARS	24
> +/*
> + * We parse xstate parameters early because fpu__init_system() is executed before
> + * parse_early_param().
> + */
> +static void __init fpu__init_parse_early_param(void)
> +{
> +	char arg[MAX_XSTATE_MASK_CHARS];
> +
> +	if (cmdline_find_option(boot_command_line, "xstate.enable", arg, sizeof(arg)) &&
> +	    !kstrtoull(arg, 0, &xstate_enable))
> +		xstate_enable &= XFEATURE_MASK_CONFIGURABLE;

This enable thing is not going to happen.

> +	if (cmdline_find_option(boot_command_line, "xstate.disable", arg, sizeof(arg)) &&
> +	    !kstrtoull(arg, 0, &xstate_disable))
> +		xstate_disable &= XFEATURE_MASK_CONFIGURABLE;
> +}
> +

This parser needs to be called for X86_32 because?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features
  2021-02-21 18:56 ` [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features Chang S. Bae
@ 2021-03-20 21:25   ` Thomas Gleixner
  2021-03-23 21:52     ` Bae, Chang Seok
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-20 21:25 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> +struct xfeature_capflag_info {
> +	int xfeature_idx;
> +	short cpu_cap;

First of all please format the struct members tabular as x86 does all
over the place.

What's the purpose of 'int' 'short' here? This data is unsinged. We have
no negative feature bits and no negative xfeatures.

Also if your intention is to save space, then you might want to look at
the compiler output. It's not saving anything, it's still 8 byte because
the compiler pads the struct.

> +};
> +
> +static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
> +	{ XFEATURE_FP,				X86_FEATURE_FPU },
> +	{ XFEATURE_SSE,				X86_FEATURE_XMM },
> +	{ XFEATURE_YMM,				X86_FEATURE_AVX },
> +	{ XFEATURE_BNDREGS,			X86_FEATURE_MPX },
> +	{ XFEATURE_BNDCSR,			X86_FEATURE_MPX },
> +	{ XFEATURE_OPMASK,			X86_FEATURE_AVX512F },
> +	{ XFEATURE_ZMM_Hi256,			X86_FEATURE_AVX512F },
> +	{ XFEATURE_Hi16_ZMM,			X86_FEATURE_AVX512F },
> +	{ XFEATURE_PT_UNIMPLEMENTED_SO_FAR,	X86_FEATURE_INTEL_PT },
> +	{ XFEATURE_PKRU,			X86_FEATURE_PKU },
> +	{ XFEATURE_PASID,			X86_FEATURE_ENQCMD },
>  };

And you could have changed the existing table just so:

static unsigned short xsave_cpuid_features[] __initdata = {
	[XFEATURE_FP]                          = X86_FEATURE_FPU,
	[XFEATURE_SSE]                         = X86_FEATURE_XMM,
	[XFEATURE_YMM]                         = X86_FEATURE_AVX,
	[XFEATURE_BNDREGS]                     = X86_FEATURE_MPX,
	[XFEATURE_BNDCSR]                      = X86_FEATURE_MPX,
	[XFEATURE_OPMASK]                      = X86_FEATURE_AVX512F,
	[XFEATURE_ZMM_Hi256]                   = X86_FEATURE_AVX512F,
	[XFEATURE_Hi16_ZMM]                    = X86_FEATURE_AVX512F,
	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]     = X86_FEATURE_INTEL_PT,
	[XFEATURE_PKRU]                        = X86_FEATURE_PKU,
	[XFEATURE_PASID]                       = X86_FEATURE_ENQCMD,
};

and the implementation to:            

        for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
-		if (xsave_cpuid_features[i] || !boot_cpu_has(xsave_cpuid_features[i]))
+		if (xsave_cpuid_features[i] || !boot_cpu_has(xsave_cpuid_features[i]))
                	xfeatures_mask_all &= ~BIT_ULL(i);

Even with the gaps for XTILE the table is smaller, the code is simpler...

> +	for (i = 0; i < ARRAY_SIZE(xfeature_capflags); i++) {
> +		short cpu_cap = xfeature_capflags[i].cpu_cap;
> +		int idx = xfeature_capflags[i].xfeature_idx;
> +
> +		if (!boot_cpu_has(cpu_cap))
> +			xfeatures_mask_all &= ~BIT_ULL(idx);
>  	}
>  
>  	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode
  2021-02-21 18:56 ` [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
@ 2021-03-20 21:26   ` Thomas Gleixner
  2021-03-23 21:51     ` Bae, Chang Seok
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-20 21:26 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:

> In 64-bit mode, include the AMX state components in
> XFEATURE_MASK_USER_SUPPORTED.
>
> The XFD feature will be used to dynamically expand the xstate per-task
> buffer on the first use.

This patch touches absolutely nothing XFD related. What's the message
here?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  2021-02-21 18:56 ` [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
@ 2021-03-20 21:31   ` Thomas Gleixner
  2021-03-23 21:52     ` Bae, Chang Seok
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-20 21:31 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>  
> +static void check_xtile_data_against_struct(int size)
> +{
> +	u32 max_palid, palid, state_size;
> +	u32 eax, ebx, ecx, edx;
> +	u16 max_tile;
> +
> +	/*
> +	 * Check the maximum palette id:
> +	 *   eax: the highest numbered palette subleaf.
> +	 */
> +	cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
> +
> +	/*
> +	 * Cross-check each tile size and find the maximum
> +	 * number of supported tiles.
> +	 */
> +	for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
> +		u16 tile_size, max;
> +
> +		/*
> +		 * Check the tile size info:
> +		 *   eax[31:16]:  bytes per title
> +		 *   ebx[31:16]:  the max names (or max number of tiles)
> +		 */
> +		cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
> +		tile_size = eax >> 16;
> +		max = ebx >> 16;
> +
> +		if (WARN_ONCE(tile_size != sizeof(struct xtile_data),
> +			      "%s: struct is %zu bytes, cpu xtile %d bytes\n",
> +			      __stringify(XFEATURE_XTILE_DATA),
> +			      sizeof(struct xtile_data), tile_size))
> +			__xstate_dump_leaves();
> +
> +		if (max > max_tile)
> +			max_tile = max;
> +	}
> +
> +	state_size = sizeof(struct xtile_data) * max_tile;
> +	if (WARN_ONCE(size != state_size,
> +		      "%s: calculated size is %u bytes, cpu state %d bytes\n",
> +		      __stringify(XFEATURE_XTILE_DATA), state_size, size))
> +		__xstate_dump_leaves();

So we have 2 warnings which complain about inconsistent state and that's
it? Why has this absolutely no consequences? We just keep stuff enabled
and jug along, right?

Which one of the two states is correct? Why don't we just disable that
muck and be done with it to play it safe?

Failing to execute some workload by saying NO due to inconsistency is
far more useful than taking the chance of potential silent data
corruption.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-02-21 18:56 ` [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state Chang S. Bae
@ 2021-03-20 22:13   ` Thomas Gleixner
  2021-03-20 22:21     ` Andy Lutomirski
                       ` (2 more replies)
  2021-03-26 16:34   ` Jann Horn
  1 sibling, 3 replies; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-20 22:13 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, mingo, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> +
> +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> +{
> +	if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> +		return;
> +
> +	if (unlikely(prev->state_mask != next->state_mask))
> +		xdisable_setbits(xfirstuse_not_detected(next));
> +}

So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
when it does not match. The spec document says:

  "System software may disable use of Intel AMX by clearing XCR0[18:17], by
   clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
   system software initialize AMX state (e.g., by executing TILERELEASE)
   before doing so. This is because maintaining AMX state in a
   non-initialized state may have negative power and performance
   implications."

I'm not seeing anything related to this. Is this a recommendation
which can be ignored or is that going to be duct taped into the code
base once the first user complains about slowdowns of their non AMX
workloads on that machine?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-20 22:13   ` Thomas Gleixner
@ 2021-03-20 22:21     ` Andy Lutomirski
  2021-03-23 21:01       ` Len Brown
  2021-03-23 21:52     ` Bae, Chang Seok
  2021-03-29 13:14     ` Len Brown
  2 siblings, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-20 22:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Andrew Lutomirski, Ingo Molnar, X86 ML,
	Len Brown, Dave Hansen, Liu, Jing2, Ravi V. Shankar, LKML, Bae,
	Chang Seok

On Sat, Mar 20, 2021 at 3:13 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > +     if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > +             return;
> > +
> > +     if (unlikely(prev->state_mask != next->state_mask))
> > +             xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>    clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>    system software initialize AMX state (e.g., by executing TILERELEASE)
>    before doing so. This is because maintaining AMX state in a
>    non-initialized state may have negative power and performance
>    implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I have an obnoxious question: do we really want to use the XFD mechanism?

Right now, glibc, and hence most user space code, blindly uses
whatever random CPU features are present for no particularly good
reason, which means that all these features get stuck in the XINUSE=1
state, even if there is no code whatsoever in the process that
benefits.  AVX512 is bad enough as we're seeing right now.  AMX will
be much worse if this happens.

We *could* instead use XCR0 and require an actual syscall to enable
it.  We could even then play games like requiring whomever enables the
feature to allocate memory for the state save area for signals, and
signal delivery could save the state and disable the feature, this
preventing the signal frame from blowing up to 8 or 12 or who knows
how many kB.

--Andy

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-20 22:21     ` Andy Lutomirski
@ 2021-03-23 21:01       ` Len Brown
  2021-03-24  3:14         ` Liu, Jing2
  0 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-23 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Borislav Petkov, Ingo Molnar, X86 ML, Len Brown,
	Dave Hansen, Liu, Jing2, Ravi V. Shankar, LKML, Bae, Chang Seok

> I have an obnoxious question: do we really want to use the XFD mechanism?

Obnoxious questions are often the most valuable! :-)

> Right now, glibc, and hence most user space code, blindly uses
> whatever random CPU features are present for no particularly good
> reason, which means that all these features get stuck in the XINUSE=1
> state, even if there is no code whatsoever in the process that
> benefits.  AVX512 is bad enough as we're seeing right now.  AMX will
> be much worse if this happens.
>
> We *could* instead use XCR0 and require an actual syscall to enable
> it.  We could even then play games like requiring whomever enables the
> feature to allocate memory for the state save area for signals, and
> signal delivery could save the state and disable the feature, this
> preventing the signal frame from blowing up to 8 or 12 or who knows
> how many kB.

This approach would have some challenges.

Enumeration today is two parts.
1. CPUID tells you if the feature exists in the HW
2. xgetbv/XCR0 tells you if the OS supports that feature

Since #2 would be missing, you are right, there would need to be
a new API enabling the user to request the OS to enable support
for that task.

If that new API is not invoked before the user touches the feature,
they die with a #UD.

And so there would need to be some assurance that the API is successfully
called before any library might use the feature.  Is there a practical way
to guarantee that, given that the feature may be used (or not) only by
a dynamically
linked library?

If a library spawns threads and queries the size of XSAVE before the API
is called, it may be confused when that size changes after the API is called.

So a simple question, "who calls the API, and when?" isn't so simple.

Finally, note that XCR0 updates cause a VMEXIT,
while XFD updates do not.

So context switching XCR0 is possible, but is problematic.

The other combination is XFD + API rather than XCR0 + API.
With XFD, the context switching is faster, and the faulting (#NM and
the new MSR with #NM cause) is viable.
We have the bit set in XCR0, so no state size advantage.
Still have issues with API logistics.
So we didn't see that the API adds any value, only pain,
over transparent 1st use enabling with XFD and no API.

cheers,
Len Brown, Intel Open Source Technology Center

ps. I agree that un-necessary XINUSE=1 is possible.
Notwithstanding the issues initially deploying AVX512, I am skeptical
that it is common today.  IMO, the problem with AVX512 state
is that we guaranteed it will be zero for XINUSE=0.
That means we have to write 0's on saves.  It would be better
to be able to skip the write -- even if we can't save the space
we can save the data transfer.  (This is what we did for AMX).

pps. your idea of requiring the user to allocate their own signal stack
is interesting.   It isn't really about allocating the stack, though --
the stack of the task that uses the feature is generally fine already.
The opportunity is to allow tasks that do *not* use the new feature to
get away with minimal data transfer and stack size.  As we don't
have the 0's guarantee for AMX, we bought the important part
of that back.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode
  2021-03-20 21:26   ` Thomas Gleixner
@ 2021-03-23 21:51     ` Bae, Chang Seok
  0 siblings, 0 replies; 78+ messages in thread
From: Bae, Chang Seok @ 2021-03-23 21:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bp, luto, mingo, x86, Brown, Len, Hansen, Dave, Liu, Jing2,
	Shankar, Ravi V, linux-kernel

On Mar 20, 2021, at 14:26, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>> In 64-bit mode, include the AMX state components in
>> XFEATURE_MASK_USER_SUPPORTED.
>> 
>> The XFD feature will be used to dynamically expand the xstate per-task
>> buffer on the first use.
> 
> This patch touches absolutely nothing XFD related. What's the message
> here?

You’re right. This is not relevant here. I will remove it.

Thank you,
Chang

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  2021-03-20 21:31   ` Thomas Gleixner
@ 2021-03-23 21:52     ` Bae, Chang Seok
  0 siblings, 0 replies; 78+ messages in thread
From: Bae, Chang Seok @ 2021-03-23 21:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, luto, mingo, x86, Brown, Len, Hansen, Dave, Liu,
	Jing2, Shankar, Ravi V, linux-kernel

On Mar 20, 2021, at 14:31, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>> 
>> +static void check_xtile_data_against_struct(int size)
>> +{
>> +	u32 max_palid, palid, state_size;
>> +	u32 eax, ebx, ecx, edx;
>> +	u16 max_tile;
>> +
>> +	/*
>> +	 * Check the maximum palette id:
>> +	 *   eax: the highest numbered palette subleaf.
>> +	 */
>> +	cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
>> +
>> +	/*
>> +	 * Cross-check each tile size and find the maximum
>> +	 * number of supported tiles.
>> +	 */
>> +	for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
>> +		u16 tile_size, max;
>> +
>> +		/*
>> +		 * Check the tile size info:
>> +		 *   eax[31:16]:  bytes per title
>> +		 *   ebx[31:16]:  the max names (or max number of tiles)
>> +		 */
>> +		cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
>> +		tile_size = eax >> 16;
>> +		max = ebx >> 16;
>> +
>> +		if (WARN_ONCE(tile_size != sizeof(struct xtile_data),
>> +			      "%s: struct is %zu bytes, cpu xtile %d bytes\n",
>> +			      __stringify(XFEATURE_XTILE_DATA),
>> +			      sizeof(struct xtile_data), tile_size))
>> +			__xstate_dump_leaves();
>> +
>> +		if (max > max_tile)
>> +			max_tile = max;
>> +	}
>> +
>> +	state_size = sizeof(struct xtile_data) * max_tile;
>> +	if (WARN_ONCE(size != state_size,
>> +		      "%s: calculated size is %u bytes, cpu state %d bytes\n",
>> +		      __stringify(XFEATURE_XTILE_DATA), state_size, size))
>> +		__xstate_dump_leaves();
> 
> So we have 2 warnings which complain about inconsistent state and that's
> it? Why has this absolutely no consequences? We just keep stuff enabled
> and jug along, right?
> 
> Which one of the two states is correct? Why don't we just disable that
> muck and be done with it to play it safe?
> 
> Failing to execute some workload by saying NO due to inconsistency is
> far more useful than taking the chance of potential silent data
> corruption.

This change in fact follows the mainline code [1], where this type of warning
is emitted with such mismatch.

Yes, disabling the feature looks to be the right way. Or, perhaps, taking a
large one is an option when mismatched ?

At least, given the feedback, the mainline needs to be revised before applying
this. Correct me if you don’t think so.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n567

Thanks,
Chang

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features
  2021-03-20 21:25   ` Thomas Gleixner
@ 2021-03-23 21:52     ` Bae, Chang Seok
  0 siblings, 0 replies; 78+ messages in thread
From: Bae, Chang Seok @ 2021-03-23 21:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bp, luto, mingo, x86, Brown, Len, Hansen, Dave, Liu, Jing2,
	Shankar, Ravi V, linux-kernel

On Mar 20, 2021, at 14:25, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> And you could have changed the existing table just so:
> 
> static unsigned short xsave_cpuid_features[] __initdata = {
> 	[XFEATURE_FP]                          = X86_FEATURE_FPU,
> 	[XFEATURE_SSE]                         = X86_FEATURE_XMM,
> 	[XFEATURE_YMM]                         = X86_FEATURE_AVX,
> 	[XFEATURE_BNDREGS]                     = X86_FEATURE_MPX,
> 	[XFEATURE_BNDCSR]                      = X86_FEATURE_MPX,
> 	[XFEATURE_OPMASK]                      = X86_FEATURE_AVX512F,
> 	[XFEATURE_ZMM_Hi256]                   = X86_FEATURE_AVX512F,
> 	[XFEATURE_Hi16_ZMM]                    = X86_FEATURE_AVX512F,
> 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]     = X86_FEATURE_INTEL_PT,
> 	[XFEATURE_PKRU]                        = X86_FEATURE_PKU,
> 	[XFEATURE_PASID]                       = X86_FEATURE_ENQCMD,
> };
> 
> and the implementation to:            
> 
>        for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
> -		if (xsave_cpuid_features[i] || !boot_cpu_has(xsave_cpuid_features[i]))
> +		if (xsave_cpuid_features[i] || !boot_cpu_has(xsave_cpuid_features[i]))
>                	xfeatures_mask_all &= ~BIT_ULL(i);
> 
> Even with the gaps for XTILE the table is smaller, the code is simpler…

True, I will follow your suggestion. Maybe follow-up with a new patch before
posting v5.

Thank you for the suggestion.

Chang


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-20 22:13   ` Thomas Gleixner
  2021-03-20 22:21     ` Andy Lutomirski
@ 2021-03-23 21:52     ` Bae, Chang Seok
  2021-03-24 14:24       ` Dave Hansen
  2021-03-29 13:14     ` Len Brown
  2 siblings, 1 reply; 78+ messages in thread
From: Bae, Chang Seok @ 2021-03-23 21:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, luto, mingo, x86, Brown, Len, Hansen, Dave, Liu,
	Jing2, Shankar, Ravi V, linux-kernel

On Mar 20, 2021, at 15:13, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>> +
>> +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
>> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
>> +{
>> +	if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>> +		return;
>> +
>> +	if (unlikely(prev->state_mask != next->state_mask))
>> +		xdisable_setbits(xfirstuse_not_detected(next));
>> +}
> 
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
> 
>  "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>   clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>   system software initialize AMX state (e.g., by executing TILERELEASE)
>   before doing so. This is because maintaining AMX state in a
>   non-initialized state may have negative power and performance
>   implications."
> 
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I think this part in the doc is worth to be mentioned at first:

    “The XTILEDATA state component is very large, and an operating system may
    prefer not to allocate memory for the XTILEDATA state of every user
    thread. Such an operating system that enables Intel AMX might prefer to
    prevent specific user threads from using the feature. An extension called
    extended feature disable (XFD) is added to the XSAVE feature set to
    support such a usage. XFD is described in Section 3.2.6.”

So, in this series, instead of saving this state always, the state is saved
only when used. XFD helps to detect each thread’s first use of those
registers. Thus, the XFD’s MSR bit is maintained as per-task here.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-23 21:01       ` Len Brown
@ 2021-03-24  3:14         ` Liu, Jing2
  2021-03-24 21:09           ` Len Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Liu, Jing2 @ 2021-03-24  3:14 UTC (permalink / raw)
  To: Len Brown, Andy Lutomirski
  Cc: Thomas Gleixner, Borislav Petkov, Ingo Molnar, X86 ML, Len Brown,
	Dave Hansen, Liu, Jing2, Ravi V. Shankar, LKML, Bae, Chang Seok



On 3/24/2021 5:01 AM, Len Brown wrote:
>> I have an obnoxious question: do we really want to use the XFD mechanism?
> Obnoxious questions are often the most valuable! :-)
>
> [...]
> cheers,
> Len Brown, Intel Open Source Technology Center
>
> ps. I agree that un-necessary XINUSE=1 is possible.
> Notwithstanding the issues initially deploying AVX512, I am skeptical
> that it is common today.
Sorry, I'm trying to understand from...
> IMO, the problem with AVX512 state
> is that we guaranteed it will be zero for XINUSE=0.
> That means we have to write 0's on saves.
why "we have to write 0's on saves" when XINUSE=0.

Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
xstate_bv bit is 0; if use XSAVE, it need save the state but
xstate_bv bit is also 0.
>   It would be better
> to be able to skip the write -- even if we can't save the space
> we can save the data transfer.  (This is what we did for AMX).
With XFD feature that XFD=1, XSAVE command still has to save INIT state
to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
do the same that both can help save the data transfer.

The reason I'm interested in XINUSE denotation is that it might be helpful
for the XFD MSRs context switch cost during vmexit and vmenter.

Thanks,
Jing
>
> pps. your idea of requiring the user to allocate their own signal stack
> is interesting.   It isn't really about allocating the stack, though --
> the stack of the task that uses the feature is generally fine already.
> The opportunity is to allow tasks that do *not* use the new feature to
> get away with minimal data transfer and stack size.  As we don't
> have the 0's guarantee for AMX, we bought the important part
> of that back.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-23 21:52     ` Bae, Chang Seok
@ 2021-03-24 14:24       ` Dave Hansen
  0 siblings, 0 replies; 78+ messages in thread
From: Dave Hansen @ 2021-03-24 14:24 UTC (permalink / raw)
  To: Bae, Chang Seok, Thomas Gleixner
  Cc: Borislav Petkov, luto, mingo, x86, Brown, Len, Liu, Jing2,
	Shankar, Ravi V, linux-kernel

On 3/23/21 2:52 PM, Bae, Chang Seok wrote:
>>  "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>>   clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>>   system software initialize AMX state (e.g., by executing TILERELEASE)
>>   before doing so. This is because maintaining AMX state in a
>>   non-initialized state may have negative power and performance
>>   implications."
>>
>> I'm not seeing anything related to this. Is this a recommendation
>> which can be ignored or is that going to be duct taped into the code
>> base once the first user complains about slowdowns of their non AMX
>> workloads on that machine?
> I think this part in the doc is worth to be mentioned at first:
> 
>     “The XTILEDATA state component is very large, and an operating system may
>     prefer not to allocate memory for the XTILEDATA state of every user
>     thread. Such an operating system that enables Intel AMX might prefer to
>     prevent specific user threads from using the feature. An extension called
>     extended feature disable (XFD) is added to the XSAVE feature set to
>     support such a usage. XFD is described in Section 3.2.6.”
> 
> So, in this series, instead of saving this state always, the state is saved
> only when used. XFD helps to detect each thread’s first use of those
> registers. Thus, the XFD’s MSR bit is maintained as per-task here.

This doesn't really have anything to do with XFD.

The spec says, basically, "as long as you have AMX state in the
registers, you may pay a penalty".

When we switch between userspace tasks, AMX gets automatically
reinitialized by XRSTOR if the task to which we switch is not using AMX.
 All is good there.

But, what if we remain in the kernel?  Let's say kswapd is going to run
for a while.  Does kswapd pay the AMX-not-in-init-state penalty?  Or,
what if we want to go to idle?  Does AMX state affect *how* idle the CPU
can go?

We probably want to actively go out and zap AMX state at some
well-defined boundary.  It's radioactive.  Task switching seems as sane
a place as any to do that.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24  3:14         ` Liu, Jing2
@ 2021-03-24 21:09           ` Len Brown
  2021-03-24 21:26             ` Andy Lutomirski
  2021-03-25  5:12             ` Liu, Jing2
  0 siblings, 2 replies; 78+ messages in thread
From: Len Brown @ 2021-03-24 21:09 UTC (permalink / raw)
  To: Liu, Jing2
  Cc: Andy Lutomirski, Thomas Gleixner, Borislav Petkov, Ingo Molnar,
	X86 ML, Len Brown, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	LKML, Bae, Chang Seok

On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2 <jing2.liu@linux.intel.com> wrote:

> > IMO, the problem with AVX512 state
> > is that we guaranteed it will be zero for XINUSE=0.
> > That means we have to write 0's on saves.

> why "we have to write 0's on saves" when XINUSE=0.
>
> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
> xstate_bv bit is 0; if use XSAVE, it need save the state but
> xstate_bv bit is also 0.
> >   It would be better
> > to be able to skip the write -- even if we can't save the space
> > we can save the data transfer.  (This is what we did for AMX).
> With XFD feature that XFD=1, XSAVE command still has to save INIT state
> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
> do the same that both can help save the data transfer.

Hi Jing, Good observation!

There are 3 cases.

1. Task context switch save into the context switch buffer.
Here we use XSAVES, and as you point out, XSAVES includes
the compaction optimization feature tracked by XINUSE.
So when AMX is enabled, but clean, XSAVES doesn't write zeros.
Further, it omits the buffer space for AMX in the destination altogether!
However, since XINUSE=1 is possible, we have to *allocate* a buffer
large enough to handle the dirty data for when XSAVES can not
employ that optimization.

2. Entry into user signal handler saves into the user space sigframe.
Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
and for AVX512, we save neither time or space.

My understanding that for application compatibility, we can *not* compact
the destination buffer that user-space sees.  This is because existing code
may have adopted fixed size offsets.  (which is unfortunate).

And so, for AVX512, we both reserve the space, and we write zeros
for clean AVX512 state.

For AMX, we must still reserve the space, but we are not going to write zeros
for clean state.  We so this in software by checking XINUSE=0, and clearing
the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
writing the zeros, even though we can't compress the space.

3. user space always uses fully uncompacted XSAVE buffers.

> The reason I'm interested in XINUSE denotation is that it might be helpful
> for the XFD MSRs context switch cost during vmexit and vmenter.

As the guest OS may be using XFD, the VMM can not use it for itself.
Rather, the VMM must context switch it when it switches between guests.
(or not expose it to guests at all)

cheers,
-Len


cheers,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24 21:09           ` Len Brown
@ 2021-03-24 21:26             ` Andy Lutomirski
  2021-03-24 21:30               ` Dave Hansen
  2021-03-25  5:12             ` Liu, Jing2
  1 sibling, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-24 21:26 UTC (permalink / raw)
  To: Len Brown
  Cc: Liu, Jing2, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	Ingo Molnar, X86 ML, Len Brown, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, LKML, Bae, Chang Seok




> On Mar 24, 2021, at 2:09 PM, Len Brown <lenb@kernel.org> wrote:
> 
> On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2 <jing2.liu@linux.intel.com> wrote:
> 
>>> IMO, the problem with AVX512 state
>>> is that we guaranteed it will be zero for XINUSE=0.
>>> That means we have to write 0's on saves.
> 
>> why "we have to write 0's on saves" when XINUSE=0.
>> 
>> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
>> xstate_bv bit is 0; if use XSAVE, it need save the state but
>> xstate_bv bit is also 0.
>>>  It would be better
>>> to be able to skip the write -- even if we can't save the space
>>> we can save the data transfer.  (This is what we did for AMX).
>> With XFD feature that XFD=1, XSAVE command still has to save INIT state
>> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
>> do the same that both can help save the data transfer.
> 
> Hi Jing, Good observation!
> 
> There are 3 cases.
> 
> 1. Task context switch save into the context switch buffer.
> Here we use XSAVES, and as you point out, XSAVES includes
> the compaction optimization feature tracked by XINUSE.
> So when AMX is enabled, but clean, XSAVES doesn't write zeros.
> Further, it omits the buffer space for AMX in the destination altogether!
> However, since XINUSE=1 is possible, we have to *allocate* a buffer
> large enough to handle the dirty data for when XSAVES can not
> employ that optimization.
> 
> 2. Entry into user signal handler saves into the user space sigframe.
> Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
> and for AVX512, we save neither time or space.
> 
> My understanding that for application compatibility, we can *not* compact
> the destination buffer that user-space sees.  This is because existing code
> may have adopted fixed size offsets.  (which is unfortunate).
> 
> And so, for AVX512, we both reserve the space, and we write zeros
> for clean AVX512 state.
> 
> For AMX, we must still reserve the space, but we are not going to write zeros
> for clean state.  We so this in software by checking XINUSE=0, and clearing
> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
> writing the zeros, even though we can't compress the space.

Why?

> 
> 3. user space always uses fully uncompacted XSAVE buffers.
> 

There is no reason we have to do this for new states. Arguably we shouldn’t for AMX to avoid yet another altstack explosion.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24 21:26             ` Andy Lutomirski
@ 2021-03-24 21:30               ` Dave Hansen
  2021-03-24 21:42                 ` Andy Lutomirski
  0 siblings, 1 reply; 78+ messages in thread
From: Dave Hansen @ 2021-03-24 21:30 UTC (permalink / raw)
  To: Andy Lutomirski, Len Brown
  Cc: Liu, Jing2, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	Ingo Molnar, X86 ML, Len Brown, Liu, Jing2, Ravi V. Shankar,
	LKML, Bae, Chang Seok

On 3/24/21 2:26 PM, Andy Lutomirski wrote:
>> 3. user space always uses fully uncompacted XSAVE buffers.
>> 
> There is no reason we have to do this for new states. Arguably we
> shouldn’t for AMX to avoid yet another altstack explosion.

The thing that's worried me is that the list of OS-enabled states is
visible to apps via XGETBV.  It doesn't seem too much of a stretch to
think that apps will see AMX enabled with XGETBV and them assume that
it's on the signal stack.

Please tell me I'm being too paranoid.  If we can break this assumption,
it would get rid of a lot of future pain.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24 21:30               ` Dave Hansen
@ 2021-03-24 21:42                 ` Andy Lutomirski
  2021-03-24 21:58                   ` Dave Hansen
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-24 21:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Len Brown, Liu, Jing2, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, Ingo Molnar, X86 ML, Len Brown, Liu, Jing2,
	Ravi V. Shankar, LKML, Bae, Chang Seok


> On Mar 24, 2021, at 2:30 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/24/21 2:26 PM, Andy Lutomirski wrote:
>>> 3. user space always uses fully uncompacted XSAVE buffers.
>>> 
>> There is no reason we have to do this for new states. Arguably we
>> shouldn’t for AMX to avoid yet another altstack explosion.
> 
> The thing that's worried me is that the list of OS-enabled states is
> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
> think that apps will see AMX enabled with XGETBV and them assume that
> it's on the signal stack.
> 
> Please tell me I'm being too paranoid.  If we can break this assumption,
> it would get rid of a lot of future pain.

There are no AMX apps. I sure hope that there are no apps that enumerate xfeatures with CPUID and try to decode the mess in the signal stack.

I do think we need to save AMX state *somewhere* if a signal happens unless userspace opts out, but I don’t think it needs to be in the nominally expected spot.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24 21:42                 ` Andy Lutomirski
@ 2021-03-24 21:58                   ` Dave Hansen
  2021-03-24 22:12                     ` Andy Lutomirski
  0 siblings, 1 reply; 78+ messages in thread
From: Dave Hansen @ 2021-03-24 21:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Len Brown, Liu, Jing2, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, Ingo Molnar, X86 ML, Len Brown, Liu, Jing2,
	Ravi V. Shankar, LKML, Bae, Chang Seok

On 3/24/21 2:42 PM, Andy Lutomirski wrote:
>>>> 3. user space always uses fully uncompacted XSAVE buffers.
>>>>
>>> There is no reason we have to do this for new states. Arguably we
>>> shouldn’t for AMX to avoid yet another altstack explosion.
>> The thing that's worried me is that the list of OS-enabled states is
>> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
>> think that apps will see AMX enabled with XGETBV and them assume that
>> it's on the signal stack.
>>
>> Please tell me I'm being too paranoid.  If we can break this
>> assumption, it would get rid of a lot of future pain.
> There are no AMX apps. I sure hope that there are no apps that
> enumerate xfeatures with CPUID and try to decode the mess in the
> signal stack.

I don't think they quite need to decode it in order to be screwed over a
bit.  For instance, I don't think it's too crazy if someone did:

	xcr0 = xgetbv(0);
	xrstor(xcr0, &sig_stack[something]);
	// change some registers
	xsave(xcr0, &sig_stack[something]);

The XRSTOR would work fine, but the XSAVE would overflow the stack
because it would save the AMX state.  It also *looks* awfully benign.
This is true even if the silly signal handler didn't know about AMX at
*ALL*.

A good app would have checked that the xfeatures field in the header
matched xcr0.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24 21:58                   ` Dave Hansen
@ 2021-03-24 22:12                     ` Andy Lutomirski
  0 siblings, 0 replies; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-24 22:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Len Brown, Liu, Jing2, Andy Lutomirski, Thomas Gleixner,
	Borislav Petkov, Ingo Molnar, X86 ML, Len Brown, Liu, Jing2,
	Ravi V. Shankar, LKML, Bae, Chang Seok


> On Mar 24, 2021, at 2:58 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/24/21 2:42 PM, Andy Lutomirski wrote:
>>>>> 3. user space always uses fully uncompacted XSAVE buffers.
>>>>> 
>>>> There is no reason we have to do this for new states. Arguably we
>>>> shouldn’t for AMX to avoid yet another altstack explosion.
>>> The thing that's worried me is that the list of OS-enabled states is
>>> visible to apps via XGETBV.  It doesn't seem too much of a stretch to
>>> think that apps will see AMX enabled with XGETBV and them assume that
>>> it's on the signal stack.
>>> 
>>> Please tell me I'm being too paranoid.  If we can break this
>>> assumption, it would get rid of a lot of future pain.
>> There are no AMX apps. I sure hope that there are no apps that
>> enumerate xfeatures with CPUID and try to decode the mess in the
>> signal stack.
> 
> I don't think they quite need to decode it in order to be screwed over a
> bit.  For instance, I don't think it's too crazy if someone did:
> 
>    xcr0 = xgetbv(0);
>    xrstor(xcr0, &sig_stack[something]);
>    // change some registers
>    xsave(xcr0, &sig_stack[something]);
> 
> The XRSTOR would work fine, but the XSAVE would overflow the stack
> because it would save the AMX state.  It also *looks* awfully benign.
> This is true even if the silly signal handler didn't know about AMX at
> *ALL*.
> 
> A good app would have checked that the xfeatures field in the header
> matched xcr0.

Ugh.

On the other hand, if we require a syscall to flip the AMX bit in XCR0, we could maybe get away with saying that programs that flip the bit and don’t understand the new ABI get to keep both pieces.

I don’t live futzing with the ABI like this, but AMX is really only barely compatible with everything before it.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-24 21:09           ` Len Brown
  2021-03-24 21:26             ` Andy Lutomirski
@ 2021-03-25  5:12             ` Liu, Jing2
  2021-03-25  6:59               ` Bae, Chang Seok
  1 sibling, 1 reply; 78+ messages in thread
From: Liu, Jing2 @ 2021-03-25  5:12 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Thomas Gleixner, Borislav Petkov, Ingo Molnar,
	X86 ML, Len Brown, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	LKML, Bae, Chang Seok



On 3/25/2021 5:09 AM, Len Brown wrote:
> On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2 <jing2.liu@linux.intel.com> wrote:
>
>>> IMO, the problem with AVX512 state
>>> is that we guaranteed it will be zero for XINUSE=0.
>>> That means we have to write 0's on saves.
>> why "we have to write 0's on saves" when XINUSE=0.
>>
>> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
>> xstate_bv bit is 0; if use XSAVE, it need save the state but
>> xstate_bv bit is also 0.
>>>    It would be better
>>> to be able to skip the write -- even if we can't save the space
>>> we can save the data transfer.  (This is what we did for AMX).
>> With XFD feature that XFD=1, XSAVE command still has to save INIT state
>> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
>> do the same that both can help save the data transfer.
> Hi Jing, Good observation!
>
> There are 3 cases.
Hi Len, thanks for your reply.
>
> 1. Task context switch save into the context switch buffer.
> Here we use XSAVES, and as you point out, XSAVES includes
> the compaction optimization feature tracked by XINUSE.
> So when AMX is enabled, but clean, XSAVES doesn't write zeros.
> Further, it omits the buffer space for AMX in the destination altogether!
> However, since XINUSE=1 is possible, we have to *allocate* a buffer
> large enough to handle the dirty data for when XSAVES can not
> employ that optimization.
Yes, I agree with you about the first case.
>
> 2. Entry into user signal handler saves into the user space sigframe.
> Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
> and for AVX512, we save neither time or space.
>
> My understanding that for application compatibility, we can *not* compact
> the destination buffer that user-space sees.  This is because existing code
> may have adopted fixed size offsets.  (which is unfortunate).

> And so, for AVX512, we both reserve the space, and we write zeros
> for clean AVX512 state.
By XSAVE, I think this is true if we assume setting EDX:EAX AVX512 bits 
as 1,
which means XSAVE will write zeros when XINUSE=0. Is this the same 
assumption
with yours?...
> For AMX, we must still reserve the space, but we are not going to write zeros
> for clean state.  We so this in software by checking XINUSE=0, and clearing
> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
> writing the zeros, even though we can't compress the space.
So my understanding is that clearing xstate_bv will not help prevent saving
zeros, but only not masking EDX:EAX, since the following logic. Not sure if
this is just what you mean. :)

RFBM ← XCR0 AND EDX:EAX; /* bitwise logical AND */
OLD_BV ← XSTATE_BV field from XSAVE header;
...
FOR i ← 2 TO 62
IF RFBM[i] = 1
THEN save XSAVE state component i at offset n from base of XSAVE area;
FI;
ENDFOR;

XSTATE_BV field in XSAVE header ← (OLD_BV AND NOT RFBM) OR (XINUSE AND 
RFBM);

> 3. user space always uses fully uncompacted XSAVE buffers.
>
>> The reason I'm interested in XINUSE denotation is that it might be helpful
>> for the XFD MSRs context switch cost during vmexit and vmenter.
> As the guest OS may be using XFD, the VMM can not use it for itself.
> Rather, the VMM must context switch it when it switches between guests.
> (or not expose it to guests at all)

My understand is that KVM cannot assume that userspace qemu uses XFD or not,
so KVM need context switch XFD between vcpu threads when vmexit/vmenter.

That's why I am thinking about detecting XINUSE when vmexit, otherwise, a
wrong armed IA32_XFD will impact XSAVES/XRSTORS causing guest AMX states
lost.

Thanks,
Jing
>
> cheers,
> -Len
>
>
> cheers,
> Len Brown, Intel Open Source Technology Center


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-25  5:12             ` Liu, Jing2
@ 2021-03-25  6:59               ` Bae, Chang Seok
  2021-03-25  7:26                 ` Liu, Jing2
  0 siblings, 1 reply; 78+ messages in thread
From: Bae, Chang Seok @ 2021-03-25  6:59 UTC (permalink / raw)
  To: Liu, Jing2
  Cc: Len Brown, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	Ingo Molnar, X86 ML, Brown, Len, Hansen, Dave, Liu, Jing2,
	Shankar, Ravi V, LKML

On Mar 24, 2021, at 22:12, Liu, Jing2 <jing2.liu@linux.intel.com> wrote:
> On 3/25/2021 5:09 AM, Len Brown wrote:
>> 
>> For AMX, we must still reserve the space, but we are not going to write zeros
>> for clean state.  We so this in software by checking XINUSE=0, and clearing
>> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
>> writing the zeros, even though we can't compress the space.
> So my understanding is that clearing xstate_bv will not help prevent saving
> zeros, but only not masking EDX:EAX, since the following logic. Not sure if
> this is just what you mean. :)

FWIW, PATCH21 [1] uses the instruction mask to skip writing zeros on sigframe.
Then, XSAVE will clear the xstate_bv for the xtile data state bit.

[1] https://lore.kernel.org/lkml/20210221185637.19281-22-chang.seok.bae@intel.com/

Thanks,
Chang

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-25  6:59               ` Bae, Chang Seok
@ 2021-03-25  7:26                 ` Liu, Jing2
  0 siblings, 0 replies; 78+ messages in thread
From: Liu, Jing2 @ 2021-03-25  7:26 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Len Brown, Andy Lutomirski, Thomas Gleixner, Borislav Petkov,
	Ingo Molnar, X86 ML, Brown, Len, Hansen, Dave, Liu, Jing2,
	Shankar, Ravi V, LKML



>>> For AMX, we must still reserve the space, but we are not going to write zeros
>>> for clean state.  We so this in software by checking XINUSE=0, and clearing
>>> the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
>>> writing the zeros, even though we can't compress the space.
>> So my understanding is that clearing xstate_bv will not help prevent saving
>> zeros, but only not masking EDX:EAX, since the following logic. Not sure if
>> this is just what you mean. :)
> FWIW, PATCH21 [1] uses the instruction mask to skip writing zeros on sigframe.
> Then, XSAVE will clear the xstate_bv for the xtile data state bit.
>
> [1] https://lore.kernel.org/lkml/20210221185637.19281-22-chang.seok.bae@intel.com/
Yes, no mask in EDX:EAX works in such case. Thanks for pointing out the 
patch.

BRs,
Jing
>
> Thanks,
> Chang


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-20 20:56   ` Thomas Gleixner
@ 2021-03-25 22:59     ` Len Brown
  2021-03-25 23:10       ` Dave Hansen
                         ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Len Brown @ 2021-03-25 22:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Sat, Mar 20, 2021 at 4:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:

> We won't enable features which are unknown ever. Keep that presilicon
> test gunk where it belongs: In the Intel poison cabinet along with the
> rest of the code which nobody ever want's to see.

I agree, it would be irresponsible to enable unvalidated features by default,
and pre-silicon "test gunk" should be kept out of the upstream kernel.

This patch series is intended solely to enable fully validated
hardware features,
with product quality kernel support.

The reason that the actual AMX feature isn't mentioned until the 16th
patch in this series
is because all of the patches before it are generic state save/restore patches,
that are not actually specific to AMX.

We call AMX a "simple state feature" -- it actually requires NO KERNEL ENABLING
above the generic state save/restore to fully support userspace AMX
applications.

While not all ISA extensions can be simple state features, we do expect
future features to share this trait, and so we want to be sure that it is simple
to update the kernel to turn those features on (and when necessary, off).

There will be a future CPUID attribute that will help us identify
future simple-state features.
For AMX, of course, we simply know.

So after the generic state management support, the kernel enabling of AMX
is not actually required to run applications.  Just like when a new instruction
is added that re-uses existing state -- the application or library can check
CPUID and just use it.  It is a formality (perhaps an obsolete one), that
we add every feature flag to /proc/cpuid for the "benefit" of userspace.

The reason we propose this cmdline switch is
1. Ability of customers to disable a feature right away if an issue is found.
Unlike the CPUid cmdline that works on flags, this is the ability to turn
off a feature based on its state number.  Ie.  There could be 20 features
that use the same state, and you can turn them all off at once this way.

2. Ability of customers to enable a feature that is disabled by default
in their kernel.  Yes, this will taint their kernel (thanks Andy),
but we have customers that want to run the new feature on day 0
before they have got a distro update to change the default, and this
gives them a way to do that.

Yeah, the cmdline syntax is not a user-friendly mnemonic, and I don't know
that making it so would be an improvement.
Like the CPUID cmdline, it is precise, it is future-proof, and it is
used only in special situations.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-25 22:59     ` Len Brown
@ 2021-03-25 23:10       ` Dave Hansen
  2021-03-26 15:27         ` Len Brown
  2021-03-26  1:41       ` Andy Lutomirski
  2021-03-26  1:50       ` Thomas Gleixner
  2 siblings, 1 reply; 78+ messages in thread
From: Dave Hansen @ 2021-03-25 23:10 UTC (permalink / raw)
  To: Len Brown, Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On 3/25/21 3:59 PM, Len Brown wrote:
> We call AMX a "simple state feature" -- it actually requires NO KERNEL ENABLING
> above the generic state save/restore to fully support userspace AMX
> applications.
> 
> While not all ISA extensions can be simple state features, we do expect
> future features to share this trait, and so we want to be sure that it is simple
> to update the kernel to turn those features on (and when necessary, off).

From some IRC chats with Thomaas and Andy, I think it's safe to say that
they're not comfortable blindly enabling even our "simple features".  I
think we're going to need at least some additional architecture to get
us to a point where everyone will be comfortable.

For instance, AMX might be "simple", but there are really only kludgy
ways to get it back to the init state.  Plus, it's *not* simple in that
state left in the registers can have permanent (as long as the state
remains) power and performance impact.

Also, we probably need to expand the "simple" architecture documentation
a bit.  For instance, we need to promise that things like pkeys which
can cause kernel exceptions will never be enumerated as "simple".

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-25 22:59     ` Len Brown
  2021-03-25 23:10       ` Dave Hansen
@ 2021-03-26  1:41       ` Andy Lutomirski
  2021-03-26 15:33         ` Len Brown
  2021-03-26  1:50       ` Thomas Gleixner
  2 siblings, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-26  1:41 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Andy Lutomirski,
	Ingo Molnar, X86 ML, Brown, Len, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List,
	Linux Documentation List

On Thu, Mar 25, 2021 at 3:59 PM Len Brown <lenb@kernel.org> wrote:
>
> On Sat, Mar 20, 2021 at 4:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> > We won't enable features which are unknown ever. Keep that presilicon
> > test gunk where it belongs: In the Intel poison cabinet along with the
> > rest of the code which nobody ever want's to see.
>
> I agree, it would be irresponsible to enable unvalidated features by default,
> and pre-silicon "test gunk" should be kept out of the upstream kernel.
>
> This patch series is intended solely to enable fully validated
> hardware features,
> with product quality kernel support.
>
> The reason that the actual AMX feature isn't mentioned until the 16th
> patch in this series
> is because all of the patches before it are generic state save/restore patches,
> that are not actually specific to AMX.
>
> We call AMX a "simple state feature" -- it actually requires NO KERNEL ENABLING
> above the generic state save/restore to fully support userspace AMX
> applications.

Regardless of what you call AMX, AMX requires kernel enabling.
Specifically, it appears that leaving AMX in use in the XINUSE sense
degrades system performance and/or power.  And the way to handle that
in kernel (TILERELEASE) cannot possibly be construed as generic.
Here's a little summary of XSTATE features that have failed to be
simple:

 - XMM: seemed simple, but the performance issues switching between
legacy and VEX are still unresolved.  And they affect the kernel, and
people have noticed and complained.

 - ZMM and the high parts of X/YMM: Intel *still* hasn't documented
the actual performance rules.  Reports from people trying to reverse
engineer it suggest that it's horrible on all but the very newest
chips.  For some reason, glibc uses it.  And it broke sigaltstack.  I
have NAKked in-kernel AVX-512 usage until Intel answers a long list of
questions.  No progress yet.

 - PKRU: makes no sense as an XSAVE feature.

 - AMX: XFD, as I understand it, has virtualization problems.  And the
TILERELEASE issue is unresolved.

Intel's track record here is poor.  If you want the kernel to trust
Intel going forward, Intel needs to build trust first.

> So after the generic state management support, the kernel enabling of AMX
> is not actually required to run applications.  Just like when a new instruction
> is added that re-uses existing state -- the application or library can check
> CPUID and just use it.  It is a formality (perhaps an obsolete one), that
> we add every feature flag to /proc/cpuid for the "benefit" of userspace.

Even this isn't true.  AVX-512 already Broke ABI (tm).  Sorry for the
big evil words, but existing programs that worked on Linux stopped
working due to kernel enablement of AVX-512.  AMX has the same
problem, except more than an order of magnitude worse.  No credible
resolution has shown up, and the only remotely credible idea anyone
has mentioned is to actually mask AMX in XCR0 until an application
opts in to an as-yet-undetermined new ABI.

--Andy

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-25 22:59     ` Len Brown
  2021-03-25 23:10       ` Dave Hansen
  2021-03-26  1:41       ` Andy Lutomirski
@ 2021-03-26  1:50       ` Thomas Gleixner
  2021-03-26 15:36         ` Len Brown
  2 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-26  1:50 UTC (permalink / raw)
  To: Len Brown
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

Len,

On Thu, Mar 25 2021 at 18:59, Len Brown wrote:
> On Sat, Mar 20, 2021 at 4:57 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> We won't enable features which are unknown ever. Keep that presilicon
>> test gunk where it belongs: In the Intel poison cabinet along with the
>> rest of the code which nobody ever want's to see.
>
> I agree, it would be irresponsible to enable unvalidated features by default,
> and pre-silicon "test gunk" should be kept out of the upstream kernel.

Well, that's not my experience from the past and sorry for being
paranoid about that.

> This patch series is intended solely to enable fully validated
> hardware features, with product quality kernel support.

The fact, that the function is broken as provided, definitely supports
that product quality argument.

> The reason that the actual AMX feature isn't mentioned until the 16th
> patch in this series is because all of the patches before it are
> generic state save/restore patches, that are not actually specific to
> AMX.

That's related to 22/22 in which way?

> We call AMX a "simple state feature" -- it actually requires NO KERNEL
> ENABLING above the generic state save/restore to fully support
> userspace AMX applications.

Aside of the unanswered questions vs. the impact of letting it in
initialized state along with the unsolved problem of sigaltstacks...

> While not all ISA extensions can be simple state features, we do
> expect future features to share this trait, and so we want to be sure
> that it is simple to update the kernel to turn those features on (and
> when necessary, off).

History tells me a different story.

> There will be a future CPUID attribute that will help us identify
> future simple-state features.
> For AMX, of course, we simply know.

You believe so, but do you know for sure?

I neither know for sure nor do I believe any of this at all.

Please provide the architectural document which guarantees that and does
so in a way that it can be evaluated by the kernel. Have not seen that,
so it does not exist at all.

  Future CPUID attributes are as useful as the tweet of today.

> So after the generic state management support, the kernel enabling of AMX
> is not actually required to run applications.  Just like when a new instruction
> is added that re-uses existing state -- the application or library can check
> CPUID and just use it.  It is a formality (perhaps an obsolete one), that
> we add every feature flag to /proc/cpuid for the "benefit" of
> userspace.

It's not a formality when the instruction requires kernel support and
from the history of the various incarnations of this command line option
it's just a given that this is going belly up.

Even the current incarnation is broken just from looking at it, so what
the heck are you talking about?

> The reason we propose this cmdline switch is
> 1. Ability of customers to disable a feature right away if an issue is found.
> Unlike the CPUid cmdline that works on flags, this is the ability to turn
> off a feature based on its state number.  Ie.  There could be 20 features
> that use the same state, and you can turn them all off at once this
> way.

I'm fine with that, but then the disabling has to handle all the things
related to it and not just be on a 'pray that it works' base.

> 2. Ability of customers to enable a feature that is disabled by default
> in their kernel.  Yes, this will taint their kernel (thanks Andy),
> but we have customers that want to run the new feature on day 0
> before they have got a distro update to change the default, and this
> gives them a way to do that.

You might know my opinion from previous discussions about this topic,
but let me repeat it for completeness sake:

   This is a generic kernel exposed to a gazillion of users and a
   minority of them want to have the ability to enable insane
   stuff on the command line because:

     1) Intel is not able to provide them a test kernel package

     2) Their favourite $DISTROVENDOR is not able to provide them a
        test kernel package

     3) Intel did not manage to get the support for this upstream
        on time so the $DISTROVENDOR was able to backport it into
        their Frankenkernel

   So you seriously want us to have a command line option to enable
   whatever the feature of today is because of #1-#3?

   Sure, from a Intel managerial POV that's all cool. Not so much when
   you put your community hat on and think about the consequences.

   Aside of that none of the above #1 - #3 is a technical argument.  See
   Documentation/process/* for further enlightment.

Of course none of your arguments above have shown up in the changelog of
this command line patch. And none of the potential side effects or down
sides have been mentioned.

Don't blame Chang Bae for that. That patch carries a:

      Reviewed-by: Len Brown <len.brown@intel.com>

I really have to ask whether you actually looked at the code and the
changelog or just tagged it because some internal procedure requires it.

Either way ....

> Yeah, the cmdline syntax is not a user-friendly mnemonic, and I don't know
> that making it so would be an improvement.
> Like the CPUID cmdline, it is precise, it is future-proof, and it is
> used only in special situations.

The CPUID commandline option is yet another trainwreck which is neither
precise nor future proof if you dare to take a deep technical look. It
should have never been merged and it should be ripped out rather than
proliferated. If you think otherwise then please provide a proper proof
that this commandline option is correct under all circumstances before
abusing it as an argument.

Please try again when you have

  - a reviewable and functional correct implementation

  - including the ability to evalute that via architectural CPUID

  - a changelog which provides an argument which is based on solely
    technical criteria instead of wishful managerial thinking or being
    just void of content like the current one.

Sorry for looking at this solely from the technical side and thereby
ignoring all the managerial powerpoint slide illusions.

Now putting my managerial hat on:

    Given the history of that command line option, I have no idea why
    this has even be tried to piggy pack on AMX at all. It's an
    orthogonal problem and absolutely not required to make AMX supported
    in the first place.

    Hrm, unless you expect that a lot of users will need to disable AMX
    because ... But that would be a technical reason not to enable it
    in the first place, which is not desired from a managerial/marketing
    POV, right?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-25 23:10       ` Dave Hansen
@ 2021-03-26 15:27         ` Len Brown
  2021-03-26 19:22           ` Thomas Gleixner
  0 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-26 15:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Andy Lutomirski,
	Ingo Molnar, X86 ML, Brown, Len, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Thu, Mar 25, 2021 at 7:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 3/25/21 3:59 PM, Len Brown wrote:
> > We call AMX a "simple state feature" -- it actually requires NO KERNEL ENABLING
> > above the generic state save/restore to fully support userspace AMX
> > applications.
> >
> > While not all ISA extensions can be simple state features, we do expect
> > future features to share this trait, and so we want to be sure that it is simple
> > to update the kernel to turn those features on (and when necessary, off).
>
> From some IRC chats with Thomaas and Andy, I think it's safe to say that
> they're not comfortable blindly enabling even our "simple features".  I
> think we're going to need at least some additional architecture to get
> us to a point where everyone will be comfortable.

Hi Dave,

There is no code in this patch series, including patch 22, that enables
an unvalidated feature by default.

Yes, I fully accept that patch 22 allows a user to enable something
that a distro didn't validate.

If there is a new requirement that the kernel cmdline not allow anything
that a distro didn't explicitly validate, then about 99.9% of the kernel cmdline
options that exist today would need to be removed.

Does such a requirement exist, or does it not?

thanks,
-Len

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26  1:41       ` Andy Lutomirski
@ 2021-03-26 15:33         ` Len Brown
  2021-03-26 15:48           ` Andy Lutomirski
  0 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-26 15:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Thu, Mar 25, 2021 at 9:42 PM Andy Lutomirski <luto@kernel.org> wrote:

> Regardless of what you call AMX, AMX requires kernel enabling.

I submit, that after the generic XFD support is in place,
there is exactly 1 bit that needs to be flipped to enable
user applications to benefit from AMX.

I submit the patch that knows about AMX and double checks the
state size is superfluous.

I submit that updating /proc/cpuinfo is superfluous.

What AMX-specific kernel enabling did I miss?

> Specifically, it appears that leaving AMX in use in the XINUSE sense
> degrades system performance and/or power.

Please share the specifics about what performance or power issue you anticipate.

thanks,
-Len

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26  1:50       ` Thomas Gleixner
@ 2021-03-26 15:36         ` Len Brown
  0 siblings, 0 replies; 78+ messages in thread
From: Len Brown @ 2021-03-26 15:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Thu, Mar 25, 2021 at 9:50 PM Thomas Gleixner <tglx@linutronix.de> wrote:

> Please provide the architectural document which guarantees that and does
> so in a way that it can be evaluated by the kernel. Have not seen that,
> so it does not exist at all.
>
>   Future CPUID attributes are as useful as the tweet of today.

I will do so the moment I am permitted.
I'm fine with dropping patch 22 until it can rely on the assurance of
that architectural feature.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 15:33         ` Len Brown
@ 2021-03-26 15:48           ` Andy Lutomirski
  2021-03-26 17:53             ` Len Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-26 15:48 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Thomas Gleixner, Chang S. Bae, Borislav Petkov,
	Ingo Molnar, X86 ML, Brown, Len, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List,
	Linux Documentation List

On Fri, Mar 26, 2021 at 8:34 AM Len Brown <lenb@kernel.org> wrote:
>
> On Thu, Mar 25, 2021 at 9:42 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> > Regardless of what you call AMX, AMX requires kernel enabling.
>
> I submit, that after the generic XFD support is in place,
> there is exactly 1 bit that needs to be flipped to enable
> user applications to benefit from AMX.

The TILERELEASE opcode itself is rather longer than one bit, and the
supporting code to invoke it at the right time, to avoid corrupting
user state, and avoid causing performance regressions merely by
existing will be orders of magnitude more than 1 bit.  Of course, all
of this is zero bits in the current series because the code is
missing.entirely.

To avoid email thread blowup:

> If there is a new requirement that the kernel cmdline not allow anything
> that a distro didn't explicitly validate, then about 99.9% of the kernel cmdline
> options that exist today would need to be removed.
>
> Does such a requirement exist, or does it not?

This isn't just about validation.  There's also ABI, performance, and
correctness:

ABI: The AVX-512 enablement *already* broke user ABI.  Sadly no one
told anyone in the kernel community until about 5 years after the
fact, and it's a bit late to revert AVX-512.  But we don't want to
enable AMX until the ABI has a reasonable chance of being settled.
Ditto for future features.  As it stands, if you xstate.enable some
16MB feature, the system may well simply fail to boot as too many user
processes explode.

Performance:

We *still* don't know the performance implications of leaving the AMX
features in use inappropriately.  Does it completely destroy idle?
Will it literally operate CPUs out of spec such that Intel's
reliability estimates will be invalidated?  (We had that with NVMe
APST.  Let's not repeat this with XSTATE.)  The performance impacts
and transitions for AVX-512 are, to put it charitably, forthcoming.

Correctness: PKRU via the kernel's normal XSAVE path would simply be
incorrect.  Do we really trust that this won't get repeated?  Also,
frankly, a command line option that may well break lots of userspace
but that we fully expect Intel to recommend setting is not a good
thing.

--Andy

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-02-21 18:56 ` [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state Chang S. Bae
  2021-03-20 22:13   ` Thomas Gleixner
@ 2021-03-26 16:34   ` Jann Horn
  2021-03-29 18:14     ` Bae, Chang Seok
  1 sibling, 1 reply; 78+ messages in thread
From: Jann Horn @ 2021-03-26 16:34 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Len Brown, Dave Hansen, jing2.liu,
	Ravi V. Shankar, kernel list

On Sun, Feb 21, 2021 at 7:56 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can defer allocating the large XSAVE buffer until tasks
> need it.
>
> XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
> IA32_XFD_ERR to assist the #NM trap handler. Both use the same
> state-component bitmap format, used by XCR0.
>
> Use this hardware capability to find the right time to expand the xstate
> buffer. Introduce two sets of helper functions for that:
>
> 1. The first set is primarily for interacting with the XFD hardware:
>         xdisable_setbits()
>         xdisable_getbits()
>         xdisable_switch()
>
> 2. The second set is for managing the first-use status and handling #NM
>    trap:
>         xfirstuse_enabled()
>         xfirstuse_not_detected()
>
> The #NM handler induces the xstate buffer expansion to save the first-used
> states.
[...]
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7f5aec758f0e..821a7f408ad4 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
[...]
> +static __always_inline bool handle_xfirstuse_event(struct fpu *fpu)
> +{
> +       bool handled = false;
> +       u64 event_mask;
[...]
> +       if (alloc_xstate_buffer(fpu, event_mask))
> +               return handled;
[...]
> +}
> +
>  DEFINE_IDTENTRY(exc_device_not_available)
>  {
>         unsigned long cr0 = read_cr0();
>
> +       if (handle_xfirstuse_event(&current->thread.fpu))
> +               return;

What happens if handle_xfirstuse_event() fails because vmalloc()
failed in alloc_xstate_buffer()? I think that should probably kill the
task with something like force_sig() - but as far as I can tell, at
the moment, it will instead end up at die(), which should only be used
for kernel bugs.

> +
>  #ifdef CONFIG_MATH_EMULATION
>         if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
>                 struct math_emu_info info = { };
> --
> 2.17.1
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 15:48           ` Andy Lutomirski
@ 2021-03-26 17:53             ` Len Brown
  2021-03-26 18:12               ` Andy Lutomirski
  2021-03-26 18:17               ` Borislav Petkov
  0 siblings, 2 replies; 78+ messages in thread
From: Len Brown @ 2021-03-26 17:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Fri, Mar 26, 2021 at 11:48 AM Andy Lutomirski <luto@kernel.org> wrote:

> > I submit, that after the generic XFD support is in place,
> > there is exactly 1 bit that needs to be flipped to enable
> > user applications to benefit from AMX.
>
> The TILERELEASE opcode itself is rather longer than one bit, and the
> supporting code to invoke it at the right time, to avoid corrupting
> user state, and avoid causing performance regressions merely by
> existing will be orders of magnitude more than 1 bit.  Of course, all
> of this is zero bits in the current series because the code is
> missing.entirely.

Please explain why the kernel must know about the TILERELEASE
instruction in order for an AMX application to run properly.

> This isn't just about validation.  There's also ABI, performance, and
> correctness.

Thank you for agreeing that this is not about unvalidated features.

> ABI: The AVX-512 enablement *already* broke user ABI.  Sadly no one
> told anyone in the kernel community until about 5 years after the
> fact, and it's a bit late to revert AVX-512.  But we don't want to
> enable AMX until the ABI has a reasonable chance of being settled.
> Ditto for future features.  As it stands, if you xstate.enable some
> 16MB feature, the system may well simply fail to boot as too many user
> processes explode.

At Dave's suggestion, we had a 64 *KB* sanity check on this path.
Boris forced us to remove it, because we could not tell him
how we chose the number 64.

I would be delighted to see a check for 64 KB restored, and that
it be a rejection, rather than warning.  At this point, as there is no way
go down that path without manually modifying the kernel, it would
devolve into a sanity check for a hardware (CPUID) bug.

> Performance:
>
> We *still* don't know the performance implications of leaving the AMX
> features in use inappropriately.  Does it completely destroy idle?

No.

> Will it literally operate CPUs out of spec such that Intel's
> reliability estimates will be invalidated?

No.

>  (We had that with NVMe APST.  Let's not repeat this with XSTATE.)

I acknowledge that the possibility of broken hardware always exists.
However, I don't see how the experience with broken NVMe actually applies here,
other than general paranoia about new features (which is, arguably, healthy).

>  The performance impacts
> and transitions for AVX-512 are, to put it charitably, forthcoming.

I acknowledge the parallels with AVX-512, in that AMX adds new instructions,
and it has even bigger registers.  I also acknowledge that the AVX-512 rollout
(and arguably, its brief existence on client CPUs) was problematic.

My understanding is that Intel continues to learn (a lot) from its mistakes.
I believe that the AVX-512 credits problem has been largely eliminated
on newer Xeons.

My understanding is that AMX is implemented only in CPUs that actually
have the hardware to properly support AMX.  If it were not, then that would
be a problem for Intel to deal with in hardware, not a problem for Linux
to deal with in software.

> Correctness: PKRU via the kernel's normal XSAVE path would simply be
> incorrect.  Do we really trust that this won't get repeated?  Also,
> frankly, a command line option that may well break lots of userspace
> but that we fully expect Intel to recommend setting is not a good
> thing.

There is no analogy between AMX and PKRU, except the fact that they
are both features, and at one time, both were new.

I am unaware of anybody at Intel recommending that any cmdline
be set that would break userspace.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 17:53             ` Len Brown
@ 2021-03-26 18:12               ` Andy Lutomirski
  2021-03-27  4:53                 ` Len Brown
  2021-03-26 18:17               ` Borislav Petkov
  1 sibling, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-26 18:12 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Thomas Gleixner, Chang S. Bae, Borislav Petkov,
	Ingo Molnar, X86 ML, Brown, Len, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List,
	Linux Documentation List

On Fri, Mar 26, 2021 at 10:54 AM Len Brown <lenb@kernel.org> wrote:
>
> On Fri, Mar 26, 2021 at 11:48 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> > > I submit, that after the generic XFD support is in place,
> > > there is exactly 1 bit that needs to be flipped to enable
> > > user applications to benefit from AMX.
> >
> > The TILERELEASE opcode itself is rather longer than one bit, and the
> > supporting code to invoke it at the right time, to avoid corrupting
> > user state, and avoid causing performance regressions merely by
> > existing will be orders of magnitude more than 1 bit.  Of course, all
> > of this is zero bits in the current series because the code is
> > missing.entirely.
>
> Please explain why the kernel must know about the TILERELEASE
> instruction in order for an AMX application to run properly.

I'm just repeating things already said, and this is getting
ridiculous.  TILERELEASE isn't needed for an AMX application to run
properly -- it's needed for the rest of the system to run properly, at
least according to Intel's published docs.  Quoting the current ISE
document:

3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE

System software may disable use of Intel AMX by clearing XCR0[18:17],
by clearing CR4.OSXSAVE, or by setting
IA32_XFD[18]. It is recommended that system software initialize AMX
state (e.g., by executing TILERELEASE)
before doing so. This is because maintaining AMX state in a
non-initialized state may have negative power and
performance implications.

Since you reviewed the patch set, I assume you are familiar with how
Linux manages XSTATE.  Linux does *not* eagerly load XSTATE on context
switch.  Instead, Linux loads XSTATE when the kernel needs it loaded
or before executing user code.  This means that the kernel can (and
does, and it's a performance win) execute kernel thread code and/or go
idle, *including long-term deep idle*, with user XSTATE loaded.


>
> > This isn't just about validation.  There's also ABI, performance, and
> > correctness.
>
> Thank you for agreeing that this is not about unvalidated features.
>
> > ABI: The AVX-512 enablement *already* broke user ABI.  Sadly no one
> > told anyone in the kernel community until about 5 years after the
> > fact, and it's a bit late to revert AVX-512.  But we don't want to
> > enable AMX until the ABI has a reasonable chance of being settled.
> > Ditto for future features.  As it stands, if you xstate.enable some
> > 16MB feature, the system may well simply fail to boot as too many user
> > processes explode.
>
> At Dave's suggestion, we had a 64 *KB* sanity check on this path.
> Boris forced us to remove it, because we could not tell him
> how we chose the number 64.
>
> I would be delighted to see a check for 64 KB restored, and that
> it be a rejection, rather than warning.  At this point, as there is no way
> go down that path without manually modifying the kernel, it would
> devolve into a sanity check for a hardware (CPUID) bug.

This is nuts.  The ABI is ALREADY BROKEN.  How does picking a random
number quantifying additional breakage help?  We do not have a good
design for AVX-512 in Linux, we don't have a good design for AMX in
Linux, and we absolutely don't have a good design for the secret
feature we don't know about yet in Linux.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 17:53             ` Len Brown
  2021-03-26 18:12               ` Andy Lutomirski
@ 2021-03-26 18:17               ` Borislav Petkov
  2021-03-27  4:41                 ` Len Brown
  1 sibling, 1 reply; 78+ messages in thread
From: Borislav Petkov @ 2021-03-26 18:17 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Thomas Gleixner, Chang S. Bae, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Fri, Mar 26, 2021 at 01:53:47PM -0400, Len Brown wrote:
> At Dave's suggestion, we had a 64 *KB* sanity check on this path.
> Boris forced us to remove it, because we could not tell him
> how we chose the number 64.

The only 64 I can remember is

#define XSTATE_BUFFER_MAX_BYTES              (64 * 1024)

What does an arbitrary number have to do with signal handling and
pushing a fat frame on the sigaltstack?

-- 
Regards/Gruss,
    Boris.

SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 15:27         ` Len Brown
@ 2021-03-26 19:22           ` Thomas Gleixner
  0 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-26 19:22 UTC (permalink / raw)
  To: Len Brown, Dave Hansen
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

Len,

On Fri, Mar 26 2021 at 11:27, Len Brown wrote:
> On Thu, Mar 25, 2021 at 7:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> From some IRC chats with Thomaas and Andy, I think it's safe to say that
>> they're not comfortable blindly enabling even our "simple features".  I
>> think we're going to need at least some additional architecture to get
>> us to a point where everyone will be comfortable.
>
> There is no code in this patch series, including patch 22, that enables
> an unvalidated feature by default.
>
> Yes, I fully accept that patch 22 allows a user to enable something
> that a distro didn't validate.

That's not the point. And neither Andy nor myself asked for distros to
validate and approve anything.

> If there is a new requirement that the kernel cmdline not allow anything
> that a distro didn't explicitly validate, then about 99.9% of the kernel cmdline
> options that exist today would need to be removed.
>
> Does such a requirement exist, or does it not?

Nobody said that, but patches to remove command line options are always
welcome. Can we start with the most horrible of all we have today, i.e
"clearcpuid=", please?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 18:17               ` Borislav Petkov
@ 2021-03-27  4:41                 ` Len Brown
  0 siblings, 0 replies; 78+ messages in thread
From: Len Brown @ 2021-03-27  4:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Thomas Gleixner, Chang S. Bae, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Fri, Mar 26, 2021 at 2:17 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Mar 26, 2021 at 01:53:47PM -0400, Len Brown wrote:
> > At Dave's suggestion, we had a 64 *KB* sanity check on this path.
> > Boris forced us to remove it, because we could not tell him
> > how we chose the number 64.
>
> The only 64 I can remember is
>
> #define XSTATE_BUFFER_MAX_BYTES              (64 * 1024)
>
> What does an arbitrary number have to do with signal handling and
> pushing a fat frame on the sigaltstack?

You are right.  If that is where the check was, it was the wrong place.
It should be part of that sanity check code where CPUID is parsed.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-26 18:12               ` Andy Lutomirski
@ 2021-03-27  4:53                 ` Len Brown
  2021-03-27 22:20                   ` Thomas Gleixner
  0 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-27  4:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>
> System software may disable use of Intel AMX by clearing XCR0[18:17],
> by clearing CR4.OSXSAVE, or by setting
> IA32_XFD[18]. It is recommended that system software initialize AMX
> state (e.g., by executing TILERELEASE)
> before doing so. This is because maintaining AMX state in a
> non-initialized state may have negative power and
> performance implications.

I agree that the wording here about disabling AMX is ominous.

The hardware initializes with AMX disabled.
The kernel probes AMX, enables it in XCR0, and keeps it enabled.

Initially, XFD is "armed" for all tasks.
When a task accesses AMX state, #NM fires, we allocate a context
switch buffer, and we "disarm" XFD for that task.
As we have that buffer in-hand for the lifetime of the task, we never
"arm" XFD for that task again.

XFD is context switched, and so the next time it is set, is when we
are restoring some other task's state.

n.b. I'm describing the Linux flow.  The VMM scenario is a little different.

> Since you reviewed the patch set, I assume you are familiar with how
> Linux manages XSTATE.  Linux does *not* eagerly load XSTATE on context
> switch.  Instead, Linux loads XSTATE when the kernel needs it loaded
> or before executing user code.  This means that the kernel can (and
> does, and it's a performance win) execute kernel thread code and/or go
> idle, *including long-term deep idle*, with user XSTATE loaded.

Yes, this scenario is clear.

There are several cases.

1. Since TMM registers are volatile, a routine using TMM that wants
them to persist across a call must save them,
    and will TILERELEASE before invoking that call.  That is the
calling convention,
    and I expect that if it is not followed, debugging (of tools) will
occur until it is.

    The only way for a user program's XSTATE to be present during the
kernel's call to idle
    is if it sleep via a system call when no other task wants to run
on that CPU.

    Since system calls are calls, in this case, AMX INIT=1 during idle.
    All deep C-state are enabled, the idle CPU is able to contribute
it's maximum turbo buget to its peers.

2. A correct program with live TMM registers takes an interrupt, and
we enter the kernel AMX INIT=0.
    Yes, we will enter the syscall at the frequency of the app (like
we always do).
    Yes, turbo frequency may be limited by the activity of this
processor and its peers (like it always is)

   2a. If we return to the same program, then depending on how long
the syscall runs, we may execute
         the program and the system call code at a frequency lower
than we might if AMX INIT=1 at time of interrupt.

   2b. If we context switch to a task that has AMX INIT=1, then any
AMX-imposed limits on turbo
         are immediately gone.

    Note for 2b.  4 generations have passed since SKX had significant
delay releasing AVX-512 credits.
    The delay in the first hardware that supports AXM should be
negligible for both AVX-512 and AMX.

3. A buggy or purposely bogus program is fully empowered to violate
the programming conventions.
    Say such a program called a long sleep, and nothing else wanted to
run on that CPU, so the kernel
    went idle with AMX INIT=0.  Indeed, this could retard the core
from getting into the deepest available
    C-state, which could impact the turbo budget of neighboring cores.
However, if that were some kind
    of DOS, it would be simpler and more effective to simply hog a CPU
by running code.  Also, as soon
    as another thread switches in with INIT=1, there is no concept of
AMX frequency caps. (see note for 2b)

I do not see a situation where the kernel needs to issue TILERELEASE
(though a VMM likely would).
What did I miss?

thanks,
Len Brown, Intel Open Source Technology Center

ps. I will respond to your ABI thoughts on your new ABI thread.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-27  4:53                 ` Len Brown
@ 2021-03-27 22:20                   ` Thomas Gleixner
  2021-03-29 13:31                     ` Len Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-27 22:20 UTC (permalink / raw)
  To: Len Brown, Andy Lutomirski
  Cc: Chang S. Bae, Borislav Petkov, Ingo Molnar, X86 ML, Brown, Len,
	Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

Len,

On Sat, Mar 27 2021 at 00:53, Len Brown wrote:
>> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>>
>> System software may disable use of Intel AMX by clearing XCR0[18:17],
>> by clearing CR4.OSXSAVE, or by setting
>> IA32_XFD[18]. It is recommended that system software initialize AMX
>> state (e.g., by executing TILERELEASE)
>> before doing so. This is because maintaining AMX state in a
>> non-initialized state may have negative power and
>> performance implications.
>
> I agree that the wording here about disabling AMX is ominous.

Which is what I pointed out 7 days ago already, but that got lost in the
ABI and command line noise... Thanks Andy for bringing it back!

> The hardware initializes with AMX disabled.
> The kernel probes AMX, enables it in XCR0, and keeps it enabled.
>
> Initially, XFD is "armed" for all tasks.
> When a task accesses AMX state, #NM fires, we allocate a context
> switch buffer, and we "disarm" XFD for that task.
> As we have that buffer in-hand for the lifetime of the task, we never
> "arm" XFD for that task again.
>
> XFD is context switched, and so the next time it is set, is when we
> are restoring some other task's state.
>
> n.b. I'm describing the Linux flow.  The VMM scenario is a little different.
>
>> Since you reviewed the patch set, I assume you are familiar with how
>> Linux manages XSTATE.  Linux does *not* eagerly load XSTATE on context
>> switch.  Instead, Linux loads XSTATE when the kernel needs it loaded
>> or before executing user code.  This means that the kernel can (and
>> does, and it's a performance win) execute kernel thread code and/or go
>> idle, *including long-term deep idle*, with user XSTATE loaded.
>
> Yes, this scenario is clear.
>
> There are several cases.
>
> 1. Since TMM registers are volatile, a routine using TMM that wants
> them to persist across a call must save them,
>     and will TILERELEASE before invoking that call.  That is the
> calling convention,
>     and I expect that if it is not followed, debugging (of tools) will
> occur until it is.
>
>     The only way for a user program's XSTATE to be present during the
> kernel's call to idle
>     is if it sleep via a system call when no other task wants to run
> on that CPU.
>
>     Since system calls are calls, in this case, AMX INIT=1 during
>     idle.

What is the guarantee for that? A calling convention?

That's uninteresting because that's only the recommended and desired
state and not the guaranteed state.

>     All deep C-state are enabled, the idle CPU is able to contribute
> it's maximum turbo buget to its peers.
>
> 2. A correct program with live TMM registers takes an interrupt, and
> we enter the kernel AMX INIT=0.
>     Yes, we will enter the syscall at the frequency of the app (like
> we always do).

That's about interrupts not syscalls and I assume this should be all
s/syscall/interrupt/ for the whole #2 including 2a

>     Yes, turbo frequency may be limited by the activity of this
> processor and its peers (like it always is)
>
>    2a. If we return to the same program, then depending on how long
> the syscall runs, we may execute
>          the program and the system call code at a frequency lower
> than we might if AMX INIT=1 at time of interrupt.

So the frequency effect is relevant for the duration of the interrupt
and the eventually appended soft interrupt, right?

The program state is uninteresting because even if the kernel would
do XSAVES, TILERELEASE on interrupt entry then it would restore the
state before returning and then the program would have the same
conditions as before the interrupt.

>    2b. If we context switch to a task that has AMX INIT=1, then any
> AMX-imposed limits on turbo
>          are immediately gone.

Immediately on context switch? Definitely not.

      switch_to(prev, next)
        XSAVES(prev)
        eventually set XFD[18]

The point where AMX INIT=1 of 'next' becomes relevant is on return to
user space where XRSTORS happens. Up to that point AMX INIT=0 stays in
effect.

Now what guarantees that 'next' is returning to user space immediately?

Nothing.

If it's a user task this can be a wakeup for whatever which might cause
another wait depending on the callchain that task is in. It can be
preempted before reaching XRSTORS which is the point that matters to
flip the AMX INIT state back to 1.

It can be a kernel task or a chain of kernel tasks with arbitrary
runtime.

As a consequence the scheduler might migrate 'prev' from CPU_A to CPU_L
and what happens to that state on CPU_A? Does it magically move along
with 'prev' to CPU_L? I can't see how, but what do I know about magic.

So now the chain of kernel tasks finishes and there is nothing to do,
CPU_A goes idle with AMX INIT=0, which prevents the CPU from going deep,
drains power, can't contribute to the turbo state or whatever undesired
side effects that has.

You can get the same effect not only by device interrupts but also by
regular task migration, ptrace, breakpoints, any form of traps,
exception the task triggers in user space, user space freezing, kill -9
and .....

> 3. A buggy or purposely bogus program is fully empowered to violate
> the programming conventions.
>     Say such a program called a long sleep, and nothing else wanted to
> run on that CPU, so the kernel
>     went idle with AMX INIT=0.  Indeed, this could retard the core
> from getting into the deepest available
>     C-state, which could impact the turbo budget of neighboring cores.
> However, if that were some kind
>     of DOS, it would be simpler and more effective to simply hog a CPU
> by running code.  Also, as soon
>     as another thread switches in with INIT=1, there is no concept of
> AMX frequency caps. (see note for 2b)

It's irrelevant whether this is intentionally buggy or not. It's equally
irrelevant whether this is a stupid attempt of DOS or not.

What's relevant is that this has undesired side effects of various
sorts.

> I do not see a situation where the kernel needs to issue TILERELEASE
> (though a VMM likely would).

So #3 does not qualify for you? Interesting POV.

> What did I miss?

See #2.b

What's the actual downside of issuing TILERELEASE conditionally
depending on prev->AMX INIT=0? Is it slooooow or what's the real
problem here?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-20 22:13   ` Thomas Gleixner
  2021-03-20 22:21     ` Andy Lutomirski
  2021-03-23 21:52     ` Bae, Chang Seok
@ 2021-03-29 13:14     ` Len Brown
  2021-03-29 13:33       ` Thomas Gleixner
  2 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-29 13:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Sat, Mar 20, 2021 at 6:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > +     if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > +             return;
> > +
> > +     if (unlikely(prev->state_mask != next->state_mask))
> > +             xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>    clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>    system software initialize AMX state (e.g., by executing TILERELEASE)
>    before doing so. This is because maintaining AMX state in a
>    non-initialized state may have negative power and performance
>    implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I found the author of this passage, and he agreed to revise it to say this
was targeted primarily at VMMs.

"negative power and performance implications" refers to the fact that
the processor will not enter C6 when AMX INIT=0, instead it will demote
to the next shallower C-state, eg C1E.

(this is because the C6 flow doesn't save the AMX registers)

For customers that have C6 enabled, the inability of a core to enter C6
may impact the maximum turbo frequency of other cores.

thanks,
-Len Brown
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-27 22:20                   ` Thomas Gleixner
@ 2021-03-29 13:31                     ` Len Brown
  2021-03-29 14:10                       ` Thomas Gleixner
  0 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-29 13:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Chang S. Bae, Borislav Petkov, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Sat, Mar 27, 2021 at 6:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:

> What's the actual downside of issuing TILERELEASE conditionally
> depending on prev->AMX INIT=0? Is it slooooow or what's the real
> problem here?

TILERELEASE is fast, so there should be no down-side to execute it.
Indeed, checking whether you need to execute it or not will probably take
longer than executing TILERELEASE.  My point (perhaps academic)
is that Linux should not have to know about TILERELEASE, or execute it.

Re: running in the kernel with AMX INIT=0

AMX INIT=0 will prevent c6 on that core.  I don't expect to see this
in the syscall path, though if a user wanted to neglect to issue TILERELEASE,
there is nothing forcing them to do so.

It can certainly happen on the interrupt path, but on the interrupt patch
I don't know if we can end up requesting c6 -- perhaps on a forced
task migration?

Re:  frequency credits in the kernel with AMX INIT=0.

It works exactly the same way as AMX INIT=1.
That is to say, the frequency credits don't key off of AMX INIT,
they key off of the actual use of the AMX execution unit, and
the credits free up several orders of magnitude faster
(both for AVX-512 and AMX) on this hardware as in previous generations.

As a result, if we interrupt an AMX program, and run for an extended
period of time in the kernel without XRESTOR to clear out his AMX INIT=0 state,
that will not have any impact on the frequency we run inside the kernel any more
than if he had AMX INIT=1 state.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 13:14     ` Len Brown
@ 2021-03-29 13:33       ` Thomas Gleixner
  2021-03-29 15:43         ` Len Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-29 13:33 UTC (permalink / raw)
  To: Len Brown
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Mon, Mar 29 2021 at 09:14, Len Brown wrote:
> On Sat, Mar 20, 2021 at 6:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
>> > +
>> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
>> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
>> > +{
>> > +     if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>> > +             return;
>> > +
>> > +     if (unlikely(prev->state_mask != next->state_mask))
>> > +             xdisable_setbits(xfirstuse_not_detected(next));
>> > +}
>>
>> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
>> when it does not match. The spec document says:
>>
>>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>>    clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>>    system software initialize AMX state (e.g., by executing TILERELEASE)
>>    before doing so. This is because maintaining AMX state in a
>>    non-initialized state may have negative power and performance
>>    implications."
>>
>> I'm not seeing anything related to this. Is this a recommendation
>> which can be ignored or is that going to be duct taped into the code
>> base once the first user complains about slowdowns of their non AMX
>> workloads on that machine?
>
> I found the author of this passage, and he agreed to revise it to say this
> was targeted primarily at VMMs.

Why would this only a problem for VMMs?

> "negative power and performance implications" refers to the fact that
> the processor will not enter C6 when AMX INIT=0, instead it will demote
> to the next shallower C-state, eg C1E.
>
> (this is because the C6 flow doesn't save the AMX registers)
>
> For customers that have C6 enabled, the inability of a core to enter C6
> may impact the maximum turbo frequency of other cores.

That's the same on bare metal, right?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support
  2021-03-29 13:31                     ` Len Brown
@ 2021-03-29 14:10                       ` Thomas Gleixner
  0 siblings, 0 replies; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-29 14:10 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Chang S. Bae, Borislav Petkov, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List, Linux Documentation List

On Mon, Mar 29 2021 at 09:31, Len Brown wrote:
> On Sat, Mar 27, 2021 at 6:20 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> What's the actual downside of issuing TILERELEASE conditionally
>> depending on prev->AMX INIT=0? Is it slooooow or what's the real
>> problem here?
>
> TILERELEASE is fast, so there should be no down-side to execute it.
> Indeed, checking whether you need to execute it or not will probably take
> longer than executing TILERELEASE.  My point (perhaps academic)
> is that Linux should not have to know about TILERELEASE, or execute it.
>
> Re: running in the kernel with AMX INIT=0
>
> AMX INIT=0 will prevent c6 on that core.  I don't expect to see this
> in the syscall path, though if a user wanted to neglect to issue TILERELEASE,
> there is nothing forcing them to do so.
>
> It can certainly happen on the interrupt path, but on the interrupt patch
> I don't know if we can end up requesting c6 -- perhaps on a forced
> task migration?

I think I clearly described how it can end up in that situation and that
there are a gazillion ways to get there.

If I decide at 5PM to call it a day after hitting the breakpoint, then I
really would appreciate that the machine goes deep idle instead of
staying at C1(E) until 9AM when I come back.

> Re:  frequency credits in the kernel with AMX INIT=0.
>
> It works exactly the same way as AMX INIT=1.
> That is to say, the frequency credits don't key off of AMX INIT,
> they key off of the actual use of the AMX execution unit, and
> the credits free up several orders of magnitude faster
> (both for AVX-512 and AMX) on this hardware as in previous generations.
>
> As a result, if we interrupt an AMX program, and run for an extended
> period of time in the kernel without XRESTOR to clear out his AMX INIT=0 state,
> that will not have any impact on the frequency we run inside the kernel any more
> than if he had AMX INIT=1 state.

Ok. That's clearly missing in documentation, but it does not solve the C
state issue at all.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 13:33       ` Thomas Gleixner
@ 2021-03-29 15:43         ` Len Brown
  2021-03-29 16:06           ` Len Brown
  2021-03-29 18:49           ` Thomas Gleixner
  0 siblings, 2 replies; 78+ messages in thread
From: Len Brown @ 2021-03-29 15:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:

> > I found the author of this passage, and he agreed to revise it to say this
> > was targeted primarily at VMMs.
>
> Why would this only a problem for VMMs?

VMMs may have to emulate different hardware for different guest OS's,
and they would likely "context switch" XCR0 to achieve that.

As switching XCR0 at run-time would confuse the heck out of user-space,
it was not imagined that a bare-metal OS would do that.

But yes, if a bare metal OS doesn't support any threading libraries
that query XCR0 with xgetbv, and they don't care about the performance
impact of switching XCR0, they could choose to switch XCR0 and
would want to TILERELEASE to assure C6 access, if it is enabled.

> > "negative power and performance implications" refers to the fact that
> > the processor will not enter C6 when AMX INIT=0, instead it will demote
> > to the next shallower C-state, eg C1E.
> >
> > (this is because the C6 flow doesn't save the AMX registers)
> >
> > For customers that have C6 enabled, the inability of a core to enter C6
> > may impact the maximum turbo frequency of other cores.
>
> That's the same on bare metal, right?

Yes, the hardware works exactly the same way.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 15:43         ` Len Brown
@ 2021-03-29 16:06           ` Len Brown
  2021-03-29 17:43             ` Andy Lutomirski
  2021-03-29 18:49           ` Thomas Gleixner
  1 sibling, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-29 16:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Mon, Mar 29, 2021 at 11:43 AM Len Brown <lenb@kernel.org> wrote:
>
> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> > > I found the author of this passage, and he agreed to revise it to say this
> > > was targeted primarily at VMMs.
> >
> > Why would this only a problem for VMMs?
>
> VMMs may have to emulate different hardware for different guest OS's,
> and they would likely "context switch" XCR0 to achieve that.
>
> As switching XCR0 at run-time would confuse the heck out of user-space,
> it was not imagined that a bare-metal OS would do that.

to clarify...
*switching* XCR0 on context switch is slow, but perfectly legal.

*changing* XCR0 during the lifetime of a process, in any of its tasks,
on any of its CPUs, will confuse any software that uses xgetbv/XCR0
to calculate the size of XSAVE buffers for userspace threading.


-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 16:06           ` Len Brown
@ 2021-03-29 17:43             ` Andy Lutomirski
  2021-03-29 18:57               ` Len Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Lutomirski @ 2021-03-29 17:43 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Andy Lutomirski,
	Ingo Molnar, X86 ML, Brown, Len, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List


> On Mar 29, 2021, at 9:06 AM, Len Brown <lenb@kernel.org> wrote:
> 
> On Mon, Mar 29, 2021 at 11:43 AM Len Brown <lenb@kernel.org> wrote:
>> 
>> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>>>> I found the author of this passage, and he agreed to revise it to say this
>>>> was targeted primarily at VMMs.
>>> 
>>> Why would this only a problem for VMMs?
>> 
>> VMMs may have to emulate different hardware for different guest OS's,
>> and they would likely "context switch" XCR0 to achieve that.
>> 
>> As switching XCR0 at run-time would confuse the heck out of user-space,
>> it was not imagined that a bare-metal OS would do that.
> 
> to clarify...
> *switching* XCR0 on context switch is slow, but perfectly legal.

How slow is it?  And how slow is switching XFD?  XFD is definitely serializing?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-26 16:34   ` Jann Horn
@ 2021-03-29 18:14     ` Bae, Chang Seok
  0 siblings, 0 replies; 78+ messages in thread
From: Bae, Chang Seok @ 2021-03-29 18:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Brown, Len, Hansen, Dave, Liu, Jing2,
	Shankar, Ravi V, kernel list

On Mar 26, 2021, at 09:34, Jann Horn <jannh@google.com> wrote:
> On Sun, Feb 21, 2021 at 7:56 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>> 
>> +       if (handle_xfirstuse_event(&current->thread.fpu))
>> +               return;
> 
> What happens if handle_xfirstuse_event() fails because vmalloc()
> failed in alloc_xstate_buffer()? I think that should probably kill the
> task with something like force_sig() - but as far as I can tell, at
> the moment, it will instead end up at die(), which should only be used
> for kernel bugs.

This question was raised on v1 before [1].

In the end, people suggested to handle the failure, e.g., with tracepoints or
stats. So, proposed this on the allocation site:

+	state_ptr = vmalloc(newsz);
+	if (!state_ptr) {
+		trace_x86_fpu_xstate_alloc_failed(fpu);
+		return -ENOMEM;
+	}

Also, I tried to justify this to Boris [2]:

  >> Maybe it is possible to backtrack this allocation failure out of #NM
  >> handling. But the tracepoint can provide a clear context, although limited
  >> to those using it.

  > Yes, add it when it is really needed. Not slapping it proactively and
  > hoping for any potential usage.

Let me know if you have a better way.

Thanks,
Chang

[1] https://lore.kernel.org/lkml/c4669d5f-11b8-3879-562c-78a791b86229@intel.com/
[2] https://lore.kernel.org/lkml/20210204131002.GA17068@zn.tnic/


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 15:43         ` Len Brown
  2021-03-29 16:06           ` Len Brown
@ 2021-03-29 18:49           ` Thomas Gleixner
  2021-03-29 22:16             ` Len Brown
  1 sibling, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-29 18:49 UTC (permalink / raw)
  To: Len Brown
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Mon, Mar 29 2021 at 11:43, Len Brown wrote:
> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> But yes, if a bare metal OS doesn't support any threading libraries
> that query XCR0 with xgetbv, and they don't care about the performance
> impact of switching XCR0, they could choose to switch XCR0 and
> would want to TILERELEASE to assure C6 access, if it is enabled.

That's not the point. The C6 issue has nothing to do with the ABI
considerations vs. XCR0.

According to documentation it is irrelevant whether AMX usage is
disabled via XCR0, CR4.OSXSAVE or XFD[18]. In any case the effect of
AMX INIT=0 will prevent C6.

As I explained in great length there are enough ways to get into a
situation where this can happen and a CPU goes idle with AMX INIT=0.

So what are we supposed to do?

   - Use TILERELEASE on context switch after XSAVES?

   - Any other mechanism on context switch

   - Clear XFD[18] when going idle and issue TILERELEASE depending
     on the last state

   - Use any other means to set the thing back into INIT=1 state when
     going idle

There is no option 'shrug and ignore' unfortunately.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 17:43             ` Andy Lutomirski
@ 2021-03-29 18:57               ` Len Brown
  0 siblings, 0 replies; 78+ messages in thread
From: Len Brown @ 2021-03-29 18:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Chang S. Bae, Borislav Petkov, Andy Lutomirski,
	Ingo Molnar, X86 ML, Brown, Len, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Mon, Mar 29, 2021 at 1:43 PM Andy Lutomirski <luto@amacapital.net> wrote:

> > *switching* XCR0 on context switch is slow, but perfectly legal.
>
> How slow is it?  And how slow is switching XFD?  XFD is definitely serializing?

XCR0 writes in a VM guest cause a VMEXIT..
XCR writes in a VM guest do not.

I will find out what the relative cost is on bare metal, where VMEXIT
is not an issue.
-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 18:49           ` Thomas Gleixner
@ 2021-03-29 22:16             ` Len Brown
  2021-03-30  8:28               ` Thomas Gleixner
  0 siblings, 1 reply; 78+ messages in thread
From: Len Brown @ 2021-03-29 22:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner <tglx@linutronix.de> wrote:

> According to documentation it is irrelevant whether AMX usage is
> disabled via XCR0, CR4.OSXSAVE or XFD[18]. In any case the effect of
> AMX INIT=0 will prevent C6.
>
> As I explained in great length there are enough ways to get into a
> situation where this can happen and a CPU goes idle with AMX INIT=0.
>
> So what are we supposed to do?

Let me know if this problem description is fair:

Many-core Xeon servers will support AMX, and when I run an AMX application
on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
If Linux cpuidle requests C6, the hardware will demote to C1E.

The concern is that a core in C1E will negatively impact power of
self, or performance
of a neighboring core.

This is what we are talking about, right?

First, I should mention that if I threw a dart at a map of Xeons
deployed across the universe, the chances are "significant" that I'd
hit one that is configured with C6 disabled, and this discussion would be moot.

Second, I should mention that Linux cpuidle demotes from deep C-states
to shallow ones all day long.  This is typically due to expected timer
expiration,
and other heuristics.

Third, I should mention that the processor itself demotes from C6 to C1E
for a number of reasons -- basically like what Linux is doing, but in HW.

Albeit, the hardware does have the capability to "un-demote" when it demotes
and recognizes it made a mistake, and that "un-demote" capability would
not be present if the reason for demotion was AVX INIT=0.

Okay, that said, let's assume we have found a system where this problem
could happen, and we use it in a way that makes it happen.  Would we notice?

If your system were profoundly idle, and one or more cores were in C1E,
then it would prevent the SOC from entering Package C6 (if enabled).
Yes, there is a measurable idle power difference between Package C1E
and Package C6.  (indeed, this is why Package C6 exists).

I'm delighted that there are Xeon customers, who care about this power savings.
Unfortunately, they are the exception, not the rule.

If you were to provoke this scenario on many cores simultaneously, then
I expect you could detect a power difference between C1E and CC6.
However, that difference would be smaller than the difference
in power due to the frequency choice of the running cores,
because it is basically just the L2-leakage vs L2-off difference.

Regarding frequency credits for a core being in C1E vs C6.
Yes, this is factored into the frequency credits for turbo mode.
How much impact, I can't say, because that information is not yet available.
However, this is mitigated by the fact that Xeon single core turbo
is deployed differently than client.  Xeon's are deployed
more with multi-core turbo in mind, and so how much you'll
notice C1E vs C6 may not be significant, unless perhaps it happened
on all the cores across the system.

>    - Use TILERELEASE on context switch after XSAVES?

Yes, that would be perfectly reasonable.

>    - Any other mechanism on context switch

XRESTOR of a context with INIT=1 would also do it.

>    - Clear XFD[18] when going idle and issue TILERELEASE depending
>      on the last state

I think you mean to *set* XFD.
When the task touched AMX, he took a #NM, and we cleared XFD for that task.
So when we get here, XFD is already clear (unarmed).
Nevertheless, the setting of XFD is moot here.

>    - Use any other means to set the thing back into INIT=1 state when
>      going idle

TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.

> There is no option 'shrug and ignore' unfortunately.

I'm not going to say it is impossible that this path will matter.
If some terrible things go wrong with the hardware, and the hardware
is configured and used in a very specific way, yes, this could matter.

In the grand scheme of things, this is a pretty small issue,
say, compared to the API discussion.

thanks,
Len Brown, Intel Open Source Technology Center


-Len

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-29 22:16             ` Len Brown
@ 2021-03-30  8:28               ` Thomas Gleixner
  2021-03-30 16:38                 ` Len Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2021-03-30  8:28 UTC (permalink / raw)
  To: Len Brown
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

Len,

On Mon, Mar 29 2021 at 18:16, Len Brown wrote:
> On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> Let me know if this problem description is fair:
>
> Many-core Xeon servers will support AMX, and when I run an AMX application
> on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
> If Linux cpuidle requests C6, the hardware will demote to C1E.
>
> The concern is that a core in C1E will negatively impact power of
> self, or performance
> of a neighboring core.
>
> This is what we are talking about, right?

Correct.

> I'm delighted that there are Xeon customers, who care about this power savings.
> Unfortunately, they are the exception, not the rule.

That maybe true or not. The point is that there is some side effect and
from a correctness point of view it needs to be addressed.

>>    - Use TILERELEASE on context switch after XSAVES?
>
> Yes, that would be perfectly reasonable.
>
>>    - Any other mechanism on context switch
>
> XRESTOR of a context with INIT=1 would also do it.
>
>>    - Clear XFD[18] when going idle and issue TILERELEASE depending
>>      on the last state
>
> I think you mean to *set* XFD.
> When the task touched AMX, he took a #NM, and we cleared XFD for that task.
> So when we get here, XFD is already clear (unarmed).
> Nevertheless, the setting of XFD is moot here.

No. We set XFD when the task which used AMX schedules out. If the CPU
reaches idle without going back to user space then XFD is still set and
AMX INIT still 0. So my assumption was that in order to issue
TILERELEASE before going idle, you need to clear XFD[18] first, but I
just saw in the spec that it is not necessary.

>>    - Use any other means to set the thing back into INIT=1 state when
>>      going idle
>
> TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.
>
>> There is no option 'shrug and ignore' unfortunately.
>
> I'm not going to say it is impossible that this path will matter.
> If some terrible things go wrong with the hardware, and the hardware
> is configured and used in a very specific way, yes, this could matter.

So then let me summarize for the bare metal case:

   1) The paragraph in the specification is unclear and needs to be
      rephrased.

      What I understood from your explanations so far:

      When AMX is disabled by clearing XCR0[18:17], by clearing
      CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no
      negative side effects due to AMX INIT=0 as long as the CPU is
      executing.

      The only possible side effect is when the CPU goes idle because
      AMX INIT=0 limits C states.

   2) As a consequence of #1 there is no further action required on
      context switch when XFD[18] is set.

   3) When the CPU goes idle with AMX INIT=0 then the idle code should
      invoke TILERELEASE. Maybe the condition is not even necessary and
      TILERELEASE can be invoked unconditionally before trying to enter
      idle.

If that's correct, then this should be part of the next series.

> In the grand scheme of things, this is a pretty small issue, say,
> compared to the API discussion.

No argument about that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state
  2021-03-30  8:28               ` Thomas Gleixner
@ 2021-03-30 16:38                 ` Len Brown
  0 siblings, 0 replies; 78+ messages in thread
From: Len Brown @ 2021-03-30 16:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Chang S. Bae, Borislav Petkov, Andy Lutomirski, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Tue, Mar 30, 2021 at 4:28 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Len,
>
> On Mon, Mar 29 2021 at 18:16, Len Brown wrote:
> > On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Let me know if this problem description is fair:
> >
> > Many-core Xeon servers will support AMX, and when I run an AMX application
> > on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
> > If Linux cpuidle requests C6, the hardware will demote to C1E.
> >
> > The concern is that a core in C1E will negatively impact power of
> > self, or performance
> > of a neighboring core.
> >
> > This is what we are talking about, right?
>
> Correct.
>
> > I'm delighted that there are Xeon customers, who care about this power savings.
> > Unfortunately, they are the exception, not the rule.
>
> That maybe true or not. The point is that there is some side effect and
> from a correctness point of view it needs to be addressed.

I don't see how demoting C6 to C1E is a "correctness" issue.
There is nothing "incorrect" about demoting to C1E when software permits C6,
and as I mentioned, this happens all the time.
All architectural state, including the AMX state, is preserved by hardware.

I do agree that there is the possibility that this scenario can result
may not be optimal power savings.
It isn't clear how often that situation might occur.

> >>    - Use TILERELEASE on context switch after XSAVES?
> >
> > Yes, that would be perfectly reasonable.
> >
> >>    - Any other mechanism on context switch
> >
> > XRESTOR of a context with INIT=1 would also do it.
> >
> >>    - Clear XFD[18] when going idle and issue TILERELEASE depending
> >>      on the last state
> >
> > I think you mean to *set* XFD.
> > When the task touched AMX, he took a #NM, and we cleared XFD for that task.
> > So when we get here, XFD is already clear (unarmed).
> > Nevertheless, the setting of XFD is moot here.
>
> No. We set XFD when the task which used AMX schedules out. If the CPU
> reaches idle without going back to user space then XFD is still set and
> AMX INIT still 0. So my assumption was that in order to issue
> TILERELEASE before going idle, you need to clear XFD[18] first, but I
> just saw in the spec that it is not necessary.

Right, XFD setting is moot here.

> >>    - Use any other means to set the thing back into INIT=1 state when
> >>      going idle
> >
> > TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.
> >
> >> There is no option 'shrug and ignore' unfortunately.
> >
> > I'm not going to say it is impossible that this path will matter.
> > If some terrible things go wrong with the hardware, and the hardware
> > is configured and used in a very specific way, yes, this could matter.
>
> So then let me summarize for the bare metal case:
>
>    1) The paragraph in the specification is unclear and needs to be
>       rephrased.
>
>       What I understood from your explanations so far:
>
>       When AMX is disabled by clearing XCR0[18:17], by clearing
>       CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no
>       negative side effects due to AMX INIT=0 as long as the CPU is
>       executing.

Right.

>       The only possible side effect is when the CPU goes idle because
>       AMX INIT=0 limits C states.

Right.

>    2) As a consequence of #1 there is no further action required on
>       context switch when XFD[18] is set.

I agree.

>    3) When the CPU goes idle with AMX INIT=0 then the idle code should
>       invoke TILERELEASE. Maybe the condition is not even necessary and
>       TILERELEASE can be invoked unconditionally before trying to enter
>       idle.
>
> If that's correct, then this should be part of the next series.

If you insist, then that is fine with me.

Personally, I would prefer to be able to measure an actual benefit
before adding code, but this one is small, and so I'm not religious about it.

> > In the grand scheme of things, this is a pretty small issue, say,
> > compared to the API discussion.
>
> No argument about that.
>
> Thanks,
>
>         tglx



-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2021-03-30 16:39 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-21 18:56 [PATCH v4 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 01/22] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
2021-03-10 13:40   ` Borislav Petkov
2021-02-21 18:56 ` [PATCH v4 02/22] x86/fpu/xstate: Modify state copy helpers " Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 03/22] x86/fpu/xstate: Modify address finders " Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 04/22] x86/fpu/xstate: Modify the context restore helper " Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 05/22] x86/fpu/xstate: Add a new variable to indicate dynamic user states Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 06/22] x86/fpu/xstate: Add new variables to indicate dynamic xstate buffer size Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 07/22] x86/fpu/xstate: Calculate and remember dynamic xstate buffer sizes Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 08/22] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 09/22] x86/fpu/xstate: Introduce helpers to manage the xstate buffer dynamically Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 10/22] x86/fpu/xstate: Define the scope of the initial xstate data Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 11/22] x86/fpu/xstate: Update the xstate save function to support dynamic states Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 12/22] x86/fpu/xstate: Update the xstate buffer address finder " Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 13/22] x86/fpu/xstate: Update the xstate context copy function " Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state Chang S. Bae
2021-03-20 22:13   ` Thomas Gleixner
2021-03-20 22:21     ` Andy Lutomirski
2021-03-23 21:01       ` Len Brown
2021-03-24  3:14         ` Liu, Jing2
2021-03-24 21:09           ` Len Brown
2021-03-24 21:26             ` Andy Lutomirski
2021-03-24 21:30               ` Dave Hansen
2021-03-24 21:42                 ` Andy Lutomirski
2021-03-24 21:58                   ` Dave Hansen
2021-03-24 22:12                     ` Andy Lutomirski
2021-03-25  5:12             ` Liu, Jing2
2021-03-25  6:59               ` Bae, Chang Seok
2021-03-25  7:26                 ` Liu, Jing2
2021-03-23 21:52     ` Bae, Chang Seok
2021-03-24 14:24       ` Dave Hansen
2021-03-29 13:14     ` Len Brown
2021-03-29 13:33       ` Thomas Gleixner
2021-03-29 15:43         ` Len Brown
2021-03-29 16:06           ` Len Brown
2021-03-29 17:43             ` Andy Lutomirski
2021-03-29 18:57               ` Len Brown
2021-03-29 18:49           ` Thomas Gleixner
2021-03-29 22:16             ` Len Brown
2021-03-30  8:28               ` Thomas Gleixner
2021-03-30 16:38                 ` Len Brown
2021-03-26 16:34   ` Jann Horn
2021-03-29 18:14     ` Bae, Chang Seok
2021-02-21 18:56 ` [PATCH v4 15/22] x86/fpu/xstate: Support ptracer-induced xstate buffer expansion Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 16/22] x86/fpu/xstate: Extend the table to map state components with features Chang S. Bae
2021-03-20 21:25   ` Thomas Gleixner
2021-03-23 21:52     ` Bae, Chang Seok
2021-02-21 18:56 ` [PATCH v4 17/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 18/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
2021-03-20 21:31   ` Thomas Gleixner
2021-03-23 21:52     ` Bae, Chang Seok
2021-02-21 18:56 ` [PATCH v4 19/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
2021-03-20 21:26   ` Thomas Gleixner
2021-03-23 21:51     ` Bae, Chang Seok
2021-02-21 18:56 ` [PATCH v4 20/22] selftest/x86/amx: Include test cases for the AMX state management Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 21/22] x86/fpu/xstate: Support dynamic user state in the signal handling path Chang S. Bae
2021-02-21 18:56 ` [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support Chang S. Bae
2021-02-21 19:30   ` Randy Dunlap
2021-02-21 20:10     ` Bae, Chang Seok
2021-02-21 20:37       ` Randy Dunlap
2021-03-20 20:56   ` Thomas Gleixner
2021-03-25 22:59     ` Len Brown
2021-03-25 23:10       ` Dave Hansen
2021-03-26 15:27         ` Len Brown
2021-03-26 19:22           ` Thomas Gleixner
2021-03-26  1:41       ` Andy Lutomirski
2021-03-26 15:33         ` Len Brown
2021-03-26 15:48           ` Andy Lutomirski
2021-03-26 17:53             ` Len Brown
2021-03-26 18:12               ` Andy Lutomirski
2021-03-27  4:53                 ` Len Brown
2021-03-27 22:20                   ` Thomas Gleixner
2021-03-29 13:31                     ` Len Brown
2021-03-29 14:10                       ` Thomas Gleixner
2021-03-26 18:17               ` Borislav Petkov
2021-03-27  4:41                 ` Len Brown
2021-03-26  1:50       ` Thomas Gleixner
2021-03-26 15:36         ` Len Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.