linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions
@ 2021-07-30 14:59 Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 01/26] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
                   ` (25 more replies)
  0 siblings, 26 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
soon.  AMX consists of configurable TMM "TILE" registers plus new CPU
instructions that operate on them.  TMUL (Tile matrix MULtiply) is the
first operator to take advantage of the new registers, and we anticipate
additional instructions in the future.

Neither AMX state nor TMUL instructions depend on AVX.  However, AMX and
AVX do share common challenges.  The TMM registers are 8KB today, and
architecturally as large as 64KB, which merit updates to hardware and
software state management.

Further, both technologies run faster when they are not simultaneously
running on SMT siblings, and both technologies use of power and bandwidth
impact the power and performance available to neighboring cores.  (This
impact has measurably improved in recent hardware.)

If the existing kernel approach for managing XSAVE state was employed to
handle AMX, 8KB space would be added to every task, but possibly rarely
used.  Thus, Linux implements on-demand expansion of per-task context
switch buffers using an XSAVE feature: eXtended Feature Disabling (XFD).
The kernel arms XFD to provide an #NM exception upon a tasks' first access
to TILE state. The kernel exception handler allocates and installs the
appropriate XSAVE context switch buffer.  User space is unaware of the
kernel's contexts switch buffer optimization.

AMX is accessible only to applications that invoke a new system call to
request access.  When a user invokes this system call, they agree that if
they use an alternate signal stack, that they are providing an alternative
signal stack of sufficient size.  The simplest way to do that is to use the
updated ABI in glibc 2.34 or later [8][9], though they could Also use their
own calculation or ask the kernel directly [3].

The patches are built on top of the recent upstream x86 FPU changes [13].

This series has three parts:
* Patch 01-15: Foundation to support dynamic user state management
* Patch 16-21: AMX enablement, including some preparation
* Patch 22-26: Optimizations, DEBUG sanity check, and self test

Note that the per-process system call in PATCH14 reflects the latest
discussion on LKML, [10][12].

The following points summarize the latest discussion, and this
implementation:

1. Kernel sets XCR0.AMX=1 at boot, and leaves it set, always.

    XCR0 is NOT context switched by Linux.
    (If it were, every change would provoke VMEXIT if in VM.)

    (KVM context switches XCR0.   If KVM exports XFD for use by a guest OS,
    it must also context switch XFD.  KVM can not use XFD for its own
    purposes.)

2. Kernel arms XFD for all tasks.

    XFD is context switched per Linux task.

3. Apps invoke new system call to request feature access (AMX).

    Implemented as a flag to arch_prctl(2), permission granted to any task
    will grant that permission to all tasks in the process.

    It is sufficient to invoke this syscall at process or library
    init-time.

    There is no concept of removing or revoking permission, once granted to
    a process.  (Permission is cleared upon exec of a new process.)

    There is a companion system call to return the current permission.

    Analogous to AVX-512 and other stateful features, applications probe
    for AMX support by checking CPUID for the instructions and checking
    XGETBV(XCR0) for the OS support.

    However, stateful features from AMX onward also require the system call
    above to be invoked before tasks in that process may use the feature.

4. Applications touching AMX without permission results in process exit.

    Armed XFD results in #NM, results in SIGILL with si_code ILL_ILLOPC,
    typically resulting in process exit.

5. Applications touching AMX with permission allocate context switch buffer
   on-demand.

    Armed XFD results in #NM.
    Kernel allocates large context switch kernel buffer.
    Kernel dis-arms XFD for that task.

6. NM handler allocation failure results in process exit.

    If the #NM handler can not allocate the 8KB buffer, the task will
    receive a SIGILL with si_code ILL_ILLOPC at the instruction that took
    the #NM fault, typically resulting in process exit.

7. Legacy app signal stack XSTATE support includes AVX-512, and stops
   before AMX.

    Legacy apps are those which do not request AMX (or subsequent feature)
    access.The signal stack ABI continues to be uncompacted XSTATE for both
    legacy and new apps.

    Existing code to find offsets in XSTATE still work.
    Existing code doing XRSTOR/XSAVE on signal stack buffer will still
    work.*

    * XSTATE size calculation using CPUID will include
    AMX and other supported features, even if the process did not invoke
    the new system call.    However, the kernel will not XSAVE AMX or later
    features onto the signal stack of a legacy process.**

   ** User-space XSAVE/XRSTOR should size buffers according to CPUID
   if they include the bits of xgetbv(XCR0) in RFBM, because XSAVE will
   write data (including zeros for INIT state) for all features included in
   RFBM.

8. New opt-in apps must agree to provide large enough sigaltstack

    1. must invoke permission system call before touching AMX TMM
    2. must guarantee if using sigaltstack(2), that they have
       allocated signal stack of sufficient size, e.g., by utilizing
       glibc signal.h 2.34 or later.

    (glibc 2.34 changed MINSIGSTKSZ and SIGSTKSZ from 2KB/8KB constants
    into run-time routines. [8])

    Linux will continue to XSAVE/XRSTOR directly to/from the signal stack,
    and the stack will always include the 8KB *space* for AMX TMM and
    subsequent features.

    Linux has an optimization in for all XFD-supported features in the INIT
    state so that XSAVE will skip writing zeros.

9. intel_idle for SPR will clear AMX TMM state

    This guarantees that AMX use will not prevent the CPU from entering the
    idle C6 state, which can be beneficial for power savings, and thus
    turbo frequency.

Reviewed-by: Len Brown <len.brown@intel.com>

Changes from v8 [16]:
* Update arch_prctl prototype for consistency with other arch_prctl’s.  It
  now takes an address of return bitmask as a parameter (Patch14).  Update
  self-tests to reflect this (Patch23).
* bugfix: Fix off-by-one-error in check_xstate_against_struct() feature
  number argument (Patch19).

Changes from v7 [15]:
* Update #NM handler to raise SIGILL rather than SIGSEGV (Patch 12).
  (Thiago Macieira)
* Rename the syscalls (Patch 14). (Thiago Macieira and Dave Hansen)
* If XSAVE is disabled, assure that syscall correctly indicates legacy
  states (Patch14). (Thiago Macieira and Dave Hansen)
* Update existing self-test to expect SIGILL (Patch23).

Changes from v6 [14]:
* Add state bitmap param to proposed syscall. (Thiago Macieira)
* Add companion syscall to return the current permission bitmap.
* Update the ptrace path to return EFAULT when no permission to write
  XTILEDATA.
* Simplify xstate size calculation code. (Dave Hansen)
* Update comments for TILERELEASE code. (Rafael J. Wysocki)

Changes from v5 [11]:
* Updated to require per-process permission for dynamic states (v5 was
  per-task).
* Support both legacy and expanded sigframe xstate buffer sizes.
* Moved the TILERELEASE code to intel_idle driver. (Peter Zijlstra)
* Fixed to deactivate fpregs with TILERELEASE. (Andy Lutomirski and Dave
  Hansen)
* Rebased on Thomas Gleixner's recent x86 FPU code changes.
* Added XFD sanity check. (Dave Hansen)
* Future proofed __raw_xsave_addr().
* Tighten up task size calculation (previously, it could over-calculate).
* Cleaned invocation memset() for init_fpstate (no functional change).
* Updated selftest to handle latest syscall semantics, plus minor updates.
* Dropped the change for XSTATE restore helper.

Changes from v4 [7]:
* Changed the buffer expansion policy to the access-request based approach
  from the transparent #NM-based approach. (Andy Lutomirski, Thomas
  Gleixner, and et al)
* Removed the boot parameter patch. (Thomas Gleixner)
* Included code to explicitly initialize AMX state during a context switch.
  (Thomas Gleixner)
* Added a new arch_prctl to pre-allocate a buffer for dynamic state. (Andy
  Lutomirski)
* Updated the fork() path to initialize all the AMX state.
* Improved ptracer's dynamic user state injection path.
* Add optimization to skip tile data in sigframe when an AMX thread
  initialized the state.
* Updated to treat the mismatched state size as an error. (Thomas Gleixner)
* Simplified the xstate feature check routine. (Thomas Gleixner)
* Simplified and updated the selftest.
* Updated some changelog. (Thomas Gleixner)
* Updated a function description. (Borislav Petkov)

Changes from v3 [6]:
* Updated some commit messages and code comments. (Borislav Petkov)
* Added and removed some helpers. (Borislav Petkov)
* Revised the buffer allocation function. (Borislav Petkov)
* Simplified in accessing buffers. (Borislav Petkov)
* Re-organized some code change more reviewable. (PATCH9/10)
* Reverted unnecessary changes. (PATCH4)
* Fixed typo in the documentation. (Randy Dunlap)

Changes from v2 [5]:
* Removed the patch for the tile data inheritance. Also, updated the
  selftest patch. (Andy Lutomirski)
* Changed the kernel tainted when any unknown state is enabled. (Andy
  Lutomirski)
* Changed to use the XFD feature only when the compacted format in use.
* Improved the test code.
* Simplified the cmdline handling.
* Removed 'task->fpu' in changelogs. (Boris Petkov)
* Updated the variable name / comments / changelogs for clarification.

Changes from v1 [4]:
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
* Inlined the #NM handling code (Andy Lutomirski)
* Made signal handling optimization revertible
* Revised the new parameter handling code (Andy Lutomirski and Dave Hansen)
* Rebased on the upstream kernel

[1]: Intel Architecture Instruction Set Extension Programming Reference
     May 2021, https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
[2]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
[3]: https://lore.kernel.org/lkml/20210518200320.17239-1-chang.seok.bae@intel.com/
[4]: https://lore.kernel.org/lkml/20201001203913.9125-1-chang.seok.bae@intel.com/
[5]: https://lore.kernel.org/lkml/20201119233257.2939-1-chang.seok.bae@intel.com/
[6]: https://lore.kernel.org/lkml/20201223155717.19556-1-chang.seok.bae@intel.com/
[7]: https://lore.kernel.org/lkml/20210221185637.19281-1-chang.seok.bae@intel.com/
[8]: https://sourceware.org/git/?p=glibc.git;a=commit;h=6c57d320484988e87e446e2e60ce42816bf51d53
[9]: https://sourceware.org/git/?p=glibc.git;a=blob;f=NEWS;h=aa0f10a891f8f9b4e6f0f6d25b6a307898c07d82;hb=HEAD#l12
[10]: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/20210523193259.26200-1-chang.seok.bae@intel.com/
[12]: https://lore.kernel.org/lkml/CAJvTdKmzN0VMyH8VU_fdzn2UZqmR=_aNrJW01a65BhyLm6YRPg@mail.gmail.com/
[13]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1423e2660cf134a8f21f2451865a04792013e49e
[14]: https://lore.kernel.org/lkml/20210630060226.24652-1-chang.seok.bae@intel.com/
[15]: https://lore.kernel.org/lkml/20210710130313.5072-1-chang.seok.bae@intel.com/
[16]: https://lore.kernel.org/lkml/20210717152903.7651-1-chang.seok.bae@intel.com/

Chang S. Bae (26):
  x86/fpu/xstate: Modify the initialization helper to handle both static
    and dynamic buffers
  x86/fpu/xstate: Modify state copy helpers to handle both static and
    dynamic buffers
  x86/fpu/xstate: Modify address finders to handle both static and
    dynamic buffers
  x86/fpu/xstate: Add a new variable to indicate dynamic user states
  x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer
    size
  x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes
  x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
  x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer
    dynamically
  x86/fpu/xstate: Update the XSTATE save function to support dynamic
    states
  x86/fpu/xstate: Update the XSTATE buffer address finder to support
    dynamic states
  x86/fpu/xstate: Update the XSTATE context copy function to support
    dynamic states
  x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user
    state
  x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion
  x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  x86/fpu/xstate: Support both legacy and expanded signal XSTATE size
  x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in
    state component numbers
  x86/fpu/xstate: Disable XSTATE support if an inconsistent state is
    detected
  x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature
    bits
  x86/fpu/amx: Define AMX state components and have it used for
    boot-time checks
  x86/fpu/amx: Initialize child's AMX state
  x86/fpu/amx: Enable the AMX feature in 64-bit mode
  x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user
    states if in INIT-state
  selftest/x86/amx: Test cases for the AMX state management
  x86/insn/amx: Add TILERELEASE instruction to the opcode map
  intel_idle/amx: Add SPR support with XTILEDATA capability
  x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE

 arch/x86/include/asm/cpufeatures.h    |   4 +
 arch/x86/include/asm/fpu/internal.h   | 117 +++-
 arch/x86/include/asm/fpu/types.h      |  72 +-
 arch/x86/include/asm/fpu/xstate.h     |  34 +-
 arch/x86/include/asm/msr-index.h      |   2 +
 arch/x86/include/asm/processor.h      |  10 +-
 arch/x86/include/asm/proto.h          |   2 +-
 arch/x86/include/asm/special_insns.h  |   6 +
 arch/x86/include/asm/trace/fpu.h      |   9 +-
 arch/x86/include/uapi/asm/prctl.h     |   3 +
 arch/x86/kernel/cpu/cpuid-deps.c      |   4 +
 arch/x86/kernel/fpu/core.c            |  94 ++-
 arch/x86/kernel/fpu/init.c            |  37 +-
 arch/x86/kernel/fpu/regset.c          |  57 +-
 arch/x86/kernel/fpu/signal.c          |  99 ++-
 arch/x86/kernel/fpu/xstate.c          | 668 ++++++++++++++++--
 arch/x86/kernel/process.c             |  21 +-
 arch/x86/kernel/process_32.c          |   2 +-
 arch/x86/kernel/process_64.c          |   8 +-
 arch/x86/kernel/traps.c               |  41 ++
 arch/x86/kvm/x86.c                    |  48 +-
 arch/x86/lib/x86-opcode-map.txt       |   8 +-
 arch/x86/math-emu/fpu_aux.c           |   2 +-
 arch/x86/math-emu/fpu_entry.c         |   4 +-
 arch/x86/math-emu/fpu_system.h        |   2 +-
 drivers/idle/intel_idle.c             |  79 +++
 tools/arch/x86/lib/x86-opcode-map.txt |   8 +-
 tools/testing/selftests/x86/Makefile  |   2 +-
 tools/testing/selftests/x86/amx.c     | 968 ++++++++++++++++++++++++++
 29 files changed, 2174 insertions(+), 237 deletions(-)
 create mode 100644 tools/testing/selftests/x86/amx.c


base-commit: ff1176468d368232b684f75e82563369208bc371
-- 
2.17.1


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v9 01/26] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 02/26] x86/fpu/xstate: Modify state copy helpers " Chang S. Bae
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, kvm

Have the function initializing the XSTATE buffer take a struct fpu *
pointer in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to fpstate_init().

Also, fpstate_init_xstate() now accepts the state component bitmap to
customize the compacted format.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v5:
* Moved fpstate_init_xstate() back to the header (again).
* Massaged the changelog.

Changes from v4:
* Added a proper function description. (Borislav Petkov)
* Added the likely() statement as a null pointer is a special case.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/internal.h | 11 ++++++++++-
 arch/x86/kernel/fpu/core.c          | 28 +++++++++++++++++-----------
 arch/x86/kernel/fpu/init.c          |  2 +-
 arch/x86/kernel/fpu/xstate.c        |  3 +--
 arch/x86/kvm/x86.c                  |  2 +-
 5 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 5a18694a89b2..c7a64e2806a9 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -80,7 +80,7 @@ static __always_inline __pure bool use_fxsr(void)
 
 extern union fpregs_state init_fpstate;
 
-extern void fpstate_init(union fpregs_state *state);
+extern void fpstate_init(struct fpu *fpu);
 #ifdef CONFIG_MATH_EMULATION
 extern void fpstate_init_soft(struct swregs_state *soft);
 #else
@@ -88,6 +88,15 @@ static inline void fpstate_init_soft(struct swregs_state *soft) {}
 #endif
 extern void save_fpregs_to_fpstate(struct fpu *fpu);
 
+static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 mask)
+{
+	/*
+	 * XRSTORS requires these bits set in xcomp_bv, or it will
+	 * trigger #GP:
+	 */
+	xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | mask;
+}
+
 /* Returns 0 or the negated trap number, which results in -EFAULT for #PF */
 #define user_insn(insn, output, input...)				\
 ({									\
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7ada7bd03a32..c0098f8422de 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -203,15 +203,6 @@ void fpu_sync_fpstate(struct fpu *fpu)
 	fpregs_unlock();
 }
 
-static inline void fpstate_init_xstate(struct xregs_state *xsave)
-{
-	/*
-	 * XRSTORS requires these bits set in xcomp_bv, or it will
-	 * trigger #GP:
-	 */
-	xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xfeatures_mask_all;
-}
-
 static inline void fpstate_init_fxstate(struct fxregs_state *fx)
 {
 	fx->cwd = 0x37f;
@@ -229,8 +220,23 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
 	fp->fos = 0xffff0000u;
 }
 
-void fpstate_init(union fpregs_state *state)
+/**
+ *
+ * fpstate_init - initialize the xstate buffer
+ *
+ * If @fpu is NULL, initialize init_fpstate.
+ *
+ * @fpu:	A struct fpu * pointer
+ */
+void fpstate_init(struct fpu *fpu)
 {
+	union fpregs_state *state;
+
+	if (likely(fpu))
+		state = &fpu->state;
+	else
+		state = &init_fpstate;
+
 	if (!static_cpu_has(X86_FEATURE_FPU)) {
 		fpstate_init_soft(&state->soft);
 		return;
@@ -239,7 +245,7 @@ void fpstate_init(union fpregs_state *state)
 	memset(state, 0, fpu_kernel_xstate_size);
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&state->xsave);
+		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
 	if (static_cpu_has(X86_FEATURE_FXSR))
 		fpstate_init_fxstate(&state->fxsave);
 	else
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 64e29927cc32..e14c72bc8706 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -124,7 +124,7 @@ static void __init fpu__init_system_generic(void)
 	 * Set up the legacy init FPU context. (xstate init might overwrite this
 	 * with a more modern format, if the CPU supports it.)
 	 */
-	fpstate_init(&init_fpstate);
+	fpstate_init(NULL);
 
 	fpu__init_system_mxcsr();
 }
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8def1b7f8fb..d4fdceb9a309 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -395,8 +395,7 @@ static void __init setup_init_fpu_buf(void)
 	print_xstate_features();
 
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
-						     xfeatures_mask_all;
+		fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);
 
 	/*
 	 * Init all the features state with header.xfeatures being 0x0
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4fd10604f72..76a4e5e274d8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10599,7 +10599,7 @@ static void fx_init(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.guest_fpu)
 		return;
 
-	fpstate_init(&vcpu->arch.guest_fpu->state);
+	fpstate_init(vcpu->arch.guest_fpu);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
 		vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 02/26] x86/fpu/xstate: Modify state copy helpers to handle both static and dynamic buffers
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 01/26] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 03/26] x86/fpu/xstate: Modify address finders " Chang S. Bae
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Have all the functions copying XSTATE take a struct fpu * pointer in
preparation for dynamic state buffer support.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Adjusted function prototype changes to the recent renamed on the new
  base.

Changes from v3:
* Updated the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/xstate.h |  4 ++--
 arch/x86/kernel/fpu/regset.c      |  2 +-
 arch/x86/kernel/fpu/signal.c      |  2 +-
 arch/x86/kernel/fpu/xstate.c      | 12 ++++++------
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 109dfcc75299..ede166e9d3f2 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -136,8 +136,8 @@ extern void __init update_regset_xstate_info(unsigned int size,
 
 void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
 int xfeature_size(int xfeature_nr);
-int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
-int copy_sigframe_from_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf);
+int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
+int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
 
 void xsaves(struct xregs_state *xsave, u64 mask);
 void xrstors(struct xregs_state *xsave, u64 mask);
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 66ed317ebc0d..49dd307003ec 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -164,7 +164,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	}
 
 	fpu_force_restore(fpu);
-	ret = copy_uabi_from_kernel_to_xstate(&fpu->state.xsave, kbuf ?: tmpbuf);
+	ret = copy_uabi_from_kernel_to_xstate(fpu, kbuf ?: tmpbuf);
 
 out:
 	vfree(tmpbuf);
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..bec8c8046888 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -371,7 +371,7 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 	fpregs_unlock();
 
 	if (use_xsave() && !fx_only) {
-		ret = copy_sigframe_from_user_to_xstate(&fpu->state.xsave, buf_fx);
+		ret = copy_sigframe_from_user_to_xstate(fpu, buf_fx);
 		if (ret)
 			return ret;
 	} else {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d4fdceb9a309..59f08953201c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1089,10 +1089,10 @@ static int copy_from_buffer(void *dst, unsigned int offset, unsigned int size,
 	return 0;
 }
 
-
-static int copy_uabi_to_xstate(struct xregs_state *xsave, const void *kbuf,
+static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
 			       const void __user *ubuf)
 {
+	struct xregs_state *xsave = &fpu->state.xsave;
 	unsigned int offset, size;
 	struct xstate_header hdr;
 	u64 mask;
@@ -1158,9 +1158,9 @@ static int copy_uabi_to_xstate(struct xregs_state *xsave, const void *kbuf,
  * format and copy to the target thread. This is called from
  * xstateregs_set().
  */
-int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
+int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 {
-	return copy_uabi_to_xstate(xsave, kbuf, NULL);
+	return copy_uabi_to_xstate(fpu, kbuf, NULL);
 }
 
 /*
@@ -1168,10 +1168,10 @@ int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
  * XSAVE[S] format and copy to the target thread. This is called from the
  * sigreturn() and rt_sigreturn() system calls.
  */
-int copy_sigframe_from_user_to_xstate(struct xregs_state *xsave,
+int copy_sigframe_from_user_to_xstate(struct fpu *fpu,
 				      const void __user *ubuf)
 {
-	return copy_uabi_to_xstate(xsave, NULL, ubuf);
+	return copy_uabi_to_xstate(fpu, NULL, ubuf);
 }
 
 static bool validate_xsaves_xrstors(u64 mask)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 03/26] x86/fpu/xstate: Modify address finders to handle both static and dynamic buffers
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 01/26] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 02/26] x86/fpu/xstate: Modify state copy helpers " Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 04/26] x86/fpu/xstate: Add a new variable to indicate dynamic user states Chang S. Bae
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, kvm

Have all the functions finding XSTATE address take a struct fpu * pointer
in preparation for dynamic state buffer support.

init_fpstate is a special case, which is indicated by a null pointer
parameter to get_xsave_addr() and __raw_xsave_addr().

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v5:
* Adjusted some call sites for the new base.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the function comment to use kernel-doc style. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
 arch/x86/include/asm/fpu/xstate.h |  2 +-
 arch/x86/kernel/fpu/xstate.c      | 42 ++++++++++++++++++++++++-------
 arch/x86/kvm/x86.c                | 10 +++-----
 3 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index ede166e9d3f2..2451bccc6cac 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -134,7 +134,7 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 int xfeature_size(int xfeature_nr);
 int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
 int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 59f08953201c..d9c029ab9497 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -841,19 +841,34 @@ void fpu__resume_cpu(void)
 	}
 }
 
-/*
+/**
+ * __raw_xsave_addr - Find the address where the feature state is saved.
+ *
  * Given an xstate feature nr, calculate where in the xsave
  * buffer the state is.  Callers should ensure that the buffer
  * is valid.
+ *
+ * If @fpu is NULL, use init_fpstate.
+ *
+ * @fpu:	A struct fpu * pointer
+ *
+ * Return:	An address of the feature state in the buffer
  */
-static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	void *xsave;
+
 	if (!xfeature_enabled(xfeature_nr)) {
 		WARN_ON_FPU(1);
 		return NULL;
 	}
 
-	return (void *)xsave + xstate_comp_offsets[xfeature_nr];
+	if (fpu)
+		xsave = &fpu->state.xsave;
+	else
+		xsave = &init_fpstate.xsave;
+
+	return xsave + xstate_comp_offsets[xfeature_nr];
 }
 /*
  * Given the xsave area and a state inside, this function returns the
@@ -866,15 +881,18 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
  * this will return NULL.
  *
  * Inputs:
- *	xstate: the thread's storage area for all FPU data
+ *	fpu: the thread's FPU data to reference xstate buffer(s).
+ *	     (A null pointer parameter indicates init_fpstate.)
  *	xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
  *	XFEATURE_SSE, etc...)
  * Output:
  *	address of the state in the xsave area, or NULL if the
  *	field is not present in the xsave buffer.
  */
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	struct xregs_state *xsave;
+
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -887,6 +905,12 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	 */
 	WARN_ONCE(!(xfeatures_mask_all & BIT_ULL(xfeature_nr)),
 		  "get of unsupported state");
+
+	if (fpu)
+		xsave = &fpu->state.xsave;
+	else
+		xsave = &init_fpstate.xsave;
+
 	/*
 	 * This assumes the last 'xsave*' instruction to
 	 * have requested that 'xfeature_nr' be saved.
@@ -901,7 +925,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
 		return NULL;
 
-	return __raw_xsave_addr(xsave, xfeature_nr);
+	return __raw_xsave_addr(fpu, xfeature_nr);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -1061,8 +1085,8 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 			membuf_write(&to, &pkru, sizeof(pkru));
 		} else {
 			copy_feature(header.xfeatures & BIT_ULL(i), &to,
-				     __raw_xsave_addr(xsave, i),
-				     __raw_xsave_addr(xinit, i),
+				     __raw_xsave_addr(&tsk->thread.fpu, i),
+				     __raw_xsave_addr(NULL, i),
 				     xstate_sizes[i]);
 		}
 		/*
@@ -1129,7 +1153,7 @@ static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
 		u64 mask = ((u64)1 << i);
 
 		if (hdr.xfeatures & mask) {
-			void *dst = __raw_xsave_addr(xsave, i);
+			void *dst = __raw_xsave_addr(fpu, i);
 
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76a4e5e274d8..c72e3ad0f9b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4717,7 +4717,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 			memcpy(dest + offset, &vcpu->arch.pkru,
 			       sizeof(vcpu->arch.pkru));
 		} else {
-			src = get_xsave_addr(xsave, xfeature_nr);
+			src = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);
 			if (src)
 				memcpy(dest + offset, src, size);
 		}
@@ -4760,7 +4760,7 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 			memcpy(&vcpu->arch.pkru, src + offset,
 			       sizeof(vcpu->arch.pkru));
 		} else {
-			void *dest = get_xsave_addr(xsave, xfeature_nr);
+			void *dest = get_xsave_addr(vcpu->arch.guest_fpu, xfeature_nr);
 
 			if (dest)
 				memcpy(dest, src + offset, size);
@@ -10831,12 +10831,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		 */
 		if (init_event)
 			kvm_put_guest_fpu(vcpu);
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
-					XFEATURE_BNDREGS);
+		mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDREGS);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
-					XFEATURE_BNDCSR);
+		mpx_state_buffer = get_xsave_addr(vcpu->arch.guest_fpu, XFEATURE_BNDCSR);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
 		if (init_event)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 04/26] x86/fpu/xstate: Add a new variable to indicate dynamic user states
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (2 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 03/26] x86/fpu/xstate: Modify address finders " Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 05/26] x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer size Chang S. Bae
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

The XSTATE per-task buffer is in preparation to be dynamic for user states.
Introduce a new mask variable to indicate the 'dynamic' user states. The
value is determined at boot-time.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Made the variable __ro_after_init.
* Dropped the perf's xstate buffer renaming, as renamed already.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)

Changes from v2:
* Updated the changelog for clarification.
---
 arch/x86/include/asm/fpu/xstate.h | 2 ++
 arch/x86/kernel/fpu/xstate.c      | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 2451bccc6cac..bc4cba62906b 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -129,6 +129,8 @@ static inline u64 xfeatures_mask_independent(void)
 	return XFEATURE_MASK_INDEPENDENT;
 }
 
+extern u64 xfeatures_mask_user_dynamic;
+
 extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 
 extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d9c029ab9497..74e608c6ad6c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -62,6 +62,12 @@ static short xsave_cpuid_features[] __initdata = {
 u64 xfeatures_mask_all __ro_after_init;
 EXPORT_SYMBOL_GPL(xfeatures_mask_all);
 
+/*
+ * This represents user xstates, a subset of xfeatures_mask_all, saved in a
+ * dynamic kernel XSAVE buffer.
+ */
+u64 xfeatures_mask_user_dynamic __ro_after_init;
+
 static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init =
 	{ [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
@@ -709,6 +715,7 @@ static int __init init_xstate_size(void)
 static void fpu__init_disable_system_xstate(void)
 {
 	xfeatures_mask_all = 0;
+	xfeatures_mask_user_dynamic = 0;
 	cr4_clear_bits(X86_CR4_OSXSAVE);
 	setup_clear_cpu_cap(X86_FEATURE_XSAVE);
 }
@@ -780,6 +787,8 @@ void __init fpu__init_system_xstate(void)
 
 	/* Store it for paranoia check at the end */
 	xfeatures = xfeatures_mask_all;
+	/* Do not support the dynamically allocated buffer yet. */
+	xfeatures_mask_user_dynamic = 0;
 
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 05/26] x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer size
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (3 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 04/26] x86/fpu/xstate: Add a new variable to indicate dynamic user states Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-12 15:03   ` Borislav Petkov
  2021-07-30 14:59 ` [PATCH v9 06/26] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes Chang S. Bae
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, kvm

The XSTATE per-task buffer is in preparation to be dynamic for user states.
Introduce new size variables to indicate the minimum and maximum size of
the buffer. The value is determined at boot-time.

Instead of adding them as newly exported, introduce helper functions to
access them as well as the user buffer size.

No functional change. Those sizes have no difference, as the buffer is not
dynamic yet.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v6:
* Massage the code comment.

Changes from v5:
* Made the new variables __ro_after_init for the new base code.
* Fixed the init_fpstate size for memset().

Changes from v3:
* Added as a new patch to add the variables along with new helpers.
  (Borislav Petkov)
---
 arch/x86/include/asm/fpu/xstate.h |  9 ++++
 arch/x86/include/asm/processor.h  | 10 +---
 arch/x86/kernel/fpu/core.c        | 26 +++++++---
 arch/x86/kernel/fpu/init.c        | 26 ++++------
 arch/x86/kernel/fpu/regset.c      |  2 +-
 arch/x86/kernel/fpu/signal.c      | 26 ++++++----
 arch/x86/kernel/fpu/xstate.c      | 83 +++++++++++++++++++++++++------
 arch/x86/kernel/process.c         |  7 +++
 arch/x86/kvm/x86.c                |  5 +-
 9 files changed, 133 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index bc4cba62906b..d722e774a9f9 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -136,6 +136,15 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
+enum xstate_config {
+	XSTATE_MIN_SIZE,
+	XSTATE_MAX_SIZE,
+	XSTATE_USER_SIZE
+};
+
+extern unsigned int get_xstate_config(enum xstate_config cfg);
+void set_xstate_config(enum xstate_config cfg, unsigned int value);
+
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 int xfeature_size(int xfeature_nr);
 int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f3020c54e2cb..505f596d1046 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -459,9 +459,6 @@ DECLARE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
 DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
 #endif	/* !X86_64 */
 
-extern unsigned int fpu_kernel_xstate_size;
-extern unsigned int fpu_user_xstate_size;
-
 struct perf_event;
 
 struct thread_struct {
@@ -536,12 +533,7 @@ struct thread_struct {
 };
 
 /* Whitelist the FPU state from the task_struct for hardened usercopy. */
-static inline void arch_thread_struct_whitelist(unsigned long *offset,
-						unsigned long *size)
-{
-	*offset = offsetof(struct thread_struct, fpu.state);
-	*size = fpu_kernel_xstate_size;
-}
+extern void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size);
 
 static inline void
 native_load_sp0(unsigned long sp0)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c0098f8422de..808f7627975d 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -231,21 +231,30 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
 void fpstate_init(struct fpu *fpu)
 {
 	union fpregs_state *state;
+	unsigned int size;
+	u64 mask;
 
-	if (likely(fpu))
+	if (likely(fpu)) {
 		state = &fpu->state;
-	else
+		/* The dynamic user states are not prepared yet. */
+		mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
+		size = get_xstate_config(XSTATE_MIN_SIZE);
+	} else {
 		state = &init_fpstate;
+		mask = xfeatures_mask_all;
+		size = sizeof(init_fpstate);
+	}
 
 	if (!static_cpu_has(X86_FEATURE_FPU)) {
 		fpstate_init_soft(&state->soft);
 		return;
 	}
 
-	memset(state, 0, fpu_kernel_xstate_size);
+	memset(state, 0, size);
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
+		fpstate_init_xstate(&state->xsave, mask);
+
 	if (static_cpu_has(X86_FEATURE_FXSR))
 		fpstate_init_fxstate(&state->fxsave);
 	else
@@ -268,8 +277,11 @@ int fpu_clone(struct task_struct *dst)
 	/*
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
+	 *
+	 * The child does not inherit the dynamic states. So,
+	 * the xstate buffer has the minimum size.
 	 */
-	memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_size);
+	memset(&dst_fpu->state.xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
 
 	/*
 	 * If the FPU registers are not owned by current just memcpy() the
@@ -278,7 +290,7 @@ int fpu_clone(struct task_struct *dst)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);
+		memcpy(&dst_fpu->state, &src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
 
 	else
 		save_fpregs_to_fpstate(dst_fpu);
@@ -337,7 +349,7 @@ static inline void restore_fpregs_from_init_fpstate(u64 features_mask)
 static inline unsigned int init_fpstate_copy_size(void)
 {
 	if (!use_xsave())
-		return fpu_kernel_xstate_size;
+		return get_xstate_config(XSTATE_MIN_SIZE);
 
 	/* XSAVE(S) just needs the legacy and the xstate header part */
 	return sizeof(init_fpstate.xsave);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index e14c72bc8706..10e2a95916aa 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -129,15 +129,6 @@ static void __init fpu__init_system_generic(void)
 	fpu__init_system_mxcsr();
 }
 
-/*
- * Size of the FPU context state. All tasks in the system use the
- * same context size, regardless of what portion they use.
- * This is inherent to the XSAVE architecture which puts all state
- * components into a single, continuous memory block:
- */
-unsigned int fpu_kernel_xstate_size __ro_after_init;
-EXPORT_SYMBOL_GPL(fpu_kernel_xstate_size);
-
 /* Get alignment of the TYPE. */
 #define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)
 
@@ -167,8 +158,10 @@ static void __init fpu__init_task_struct_size(void)
 	/*
 	 * Add back the dynamically-calculated register state
 	 * size.
+	 *
+	 * Use the minimum size as embedded to task_struct.
 	 */
-	task_size += fpu_kernel_xstate_size;
+	task_size += get_xstate_config(XSTATE_MIN_SIZE);
 
 	/*
 	 * We dynamically size 'struct fpu', so we require that
@@ -193,6 +186,7 @@ static void __init fpu__init_task_struct_size(void)
 static void __init fpu__init_system_xstate_size_legacy(void)
 {
 	static int on_boot_cpu __initdata = 1;
+	unsigned int xstate_size;
 
 	WARN_ON_FPU(!on_boot_cpu);
 	on_boot_cpu = 0;
@@ -203,17 +197,17 @@ static void __init fpu__init_system_xstate_size_legacy(void)
 	 */
 
 	if (!boot_cpu_has(X86_FEATURE_FPU)) {
-		fpu_kernel_xstate_size = sizeof(struct swregs_state);
+		xstate_size = sizeof(struct swregs_state);
 	} else {
 		if (boot_cpu_has(X86_FEATURE_FXSR))
-			fpu_kernel_xstate_size =
-				sizeof(struct fxregs_state);
+			xstate_size = sizeof(struct fxregs_state);
 		else
-			fpu_kernel_xstate_size =
-				sizeof(struct fregs_state);
+			xstate_size = sizeof(struct fregs_state);
 	}
 
-	fpu_user_xstate_size = fpu_kernel_xstate_size;
+	set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
+	set_xstate_config(XSTATE_MAX_SIZE, xstate_size);
+	set_xstate_config(XSTATE_USER_SIZE, xstate_size);
 }
 
 /* Legacy code to initialize eager fpu mode. */
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 49dd307003ec..8dea3730620e 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -149,7 +149,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	/*
 	 * A whole standard-format XSAVE buffer is needed:
 	 */
-	if (pos != 0 || count != fpu_user_xstate_size)
+	if (pos != 0 || count != get_xstate_config(XSTATE_USER_SIZE))
 		return -EFAULT;
 
 	if (!kbuf) {
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index bec8c8046888..63f000988fa6 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -36,7 +36,7 @@ static inline int check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
 	/* Check for the first magic field and other error scenarios. */
 	if (fx_sw->magic1 != FP_XSTATE_MAGIC1 ||
 	    fx_sw->xstate_size < min_xstate_size ||
-	    fx_sw->xstate_size > fpu_user_xstate_size ||
+	    fx_sw->xstate_size > get_xstate_config(XSTATE_USER_SIZE) ||
 	    fx_sw->xstate_size > fx_sw->extended_size)
 		goto setfx;
 
@@ -107,7 +107,7 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame)
 		return err;
 
 	err |= __put_user(FP_XSTATE_MAGIC2,
-			  (__u32 __user *)(buf + fpu_user_xstate_size));
+			  (__u32 __user *)(buf + get_xstate_config(XSTATE_USER_SIZE)));
 
 	/*
 	 * Read the xfeatures which we copied (directly from the cpu or
@@ -144,7 +144,7 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
 	else
 		err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);
 
-	if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
+	if (unlikely(err) && __clear_user(buf, get_xstate_config(XSTATE_USER_SIZE)))
 		err = -EFAULT;
 	return err;
 }
@@ -205,7 +205,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (!fault_in_pages_writeable(buf_fx, get_xstate_config(XSTATE_USER_SIZE)))
 			goto retry;
 		return -EFAULT;
 	}
@@ -304,12 +304,12 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
 static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 			     bool ia32_fxstate)
 {
-	int state_size = fpu_kernel_xstate_size;
 	struct task_struct *tsk = current;
 	struct fpu *fpu = &tsk->thread.fpu;
 	struct user_i387_ia32_struct env;
 	u64 user_xfeatures = 0;
 	bool fx_only = false;
+	int state_size;
 	int ret;
 
 	if (use_xsave()) {
@@ -323,6 +323,8 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 		state_size = fx_sw_user.xstate_size;
 		user_xfeatures = fx_sw_user.xfeatures;
 	} else {
+		/* The buffer cannot be dynamic without using XSAVE. */
+		state_size = get_xstate_config(XSTATE_MIN_SIZE);
 		user_xfeatures = XFEATURE_MASK_FPSSE;
 	}
 
@@ -418,8 +420,9 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 }
 static inline int xstate_sigframe_size(void)
 {
-	return use_xsave() ? fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE :
-			fpu_user_xstate_size;
+	int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
+
+	return use_xsave() ? xstate_size + FP_XSTATE_MAGIC2_SIZE : xstate_size;
 }
 
 /*
@@ -514,19 +517,20 @@ unsigned long fpu__get_fpstate_size(void)
  */
 void fpu__init_prepare_fx_sw_frame(void)
 {
-	int size = fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+	int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
+	int ext_size = xstate_size + FP_XSTATE_MAGIC2_SIZE;
 
 	fx_sw_reserved.magic1 = FP_XSTATE_MAGIC1;
-	fx_sw_reserved.extended_size = size;
+	fx_sw_reserved.extended_size = ext_size;
 	fx_sw_reserved.xfeatures = xfeatures_mask_uabi();
-	fx_sw_reserved.xstate_size = fpu_user_xstate_size;
+	fx_sw_reserved.xstate_size = xstate_size;
 
 	if (IS_ENABLED(CONFIG_IA32_EMULATION) ||
 	    IS_ENABLED(CONFIG_X86_32)) {
 		int fsave_header_size = sizeof(struct fregs_state);
 
 		fx_sw_reserved_ia32 = fx_sw_reserved;
-		fx_sw_reserved_ia32.extended_size = size + fsave_header_size;
+		fx_sw_reserved_ia32.extended_size = ext_size + fsave_header_size;
 	}
 }
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 74e608c6ad6c..12caf1a56ce0 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -77,12 +77,51 @@ static unsigned int xstate_comp_offsets[XFEATURE_MAX] __ro_after_init =
 static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] __ro_after_init =
 	{ [ 0 ... XFEATURE_MAX - 1] = -1};
 
-/*
- * The XSAVE area of kernel can be in standard or compacted format;
- * it is always in standard format for user mode. This is the user
- * mode standard format size used for signal and ptrace frames.
+/**
+ * struct fpu_xstate_buffer_config - xstate buffer configuration
+ * @max_size:			The CPUID-enumerated all-feature "maximum" size
+ *				for xstate per-task buffer.
+ * @min_size:			The size to fit into the statically-allocated
+ *				buffer. With dynamic states, this buffer no longer
+ *				contains all the enabled state components.
+ * @user_size:			The size of user-space buffer for signal and
+ *				ptrace frames, in the non-compacted format.
  */
-unsigned int fpu_user_xstate_size __ro_after_init;
+struct fpu_xstate_buffer_config {
+	unsigned int min_size, max_size;
+	unsigned int user_size;
+};
+
+static struct fpu_xstate_buffer_config buffer_config __ro_after_init;
+
+unsigned int get_xstate_config(enum xstate_config cfg)
+{
+	switch (cfg) {
+	case XSTATE_MIN_SIZE:
+		return buffer_config.min_size;
+	case XSTATE_MAX_SIZE:
+		return buffer_config.max_size;
+	case XSTATE_USER_SIZE:
+		return buffer_config.user_size;
+	default:
+		return 0;
+	}
+}
+EXPORT_SYMBOL_GPL(get_xstate_config);
+
+void set_xstate_config(enum xstate_config cfg, unsigned int value)
+{
+	switch (cfg) {
+	case XSTATE_MIN_SIZE:
+		buffer_config.min_size = value;
+		break;
+	case XSTATE_MAX_SIZE:
+		buffer_config.max_size = value;
+		break;
+	case XSTATE_USER_SIZE:
+		buffer_config.user_size = value;
+	}
+}
 
 /*
  * Return whether the system supports a given xfeature.
@@ -595,7 +634,11 @@ static void do_extra_xstate_size_checks(void)
 		 */
 		paranoid_xstate_size += xfeature_size(i);
 	}
-	XSTATE_WARN_ON(paranoid_xstate_size != fpu_kernel_xstate_size);
+	/*
+	 * The size accounts for all the possible states reserved in the
+	 * per-task buffer.  Check against the maximum size.
+	 */
+	XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
 }
 
 
@@ -690,21 +733,29 @@ static int __init init_xstate_size(void)
 	else
 		possible_xstate_size = xsave_size;
 
-	/* Ensure we have the space to store all enabled: */
-	if (!is_supported_xstate_size(possible_xstate_size))
-		return -EINVAL;
-
 	/*
-	 * The size is OK, we are definitely going to use xsave,
-	 * make it known to the world that we need more space.
+	 * The size accounts for all the possible states reserved in the
+	 * per-task buffer.  Set the maximum with this value.
 	 */
-	fpu_kernel_xstate_size = possible_xstate_size;
+	set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
+
+	/* Perform an extra check for the maximum size. */
 	do_extra_xstate_size_checks();
 
+	/*
+	 * Set the minimum to be the same as the maximum. The dynamic
+	 * user states are not supported yet.
+	 */
+	set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
+
+	/* Ensure the minimum size fits in the statically-allocated buffer: */
+	if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
+		return -EINVAL;
+
 	/*
 	 * User space is always in standard format.
 	 */
-	fpu_user_xstate_size = xsave_size;
+	set_xstate_config(XSTATE_USER_SIZE, xsave_size);
 	return 0;
 }
 
@@ -800,7 +851,7 @@ void __init fpu__init_system_xstate(void)
 	 * Update info used for ptrace frames; use standard-format size and no
 	 * supervisor xstates:
 	 */
-	update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_uabi());
+	update_regset_xstate_info(get_xstate_config(XSTATE_USER_SIZE), xfeatures_mask_uabi());
 
 	fpu__init_prepare_fx_sw_frame();
 	setup_init_fpu_buf();
@@ -820,7 +871,7 @@ void __init fpu__init_system_xstate(void)
 	print_xstate_offset_size();
 	pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
 		xfeatures_mask_all,
-		fpu_kernel_xstate_size,
+		get_xstate_config(XSTATE_MAX_SIZE),
 		boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
 	return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..9ad39e807fcf 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -90,6 +90,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return fpu_clone(dst);
 }
 
+void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
+{
+	*offset = offsetof(struct thread_struct, fpu.state);
+	/* The buffer embedded in thread_struct has the minimum size. */
+	*size = get_xstate_config(XSTATE_MIN_SIZE);
+}
+
 /*
  * Free thread data structures etc..
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c72e3ad0f9b8..e1d69ba8e743 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9886,10 +9886,13 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	/*
 	 * If the target FPU state is not resident in the CPU registers, just
 	 * memcpy() from current, else save CPU state directly to the target.
+	 *
+	 * KVM does not support dynamic user states yet. Assume the buffer
+	 * always has the minimum size.
 	 */
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
 		memcpy(&fpu->state, &current->thread.fpu.state,
-		       fpu_kernel_xstate_size);
+		       get_xstate_config(XSTATE_MIN_SIZE));
 	else
 		save_fpregs_to_fpstate(fpu);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 06/26] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (4 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 05/26] x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer size Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-12 16:36   ` Borislav Petkov
  2021-07-30 14:59 ` [PATCH v9 07/26] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

The CPUID instruction separately enumerates sizes and alignments of
individual xfeatures. It independently enumerates the required size of an
entire XSAVE buffer to store all enabled features.

calculate_xstate_sizes() currently uses the individual feature
size/alignment enumeration to independently recalculate the required XSAVE
buffer size. This is compared against the CPUID-provided value.

Extend the function to accept an option to exclude dynamic states. With
that, calculate the maximum size that contains all the enabled states, and
the minimum size that fits in the statically-allocated buffer by excluding
dynamic states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v6:
* Simplify xstate size calculation code. (Dave Hansen)
* Updated the changelog. (Dave Hansen)
* Fixed the v6 changes.

Changes from v5:
* Re-adjusted some local variable names.

Changes from v4:
* Massaged the function description, in preparation for the change
  with a return value.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Updated the code comment. (Borislav Petkov)
* Adjusted the calculation function naming.
* Moved out the new variable addition into a new patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Renamed the in-line size variable.
* Updated some code comments.
---
 arch/x86/kernel/fpu/xstate.c | 59 ++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 12caf1a56ce0..cd709408efb5 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -591,24 +591,28 @@ static void check_xstate_against_struct(int nr)
 	}
 }
 
-/*
- * This essentially double-checks what the cpu told us about
- * how large the XSAVE buffer needs to be.  We are recalculating
- * it to be safe.
+/**
+ * calculate_xstate_size - Calculate the xstate per-task buffer size.
+ *
+ * Independent XSAVE features allocate their own buffers and are always
+ * excluded. Only the size of the buffer for task->fpu is checked here.
  *
- * Independent XSAVE features allocate their own buffers and are not
- * covered by these checks. Only the size of the buffer for task->fpu
- * is checked here.
+ * @include_dynamic_states:	A knob to include dynamic states or not.
+ *
+ * Return:			The calculated xstate size.
  */
-static void do_extra_xstate_size_checks(void)
+static unsigned int calculate_xstate_size(bool include_dynamic_states)
 {
-	int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+	unsigned int xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
 	int i;
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
 			continue;
 
+		if (!include_dynamic_states && (xfeatures_mask_user_dynamic & BIT_ULL(i)))
+			continue;
+
 		check_xstate_against_struct(i);
 		/*
 		 * Supervisor state components can be managed only by
@@ -619,7 +623,7 @@ static void do_extra_xstate_size_checks(void)
 
 		/* Align from the end of the previous feature */
 		if (xfeature_is_aligned(i))
-			paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
+			xstate_size = ALIGN(xstate_size, 64);
 		/*
 		 * The offset of a given state in the non-compacted
 		 * format is given to us in a CPUID leaf.  We check
@@ -627,18 +631,15 @@ static void do_extra_xstate_size_checks(void)
 		 * setup_xstate_features(). XSAVES uses compacted format.
 		 */
 		if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
-			paranoid_xstate_size = xfeature_uncompacted_offset(i);
+			xstate_size = xfeature_uncompacted_offset(i);
 		/*
 		 * The compacted-format offset always depends on where
 		 * the previous state ended.
 		 */
-		paranoid_xstate_size += xfeature_size(i);
+		xstate_size += xfeature_size(i);
 	}
-	/*
-	 * The size accounts for all the possible states reserved in the
-	 * per-task buffer.  Check against the maximum size.
-	 */
-	XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
+
+	return xstate_size;
 }
 
 
@@ -723,7 +724,7 @@ static bool is_supported_xstate_size(unsigned int test_xstate_size)
 static int __init init_xstate_size(void)
 {
 	/* Recompute the context size for enabled features: */
-	unsigned int possible_xstate_size;
+	unsigned int possible_xstate_size, xstate_size;
 	unsigned int xsave_size;
 
 	xsave_size = get_xsave_size();
@@ -734,23 +735,23 @@ static int __init init_xstate_size(void)
 		possible_xstate_size = xsave_size;
 
 	/*
-	 * The size accounts for all the possible states reserved in the
-	 * per-task buffer.  Set the maximum with this value.
+	 * Calculate xstate size for all the possible states by setting
+	 * 'true' to include dynamic states. Cross-check with the CPUID-
+	 * provided size and record it.
 	 */
+	xstate_size = calculate_xstate_size(true);
+	XSTATE_WARN_ON(possible_xstate_size != xstate_size);
 	set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
 
-	/* Perform an extra check for the maximum size. */
-	do_extra_xstate_size_checks();
-
 	/*
-	 * Set the minimum to be the same as the maximum. The dynamic
-	 * user states are not supported yet.
+	 * Calculate the xstate size without dynamic states by setting
+	 * 'false' to exclude dynamic states. Ensure the size fits in
+	 * the statically-allocated buffer and record it.
 	 */
-	set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
-
-	/* Ensure the minimum size fits in the statically-allocated buffer: */
-	if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
+	xstate_size = calculate_xstate_size(false);
+	if (!is_supported_xstate_size(xstate_size))
 		return -EINVAL;
+	set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
 
 	/*
 	 * User space is always in standard format.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 07/26] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (5 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 06/26] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-12 17:09   ` Borislav Petkov
  2021-07-30 14:59 ` [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically Chang S. Bae
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, kvm

The XSTATE per-task buffer is embedded into struct fpu. The field 'state'
represents the buffer. When the dynamic user state is in use, the buffer
may be dynamically allocated.

Convert the 'state' field to point either to the embedded buffer or to the
dynamically-allocated buffer. Also, add a new field to represent the
embedded buffer.

The initial task sets it before dealing with soft FPU. Make sure that every
FPU state has a valid pointer value on its creation.

No functional change.

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v5:
* Tightened up task size calculation (previously, it could over-calculate)
* Adjusted the changelog.

Changes from v4:
* Fixed KVM's user_fpu and guest_fpu to initialize the 'state' field correctly.
* Massaged the changelog.

Changes from v3:
* Added as a new patch to simplify the buffer access. (Borislav Petkov)
---
 arch/x86/include/asm/fpu/internal.h |  2 +-
 arch/x86/include/asm/fpu/types.h    | 29 ++++++++++++++++++++------
 arch/x86/include/asm/trace/fpu.h    |  4 ++--
 arch/x86/kernel/fpu/core.c          | 32 +++++++++++++++--------------
 arch/x86/kernel/fpu/init.c          |  8 +++++---
 arch/x86/kernel/fpu/regset.c        | 24 +++++++++++-----------
 arch/x86/kernel/fpu/signal.c        | 24 +++++++++++-----------
 arch/x86/kernel/fpu/xstate.c        |  8 ++++----
 arch/x86/kernel/process.c           |  2 +-
 arch/x86/kvm/x86.c                  | 22 +++++++++++---------
 arch/x86/math-emu/fpu_aux.c         |  2 +-
 arch/x86/math-emu/fpu_entry.c       |  4 ++--
 arch/x86/math-emu/fpu_system.h      |  2 +-
 13 files changed, 93 insertions(+), 70 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index c7a64e2806a9..d2fc19c0e457 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -484,7 +484,7 @@ static inline void fpregs_restore_userregs(void)
 		 */
 		mask = xfeatures_mask_restore_user() |
 			xfeatures_mask_supervisor();
-		__restore_fpregs_from_fpstate(&fpu->state, mask);
+		__restore_fpregs_from_fpstate(fpu->state, mask);
 
 		fpregs_activate(fpu);
 		fpu->last_cpu = cpu;
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f5a38a5f3ae1..c7826708f27f 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -339,13 +339,30 @@ struct fpu {
 	/*
 	 * @state:
 	 *
-	 * In-memory copy of all FPU registers that we save/restore
-	 * over context switches. If the task is using the FPU then
-	 * the registers in the FPU are more recent than this state
-	 * copy. If the task context-switches away then they get
-	 * saved here and represent the FPU state.
+	 * A pointer to indicate the in-memory copy of all FPU registers
+	 * that are saved/restored over context switches.
+	 *
+	 * Initially @state points to @__default_state. When dynamic states
+	 * get used, a memory is allocated for the larger state copy and
+	 * @state is updated to point to it. Then, the state in ->state
+	 * supersedes and invalidates the state in @__default_state.
+	 *
+	 * In general, if the task is using the FPU then the registers in
+	 * the FPU are more recent than the state copy. If the task
+	 * context-switches away then they get saved in ->state and
+	 * represent the FPU state.
+	 */
+	union fpregs_state		*state;
+
+	/*
+	 * @__default_state:
+	 *
+	 * Initial in-memory copy of all FPU registers that saved/restored
+	 * over context switches. When the task is switched to dynamic
+	 * states, this copy is replaced with the new in-memory copy in
+	 * ->state.
 	 */
-	union fpregs_state		state;
+	union fpregs_state		__default_state;
 	/*
 	 * WARNING: 'state' is dynamically-sized.  Do not put
 	 * anything after it here.
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 879b77792f94..ef82f4824ce7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -22,8 +22,8 @@ DECLARE_EVENT_CLASS(x86_fpu,
 		__entry->fpu		= fpu;
 		__entry->load_fpu	= test_thread_flag(TIF_NEED_FPU_LOAD);
 		if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
-			__entry->xfeatures = fpu->state.xsave.header.xfeatures;
-			__entry->xcomp_bv  = fpu->state.xsave.header.xcomp_bv;
+			__entry->xfeatures = fpu->state->xsave.header.xfeatures;
+			__entry->xcomp_bv  = fpu->state->xsave.header.xcomp_bv;
 		}
 	),
 	TP_printk("x86/fpu: %p load: %d xfeatures: %llx xcomp_bv: %llx",
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 808f7627975d..6390562516c9 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -99,19 +99,19 @@ EXPORT_SYMBOL(irq_fpu_usable);
 void save_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
-		os_xsave(&fpu->state.xsave);
+		os_xsave(&fpu->state->xsave);
 
 		/*
 		 * AVX512 state is tracked here because its use is
 		 * known to slow the max clock speed of the core.
 		 */
-		if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+		if (fpu->state->xsave.header.xfeatures & XFEATURE_MASK_AVX512)
 			fpu->avx512_timestamp = jiffies;
 		return;
 	}
 
 	if (likely(use_fxsr())) {
-		fxsave(&fpu->state.fxsave);
+		fxsave(&fpu->state->fxsave);
 		return;
 	}
 
@@ -119,8 +119,8 @@ void save_fpregs_to_fpstate(struct fpu *fpu)
 	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
 	 * so we have to reload them from the memory state.
 	 */
-	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
-	frstor(&fpu->state.fsave);
+	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state->fsave));
+	frstor(&fpu->state->fsave);
 }
 EXPORT_SYMBOL(save_fpregs_to_fpstate);
 
@@ -235,7 +235,7 @@ void fpstate_init(struct fpu *fpu)
 	u64 mask;
 
 	if (likely(fpu)) {
-		state = &fpu->state;
+		state = fpu->state;
 		/* The dynamic user states are not prepared yet. */
 		mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
 		size = get_xstate_config(XSTATE_MIN_SIZE);
@@ -274,6 +274,8 @@ int fpu_clone(struct task_struct *dst)
 	if (!cpu_feature_enabled(X86_FEATURE_FPU))
 		return 0;
 
+	dst_fpu->state = &dst_fpu->__default_state;
+
 	/*
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
@@ -281,7 +283,7 @@ int fpu_clone(struct task_struct *dst)
 	 * The child does not inherit the dynamic states. So,
 	 * the xstate buffer has the minimum size.
 	 */
-	memset(&dst_fpu->state.xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
+	memset(&dst_fpu->state->xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
 
 	/*
 	 * If the FPU registers are not owned by current just memcpy() the
@@ -290,7 +292,7 @@ int fpu_clone(struct task_struct *dst)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&dst_fpu->state, &src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
+		memcpy(dst_fpu->state, src_fpu->state, get_xstate_config(XSTATE_MIN_SIZE));
 
 	else
 		save_fpregs_to_fpstate(dst_fpu);
@@ -377,7 +379,7 @@ static void fpu_reset_fpstate(void)
 	 * user space as PKRU is eagerly written in switch_to() and
 	 * flush_thread().
 	 */
-	memcpy(&fpu->state, &init_fpstate, init_fpstate_copy_size());
+	memcpy(fpu->state, &init_fpstate, init_fpstate_copy_size());
 	set_thread_flag(TIF_NEED_FPU_LOAD);
 	fpregs_unlock();
 }
@@ -404,7 +406,7 @@ void fpu__clear_user_states(struct fpu *fpu)
 	 */
 	if (xfeatures_mask_supervisor() &&
 	    !fpregs_state_valid(fpu, smp_processor_id())) {
-		os_xrstor(&fpu->state.xsave, xfeatures_mask_supervisor());
+		os_xrstor(&fpu->state->xsave, xfeatures_mask_supervisor());
 	}
 
 	/* Reset user states in registers. */
@@ -486,11 +488,11 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 		 * fully reproduce the context of the exception.
 		 */
 		if (boot_cpu_has(X86_FEATURE_FXSR)) {
-			cwd = fpu->state.fxsave.cwd;
-			swd = fpu->state.fxsave.swd;
+			cwd = fpu->state->fxsave.cwd;
+			swd = fpu->state->fxsave.swd;
 		} else {
-			cwd = (unsigned short)fpu->state.fsave.cwd;
-			swd = (unsigned short)fpu->state.fsave.swd;
+			cwd = (unsigned short)fpu->state->fsave.cwd;
+			swd = (unsigned short)fpu->state->fsave.swd;
 		}
 
 		err = swd & ~cwd;
@@ -504,7 +506,7 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 		unsigned short mxcsr = MXCSR_DEFAULT;
 
 		if (boot_cpu_has(X86_FEATURE_XMM))
-			mxcsr = fpu->state.fxsave.mxcsr;
+			mxcsr = fpu->state->fxsave.mxcsr;
 
 		err = ~(mxcsr >> 7) & mxcsr;
 	}
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 10e2a95916aa..3e4e14ca723b 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -31,10 +31,12 @@ static void fpu__init_cpu_generic(void)
 		cr0 |= X86_CR0_EM;
 	write_cr0(cr0);
 
+	current->thread.fpu.state = &current->thread.fpu.__default_state;
+
 	/* Flush out any pending x87 state: */
 #ifdef CONFIG_MATH_EMULATION
 	if (!boot_cpu_has(X86_FEATURE_FPU))
-		fpstate_init_soft(&current->thread.fpu.state.soft);
+		fpstate_init_soft(&current->thread.fpu.state->soft);
 	else
 #endif
 		asm volatile ("fninit");
@@ -153,7 +155,7 @@ static void __init fpu__init_task_struct_size(void)
 	 * Subtract off the static size of the register state.
 	 * It potentially has a bunch of padding.
 	 */
-	task_size -= sizeof(((struct task_struct *)0)->thread.fpu.state);
+	task_size -= sizeof(((struct task_struct *)0)->thread.fpu.__default_state);
 
 	/*
 	 * Add back the dynamically-calculated register state
@@ -170,7 +172,7 @@ static void __init fpu__init_task_struct_size(void)
 	 * you hit a compile error here, check the structure to
 	 * see if something got added to the end.
 	 */
-	CHECK_MEMBER_AT_END_OF(struct fpu, state);
+	CHECK_MEMBER_AT_END_OF(struct fpu, __default_state);
 	CHECK_MEMBER_AT_END_OF(struct thread_struct, fpu);
 	CHECK_MEMBER_AT_END_OF(struct task_struct, thread);
 
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 8dea3730620e..73d7d7b489fe 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -74,8 +74,8 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
 	sync_fpstate(fpu);
 
 	if (!use_xsave()) {
-		return membuf_write(&to, &fpu->state.fxsave,
-				    sizeof(fpu->state.fxsave));
+		return membuf_write(&to, &fpu->state->fxsave,
+				    sizeof(fpu->state->fxsave));
 	}
 
 	copy_xstate_to_uabi_buf(to, target, XSTATE_COPY_FX);
@@ -110,15 +110,15 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
 	fpu_force_restore(fpu);
 
 	/* Copy the state  */
-	memcpy(&fpu->state.fxsave, &newstate, sizeof(newstate));
+	memcpy(&fpu->state->fxsave, &newstate, sizeof(newstate));
 
 	/* Clear xmm8..15 */
-	BUILD_BUG_ON(sizeof(fpu->state.fxsave.xmm_space) != 16 * 16);
-	memset(&fpu->state.fxsave.xmm_space[8], 0, 8 * 16);
+	BUILD_BUG_ON(sizeof(fpu->state->fxsave.xmm_space) != 16 * 16);
+	memset(&fpu->state->fxsave.xmm_space[8], 0, 8 * 16);
 
 	/* Mark FP and SSE as in use when XSAVE is enabled */
 	if (use_xsave())
-		fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+		fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
 
 	return 0;
 }
@@ -283,7 +283,7 @@ static void __convert_from_fxsr(struct user_i387_ia32_struct *env,
 void
 convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
 {
-	__convert_from_fxsr(env, tsk, &tsk->thread.fpu.state.fxsave);
+	__convert_from_fxsr(env, tsk, &tsk->thread.fpu.state->fxsave);
 }
 
 void convert_to_fxsr(struct fxregs_state *fxsave,
@@ -326,7 +326,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
 		return fpregs_soft_get(target, regset, to);
 
 	if (!cpu_feature_enabled(X86_FEATURE_FXSR)) {
-		return membuf_write(&to, &fpu->state.fsave,
+		return membuf_write(&to, &fpu->state->fsave,
 				    sizeof(struct fregs_state));
 	}
 
@@ -337,7 +337,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
 		copy_xstate_to_uabi_buf(mb, target, XSTATE_COPY_FP);
 		fx = &fxsave;
 	} else {
-		fx = &fpu->state.fxsave;
+		fx = &fpu->state->fxsave;
 	}
 
 	__convert_from_fxsr(&env, target, fx);
@@ -366,16 +366,16 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
 	fpu_force_restore(fpu);
 
 	if (cpu_feature_enabled(X86_FEATURE_FXSR))
-		convert_to_fxsr(&fpu->state.fxsave, &env);
+		convert_to_fxsr(&fpu->state->fxsave, &env);
 	else
-		memcpy(&fpu->state.fsave, &env, sizeof(env));
+		memcpy(&fpu->state->fsave, &env, sizeof(env));
 
 	/*
 	 * Update the header bit in the xsave header, indicating the
 	 * presence of FP.
 	 */
 	if (cpu_feature_enabled(X86_FEATURE_XSAVE))
-		fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FP;
+		fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FP;
 
 	return 0;
 }
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 63f000988fa6..2f35aada2007 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -67,13 +67,13 @@ static inline int check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
 static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)
 {
 	if (use_fxsr()) {
-		struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+		struct xregs_state *xsave = &tsk->thread.fpu.state->xsave;
 		struct user_i387_ia32_struct env;
 		struct _fpstate_32 __user *fp = buf;
 
 		fpregs_lock();
 		if (!test_thread_flag(TIF_NEED_FPU_LOAD))
-			fxsave(&tsk->thread.fpu.state.fxsave);
+			fxsave(&tsk->thread.fpu.state->fxsave);
 		fpregs_unlock();
 
 		convert_from_fxsr(&env, tsk);
@@ -294,7 +294,7 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
 	 * been restored from a user buffer directly.
 	 */
 	if (test_thread_flag(TIF_NEED_FPU_LOAD) && xfeatures_mask_supervisor())
-		os_xrstor(&fpu->state.xsave, xfeatures_mask_supervisor());
+		os_xrstor(&fpu->state->xsave, xfeatures_mask_supervisor());
 
 	fpregs_mark_activate();
 	fpregs_unlock();
@@ -365,7 +365,7 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 		 * the right place in memory. It's ia32 mode. Shrug.
 		 */
 		if (xfeatures_mask_supervisor())
-			os_xsave(&fpu->state.xsave);
+			os_xsave(&fpu->state->xsave);
 		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__fpu_invalidate_fpregs_state(fpu);
@@ -377,21 +377,21 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 		if (ret)
 			return ret;
 	} else {
-		if (__copy_from_user(&fpu->state.fxsave, buf_fx,
-				     sizeof(fpu->state.fxsave)))
+		if (__copy_from_user(&fpu->state->fxsave, buf_fx,
+				     sizeof(fpu->state->fxsave)))
 			return -EFAULT;
 
 		/* Reject invalid MXCSR values. */
-		if (fpu->state.fxsave.mxcsr & ~mxcsr_feature_mask)
+		if (fpu->state->fxsave.mxcsr & ~mxcsr_feature_mask)
 			return -EINVAL;
 
 		/* Enforce XFEATURE_MASK_FPSSE when XSAVE is enabled */
 		if (use_xsave())
-			fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+			fpu->state->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
 	}
 
 	/* Fold the legacy FP storage */
-	convert_to_fxsr(&fpu->state.fxsave, &env);
+	convert_to_fxsr(&fpu->state->fxsave, &env);
 
 	fpregs_lock();
 	if (use_xsave()) {
@@ -406,10 +406,10 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 		 */
 		u64 mask = user_xfeatures | xfeatures_mask_supervisor();
 
-		fpu->state.xsave.header.xfeatures &= mask;
-		ret = os_xrstor_safe(&fpu->state.xsave, xfeatures_mask_all);
+		fpu->state->xsave.header.xfeatures &= mask;
+		ret = os_xrstor_safe(&fpu->state->xsave, xfeatures_mask_all);
 	} else {
-		ret = fxrstor_safe(&fpu->state.fxsave);
+		ret = fxrstor_safe(&fpu->state->fxsave);
 	}
 
 	if (likely(!ret))
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index cd709408efb5..5f58dca4c6b7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -925,7 +925,7 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 	}
 
 	if (fpu)
-		xsave = &fpu->state.xsave;
+		xsave = &fpu->state->xsave;
 	else
 		xsave = &init_fpstate.xsave;
 
@@ -968,7 +968,7 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
 		  "get of unsupported state");
 
 	if (fpu)
-		xsave = &fpu->state.xsave;
+		xsave = &fpu->state->xsave;
 	else
 		xsave = &init_fpstate.xsave;
 
@@ -1060,7 +1060,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 			     enum xstate_copy_mode copy_mode)
 {
 	const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
-	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+	struct xregs_state *xsave = &tsk->thread.fpu.state->xsave;
 	struct xregs_state *xinit = &init_fpstate.xsave;
 	struct xstate_header header;
 	unsigned int zerofrom;
@@ -1177,7 +1177,7 @@ static int copy_from_buffer(void *dst, unsigned int offset, unsigned int size,
 static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
 			       const void __user *ubuf)
 {
-	struct xregs_state *xsave = &fpu->state.xsave;
+	struct xregs_state *xsave = &fpu->state->xsave;
 	unsigned int offset, size;
 	struct xstate_header hdr;
 	u64 mask;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9ad39e807fcf..534b9fb7e7ee 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -92,7 +92,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 
 void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
 {
-	*offset = offsetof(struct thread_struct, fpu.state);
+	*offset = offsetof(struct thread_struct, fpu.__default_state);
 	/* The buffer embedded in thread_struct has the minimum size. */
 	*size = get_xstate_config(XSTATE_MIN_SIZE);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1d69ba8e743..18a337f99459 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4685,7 +4685,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 
 static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
 	u64 xstate_bv = xsave->header.xfeatures;
 	u64 valid;
 
@@ -4728,7 +4728,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 
 static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
+	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state->xsave;
 	u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
 	u64 valid;
 
@@ -4781,7 +4781,7 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 		fill_xsave((u8 *) guest_xsave->region, vcpu);
 	} else {
 		memcpy(guest_xsave->region,
-			&vcpu->arch.guest_fpu->state.fxsave,
+			&vcpu->arch.guest_fpu->state->fxsave,
 			sizeof(struct fxregs_state));
 		*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
 			XFEATURE_MASK_FPSSE;
@@ -4815,7 +4815,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 		if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
 			mxcsr & ~mxcsr_feature_mask)
 			return -EINVAL;
-		memcpy(&vcpu->arch.guest_fpu->state.fxsave,
+		memcpy(&vcpu->arch.guest_fpu->state->fxsave,
 			guest_xsave->region, sizeof(struct fxregs_state));
 	}
 	return 0;
@@ -9891,7 +9891,7 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	 * always has the minimum size.
 	 */
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&fpu->state, &current->thread.fpu.state,
+		memcpy(fpu->state, current->thread.fpu.state,
 		       get_xstate_config(XSTATE_MIN_SIZE));
 	else
 		save_fpregs_to_fpstate(fpu);
@@ -9910,7 +9910,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 	 */
 	if (vcpu->arch.guest_fpu)
 		/* PKRU is separately restored in kvm_x86_ops.run. */
-		__restore_fpregs_from_fpstate(&vcpu->arch.guest_fpu->state,
+		__restore_fpregs_from_fpstate(vcpu->arch.guest_fpu->state,
 					~XFEATURE_MASK_PKRU);
 
 	fpregs_mark_activate();
@@ -9931,7 +9931,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu)
 		kvm_save_current_fpu(vcpu->arch.guest_fpu);
 
-	restore_fpregs_from_fpstate(&vcpu->arch.user_fpu->state);
+	restore_fpregs_from_fpstate(vcpu->arch.user_fpu->state);
 
 	fpregs_mark_activate();
 	fpregs_unlock();
@@ -10520,7 +10520,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+	fxsave = &vcpu->arch.guest_fpu->state->fxsave;
 	memcpy(fpu->fpr, fxsave->st_space, 128);
 	fpu->fcw = fxsave->cwd;
 	fpu->fsw = fxsave->swd;
@@ -10543,7 +10543,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu->state.fxsave;
+	fxsave = &vcpu->arch.guest_fpu->state->fxsave;
 
 	memcpy(fxsave->st_space, fpu->fpr, 128);
 	fxsave->cwd = fpu->fcw;
@@ -10604,7 +10604,7 @@ static void fx_init(struct kvm_vcpu *vcpu)
 
 	fpstate_init(vcpu->arch.guest_fpu);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
+		vcpu->arch.guest_fpu->state->xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
 
 	/*
@@ -10684,6 +10684,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		pr_err("kvm: failed to allocate userspace's fpu\n");
 		goto free_emulate_ctxt;
 	}
+	vcpu->arch.user_fpu->state = &vcpu->arch.user_fpu->__default_state;
 
 	vcpu->arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache,
 						 GFP_KERNEL_ACCOUNT);
@@ -10691,6 +10692,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		pr_err("kvm: failed to allocate vcpu's fpu\n");
 		goto free_user_fpu;
 	}
+	vcpu->arch.guest_fpu->state = &vcpu->arch.guest_fpu->__default_state;
 	fx_init(vcpu);
 
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
diff --git a/arch/x86/math-emu/fpu_aux.c b/arch/x86/math-emu/fpu_aux.c
index 034748459482..51432a73024c 100644
--- a/arch/x86/math-emu/fpu_aux.c
+++ b/arch/x86/math-emu/fpu_aux.c
@@ -53,7 +53,7 @@ void fpstate_init_soft(struct swregs_state *soft)
 
 void finit(void)
 {
-	fpstate_init_soft(&current->thread.fpu.state.soft);
+	fpstate_init_soft(&current->thread.fpu.state->soft);
 }
 
 /*
diff --git a/arch/x86/math-emu/fpu_entry.c b/arch/x86/math-emu/fpu_entry.c
index 8679a9d6c47f..6ba56632170e 100644
--- a/arch/x86/math-emu/fpu_entry.c
+++ b/arch/x86/math-emu/fpu_entry.c
@@ -640,7 +640,7 @@ int fpregs_soft_set(struct task_struct *target,
 		    unsigned int pos, unsigned int count,
 		    const void *kbuf, const void __user *ubuf)
 {
-	struct swregs_state *s387 = &target->thread.fpu.state.soft;
+	struct swregs_state *s387 = &target->thread.fpu.state->soft;
 	void *space = s387->st_space;
 	int ret;
 	int offset, other, i, tags, regnr, tag, newtop;
@@ -691,7 +691,7 @@ int fpregs_soft_get(struct task_struct *target,
 		    const struct user_regset *regset,
 		    struct membuf to)
 {
-	struct swregs_state *s387 = &target->thread.fpu.state.soft;
+	struct swregs_state *s387 = &target->thread.fpu.state->soft;
 	const void *space = s387->st_space;
 	int offset = (S387->ftop & 7) * 10, other = 80 - offset;
 
diff --git a/arch/x86/math-emu/fpu_system.h b/arch/x86/math-emu/fpu_system.h
index 9b41391867dc..a6291ddfdda6 100644
--- a/arch/x86/math-emu/fpu_system.h
+++ b/arch/x86/math-emu/fpu_system.h
@@ -73,7 +73,7 @@ static inline bool seg_writable(struct desc_struct *d)
 	return (d->type & SEG_TYPE_EXECUTE_MASK) == SEG_TYPE_WRITABLE;
 }
 
-#define I387			(&current->thread.fpu.state)
+#define I387			(current->thread.fpu.state)
 #define FPU_info		(I387->soft.info)
 
 #define FPU_CS			(*(unsigned short *) &(FPU_info->regs->cs))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (6 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 07/26] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-12 19:44   ` Borislav Petkov
  2021-08-30 17:45   ` Dave Hansen
  2021-07-30 14:59 ` [PATCH v9 09/26] x86/fpu/xstate: Update the XSTATE save function to support dynamic states Chang S. Bae
                   ` (17 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

The static XSTATE per-task buffer contains the extended register states --
but it is not expandable at runtime. Introduce runtime methods and a new
fpu struct field to support the expansion.

fpu->state_mask indicates which state components are reserved to be
saved in the XSTATE buffer.

alloc_xstate_buffer() uses vzalloc(). If use of this mechanism grows to
allocate buffers larger than 64KB, a more sophisticated allocation scheme
that includes purpose-built reclaim capability might be justified.

Introduce a new helper -- get_xstate_size() to calculate the buffer size.

Also, use the new field and helper to initialize the buffer.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Added to ensure XSAVES format with current in fpu_reset_fpstate() for new
  base code.

Changes from v3:
* Updated code comments. (Borislav Petkov)
* Used vzalloc() instead of vmalloc() with memset(). (Borislav Petkov)
* Removed the max size check for >64KB. (Borislav Petkov)
* Removed the allocation size check in the helper. (Borislav Petkov)
* Switched the function description in the kernel-doc style.
* Used them for buffer initialization -- moved from the next patch.

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Replaced 'area' with 'buffer' in the comments and the changelog.
* Updated the code comments.

Changes from v1:
* Removed unneeded interrupt masking (Andy Lutomirski)
* Added vmalloc() error tracing (Dave Hansen, PeterZ, and Andy Lutomirski)
---
 arch/x86/include/asm/fpu/types.h  |   8 ++
 arch/x86/include/asm/fpu/xstate.h |   3 +
 arch/x86/include/asm/trace/fpu.h  |   5 ++
 arch/x86/kernel/fpu/core.c        |  18 +++--
 arch/x86/kernel/fpu/xstate.c      | 127 ++++++++++++++++++++++++++++++
 5 files changed, 154 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c7826708f27f..c0192e16cadb 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -336,6 +336,14 @@ struct fpu {
 	 */
 	unsigned long			avx512_timestamp;
 
+	/*
+	 * @state_mask:
+	 *
+	 * The bitmap represents state components reserved to be saved in
+	 * ->state.
+	 */
+	u64				state_mask;
+
 	/*
 	 * @state:
 	 *
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d722e774a9f9..45735441fbe8 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -146,6 +146,9 @@ extern unsigned int get_xstate_config(enum xstate_config cfg);
 void set_xstate_config(enum xstate_config cfg, unsigned int value);
 
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
+unsigned int get_xstate_size(u64 mask);
+int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
+void free_xstate_buffer(struct fpu *fpu);
 int xfeature_size(int xfeature_nr);
 int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
 int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index ef82f4824ce7..b691c2db47c7 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -89,6 +89,11 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
 	TP_ARGS(fpu)
 );
 
+DEFINE_EVENT(x86_fpu, x86_fpu_xstate_alloc_failed,
+	TP_PROTO(struct fpu *fpu),
+	TP_ARGS(fpu)
+);
+
 #undef TRACE_INCLUDE_PATH
 #define TRACE_INCLUDE_PATH asm/trace/
 #undef TRACE_INCLUDE_FILE
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 6390562516c9..16abc0357e2e 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -236,9 +236,8 @@ void fpstate_init(struct fpu *fpu)
 
 	if (likely(fpu)) {
 		state = fpu->state;
-		/* The dynamic user states are not prepared yet. */
-		mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
-		size = get_xstate_config(XSTATE_MIN_SIZE);
+		mask = fpu->state_mask;
+		size = get_xstate_size(fpu->state_mask);
 	} else {
 		state = &init_fpstate;
 		mask = xfeatures_mask_all;
@@ -274,14 +273,16 @@ int fpu_clone(struct task_struct *dst)
 	if (!cpu_feature_enabled(X86_FEATURE_FPU))
 		return 0;
 
+	/*
+	 * The child does not inherit the dynamic states. Thus, use the
+	 * buffer embedded in struct task_struct, which has the minimum
+	 * size.
+	 */
+	dst_fpu->state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
 	dst_fpu->state = &dst_fpu->__default_state;
-
 	/*
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
-	 *
-	 * The child does not inherit the dynamic states. So,
-	 * the xstate buffer has the minimum size.
 	 */
 	memset(&dst_fpu->state->xsave, 0, get_xstate_config(XSTATE_MIN_SIZE));
 
@@ -380,6 +381,9 @@ static void fpu_reset_fpstate(void)
 	 * flush_thread().
 	 */
 	memcpy(fpu->state, &init_fpstate, init_fpstate_copy_size());
+	/* Adjust the xstate buffer format for current. */
+	if (boot_cpu_has(X86_FEATURE_XSAVES))
+		fpstate_init_xstate(&fpu->state->xsave, fpu->state_mask);
 	set_thread_flag(TIF_NEED_FPU_LOAD);
 	fpregs_unlock();
 }
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5f58dca4c6b7..26f6d5e0f1ed 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,6 +10,7 @@
 #include <linux/pkeys.h>
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -19,6 +20,7 @@
 
 #include <asm/tlbflush.h>
 #include <asm/cpufeature.h>
+#include <asm/trace/fpu.h>
 
 /*
  * Although we spell it out in here, the Processor Trace
@@ -76,6 +78,12 @@ static unsigned int xstate_comp_offsets[XFEATURE_MAX] __ro_after_init =
 	{ [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] __ro_after_init =
 	{ [ 0 ... XFEATURE_MAX - 1] = -1};
+/*
+ * True if the buffer of the corresponding XFEATURE is located on the next 64
+ * byte boundary. Otherwise, it follows the preceding component immediately.
+ */
+static bool xstate_aligns[XFEATURE_MAX] __ro_after_init =
+	{ [ 0 ... XFEATURE_MAX - 1] = false};
 
 /**
  * struct fpu_xstate_buffer_config - xstate buffer configuration
@@ -174,6 +182,55 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/**
+ * get_xstate_size - Calculate an xstate buffer size
+ * @mask:	This bitmap tells which components reserved in the buffer.
+ *
+ * Available once those arrays for the offset, size, and alignment info are
+ * set up, by setup_xstate_features().
+ *
+ * Returns:	The buffer size
+ */
+unsigned int get_xstate_size(u64 mask)
+{
+	unsigned int size;
+	int i, nr;
+
+	if (!mask)
+		return 0;
+
+	/*
+	 * The minimum buffer size excludes the dynamic user state. When a
+	 * task uses the state, the buffer can grow up to the max size.
+	 */
+	if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
+		return get_xstate_config(XSTATE_MIN_SIZE);
+	else if (mask == xfeatures_mask_all)
+		return get_xstate_config(XSTATE_MAX_SIZE);
+
+	nr = fls64(mask) - 1;
+
+	if (!boot_cpu_has(X86_FEATURE_XSAVES))
+		return xstate_offsets[nr] + xstate_sizes[nr];
+
+	if ((xfeatures_mask_all & (BIT_ULL(nr + 1) - 1)) == mask)
+		return xstate_comp_offsets[nr] + xstate_sizes[nr];
+
+	/*
+	 * With the given mask, no relevant size is found so far. So,
+	 * calculate it by summing up each state size.
+	 */
+	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
+		if (!(mask & BIT_ULL(i)))
+			continue;
+
+		if (xstate_aligns[i])
+			size = ALIGN(size, 64);
+		size += xstate_sizes[i];
+	}
+	return size;
+}
+
 /*
  * Enable the extended processor state save/restore feature.
  * Called once per CPU onlining.
@@ -224,10 +281,12 @@ static void __init setup_xstate_features(void)
 	xstate_offsets[XFEATURE_FP]	= 0;
 	xstate_sizes[XFEATURE_FP]	= offsetof(struct fxregs_state,
 						   xmm_space);
+	xstate_aligns[XFEATURE_FP]	= true;
 
 	xstate_offsets[XFEATURE_SSE]	= xstate_sizes[XFEATURE_FP];
 	xstate_sizes[XFEATURE_SSE]	= sizeof_field(struct fxregs_state,
 						       xmm_space);
+	xstate_aligns[XFEATURE_SSE]	= true;
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
@@ -245,6 +304,7 @@ static void __init setup_xstate_features(void)
 			continue;
 
 		xstate_offsets[i] = ebx;
+		xstate_aligns[i] = (ecx & 2) ? true : false;
 
 		/*
 		 * In our xstate size checks, we assume that the highest-numbered
@@ -848,6 +908,9 @@ void __init fpu__init_system_xstate(void)
 	if (err)
 		goto out_disable;
 
+	/* Make sure init_task does not include the dynamic user states. */
+	current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+
 	/*
 	 * Update info used for ptrace frames; use standard-format size and no
 	 * supervisor xstates:
@@ -1038,6 +1101,70 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 }
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
+void free_xstate_buffer(struct fpu *fpu)
+{
+	/* Free up only the dynamically-allocated memory. */
+	if (fpu->state != &fpu->__default_state)
+		vfree(fpu->state);
+}
+
+/**
+ * alloc_xstate_buffer - Allocate a buffer with the size calculated from
+ *			 @mask.
+ *
+ * @fpu:	A struct fpu * pointer
+ * @mask:	The bitmap tells which components to be reserved in the new
+ *		buffer.
+ *
+ * Use vmalloc() simply here. If the task with a vmalloc()-allocated buffer
+ * tends to terminate quickly, vfree()-induced IPIs may be a concern.
+ * Caching may be helpful for this. But the task with large state is likely
+ * to live longer.
+ *
+ * Also, this method does not shrink or reclaim the buffer.
+ *
+ * Returns 0 on success, -ENOMEM on allocation error.
+ */
+int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
+{
+	union fpregs_state *state;
+	unsigned int oldsz, newsz;
+	u64 state_mask;
+
+	state_mask = fpu->state_mask | mask;
+
+	oldsz = get_xstate_size(fpu->state_mask);
+	newsz = get_xstate_size(state_mask);
+
+	if (oldsz >= newsz)
+		return 0;
+
+	state = vzalloc(newsz);
+	if (!state) {
+		/*
+		 * When allocation requested from #NM, the error code may
+		 * not be populated well. Then, this tracepoint is useful
+		 * for providing the failure context.
+		 */
+		trace_x86_fpu_xstate_alloc_failed(fpu);
+		return -ENOMEM;
+	}
+
+	if (boot_cpu_has(X86_FEATURE_XSAVES))
+		fpstate_init_xstate(&state->xsave, state_mask);
+
+	/*
+	 * As long as the register state is intact, save the xstate in the
+	 * new buffer at the next context copy/switch or potentially
+	 * ptrace-driven xstate writing.
+	 */
+
+	free_xstate_buffer(fpu);
+	fpu->state = state;
+	fpu->state_mask = state_mask;
+	return 0;
+}
+
 static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
 			 void *init_xstate, unsigned int size)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 09/26] x86/fpu/xstate: Update the XSTATE save function to support dynamic states
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (7 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder " Chang S. Bae
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, kvm

Extend os_xsave() to receive a mask argument of which states to save, in
preparation for dynamic user state handling.

Update KVM to set a valid fpu->state_mask, so it can continue to share with
the core code.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
Changes from v5:
* Adjusted the changelog and code for the new base code.

Changes from v3:
* Updated the changelog. (Borislav Petkov)
* Made the code change more reviewable.

Changes from v2:
* Updated the changelog to clarify the KVM code changes.
---
 arch/x86/include/asm/fpu/internal.h | 3 +--
 arch/x86/kernel/fpu/core.c          | 2 +-
 arch/x86/kernel/fpu/signal.c        | 2 +-
 arch/x86/kvm/x86.c                  | 9 +++++++--
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index d2fc19c0e457..263e349ff85a 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -298,9 +298,8 @@ static inline void os_xrstor_booting(struct xregs_state *xstate)
  * Uses either XSAVE or XSAVEOPT or XSAVES depending on the CPU features
  * and command line options. The choice is permanent until the next reboot.
  */
-static inline void os_xsave(struct xregs_state *xstate)
+static inline void os_xsave(struct xregs_state *xstate, u64 mask)
 {
-	u64 mask = xfeatures_mask_all;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 16abc0357e2e..541628bfc8c0 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -99,7 +99,7 @@ EXPORT_SYMBOL(irq_fpu_usable);
 void save_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
-		os_xsave(&fpu->state->xsave);
+		os_xsave(&fpu->state->xsave, fpu->state_mask);
 
 		/*
 		 * AVX512 state is tracked here because its use is
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 2f35aada2007..f70f84d53442 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -365,7 +365,7 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 		 * the right place in memory. It's ia32 mode. Shrug.
 		 */
 		if (xfeatures_mask_supervisor())
-			os_xsave(&fpu->state->xsave);
+			os_xsave(&fpu->state->xsave, fpu->state_mask);
 		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__fpu_invalidate_fpregs_state(fpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 18a337f99459..97b68c6cacd2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9890,11 +9890,16 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	 * KVM does not support dynamic user states yet. Assume the buffer
 	 * always has the minimum size.
 	 */
-	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		memcpy(fpu->state, current->thread.fpu.state,
 		       get_xstate_config(XSTATE_MIN_SIZE));
-	else
+	} else {
+		struct fpu *src_fpu = &current->thread.fpu;
+
+		if (fpu->state_mask != src_fpu->state_mask)
+			fpu->state_mask = src_fpu->state_mask;
 		save_fpregs_to_fpstate(fpu);
+	}
 }
 
 /* Swap (qemu) user FPU context for the guest FPU context. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder to support dynamic states
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (8 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 09/26] x86/fpu/xstate: Update the XSTATE save function to support dynamic states Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-18 11:33   ` Borislav Petkov
  2021-07-30 14:59 ` [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function " Chang S. Bae
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

__raw_xsave_addr() returns the requested component's pointer in an XSTATE
buffer, by simply looking up the offset table. The offset used to be fixed,
but, with dynamic user states, it becomes variable.

get_xstate_size() has a routine to find an offset at runtime. Refactor to
use it for the address finder.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Updated for future proofed __raw_xsave_addr().

Changes from v3:
* Added the function description in the kernel-doc style. (Borislav Petkov)
* Removed 'no functional change' in the changelog. (Borislav Petkov)
---
 arch/x86/kernel/fpu/xstate.c | 78 ++++++++++++++++++++++++------------
 1 file changed, 53 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 26f6d5e0f1ed..98ab10e4da3b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -182,6 +182,38 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/**
+ * get_xstate_comp_offset - Find the feature's offset in the compacted
+ *			    format.
+ * @mask:	This bitmap tells which components reserved in the format.
+ * @feature_nr:	The feature number
+ *
+ * Returns:	The offset value
+ */
+static unsigned int get_xstate_comp_offset(u64 mask, int feature_nr)
+{
+	u64 xmask = BIT_ULL(feature_nr + 1) - 1;
+	unsigned int next_offset, offset = 0;
+	int i;
+
+	if ((xfeatures_mask_all & xmask) == (mask & xmask))
+		return xstate_comp_offsets[feature_nr];
+
+	/*
+	 * With the given mask, no relevant size is found. Calculate it by
+	 * summing up each state size.
+	 */
+	for (next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE;
+	     i <= feature_nr; i++) {
+		if (!(mask & BIT_ULL(i)))
+			continue;
+
+		offset = xstate_aligns[i] ? ALIGN(next_offset, 64) : next_offset;
+		next_offset += xstate_sizes[i];
+	}
+	return offset;
+}
+
 /**
  * get_xstate_size - Calculate an xstate buffer size
  * @mask:	This bitmap tells which components reserved in the buffer.
@@ -193,8 +225,8 @@ static bool xfeature_is_supervisor(int xfeature_nr)
  */
 unsigned int get_xstate_size(u64 mask)
 {
-	unsigned int size;
-	int i, nr;
+	unsigned int offset;
+	int nr;
 
 	if (!mask)
 		return 0;
@@ -213,22 +245,8 @@ unsigned int get_xstate_size(u64 mask)
 	if (!boot_cpu_has(X86_FEATURE_XSAVES))
 		return xstate_offsets[nr] + xstate_sizes[nr];
 
-	if ((xfeatures_mask_all & (BIT_ULL(nr + 1) - 1)) == mask)
-		return xstate_comp_offsets[nr] + xstate_sizes[nr];
-
-	/*
-	 * With the given mask, no relevant size is found so far. So,
-	 * calculate it by summing up each state size.
-	 */
-	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
-		if (!(mask & BIT_ULL(i)))
-			continue;
-
-		if (xstate_aligns[i])
-			size = ALIGN(size, 64);
-		size += xstate_sizes[i];
-	}
-	return size;
+	offset = get_xstate_comp_offset(mask, nr);
+	return offset + xstate_sizes[nr];
 }
 
 /*
@@ -980,19 +998,29 @@ void fpu__resume_cpu(void)
  */
 static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	unsigned int offset;
 	void *xsave;
 
 	if (!xfeature_enabled(xfeature_nr)) {
-		WARN_ON_FPU(1);
-		return NULL;
-	}
+		goto not_found;
+	} else if (!fpu) {
+		xsave = &init_fpstate.xsave;
 
-	if (fpu)
+		offset = get_xstate_comp_offset(xfeatures_mask_all, xfeature_nr);
+		if (offset > sizeof(init_fpstate))
+			goto not_found;
+	} else if (!(fpu->state_mask & BIT_ULL(xfeature_nr))) {
+		goto not_found;
+	} else {
 		xsave = &fpu->state->xsave;
-	else
-		xsave = &init_fpstate.xsave;
+		offset = get_xstate_comp_offset(fpu->state_mask, xfeature_nr);
+	}
+
+	return xsave + offset;
 
-	return xsave + xstate_comp_offsets[xfeature_nr];
+not_found:
+	WARN_ON_FPU(1);
+	return NULL;
 }
 /*
  * Given the xsave area and a state inside, this function returns the
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function to support dynamic states
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (9 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder " Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-18 12:03   ` Borislav Petkov
  2021-07-30 14:59 ` [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state Chang S. Bae
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

ptrace() and signal return paths use XSTATE context copy functions. They
allow callers to read (or write) XSTATE values in the target's buffer. With
dynamic user states, a component's position in the buffer may vary and the
init fpstate is not always large enough to cover all the states.

Adjust the helpers to find a component's offset correctly. Also, update the
copy loop in the ptrace read path to support dynamic states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Updated to ensure xstate_bv aligned with the target.
* Rewrote the xstate copy loop, for the ptrace() read path, in an open
  code.
* Adjusted the changelog.

Changes from v3:
* Cleaned up the code change with more comments.
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
---
 arch/x86/kernel/fpu/xstate.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 98ab10e4da3b..3b56e7612c45 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1273,6 +1273,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 	zerofrom = offsetof(struct xregs_state, extended_state_area);
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		u64 mask = BIT_ULL(i);
 		/*
 		 * The ptrace buffer is in non-compacted XSAVE format.
 		 * In non-compacted format disabled features still occupy
@@ -1280,7 +1281,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 		 * compacted init_fpstate. The gap tracking will zero this
 		 * later.
 		 */
-		if (!(xfeatures_mask_uabi() & BIT_ULL(i)))
+		if (!(xfeatures_mask_uabi() & mask))
 			continue;
 
 		/*
@@ -1300,10 +1301,24 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 			pkru.pkru = tsk->thread.pkru;
 			membuf_write(&to, &pkru, sizeof(pkru));
 		} else {
-			copy_feature(header.xfeatures & BIT_ULL(i), &to,
-				     __raw_xsave_addr(&tsk->thread.fpu, i),
-				     __raw_xsave_addr(NULL, i),
-				     xstate_sizes[i]);
+			unsigned int size = xstate_sizes[i];
+			void *from = NULL;
+
+			/*
+			 * Copy the xstate if available. Otherwise, copy the
+			 * non-zero init states for legacy states (FP and
+			 * SSE) or fill zeros.
+			 */
+
+			if (header.xfeatures & mask)
+				from = __raw_xsave_addr(&tsk->thread.fpu, i);
+			else if (XFEATURE_MASK_FPSSE & mask)
+				from = __raw_xsave_addr(NULL, i);
+
+			if (from)
+				membuf_write(&to, from, size);
+			else
+				membuf_zero(&to, size);
 		}
 		/*
 		 * Keep track of the last copied state in the non-compacted
@@ -1345,6 +1360,8 @@ static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
+	hdr.xfeatures &= fpu->state_mask;
+
 	/* Validate MXCSR when any of the related features is in use */
 	mask = XFEATURE_MASK_FP | XFEATURE_MASK_SSE | XFEATURE_MASK_YMM;
 	if (hdr.xfeatures & mask) {
@@ -1371,6 +1388,9 @@ static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
 		if (hdr.xfeatures & mask) {
 			void *dst = __raw_xsave_addr(fpu, i);
 
+			if (!dst)
+				continue;
+
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (10 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function " Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-18 16:24   ` Borislav Petkov
  2021-07-30 14:59 ` [PATCH v9 13/26] x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion Chang S. Bae
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can defer allocating the large XSAVE buffer until tasks
need it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
xstate-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand the XSTATE
buffer. The #NM handler induces the buffer expansion.

Introduce helper functions:
    xfd_write()   - write IA32_XFD MSR
    xfd_read()    - read IA32_XFD MSR
    xfd_switch()  - switch IA32_XFD MSR
    xfd_capable() - indicate XFD-capable xfeatures

In the event of vzalloc() failure, send SIGILL with si_code ILL_ILL_OPC.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v7:
* Update #NM handler to raise SIGILL rather than SIGSEGV. (Thiago
  Macieira)

Changes from v6:
* Update the #NM handler a little bit.
* Clean up the code comment.

Changes from v5:
* Excluded the access request check here and included the buffer allocation
  again in #NM handler. The access request will be dealt in next patch.
* Updated the title. (Dave Hansen)
* Updated the code comment.

Changes from v4:
* Changed to use XFD to support the access request policy. Updated #NM
  handler to raise a signal instead of buffer allocation.
* Decoupled XFD from the use of XSAVE compacted format.
* Updated helper functions.
* Updated function descriptions in a proper format.
* Updated some code comments.

Changes from v3:
* Removed 'no functional change' in the changelog. (Borislav Petkov)

Changes from v2:
* Changed to enable XFD only when the compacted format is used.
* Updated the changelog with task->fpu removed. (Borislav Petkov)

Changes from v1:
* Inlined the XFD-induced #NM handling code (Andy Lutomirski)
---
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/fpu/internal.h | 45 +++++++++++++++++++++++++++--
 arch/x86/include/asm/msr-index.h    |  2 ++
 arch/x86/kernel/cpu/cpuid-deps.c    |  1 +
 arch/x86/kernel/fpu/xstate.c        | 44 ++++++++++++++++++++++++++--
 arch/x86/kernel/process.c           |  6 ++++
 arch/x86/kernel/process_32.c        |  2 +-
 arch/x86/kernel/process_64.c        |  2 +-
 arch/x86/kernel/traps.c             | 39 +++++++++++++++++++++++++
 9 files changed, 136 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..37150b7a8e44 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -277,6 +277,7 @@
 #define X86_FEATURE_XSAVEC		(10*32+ 1) /* XSAVEC instruction */
 #define X86_FEATURE_XGETBV1		(10*32+ 2) /* XGETBV with ECX = 1 instruction */
 #define X86_FEATURE_XSAVES		(10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD			(10*32+ 4) /* eXtended Feature Disabling */
 
 /*
  * Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 263e349ff85a..e3590cf55325 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -535,14 +535,55 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  * Misc helper functions:
  */
 
+/* The Extended Feature Disable (XFD) helpers: */
+
+static inline void xfd_write(u64 value)
+{
+	wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xfd_read(void)
+{
+	u64 value;
+
+	rdmsrl_safe(MSR_IA32_XFD, &value);
+	return value;
+}
+
+static inline u64 xfd_capable(void)
+{
+	return xfeatures_mask_user_dynamic;
+}
+
+/**
+ * xfd_switch - Switches the MSR IA32_XFD context if needed.
+ * @prev:	The previous task's struct fpu pointer
+ * @next:	The next task's struct fpu pointer
+ */
+static inline void xfd_switch(struct fpu *prev, struct fpu *next)
+{
+	u64 prev_xfd_mask, next_xfd_mask;
+
+	if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())
+		return;
+
+	prev_xfd_mask = prev->state_mask & xfd_capable();
+	next_xfd_mask = next->state_mask & xfd_capable();
+
+	if (unlikely(prev_xfd_mask != next_xfd_mask))
+		xfd_write(xfd_capable() ^ next_xfd_mask);
+}
+
 /*
  * Delay loading of the complete FPU state until the return to userland.
  * PKRU is handled separately.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
 {
-	if (cpu_feature_enabled(X86_FEATURE_FPU))
+	if (cpu_feature_enabled(X86_FEATURE_FPU)) {
 		set_thread_flag(TIF_NEED_FPU_LOAD);
+		xfd_switch(old_fpu, new_fpu);
+	}
 }
 
 #endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..eac0cfd9210b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -626,6 +626,8 @@
 #define MSR_IA32_BNDCFGS_RSVD		0x00000ffc
 
 #define MSR_IA32_XSS			0x00000da0
+#define MSR_IA32_XFD			0x000001c4
+#define MSR_IA32_XFD_ERR		0x000001c5
 
 #define MSR_IA32_APICBASE		0x0000001b
 #define MSR_IA32_APICBASE_BSP		(1<<8)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index defda61f372d..7f891d2eb52e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_SGX_LC,			X86_FEATURE_SGX	      },
 	{ X86_FEATURE_SGX1,			X86_FEATURE_SGX       },
 	{ X86_FEATURE_SGX2,			X86_FEATURE_SGX1      },
+	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVE     },
 	{}
 };
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 3b56e7612c45..c6ff0575d87d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -182,6 +182,26 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/**
+ * xfd_supported - Check if the feature supports Extended Feature Disable (XFD).
+ * @feature_nr:	The feature number.
+ *
+ * Returns:	True if supported; otherwise, false.
+ */
+static bool xfd_supported(int feature_nr)
+{
+	u32 eax, ebx, ecx, edx;
+
+	if (!boot_cpu_has(X86_FEATURE_XFD))
+		return false;
+
+	/*
+	 * If state component 'i' supports it, ECX[2] return 1; otherwise, 0.
+	 */
+	cpuid_count(XSTATE_CPUID, feature_nr, &eax, &ebx, &ecx, &edx);
+	return ecx & 4;
+}
+
 /**
  * get_xstate_comp_offset - Find the feature's offset in the compacted
  *			    format.
@@ -274,6 +294,9 @@ void fpu__init_cpu_xstate(void)
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
 				     xfeatures_mask_independent());
 	}
+
+	if (boot_cpu_has(X86_FEATURE_XFD))
+		xfd_write(xfd_capable());
 }
 
 static bool xfeature_enabled(enum xfeature xfeature)
@@ -473,8 +496,9 @@ static void __init print_xstate_offset_size(void)
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
 			continue;
-		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
-			 i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
+			i, xstate_comp_offsets[i], i, xstate_sizes[i],
+			(xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "dynamic" : "default");
 	}
 }
 
@@ -920,6 +944,16 @@ void __init fpu__init_system_xstate(void)
 	/* Do not support the dynamically allocated buffer yet. */
 	xfeatures_mask_user_dynamic = 0;
 
+	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		u64 feature_mask = BIT_ULL(i);
+
+		if (!(xfeatures_mask_uabi() & feature_mask))
+			continue;
+
+		if (xfd_supported(i))
+			xfeatures_mask_user_dynamic |= feature_mask;
+	}
+
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
 	err = init_xstate_size();
@@ -981,6 +1015,12 @@ void fpu__resume_cpu(void)
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
 				     xfeatures_mask_independent());
 	}
+
+	if (boot_cpu_has(X86_FEATURE_XFD)) {
+		u64 fpu_xfd_mask = current->thread.fpu.state_mask & xfd_capable();
+
+		xfd_write(xfd_capable() ^ fpu_xfd_mask);
+	}
 }
 
 /**
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 534b9fb7e7ee..b85fa499f195 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -97,6 +97,12 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
 	*size = get_xstate_config(XSTATE_MIN_SIZE);
 }
 
+void arch_release_task_struct(struct task_struct *task)
+{
+	if (cpu_feature_enabled(X86_FEATURE_FPU))
+		free_xstate_buffer(&task->thread.fpu);
+}
+
 /*
  * Free thread data structures etc..
  */
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	this_cpu_write(current_task, next_p);
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(prev_fpu, next_fpu);
 
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ec0d836a13b1..41c9855158d6 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -620,7 +620,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(current_task, next_p);
 	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(prev_fpu, next_fpu);
 
 	/* Reload sp0. */
 	update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..dd66d528afd8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1112,6 +1112,45 @@ DEFINE_IDTENTRY(exc_device_not_available)
 {
 	unsigned long cr0 = read_cr0();
 
+	if (boot_cpu_has(X86_FEATURE_XFD)) {
+		u64 xfd_err;
+
+		rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
+		wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+
+		if (xfd_err) {
+			u64 xfd_event = xfd_err & xfd_capable();
+
+			if (WARN_ON(!xfd_event)) {
+				/*
+				 * Unexpected event is raised. But update XFD state to
+				 * unblock the task.
+				 */
+				xfd_write(xfd_read() & ~xfd_err);
+			} else {
+				struct fpu *fpu = &current->thread.fpu;
+				int err = -1;
+
+				/*
+				 * Make sure not in interrupt context as handling a
+				 * trap from userspace.
+				 */
+				if (!WARN_ON(in_interrupt())) {
+					err = alloc_xstate_buffer(fpu, xfd_event);
+					if (!err)
+						xfd_write((fpu->state_mask & xfd_capable()) ^
+							  xfd_capable());
+				}
+
+				/* Raise a signal when it failed to handle. */
+				if (err)
+					force_sig_fault(SIGILL, ILL_ILLOPC,
+							error_get_trap_addr(regs));
+			}
+			return;
+		}
+	}
+
 #ifdef CONFIG_MATH_EMULATION
 	if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
 		struct math_emu_info info = { };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 13/26] x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (11 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE Chang S. Bae
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

ptrace() may update XSTATE data before the target task has taken an XFD
fault and expanded the XSTATE buffer. Detect this case and allocate a
sufficient buffer to support the request. Also, disable the (now
unnecessary) associated first-use fault.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Adjusted to use 'tmpbuf' for the new base code.

Changes from v4:
* Improved the condition check for the expansion.
* Simplified the XSTATE_BV retrieval.
* Updated the code comment.

Changes from v3:
* Removed 'no functional changes' in the changelog. (Borislav Petkov)

Changes from v2:
* Updated the changelog with task->fpu removed. (Borislav Petkov)
* Updated the code comments.
---
 arch/x86/kernel/fpu/regset.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 73d7d7b489fe..244e672c3e3d 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -163,6 +163,30 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 		}
 	}
 
+	/*
+	 * When a ptracer attempts to write any dynamic user state in the
+	 * target buffer but not sufficiently allocated, it dynamically
+	 * expands the buffer.
+	 *
+	 * Check if the expansion is possibly needed.
+	 */
+	if (xfeatures_mask_user_dynamic &&
+	    ((fpu->state_mask & xfeatures_mask_user_dynamic) != xfeatures_mask_user_dynamic)) {
+		u64 state_mask;
+
+		/* Retrieve XSTATE_BV. */
+		memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
+		       sizeof(u64));
+
+		/* Expand the xstate buffer based on the XSTATE_BV. */
+		state_mask &= xfeatures_mask_user_dynamic;
+		if (state_mask) {
+			ret = alloc_xstate_buffer(fpu, state_mask);
+			if (ret)
+				goto out;
+		}
+	}
+
 	fpu_force_restore(fpu);
 	ret = copy_uabi_from_kernel_to_xstate(fpu, kbuf ?: tmpbuf);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (12 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 13/26] x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-08-06 16:46   ` Thiago Macieira
  2021-07-30 14:59 ` [PATCH v9 15/26] x86/fpu/xstate: Support both legacy and expanded signal XSTATE size Chang S. Bae
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

arch_prctl(ARCH_SET_STATE_ENABLE, u64 bitmask)
    Some XSTATE features, such as AMX, are unavailable to applications
    until that process explicitly requests them via this call. Requests can
    be made for any number of valid user XSTATEs in a single call. This
    call is intended to be invoked very early in process initialization. A
    forked child inherits access, but permission is reset upon exec. There
    is no concept of un-requesting XSTATE access.
    Return codes:
        0: success (including repeated calls)
        EINVAL: no hardware feature for the request
	EBUSY: error in updating all threads in the process

arch_prctl(ARCH_GET_STATE_ENABLE, u64 *bitmask)
    Return the bitmask of permitted user XSTATE features. If XSAVE
    is disabled, the bitmask indicates only legacy states.

The permission is checked at every XSTATE buffer expansion: e.g.
XFD-induced #NM event, and ptracer's XSTATE injection. When no permission
is found, inform userspace via SIGSEGV or with error code.

The notion of granted permission is broadcast to all threads in a process.
(This approach follows the PR_SET_FP_MODE prctl(2) implementation.)

Detect a fork race by aborting and returning -EBUSY if the number of
threads at the end of call changed.

[ An alternative implementation would not save the permission bitmap in
  every task. But instead would extend the per-process signal data, and
  that would not be subject to this race. ]

Rename the third argument for do_arch_prctl_common() to reflect its generic
use.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v8:
* Update arch_prctl prototype for consistency with other arch_prctl's. It
  now takes an address of return bitmask as a parameter.
* Optimize the reset function.

Changes from v7:
* Rename the syscalls. (Thiago Macieira and Dave Hansen)
* If XSAVE is disabled, assure that syscall correctly indicates legacy
  states. (Thiago Macieira and Dave Hansen)

Changes from v6:
* Add state bitmap param to proposed syscall. (Thiago Macieira)
* Add companion syscall to return the current permission bitmap.
* Update the ptrace path to return EFAULT when no permission to write
  XTILEDATA.
* Update do_arch_prctl_common().

Changes from v5:
* Switched to per-process permission. (Based on the discussion on LKML)
---
 arch/x86/include/asm/fpu/types.h  |  8 +++
 arch/x86/include/asm/fpu/xstate.h |  5 ++
 arch/x86/include/asm/proto.h      |  2 +-
 arch/x86/include/uapi/asm/prctl.h |  3 +
 arch/x86/kernel/fpu/regset.c      | 17 ++++--
 arch/x86/kernel/fpu/xstate.c      | 96 +++++++++++++++++++++++++++++++
 arch/x86/kernel/process.c         |  8 ++-
 arch/x86/kernel/process_64.c      |  6 ++
 arch/x86/kernel/traps.c           |  8 ++-
 9 files changed, 141 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c0192e16cadb..03160a1a79ad 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -336,6 +336,14 @@ struct fpu {
 	 */
 	unsigned long			avx512_timestamp;
 
+	/*
+	 * @state_perm:
+	 *
+	 * The bitmap indicates the permission of using some state
+	 * components which are dynamically stored in the per-task buffer.
+	 */
+	u64				dynamic_state_perm;
+
 	/*
 	 * @state_mask:
 	 *
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 45735441fbe8..9fb6308aaf07 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -149,6 +149,11 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 unsigned int get_xstate_size(u64 mask);
 int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
 void free_xstate_buffer(struct fpu *fpu);
+
+long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm);
+void reset_task_xstate_perm(struct task_struct *tsk);
+unsigned long get_task_state_perm(struct task_struct *tsk);
+
 int xfeature_size(int xfeature_nr);
 int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
 int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h
index 8c5d1910a848..feed36d44d04 100644
--- a/arch/x86/include/asm/proto.h
+++ b/arch/x86/include/asm/proto.h
@@ -40,6 +40,6 @@ void x86_report_nx(void);
 extern int reboot_force;
 
 long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long cpuid_enabled);
+			  unsigned long arg2);
 
 #endif /* _ASM_X86_PROTO_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..c73e141ce90a 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
 #define ARCH_GET_CPUID		0x1011
 #define ARCH_SET_CPUID		0x1012
 
+#define ARCH_SET_STATE_ENABLE	0x1021
+#define ARCH_GET_STATE_ENABLE	0x1022
+
 #define ARCH_MAP_VDSO_X32	0x2001
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 244e672c3e3d..ee71ffd7c221 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -166,22 +166,27 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	/*
 	 * When a ptracer attempts to write any dynamic user state in the
 	 * target buffer but not sufficiently allocated, it dynamically
-	 * expands the buffer.
+	 * expands the buffer if permitted.
 	 *
 	 * Check if the expansion is possibly needed.
 	 */
 	if (xfeatures_mask_user_dynamic &&
 	    ((fpu->state_mask & xfeatures_mask_user_dynamic) != xfeatures_mask_user_dynamic)) {
-		u64 state_mask;
+		u64 state_mask, dynstate_mask;
 
 		/* Retrieve XSTATE_BV. */
 		memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
 		       sizeof(u64));
 
-		/* Expand the xstate buffer based on the XSTATE_BV. */
-		state_mask &= xfeatures_mask_user_dynamic;
-		if (state_mask) {
-			ret = alloc_xstate_buffer(fpu, state_mask);
+		/* Check the permission and expand the xstate buffer. */
+		dynstate_mask = state_mask & xfeatures_mask_user_dynamic;
+		if (dynstate_mask) {
+			if ((dynstate_mask & fpu->dynamic_state_perm) != dynstate_mask) {
+				ret = -EFAULT;
+				goto out;
+			}
+
+			ret = alloc_xstate_buffer(fpu, dynstate_mask);
 			if (ret)
 				goto out;
 		}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c6ff0575d87d..84dda445386e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -961,6 +961,7 @@ void __init fpu__init_system_xstate(void)
 		goto out_disable;
 
 	/* Make sure init_task does not include the dynamic user states. */
+	current->thread.fpu.dynamic_state_perm = 0;
 	current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
 
 	/*
@@ -1233,6 +1234,101 @@ int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
 	return 0;
 }
 
+/**
+ * set_process_xstate_perm - Set a per-process permission to use dynamic
+ *			     user xstates.
+ * @tsk:	A struct task_struct * pointer
+ * @state_perm:	A bitmap to indicate which state's permission to be set.
+ * Return:	0 if successful; otherwise, error code.
+ */
+long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm)
+{
+	u64 req_dynstate_perm, old_dynstate_perm;
+	struct task_struct *t;
+	int nr_threads = 0;
+	u64 features_mask;
+
+	if (!boot_cpu_has(X86_FEATURE_FPU))
+		features_mask = 0;
+	else if (use_xsave())
+		features_mask = xfeatures_mask_uabi();
+	else if (use_fxsr())
+		features_mask = XFEATURE_MASK_FPSSE;
+	else
+		features_mask = XFEATURE_MASK_FP;
+
+	if (state_perm & ~features_mask)
+		return -EINVAL;
+
+	req_dynstate_perm = state_perm & xfeatures_mask_user_dynamic;
+	if (!req_dynstate_perm)
+		return 0;
+
+	old_dynstate_perm = tsk->thread.fpu.dynamic_state_perm;
+
+	for_each_thread(tsk, t) {
+		t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
+		nr_threads++;
+	}
+
+	if (nr_threads != tsk->signal->nr_threads) {
+		for_each_thread(tsk, t)
+			t->thread.fpu.dynamic_state_perm = old_dynstate_perm;
+		pr_err("x86/fpu: ARCH_XSTATE_PERM failed as thread number mismatched.\n");
+		return -EBUSY;
+	}
+	return 0;
+}
+
+/**
+ * reset_task_xstate_perm - Reset a task's permission to use dynamic user
+ *			    xstates.
+ *
+ * It is expected to call at exec in which one task runs in a process.
+ *
+ * @task:	A struct task_struct * pointer
+ */
+void reset_task_xstate_perm(struct task_struct *tsk)
+{
+	struct fpu *fpu = &tsk->thread.fpu;
+
+	if (!xfeatures_mask_user_dynamic ||
+	    !(fpu->state_mask & xfeatures_mask_user_dynamic))
+		return;
+
+	WARN_ON(tsk->signal->nr_threads > 1);
+
+	fpu->state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+	free_xstate_buffer(fpu);
+	fpu->state = &fpu->__default_state;
+	if (boot_cpu_has(X86_FEATURE_XSAVES))
+		fpstate_init_xstate(&fpu->state->xsave, fpu->state_mask);
+
+	xfd_write(xfd_capable() ^ (fpu->state_mask & xfd_capable()));
+
+	fpu->dynamic_state_perm = 0;
+}
+
+/**
+ * get_task_state_perm - get the state permission bitmap
+ * @tsk:	A struct task_struct * pointer
+ * Return:	A bitmap to indicate which state's permission is set.
+ */
+unsigned long get_task_state_perm(struct task_struct *tsk)
+{
+	if (!boot_cpu_has(X86_FEATURE_FPU))
+		return 0;
+
+	if (use_xsave())
+		return (xfeatures_mask_uabi() & ~xfeatures_mask_user_dynamic) |
+		       tsk->thread.fpu.dynamic_state_perm;
+
+	if (use_fxsr())
+		return XFEATURE_MASK_FPSSE;
+
+	return XFEATURE_MASK_FP;
+}
+
 static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
 			 void *init_xstate, unsigned int size)
 {
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b85fa499f195..5b4f9b82aea1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -1012,13 +1012,17 @@ unsigned long get_wchan(struct task_struct *p)
 }
 
 long do_arch_prctl_common(struct task_struct *task, int option,
-			  unsigned long cpuid_enabled)
+			  unsigned long arg2)
 {
 	switch (option) {
 	case ARCH_GET_CPUID:
 		return get_cpuid_mode();
 	case ARCH_SET_CPUID:
-		return set_cpuid_mode(task, cpuid_enabled);
+		return set_cpuid_mode(task, arg2);
+	case ARCH_SET_STATE_ENABLE:
+		return set_process_xstate_perm(task, arg2);
+	case ARCH_GET_STATE_ENABLE:
+		return put_user(get_task_state_perm(task), (unsigned long __user *)arg2);
 	}
 
 	return -EINVAL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 41c9855158d6..065ea28328b9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -678,6 +678,9 @@ void set_personality_64bit(void)
 	   so it's not too bad. The main problem is just that
 	   32bit children are affected again. */
 	current->personality &= ~READ_IMPLIES_EXEC;
+
+	/* Make sure to reset the dynamic state permission. */
+	reset_task_xstate_perm(current);
 }
 
 static void __set_personality_x32(void)
@@ -723,6 +726,9 @@ void set_personality_ia32(bool x32)
 	/* Make sure to be in 32bit mode */
 	set_thread_flag(TIF_ADDR32);
 
+	/* Make sure to reset the dynamic state permission. */
+	reset_task_xstate_perm(current);
+
 	if (x32)
 		__set_personality_x32();
 	else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index dd66d528afd8..c94f3b76c126 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1132,10 +1132,12 @@ DEFINE_IDTENTRY(exc_device_not_available)
 				int err = -1;
 
 				/*
-				 * Make sure not in interrupt context as handling a
-				 * trap from userspace.
+				 * Make sure that dynamic buffer expansion is permitted
+				 * and not in interrupt context as handling a trap from
+				 * userspace.
 				 */
-				if (!WARN_ON(in_interrupt())) {
+				if (((xfd_event & fpu->dynamic_state_perm) == xfd_event) &&
+				    !WARN_ON(in_interrupt())) {
 					err = alloc_xstate_buffer(fpu, xfd_event);
 					if (!err)
 						xfd_write((fpu->state_mask & xfd_capable()) ^
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 15/26] x86/fpu/xstate: Support both legacy and expanded signal XSTATE size
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (13 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 16/26] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers Chang S. Bae
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Prepare to support two XSTATE sizes on the signal stack -- legacy and
expanded. Legacy programs have not requested access to AMX (or later
features), and the XSTATE on their signal stack can include up through
AVX-512.

Programs that request access to AVX (and/or later features) will have an
uncompressed XSTATE that includes those features. If such program that also
use the sigaltstack, they must assure that their sigaltstack is large
enough to handle that full XSTATE format. (This is most easily done by
using signal.h from glibc 2.34 or later)

Introduce a new XSTATE size variable for the legacy stack and some helpers.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v6:
* Massage the code comments.

Changes form v5:
* Added as a new patch.
---
 arch/x86/include/asm/fpu/internal.h | 23 +++++++++--
 arch/x86/include/asm/fpu/xstate.h   |  3 +-
 arch/x86/kernel/fpu/init.c          |  1 +
 arch/x86/kernel/fpu/signal.c        | 63 ++++++++++++++++++++---------
 arch/x86/kernel/fpu/xstate.c        | 25 +++++++++++-
 5 files changed, 89 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index e3590cf55325..3b52cfb62ab5 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -337,15 +337,30 @@ static inline void os_xrstor(struct xregs_state *xstate, u64 mask)
  */
 static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
 {
+	u32 lmask, hmask;
+	u64 mask;
+	int err;
+
 	/*
 	 * Include the features which are not xsaved/rstored by the kernel
 	 * internally, e.g. PKRU. That's user space ABI and also required
 	 * to allow the signal handler to modify PKRU.
 	 */
-	u64 mask = xfeatures_mask_uabi();
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
-	int err;
+	mask = xfeatures_mask_uabi();
+
+	/*
+	 * Exclude dynamic user states for non-opt-in threads.
+	 */
+	if (xfeatures_mask_user_dynamic) {
+		struct fpu *fpu = &current->thread.fpu;
+
+		mask &= fpu->dynamic_state_perm ?
+			fpu->state_mask :
+			~xfeatures_mask_user_dynamic;
+	}
+
+	lmask = mask;
+	hmask = mask >> 32;
 
 	/*
 	 * Clear the xsave header first, so that reserved fields are
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9fb6308aaf07..c39ea8bac68f 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -139,7 +139,8 @@ extern void __init update_regset_xstate_info(unsigned int size,
 enum xstate_config {
 	XSTATE_MIN_SIZE,
 	XSTATE_MAX_SIZE,
-	XSTATE_USER_SIZE
+	XSTATE_USER_SIZE,
+	XSTATE_USER_MINSIG_SIZE,
 };
 
 extern unsigned int get_xstate_config(enum xstate_config cfg);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 3e4e14ca723b..acbd3da0e022 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -210,6 +210,7 @@ static void __init fpu__init_system_xstate_size_legacy(void)
 	set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
 	set_xstate_config(XSTATE_MAX_SIZE, xstate_size);
 	set_xstate_config(XSTATE_USER_SIZE, xstate_size);
+	set_xstate_config(XSTATE_USER_MINSIG_SIZE, xstate_size);
 }
 
 /* Legacy code to initialize eager fpu mode. */
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index f70f84d53442..78696b412b56 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -15,9 +15,26 @@
 #include <asm/sigframe.h>
 #include <asm/trace/fpu.h>
 
+/*
+ * Record the signal xstate size and feature bits. Exclude dynamic user
+ * states. See fpu__init_prepare_fx_sw_frame(). The opt-in tasks will
+ * dynamically adjust the data.
+ */
 static struct _fpx_sw_bytes fx_sw_reserved __ro_after_init;
 static struct _fpx_sw_bytes fx_sw_reserved_ia32 __ro_after_init;
 
+static unsigned int current_sig_xstate_size(void)
+{
+	return current->thread.fpu.dynamic_state_perm ?
+	       get_xstate_config(XSTATE_USER_SIZE) :
+	       get_xstate_config(XSTATE_USER_MINSIG_SIZE);
+}
+
+static inline int extend_sig_xstate_size(unsigned int size)
+{
+	return use_xsave() ? size + FP_XSTATE_MAGIC2_SIZE : size;
+}
+
 /*
  * Check for the presence of extended state information in the
  * user fpstate pointer in the sigcontext.
@@ -36,7 +53,7 @@ static inline int check_xstate_in_sigframe(struct fxregs_state __user *fxbuf,
 	/* Check for the first magic field and other error scenarios. */
 	if (fx_sw->magic1 != FP_XSTATE_MAGIC1 ||
 	    fx_sw->xstate_size < min_xstate_size ||
-	    fx_sw->xstate_size > get_xstate_config(XSTATE_USER_SIZE) ||
+	    fx_sw->xstate_size > current_sig_xstate_size() ||
 	    fx_sw->xstate_size > fx_sw->extended_size)
 		goto setfx;
 
@@ -94,20 +111,32 @@ static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)
 
 static inline int save_xstate_epilog(void __user *buf, int ia32_frame)
 {
+	unsigned int current_xstate_size = current_sig_xstate_size();
 	struct xregs_state __user *x = buf;
-	struct _fpx_sw_bytes *sw_bytes;
+	struct _fpx_sw_bytes sw_bytes;
 	u32 xfeatures;
 	int err;
 
-	/* Setup the bytes not touched by the [f]xsave and reserved for SW. */
-	sw_bytes = ia32_frame ? &fx_sw_reserved_ia32 : &fx_sw_reserved;
-	err = __copy_to_user(&x->i387.sw_reserved, sw_bytes, sizeof(*sw_bytes));
+	/*
+	 * Setup the bytes not touched by the [f]xsave and reserved for SW.
+	 *
+	 * Use the recorded values if it matches with the current task. Otherwise,
+	 * adjust it.
+	 */
+	sw_bytes = ia32_frame ? fx_sw_reserved_ia32 : fx_sw_reserved;
+	if (sw_bytes.xstate_size != current_xstate_size) {
+		unsigned int default_xstate_size = sw_bytes.xstate_size;
+
+		sw_bytes.xfeatures = xfeatures_mask_uabi();
+		sw_bytes.xstate_size = current_xstate_size;
+		sw_bytes.extended_size += (current_xstate_size - default_xstate_size);
+	}
+	err = __copy_to_user(&x->i387.sw_reserved, &sw_bytes, sizeof(sw_bytes));
 
 	if (!use_xsave())
 		return err;
 
-	err |= __put_user(FP_XSTATE_MAGIC2,
-			  (__u32 __user *)(buf + get_xstate_config(XSTATE_USER_SIZE)));
+	err |= __put_user(FP_XSTATE_MAGIC2, (__u32 __user *)(buf + current_xstate_size));
 
 	/*
 	 * Read the xfeatures which we copied (directly from the cpu or
@@ -144,7 +173,7 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
 	else
 		err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);
 
-	if (unlikely(err) && __clear_user(buf, get_xstate_config(XSTATE_USER_SIZE)))
+	if (unlikely(err) && __clear_user(buf, current_sig_xstate_size()))
 		err = -EFAULT;
 	return err;
 }
@@ -205,7 +234,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, get_xstate_config(XSTATE_USER_SIZE)))
+		if (!fault_in_pages_writeable(buf_fx, current_sig_xstate_size()))
 			goto retry;
 		return -EFAULT;
 	}
@@ -418,19 +447,13 @@ static int __fpu_restore_sig(void __user *buf, void __user *buf_fx,
 	fpregs_unlock();
 	return ret;
 }
-static inline int xstate_sigframe_size(void)
-{
-	int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
-
-	return use_xsave() ? xstate_size + FP_XSTATE_MAGIC2_SIZE : xstate_size;
-}
 
 /*
  * Restore FPU state from a sigframe:
  */
 int fpu__restore_sig(void __user *buf, int ia32_frame)
 {
-	unsigned int size = xstate_sigframe_size();
+	unsigned int size = extend_sig_xstate_size(current_sig_xstate_size());
 	struct fpu *fpu = &current->thread.fpu;
 	void __user *buf_fx = buf;
 	bool ia32_fxstate = false;
@@ -477,7 +500,7 @@ unsigned long
 fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
 		     unsigned long *buf_fx, unsigned long *size)
 {
-	unsigned long frame_size = xstate_sigframe_size();
+	unsigned long frame_size = extend_sig_xstate_size(current_sig_xstate_size());
 
 	*buf_fx = sp = round_down(sp - frame_size, 64);
 	if (ia32_frame && use_fxsr()) {
@@ -492,7 +515,7 @@ fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
 
 unsigned long fpu__get_fpstate_size(void)
 {
-	unsigned long ret = xstate_sigframe_size();
+	unsigned long ret = extend_sig_xstate_size(get_xstate_config(XSTATE_USER_SIZE));
 
 	/*
 	 * This space is needed on (most) 32-bit kernels, or when a 32-bit
@@ -517,12 +540,12 @@ unsigned long fpu__get_fpstate_size(void)
  */
 void fpu__init_prepare_fx_sw_frame(void)
 {
-	int xstate_size = get_xstate_config(XSTATE_USER_SIZE);
+	int xstate_size = get_xstate_config(XSTATE_USER_MINSIG_SIZE);
 	int ext_size = xstate_size + FP_XSTATE_MAGIC2_SIZE;
 
 	fx_sw_reserved.magic1 = FP_XSTATE_MAGIC1;
 	fx_sw_reserved.extended_size = ext_size;
-	fx_sw_reserved.xfeatures = xfeatures_mask_uabi();
+	fx_sw_reserved.xfeatures = xfeatures_mask_uabi() & ~xfeatures_mask_user_dynamic;
 	fx_sw_reserved.xstate_size = xstate_size;
 
 	if (IS_ENABLED(CONFIG_IA32_EMULATION) ||
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 84dda445386e..c539e02965a6 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -94,10 +94,13 @@ static bool xstate_aligns[XFEATURE_MAX] __ro_after_init =
  *				contains all the enabled state components.
  * @user_size:			The size of user-space buffer for signal and
  *				ptrace frames, in the non-compacted format.
+ * @user_minsig_size:		The non-compacted legacy xstate size for signal.
+ *				Legacy programs do not request to access dynamic
+ *				states.
  */
 struct fpu_xstate_buffer_config {
 	unsigned int min_size, max_size;
-	unsigned int user_size;
+	unsigned int user_size, user_minsig_size;
 };
 
 static struct fpu_xstate_buffer_config buffer_config __ro_after_init;
@@ -111,6 +114,8 @@ unsigned int get_xstate_config(enum xstate_config cfg)
 		return buffer_config.max_size;
 	case XSTATE_USER_SIZE:
 		return buffer_config.user_size;
+	case XSTATE_USER_MINSIG_SIZE:
+		return buffer_config.user_minsig_size;
 	default:
 		return 0;
 	}
@@ -128,6 +133,9 @@ void set_xstate_config(enum xstate_config cfg, unsigned int value)
 		break;
 	case XSTATE_USER_SIZE:
 		buffer_config.user_size = value;
+		break;
+	case XSTATE_USER_MINSIG_SIZE:
+		buffer_config.user_minsig_size = value;
 	}
 }
 
@@ -859,6 +867,21 @@ static int __init init_xstate_size(void)
 	 * User space is always in standard format.
 	 */
 	set_xstate_config(XSTATE_USER_SIZE, xsave_size);
+
+	/*
+	 * The minimum signal xstate size is for non-opt-in user threads
+	 * that do not access dynamic states.
+	 */
+	if (xfeatures_mask_user_dynamic) {
+		int nr = fls64(xfeatures_mask_uabi() & ~xfeatures_mask_user_dynamic) - 1;
+		unsigned int size, offset, ecx, edx;
+
+		cpuid_count(XSTATE_CPUID, nr, &size, &offset, &ecx, &edx);
+		set_xstate_config(XSTATE_USER_MINSIG_SIZE, offset + size);
+	} else {
+		set_xstate_config(XSTATE_USER_MINSIG_SIZE, xsave_size);
+	}
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 16/26] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (14 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 15/26] x86/fpu/xstate: Support both legacy and expanded signal XSTATE size Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 17/26] x86/fpu/xstate: Disable XSTATE support if an inconsistent state is detected Chang S. Bae
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

At compile-time xfeatures_mask_all includes all possible XCR0 features. At
run-time fpu__init_system_xstate() clears features in xfeatures_mask_all
that are not enabled in CPUID. It does this by looping through all possible
XCR0 features.

Update the code to handle the possibility that there will be gaps in the
XCR0 feature bit numbers.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Folded a few lines.

Changes from v4:
* Simplified the implementation. (Thomas Gleixner)
* Updated the patch title accordingly.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
 arch/x86/kernel/fpu/xstate.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c539e02965a6..930e72f55e75 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -43,18 +43,17 @@ static const char *xfeature_names[] =
 	"unknown xstate feature"	,
 };
 
-static short xsave_cpuid_features[] __initdata = {
-	X86_FEATURE_FPU,
-	X86_FEATURE_XMM,
-	X86_FEATURE_AVX,
-	X86_FEATURE_MPX,
-	X86_FEATURE_MPX,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_INTEL_PT,
-	X86_FEATURE_PKU,
-	X86_FEATURE_ENQCMD,
+static unsigned short xsave_cpuid_features[] __initdata = {
+	[XFEATURE_SSE]				= X86_FEATURE_XMM,
+	[XFEATURE_YMM]				= X86_FEATURE_AVX,
+	[XFEATURE_BNDREGS]			= X86_FEATURE_MPX,
+	[XFEATURE_BNDCSR]			= X86_FEATURE_MPX,
+	[XFEATURE_OPMASK]			= X86_FEATURE_AVX512F,
+	[XFEATURE_ZMM_Hi256]			= X86_FEATURE_AVX512F,
+	[XFEATURE_Hi16_ZMM]			= X86_FEATURE_AVX512F,
+	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
+	[XFEATURE_PKRU]				= X86_FEATURE_PKU,
+	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
 };
 
 /*
@@ -955,7 +954,8 @@ void __init fpu__init_system_xstate(void)
 	 * Clear XSAVE features that are disabled in the normal CPUID.
 	 */
 	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
-		if (!boot_cpu_has(xsave_cpuid_features[i]))
+		if (((i == 0) || xsave_cpuid_features[i]) &&
+		    !boot_cpu_has(xsave_cpuid_features[i]))
 			xfeatures_mask_all &= ~BIT_ULL(i);
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 17/26] x86/fpu/xstate: Disable XSTATE support if an inconsistent state is detected
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (15 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 16/26] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 18/26] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

The kernel has a sanity check between two methods to calculate XSTATE size.
In the unlikely event that they disagree, disable the use of XSTATE.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
 arch/x86/kernel/fpu/xstate.c | 40 ++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 930e72f55e75..a36e24028ca7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -654,11 +654,11 @@ static void __xstate_dump_leaves(void)
 } while (0)
 
 #define XCHECK_SZ(sz, nr, nr_macro, __struct) do {			\
-	if ((nr == nr_macro) &&						\
-	    WARN_ONCE(sz != sizeof(__struct),				\
-		"%s: struct is %zu bytes, cpu state %d bytes\n",	\
-		__stringify(nr_macro), sizeof(__struct), sz)) {		\
+	if ((nr == nr_macro) &&	(sz != sizeof(__struct))) {		\
+		pr_err("%s: struct is %zu bytes, cpu state %d bytes\n",	\
+		       __stringify(nr_macro), sizeof(__struct), sz);	\
 		__xstate_dump_leaves();					\
+		return -EINVAL;						\
 	}								\
 } while (0)
 
@@ -667,7 +667,7 @@ static void __xstate_dump_leaves(void)
  * that our software representation matches what the CPU
  * tells us about the state's size.
  */
-static void check_xstate_against_struct(int nr)
+static int check_xstate_against_struct(int nr)
 {
 	/*
 	 * Ask the CPU for the size of the state.
@@ -695,9 +695,12 @@ static void check_xstate_against_struct(int nr)
 	    (nr >= XFEATURE_MAX) ||
 	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
 	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
-		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
+		pr_err("no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
+		return -EINVAL;
 	}
+
+	return 0;
 }
 
 /**
@@ -707,13 +710,14 @@ static void check_xstate_against_struct(int nr)
  * excluded. Only the size of the buffer for task->fpu is checked here.
  *
  * @include_dynamic_states:	A knob to include dynamic states or not.
+ * @size:			A pointer to record the size.
  *
- * Return:			The calculated xstate size.
+ * Return:			0 if successful; otherwise, error code.
  */
-static unsigned int calculate_xstate_size(bool include_dynamic_states)
+static int calculate_xstate_size(bool include_dynamic_states, unsigned int *size)
 {
 	unsigned int xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
-	int i;
+	int i, err;
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
@@ -722,7 +726,10 @@ static unsigned int calculate_xstate_size(bool include_dynamic_states)
 		if (!include_dynamic_states && (xfeatures_mask_user_dynamic & BIT_ULL(i)))
 			continue;
 
-		check_xstate_against_struct(i);
+		err = check_xstate_against_struct(i);
+		if (err)
+			return err;
+
 		/*
 		 * Supervisor state components can be managed only by
 		 * XSAVES.
@@ -748,7 +755,9 @@ static unsigned int calculate_xstate_size(bool include_dynamic_states)
 		xstate_size += xfeature_size(i);
 	}
 
-	return xstate_size;
+	if (size)
+		*size = xstate_size;
+	return 0;
 }
 
 
@@ -835,6 +844,7 @@ static int __init init_xstate_size(void)
 	/* Recompute the context size for enabled features: */
 	unsigned int possible_xstate_size, xstate_size;
 	unsigned int xsave_size;
+	int err;
 
 	xsave_size = get_xsave_size();
 
@@ -848,7 +858,9 @@ static int __init init_xstate_size(void)
 	 * 'true' to include dynamic states. Cross-check with the CPUID-
 	 * provided size and record it.
 	 */
-	xstate_size = calculate_xstate_size(true);
+	err = calculate_xstate_size(true, &xstate_size);
+	if (err)
+		return err;
 	XSTATE_WARN_ON(possible_xstate_size != xstate_size);
 	set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
 
@@ -857,7 +869,9 @@ static int __init init_xstate_size(void)
 	 * 'false' to exclude dynamic states. Ensure the size fits in
 	 * the statically-allocated buffer and record it.
 	 */
-	xstate_size = calculate_xstate_size(false);
+	err = calculate_xstate_size(false, &xstate_size);
+	if (err)
+		return err;
 	if (!is_supported_xstate_size(xstate_size))
 		return -EINVAL;
 	set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 18/26] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (16 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 17/26] x86/fpu/xstate: Disable XSTATE support if an inconsistent state is detected Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 19/26] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Intel's Advanced Matrix Extension (AMX) is a new 64-bit extended feature
consisting of two-dimensional registers and an accelerator unit. The first
implementation of the latter is the tile matrix multiply unit (TMUL). TMUL
performs SIMD dot-products on four bytes (INT8) or two bfloat16
floating-point (BF16) elements.

Here enumerate this hardware capability to be shown as 'amx_tile',
'amx_bf16', and 'amx_int8' in /proc/cpuinfo.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v4:
* Massaged the changelog a bit.
---
 arch/x86/include/asm/cpufeatures.h | 3 +++
 arch/x86/kernel/cpu/cpuid-deps.c   | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 37150b7a8e44..9e9763ec7713 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -386,7 +386,10 @@
 #define X86_FEATURE_TSXLDTRK		(18*32+16) /* TSX Suspend Load Address Tracking */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR		(18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_AMX_BF16		(18*32+22) /* AMX BF16 Support */
 #define X86_FEATURE_AVX512_FP16		(18*32+23) /* AVX512 FP16 */
+#define X86_FEATURE_AMX_TILE		(18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8		(18*32+25) /* AMX INT8 Support */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
 #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 7f891d2eb52e..9a520ab259ac 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -76,6 +76,9 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_SGX1,			X86_FEATURE_SGX       },
 	{ X86_FEATURE_SGX2,			X86_FEATURE_SGX1      },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVE     },
+	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XSAVE     },
+	{ X86_FEATURE_AMX_INT8,			X86_FEATURE_AMX_TILE  },
+	{ X86_FEATURE_AMX_BF16,			X86_FEATURE_AMX_TILE  },
 	{}
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 19/26] x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (17 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 18/26] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 20/26] x86/fpu/amx: Initialize child's AMX state Chang S. Bae
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Linux uses check_xstate_against_struct() to sanity check the size of
XSTATE-enabled features. AMX is the XSAVE-enabled feature, and its size is
not hard-coded but discoverable at run-time via CPUID.

The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.

Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v8:
* bugfix: Fix off-by-one-error in check_xstate_against_struct() feature
  number argument.

Changes from v4:
* Changed to return an error when tile data size mismatches. (Thomas Gleixner)
* Updated the function description and code comments.

Changes from v2:
* Updated the code comments.

Changes from v1:
* Rebased on the upstream kernel (5.10)
---
 arch/x86/include/asm/fpu/types.h  | 27 +++++++++++
 arch/x86/include/asm/fpu/xstate.h |  2 +
 arch/x86/kernel/fpu/xstate.c      | 80 ++++++++++++++++++++++++++++++-
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 03160a1a79ad..f24b58b606dc 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -120,6 +120,9 @@ enum xfeature {
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
+	XFEATURE_RSRVD_COMP_16,
+	XFEATURE_XTILE_CFG,
+	XFEATURE_XTILE_DATA,
 
 	XFEATURE_MAX,
 };
@@ -136,11 +139,15 @@ enum xfeature {
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
+#define XFEATURE_MASK_XTILE_CFG	(1 << XFEATURE_XTILE_CFG)
+#define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
 					 | XFEATURE_MASK_ZMM_Hi256 \
 					 | XFEATURE_MASK_Hi16_ZMM)
+#define XFEATURE_MASK_XTILE		(XFEATURE_MASK_XTILE_DATA \
+					 | XFEATURE_MASK_XTILE_CFG)
 
 #define FIRST_EXTENDED_XFEATURE	XFEATURE_YMM
 
@@ -153,6 +160,9 @@ struct reg_256_bit {
 struct reg_512_bit {
 	u8	regbytes[512/8];
 };
+struct reg_1024_byte {
+	u8	regbytes[1024];
+};
 
 /*
  * State component 2:
@@ -255,6 +265,23 @@ struct arch_lbr_state {
 	u64 ler_to;
 	u64 ler_info;
 	struct lbr_entry		entries[];
+};
+
+/*
+ * State component 17: 64-byte tile configuration register.
+ */
+struct xtile_cfg {
+	u64				tcfg[8];
+} __packed;
+
+/*
+ * State component 18: 1KB tile data register.
+ * Each register represents 16 64-byte rows of the matrix
+ * data. But the number of registers depends on the actual
+ * implementation.
+ */
+struct xtile_data {
+	struct reg_1024_byte		tmm;
 } __packed;
 
 /*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index c39ea8bac68f..e7c6396261ca 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -14,6 +14,8 @@
 
 #define XSTATE_CPUID		0x0000000d
 
+#define TILE_CPUID		0x0000001d
+
 #define FXSAVE_SIZE	512
 
 #define XSAVE_HDR_SIZE	    64
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a36e24028ca7..dac01e4d7654 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -41,6 +41,14 @@ static const char *xfeature_names[] =
 	"Protection Keys User registers",
 	"PASID state",
 	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"AMX Tile config"		,
+	"AMX Tile data"			,
+	"unknown xstate feature"	,
 };
 
 static unsigned short xsave_cpuid_features[] __initdata = {
@@ -54,6 +62,8 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_PKU,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
+	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
+	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
 
 /*
@@ -389,6 +399,8 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
+	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
 
 /*
@@ -662,6 +674,67 @@ static void __xstate_dump_leaves(void)
 	}								\
 } while (0)
 
+/**
+ * check_xtile_data_against_struct - Check tile data state size.
+ *
+ * Calculate the state size by multiplying the single tile size which is
+ * recorded in a C struct, and the number of tiles that the CPU informs.
+ * Compare the provided size with the calculation.
+ *
+ * @size:	The tile data state size
+ *
+ * Returns:	0 on success, -EINVAL on mismatch.
+ */
+static int check_xtile_data_against_struct(int size)
+{
+	u32 max_palid, palid, state_size;
+	u32 eax, ebx, ecx, edx;
+	u16 max_tile;
+
+	/*
+	 * Check the maximum palette id:
+	 *   eax: the highest numbered palette subleaf.
+	 */
+	cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
+
+	/*
+	 * Cross-check each tile size and find the maximum number of
+	 * supported tiles.
+	 */
+	for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
+		u16 tile_size, max;
+
+		/*
+		 * Check the tile size info:
+		 *   eax[31:16]:  bytes per title
+		 *   ebx[31:16]:  the max names (or max number of tiles)
+		 */
+		cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
+		tile_size = eax >> 16;
+		max = ebx >> 16;
+
+		if (tile_size != sizeof(struct xtile_data)) {
+			pr_err("%s: struct is %zu bytes, cpu xtile %d bytes\n",
+			       __stringify(XFEATURE_XTILE_DATA),
+			       sizeof(struct xtile_data), tile_size);
+			__xstate_dump_leaves();
+			return -EINVAL;
+		}
+
+		if (max > max_tile)
+			max_tile = max;
+	}
+
+	state_size = sizeof(struct xtile_data) * max_tile;
+	if (size != state_size) {
+		pr_err("%s: calculated size is %u bytes, cpu state %d bytes\n",
+		       __stringify(XFEATURE_XTILE_DATA), state_size, size);
+		__xstate_dump_leaves();
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /*
  * We have a C struct for each 'xstate'.  We need to ensure
  * that our software representation matches what the CPU
@@ -685,6 +758,11 @@ static int check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
+	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+
+	/* The tile data size varies between implementations. */
+	if (nr == XFEATURE_XTILE_DATA)
+		check_xtile_data_against_struct(sz);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
@@ -694,7 +772,7 @@ static int check_xstate_against_struct(int nr)
 	if ((nr < XFEATURE_YMM) ||
 	    (nr >= XFEATURE_MAX) ||
 	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
+	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
 		pr_err("no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 		return -EINVAL;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 20/26] x86/fpu/amx: Initialize child's AMX state
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (18 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 19/26] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 21/26] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Assure that a forked child starts AMX registers in the INIT-state.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Introduced a new define. (Andy Lutomirski)

Changes from v4:
* Added as a new patch. This was missing on previous versions.
---
 arch/x86/include/asm/fpu/xstate.h | 3 +++
 arch/x86/kernel/fpu/core.c        | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index e7c6396261ca..912b420cb148 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -80,6 +80,9 @@
 				      XFEATURE_MASK_INDEPENDENT | \
 				      XFEATURE_MASK_SUPERVISOR_UNSUPPORTED)
 
+/* Volatile states that a child does not inherit. */
+#define XFEATURE_MASK_CLEARED_ON_CLONE	XFEATURE_MASK_XTILE
+
 #ifdef CONFIG_X86_64
 #define REX_PREFIX	"0x48, "
 #else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 541628bfc8c0..387118127f93 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -299,6 +299,9 @@ int fpu_clone(struct task_struct *dst)
 		save_fpregs_to_fpstate(dst_fpu);
 	fpregs_unlock();
 
+	if (xfeatures_mask_all & XFEATURE_MASK_CLEARED_ON_CLONE)
+		dst_fpu->state->xsave.header.xfeatures &= ~XFEATURE_MASK_CLEARED_ON_CLONE;
+
 	set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);
 
 	trace_x86_fpu_copy_src(src_fpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 21/26] x86/fpu/amx: Enable the AMX feature in 64-bit mode
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (19 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 20/26] x86/fpu/amx: Initialize child's AMX state Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 22/26] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state Chang S. Bae
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

In 64-bit mode, include the AMX state components in
XFEATURE_MASK_USER_SUPPORTED.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Adjusted macro changes and moved the disabling code for non-64-bit mode
  for the new base changes.

Changes from v4:
* Removed the irrelevant line from the changelog. (Thomas Gleixner)
---
 arch/x86/include/asm/fpu/xstate.h | 3 ++-
 arch/x86/kernel/fpu/xstate.c      | 6 +++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 912b420cb148..f934ce88c048 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -35,7 +35,8 @@
 				      XFEATURE_MASK_Hi16_ZMM	 | \
 				      XFEATURE_MASK_PKRU | \
 				      XFEATURE_MASK_BNDREGS | \
-				      XFEATURE_MASK_BNDCSR)
+				      XFEATURE_MASK_BNDCSR | \
+				      XFEATURE_MASK_XTILE)
 
 /*
  * Features which are restored when returning to user space.
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index dac01e4d7654..96056f49bcff 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -538,7 +538,8 @@ static void __init print_xstate_offset_size(void)
 	 XFEATURE_MASK_PKRU |			\
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
-	 XFEATURE_MASK_PASID)
+	 XFEATURE_MASK_PASID |			\
+	 XFEATURE_MASK_XTILE)
 
 /*
  * setup the xstate image representing the init state
@@ -1054,6 +1055,9 @@ void __init fpu__init_system_xstate(void)
 	xfeatures_mask_all &= XFEATURE_MASK_USER_SUPPORTED |
 			      XFEATURE_MASK_SUPERVISOR_SUPPORTED;
 
+	if (!IS_ENABLED(CONFIG_X86_64))
+		xfeatures_mask_all &= ~XFEATURE_MASK_XTILE;
+
 	/* Store it for paranoia check at the end */
 	xfeatures = xfeatures_mask_all;
 	/* Do not support the dynamically allocated buffer yet. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 22/26] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (20 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 21/26] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 23/26] selftest/x86/amx: Test cases for the AMX state management Chang S. Bae
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

By default, for XSTATE features in the INIT-state, XSAVE writes zeros to
the uncompressed destination buffer.

E.g., if you are not using AVX-512, you will still get a bunch of zeros on
the signal stack where live AVX-512 data would go.

For 'dynamic user state' (currently only XTILEDATA), explicitly skip this
data transfer. The result is that the user buffer for the AMX region will
not be touched by XSAVE.

[ Reading XINUSE takes about 20-30 cycles, but writing zeros consumes about
  5-times or more, e.g., for XTILEDATA. ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Mentioned the optimization trade-offs in the changelog. (Dave Hansen)
* Added code comment.

Changes from v4:
* Added as a new patch.
---
 arch/x86/include/asm/fpu/internal.h | 38 +++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 3b52cfb62ab5..04021f0b7dd7 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -337,8 +337,9 @@ static inline void os_xrstor(struct xregs_state *xstate, u64 mask)
  */
 static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
 {
+	struct fpu *fpu = &current->thread.fpu;
 	u32 lmask, hmask;
-	u64 mask;
+	u64 state_mask;
 	int err;
 
 	/*
@@ -346,21 +347,38 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
 	 * internally, e.g. PKRU. That's user space ABI and also required
 	 * to allow the signal handler to modify PKRU.
 	 */
-	mask = xfeatures_mask_uabi();
+	state_mask = xfeatures_mask_uabi();
+
+	if (!xfeatures_mask_user_dynamic)
+		goto mask_ready;
 
 	/*
 	 * Exclude dynamic user states for non-opt-in threads.
 	 */
-	if (xfeatures_mask_user_dynamic) {
-		struct fpu *fpu = &current->thread.fpu;
-
-		mask &= fpu->dynamic_state_perm ?
-			fpu->state_mask :
-			~xfeatures_mask_user_dynamic;
+	if (!fpu->dynamic_state_perm) {
+		state_mask &= ~xfeatures_mask_user_dynamic;
+	} else {
+		u64 dynamic_state_mask;
+
+		state_mask &= fpu->state_mask;
+
+		dynamic_state_mask = state_mask & xfeatures_mask_user_dynamic;
+		if (dynamic_state_mask && boot_cpu_has(X86_FEATURE_XGETBV1)) {
+			u64 dynamic_xinuse, dynamic_init;
+			u64 xinuse = xgetbv(1);
+
+			dynamic_xinuse = xinuse & dynamic_state_mask;
+			dynamic_init = ~xinuse & dynamic_state_mask;
+			if (dynamic_init) {
+				state_mask &= ~xfeatures_mask_user_dynamic;
+				state_mask |= dynamic_xinuse;
+			}
+		}
 	}
 
-	lmask = mask;
-	hmask = mask >> 32;
+mask_ready:
+	lmask = state_mask;
+	hmask = state_mask >> 32;
 
 	/*
 	 * Clear the xsave header first, so that reserved fields are
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 23/26] selftest/x86/amx: Test cases for the AMX state management
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (21 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 22/26] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 24/26] x86/insn/amx: Add TILERELEASE instruction to the opcode map Chang S. Bae
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, linux-kselftest

This selftest verifies that the XSTATE arch_prctl works for AMX state and
that a forked task has the AMX state in the INIT-state.

In addition, this test verifies that the kernel correctly context switches
unique AMX data, when multiple threads are using AMX. The test also
verifies that ptrace() can insert data into existing threads.

Finally, add a test case to verify that unused states are excluded, by
leaving a known pattern on the signal stack and verifying that it is still
intact after taking a subsequent signal.

These test cases do not depend on AMX compiler support, as they employ
userspace-XSAVE directly to access AMX state.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Changes from v8:
* Adjust for the arch_prctl change.
* Assure XTILECFG is recovered upon sigreturn..

Changes from v7:
* Adjust for SIGILL.
* Test XTILECFG for legacy signal delivery.

Changes from v6:
* Adjust for the syscall and ptrace path changes.

Changes from v5:
* Adjusted arch_prctl for the updated ABI.
* Added test for the dynamic signal xstate buffer.
* Fixed XSAVE buffer's header data.

Changes from v4:
* Added test for arch_prctl.
* Excluded tile config details to focus on testing the kernel's ability to
  manage dynamic user state.
* Removed tile instructions.
* Simplified the fork() and ptrace() test routine.
* Massaged the changelog.

Changes from v2:
* Updated the test messages and the changelog as tile data is not inherited
  to a child anymore.
* Removed bytecode for the instructions already supported by binutils.
* Changed to check the XSAVE availability in a reliable way.

Changes from v1:
* Removed signal testing code
---
 tools/testing/selftests/x86/Makefile |   2 +-
 tools/testing/selftests/x86/amx.c    | 968 +++++++++++++++++++++++++++
 2 files changed, 969 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/amx.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index b4142cd1c5c2..8a1f62ab3c8e 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
-			corrupt_xstate_header
+			corrupt_xstate_header amx
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
new file mode 100644
index 000000000000..afd8c66ca206
--- /dev/null
+++ b/tools/testing/selftests/x86/amx.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <elf.h>
+#include <pthread.h>
+#include <setjmp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <x86intrin.h>
+
+#include <linux/futex.h>
+
+#include <sys/ptrace.h>
+#include <sys/shm.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <sys/uio.h>
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+static inline uint64_t xgetbv(uint32_t index)
+{
+	uint32_t eax, edx;
+
+	asm volatile("xgetbv;"
+		     : "=a" (eax), "=d" (edx)
+		     : "c" (index));
+	return eax + ((uint64_t)edx << 32);
+}
+
+static inline void cpuid(uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
+{
+	asm volatile("cpuid;"
+		     : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
+		     : "0" (*eax), "2" (*ecx));
+}
+
+static inline void xsave(void *xbuf, uint32_t lo, uint32_t hi)
+{
+	asm volatile("xsave (%%rdi)"
+		     : : "D" (xbuf), "a" (lo), "d" (hi)
+		     : "memory");
+}
+
+static inline void xrstor(void *xbuf, uint32_t lo, uint32_t hi)
+{
+	asm volatile("xrstor (%%rdi)"
+		     : : "D" (xbuf), "a" (lo), "d" (hi));
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+static void clearhandler(int sig)
+{
+	struct sigaction sa;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_handler = SIG_DFL;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+static jmp_buf jmpbuf;
+
+/* Hardware info check: */
+
+static bool noxsave;
+
+static void handle_noxsave(int sig, siginfo_t *si, void *ctx_void)
+{
+	noxsave = true;
+	siglongjmp(jmpbuf, 1);
+}
+
+#define XFEATURE_XTILECFG	17
+#define XFEATURE_XTILEDATA	18
+#define XFEATURE_MASK_XTILECFG	(1 << XFEATURE_XTILECFG)
+#define XFEATURE_MASK_XTILEDATA	(1 << XFEATURE_XTILEDATA)
+#define XFEATURE_MASK_XTILE	(XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA)
+
+static inline bool check_xtile(void)
+{
+	bool xtile_enable;
+
+	sethandler(SIGILL, handle_noxsave, 0);
+
+	if ((!sigsetjmp(jmpbuf, 1)) && (xgetbv(0) & XFEATURE_MASK_XTILE)) {
+		xtile_enable = true;
+		goto out;
+	}
+	xtile_enable = false;
+out:
+	clearhandler(SIGILL);
+	return xtile_enable;
+}
+
+static uint32_t xsave_size;
+static uint32_t xsave_xtiledata_offset, xsave_xtilecfg_offset;
+static uint32_t xtiledata_size, xtilecfg_size;
+
+static struct _tile_spec {
+	uint16_t bytes_per_row;
+	uint16_t max_names;
+	uint16_t max_rows;
+} tile_spec;
+
+#define XSTATE_CPUID			0xd
+#define XSTATE_USER_STATE_SUBLEAVE	0x0
+#define TILE_CPUID			0x1d
+#define TILE_PALETTE_ID			0x1
+
+static void check_cpuid(void)
+{
+	uint32_t eax, ebx, ecx, edx;
+
+	eax = XSTATE_CPUID;
+	ecx = XSTATE_USER_STATE_SUBLEAVE;
+
+	cpuid(&eax, &ebx, &ecx, &edx);
+	if (!ebx)
+		err(1, "xstate cpuid: xsave size");
+
+	xsave_size = ebx;
+
+	eax = XSTATE_CPUID;
+	ecx = XFEATURE_XTILECFG;
+
+	cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx)
+		err(1, "xstate cpuid: tile config state");
+
+	xtilecfg_size = eax;
+	xsave_xtilecfg_offset = ebx;
+
+	eax = XSTATE_CPUID;
+	ecx = XFEATURE_XTILEDATA;
+
+	cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx)
+		err(1, "xstate cpuid: tile data state");
+
+	xtiledata_size = eax;
+	xsave_xtiledata_offset = ebx;
+
+	eax = TILE_CPUID;
+	ecx = TILE_PALETTE_ID;
+
+	cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx || !ecx)
+		err(1, "tile cpuid: palette 1");
+
+	tile_spec.max_names = ebx >> 16;
+	tile_spec.bytes_per_row = ebx;
+	tile_spec.max_rows = ecx;
+}
+
+/* The helpers for managing XSAVE buffer and tile states: */
+
+void *alloc_xsave_buffer(void)
+{
+	void *xbuf;
+
+	/* XSAVE buffer should be 64B-aligned. */
+	xbuf = aligned_alloc(64, xsave_size);
+	if (!xbuf)
+		err(1, "aligned_alloc()");
+	return xbuf;
+}
+
+#define XSAVE_HDR_OFFSET	512
+#define XSAVE_HDR_SIZE		64
+
+static inline void clear_xstate_header(void *buffer)
+{
+	memset(buffer + XSAVE_HDR_OFFSET, 0, XSAVE_HDR_SIZE);
+}
+
+static inline uint64_t get_xstatebv(void *buffer)
+{
+	return *(uint64_t *)(buffer + XSAVE_HDR_OFFSET);
+}
+
+static inline void set_xstatebv(void *buffer, uint64_t bv)
+{
+	*(uint64_t *)(buffer + XSAVE_HDR_OFFSET) = bv;
+}
+
+static void set_rand_tiledata(void *tiledata)
+{
+	int *ptr = tiledata;
+	int data = rand();
+	int i;
+
+	for (i = 0; i < xtiledata_size / sizeof(int); i++, ptr++)
+		*ptr = data;
+}
+
+#define	MAX_TILES		16
+#define RESERVED_BYTES		14
+
+struct tile_config {
+	uint8_t  palette_id;
+	uint8_t  start_row;
+	uint8_t  reserved[RESERVED_BYTES];
+	uint16_t colsb[MAX_TILES];
+	uint8_t  rows[MAX_TILES];
+};
+
+static void set_tilecfg(void *tilecfg)
+{
+	struct tile_config *cfg = tilecfg;
+	int i;
+
+	memset(cfg, 0, sizeof(*cfg));
+	cfg->palette_id = TILE_PALETTE_ID;
+	for (i = 0; i < tile_spec.max_names; i++) {
+		cfg->colsb[i] = tile_spec.bytes_per_row;
+		cfg->rows[i] = tile_spec.max_rows;
+	}
+}
+
+static void *xsave_buffer, *tiledata, *tilecfg;
+static int nerrs, errs;
+
+/* See 'struct _fpx_sw_bytes' at sigcontext.h */
+#define SW_BYTES_OFFSET		464
+/* N.B. The struct's field name varies so read from the offset. */
+#define SW_BYTES_BV_OFFSET	(SW_BYTES_OFFSET + 8)
+
+static inline struct _fpx_sw_bytes *get_fpx_sw_bytes(void *buffer)
+{
+	return (struct _fpx_sw_bytes *)(buffer + SW_BYTES_OFFSET);
+}
+
+static inline uint64_t get_fpx_sw_bytes_xstatebv(void *buffer)
+{
+	return *(uint64_t *)(buffer + SW_BYTES_BV_OFFSET);
+}
+
+static volatile bool noperm;
+static bool check_tilecfg_sigframe;
+
+static void handle_noperm(int sig, siginfo_t *si, void *ctx_void)
+{
+	ucontext_t *ctx = (ucontext_t *)ctx_void;
+	void *xbuf = ctx->uc_mcontext.fpregs;
+	struct _fpx_sw_bytes *sw_bytes;
+
+	printf("\tAt SIGILL handler,\n");
+
+	if (si->si_code != ILL_ILLOPC) {
+		errs++;
+		printf("[FAIL]\tInvalid signal code (%x).\n", si->si_code);
+	} else {
+		printf("[OK]\tValid signal code (ILL_ILLOPC).\n");
+	}
+
+	sw_bytes = get_fpx_sw_bytes(xbuf);
+	if (!(sw_bytes->xstate_size < xsave_xtiledata_offset) &&
+	    !(get_fpx_sw_bytes_xstatebv(xbuf) & XFEATURE_MASK_XTILEDATA)) {
+		printf("[OK]\tValid xstate size and mask in the SW data of xstate buffer.\n");
+	} else {
+		errs++;
+		printf("[FAIL]\tInvalid xstate size and/or mask in the SW data of xstate buf.\n");
+	}
+
+	if (check_tilecfg_sigframe) {
+		if (memcmp(tilecfg, xbuf + xsave_xtilecfg_offset, xtilecfg_size)) {
+			errs++;
+			printf("[FAIL]\tTILECFG is corrupted.\n");
+		} else {
+			printf("[OK]\tTILECFG is successfully delivered.\n");
+		}
+	}
+
+	noperm = true;
+	ctx->uc_mcontext.gregs[REG_RIP] += 3; /* Skip the faulting XRSTOR */
+}
+
+/* Return true if XRSTOR is successful; otherwise, false.  */
+static inline bool xrstor_safe(void *buffer, uint32_t lo, uint32_t hi)
+{
+	noperm = false;
+	xrstor(buffer, lo, hi);
+	return !noperm;
+}
+
+/* arch_prctl test */
+
+#define ARCH_SET_STATE_ENABLE	0x1021
+#define ARCH_GET_STATE_ENABLE	0x1022
+
+static void enable_tiledata(void)
+{
+	unsigned long bitmask;
+	long rc;
+
+	rc = syscall(SYS_arch_prctl, ARCH_SET_STATE_ENABLE, XFEATURE_MASK_XTILEDATA);
+	if (rc)
+		goto fail;
+
+	rc = syscall(SYS_arch_prctl, ARCH_GET_STATE_ENABLE, &bitmask);
+	if (rc)
+		err(1, "ARCH_GET_STATE_ENABLE");
+	else if (bitmask & XFEATURE_MASK_XTILEDATA)
+		return;
+
+fail:
+	err(1, "ARCH_SET_STATE_ENABLE");
+}
+
+#define TEST_EXECV_ARG		"nested"
+
+static void test_arch_prctl(int argc, char **argv)
+{
+	pid_t parent, child, grandchild;
+
+	parent = fork();
+	if (parent < 0) {
+		err(1, "fork");
+	} else if (parent > 0) {
+		int status;
+
+		wait(&status);
+		if (!WIFEXITED(status) || WEXITSTATUS(status))
+			err(1, "arch_prctl test parent exit");
+		return;
+	}
+
+	printf("[RUN]\tCheck ARCH_SET_STATE_ENABLE around process fork().\n");
+
+	printf("\tFork a child.\n");
+	child = fork();
+	if (child < 0) {
+		err(1, "fork");
+	} else if (child > 0) {
+		int status;
+
+		enable_tiledata();
+		printf("\tDo ARCH_SET_STATE_ENABLE at parent\n");
+
+		wait(&status);
+		if (!WIFEXITED(status) || WEXITSTATUS(status))
+			err(1, "arch_prctl test child exit");
+		_exit(0);
+	}
+
+	clear_xstate_header(xsave_buffer);
+
+	/* By default, XTILECFG is permitted to use. */
+	set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILECFG);
+	set_tilecfg(xsave_buffer + xsave_xtilecfg_offset);
+	xrstor(xsave_buffer, -1, -1);
+	memcpy(tilecfg, xsave_buffer + xsave_xtilecfg_offset, xtilecfg_size);
+
+	set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+	set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+
+	printf("\tLoad tile data without ARCH_SET_STATE_ENABLE at child.\n");
+	/*
+	 * Test XTILECFG state delivery via signal, when XTILEDATA is not
+	 * permitted.
+	 */
+	check_tilecfg_sigframe = true;
+	if (xrstor_safe(xsave_buffer, -1, -1)) {
+		nerrs++;
+		printf("[FAIL]\tSucceeded at child.\n");
+	} else {
+		printf("[OK]\tBlocked at child.\n");
+
+		/* Assure XTILECFG state recovery at sigreturn. */
+		printf("\tReturn from signal handler,\n");
+		xsave(xsave_buffer, XFEATURE_MASK_XTILECFG, 0);
+		if (memcmp(tilecfg, xsave_buffer + xsave_xtilecfg_offset, xtilecfg_size)) {
+			nerrs++;
+			printf("[FAIL]\tTilecfg is not restored.\n");
+		} else {
+			printf("[OK]\tTilecfg is restored.\n");
+		}
+	}
+
+	printf("\tDo ARCH_SET_STATE_ENABLE at child.\n");
+	enable_tiledata();
+
+	printf("\tLoad tile data with ARCH_SET_STATE_ENABLE at child:\n");
+	check_tilecfg_sigframe = false;
+	if (xrstor_safe(xsave_buffer, -1, -1)) {
+		printf("[OK]\tSucceeded at child.\n");
+	} else {
+		nerrs++;
+		printf("[FAIL]\tBlocked at child.\n");
+	}
+
+	printf("\tFork a grandchild.\n");
+	grandchild = fork();
+	if (grandchild < 0) {
+		err(1, "fork");
+	} else if (!grandchild) {
+		char *args[] = {argv[0], TEST_EXECV_ARG, NULL};
+
+		if (xrstor_safe(xsave_buffer, -1, -1)) {
+			printf("[OK]\tSucceeded at grandchild.\n");
+		} else {
+			nerrs++;
+			printf("[FAIL]\tBlocked at grandchild.\n");
+		}
+		nerrs += execv(args[0], args);
+	} else {
+		int status;
+
+		wait(&status);
+		if (!WIFEXITED(status) || WEXITSTATUS(status))
+			err(1, "fork test grandchild");
+	}
+	_exit(0);
+}
+
+/* Testing tile data inheritance */
+
+static void test_fork(void)
+{
+	pid_t child, grandchild;
+
+	child = fork();
+	if (child < 0) {
+		err(1, "fork");
+	} else if (child > 0) {
+		int status;
+
+		wait(&status);
+		if (!WIFEXITED(status) || WEXITSTATUS(status))
+			err(1, "fork test child");
+		return;
+	}
+
+	printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tile data -- yes:\n");
+
+	clear_xstate_header(xsave_buffer);
+	set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILE);
+	set_tilecfg(xsave_buffer + xsave_xtilecfg_offset);
+	set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+	xrstor_safe(xsave_buffer, -1, -1);
+
+	grandchild = fork();
+	if (grandchild < 0) {
+		err(1, "fork");
+	} else if (grandchild > 0) {
+		int status;
+
+		wait(&status);
+		if (!WIFEXITED(status) || WEXITSTATUS(status))
+			err(1, "fork test grand child");
+		_exit(0);
+	}
+
+	if (xgetbv(1) & XFEATURE_MASK_XTILE) {
+		nerrs++;
+		printf("[FAIL]\tIn a child, AMX state is not initialized.\n");
+	} else {
+		printf("[OK]\tIn a child, AMX state is initialized.\n");
+	}
+	_exit(0);
+}
+
+/* Context switching test */
+
+#define ITERATIONS	10
+#define NUM_THREADS	5
+
+struct futex_info {
+	int current;
+	int *futex;
+	int next;
+};
+
+static inline void command_wait(struct futex_info *info, int value)
+{
+	do {
+		sched_yield();
+	} while (syscall(SYS_futex, info->futex, FUTEX_WAIT, value, 0, 0, 0));
+}
+
+static inline void command_wake(struct futex_info *info, int value)
+{
+	do {
+		*info->futex = value;
+		while (!syscall(SYS_futex, info->futex, FUTEX_WAKE, 1, 0, 0, 0))
+			sched_yield();
+	} while (0);
+}
+
+static inline int get_iterative_value(int id)
+{
+	return ((id << 1) & ~0x1);
+}
+
+static inline int get_endpoint_value(int id)
+{
+	return ((id << 1) | 0x1);
+}
+
+static void *check_tiledata(void *info)
+{
+	struct futex_info *finfo = (struct futex_info *)info;
+	void *xbuf, *tdata;
+	int i;
+
+	xbuf = alloc_xsave_buffer();
+	tdata = malloc(xtiledata_size);
+	if (!tdata)
+		err(1, "malloc()");
+
+	set_xstatebv(xbuf, XFEATURE_MASK_XTILEDATA);
+	set_rand_tiledata(xbuf + xsave_xtiledata_offset);
+	xrstor_safe(xbuf, -1, -1);
+	memcpy(tdata, xbuf + xsave_xtiledata_offset, xtiledata_size);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		command_wait(finfo, get_iterative_value(finfo->current));
+
+		xsave(xbuf, XFEATURE_MASK_XTILEDATA, 0);
+		if (memcmp(tdata, xbuf + xsave_xtiledata_offset, xtiledata_size))
+			errs++;
+
+		set_rand_tiledata(xbuf + xsave_xtiledata_offset);
+		xrstor_safe(xbuf, -1, -1);
+		memcpy(tdata, xbuf + xsave_xtiledata_offset, xtiledata_size);
+
+		command_wake(finfo, get_iterative_value(finfo->next));
+	}
+
+	command_wait(finfo, get_endpoint_value(finfo->current));
+
+	free(xbuf);
+	free(tdata);
+	return NULL;
+}
+
+static int create_threads(int num, struct futex_info *finfo)
+{
+	const int shm_id = shmget(IPC_PRIVATE, sizeof(int), IPC_CREAT | 0666);
+	int *futex = shmat(shm_id, NULL, 0);
+	pthread_t thread;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		finfo[i].futex = futex;
+		finfo[i].current = i + 1;
+		finfo[i].next = (i + 2) % (num + 1);
+
+		if (pthread_create(&thread, NULL, check_tiledata, &finfo[i]))
+			err(1, "pthread_create()");
+	}
+	return 0;
+}
+
+static void test_context_switch(void)
+{
+	struct futex_info *finfo;
+	int i;
+
+	printf("[RUN]\tCheck tile data context switches.\n");
+	printf("\t# of context switches -- %u, # of threads -- %d:\n",
+	       ITERATIONS * NUM_THREADS, NUM_THREADS);
+
+	errs = 0;
+
+	finfo = malloc(sizeof(*finfo) * NUM_THREADS);
+	if (!finfo)
+		err(1, "malloc()");
+
+	create_threads(NUM_THREADS, finfo);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		command_wake(finfo, get_iterative_value(1));
+		command_wait(finfo, get_iterative_value(0));
+	}
+
+	for (i = 1; i <= NUM_THREADS; i++)
+		command_wake(finfo, get_endpoint_value(i));
+
+	if (errs) {
+		nerrs += errs;
+		printf("[FAIL]\tIncorrect cases were found -- (%d / %u).\n",
+		       errs, ITERATIONS * NUM_THREADS);
+	} else {
+		printf("[OK]\tNo incorrect case was found.\n");
+	}
+
+	free(finfo);
+}
+
+/* Ptrace test */
+
+static bool ptracee_state_perm;
+
+static int inject_tiledata(pid_t target)
+{
+	struct iovec iov;
+
+	iov.iov_base = xsave_buffer;
+	iov.iov_len = xsave_size;
+
+	clear_xstate_header(xsave_buffer);
+	set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+	set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+	memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size);
+
+	if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) {
+		if (errno != EFAULT)
+			err(1, "PTRACE_SETREGSET");
+		else
+			return errno;
+	}
+
+	if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+		err(1, "PTRACE_GETREGSET");
+
+	if (!memcmp(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size))
+		return 0;
+	else
+		return -1;
+}
+
+static void test_tile_write(void)
+{
+	int status, rc;
+	pid_t child;
+	bool pass;
+
+	child = fork();
+	if (child < 0) {
+		err(1, "fork");
+	} else if (!child) {
+		if (ptracee_state_perm)
+			enable_tiledata();
+
+		if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+			err(1, "PTRACE_TRACEME");
+
+		raise(SIGTRAP);
+		_exit(0);
+	}
+
+	do {
+		wait(&status);
+	} while (WSTOPSIG(status) != SIGTRAP);
+
+	printf("\tInject tile data %s ARCH_SET_STATE_ENABLE\n",
+	       ptracee_state_perm ? "with" : "without");
+
+	rc = inject_tiledata(child);
+	pass = (rc == EFAULT && !ptracee_state_perm) ||
+	       (!rc && ptracee_state_perm);
+	if (!pass)
+		nerrs++;
+	printf("[%s]\tTile data was %swritten on ptracee.\n",
+	       pass ? "OK" : "FAIL", errs ? "not " : "");
+
+	ptrace(PTRACE_DETACH, child, NULL, NULL);
+	wait(&status);
+	if (!WIFEXITED(status) || WEXITSTATUS(status))
+		err(1, "ptrace test");
+}
+
+static void test_ptrace(void)
+{
+	printf("[RUN]\tCheck ptrace() to inject tile data.\n");
+
+	ptracee_state_perm = false;
+	test_tile_write();
+
+	ptracee_state_perm = true;
+	test_tile_write();
+}
+
+/* Signal handling test */
+
+static bool init_tiledata, load_tiledata;
+static volatile bool signaled, sigstk_prefill;
+
+#define SIGFRAME_TILEDATA_SIGNATURE	0xEE
+
+static void handle_sigstk_prefill(int sig, siginfo_t *info, void *ctx_void)
+{
+	void *xbuf = ((ucontext_t *)ctx_void)->uc_mcontext.fpregs;
+	struct _fpx_sw_bytes *sw_bytes = get_fpx_sw_bytes(xsave);
+
+	if (sw_bytes->xstate_size >= (xsave_xtiledata_offset + xtiledata_size)) {
+		memset(xbuf + xsave_xtiledata_offset, SIGFRAME_TILEDATA_SIGNATURE,
+		       xtiledata_size);
+	}
+
+	sigstk_prefill = true;
+}
+
+static void handle_signal(int sig, siginfo_t *info, void *ctx_void)
+{
+	bool tiledata_area, tiledata_bit, tiledata_inuse;
+	void *xbuf = ((ucontext_t *)ctx_void)->uc_mcontext.fpregs;
+	struct _fpx_sw_bytes *sw_bytes = get_fpx_sw_bytes(xbuf);
+	char d = SIGFRAME_TILEDATA_SIGNATURE;
+	int i;
+
+	printf("\tAt signal delivery,\n");
+
+	/* Check SW reserved data in the buffer: */
+	if ((sw_bytes->xstate_size >= (xsave_xtiledata_offset + xtiledata_size)) &&
+	    (get_fpx_sw_bytes_xstatebv(xbuf) & XFEATURE_MASK_XTILEDATA)) {
+		printf("[OK]\tValid xstate size and mask in the SW data of xstate buffer\n");
+	} else {
+		errs++;
+		printf("[FAIL]\tInvalid xstate size and/or mask in the SW data of xstate buffer\n");
+	}
+
+	/* Check XSAVE buffer header: */
+	tiledata_inuse = (load_tiledata && !init_tiledata);
+	tiledata_bit = get_xstatebv(xbuf) & XFEATURE_MASK_XTILEDATA;
+
+	if (tiledata_bit == tiledata_inuse) {
+		printf("[OK]\tTiledata bit is %sset in XSTATE_BV of xstate buffer.\n",
+		       tiledata_bit ? "" : "not ");
+	} else {
+		errs++;
+		printf("[FAIL]\tTiledata bit is %sset in XSTATE_BV of xstate buffer.\n",
+		       tiledata_bit ? "" : "not ");
+	}
+
+	/*
+	 * Check the sigframe data:
+	 */
+
+	tiledata_inuse = (load_tiledata && !init_tiledata);
+	tiledata_area = false;
+	if (sw_bytes->xstate_size >= (xsave_xtiledata_offset + xtiledata_size)) {
+		for (i = 0; i < xtiledata_size; i++) {
+			if (memcmp(xbuf + xsave_xtiledata_offset + i, &d, 1)) {
+				tiledata_area = true;
+				break;
+			}
+		}
+	}
+
+	if (tiledata_area == tiledata_inuse) {
+		printf("[OK]\tTiledata is %ssaved in signal buffer.\n",
+		       tiledata_area ? "" : "not ");
+	} else {
+		errs++;
+		printf("[FAIL]\tTiledata is %ssaved in signal buffer.\n",
+		       tiledata_area ? "" : "not ");
+	}
+
+	/* Load random tiledata to test sigreturn: */
+	clear_xstate_header(xsave_buffer);
+	set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+	set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+	xrstor_safe(xsave_buffer, -1, -1);
+	signaled = true;
+}
+
+static void test_signal_handling(void)
+{
+	pid_t child;
+
+	signaled = false;
+	sigstk_prefill = false;
+
+	child = fork();
+	if (child < 0) {
+		err(1, "fork");
+	} else if (child > 0) {
+		do {
+			int status;
+
+			wait(&status);
+			if (WIFSTOPPED(status))
+				kill(child, SIGCONT);
+			else if (WIFEXITED(status) && !WEXITSTATUS(status))
+				break;
+			else
+				err(1, "signal test child");
+		} while (1);
+		return;
+	}
+
+	printf("\tBefore signal, load tile data -- %s", load_tiledata ? "yes, " : "no:\n");
+	if (load_tiledata)
+		printf("re-initialized -- %s:\n", init_tiledata ? "yes" : "no");
+
+	/*
+	 * Raise SIGUSR1 to pre-fill sig stack. Also, load tiledata to size the pre-fill.
+	 */
+
+	if (load_tiledata) {
+		clear_xstate_header(xsave_buffer);
+		set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+		xrstor_safe(xsave_buffer, -1, -1);
+	}
+
+	raise(SIGUSR1);
+	if (!sigstk_prefill)
+		err(1, "SIGUSR1");
+
+	/*
+	 * Raise SIGALRM to test AMX state handling in signal delivery. Set up the state and
+	 * data before the test.
+	 */
+
+	if (load_tiledata) {
+		clear_xstate_header(xsave_buffer);
+		set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+		set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+		xrstor_safe(xsave_buffer, -1, -1);
+
+		if (init_tiledata) {
+			clear_xstate_header(xsave_buffer);
+			set_xstatebv(xsave_buffer, 0);
+			xrstor_safe(xsave_buffer, -1, -1);
+			memset(tiledata, 0, xtiledata_size);
+		} else {
+			memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset,
+			       xtiledata_size);
+		}
+	} else {
+		memset(tiledata, 0, xtiledata_size);
+	}
+
+	raise(SIGALRM);
+	if (!signaled)
+		err(1, "SIGALRM");
+
+	printf("\tReturn from signal handler,\n");
+	xsave(xsave_buffer, XFEATURE_MASK_XTILEDATA, 0);
+	if (memcmp(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size)) {
+		errs++;
+		printf("[FAIL]\tTiledata is not restored.\n");
+	} else {
+		printf("[OK]\tTiledata is restored.\n");
+	}
+
+	if (errs)
+		nerrs++;
+	_exit(0);
+}
+
+static void test_signal(void)
+{
+	printf("[RUN]\tCheck tile data state in signal path:\n");
+
+	sethandler(SIGALRM, handle_signal, 0);
+	sethandler(SIGUSR1, handle_sigstk_prefill, 0);
+
+	load_tiledata = false;
+	init_tiledata = false;
+	errs = 0;
+	test_signal_handling();
+
+	load_tiledata = true;
+	init_tiledata = false;
+	errs = 0;
+	test_signal_handling();
+
+	load_tiledata = true;
+	init_tiledata = true;
+	errs = 0;
+	test_signal_handling();
+
+	clearhandler(SIGALRM);
+	clearhandler(SIGUSR1);
+}
+
+int main(int argc, char **argv)
+{
+	cpu_set_t cpuset;
+
+	if (argc == 2) {
+		int ret;
+
+		if (strcmp(argv[1], TEST_EXECV_ARG))
+			return 0;
+
+		printf("\tRun after execv().\n");
+
+		xsave_buffer = alloc_xsave_buffer();
+		clear_xstate_header(xsave_buffer);
+
+		set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILE);
+		set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+
+		sethandler(SIGILL, handle_noperm, 0);
+
+		if (xrstor_safe(xsave_buffer, -1, -1)) {
+			printf("[FAIL]\tSucceeded after execv().\n");
+			ret = 1;
+		} else {
+			printf("[OK]\tBlocked after execv().\n");
+			ret = 0;
+		}
+
+		clearhandler(SIGILL);
+		free(xsave_buffer);
+		_exit(ret);
+	}
+
+	/* Check hardware availability at first */
+
+	if (!check_xtile()) {
+		printf("%s is disabled.\n", noxsave ? "XSAVE" : "AMX");
+		return 0;
+	}
+
+	check_cpuid();
+
+	xsave_buffer = alloc_xsave_buffer();
+	clear_xstate_header(xsave_buffer);
+
+	tiledata = malloc(xtiledata_size);
+	if (!tiledata)
+		err(1, "malloc()");
+
+	tilecfg = malloc(xtilecfg_size);
+	if (!tilecfg)
+		err(1, "malloc()");
+	set_tilecfg(tilecfg);
+
+	nerrs = 0;
+
+	sethandler(SIGILL, handle_noperm, 0);
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 0");
+
+	test_arch_prctl(argc, argv);
+	test_ptrace();
+
+	enable_tiledata();
+	test_context_switch();
+	test_fork();
+	test_signal();
+
+	clearhandler(SIGILL);
+
+	free(tilecfg);
+	free(tiledata);
+	free(xsave_buffer);
+	return nerrs ? 1 : 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 24/26] x86/insn/amx: Add TILERELEASE instruction to the opcode map
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (22 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 23/26] selftest/x86/amx: Test cases for the AMX state management Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability Chang S. Bae
  2021-07-30 14:59 ` [PATCH v9 26/26] x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE Chang S. Bae
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Include the opcode of TILERELEASE that returns all the AMX state to
INIT-state.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v4:
* Added as a new patch as preparatory to use the instruction in the kernel.
---
 arch/x86/lib/x86-opcode-map.txt       | 8 +++++++-
 tools/arch/x86/lib/x86-opcode-map.txt | 8 +++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
 45: vpsrlvd/q Vx,Hx,Wx (66),(v)
 46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
 47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
 4c: vrcp14ps/d Vpd,Wpd (66),(ev)
 4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
 4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
 7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
 EndTable
 
+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
 # AMD's Prefetch Group
 GrpTable: GrpP
 0: PREFETCH
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index ec31f5b60323..dbc5078ccafe 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -690,7 +690,9 @@ AVXcode: 2
 45: vpsrlvd/q Vx,Hx,Wx (66),(v)
 46: vpsravd Vx,Hx,Wx (66),(v) | vpsravd/q Vx,Hx,Wx (66),(evo)
 47: vpsllvd/q Vx,Hx,Wx (66),(v)
-# Skip 0x48-0x4b
+# Skip 0x48
+49: Grp22 (1A)
+# Skip 0x4a-0x4b
 4c: vrcp14ps/d Vpd,Wpd (66),(ev)
 4d: vrcp14ss/d Vsd,Hpd,Wsd (66),(ev)
 4e: vrsqrt14ps/d Vpd,Wpd (66),(ev)
@@ -1082,6 +1084,10 @@ GrpTable: Grp21
 7: ENDBR64 (F3),(010),(11B) | ENDBR32 (F3),(011),(11B)
 EndTable
 
+GrpTable: Grp22
+0: TILERELEASE (!F3),(v1),(11B)
+EndTable
+
 # AMD's Prefetch Group
 GrpTable: GrpP
 0: PREFETCH
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (23 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 24/26] x86/insn/amx: Add TILERELEASE instruction to the opcode map Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  2021-07-30 18:41   ` Dave Hansen
  2021-07-30 20:15   ` Dave Hansen
  2021-07-30 14:59 ` [PATCH v9 26/26] x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE Chang S. Bae
  25 siblings, 2 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae, linux-pm

Add a custom Sapphire Rapids (SPR) C-state table to intel_idle driver. The
parameters in this table are preferred over those supplied by ACPI.

SPR supports AMX, and so this custom table uses idle entry points that know
how to initialize AMX TMM state, if necessary.

This guarantees that AMX TMM state will never be the cause of hardware
C-state demotion from C6 to C1E. Under some conditions this may result in
improved power savings, and thus higher available turbo frequency budget.

[ Based on patch by Artem Bityutskiy <artem.bityutskiy@linux.intel.com>. ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
---
Changes from v6:
* Update the changelog and function description. (Rafael J. Wysocki)

Changes from v5:
* Moved the code to intel_idle. (Peter Zijlstra)
* Fixed to deactivate fpregs. (Andy Lutomirski and Dave Hansen)
* Updated the code comment. (Dave Hansen)

Changes from v4:
* Added as a new patch. (Thomas Gleixner)
---
 arch/x86/include/asm/special_insns.h |  6 +++
 drivers/idle/intel_idle.c            | 79 ++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index f3fbb84ff8a7..fada1bb82c7b 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -294,6 +294,12 @@ static inline int enqcmds(void __iomem *dst, const void *src)
 	return 0;
 }
 
+static inline void tile_release(void)
+{
+	/* Instruction opcode for TILERELEASE; supported in binutils >= 2.36. */
+	asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0");
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_X86_SPECIAL_INSNS_H */
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index e6c543b5ee1d..fe1ba26cc797 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -54,6 +54,8 @@
 #include <asm/intel-family.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/fpu/internal.h>
+#include <asm/special_insns.h>
 
 #define INTEL_IDLE_VERSION "0.5.1"
 
@@ -155,6 +157,55 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
 	return 0;
 }
 
+/**
+ * idle_tile - Initialize TILE registers in INIT-state
+ *
+ * Leaving state in the dirty TILE registers may prevent the processor from
+ * entering lower-power idle states. Use TILERELEASE to initialize the
+ * state. Destroying fpregs state is safe after the fpstate update.
+ */
+static inline void idle_tile(void)
+{
+	if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
+		tile_release();
+		fpregs_deactivate(&current->thread.fpu);
+	}
+}
+
+/**
+ * intel_idle_tile - Ask the processor to enter the given idle state.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Ensure TILE registers in INIT-state before using intel_idle() to
+ * enter the idle state.
+ */
+static __cpuidle int intel_idle_tile(struct cpuidle_device *dev,
+				     struct cpuidle_driver *drv, int index)
+{
+	idle_tile();
+
+	return intel_idle(dev, drv, index);
+}
+
+/**
+ * intel_idle_s2idle_tile - Ask the processor to enter the given idle state.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Ensure TILE registers in INIT-state before using intel_idle_s2idle() to
+ * enter the idle state.
+ */
+static __cpuidle int intel_idle_s2idle_tile(struct cpuidle_device *dev,
+					    struct cpuidle_driver *drv, int index)
+{
+	idle_tile();
+
+	return intel_idle_s2idle(dev, drv, index);
+}
+
 /*
  * States are indexed by the cstate number,
  * which is also the index into the MWAIT hint array.
@@ -752,6 +803,27 @@ static struct cpuidle_state icx_cstates[] __initdata = {
 		.enter = NULL }
 };
 
+static struct cpuidle_state spr_cstates[] __initdata = {
+	{
+		.name = "C1",
+		.desc = "MWAIT 0x00",
+		.flags = MWAIT2flg(0x00),
+		.exit_latency = 1,
+		.target_residency = 1,
+		.enter = &intel_idle,
+		.enter_s2idle = intel_idle_s2idle, },
+	{
+		.name = "C6",
+		.desc = "MWAIT 0x20",
+		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 128,
+		.target_residency = 384,
+		.enter = &intel_idle_tile,
+		.enter_s2idle = intel_idle_s2idle_tile, },
+	{
+		.enter = NULL }
+};
+
 static struct cpuidle_state atom_cstates[] __initdata = {
 	{
 		.name = "C1E",
@@ -1095,6 +1167,12 @@ static const struct idle_cpu idle_cpu_icx __initconst = {
 	.use_acpi = true,
 };
 
+static const struct idle_cpu idle_cpu_spr __initconst = {
+	.state_table = spr_cstates,
+	.disable_promotion_to_c1e = true,
+	.use_acpi = true,
+};
+
 static const struct idle_cpu idle_cpu_avn __initconst = {
 	.state_table = avn_cstates,
 	.disable_promotion_to_c1e = true,
@@ -1157,6 +1235,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
 	X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X,		&idle_cpu_skx),
 	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X,		&idle_cpu_icx),
 	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D,		&idle_cpu_icx),
+	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X,	&idle_cpu_spr),
 	X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL,	&idle_cpu_knl),
 	X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM,	&idle_cpu_knl),
 	X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT,	&idle_cpu_bxt),
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v9 26/26] x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE
  2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (24 preceding siblings ...)
  2021-07-30 14:59 ` [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability Chang S. Bae
@ 2021-07-30 14:59 ` Chang S. Bae
  25 siblings, 0 replies; 91+ messages in thread
From: Chang S. Bae @ 2021-07-30 14:59 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86
  Cc: len.brown, dave.hansen, thiago.macieira, jing2.liu,
	ravi.v.shankar, linux-kernel, chang.seok.bae

Add a DEBUG sanity check that XFD state matches with XINUSE state.

Instead of reading MSR IA32_XFD directly, read a per-cpu value that is
recorded at every MSR write.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes from v5:
* Added as a new patch. (Dave Hansen)
---
 arch/x86/include/asm/fpu/internal.h | 15 +++++++++++++++
 arch/x86/kernel/fpu/core.c          | 13 +++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 04021f0b7dd7..dd845829ac15 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -570,10 +570,25 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 
 /* The Extended Feature Disable (XFD) helpers: */
 
+#ifdef CONFIG_X86_DEBUG_FPU
+DECLARE_PER_CPU(u64, xfd_shadow);
+static inline u64 xfd_debug_shadow(void)
+{
+	return this_cpu_read(xfd_shadow);
+}
+
+static inline void xfd_write(u64 value)
+{
+	wrmsrl_safe(MSR_IA32_XFD, value);
+	this_cpu_write(xfd_shadow, value);
+}
+#else
+#define xfd_debug_shadow()	0
 static inline void xfd_write(u64 value)
 {
 	wrmsrl_safe(MSR_IA32_XFD, value);
 }
+#endif
 
 static inline u64 xfd_read(void)
 {
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 387118127f93..650c2d3cc45d 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -82,6 +82,10 @@ bool irq_fpu_usable(void)
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
+#ifdef CONFIG_X86_DEBUG_FPU
+DEFINE_PER_CPU(u64, xfd_shadow);
+#endif
+
 /*
  * Save the FPU register state in fpu->state. The register state is
  * preserved.
@@ -99,6 +103,15 @@ EXPORT_SYMBOL(irq_fpu_usable);
 void save_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
+		/*
+		 * If XFD is armed for an xfeature, XSAVE* will not save
+		 * its state. Verify XFD is clear for all features that
+		 * are in use before XSAVE*.
+		 */
+		if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) && xfd_capable() &&
+		    boot_cpu_has(X86_FEATURE_XGETBV1))
+			WARN_ON_FPU(xgetbv(1) & xfd_debug_shadow());
+
 		os_xsave(&fpu->state->xsave, fpu->state_mask);
 
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability
  2021-07-30 14:59 ` [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability Chang S. Bae
@ 2021-07-30 18:41   ` Dave Hansen
  2021-08-03 21:32     ` Bae, Chang Seok
  2021-07-30 20:15   ` Dave Hansen
  1 sibling, 1 reply; 91+ messages in thread
From: Dave Hansen @ 2021-07-30 18:41 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, tglx, mingo, x86
  Cc: len.brown, thiago.macieira, jing2.liu, ravi.v.shankar,
	linux-kernel, linux-pm

>  #endif /* _ASM_X86_SPECIAL_INSNS_H */
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index e6c543b5ee1d..fe1ba26cc797 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -54,6 +54,8 @@
>  #include <asm/intel-family.h>
>  #include <asm/mwait.h>
>  #include <asm/msr.h>
> +#include <asm/fpu/internal.h>
> +#include <asm/special_insns.h>
>  
>  #define INTEL_IDLE_VERSION "0.5.1"
>  
> @@ -155,6 +157,55 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
>  	return 0;
>  }
>  
> +/**
> + * idle_tile - Initialize TILE registers in INIT-state
> + *
> + * Leaving state in the dirty TILE registers may prevent the processor from
> + * entering lower-power idle states. Use TILERELEASE to initialize the
> + * state. Destroying fpregs state is safe after the fpstate update.
> + */
> +static inline void idle_tile(void)
> +{
> +	if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
> +		tile_release();
> +		fpregs_deactivate(&current->thread.fpu);
> +	}
> +}

This isn't obviously safe code.  There's a window in there when we have
bogus, destroyed FPU register state but where we might be rescheduled.

I would assume that preempt is off *somewhere* in this, but it would be
nice to make sure of that, or at least mention the requirement for it to
be off before this code is safe.

I'm also not sure TILERELEASE is *technically* what you want here.
xgetbv(1) tells you whether a feature is being tracked by the processor
as being in its init state.  tile_release() gets a feature into its init
state, but does not guarantee that the processor will *track* it as
being in the init state.

TILERELEASE is not documented to have an effect on XINUSE (init tracking).

XRSTOR, on the other hand, is at least documented to affect XINUSE.

It sounds like we either need a documentation update, or a clear
explanation why TILERELEASE is being used over XRSTOR.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability
  2021-07-30 14:59 ` [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability Chang S. Bae
  2021-07-30 18:41   ` Dave Hansen
@ 2021-07-30 20:15   ` Dave Hansen
  1 sibling, 0 replies; 91+ messages in thread
From: Dave Hansen @ 2021-07-30 20:15 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, tglx, mingo, x86
  Cc: len.brown, thiago.macieira, jing2.liu, ravi.v.shankar,
	linux-kernel, linux-pm

On 7/30/21 7:59 AM, Chang S. Bae wrote:
> SPR supports AMX, and so this custom table uses idle entry points that know
> how to initialize AMX TMM state, if necessary.

That's pretty direct with where this is showing up.

But, the cover letter is quite a bit more cagey:

> Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
> soon.

If you do another version of these, it might be handy to make sure those
are as consistent as possible.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability
  2021-07-30 18:41   ` Dave Hansen
@ 2021-08-03 21:32     ` Bae, Chang Seok
  2021-08-03 21:38       ` Dave Hansen
  0 siblings, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-03 21:32 UTC (permalink / raw)
  To: Hansen, Dave
  Cc: Borislav Petkov, Lutomirski, Andy, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Brown, Len, Macieira, Thiago, Liu,
	Jing2, Shankar, Ravi V, Linux Kernel Mailing List, linux-pm

On Jul 30, 2021, at 11:41, Hansen, Dave <dave.hansen@intel.com> wrote:
>> +/**
>> + * idle_tile - Initialize TILE registers in INIT-state
>> + *
>> + * Leaving state in the dirty TILE registers may prevent the processor from
>> + * entering lower-power idle states. Use TILERELEASE to initialize the
>> + * state. Destroying fpregs state is safe after the fpstate update.
>> + */
>> +static inline void idle_tile(void)
>> +{
>> +	if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
>> +		tile_release();
>> +		fpregs_deactivate(&current->thread.fpu);
>> +	}
>> +}
> 
> This isn't obviously safe code.  There's a window in there when we have
> bogus, destroyed FPU register state but where we might be rescheduled.
> 
> I would assume that preempt is off *somewhere* in this, but it would be
> nice to make sure of that, or at least mention the requirement for it to
> be off before this code is safe.

I can see preempt_disable() in this path:

$kernel/sched/idle.c::play_idle_precise()
--> preempt_disable()
...
--> do_idle()
    --> cpuidle_idle_call()
        --> call_cpuidle()
            --> $drivers/cpuidle/cpuidle.c::cpuidle_enter()
                --> cpuidle_enter_state()
                    --> target_state->enter()
                        --> $drivers/idle/intel_idle.c::intel_idle_tile()
                            --> idle_tile()
...
--> preempt_enable()

> I'm also not sure TILERELEASE is *technically* what you want here.
> xgetbv(1) tells you whether a feature is being tracked by the processor
> as being in its init state.  tile_release() gets a feature into its init
> state, but does not guarantee that the processor will *track* it as
> being in the init state.
> 
> TILERELEASE is not documented to have an effect on XINUSE (init tracking).
> 
> XRSTOR, on the other hand, is at least documented to affect XINUSE.
> 
> It sounds like we either need a documentation update, or a clear
> explanation why TILERELEASE is being used over XRSTOR.

TILERELEASE impacts INIT tracking at least on the first AMX implementation. I
agree that the documentation needs some update.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability
  2021-08-03 21:32     ` Bae, Chang Seok
@ 2021-08-03 21:38       ` Dave Hansen
  2021-08-03 21:43         ` Brown, Len
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Hansen @ 2021-08-03 21:38 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Borislav Petkov, Lutomirski, Andy, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Brown, Len, Macieira, Thiago, Liu,
	Jing2, Shankar, Ravi V, Linux Kernel Mailing List, linux-pm

On 8/3/21 2:32 PM, Bae, Chang Seok wrote:
>>> +static inline void idle_tile(void)
>>> +{
>>> +    if (boot_cpu_has(X86_FEATURE_XGETBV1) && (xgetbv(1) & XFEATURE_MASK_XTILE)) {
>>> +            tile_release();
>>> +            fpregs_deactivate(&current->thread.fpu);
>>> +    }
>>> +}
>> This isn't obviously safe code.  There's a window in there when we have
>> bogus, destroyed FPU register state but where we might be rescheduled.
>>
>> I would assume that preempt is off *somewhere* in this, but it would be
>> nice to make sure of that, or at least mention the requirement for it to
>> be off before this code is safe.
> I can see preempt_disable() in this path:
> 
> $kernel/sched/idle.c::play_idle_precise()
> --> preempt_disable()
> ...
> --> do_idle()
>     --> cpuidle_idle_call()
>         --> call_cpuidle()
>             --> $drivers/cpuidle/cpuidle.c::cpuidle_enter()
>                 --> cpuidle_enter_state()
>                     --> target_state->enter()
>                         --> $drivers/idle/intel_idle.c::intel_idle_tile()
>                             --> idle_tile()
> ...
> --> preempt_enable()

OK, that's good.  Can we comment about the preempt requirement
somewhere?  Or, maybe add a !in_atomic() warning?

Also, should this have something like a fpregs_state_valid() check?  If
the registers are invalid, should this be calling tile_release()?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* RE: [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability
  2021-08-03 21:38       ` Dave Hansen
@ 2021-08-03 21:43         ` Brown, Len
  0 siblings, 0 replies; 91+ messages in thread
From: Brown, Len @ 2021-08-03 21:43 UTC (permalink / raw)
  To: Hansen, Dave, Bae, Chang Seok
  Cc: Borislav Petkov, Lutomirski, Andy, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Macieira, Thiago, Liu, Jing2, Shankar,
	Ravi V, Linux Kernel Mailing List, linux-pm

> Also, should this have something like a fpregs_state_valid() check?  If the registers are invalid, should this be calling tile_release()?

From a correctness point of view, it is valid to always call tile_release() here.
From a performance point of view, tile_release() is very fast.




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  2021-07-30 14:59 ` [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE Chang S. Bae
@ 2021-08-06 16:46   ` Thiago Macieira
  2021-08-09 22:08     ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Thiago Macieira @ 2021-08-06 16:46 UTC (permalink / raw)
  To: bp, luto, tglx, mingo, x86, Chang S. Bae
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

On Friday, 30 July 2021 07:59:45 PDT Chang S. Bae wrote:
> +       for_each_thread(tsk, t) {
> +               t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
> +               nr_threads++;
> +       }
> +
> +       if (nr_threads != tsk->signal->nr_threads) {
> +               for_each_thread(tsk, t)
> +                       t->thread.fpu.dynamic_state_perm =
> old_dynstate_perm; 
> +               pr_err("x86/fpu: ARCH_XSTATE_PERM failed
> as thread number mismatched.\n"); 
> +               return -EBUSY;
> +       }
> +       return 0;
> +}

Hello all

As I was trying to write the matching userspace code, I think the solution 
above had two problems. 

First the simpler one: that EBUSY. It must go and you can do that with a lock. 
Library code cannot ensure that it is running in single-threaded state and 
that no other threads are started or exit while they make the system call. 
There's nothing the library in question can do if it got an EBUSY. Do you want 
me to try again? What if it fails again? What's the state of the dynamically 
permitted states after an EBUSY? It's probably inconsistent. Moreover, there's 
an ABA problem there: what happens if a thread starts and another exits while 
this system call is running? And what happens if two threads are making this 
system call? 
(also, shouldn't tsk->signal->nr_threads be an atomic read?)

The second and bigger problem is the consequence of not issuing the 
ARCH_SET_STATE_ENABLE call: a SIGILL. Up until now, this hasn't happened, so I 
expect this to be a surprise to people, in the worst possible way. The Intel 
Software Developer Manual and every single tutorial out there says that the 
sequence of actions is:
 1) check that OSXSAVE is enabled
 2) check that the AVX, AVX512 or AMX instructions are supported with CPUID
 3) execute XGETBV EAX=0
 4) disable any instructions whose matching state is not enabled by the OS

This is what software developers will write for AMX and any new future state, 
until they learn better. This is also all that other OSes will require to run. 
Moreover, until developers can actually run their software on CPUs with AMX 
support, they will not notice any missed system calls (the Software 
Development Emulator tool will execute the instructions whether you've issued 
the syscall or not).

As a consequence, there's a large chance that a test escape like that will 
cause software to start crashing when run on AMX-capable CPUs when those start 
showing up and get enabled in public clouds.

So I have to insist that the XGETBV instruction's result match exactly what is 
permitted to run. That means we either enable AMX unconditionally with no need 
for system calls (with or without XFD trapping to dynamically allocate more 
state), or that the XCR0 register be set without the AMX bits by default, 
until the system call is issued.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  2021-08-06 16:46   ` Thiago Macieira
@ 2021-08-09 22:08     ` Bae, Chang Seok
  2021-08-09 23:42       ` Thiago Macieira
  0 siblings, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-09 22:08 UTC (permalink / raw)
  To: Macieira, Thiago
  Cc: bp, Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 6, 2021, at 09:46, Macieira, Thiago <thiago.macieira@intel.com> wrote:
> On Friday, 30 July 2021 07:59:45 PDT Chang S. Bae wrote:
>> +       for_each_thread(tsk, t) {
>> +               t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
>> +               nr_threads++;
>> +       }
>> +
>> +       if (nr_threads != tsk->signal->nr_threads) {
>> +               for_each_thread(tsk, t)
>> +                       t->thread.fpu.dynamic_state_perm =
>> old_dynstate_perm; 
>> +               pr_err("x86/fpu: ARCH_XSTATE_PERM failed
>> as thread number mismatched.\n"); 
>> +               return -EBUSY;
>> +       }
>> +       return 0;
>> +}

<snip>

> First the simpler one: that EBUSY. It must go and you can do that with a lock. 
> Library code cannot ensure that it is running in single-threaded state and 
> that no other threads are started or exit while they make the system call. 
> There's nothing the library in question can do if it got an EBUSY. Do you want 
> me to try again? What if it fails again? What's the state of the dynamically 
> permitted states after an EBUSY? It's probably inconsistent. Moreover, there's 
> an ABA problem there: what happens if a thread starts and another exits while 
> this system call is running? And what happens if two threads are making this 
> system call? 
> (also, shouldn't tsk->signal->nr_threads be an atomic read?)

I suspect the EBUSY situation is somewhat imaginative. In theory, the
situation might be possible one thread calls this syscall at some point when a
new task is being created -- after task data duplication [1] and before
enlisted [2].

As stated in the changelog, the alternative is possible:
> An alternative implementation would not save the permission bitmap in
> every task. But instead would extend the per-process signal data, and
> that would not be subject to this race.
But it involves quite a bit of code complexity and this is pretty much
backend. I think it is possible to follow up and update when the case ever
turns out to be real. At least, I'm not aware of any report against the
PR_SET_FP_MODE prctl(2) [3] which took the same way -- walk and update the
task list.

Perhaps, the hunk above can be improved to be atomic.

<snip>

> So I have to insist that the XGETBV instruction's result match exactly what is 
> permitted to run. That means we either enable AMX unconditionally with no need 
> for system calls (with or without XFD trapping to dynamically allocate more 
> state), or that the XCR0 register be set without the AMX bits by default, 
> until the system call is issued.

XCR0 provokes VMEXIT which will impact the performance hardly. At least the
opt-in model is a consensus out of the long debate [4]. Let alone the question
on how well advertise this new syscall though.

Thanks,
Chang

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2128
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2320
[3]: https://man7.org/linux/man-pages/man2/prctl.2.html
[4]: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com/



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  2021-08-09 22:08     ` Bae, Chang Seok
@ 2021-08-09 23:42       ` Thiago Macieira
  2021-08-10  0:57         ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Thiago Macieira @ 2021-08-09 23:42 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: bp, Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Liu, Jing2, Shankar, Ravi V, linux-kernel

On Monday, 9 August 2021 15:08:19 PDT Bae, Chang Seok wrote:
> I suspect the EBUSY situation is somewhat imaginative. In theory, the
> situation might be possible one thread calls this syscall at some point when
> a new task is being created -- after task data duplication [1] and before
> enlisted [2].
> 
> As stated in the changelog, the alternative is possible:
> > An alternative implementation would not save the permission bitmap in
> > every task. But instead would extend the per-process signal data, and
> > that would not be subject to this race.
> 
> But it involves quite a bit of code complexity and this is pretty much
> backend. I think it is possible to follow up and update when the case ever
> turns out to be real. At least, I'm not aware of any report against the
> PR_SET_FP_MODE prctl(2) [3] which took the same way -- walk and update the
> task list.
> 
> Perhaps, the hunk above can be improved to be atomic.
> 
> <snip>

Hello Chang

Thanks for taking a look at this. I agree that this is a very, very tiny 
corner case and the issue can be treated as a bugfix later. The API between 
userspace and kernel is fine, which is the big issue right now.

But explaining what the issue I see is: a userspace library cannot enforce 
that other threads in the same process aren't either making this same system 
call or starting/exiting threads. So I see two scenarios where the corner case 
can be reached:

1) two simultaneous ARCH_SET_STATE_ENABLE calls
Imagine two threads, each setting a different bit (say bits 18 and 19). Since 
they race with each other and this line:
              t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
is not using an atomic, the compiler won't emit LOCK OR, so it's possible the 
two calls will step over each other and partially undo the other's work. The 
result after the two calls is completely indeterminate, yet both functions 
returned success.

Since right now there's only one bit that can be set, we know that the two 
calls are doing the same thing, so they're not effectively racing each other. 
So this case is not an issue *right* *now*. There's only duplicate work.

2) one thread calls ARCH_SET_STATE_ENABLE while another thread exits/starts
In this case, the nr_threads != tsk->signal->nr_threads test will fail 
resulting in the dynamic state to be rolled back and the EBUSY condition. I'd 
like a recommendation from the kernel on how to deal with that signal: should 
I retry? For now, I'm going to treat EBUSY like EINVAL and assume I cannot use 
the feature.

1+2) both situations at the same time
This means the corruption can get worse since the rollback code can undo or 
partially undo the progression of the other ARCH_SET_STATE_ENABLE.

> > So I have to insist that the XGETBV instruction's result match exactly
> > what is permitted to run. That means we either enable AMX unconditionally
> > with no need for system calls (with or without XFD trapping to
> > dynamically allocate more state), or that the XCR0 register be set
> > without the AMX bits by default, until the system call is issued.
> 
> XCR0 provokes VMEXIT which will impact the performance hardly. At least the
> opt-in model is a consensus out of the long debate [4]. Let alone the
> question on how well advertise this new syscall though.

I understand.

I am pointing out that this will cause crashes because of improperly / 
insufficiently-tested software. That is, software that violates the contract 
of the new API because we inserted a new requirement that didn't exist for old 
features. Yes, said software is buggy.

The problem is that the crashes can be surprising and will only show up after 
the software gets run on an AMX-capable machine. That may happen, for example, 
if a cloud provider "upgrades" the instance of a VM from a previous generation 
of processor to a new one, or if a batch job does include the new instance 
type. That is, the crashes will not happen for the developer of the software 
in question, but instead for the user.

However, given the requirements that:
 a) XCR0 not be context-switched
 b) a new API call be required to allow the new instructions

Then there's no alternative.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  2021-08-09 23:42       ` Thiago Macieira
@ 2021-08-10  0:57         ` Bae, Chang Seok
  2021-08-13 19:44           ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-10  0:57 UTC (permalink / raw)
  To: Macieira, Thiago
  Cc: bp, Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 9, 2021, at 16:42, Macieira, Thiago <thiago.macieira@intel.com> wrote:
> 
> This means the corruption can get worse since the rollback code can undo or 
> partially undo the progression of the other ARCH_SET_STATE_ENABLE.

Maybe something like this can help here to ensure a valid rollback.

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 96056f49bcff..3468bc0ee654 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1353,6 +1353,8 @@ int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
       return 0;
}

+static DEFINE_SPINLOCK(set_xstate_perm_lock);
+
/**
 * set_process_xstate_perm - Set a per-process permission to use dynamic
 *                          user xstates.
@@ -1383,6 +1385,8 @@ long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm)
       if (!req_dynstate_perm)
               return 0;

+       spin_lock(&set_xstate_perm_lock);
+
       old_dynstate_perm = tsk->thread.fpu.dynamic_state_perm;

       for_each_thread(tsk, t) {
@@ -1396,6 +1400,8 @@ long set_process_xstate_perm(struct task_struct *tsk, u64 state_perm)
               pr_err("x86/fpu: ARCH_XSTATE_PERM failed as thread number mismatched.\n");
               return -EBUSY;
       }
+
+       spin_unlock(&set_xstate_perm_lock);
       return 0;
}

Thanks,
Chang

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 05/26] x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer size
  2021-07-30 14:59 ` [PATCH v9 05/26] x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer size Chang S. Bae
@ 2021-08-12 15:03   ` Borislav Petkov
  0 siblings, 0 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-12 15:03 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel, kvm

On Fri, Jul 30, 2021 at 07:59:36AM -0700, Chang S. Bae wrote:
> @@ -167,8 +158,10 @@ static void __init fpu__init_task_struct_size(void)
>  	/*
>  	 * Add back the dynamically-calculated register state
>  	 * size.
> +	 *
> +	 * Use the minimum size as embedded to task_struct.

embedded in...

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 74e608c6ad6c..12caf1a56ce0 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -77,12 +77,51 @@ static unsigned int xstate_comp_offsets[XFEATURE_MAX] __ro_after_init =
>  static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] __ro_after_init =
>  	{ [ 0 ... XFEATURE_MAX - 1] = -1};
>  
> -/*
> - * The XSAVE area of kernel can be in standard or compacted format;
> - * it is always in standard format for user mode. This is the user
> - * mode standard format size used for signal and ptrace frames.
> +/**
> + * struct fpu_xstate_buffer_config - xstate buffer configuration
> + * @max_size:			The CPUID-enumerated all-feature "maximum" size
> + *				for xstate per-task buffer.
> + * @min_size:			The size to fit into the statically-allocated
> + *				buffer. With dynamic states, this buffer no longer
> + *				contains all the enabled state components.
> + * @user_size:			The size of user-space buffer for signal and
> + *				ptrace frames, in the non-compacted format.
>   */
> -unsigned int fpu_user_xstate_size __ro_after_init;
> +struct fpu_xstate_buffer_config {
> +	unsigned int min_size, max_size;
> +	unsigned int user_size;
> +};
> +
> +static struct fpu_xstate_buffer_config buffer_config __ro_after_init;

I know I had asked for the accessors below but if this is going to be
read-only after init and is not going to change for duration of the
system lifetime, then you don't really need those accessors.

I.e., you can do

struct fpu_xstate_buffer_config {
	unsigned int min_size, max_size;
	unsigned int user_size;
};

static struct fpu_xstate_buffer_config fpu_buf_cfg __ro_after_init;

and then access those values through fpu_buf_cfg.<value>

Thx.

> +
> +unsigned int get_xstate_config(enum xstate_config cfg)
> +{
> +	switch (cfg) {
> +	case XSTATE_MIN_SIZE:
> +		return buffer_config.min_size;
> +	case XSTATE_MAX_SIZE:
> +		return buffer_config.max_size;
> +	case XSTATE_USER_SIZE:
> +		return buffer_config.user_size;
> +	default:
> +		return 0;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(get_xstate_config);
> +
> +void set_xstate_config(enum xstate_config cfg, unsigned int value)
> +{
> +	switch (cfg) {
> +	case XSTATE_MIN_SIZE:
> +		buffer_config.min_size = value;
> +		break;
> +	case XSTATE_MAX_SIZE:
> +		buffer_config.max_size = value;
> +		break;
> +	case XSTATE_USER_SIZE:
> +		buffer_config.user_size = value;
> +	}
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 06/26] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes
  2021-07-30 14:59 ` [PATCH v9 06/26] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes Chang S. Bae
@ 2021-08-12 16:36   ` Borislav Petkov
  0 siblings, 0 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-12 16:36 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel

On Fri, Jul 30, 2021 at 07:59:37AM -0700, Chang S. Bae wrote:
> The CPUID instruction separately enumerates sizes and alignments of
> individual xfeatures. It independently enumerates the required size of an
> entire XSAVE buffer to store all enabled features.
> 
> calculate_xstate_sizes() currently uses the individual feature
> size/alignment enumeration to independently recalculate the required XSAVE
> buffer size. This is compared against the CPUID-provided value.
> 
> Extend the function to accept an option to exclude dynamic states. With
> that, calculate the maximum size that contains all the enabled states, and
> the minimum size that fits in the statically-allocated buffer by excluding
> dynamic states.

This explains *what* this patch does but not *why*. *What* I can more or
less see but for *why* I'd need my crystal ball...

> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 12caf1a56ce0..cd709408efb5 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -591,24 +591,28 @@ static void check_xstate_against_struct(int nr)
>  	}
>  }
>  
> -/*
> - * This essentially double-checks what the cpu told us about
> - * how large the XSAVE buffer needs to be.  We are recalculating
> - * it to be safe.

Why are you removing that comment? Are we not recalculating anymore?

> +/**
> + * calculate_xstate_size - Calculate the xstate per-task buffer size.
> + *
> + * Independent XSAVE features allocate their own buffers and are always
> + * excluded. Only the size of the buffer for task->fpu is checked here.
>   *
> - * Independent XSAVE features allocate their own buffers and are not
> - * covered by these checks. Only the size of the buffer for task->fpu
> - * is checked here.
> + * @include_dynamic_states:	A knob to include dynamic states or not.
> + *
> + * Return:			The calculated xstate size.
>   */
> -static void do_extra_xstate_size_checks(void)
> +static unsigned int calculate_xstate_size(bool include_dynamic_states)
>  {
> -	int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
> +	unsigned int xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
>  	int i;
>  
>  	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
>  		if (!xfeature_enabled(i))
>  			continue;
>  
> +		if (!include_dynamic_states && (xfeatures_mask_user_dynamic & BIT_ULL(i)))

The order should be flipped: if (dynamic_state and !include_dynamic_states)

> +			continue;
> +
>  		check_xstate_against_struct(i);
>  		/*
>  		 * Supervisor state components can be managed only by
> @@ -619,7 +623,7 @@ static void do_extra_xstate_size_checks(void)
>  
>  		/* Align from the end of the previous feature */
>  		if (xfeature_is_aligned(i))
> -			paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
> +			xstate_size = ALIGN(xstate_size, 64);
>  		/*
>  		 * The offset of a given state in the non-compacted
>  		 * format is given to us in a CPUID leaf.  We check
> @@ -627,18 +631,15 @@ static void do_extra_xstate_size_checks(void)
>  		 * setup_xstate_features(). XSAVES uses compacted format.
>  		 */
>  		if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
> -			paranoid_xstate_size = xfeature_uncompacted_offset(i);
> +			xstate_size = xfeature_uncompacted_offset(i);
>  		/*
>  		 * The compacted-format offset always depends on where
>  		 * the previous state ended.
>  		 */
> -		paranoid_xstate_size += xfeature_size(i);
> +		xstate_size += xfeature_size(i);
>  	}
> -	/*
> -	 * The size accounts for all the possible states reserved in the
> -	 * per-task buffer.  Check against the maximum size.
> -	 */
> -	XSTATE_WARN_ON(paranoid_xstate_size != get_xstate_config(XSTATE_MAX_SIZE));
> +
> +	return xstate_size;
>  }
>  
>  

<--- You can remove one of the newlines here, while at it.

> @@ -723,7 +724,7 @@ static bool is_supported_xstate_size(unsigned int test_xstate_size)
>  static int __init init_xstate_size(void)
>  {
>  	/* Recompute the context size for enabled features: */
> -	unsigned int possible_xstate_size;
> +	unsigned int possible_xstate_size, xstate_size;
>  	unsigned int xsave_size;
>  
>  	xsave_size = get_xsave_size();
> @@ -734,23 +735,23 @@ static int __init init_xstate_size(void)
>  		possible_xstate_size = xsave_size;
>  
>  	/*
> -	 * The size accounts for all the possible states reserved in the
> -	 * per-task buffer.  Set the maximum with this value.
> +	 * Calculate xstate size for all the possible states by setting
> +	 * 'true' to include dynamic states.

"Calculate the maximum xstate size, including the dynamic states."

> Cross-check with the CPUID-
> +	 * provided size and record it.
>  	 */
> +	xstate_size = calculate_xstate_size(true);
> +	XSTATE_WARN_ON(possible_xstate_size != xstate_size);
>  	set_xstate_config(XSTATE_MAX_SIZE, possible_xstate_size);
>  
> -	/* Perform an extra check for the maximum size. */
> -	do_extra_xstate_size_checks();
> -
>  	/*
> -	 * Set the minimum to be the same as the maximum. The dynamic
> -	 * user states are not supported yet.
> +	 * Calculate the xstate size without dynamic states by setting
> +	 * 'false' to exclude dynamic states.

"Calculate the minimum xstate size, i.e., excluding the dynamic xstates."

> Ensure the size fits in
> +	 * the statically-allocated buffer and record it.
>  	 */
> -	set_xstate_config(XSTATE_MIN_SIZE, possible_xstate_size);
> -
> -	/* Ensure the minimum size fits in the statically-allocated buffer: */
> -	if (!is_supported_xstate_size(get_xstate_config(XSTATE_MIN_SIZE)))
> +	xstate_size = calculate_xstate_size(false);
> +	if (!is_supported_xstate_size(xstate_size))
>  		return -EINVAL;

<---- newline here.

> +	set_xstate_config(XSTATE_MIN_SIZE, xstate_size);
>  
>  	/*
>  	 * User space is always in standard format.
> -- 
> 2.17.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 07/26] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer
  2021-07-30 14:59 ` [PATCH v9 07/26] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
@ 2021-08-12 17:09   ` Borislav Petkov
  0 siblings, 0 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-12 17:09 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel, kvm

On Fri, Jul 30, 2021 at 07:59:38AM -0700, Chang S. Bae wrote:
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index f5a38a5f3ae1..c7826708f27f 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -339,13 +339,30 @@ struct fpu {
>  	/*
>  	 * @state:
>  	 *
> -	 * In-memory copy of all FPU registers that we save/restore
> -	 * over context switches. If the task is using the FPU then
> -	 * the registers in the FPU are more recent than this state
> -	 * copy. If the task context-switches away then they get
> -	 * saved here and represent the FPU state.
> +	 * A pointer to indicate the in-memory copy of all FPU registers
> +	 * that are saved/restored over context switches.
> +	 *
> +	 * Initially @state points to @__default_state. When dynamic states
> +	 * get used, a memory is allocated for the larger state copy and
> +	 * @state is updated to point to it. Then, the state in ->state
> +	 * supersedes and invalidates the state in @__default_state.
> +	 *
> +	 * In general, if the task is using the FPU then the registers in
> +	 * the FPU are more recent than the state copy. If the task
> +	 * context-switches away then they get saved in ->state and
> +	 * represent the FPU state.
> +	 */
> +	union fpregs_state		*state;
> +
> +	/*
> +	 * @__default_state:
> +	 *
> +	 * Initial in-memory copy of all FPU registers that saved/restored
> +	 * over context switches. When the task is switched to dynamic
> +	 * states, this copy is replaced with the new in-memory copy in
> +	 * ->state.
>  	 */
> -	union fpregs_state		state;
> +	union fpregs_state		__default_state;
>  	/*
>  	 * WARNING: 'state' is dynamically-sized.  Do not put
		    ^^^^^^

that needs to be __default_state as it is which is dynamically-sized.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-07-30 14:59 ` [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically Chang S. Bae
@ 2021-08-12 19:44   ` Borislav Petkov
  2021-08-13  8:04     ` Bae, Chang Seok
  2021-08-16 18:33     ` Bae, Chang Seok
  2021-08-30 17:45   ` Dave Hansen
  1 sibling, 2 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-12 19:44 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel

On Fri, Jul 30, 2021 at 07:59:39AM -0700, Chang S. Bae wrote:
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index c7826708f27f..c0192e16cadb 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -336,6 +336,14 @@ struct fpu {
>  	 */
>  	unsigned long			avx512_timestamp;
>  
> +	/*
> +	 * @state_mask:
> +	 *
> +	 * The bitmap represents state components reserved to be saved in
> +	 * ->state.

What does "reserved to be saved in" even mean?

> +	 */
> +	u64				state_mask;
> +
>  	/*
>  	 * @state:
>  	 *
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index d722e774a9f9..45735441fbe8 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -146,6 +146,9 @@ extern unsigned int get_xstate_config(enum xstate_config cfg);
>  void set_xstate_config(enum xstate_config cfg, unsigned int value);
>  
>  void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
> +unsigned int get_xstate_size(u64 mask);
> +int alloc_xstate_buffer(struct fpu *fpu, u64 mask);
> +void free_xstate_buffer(struct fpu *fpu);
>  int xfeature_size(int xfeature_nr);
>  int copy_uabi_from_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
>  int copy_sigframe_from_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
> diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
> index ef82f4824ce7..b691c2db47c7 100644
> --- a/arch/x86/include/asm/trace/fpu.h
> +++ b/arch/x86/include/asm/trace/fpu.h
> @@ -89,6 +89,11 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
>  	TP_ARGS(fpu)
>  );
>  
> +DEFINE_EVENT(x86_fpu, x86_fpu_xstate_alloc_failed,
> +	TP_PROTO(struct fpu *fpu),
> +	TP_ARGS(fpu)

Last time I said:

"Yes, add it when it is really needed. Not slapping it proactively and
hoping for any potential usage."

Why is that thing still here?!

> @@ -380,6 +381,9 @@ static void fpu_reset_fpstate(void)
>  	 * flush_thread().
>  	 */
>  	memcpy(fpu->state, &init_fpstate, init_fpstate_copy_size());
> +	/* Adjust the xstate buffer format for current. */
> +	if (boot_cpu_has(X86_FEATURE_XSAVES))

cpu_feature_enabled

> +		fpstate_init_xstate(&fpu->state->xsave, fpu->state_mask);

<---- newline here.

>  	set_thread_flag(TIF_NEED_FPU_LOAD);
>  	fpregs_unlock();
>  }
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 5f58dca4c6b7..26f6d5e0f1ed 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -10,6 +10,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/seq_file.h>
>  #include <linux/proc_fs.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/fpu/api.h>
>  #include <asm/fpu/internal.h>
> @@ -19,6 +20,7 @@
>  
>  #include <asm/tlbflush.h>
>  #include <asm/cpufeature.h>
> +#include <asm/trace/fpu.h>
>  
>  /*
>   * Although we spell it out in here, the Processor Trace
> @@ -76,6 +78,12 @@ static unsigned int xstate_comp_offsets[XFEATURE_MAX] __ro_after_init =
>  	{ [ 0 ... XFEATURE_MAX - 1] = -1};
>  static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] __ro_after_init =
>  	{ [ 0 ... XFEATURE_MAX - 1] = -1};
> +/*
> + * True if the buffer of the corresponding XFEATURE is located on the next 64
> + * byte boundary. Otherwise, it follows the preceding component immediately.
> + */
> +static bool xstate_aligns[XFEATURE_MAX] __ro_after_init =

Then call that thing xstate_64byte_aligned[] to denote *exactly* what it
contains.

> +	{ [ 0 ... XFEATURE_MAX - 1] = false};
>  
>  /**
>   * struct fpu_xstate_buffer_config - xstate buffer configuration
> @@ -174,6 +182,55 @@ static bool xfeature_is_supervisor(int xfeature_nr)
>  	return ecx & 1;
>  }
>  
> +/**
> + * get_xstate_size - Calculate an xstate buffer size

calculate_xstate_buf_size_from_mask()

if anything. This name is deceivingly generic.

> + * @mask:	This bitmap tells which components reserved in the buffer.

are reserved?

What's this notion of reservation here? The mask is dictating what gets
reserved in the buffer or what?

Looking at the usage, that mask is simply saying which components are
going to be saved in the buffer. So all this "reserved" bla is only
confusing - drop it.

> + *
> + * Available once those arrays for the offset, size, and alignment info are
> + * set up, by setup_xstate_features().
> + *
> + * Returns:	The buffer size
> + */
> +unsigned int get_xstate_size(u64 mask)
> +{
> +	unsigned int size;
> +	int i, nr;
> +
> +	if (!mask)
> +		return 0;
> +
> +	/*
> +	 * The minimum buffer size excludes the dynamic user state. When a
> +	 * task uses the state, the buffer can grow up to the max size.
> +	 */
> +	if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
> +		return get_xstate_config(XSTATE_MIN_SIZE);
> +	else if (mask == xfeatures_mask_all)
> +		return get_xstate_config(XSTATE_MAX_SIZE);
> +
> +	nr = fls64(mask) - 1;
> +
> +	if (!boot_cpu_has(X86_FEATURE_XSAVES))

cpu_feature_enabled()

> +		return xstate_offsets[nr] + xstate_sizes[nr];

From all the superfluous commenting, where a comment is really needed is
here but there's none.

What's that doing? No compacted states enabled so take the offset and
size of the *last* state and use that as the buffer size?

> +
> +	if ((xfeatures_mask_all & (BIT_ULL(nr + 1) - 1)) == mask)
				  ^^^^^^^^^^^^^^^^^^^^^


That thing looks like a GENMASK_ULL() thing. Use it?

Also, what is that test doing?!

If a mask up to nr ANDed with mask_all is == mask?!

You need to explain yourself a lot more here what you're doing. Why
those two special cases if you can simply iterate over the extended
states and be done with it? Except maybe the first two special cases
which are trivial...

> @@ -848,6 +908,9 @@ void __init fpu__init_system_xstate(void)
>  	if (err)
>  		goto out_disable;
>  
> +	/* Make sure init_task does not include the dynamic user states. */

My constant review question: why?

I probably should put it on a t-shirt.

> +	current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
> +
>  	/*
>  	 * Update info used for ptrace frames; use standard-format size and no
>  	 * supervisor xstates:
> @@ -1038,6 +1101,70 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
>  }
>  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
>  
> +void free_xstate_buffer(struct fpu *fpu)
> +{
> +	/* Free up only the dynamically-allocated memory. */

This belongs above the function along with an explanation when it needs
to be called.

> +	if (fpu->state != &fpu->__default_state)
> +		vfree(fpu->state);
> +}


> +
> +/**
> + * alloc_xstate_buffer - Allocate a buffer with the size calculated from

This name doesn't even begin to tell me that this function deals with
enlarging the xstate buffer with dynamic states. How is the caller
supposed to know?

Also, you need to move all possible xfeatures_mask_user_dynamic querying
inside it so that its user doesn't have to do it. I'm looking at the
callsite in xstateregs_set().

The other callsite in exc_device_not_available() seems to not check the
dynamic states but uses only XFD. I guess I'll parse that properly when
I get there but right now I have no clue why you're not checking the
dynamic mask there.

> + *			 @mask.
> + *
> + * @fpu:	A struct fpu * pointer
> + * @mask:	The bitmap tells which components to be reserved in the new
> + *		buffer.
> + *
> + * Use vmalloc() simply here. If the task with a vmalloc()-allocated buffer

vzalloc

> + * tends to terminate quickly, vfree()-induced IPIs may be a concern.
> + * Caching may be helpful for this. But the task with large state is likely
> + * to live longer.
> + *
> + * Also, this method does not shrink or reclaim the buffer.
> + *
> + * Returns 0 on success, -ENOMEM on allocation error.
> + */
> +int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
> +{
> +	union fpregs_state *state;
> +	unsigned int oldsz, newsz;
> +	u64 state_mask;
> +
> +	state_mask = fpu->state_mask | mask;
> +
> +	oldsz = get_xstate_size(fpu->state_mask);
> +	newsz = get_xstate_size(state_mask);
> +
> +	if (oldsz >= newsz)
> +		return 0;

Why?

Why not simply:

	if (fpu->state_mask == mask)
		return 0;

	/* vzalloc */

	/* free the old buffer */
	free_xstate_buffer(fpu);

	fpu->state = state;
	...

?

Our FPU code is a mess - you should try not to make it an even bigger
one without a good reason.

> +
> +	state = vzalloc(newsz);
> +	if (!state) {
> +		/*
> +		 * When allocation requested from #NM, the error code may
> +		 * not be populated well. Then, this tracepoint is useful
> +		 * for providing the failure context.
> +		 */
> +		trace_x86_fpu_xstate_alloc_failed(fpu);
> +		return -ENOMEM;

What happens with the old buffer here? It seems we leak it...

> +	}
> +
> +	if (boot_cpu_has(X86_FEATURE_XSAVES))

cpu_feature_enabled


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-12 19:44   ` Borislav Petkov
@ 2021-08-13  8:04     ` Bae, Chang Seok
  2021-08-13 10:04       ` Borislav Petkov
  2021-08-16 18:33     ` Bae, Chang Seok
  1 sibling, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-13  8:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 12, 2021, at 12:44, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 30, 2021 at 07:59:39AM -0700, Chang S. Bae wrote:
>> 
>> --- a/arch/x86/include/asm/trace/fpu.h
>> +++ b/arch/x86/include/asm/trace/fpu.h
>> @@ -89,6 +89,11 @@ DEFINE_EVENT(x86_fpu, x86_fpu_xstate_check_failed,
>> 	TP_ARGS(fpu)
>> );
>> 
>> +DEFINE_EVENT(x86_fpu, x86_fpu_xstate_alloc_failed,
>> +	TP_PROTO(struct fpu *fpu),
>> +	TP_ARGS(fpu)
> 
> Last time I said:
> 
> "Yes, add it when it is really needed. Not slapping it proactively and
> hoping for any potential usage."
> 
> Why is that thing still here?!

There was no clear path to emit the error code before. I thought that’s the
reason for this tracepoint. But now a signal or an error code return is
established. I should have removed it along with the change.

>> + * @mask:	This bitmap tells which components reserved in the buffer.
> 
> are reserved?
> 
> What's this notion of reservation here? The mask is dictating what gets
> reserved in the buffer or what?
> 
> Looking at the usage, that mask is simply saying which components are
> going to be saved in the buffer. So all this "reserved" bla is only
> confusing - drop it.

Okay. I remember this “reserved” started from a changelog. With your
confusion, let me also make sure all is removed.

>> + *
>> + * Available once those arrays for the offset, size, and alignment info are
>> + * set up, by setup_xstate_features().
>> + *
>> + * Returns:	The buffer size
>> + */
>> +unsigned int get_xstate_size(u64 mask)
>> +{
>> +	unsigned int size;
>> +	int i, nr;
>> +
>> +	if (!mask)
>> +		return 0;
>> +
>> +	/*
>> +	 * The minimum buffer size excludes the dynamic user state. When a
>> +	 * task uses the state, the buffer can grow up to the max size.
>> +	 */
>> +	if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
>> +		return get_xstate_config(XSTATE_MIN_SIZE);
>> +	else if (mask == xfeatures_mask_all)
>> +		return get_xstate_config(XSTATE_MAX_SIZE);
>> +
>> +	nr = fls64(mask) - 1;
>> +
>> +	if (!boot_cpu_has(X86_FEATURE_XSAVES))
> 
> cpu_feature_enabled()
> 
>> +		return xstate_offsets[nr] + xstate_sizes[nr];
> 
> From all the superfluous commenting, where a comment is really needed is
> here but there's none.
> 
> What's that doing? No compacted states enabled so take the offset and
> size of the *last* state and use that as the buffer size?

Yes, each state offset in the non-compacted format is fixed in a machine
regardless of RFBM. So, simply take the size like that.

>> +
>> +	if ((xfeatures_mask_all & (BIT_ULL(nr + 1) - 1)) == mask)
> 				  ^^^^^^^^^^^^^^^^^^^^^
> 
> That thing looks like a GENMASK_ULL() thing. Use it?

Looks like I was not familiar with this macro:
   if ((xfeatures_mask_all & GENMASK_ULL(nr, 0)) == mask)

> Also, what is that test doing?!
> 
> If a mask up to nr ANDed with mask_all is == mask?!
> 
> You need to explain yourself a lot more here what you're doing. Why
> those two special cases if you can simply iterate over the extended
> states and be done with it? Except maybe the first two special cases
> which are trivial...

xstate_comp_offset[] comes from the compacted format with xfeatures_mask_all.
If feature bits are all the same up to ‘nr', this recorded offset can be taken.

But it might be better to simplify this hunk for readability. I suspect its
call sites are not that performance-critical.

>> @@ -848,6 +908,9 @@ void __init fpu__init_system_xstate(void)
>> 	if (err)
>> 		goto out_disable;
>> 
>> +	/* Make sure init_task does not include the dynamic user states. */
> 
> My constant review question: why?

Every task’s state_mask should begin as aligned with the default buffer.
fpu_clone() sets this for all, except init_task.
Maybe:
    “Make sure init_task’s state_mask aligned with its __default_state"

>> +	current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);


>> +
>> +/**
>> + * alloc_xstate_buffer - Allocate a buffer with the size calculated from
> 
> This name doesn't even begin to tell me that this function deals with
> enlarging the xstate buffer with dynamic states. How is the caller
> supposed to know?

How about enlarge_xstate_buffer() or realloc_xstate_buffer()?

> 
> Also, you need to move all possible xfeatures_mask_user_dynamic querying
> inside it so that its user doesn't have to do it. I'm looking at the
> callsite in xstateregs_set().

The query is intended to check whether the xstate buffer is fully expanded or
not -- no need to enlarge.

If the buffer is already the maximum, the code to retrieve XSTATE_BV, this
call, etc should be skipped there.  

If the query is moved here, I guess this call site code becomes a bit ugly.

> The other callsite in exc_device_not_available() seems to not check the
> dynamic states but uses only XFD. I guess I'll parse that properly when
> I get there but right now I have no clue why you're not checking the
> dynamic mask there.

In this case, I think it makes sense to move it in this function. But not
clear how well adjust the above case yet.

>> +int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
>> +{
>> +	union fpregs_state *state;
>> +	unsigned int oldsz, newsz;
>> +	u64 state_mask;
>> +
>> +	state_mask = fpu->state_mask | mask;
>> +
>> +	oldsz = get_xstate_size(fpu->state_mask);
>> +	newsz = get_xstate_size(state_mask);
>> +
>> +	if (oldsz >= newsz)
>> +		return 0;
> 
> Why?
> 
> Why not simply:
> 
> 	if (fpu->state_mask == mask)
> 		return 0;
> 
> 	/* vzalloc */
> 
> 	/* free the old buffer */
> 	free_xstate_buffer(fpu);
> 
> 	fpu->state = state;
> 	...
> 
> ?
> 
> Our FPU code is a mess - you should try not to make it an even bigger
> one without a good reason.

Okay, maybe get_xstate_size() is overkill. But I think a sanity-check like
this:
    if ((mask & fpu->state_mask) == mask) 
        return 0; 

>> +
>> +	state = vzalloc(newsz);
>> +	if (!state) {
>> +		/*
>> +		 * When allocation requested from #NM, the error code may
>> +		 * not be populated well. Then, this tracepoint is useful
>> +		 * for providing the failure context.
>> +		 */
>> +		trace_x86_fpu_xstate_alloc_failed(fpu);
>> +		return -ENOMEM;
> 
> What happens with the old buffer here? It seems we leak it…

No, it is still pointed by fpu->state and will be freed in the exit path.

Thanks,
Chang



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-13  8:04     ` Bae, Chang Seok
@ 2021-08-13 10:04       ` Borislav Petkov
  2021-08-13 19:43         ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-13 10:04 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Fri, Aug 13, 2021 at 08:04:54AM +0000, Bae, Chang Seok wrote:
> Yes, each state offset in the non-compacted format is fixed in a machine
> regardless of RFBM. So, simply take the size like that.

Comment above it please.

Also, why is this special case needed at all?

> But it might be better to simplify this hunk for readability. I
> suspect its call sites are not that performance-critical.

That's *exactly* what I'm driving at!

> Every task’s state_mask should begin as aligned with the default buffer.
> fpu_clone() sets this for all, except init_task.
> Maybe:
>     “Make sure init_task’s state_mask aligned with its __default_state"

Why "make sure"?

There's nothing to make sure - it is simply so that initially, the FPU
buffer used is the static one, without dynamic states. Just say that
instead.

> How about enlarge_xstate_buffer() or realloc_xstate_buffer()?

realloc is fine along with a proper explanation above it why the realloc
is done/needed.

> The query is intended to check whether the xstate buffer is fully expanded or
> not -- no need to enlarge.
> 
> If the buffer is already the maximum, the code to retrieve XSTATE_BV, this
> call, etc should be skipped there.  
> 
> If the query is moved here, I guess this call site code becomes a bit ugly.

Why does it become ugly?

You simply return early without touching the buffer at all.

> No, it is still pointed by fpu->state and will be freed in the exit path.

Exit path of the task?

All I see is "return -ENOMEM" and no callers of alloc_xstate_buffer()
are calling free_xstate_buffer()...

And looking further into the patchset:

exc_device_not_available does not call free_xstate_buffer() I'm assuming

	force_sig_fault(SIGILL, ILL_ILLOPC,..

later will cause arch_release_task_struct() to happen which will call
free_xstate_buffer(). Yes, no?

I don't see any freeing in xstateregs_set() either, so what's happening
there when it returns -ENOMEM?

I guess there we remain with the old buffer, i.e., the ptrace operation
fails.

Am I close?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-13 10:04       ` Borislav Petkov
@ 2021-08-13 19:43         ` Bae, Chang Seok
  2021-08-18  9:28           ` Borislav Petkov
  0 siblings, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-13 19:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 13, 2021, at 03:04, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Aug 13, 2021 at 08:04:54AM +0000, Bae, Chang Seok wrote:
>> Yes, each state offset in the non-compacted format is fixed in a machine
>> regardless of RFBM. So, simply take the size like that.
> 
> Comment above it please.
> 
> Also, why is this special case needed at all?

Without the “compacted” notion in the function name, one might call this even
with !XSAVES. But chances are very low in practice.

>> The query is intended to check whether the xstate buffer is fully expanded or
>> not -- no need to enlarge.
>> 
>> If the buffer is already the maximum, the code to retrieve XSTATE_BV, this
>> call, etc should be skipped there.  
>> 
>> If the query is moved here, I guess this call site code becomes a bit ugly.
> 
> Why does it become ugly?
> 
> You simply return early without touching the buffer at all.

Perhaps, the call site in the ptrace path becomes like this, I think:

+	if (xfeatures_mask_user_dynamic) {
+		u64 state_mask;
+
+		/* Retrieve XSTATE_BV. */
+		memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
+		       sizeof(u64));
+
+		/* Expand the xstate buffer based on the XSTATE_BV. */
+		ret = realloc_xstate_buffer(fpu, state_mask & xfeatures_mask_user_dynamic);
+		if (ret)
+			goto out;
+	}

Maybe retrieve XSTATE_BV is inevitable here. Then, it is not that ugly.

>> No, it is still pointed by fpu->state and will be freed in the exit path.
> 
> Exit path of the task?
> 
> All I see is "return -ENOMEM" and no callers of alloc_xstate_buffer()
> are calling free_xstate_buffer()...
> 
> And looking further into the patchset:
> 
> exc_device_not_available does not call free_xstate_buffer() I'm assuming
> 
> 	force_sig_fault(SIGILL, ILL_ILLOPC,..
> 
> later will cause arch_release_task_struct() to happen which will call
> free_xstate_buffer(). Yes, no?

Yes.

> I don't see any freeing in xstateregs_set() either, so what's happening
> there when it returns -ENOMEM?
> 
> I guess there we remain with the old buffer, i.e., the ptrace operation
> fails.
> 
> Am I close?

In this case, the ptracer just failed to inject some context. But the
ptracee’s context in the (old) buffer is intact. It will resume and eventually
exit. I think arch_release_task_struct()->free_xstate_buffer() will take care
of the old buffer.

Thanks,
Chang




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE
  2021-08-10  0:57         ` Bae, Chang Seok
@ 2021-08-13 19:44           ` Bae, Chang Seok
  0 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-13 19:44 UTC (permalink / raw)
  To: Macieira, Thiago
  Cc: bp, Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 9, 2021, at 17:57, Bae, Chang Seok <chang.seok.bae@intel.com> wrote:
> On Aug 9, 2021, at 16:42, Macieira, Thiago <thiago.macieira@intel.com> wrote:
>> 
>> This means the corruption can get worse since the rollback code can undo or 
>> partially undo the progression of the other ARCH_SET_STATE_ENABLE.
> 
> Maybe something like this can help here to ensure a valid rollback.

After reconsidering this, I think the group_leader task’s permission value is
reliable. Perhaps, reference group_leader’s everywhere, instead of each
task's. I think that way resolves the corner case in a simple way.

Thanks,
Chang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-12 19:44   ` Borislav Petkov
  2021-08-13  8:04     ` Bae, Chang Seok
@ 2021-08-16 18:33     ` Bae, Chang Seok
  2021-08-16 18:53       ` Borislav Petkov
  1 sibling, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-16 18:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 12, 2021, at 12:44, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 30, 2021 at 07:59:39AM -0700, Chang S. Bae wrote:
>> 
>> +	if (boot_cpu_has(X86_FEATURE_XSAVES))
> 
> cpu_feature_enabled

Without DISABLE_XSAVES or something under ifdef CONFIG_X86_XX in
$arch/x86/include/asm/disable-features.h, I don’t see the difference with this
macro. Am I missing anything here? Or, boot_cpu_has() is going to be
deprecated everywhere?

Thanks,
Chang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-16 18:33     ` Bae, Chang Seok
@ 2021-08-16 18:53       ` Borislav Petkov
  0 siblings, 0 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-16 18:53 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Mon, Aug 16, 2021 at 06:33:37PM +0000, Bae, Chang Seok wrote:
> Without DISABLE_XSAVES or something under ifdef CONFIG_X86_XX in
> $arch/x86/include/asm/disable-features.h, I don’t see the difference with this
> macro. Am I missing anything here? Or, boot_cpu_has() is going to be
> deprecated everywhere?

There's:

cpu_has
this_cpu_has
cpu_feature_enabled
boot_cpu_has
static_cpu_has

All code where it doesn't matter which CPU, should use
cpu_feature_enabled() and simplicity will ensue in these here lands.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-13 19:43         ` Bae, Chang Seok
@ 2021-08-18  9:28           ` Borislav Petkov
  2021-08-18 19:46             ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18  9:28 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Fri, Aug 13, 2021 at 07:43:53PM +0000, Bae, Chang Seok wrote:
> Without the “compacted” notion in the function name, one might
> call this even with !XSAVES. But chances are very low in practice.

So leave only the first two which are obvious and are more likely to
happen - the first one is going to be the most likely on non-dynamic
setups and the second one is on dynamic systems.

For all the other configurations, just do the loop and that's it.

*IF* an optimization needs to happen there, then it can happen latter,
supplied with perf numbers to justify it.

> Perhaps, the call site in the ptrace path becomes like this, I think:
> 
> +	if (xfeatures_mask_user_dynamic) {
> +		u64 state_mask;
> +
> +		/* Retrieve XSTATE_BV. */
> +		memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
> +		       sizeof(u64));
> +
> +		/* Expand the xstate buffer based on the XSTATE_BV. */
> +		ret = realloc_xstate_buffer(fpu, state_mask & xfeatures_mask_user_dynamic);
> +		if (ret)
> +			goto out;
> +	}
> 
> Maybe retrieve XSTATE_BV is inevitable here. Then, it is not that ugly.

Lemme see if I can follow: here, a task is being ptraced and the tracer
process does PTRACE_SETREGS to set the xregs and you want to go and read
out the XSTATE_BV vector from the supplied xstate buffer to see how much
to enlarge the buffer.

Which makes me go, whut?

Why doesn't the task already have a large enough buffer?

IOW and IIUC, you should not have to ever resize the xstate buffer of a
task in ptrace.

> In this case, the ptracer just failed to inject some context. But the
> ptracee’s context in the (old) buffer is intact. It will resume and eventually
> exit. I think arch_release_task_struct()->free_xstate_buffer() will take care
> of the old buffer.

You think or you know?

How about verifying it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder to support dynamic states
  2021-07-30 14:59 ` [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder " Chang S. Bae
@ 2021-08-18 11:33   ` Borislav Petkov
  2021-08-18 19:47     ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18 11:33 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel

On Fri, Jul 30, 2021 at 07:59:41AM -0700, Chang S. Bae wrote:
> __raw_xsave_addr() returns the requested component's pointer in an XSTATE
> buffer, by simply looking up the offset table. The offset used to be fixed,
> but, with dynamic user states, it becomes variable.
> 
> get_xstate_size() has a routine to find an offset at runtime. Refactor to
> use it for the address finder.
> 
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
> Changes from v5:
> * Updated for future proofed __raw_xsave_addr().
> 
> Changes from v3:
> * Added the function description in the kernel-doc style. (Borislav Petkov)
> * Removed 'no functional change' in the changelog. (Borislav Petkov)
> ---
>  arch/x86/kernel/fpu/xstate.c | 78 ++++++++++++++++++++++++------------
>  1 file changed, 53 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 26f6d5e0f1ed..98ab10e4da3b 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -182,6 +182,38 @@ static bool xfeature_is_supervisor(int xfeature_nr)
>  	return ecx & 1;
>  }
>  
> +/**
> + * get_xstate_comp_offset - Find the feature's offset in the compacted
> + *			    format.
> + * @mask:	This bitmap tells which components reserved in the format.

There's that "reserved" confusion thing. Rewrite pls.

> + * @feature_nr:	The feature number
> + *
> + * Returns:	The offset value
> + */
> +static unsigned int get_xstate_comp_offset(u64 mask, int feature_nr)
> +{
> +	u64 xmask = BIT_ULL(feature_nr + 1) - 1;
> +	unsigned int next_offset, offset = 0;
> +	int i;
> +
> +	if ((xfeatures_mask_all & xmask) == (mask & xmask))
> +		return xstate_comp_offsets[feature_nr];
> +
> +	/*
> +	 * With the given mask, no relevant size is found. Calculate it by
> +	 * summing up each state size.
> +	 */
> +	for (next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE;
> +	     i <= feature_nr; i++) {
> +		if (!(mask & BIT_ULL(i)))
> +			continue;
> +
> +		offset = xstate_aligns[i] ? ALIGN(next_offset, 64) : next_offset;
> +		next_offset += xstate_sizes[i];

Why is this more complex than it has to be?

IOW, why can't you simply do:

        offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
        for (i = FIRST_EXTENDED_XFEATURE; i <= feature_nr; i++) {
                if (!(mask & BIT_ULL(i)))
                        continue;

                if (xstate_aligns[i])
                        offset = ALIGN(offset, 64);

                offset += xstate_sizes[i];
        }
        return offset;

like it was before?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function to support dynamic states
  2021-07-30 14:59 ` [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function " Chang S. Bae
@ 2021-08-18 12:03   ` Borislav Petkov
  2021-08-18 19:47     ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18 12:03 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel

On Fri, Jul 30, 2021 at 07:59:42AM -0700, Chang S. Bae wrote:
> ptrace() and signal return paths use XSTATE context copy functions. They
> allow callers to read (or write) XSTATE values in the target's buffer. With
> dynamic user states, a component's position in the buffer may vary and the
> init fpstate is not always large enough to cover all the states.
> 
> Adjust the helpers to find a component's offset correctly. Also, update the
> copy loop in the ptrace read path to support dynamic states.
> 
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
> Changes from v5:
> * Updated to ensure xstate_bv aligned with the target.
> * Rewrote the xstate copy loop, for the ptrace() read path, in an open
>   code.
> * Adjusted the changelog.
> 
> Changes from v3:
> * Cleaned up the code change with more comments.
> * Removed 'no functional change' in the changelog. (Borislav Petkov)
> 
> Changes from v2:
> * Updated the changelog with task->fpu removed. (Borislav Petkov)
> ---
>  arch/x86/kernel/fpu/xstate.c | 30 +++++++++++++++++++++++++-----
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 98ab10e4da3b..3b56e7612c45 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1273,6 +1273,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
>  	zerofrom = offsetof(struct xregs_state, extended_state_area);
>  
>  	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> +		u64 mask = BIT_ULL(i);
>  		/*
>  		 * The ptrace buffer is in non-compacted XSAVE format.
>  		 * In non-compacted format disabled features still occupy
> @@ -1280,7 +1281,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
>  		 * compacted init_fpstate. The gap tracking will zero this
>  		 * later.
>  		 */
> -		if (!(xfeatures_mask_uabi() & BIT_ULL(i)))
> +		if (!(xfeatures_mask_uabi() & mask))
>  			continue;
>  
>  		/*
> @@ -1300,10 +1301,24 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
>  			pkru.pkru = tsk->thread.pkru;
>  			membuf_write(&to, &pkru, sizeof(pkru));
>  		} else {

So this chunk, the else branch starting here is begging to be a separate
function.

> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
> -				     __raw_xsave_addr(&tsk->thread.fpu, i),
> -				     __raw_xsave_addr(NULL, i),
> -				     xstate_sizes[i]);
> +			unsigned int size = xstate_sizes[i];
> +			void *from = NULL;
> +
> +			/*
> +			 * Copy the xstate if available. Otherwise, copy the
> +			 * non-zero init states for legacy states (FP and
> +			 * SSE) or fill zeros.
> +			 */
> +
> +			if (header.xfeatures & mask)
> +				from = __raw_xsave_addr(&tsk->thread.fpu, i);
> +			else if (XFEATURE_MASK_FPSSE & mask)

The i loop variable above starts from FIRST_EXTENDED_XFEATURE - why is
this XFEATURE_MASK_FPSSE check even here?

> +				from = __raw_xsave_addr(NULL, i);
> +
> +			if (from)
> +				membuf_write(&to, from, size);
> +			else
> +				membuf_zero(&to, size);
>  		}
>  		/*
>  		 * Keep track of the last copied state in the non-compacted
> @@ -1345,6 +1360,8 @@ static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
>  	if (validate_user_xstate_header(&hdr))
>  		return -EINVAL;
>  
> +	hdr.xfeatures &= fpu->state_mask;
> +

This hunk looks arbitrary here and wants to be together with the patch
which adds ->state_mask.

>  	/* Validate MXCSR when any of the related features is in use */
>  	mask = XFEATURE_MASK_FP | XFEATURE_MASK_SSE | XFEATURE_MASK_YMM;
>  	if (hdr.xfeatures & mask) {
> @@ -1371,6 +1388,9 @@ static int copy_uabi_to_xstate(struct fpu *fpu, const void *kbuf,
>  		if (hdr.xfeatures & mask) {
>  			void *dst = __raw_xsave_addr(fpu, i);
>  
> +			if (!dst)
> +				continue;
> +
>  			offset = xstate_offsets[i];
>  			size = xstate_sizes[i];

I don't know where this hunk belongs to...

Maybe as a completely separate patch which fixes the case where
__raw_xsave_addr() can in the very unlikely event return NULL...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-07-30 14:59 ` [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state Chang S. Bae
@ 2021-08-18 16:24   ` Borislav Petkov
  2021-08-18 17:20     ` Thiago Macieira
                       ` (3 more replies)
  0 siblings, 4 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18 16:24 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, thiago.macieira,
	jing2.liu, ravi.v.shankar, linux-kernel

On Fri, Jul 30, 2021 at 07:59:43AM -0700, Chang S. Bae wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index d0ce5cfd3ac1..37150b7a8e44 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -277,6 +277,7 @@
>  #define X86_FEATURE_XSAVEC		(10*32+ 1) /* XSAVEC instruction */
>  #define X86_FEATURE_XGETBV1		(10*32+ 2) /* XGETBV with ECX = 1 instruction */
>  #define X86_FEATURE_XSAVES		(10*32+ 3) /* XSAVES/XRSTORS instructions */
> +#define X86_FEATURE_XFD			(10*32+ 4) /* eXtended Feature Disabling */
							     ^
							     |

Add "" at the marker above - it doesn't look like we wanna show "xfd" in
/proc/cpuinfo.

>   * Extended auxiliary flags: Linux defined - for features scattered in various
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 263e349ff85a..e3590cf55325 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -535,14 +535,55 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
>   * Misc helper functions:
>   */
>  
> +/* The Extended Feature Disable (XFD) helpers: */
> +
> +static inline void xfd_write(u64 value)
> +{
> +	wrmsrl_safe(MSR_IA32_XFD, value);
> +}
> +
> +static inline u64 xfd_read(void)
> +{
> +	u64 value;
> +
> +	rdmsrl_safe(MSR_IA32_XFD, &value);
> +	return value;
> +}

Those look useless. Imagine we had to add wrappers around *every* MSR we
touch in the kernel...

> +
> +static inline u64 xfd_capable(void)
> +{
> +	return xfeatures_mask_user_dynamic;
> +}

A small helper which returns an u64 but is used in boolean context?

Also, this name is wrong: XFD capable is a system which has
X86_FEATURE_XFD set. You should simply use xfeatures_mask_user_dynamic
everywhere since it is __ro_after_init.

> +/**
> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
> + * @prev:	The previous task's struct fpu pointer
> + * @next:	The next task's struct fpu pointer
> + */
> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
> +{
> +	u64 prev_xfd_mask, next_xfd_mask;
> +
> +	if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())

cpu_feature_enabled(). Use that everywhere in your patchset. But you
know already...

> +		return;
> +
> +	prev_xfd_mask = prev->state_mask & xfd_capable();
> +	next_xfd_mask = next->state_mask & xfd_capable();

This is just plain misleading:

You're *AND*ing a mask with xfd_capable?!?

Just use xfeatures_mask_user_dynamic directly instead, as already
mentioned.

> +	if (unlikely(prev_xfd_mask != next_xfd_mask))
> +		xfd_write(xfd_capable() ^ next_xfd_mask);
> +}

Here too.

Also, I must be missing something. Let's play with some imaginary masks:

prev->state_mask = 110b
next->state_mask = 111b
dyn		 = 101b

("dyn" is short for xfeatures_mask_user_dynamic)

prev_xfd_mask = 100b
next_xfd_mask = 101b

if (unlikely(100b != 101b))
	xfd_write(101b ^ 101b) == xfd_write(0)

so next has bits 2 and 0 set but the xfd write zaps them so next won't
get any more #NMs for those states.

Why?

>  /*
>   * Delay loading of the complete FPU state until the return to userland.
>   * PKRU is handled separately.
>   */
> -static inline void switch_fpu_finish(struct fpu *new_fpu)
> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
>  {
> -	if (cpu_feature_enabled(X86_FEATURE_FPU))
> +	if (cpu_feature_enabled(X86_FEATURE_FPU)) {
>  		set_thread_flag(TIF_NEED_FPU_LOAD);
> +		xfd_switch(old_fpu, new_fpu);
> +	}
>  }
>  
>  #endif /* _ASM_X86_FPU_INTERNAL_H */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index a7c413432b33..eac0cfd9210b 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -626,6 +626,8 @@
>  #define MSR_IA32_BNDCFGS_RSVD		0x00000ffc
>  
>  #define MSR_IA32_XSS			0x00000da0
> +#define MSR_IA32_XFD			0x000001c4
> +#define MSR_IA32_XFD_ERR		0x000001c5

At least try to keep those numerically sorted, at least among the
architectural MSR_IA32_ ones. That is, provided those XFD things are
architectural...

>  #define MSR_IA32_APICBASE		0x0000001b
>  #define MSR_IA32_APICBASE_BSP		(1<<8)
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index defda61f372d..7f891d2eb52e 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
>  	{ X86_FEATURE_SGX_LC,			X86_FEATURE_SGX	      },
>  	{ X86_FEATURE_SGX1,			X86_FEATURE_SGX       },
>  	{ X86_FEATURE_SGX2,			X86_FEATURE_SGX1      },
> +	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVE     },
>  	{}
>  };
>  
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 3b56e7612c45..c6ff0575d87d 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -182,6 +182,26 @@ static bool xfeature_is_supervisor(int xfeature_nr)
>  	return ecx & 1;
>  }
>  
> +/**
> + * xfd_supported - Check if the feature supports Extended Feature Disable (XFD).
> + * @feature_nr:	The feature number.
> + *
> + * Returns:	True if supported; otherwise, false.
> + */
> +static bool xfd_supported(int feature_nr)

xfeature_supports_xfd()

> +{
> +	u32 eax, ebx, ecx, edx;
> +
> +	if (!boot_cpu_has(X86_FEATURE_XFD))
> +		return false;
> +
> +	/*
> +	 * If state component 'i' supports it, ECX[2] return 1; otherwise, 0.
> +	 */
> +	cpuid_count(XSTATE_CPUID, feature_nr, &eax, &ebx, &ecx, &edx);
> +	return ecx & 4;
> +}
> +
>  /**
>   * get_xstate_comp_offset - Find the feature's offset in the compacted
>   *			    format.
> @@ -274,6 +294,9 @@ void fpu__init_cpu_xstate(void)
>  		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
>  				     xfeatures_mask_independent());
>  	}
> +
> +	if (boot_cpu_has(X86_FEATURE_XFD))
> +		xfd_write(xfd_capable());
>  }
>  
>  static bool xfeature_enabled(enum xfeature xfeature)
> @@ -473,8 +496,9 @@ static void __init print_xstate_offset_size(void)
>  	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
>  		if (!xfeature_enabled(i))
>  			continue;
> -		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
> -			 i, xstate_comp_offsets[i], i, xstate_sizes[i]);
> +		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
> +			i, xstate_comp_offsets[i], i, xstate_sizes[i],
> +			(xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "dynamic" : "default");

Make that

			(xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "(dynamic)" : "");

> @@ -920,6 +944,16 @@ void __init fpu__init_system_xstate(void)
>  	/* Do not support the dynamically allocated buffer yet. */
>  	xfeatures_mask_user_dynamic = 0;
>  
> +	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
> +		u64 feature_mask = BIT_ULL(i);
> +
> +		if (!(xfeatures_mask_uabi() & feature_mask))
> +			continue;
> +
> +		if (xfd_supported(i))
> +			xfeatures_mask_user_dynamic |= feature_mask;
> +	}
> +
>  	/* Enable xstate instructions to be able to continue with initialization: */
>  	fpu__init_cpu_xstate();
>  	err = init_xstate_size();
> @@ -981,6 +1015,12 @@ void fpu__resume_cpu(void)
>  		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
>  				     xfeatures_mask_independent());
>  	}
> +
> +	if (boot_cpu_has(X86_FEATURE_XFD)) {
> +		u64 fpu_xfd_mask = current->thread.fpu.state_mask & xfd_capable();
> +
> +		xfd_write(xfd_capable() ^ fpu_xfd_mask);
> +	}
>  }
>  
>  /**
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 534b9fb7e7ee..b85fa499f195 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -97,6 +97,12 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
>  	*size = get_xstate_config(XSTATE_MIN_SIZE);
>  }
>  
> +void arch_release_task_struct(struct task_struct *task)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_FPU))
> +		free_xstate_buffer(&task->thread.fpu);
> +}
> +
>  /*
>   * Free thread data structures etc..
>   */
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 4f2f54e1281c..7bd5d08eeb41 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  
>  	this_cpu_write(current_task, next_p);
>  
> -	switch_fpu_finish(next_fpu);
> +	switch_fpu_finish(prev_fpu, next_fpu);
>  
>  	/* Load the Intel cache allocation PQR MSR. */
>  	resctrl_sched_in();
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index ec0d836a13b1..41c9855158d6 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -620,7 +620,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  	this_cpu_write(current_task, next_p);
>  	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
>  
> -	switch_fpu_finish(next_fpu);
> +	switch_fpu_finish(prev_fpu, next_fpu);
>  
>  	/* Reload sp0. */
>  	update_task_stack(next_p);
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..dd66d528afd8 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1112,6 +1112,45 @@ DEFINE_IDTENTRY(exc_device_not_available)
>  {
>  	unsigned long cr0 = read_cr0();
>  
> +	if (boot_cpu_has(X86_FEATURE_XFD)) {

This whole thing wants to be in a separate function. Even the
indentation level is begging for it.

> +		u64 xfd_err;
> +
> +		rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
> +		wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
> +
> +		if (xfd_err) {
> +			u64 xfd_event = xfd_err & xfd_capable();
> +
> +			if (WARN_ON(!xfd_event)) {
> +				/*
> +				 * Unexpected event is raised. But update XFD state to
> +				 * unblock the task.
> +				 */
> +				xfd_write(xfd_read() & ~xfd_err);

So AFAIU, xfd_err points to some other feature which caused this
exception.

So if that feature bit is set in XFD, you're clearing it here. Why?

So that it doesn't raise that #NM for it anymore?

This looks weird.

> +			} else {
> +				struct fpu *fpu = &current->thread.fpu;
> +				int err = -1;
> +
> +				/*
> +				 * Make sure not in interrupt context as handling a
> +				 * trap from userspace.
> +				 */
> +				if (!WARN_ON(in_interrupt())) {

I'm guessing that's supposed to stop people from using AMX and other
dynamic states in the kernel?

> +					err = alloc_xstate_buffer(fpu, xfd_event);
> +					if (!err)
> +						xfd_write((fpu->state_mask & xfd_capable()) ^
> +							  xfd_capable());
> +				}
> +
> +				/* Raise a signal when it failed to handle. */
> +				if (err)
> +					force_sig_fault(SIGILL, ILL_ILLOPC,
> +							error_get_trap_addr(regs));

Where is it documented that that configuration of SIG* types means,
failure to allocate the dynamic buffer?

To the general picture: why is this thing even allocating a buffer in #NM?

Why isn't the buffer pre-allocated for the process after latter having
done prctl() so that when an #NM happens, no allocation happens at all?

And with those buffers preallocated, all that XFD muck is not needed
either.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 16:24   ` Borislav Petkov
@ 2021-08-18 17:20     ` Thiago Macieira
  2021-08-18 17:46       ` Borislav Petkov
  2021-08-18 19:47     ` Bae, Chang Seok
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Thiago Macieira @ 2021-08-18 17:20 UTC (permalink / raw)
  To: Chang S. Bae, Borislav Petkov
  Cc: luto, tglx, mingo, x86, len.brown, dave.hansen, jing2.liu,
	ravi.v.shankar, linux-kernel

On Wednesday, 18 August 2021 09:24:51 PDT Borislav Petkov wrote:
> > +#define X86_FEATURE_XFD                        (10*32+ 4) /* eXtended
> > Feature Disabling */
> 
> Add "" at the marker above - it doesn't look like we wanna show "xfd" in
> /proc/cpuinfo.

Why not?

It could help diagnosing why this code has a failure if XFD is somehow 
missing. That can happen with hypervisors or future CPUs.

> > +                               /* Raise a signal when it failed to
> > handle. */ +                               if (err)
> > +                                       force_sig_fault(SIGILL,
> > ILL_ILLOPC,
> > +                                                      
> > error_get_trap_addr(regs));> 
> Where is it documented that that configuration of SIG* types means,
> failure to allocate the dynamic buffer?

This wasn't part of the memory failure, but now that you've pointed out, yes, 
we are getting a SIGILL in case the kernel failed to allocate memory too. 

This is the same code path we get if the task executes an AMX instruction 
without first requesting support for it via the system call. At my request, 
Chang changed it from SIGSEGV to SIGILL, because that's the behaviour one 
would see if the kernel did not support AMX at all, hadn't enabled it in XCR0 
or the CPU didn't support the instructions.

I don't know how to best handle killing the application if the kernel is OOM 
(see below, though). Maybe it should SIGKILL instead. The problem with sending 
a SIGSEGV is debuggability: if I get a core dump of this crash, which is 
likely going to happen in a load instruction, I'll spend a lot time trying to 
understand why the pointer in that instruction wasn't correct. Very few people 
will ever consider it may have another reason.

> To the general picture: why is this thing even allocating a buffer in #NM?
> 
> Why isn't the buffer pre-allocated for the process after latter having
> done prctl() so that when an #NM happens, no allocation happens at all?

That's a good question, but I thought it had been discussed and agreed that we 
didn't want to extend the buffers at the moment the application requested the 
bits, because it may never need them. This was probably a defence against 
applications requesting all bits without knowing whether they'll need them at 
all.

The way the API to userspace is implemented, the only way to find out if the 
kernel supports a given state is to enable it. It's not currently possible to 
ask "do you support AMX tile data?" and then go about the application's merry 
way until it determines it really wants to do matrix multiplications. In the 
case of applications with plugins, they need to have that answer before the 
load the plugin, which usually happens at application start.

I was going to suggest a new API to return the supported bits, but hadn't yet 
because it wasn't required for this patchset to work. So long as that API 
landed at or before the time a new bit was added, userspace would be able to 
cope. But if the kernel is going to allocate the bits at the moment of the 
system call *and* we wish for userspace not to request more than it really 
needs, then we'll need this extra API right now.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 17:20     ` Thiago Macieira
@ 2021-08-18 17:46       ` Borislav Petkov
  2021-08-18 17:58         ` Thiago Macieira
  2021-08-18 20:43         ` Bae, Chang Seok
  0 siblings, 2 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18 17:46 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Chang S. Bae, luto, tglx, mingo, x86, len.brown, dave.hansen,
	jing2.liu, ravi.v.shankar, linux-kernel

On Wed, Aug 18, 2021 at 10:20:58AM -0700, Thiago Macieira wrote:
> Why not?
>
> It could help diagnosing why this code has a failure if XFD is somehow 
> missing. That can happen with hypervisors or future CPUs.

You're new to this...

tools/arch/x86/kcpuid/kcpuid.c should be used for all CPUID querying
needs.

<snipping the SIGILL question for now because it might become moot>

> That's a good question, but I thought it had been discussed and agreed that we 
> didn't want to extend the buffers at the moment the application requested the 
> bits, because it may never need them.

Huh? An application doing prctl(GIVE_ME_AMX) and then it might never
need it? That's only that application's fault then.

> This was probably a defence against applications requesting all bits
> without knowing whether they'll need them at all.

That sounds like a badly programmed application.

> The way the API to userspace is implemented, the only way to find
> out if the kernel supports a given state is to enable it. It's not
> currently possible to ask "do you support AMX tile data?"

Then our API needs improving. An app should be able to ask the kernel
"Do you support AMX?" get a proper answer and act accordingly.

> and then go about the application's merry way until it determines it
> really wants to do matrix multiplications. In the case of applications
> with plugins, they need to have that answer before the load the
> plugin, which usually happens at application start.

I don't see a problem with the app doing at load time:

A: Heey, kernel, do you support AMX?
K: Yes
A: Allocate a dynamic FPU buffer for me then pls.

> I was going to suggest a new API to return the supported bits, but
> hadn't yet because it wasn't required for this patchset to work.

I think you should. The important part is having the API good and
complete.

> So long as that API landed at or before the time a new bit was added,
> userspace would be able to cope. But if the kernel is going to
> allocate the bits at the moment of the system call *and* we wish for
> userspace not to request more than it really needs, then we'll need
> this extra API right now.

No no, once the API hits upstream, it is cast in stone. So it better
be done in full with the patchset, in one go. No later significant API
additions or changes, none especially after apps start using it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 17:46       ` Borislav Petkov
@ 2021-08-18 17:58         ` Thiago Macieira
  2021-08-18 18:10           ` Borislav Petkov
  2021-08-18 20:43         ` Bae, Chang Seok
  1 sibling, 1 reply; 91+ messages in thread
From: Thiago Macieira @ 2021-08-18 17:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chang S. Bae, luto, tglx, mingo, x86, len.brown, dave.hansen,
	jing2.liu, ravi.v.shankar, linux-kernel

On Wednesday, 18 August 2021 10:46:09 PDT Borislav Petkov wrote:
> You're new to this...
> 
> tools/arch/x86/kcpuid/kcpuid.c should be used for all CPUID querying
> needs.

That tells me what the CPU supports, not what the kernel does. By omitting the 
"xfd" entry in /proc/cpuinfo, we are assuming that all kernels with "amxtile" 
also implicitly support xfd. That is a valid assumption.

> I don't see a problem with the app doing at load time:
> 
> A: Heey, kernel, do you support AMX?
> K: Yes
> A: Allocate a dynamic FPU buffer for me then pls.

Many applications need to determine which plugins and code paths to enable 
before getting the data that will tell them what to do. It's entirely possible 
for them to never need to run the AMX instructions, so they may wish to defer 
the request to allocate the XSAVE state until they have read their input data.

It's indeed possible that the allocation then fails and the application be 
unable to continue. But OOM conditions are unlikely, so it may be an 
acceptable price to pay. In fact, by *not* allocating the extra state for 
every thread in the current process, it may avoid the OOM.

> > I was going to suggest a new API to return the supported bits, but
> > hadn't yet because it wasn't required for this patchset to work.
> 
> I think you should. The important part is having the API good and
> complete.
> 
> > So long as that API landed at or before the time a new bit was added,
> > userspace would be able to cope. But if the kernel is going to
> > allocate the bits at the moment of the system call *and* we wish for
> > userspace not to request more than it really needs, then we'll need
> > this extra API right now.
> 
> No no, once the API hits upstream, it is cast in stone. So it better
> be done in full with the patchset, in one go. No later significant API
> additions or changes, none especially after apps start using it.

Sorry, that's not what I meant. I was going to request an extra API, a third 
call. We'd have:
 - get current state
 - set new state
 - get available bits to set

The first two are in Chang's patch set, the third one is not. Right now, 
there's a single bit that can be set, so there's no need to have the third 
one. Any future software that wants to request a new bit will know if the 
kernel supports it by the very presence of the API. That is, if they ask and 
the API fails with -EINVAL, then this new bit isn't supported.

I didn't make the request because, as I said, it didn't seem required. 
Therefore, I didn't want to add further work before the minimum functionality 
got merged.

Now, if we are going to have this API any way, it might be a good idea to 
combine the two getters in one by adding a second pointer parameter.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 17:58         ` Thiago Macieira
@ 2021-08-18 18:10           ` Borislav Petkov
  2021-08-24 22:51             ` Len Brown
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18 18:10 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Chang S. Bae, luto, tglx, mingo, x86, len.brown, dave.hansen,
	jing2.liu, ravi.v.shankar, linux-kernel

On Wed, Aug 18, 2021 at 10:58:42AM -0700, Thiago Macieira wrote:
> That tells me what the CPU supports, not what the kernel does. By
> omitting the "xfd" entry in /proc/cpuinfo, we are assuming that all
> kernels with "amxtile" also implicitly support xfd. That is a valid
> assumption.

What relevance does the fact have for userspace whether the kernel
supports XFD or not?

IOW, userspace cares about AMX and the other features which are supposed
to use XFD - not how those features are implemented: whether with
faulting or with pre-allocation or whatever.

> Many applications need to determine which plugins and code paths to
> enable before getting the data that will tell them what to do. It's
> entirely possible for them to never need to run the AMX instructions,
> so they may wish to defer the request to allocate the XSAVE state
> until they have read their input data.
>
> It's indeed possible that the allocation then fails and the
> application be unable to continue. But OOM conditions are unlikely, so
> it may be an acceptable price to pay. In fact, by *not* allocating the
> extra state for every thread in the current process, it may avoid the
> OOM.

And?

That doesn't conflict with my suggestion. It goes and asks the kernel
what it supports and then requests the buffers.

> Sorry, that's not what I meant. I was going to request an extra API, a third 
> call. We'd have:
>  - get current state
>  - set new state
>  - get available bits to set

Yes, this should have been the API from the very beginning. Of course
you need to be able to query what bits can be set at all.

> ...
> Now, if we are going to have this API any way, it might be a good
> idea to combine the two getters in one by adding a second pointer
> parameter.

Yeah, I'll get to that patch in the coming days and have a look. So far,
it only makes sense to have a querying API too so that we can provide
support for more "fat" features.

Unless Intel folks decide to stop using XSAVE for that - it was a bad
idea in the first place anyway, TBH - but it's not like hw people listen
to sw folk so...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-18  9:28           ` Borislav Petkov
@ 2021-08-18 19:46             ` Bae, Chang Seok
  2021-08-25 16:01               ` Bae, Chang Seok
  2021-08-30 17:07               ` Borislav Petkov
  0 siblings, 2 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 19:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 02:28, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Aug 13, 2021 at 07:43:53PM +0000, Bae, Chang Seok wrote:
>> Without the “compacted” notion in the function name, one might
>> call this even with !XSAVES. But chances are very low in practice.
> 
> So leave only the first two which are obvious and are more likely to
> happen - the first one is going to be the most likely on non-dynamic
> setups and the second one is on dynamic systems.
> 
> For all the other configurations, just do the loop and that's it.
> 
> *IF* an optimization needs to happen there, then it can happen latter,
> supplied with perf numbers to justify it.

No, this non-compacted thing is not for optimization. SDM is not quite clear
about the logic behind the non-compacted format -- some state’s offset does
not always match with the 'size + offset' of the previous one, even without
64B-alignment. So, the loop is only for the compacted format, not the
non-compacted one. 

It was refactored to use in the new helper to find feature_nr’s start point.
If the size is added up here, it is not ‘i’'s start point anymore.

>> Perhaps, the call site in the ptrace path becomes like this, I think:
>> 
>> +	if (xfeatures_mask_user_dynamic) {
>> +		u64 state_mask;
>> +
>> +		/* Retrieve XSTATE_BV. */
>> +		memcpy(&state_mask, (kbuf ?: tmpbuf) + offsetof(struct xregs_state, header),
>> +		       sizeof(u64));
>> +
>> +		/* Expand the xstate buffer based on the XSTATE_BV. */
>> +		ret = realloc_xstate_buffer(fpu, state_mask & xfeatures_mask_user_dynamic);
>> +		if (ret)
>> +			goto out;
>> +	}
>> 
>> Maybe retrieve XSTATE_BV is inevitable here. Then, it is not that ugly.
> 
> Lemme see if I can follow: here, a task is being ptraced and the tracer
> process does PTRACE_SETREGS to set the xregs and you want to go and read
> out the XSTATE_BV vector from the supplied xstate buffer to see how much
> to enlarge the buffer.
> 
> Which makes me go, whut?
> 
> Why doesn't the task already have a large enough buffer?
> 
> IOW and IIUC, you should not have to ever resize the xstate buffer of a
> task in ptrace.

Sorry, it looks like I missed adding the permission check in the above.

[ I saw the discussion has (re-)started for the allocation API though, assume
  that the resize happens transparently for now. ]

Saying the ptracee has never used AMX -- it has a small buffer. Then, if the
ptracer attempts to inject tile data, and the buffer is resized here, it
fails. This precludes AMX state injection as allowed only after the ptracee
ever used the state. My concern is it may make it confusing to ptrace users at
least.

>> In this case, the ptracer just failed to inject some context. But the
>> ptracee’s context in the (old) buffer is intact. It will resume and eventually
>> exit. I think arch_release_task_struct()->free_xstate_buffer() will take care
>> of the old buffer.
> 
> You think or you know?
> 
> How about verifying it.

Let me consider a case to mimic the situation somehow.

Thanks,
Chang




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder to support dynamic states
  2021-08-18 11:33   ` Borislav Petkov
@ 2021-08-18 19:47     ` Bae, Chang Seok
  2021-08-30 17:18       ` Borislav Petkov
  0 siblings, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 19:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 04:33, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 30, 2021 at 07:59:41AM -0700, Chang S. Bae wrote:
>> 
>> + * @feature_nr:	The feature number
>> + *
>> + * Returns:	The offset value
>> + */
>> +static unsigned int get_xstate_comp_offset(u64 mask, int feature_nr)
>> +{
>> +	u64 xmask = BIT_ULL(feature_nr + 1) - 1;
>> +	unsigned int next_offset, offset = 0;
>> +	int i;
>> +
>> +	if ((xfeatures_mask_all & xmask) == (mask & xmask))
>> +		return xstate_comp_offsets[feature_nr];
>> +
>> +	/*
>> +	 * With the given mask, no relevant size is found. Calculate it by
>> +	 * summing up each state size.
>> +	 */
>> +	for (next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE;
>> +	     i <= feature_nr; i++) {
>> +		if (!(mask & BIT_ULL(i)))
>> +			continue;
>> +
>> +		offset = xstate_aligns[i] ? ALIGN(next_offset, 64) : next_offset;
>> +		next_offset += xstate_sizes[i];
> 
> Why is this more complex than it has to be?
> 
> IOW, why can't you simply do:
> 
>        offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
>        for (i = FIRST_EXTENDED_XFEATURE; i <= feature_nr; i++) {
>                if (!(mask & BIT_ULL(i)))
>                        continue;
> 
>                if (xstate_aligns[i])
>                        offset = ALIGN(offset, 64);
> 
>                offset += xstate_sizes[i];
>        }
>        return offset;
> 
> like it was before?

It was refactored to use in the new helper to find feature_nr’s start point.
If the size is added up here, it is not ‘i’'s start point anymore.

Thanks,
Chang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function to support dynamic states
  2021-08-18 12:03   ` Borislav Petkov
@ 2021-08-18 19:47     ` Bae, Chang Seok
  0 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 19:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 05:03, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 30, 2021 at 07:59:42AM -0700, Chang S. Bae wrote:
>> 
>> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
>> -				     __raw_xsave_addr(&tsk->thread.fpu, i),
>> -				     __raw_xsave_addr(NULL, i),
>> -				     xstate_sizes[i]);
>> +			unsigned int size = xstate_sizes[i];
>> +			void *from = NULL;
>> +
>> +			/*
>> +			 * Copy the xstate if available. Otherwise, copy the
>> +			 * non-zero init states for legacy states (FP and
>> +			 * SSE) or fill zeros.
>> +			 */
>> +
>> +			if (header.xfeatures & mask)
>> +				from = __raw_xsave_addr(&tsk->thread.fpu, i);
>> +			else if (XFEATURE_MASK_FPSSE & mask)
> 
> The i loop variable above starts from FIRST_EXTENDED_XFEATURE - why is
> this XFEATURE_MASK_FPSSE check even here?

!(header.xfeatures & mask) means init-state should be copied. Except for
these, the init value is zero (as also noted here [1]). So, check this to copy
correct init data if the current iteration is for the legacy states.

At least, I may need to improve the readability here.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n416

Thanks,
Chang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 16:24   ` Borislav Petkov
  2021-08-18 17:20     ` Thiago Macieira
@ 2021-08-18 19:47     ` Bae, Chang Seok
  2021-08-24 22:21     ` Len Brown
  2021-08-24 23:17     ` Len Brown
  3 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 19:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

[ Cut out the API comments and other obvious ones here. ]

On Aug 18, 2021, at 09:24, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 30, 2021 at 07:59:43AM -0700, Chang S. Bae wrote:
> 
>> +/**
>> + * xfd_switch - Switches the MSR IA32_XFD context if needed.
>> + * @prev:	The previous task's struct fpu pointer
>> + * @next:	The next task's struct fpu pointer
>> + */
>> +static inline void xfd_switch(struct fpu *prev, struct fpu *next)
>> +{
>> +	u64 prev_xfd_mask, next_xfd_mask;
>> +
>> +	if (!static_cpu_has(X86_FEATURE_XFD) || !xfd_capable())
> 
> cpu_feature_enabled(). Use that everywhere in your patchset. But you
> know already...

Yes, I did on my local.

>> +		return;
>> +
>> +	prev_xfd_mask = prev->state_mask & xfd_capable();
>> +	next_xfd_mask = next->state_mask & xfd_capable();
> 
> This is just plain misleading:
> 
> You're *AND*ing a mask with xfd_capable?!?
> 
> Just use xfeatures_mask_user_dynamic directly instead, as already
> mentioned.

Okay.

>> +	if (unlikely(prev_xfd_mask != next_xfd_mask))
>> +		xfd_write(xfd_capable() ^ next_xfd_mask);
>> +}
> 
> Here too.
> 
> Also, I must be missing something. Let's play with some imaginary masks:
> 
> prev->state_mask = 110b
> next->state_mask = 111b
> dyn		 = 101b
> 
> ("dyn" is short for xfeatures_mask_user_dynamic)
> 
> prev_xfd_mask = 100b
> next_xfd_mask = 101b
> 
> if (unlikely(100b != 101b))
> 	xfd_write(101b ^ 101b) == xfd_write(0)
> 
> so next has bits 2 and 0 set but the xfd write zaps them so next won't
> get any more #NMs for those states.
> 
> Why?

Because the next has already fully expanded the buffer -- its state_mask
equals to feature_mask_user_dynamic.

No more XFD event is needed for the task.

>> 
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index a58800973aed..dd66d528afd8 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -1112,6 +1112,45 @@ DEFINE_IDTENTRY(exc_device_not_available)
>> {
>> 	unsigned long cr0 = read_cr0();
>> 
>> +	if (boot_cpu_has(X86_FEATURE_XFD)) {
> 
> This whole thing wants to be in a separate function. Even the
> indentation level is begging for it.

Ah, it was once in a separate function until V4. Since trimmed down quite a
bit in V5, it has grown from there.

Let me fix this.

>> +		u64 xfd_err;
>> +
>> +		rdmsrl_safe(MSR_IA32_XFD_ERR, &xfd_err);
>> +		wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
>> +
>> +		if (xfd_err) {
>> +			u64 xfd_event = xfd_err & xfd_capable();
>> +
>> +			if (WARN_ON(!xfd_event)) {
>> +				/*
>> +				 * Unexpected event is raised. But update XFD state to
>> +				 * unblock the task.
>> +				 */
>> +				xfd_write(xfd_read() & ~xfd_err);
> 
> So AFAIU, xfd_err points to some other feature which caused this
> exception.
> 
> So if that feature bit is set in XFD, you're clearing it here. Why?
> 
> So that it doesn't raise that #NM for it anymore?
> 
> This looks weird.

If this ever happens, something might be wrong with the hardware.

If the bit is not reset, it will get stuck with repeatedly unhandled #NMs,
which I think is even more terrible.

>> +			} else {
>> +				struct fpu *fpu = &current->thread.fpu;
>> +				int err = -1;
>> +
>> +				/*
>> +				 * Make sure not in interrupt context as handling a
>> +				 * trap from userspace.
>> +				 */
>> +				if (!WARN_ON(in_interrupt())) {
> 
> I'm guessing that's supposed to stop people from using AMX and other
> dynamic states in the kernel?

But the kernel can handle this without XFD?

Thanks,
Chang



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 17:46       ` Borislav Petkov
  2021-08-18 17:58         ` Thiago Macieira
@ 2021-08-18 20:43         ` Bae, Chang Seok
  2021-08-18 21:04           ` Thiago Macieira
  2021-08-18 21:17           ` Borislav Petkov
  1 sibling, 2 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 20:43 UTC (permalink / raw)
  To: Borislav Petkov, Macieira, Thiago
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 10:46, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 18, 2021 at 10:20:58AM -0700, Thiago Macieira wrote
>> The way the API to userspace is implemented, the only way to find
>> out if the kernel supports a given state is to enable it. It's not
>> currently possible to ask "do you support AMX tile data?"
> 
> Then our API needs improving. An app should be able to ask the kernel
> "Do you support AMX?" get a proper answer and act accordingly.

Maybe I’m missing something, but I wonder what’s the difference from reading
XCR0.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 20:43         ` Bae, Chang Seok
@ 2021-08-18 21:04           ` Thiago Macieira
  2021-08-18 21:12             ` Bae, Chang Seok
  2021-08-19  1:21             ` Andy Lutomirski
  2021-08-18 21:17           ` Borislav Petkov
  1 sibling, 2 replies; 91+ messages in thread
From: Thiago Macieira @ 2021-08-18 21:04 UTC (permalink / raw)
  To: Borislav Petkov, Bae, Chang Seok
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Liu, Jing2, Shankar, Ravi V, linux-kernel

On Wednesday, 18 August 2021 13:43:50 PDT Bae, Chang Seok wrote:
> > Then our API needs improving. An app should be able to ask the kernel
> > "Do you support AMX?" get a proper answer and act accordingly.
> 
> Maybe I’m missing something, but I wonder what’s the difference from
> reading  XCR0.

That assumes the kernel will always enable the bits in XCR0, like it is doing 
today and with your patch, because modifying it is a VM exit.

But it's not the only possible solution. A future kernel could decide to leave 
some bits off and only enable upon request. That's how macOS/Darwin does its 
AVX512 support.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 21:04           ` Thiago Macieira
@ 2021-08-18 21:12             ` Bae, Chang Seok
  2021-08-18 22:27               ` Thiago Macieira
  2021-08-19  1:21             ` Andy Lutomirski
  1 sibling, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 21:12 UTC (permalink / raw)
  To: Macieira, Thiago
  Cc: Borislav Petkov, Lutomirski, Andy, tglx, mingo, x86, Brown, Len,
	Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 14:04, Thiago Macieira <thiago.macieira@intel.com> wrote:
> But it's not the only possible solution. A future kernel could decide to leave 
> some bits off and only enable upon request. That's how macOS/Darwin does its 
> AVX512 support.

Even if XCR0 is ever switched, doesn’t XGETBV(0) return it for the *current*
task?

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 20:43         ` Bae, Chang Seok
  2021-08-18 21:04           ` Thiago Macieira
@ 2021-08-18 21:17           ` Borislav Petkov
  2021-08-18 21:37             ` Bae, Chang Seok
  2021-08-24 23:22             ` Len Brown
  1 sibling, 2 replies; 91+ messages in thread
From: Borislav Petkov @ 2021-08-18 21:17 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Macieira, Thiago, Lutomirski, Andy, tglx, mingo, x86, Brown, Len,
	Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Wed, Aug 18, 2021 at 08:43:50PM +0000, Bae, Chang Seok wrote:
> Maybe I’m missing something, but I wonder what’s the difference
> from reading XCR0.

Wny, because adding another prctl() is too damn hard?

What if this modus operandi of features userspace can use with kernel
assistance but need an explicit request and are off otherwise, gets
extended beyond XSAVE-managed features?

You would wish then that you had defined a

	prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);

at the time...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 21:17           ` Borislav Petkov
@ 2021-08-18 21:37             ` Bae, Chang Seok
  2021-08-19  8:00               ` Borislav Petkov
  2021-08-24 23:22             ` Len Brown
  1 sibling, 1 reply; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-18 21:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Macieira, Thiago, Lutomirski, Andy, tglx, mingo, x86, Brown, Len,
	Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 14:17, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 18, 2021 at 08:43:50PM +0000, Bae, Chang Seok wrote:
>> Maybe I’m missing something, but I wonder what’s the difference
>> from reading XCR0.
> 
> Wny, because adding another prctl() is too damn hard?

Well, IIUC, merely XGETBV(0) in the kernel instead of from userspace.

> What if this modus operandi of features userspace can use with kernel
> assistance but need an explicit request and are off otherwise, gets
> extended beyond XSAVE-managed features?

What if it never happens? It will be just the same as XGETBV(0). I think on
the flip side there is also a benefit of maintaining a simple API as possible.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 21:12             ` Bae, Chang Seok
@ 2021-08-18 22:27               ` Thiago Macieira
  0 siblings, 0 replies; 91+ messages in thread
From: Thiago Macieira @ 2021-08-18 22:27 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Borislav Petkov, Lutomirski, Andy, tglx, mingo, x86, Brown, Len,
	Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Wednesday, 18 August 2021 14:12:06 PDT Bae, Chang Seok wrote:
> On Aug 18, 2021, at 14:04, Thiago Macieira <thiago.macieira@intel.com>
> wrote:
> > But it's not the only possible solution. A future kernel could decide to
> > leave some bits off and only enable upon request. That's how
> > macOS/Darwin does its AVX512 support.
> 
> 
> Even if XCR0 is ever switched, doesn’t XGETBV(0) return it for the
> *current*  task?

That's the point. If the kernel decides that feature bit 19 will be left off 
in XCR0, how shall userspace know the kernel supports the feature through the 
arch_prctl syscall you added?

Not that I am advising we adopt this strategy. We don't need more 
fragmentation on how we enable the features. But having this syscall gives us 
flexibility in case we do need it in the future.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 21:04           ` Thiago Macieira
  2021-08-18 21:12             ` Bae, Chang Seok
@ 2021-08-19  1:21             ` Andy Lutomirski
  2021-08-19 16:06               ` Thiago Macieira
  1 sibling, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2021-08-19  1:21 UTC (permalink / raw)
  To: Thiago Macieira, Borislav Petkov, Bae, Chang Seok
  Cc: Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Brown,
	Len, Dave Hansen, Liu, Jing2, Shankar, Ravi V,
	Linux Kernel Mailing List



On Wed, Aug 18, 2021, at 2:04 PM, Thiago Macieira wrote:
> On Wednesday, 18 August 2021 13:43:50 PDT Bae, Chang Seok wrote:
> > > Then our API needs improving. An app should be able to ask the kernel
> > > "Do you support AMX?" get a proper answer and act accordingly.
> > 
> > Maybe I’m missing something, but I wonder what’s the difference from
> > reading  XCR0.
> 
> That assumes the kernel will always enable the bits in XCR0, like it is doing 
> today and with your patch, because modifying it is a VM exit.
> 
> But it's not the only possible solution. A future kernel could decide to leave 
> some bits off and only enable upon request. That's how macOS/Darwin does its 
> AVX512 support.

The fact that Darwin does this strongly suggests that real programs can handle it, which increases my inclination for Linux to do the same thing.

> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>   Software Architect - Intel DPG Cloud Engineering
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 21:37             ` Bae, Chang Seok
@ 2021-08-19  8:00               ` Borislav Petkov
  2021-08-19 15:24                 ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-19  8:00 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Macieira, Thiago, Lutomirski, Andy, tglx, mingo, x86, Brown, Len,
	Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Wed, Aug 18, 2021 at 09:37:58PM +0000, Bae, Chang Seok wrote:
> What if it never happens? It will be just the same as XGETBV(0). I
> think on the flip side there is also a benefit of maintaining a simple
> API as possible.

Dude, why are you still pointlessly harping on this?

How is adding adding another trivial prctl which will be simply
forwarding XCR0 for now, making the API more complex?

If you don't wanna do it just say so - someone else will.

Geez.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-19  8:00               ` Borislav Petkov
@ 2021-08-19 15:24                 ` Bae, Chang Seok
  0 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-19 15:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Macieira, Thiago, Lutomirski, Andy, tglx, mingo, x86, Brown, Len,
	Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 19, 2021, at 01:00, Borislav Petkov <bp@alien8.de> wrote:
> If you don't wanna do it just say so - someone else will.

Okay, looks like you’re so sure about it.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-19  1:21             ` Andy Lutomirski
@ 2021-08-19 16:06               ` Thiago Macieira
  0 siblings, 0 replies; 91+ messages in thread
From: Thiago Macieira @ 2021-08-19 16:06 UTC (permalink / raw)
  To: Borislav Petkov, Bae, Chang Seok, Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, the arch/x86 maintainers, Brown,
	Len, Dave Hansen, Liu, Jing2, Shankar, Ravi V,
	Linux Kernel Mailing List

On Wednesday, 18 August 2021 18:21:11 PDT Andy Lutomirski wrote:
> But it's not the only possible solution. A future kernel could decide to
> leave
> > some bits off and only enable upon request. That's how macOS/Darwin does
> > its AVX512 support.
> 
> The fact that Darwin does this strongly suggests that real programs can
> handle it, which increases my inclination for Linux to do the same thing.

Yes and no... yes, programs could be made to handle this. I've reached to the 
Intel team responsible for the instructions in the manual on how to detect 
AVX512 and AMX, so the content is extended to say there's an OS-specific part 
that software developers need to be aware of. But until then, it's not very 
discoverable. As a result, there's plenty of software that could enable AVX512 
on the Xeon-based Mac Pros but never do because the developers didn't know 
that there was more than what the manual said. But the worst case that can 
happen here is that the software gracefully falls back to AVX2 or an earlier 
instruction set (unlike the Linux solution).

No, because XSETBV causes a VM exit, so we don't want to execute that on a 
context switch, for performance reasons. That's probably not been a concern 
for Apple developers, but is for Linux.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 16:24   ` Borislav Petkov
  2021-08-18 17:20     ` Thiago Macieira
  2021-08-18 19:47     ` Bae, Chang Seok
@ 2021-08-24 22:21     ` Len Brown
  2021-08-30 17:41       ` Borislav Petkov
  2021-08-24 23:17     ` Len Brown
  3 siblings, 1 reply; 91+ messages in thread
From: Len Brown @ 2021-08-24 22:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, thiago.macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Wed, Aug 18, 2021 at 12:24 PM Borislav Petkov <bp@alien8.de> wrote:

> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index a7c413432b33..eac0cfd9210b 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -626,6 +626,8 @@
> >  #define MSR_IA32_BNDCFGS_RSVD                0x00000ffc
> >
> >  #define MSR_IA32_XSS                 0x00000da0
> > +#define MSR_IA32_XFD                 0x000001c4
> > +#define MSR_IA32_XFD_ERR             0x000001c5
>
> At least try to keep those numerically sorted, at least among the
> architectural MSR_IA32_ ones.

agreed

> That is, provided those XFD things are architectural...

Yes.
MSR_IA32_XFD and MSR_IA32_XFD_ERR are architectural.

(which is why they follow the convention of having an "IA32" in their name)

https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 18:10           ` Borislav Petkov
@ 2021-08-24 22:51             ` Len Brown
  0 siblings, 0 replies; 91+ messages in thread
From: Len Brown @ 2021-08-24 22:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thiago Macieira, Chang S. Bae, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, X86 ML, Brown, Len, Dave Hansen, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Wed, Aug 18, 2021 at 2:09 PM Borislav Petkov <bp@alien8.de> wrote:

> What relevance does the fact have for userspace whether the kernel
> supports XFD or not?

Right.
If user space needs to know that XFD exists, then we have done
something very wrong.

-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 16:24   ` Borislav Petkov
                       ` (2 preceding siblings ...)
  2021-08-24 22:21     ` Len Brown
@ 2021-08-24 23:17     ` Len Brown
  2021-08-30 17:53       ` Borislav Petkov
  2021-08-30 18:04       ` Dave Hansen
  3 siblings, 2 replies; 91+ messages in thread
From: Len Brown @ 2021-08-24 23:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, thiago.macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Wed, Aug 18, 2021 at 12:24 PM Borislav Petkov <bp@alien8.de> wrote:

> Why isn't the buffer pre-allocated for the process after latter having
> done prctl() so that when an #NM happens, no allocation happens at all?

The problem with a system call to pre-allocate an AMX context switch buffer
is that it doesn't actually deliver on the goal of guaranteeing no subsequent
run-time failures due to OOM.

Even if your AMX thread pool threads were to invoke this system call
as soon as possible...
What is to say that the thread pool is created only at a time when memory
is available?  A thread could be created 24 hours into program execution
under OOM conditions and this system call will return ENOMEM, and your program
will in all likelihood throw up its arms and exit at the exact same place
it would exit for transparently allocated buffers.

What if you don't care about 24-hours in, and you do care about
allocating at program start?
The program can equally cause the kernel to allocate an AMX context switch
buffer by simply touching a TMM register -- and this can be done at exactly the
same place in the program that calling a pre-allocate system call.

The difference in these two methods is that the system call returns a synchronus
ENOMEM, while the touching of a TMM register sends you a signal at
that location.
In theory, a program may have a thoughtfully implemented and thoroughly tested
*else* clause for ENOMEM -- but you and I know that is a fantasy --
they will exit anyway.

The advantage of the #NM over the syscall is that  the programmer doesn't
actually have to do anything.  Also, transparently allocated buffers offer
a theoretical benefit that a program may have many threads, but only a few
may actually touch AMX, and so there is savings to be had by allocating buffers
only for the threads that actually use the buffers.

Finally, the XFD/NM mechanism opens the door in the future for the
kernel to actually
harvest allocated, but unused buffers -- though we didn't bother
implementing that for AMX.

> And with those buffers preallocated, all that XFD muck is not needed either.

Independent of context switch buffer allocation...

XFD is used to *enforce* that AMX is not used without permission.
Were we to not use the XFD feature, users would be able to stash
data in TMM registers and even use TMUL without the kernel
being able to prevent them from doing so.  Then when they
context switch or take a signal, the data in their TMM registers
would mysteriously vanish...

Much better to be able to tell them immediately that they are doing it wrong...

-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-18 21:17           ` Borislav Petkov
  2021-08-18 21:37             ` Bae, Chang Seok
@ 2021-08-24 23:22             ` Len Brown
  2021-08-30 17:31               ` Borislav Petkov
  1 sibling, 1 reply; 91+ messages in thread
From: Len Brown @ 2021-08-24 23:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Bae, Chang Seok, Macieira, Thiago, Lutomirski, Andy, tglx, mingo,
	x86, Brown, Len, Hansen, Dave, Liu, Jing2, Shankar, Ravi V,
	linux-kernel

On Wed, Aug 18, 2021 at 5:16 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Aug 18, 2021 at 08:43:50PM +0000, Bae, Chang Seok wrote:
> > Maybe I’m missing something, but I wonder what’s the difference
> > from reading XCR0.
>
> Wny, because adding another prctl() is too damn hard?

Adding complexity is easy.  Removing it is the hard part. ;-)

Programmers today know what CPUID and xgetbv(XCR0) mean:
1. feature exists in the HW
2. OS has ability to handle state

This is true for all features.

We are forced to complicate their life for AMX (and subsequent features)
because of the legacy Linux signal ABI.
We require that new apps invoke a system call to tell us that they are
not indeed a legacy
program, but that they are a program that understands if they use an
alt-sig-stack
that it must be big enough to handle whatever current hardware requires.

The secondary motivation for the system call is the desire to give the
kernel a hook
so that it can refuse to give permission for some apps to use AMX,
should the need arise.

Programmers don't like this, but it nobody has figured out a more
programmer-friendly way
to meet these requirements.
And so if they want to use AMX on Linux, they *must* use this new SET syscall.
Since Linux enforces that they use it, they will use it if they want
AMX (or subsequent features).

> prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);

The problem is that it adds zero value over the currently used xgetbv(XCR0).
As it adds no value, programmers will not use it.

Sure, if the hardware is re-designed, and Linux is re-designed, and XCR0
can then change at run-time during the lifetime of a program, we have additional
challenges.  (such as legacy code that doesn't expect XCR0 to change
at run-time).
I don't think that this additional system call even begins to address
that theoretical
new world.

But this discussion moot.  If it has no use, it will not get used.
--
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-18 19:46             ` Bae, Chang Seok
@ 2021-08-25 16:01               ` Bae, Chang Seok
  2021-08-30 17:07               ` Borislav Petkov
  1 sibling, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-25 16:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 18, 2021, at 12:46, Bae, Chang Seok <chang.seok.bae@intel.com> wrote:
> Let me consider a case to mimic the situation somehow.

Once the below changes were applied on top of this series v9, the self-test
ran:

    $ ./tools/testing/selftests/x86/amx_64
            Inject tile data
    Tile data was not written on ptracee.

Check the kernel messages:

    $ sudo dmesg | tail -n 2
    [   82.780882] x86/fpu: Assume new re-allocation fails here and fpu->state
    retains the old re-allocation (0x000000009f3a83cc)

The ptracee loaded tile data, so it’s XSTATE per-task buffer had been
re-allocated. Then, the ptracer attempted to inject new tile data but the
kernel returned ENOMEM along with the message. This emulates the behavior with
reallocation failure on the ptrace path.

    [   82.793127] process: x86/fpu: Free the re-allocated buffer at
    0x000000009f3a83cc

The program exited. This message indicates the old buffer was freed at that
moment.

Thanks,
Chang


diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index ee71ffd7c221..3153dc91c715 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -170,8 +170,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	 *
 	 * Check if the expansion is possibly needed.
 	 */
-	if (xfeatures_mask_user_dynamic &&
-	    ((fpu->state_mask & xfeatures_mask_user_dynamic) != xfeatures_mask_user_dynamic)) {
+	if (xfeatures_mask_user_dynamic) {
 		u64 state_mask, dynstate_mask;
 
 		/* Retrieve XSTATE_BV. */
@@ -186,9 +185,13 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 				goto out;
 			}
 
-			ret = alloc_xstate_buffer(fpu, dynstate_mask);
-			if (ret)
+			if (fpu->state != &fpu->__default_state) {
+				pr_info("x86/fpu: Assume new re-allocation fails here and "
+					"fpu->state retains the old re-allocation (0x%p)\n",
+					fpu->state);
+				ret = -ENOMEM;
 				goto out;
+			}
 		}
 	}
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5b4f9b82aea1..c04098db58b6 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -99,8 +99,15 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
 
 void arch_release_task_struct(struct task_struct *task)
 {
-	if (cpu_feature_enabled(X86_FEATURE_FPU))
-		free_xstate_buffer(&task->thread.fpu);
+	if (cpu_feature_enabled(X86_FEATURE_FPU)) {
+		struct fpu *fpu = &task->thread.fpu;
+
+		if (fpu->state != &fpu->__default_state)
+			pr_info("x86/fpu: Free the re-allocated buffer at 0x%p\n",
+				fpu->state);
+
+		free_xstate_buffer(fpu);
+	}
 }
 
 /*
diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
index afd8c66ca206..6393ec01a9a1 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -610,8 +610,6 @@ static void test_context_switch(void)
 
 /* Ptrace test */
 
-static bool ptracee_state_perm;
-
 static int inject_tiledata(pid_t target)
 {
 	struct iovec iov;
@@ -624,12 +622,8 @@ static int inject_tiledata(pid_t target)
 	set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
 	memcpy(tiledata, xsave_buffer + xsave_xtiledata_offset, xtiledata_size);
 
-	if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) {
-		if (errno != EFAULT)
-			err(1, "PTRACE_SETREGSET");
-		else
-			return errno;
-	}
+	if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+		return errno;
 
 	if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
 		err(1, "PTRACE_GETREGSET");
@@ -640,18 +634,19 @@ static int inject_tiledata(pid_t target)
 		return -1;
 }
 
-static void test_tile_write(void)
+static void test_kernel_xbuffer_free_with_ptrace_failure(void)
 {
 	int status, rc;
 	pid_t child;
-	bool pass;
 
 	child = fork();
 	if (child < 0) {
 		err(1, "fork");
 	} else if (!child) {
-		if (ptracee_state_perm)
-			enable_tiledata();
+		clear_xstate_header(xsave_buffer);
+		set_xstatebv(xsave_buffer, XFEATURE_MASK_XTILEDATA);
+		set_rand_tiledata(xsave_buffer + xsave_xtiledata_offset);
+		xrstor_safe(xsave_buffer, -1, -1);
 
 		if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
 			err(1, "PTRACE_TRACEME");
@@ -664,16 +659,11 @@ static void test_tile_write(void)
 		wait(&status);
 	} while (WSTOPSIG(status) != SIGTRAP);
 
-	printf("\tInject tile data %s ARCH_SET_STATE_ENABLE\n",
-	       ptracee_state_perm ? "with" : "without");
+	printf("\tInject tile data\n");
 
 	rc = inject_tiledata(child);
-	pass = (rc == EFAULT && !ptracee_state_perm) ||
-	       (!rc && ptracee_state_perm);
-	if (!pass)
-		nerrs++;
-	printf("[%s]\tTile data was %swritten on ptracee.\n",
-	       pass ? "OK" : "FAIL", errs ? "not " : "");
+	if (rc)
+		printf("Tile data was not written on ptracee.\n");
 
 	ptrace(PTRACE_DETACH, child, NULL, NULL);
 	wait(&status);
@@ -681,17 +671,6 @@ static void test_tile_write(void)
 		err(1, "ptrace test");
 }
 
-static void test_ptrace(void)
-{
-	printf("[RUN]\tCheck ptrace() to inject tile data.\n");
-
-	ptracee_state_perm = false;
-	test_tile_write();
-
-	ptracee_state_perm = true;
-	test_tile_write();
-}
-
 /* Signal handling test */
 
 static bool init_tiledata, load_tiledata;
@@ -951,13 +930,8 @@ int main(int argc, char **argv)
 	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
 		err(1, "sched_setaffinity to CPU 0");
 
-	test_arch_prctl(argc, argv);
-	test_ptrace();
-
 	enable_tiledata();
-	test_context_switch();
-	test_fork();
-	test_signal();
+	test_kernel_xbuffer_free_with_ptrace_failure();
 
 	clearhandler(SIGILL);


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-18 19:46             ` Bae, Chang Seok
  2021-08-25 16:01               ` Bae, Chang Seok
@ 2021-08-30 17:07               ` Borislav Petkov
  2021-08-30 23:39                 ` Bae, Chang Seok
  1 sibling, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-30 17:07 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Wed, Aug 18, 2021 at 07:46:55PM +0000, Bae, Chang Seok wrote:
> No, this non-compacted thing is not for optimization. SDM is not quite clear
> about the logic behind the non-compacted format -- some state’s offset does
> not always match with the 'size + offset' of the previous one, even without
> 64B-alignment. So, the loop is only for the compacted format, not the
> non-compacted one. 
> 
> It was refactored to use in the new helper to find feature_nr’s start point.
> If the size is added up here, it is not ‘i’'s start point anymore.

Let's see, we're still talking about this thing, right:

        nr = fls64(mask) - 1;

        if (!boot_cpu_has(X86_FEATURE_XSAVES))
                return xstate_offsets[nr] + xstate_sizes[nr];

?

That @mask is "which components reserved in the buffer."

Which buffer? The mask being passed is independent from whatever buffer.

So you need to do a lot more explaining here before this goes anywhere.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder to support dynamic states
  2021-08-18 19:47     ` Bae, Chang Seok
@ 2021-08-30 17:18       ` Borislav Petkov
  2021-08-30 23:38         ` Bae, Chang Seok
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-30 17:18 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Wed, Aug 18, 2021 at 07:47:04PM +0000, Bae, Chang Seok wrote:
> It was refactored to use in the new helper to find feature_nr’s start point.

Which new helper?

> If the size is added up here, it is not ‘i’'s start point anymore.

Yeah, sorry, I have only a very slight idea what you mean here - you'll
have to try again.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-24 23:22             ` Len Brown
@ 2021-08-30 17:31               ` Borislav Petkov
  2021-09-17  3:48                 ` Len Brown
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-30 17:31 UTC (permalink / raw)
  To: Len Brown
  Cc: Bae, Chang Seok, Macieira, Thiago, Lutomirski, Andy, tglx, mingo,
	x86, Brown, Len, Hansen, Dave, Liu, Jing2, Shankar, Ravi V,
	linux-kernel

On Tue, Aug 24, 2021 at 07:22:18PM -0400, Len Brown wrote:
> We are forced to complicate their life for AMX (and subsequent features)
> because of the legacy Linux signal ABI.

No, we need to design this interface properly because you folks went and
put this AMX thing in xstates. Where it doesn't belong at all.

> We require that new apps invoke a system call to tell us that they
> are not indeed a legacy program, but that they are a program that
> understands if they use an alt-sig-stack that it must be big enough to
> handle whatever current hardware requires.

Yes, because of the reason I gave above. If no additional 8K fat wasn't
an xstate, we wouldn't be having this conversation.

> The secondary motivation for the system call is the desire to give the
> kernel a hook so that it can refuse to give permission for some apps
> to use AMX, should the need arise.

Yes.

> > prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);
>
> The problem is that it adds zero value over the currently used xgetbv(XCR0).
> As it adds no value, programmers will not use it.

Bullsh*t.

First of all, it is a new interface we're introducing and if it is
there from the get-go along with examples how to use it and proper
documentation, people will.

Secondly, from a previous email of mine: "What if this modus operandi of
features userspace can use with kernel assistance but need an explicit
request and are off otherwise, gets extended beyond XSAVE-managed
features?"

In that case you can xgetbv() all you want but the new fat feature is
not even in XCR0. So *then* you *have* to introduce a new prctl() to
query supported features. And right then and there you wish you would've
done that from the very beginning!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-24 22:21     ` Len Brown
@ 2021-08-30 17:41       ` Borislav Petkov
  2021-08-31 21:44         ` Len Brown
  0 siblings, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-30 17:41 UTC (permalink / raw)
  To: Len Brown
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, thiago.macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Tue, Aug 24, 2021 at 06:21:23PM -0400, Len Brown wrote:
> MSR_IA32_XFD and MSR_IA32_XFD_ERR are architectural.
> 
> (which is why they follow the convention of having an "IA32" in their name)

Where is that official statement I can refer to that says that MSRs with
"IA32" in the name are architectural?

Perhaps that section of the SDM:

"2.1 ARCHITECTURAL MSRS"

?

In any case, those MSRs are not there yet, maybe they need to trickle
from the ISA to the SDM docs at some point first.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-07-30 14:59 ` [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically Chang S. Bae
  2021-08-12 19:44   ` Borislav Petkov
@ 2021-08-30 17:45   ` Dave Hansen
  2021-08-30 23:39     ` Bae, Chang Seok
  1 sibling, 1 reply; 91+ messages in thread
From: Dave Hansen @ 2021-08-30 17:45 UTC (permalink / raw)
  To: Chang S. Bae, bp, luto, tglx, mingo, x86
  Cc: len.brown, thiago.macieira, jing2.liu, ravi.v.shankar, linux-kernel

On 7/30/21 7:59 AM, Chang S. Bae wrote:
> +/**
> + * get_xstate_size - Calculate an xstate buffer size

Calculate the amount of space needed to store an xstate buffer with the
given features.

> + * @mask:	This bitmap tells which components reserved in the buffer.

The set of components for which the space is needed.

> + * Available once those arrays for the offset, size, and alignment info are
> + * set up, by setup_xstate_features().

Please just say:

	Consults values populated in setup_xstate_features().  Must be
	called after that setup.


> + * Returns:	The buffer size
> + */
> +unsigned int get_xstate_size(u64 mask)
> +{
> +	unsigned int size;
> +	int i, nr;
> +
> +	if (!mask)
> +		return 0;
> +
> +	/*
> +	 * The minimum buffer size excludes the dynamic user state. When a
> +	 * task uses the state, the buffer can grow up to the max size.
> +	 */
> +	if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
> +		return get_xstate_config(XSTATE_MIN_SIZE);
> +	else if (mask == xfeatures_mask_all)
> +		return get_xstate_config(XSTATE_MAX_SIZE);

Is this just an optimization?  It seems redundant with everything below.
 I think that adds to the confusion.

> +	nr = fls64(mask) - 1;

"nr" is a really, really, confusing name for this.  "last_feature_nr"
might be better.  Otherwise, this might be read as "number of features".
 Comment might have helped, had there been any.

> +	if (!boot_cpu_has(X86_FEATURE_XSAVES))
> +		return xstate_offsets[nr] + xstate_sizes[nr];

Doesn't xstate_comp_offsets[] also work for non-compacted features?
setup_xstate_comp_offsets() says so and __raw_xsave_addr() depends on
that behavior.

> +	if ((xfeatures_mask_all & (BIT_ULL(nr + 1) - 1)) == mask)
> +		return xstate_comp_offsets[nr] + xstate_sizes[nr];

OK, so this is basically saying, "Is the size I'm looking for already
calculated and stored in xstate_comp_offsets[] because the mask is a
subset of xfeatures_mask_all".  Right?

I guess that work.  But, that's a *LOT* of logic to go uncommented.

> +	/*
> +	 * With the given mask, no relevant size is found so far. So,
> +	 * calculate it by summing up each state size.
> +	 */
> +	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
> +		if (!(mask & BIT_ULL(i)))
> +			continue;
> +
> +		if (xstate_aligns[i])
> +			size = ALIGN(size, 64);
> +		size += xstate_sizes[i];
> +	}
> +	return size;
> +}

OK, so this finally reveals something important about the function.  It
is *trying* to avoid running this loop.  All of the above is really just
optimizations to try and avoid doing this loop.

That makes me wonder why you chose that particular set of optimizations.
 It also makes me wonder if they're even necessary.

So, first of all, why is this a new loop?  Can't it share code with the
XSAVE setup code?  That code also calculates the amount of space needed
for an XSAVE buffer given a mask.

Second, which of those optimizations do we *need*?  I worry that this is
trying to be way too generic and be *optimized* for being generic code
when it will never really get random masks as input.

For instance, who is going to be calling this with
mask!=xfeatures_mask_all with !boot_cpu_has(X86_FEATURE_XSAVES)?  That
seems rather improbable.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-24 23:17     ` Len Brown
@ 2021-08-30 17:53       ` Borislav Petkov
  2021-08-31 22:07         ` Len Brown
  2021-08-30 18:04       ` Dave Hansen
  1 sibling, 1 reply; 91+ messages in thread
From: Borislav Petkov @ 2021-08-30 17:53 UTC (permalink / raw)
  To: Len Brown
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, thiago.macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Tue, Aug 24, 2021 at 07:17:49PM -0400, Len Brown wrote:
> The problem with a system call to pre-allocate an AMX context switch
> buffer is that it doesn't actually deliver on the goal of guaranteeing
> no subsequent run-time failures due to OOM.

You mean you pre-allocate *everything*, i.e., you won't do any more
allocations but then you still might OOM?

Yeah, right.

Maybe from something else but not from your AMX-using process which has
prepared everything already.

> Even if your AMX thread pool threads were to invoke this system call
> as soon as possible... What is to say that the thread pool is created
> only at a time when memory is available? A thread could be created
> 24 hours into program execution under OOM conditions and this system
> call will return ENOMEM, and your program will in all likelihood
> throw up its arms and exit at the exact same place it would exit for
> transparently allocated buffers.

Well, if you preallocate everything you won't have to run for 24 hours,
encounter -ENOMEM and *lose* 24 hours worth of AMX computation. And then
the kernel won't have to do all kinds of crazy dance with dynamically
resizing buffers just because some small percentage of luserspace apps
decided to do AMX stuff.

> The program can equally cause the kernel to allocate an AMX context
> switch buffer by simply touching a TMM register -- and this can
> be done at exactly the same place in the program that calling a
> pre-allocate system call.

If the program touches a TMM register and it hasn't requested AMX
support upfront, it'll get killed.

> The advantage of the #NM over the syscall is that the programmer
> doesn't actually have to do anything. Also, transparently allocated
> buffers offer a theoretical benefit that a program may have many
> threads, but only a few may actually touch AMX, and so there is
> savings to be had by allocating buffers only for the threads that
> actually use the buffers.

The program already asked the kernel whether it can use AMX - it can
allocate the buffers for the threads too.

> XFD is used to *enforce* that AMX is not used without permission.
> Were we to not use the XFD feature, users would be able to stash
> data in TMM registers and even use TMUL without the kernel
> being able to prevent them from doing so.  Then when they
> context switch or take a signal, the data in their TMM registers
> would mysteriously vanish...
>
> Much better to be able to tell them immediately that they are doing it
> wrong...

Ok, killing the program in the #NM handler if it hasn't requested AMX
prior makes sense.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-24 23:17     ` Len Brown
  2021-08-30 17:53       ` Borislav Petkov
@ 2021-08-30 18:04       ` Dave Hansen
  2021-08-31 22:15         ` Len Brown
  1 sibling, 1 reply; 91+ messages in thread
From: Dave Hansen @ 2021-08-30 18:04 UTC (permalink / raw)
  To: Len Brown, Borislav Petkov
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, thiago.macieira, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On 8/24/21 4:17 PM, Len Brown wrote:
> Even if your AMX thread pool threads were to invoke this system call
> as soon as possible...
> What is to say that the thread pool is created only at a time when memory
> is available?  A thread could be created 24 hours into program execution
> under OOM conditions and this system call will return ENOMEM, and your program
> will in all likelihood throw up its arms and exit at the exact same place
> it would exit for transparently allocated buffers.

I tried this exact line of reasoning with Thomas: it doesn't matter
where we run out of memory, we still need the same memory and we're
screwed either way.

However, Thomas expressed a clear preference for ABIs which return
memory failures explicitly at syscalls versus implicit failures which
can happen on random instructions.

One might say that the odds of checking for and handling a NULL value
(or ENOMEM) are the same as installing a signal handler.  *But*, it's
infinitely easier to unroll state and recover from a NULL than it is to
handle it from within a signal handler.  In other words, the explicit
ones *encourage* better programming.

I'd prefer removing the demand-driven allocation at this point.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder to support dynamic states
  2021-08-30 17:18       ` Borislav Petkov
@ 2021-08-30 23:38         ` Bae, Chang Seok
  0 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-30 23:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 30, 2021, at 10:18, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 18, 2021 at 07:47:04PM +0000, Bae, Chang Seok wrote:
>> It was refactored to use in the new helper to find feature_nr’s start point.
> 
> Which new helper?
> 
>> If the size is added up here, it is not ‘i’'s start point anymore.
> 
> Yeah, sorry, I have only a very slight idea what you mean here - you'll
> have to try again.

Just recap what is said on SDM, Vol1 13.4.3 Extended Region of an XSAVE Area:

  - In the compacted format, each state component i is located at a byte offset
    from the base address of the XSAVE area.
  - location_i is determined by location_j and size_j, where j is in 1 < j < i
    and the greatest value XCOMP_BV[j] = 1:
    location_i = location_j + size_j
    (let's simplify without the 64-byte alignment here)

This loop was moved from calculate_xstate_buf_size_from_mask() (if renamed) to
here in the new helper -- get_xstate_comp_offset().

But the returned value is different.  get_xstate_comp_offset(mask, i) should
return the location_i but calculate_xstate_buf_size_from_mask(mask) calculates
the buffer size which is location_i + size_i when the greatest feature number
is i.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-30 17:45   ` Dave Hansen
@ 2021-08-30 23:39     ` Bae, Chang Seok
  0 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-30 23:39 UTC (permalink / raw)
  To: Hansen, Dave
  Cc: bp, Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Macieira,
	Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 30, 2021, at 10:45, Hansen, Dave <dave.hansen@intel.com> wrote:
<snip> 
> On 7/30/21 7:59 AM, Chang S. Bae wrote:
>> 
>> +	/*
>> +	 * The minimum buffer size excludes the dynamic user state. When a
>> +	 * task uses the state, the buffer can grow up to the max size.
>> +	 */
>> +	if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
>> +		return get_xstate_config(XSTATE_MIN_SIZE);
>> +	else if (mask == xfeatures_mask_all)
>> +		return get_xstate_config(XSTATE_MAX_SIZE);
> 
> Is this just an optimization?  It seems redundant with everything below.
> I think that adds to the confusion.

Boris suggested to remove the below instead [1]:

    "So leave only the first two which are obvious and are more likely to
     happen - the first one is going to be the most likely on non-dynamic
     setups and the second one is on dynamic systems."

>> +	nr = fls64(mask) - 1;
> 
> "nr" is a really, really, confusing name for this.  "last_feature_nr"
> might be better.  Otherwise, this might be read as "number of features".
> Comment might have helped, had there been any.

Yes, it seems to be the case.

>> +	if (!boot_cpu_has(X86_FEATURE_XSAVES))
>> +		return xstate_offsets[nr] + xstate_sizes[nr];
> 
> Doesn't xstate_comp_offsets[] also work for non-compacted features?
> setup_xstate_comp_offsets() says so and __raw_xsave_addr() depends on
> that behavior.

Yes, but I think using xstate_comp_offsets[] for non-compacted format instead
of xstate_offsets[] here just makes confusion.

>> +	if ((xfeatures_mask_all & (BIT_ULL(nr + 1) - 1)) == mask)
>> +		return xstate_comp_offsets[nr] + xstate_sizes[nr];
> 
> OK, so this is basically saying, "Is the size I'm looking for already
> calculated and stored in xstate_comp_offsets[] because the mask is a
> subset of xfeatures_mask_all".  Right?
> 
> I guess that work.  But, that's a *LOT* of logic to go uncommented.

Boris suggested simplifying the function by removing this [2]:
    > But it might be better to simplify this hunk for readability. I
    > suspect its call sites are not that performance-critical.
    That's *exactly* what I'm driving at!

And I applied on v10 [3].

>> +	/*
>> +	 * With the given mask, no relevant size is found so far. So,
>> +	 * calculate it by summing up each state size.
>> +	 */
>> +	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
>> +		if (!(mask & BIT_ULL(i)))
>> +			continue;
>> +
>> +		if (xstate_aligns[i])
>> +			size = ALIGN(size, 64);
>> +		size += xstate_sizes[i];
>> +	}
>> +	return size;
>> +}
> 
> OK, so this finally reveals something important about the function.  It
> is *trying* to avoid running this loop.  All of the above is really just
> optimizations to try and avoid doing this loop.
> 
> That makes me wonder why you chose that particular set of optimizations.
> It also makes me wonder if they're even necessary.
> 
> So, first of all, why is this a new loop?  Can't it share code with the
> XSAVE setup code?  That code also calculates the amount of space needed
> for an XSAVE buffer given a mask.

This runtime function uses the recorded values for offset, size, and alignment
instead of performing CPUID. The loop in the setup function references CPUID
values.

> Second, which of those optimizations do we *need*?  I worry that this is
> trying to be way too generic and be *optimized* for being generic code
> when it will never really get random masks as input.
> 
> For instance, who is going to be calling this with
> mask!=xfeatures_mask_all with !boot_cpu_has(X86_FEATURE_XSAVES)?  That
> seems rather improbable.

This function is considered to help the dynamic state allocation function and
some others. Avoiding the loop might be helpful for the future, especially when
some other dynamic states are enabled.

V10 has a much-trimmed version [3] now as that optimization is not needed with
AMX enabling.

Thanks,
Chang

[1]: https://lore.kernel.org/lkml/YRzSuC25eHEOgj6h@zn.tnic/
[2]: https://lore.kernel.org/lkml/YRZDu2Rk+KdRhh1U@zn.tnic/
[3]: https://lore.kernel.org/lkml/20210825155413.19673-10-chang.seok.bae@intel.com/



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically
  2021-08-30 17:07               ` Borislav Petkov
@ 2021-08-30 23:39                 ` Bae, Chang Seok
  0 siblings, 0 replies; 91+ messages in thread
From: Bae, Chang Seok @ 2021-08-30 23:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lutomirski, Andy, tglx, mingo, x86, Brown, Len, Hansen, Dave,
	Macieira, Thiago, Liu, Jing2, Shankar, Ravi V, linux-kernel

On Aug 30, 2021, at 10:07, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Aug 18, 2021 at 07:46:55PM +0000, Bae, Chang Seok wrote:
>> No, this non-compacted thing is not for optimization. SDM is not quite clear
>> about the logic behind the non-compacted format -- some state’s offset does
>> not always match with the 'size + offset' of the previous one, even without
>> 64B-alignment. So, the loop is only for the compacted format, not the
>> non-compacted one. 
>> 
>> It was refactored to use in the new helper to find feature_nr’s start point.
>> If the size is added up here, it is not ‘i’'s start point anymore.
> 
> Let's see, we're still talking about this thing, right:
> 
>        nr = fls64(mask) - 1;
> 
>        if (!boot_cpu_has(X86_FEATURE_XSAVES))
>                return xstate_offsets[nr] + xstate_sizes[nr];
> 
> ?
> 
> That @mask is "which components reserved in the buffer."
> 
> Which buffer? The mask being passed is independent from whatever buffer.
> 
> So you need to do a lot more explaining here before this goes anywhere.

Yes, this function that you suggested to rename like 
calculate_xstate_buf_size_from_mask() has no input to do with any buffer.

Perhaps, as Dave suggested [1],
    "@mask		The set of components for which the space is needed."

Along with these,
    "calculate_xstate_buf_size_from_mask -- Calculate the amount of space
     needed to store an xstate buffer with the given features.”

    s/nr/last_feature_nr/

[1]: https://lore.kernel.org/lkml/bb49fdc9-2228-8bd1-bcc5-7c498daf0887@intel.com/

Thanks,
Chang

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-30 17:41       ` Borislav Petkov
@ 2021-08-31 21:44         ` Len Brown
  0 siblings, 0 replies; 91+ messages in thread
From: Len Brown @ 2021-08-31 21:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, thiago.macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Mon, Aug 30, 2021 at 1:41 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Aug 24, 2021 at 06:21:23PM -0400, Len Brown wrote:
> > MSR_IA32_XFD and MSR_IA32_XFD_ERR are architectural.
> >
> > (which is why they follow the convention of having an "IA32" in their name)
>
> Where is that official statement I can refer to that says that MSRs with
> "IA32" in the name are architectural?
>
> Perhaps that section of the SDM:
>
> "2.1 ARCHITECTURAL MSRS"

Yes.

> In any case, those MSRs are not there yet, maybe they need to trickle
> from the ISA to the SDM docs at some point first.

Right.
These new MSRs are already named IA32... even though the info from
the ISA Extensions Manual hasn't yet tricked into the SDM, because
they are defined to be architectural from the get-go.

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-30 17:53       ` Borislav Petkov
@ 2021-08-31 22:07         ` Len Brown
  2021-08-31 22:11           ` Dave Hansen
  0 siblings, 1 reply; 91+ messages in thread
From: Len Brown @ 2021-08-31 22:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Dave Hansen, Thiago Macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Mon, Aug 30, 2021 at 1:52 PM Borislav Petkov <bp@alien8.de> wrote:

> Well, if you preallocate everything...

Nothing prevents, say, a pthread_create() or anything
else where the kernel consumes memory on behalf of a process
from failing at run-time...  AMX does not add a unique OOM risk here.

> > The advantage of the #NM over the syscall is that the programmer
> > doesn't actually have to do anything. Also, transparently allocated
> > buffers offer a theoretical benefit that a program may have many
> > threads, but only a few may actually touch AMX, and so there is
> > savings to be had by allocating buffers only for the threads that
> > actually use the buffers.
>
> The program already asked the kernel whether it can use AMX - it can
> allocate the buffers for the threads too.

The result is that if one thread in a 1,000 task process requests
and touches AMX, the kernel would allocate 8MB, instead of 8KB
of context switch buffers for that process, no?

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-31 22:07         ` Len Brown
@ 2021-08-31 22:11           ` Dave Hansen
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Hansen @ 2021-08-31 22:11 UTC (permalink / raw)
  To: Len Brown, Borislav Petkov
  Cc: Chang S. Bae, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	X86 ML, Brown, Len, Thiago Macieira, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On 8/31/21 3:07 PM, Len Brown wrote:
> On Mon, Aug 30, 2021 at 1:52 PM Borislav Petkov <bp@alien8.de> wrote:
> 
>> Well, if you preallocate everything...
> Nothing prevents, say, a pthread_create() or anything
> else where the kernel consumes memory on behalf of a process
> from failing at run-time...  AMX does not add a unique OOM risk here.
> 
>>> The advantage of the #NM over the syscall is that the programmer
>>> doesn't actually have to do anything. Also, transparently allocated
>>> buffers offer a theoretical benefit that a program may have many
>>> threads, but only a few may actually touch AMX, and so there is
>>> savings to be had by allocating buffers only for the threads that
>>> actually use the buffers.
>> The program already asked the kernel whether it can use AMX - it can
>> allocate the buffers for the threads too.
> The result is that if one thread in a 1,000 task process requests
> and touches AMX, the kernel would allocate 8MB, instead of 8KB
> of context switch buffers for that process, no?

Yes, but that's a pretty natural consequence of the process-level ABI
which was chosen.  A per-thread permission scheme would not have had
this particular trade-off.

If you have a big process (lots of threads) and you use a process-level
ABI, there are going to big implications.  I don't think we can get away
from this.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-30 18:04       ` Dave Hansen
@ 2021-08-31 22:15         ` Len Brown
  2021-08-31 22:16           ` Len Brown
  2021-08-31 22:39           ` Thiago Macieira
  0 siblings, 2 replies; 91+ messages in thread
From: Len Brown @ 2021-08-31 22:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Chang S. Bae, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, X86 ML, Brown, Len, Thiago Macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Mon, Aug 30, 2021 at 2:04 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 8/24/21 4:17 PM, Len Brown wrote:
> > Even if your AMX thread pool threads were to invoke this system call
> > as soon as possible...
> > What is to say that the thread pool is created only at a time when memory
> > is available?  A thread could be created 24 hours into program execution
> > under OOM conditions and this system call will return ENOMEM, and your program
> > will in all likelihood throw up its arms and exit at the exact same place
> > it would exit for transparently allocated buffers.
>
> I tried this exact line of reasoning with Thomas: it doesn't matter
> where we run out of memory, we still need the same memory and we're
> screwed either way.
>
> However, Thomas expressed a clear preference for ABIs which return
> memory failures explicitly at syscalls versus implicit failures which
> can happen on random instructions.
>
> One might say that the odds of checking for and handling a NULL value
> (or ENOMEM) are the same as installing a signal handler.  *But*, it's
> infinitely easier to unroll state and recover from a NULL than it is to
> handle it from within a signal handler.  In other words, the explicit
> ones *encourage* better programming.

I agree.
Indeed, I believe that there is universal agreement that a synchronous
return code
from a system call is a far superior programming model than decoding
the location of a failure in a system call.  (no, the IP isn't random -- it is
always the 1st instruction in that thread to touch a TMM register).

> I'd prefer removing the demand-driven allocation at this point.

Adding a pre-allocate system call that can gracefully fail
(even though it never will) is independent from removing
demand-driver allocation.  I would leave this to application
developers.  Honestly, the kernel shouldn't care.

-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-31 22:15         ` Len Brown
@ 2021-08-31 22:16           ` Len Brown
  2021-08-31 22:39           ` Thiago Macieira
  1 sibling, 0 replies; 91+ messages in thread
From: Len Brown @ 2021-08-31 22:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Chang S. Bae, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, X86 ML, Brown, Len, Thiago Macieira, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Tue, Aug 31, 2021 at 6:15 PM Len Brown <lenb@kernel.org> wrote:
>
> On Mon, Aug 30, 2021 at 2:04 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 8/24/21 4:17 PM, Len Brown wrote:
> > > Even if your AMX thread pool threads were to invoke this system call
> > > as soon as possible...
> > > What is to say that the thread pool is created only at a time when memory
> > > is available?  A thread could be created 24 hours into program execution
> > > under OOM conditions and this system call will return ENOMEM, and your program
> > > will in all likelihood throw up its arms and exit at the exact same place
> > > it would exit for transparently allocated buffers.
> >
> > I tried this exact line of reasoning with Thomas: it doesn't matter
> > where we run out of memory, we still need the same memory and we're
> > screwed either way.
> >
> > However, Thomas expressed a clear preference for ABIs which return
> > memory failures explicitly at syscalls versus implicit failures which
> > can happen on random instructions.
> >
> > One might say that the odds of checking for and handling a NULL value
> > (or ENOMEM) are the same as installing a signal handler.  *But*, it's
> > infinitely easier to unroll state and recover from a NULL than it is to
> > handle it from within a signal handler.  In other words, the explicit
> > ones *encourage* better programming.
>
> I agree.
> Indeed, I believe that there is universal agreement that a synchronous
> return code
> from a system call is a far superior programming model than decoding
> the location of a failure in a system call.  (no, the IP isn't random -- it is

decoding the location of the failure in a *signal hander*

> always the 1st instruction in that thread to touch a TMM register).
>
> > I'd prefer removing the demand-driven allocation at this point.
>
> Adding a pre-allocate system call that can gracefully fail
> (even though it never will) is independent from removing
> demand-driver allocation.  I would leave this to application
> developers.  Honestly, the kernel shouldn't care.
>
> --
> Len Brown, Intel Open Source Technology Center



-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-31 22:15         ` Len Brown
  2021-08-31 22:16           ` Len Brown
@ 2021-08-31 22:39           ` Thiago Macieira
  2021-08-31 22:44             ` Len Brown
  1 sibling, 1 reply; 91+ messages in thread
From: Thiago Macieira @ 2021-08-31 22:39 UTC (permalink / raw)
  To: Dave Hansen, Len Brown
  Cc: Borislav Petkov, Chang S. Bae, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, X86 ML, Brown, Len, Liu, Jing2, Ravi V. Shankar,
	Linux Kernel Mailing List

On Tuesday, 31 August 2021 15:15:55 PDT Len Brown wrote:
> Indeed, I believe that there is universal agreement that a synchronous
> return code
> from a system call is a far superior programming model than decoding
> the location of a failure in a system call.  (no, the IP isn't random -- it
> is always the 1st instruction in that thread to touch a TMM register).

That instruction is actually likely going to be a memory load, probably an 
LDTILECFG. So the developer will see a crashing instruction with a pointer and 
will spend time trying to figure out why that pointer was wrong, when there 
was nothing wrong with it.

That's why I suggested (and Chang implemented) a SIGILL for when #NM is 
received and the arch_prctl() wasn't previously done. The OOM condition, if 
the extra state is dynamically allocated, was meant to stay a SIGSEGV, but 
maybe should change to SIGKILL.

On the other hand, if it it's allocated at the syscall, then the kernel can 
return -ENOMEM for it (which would allow for graceful degradation) or for a 
later clone() syscall starting a new thread (which I don't expect to ever 
gracefully degrade).

> decoding the location of the failure in a *signal hander*

That's a separate problem.

We can't be sure that the portion of the userspace doing the alt-stack crash 
handler is aware of the portion using AMX. There's no way to enforce this. The 
prctl() is a good indication, but I have no clue how high the correlation will 
be.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-31 22:39           ` Thiago Macieira
@ 2021-08-31 22:44             ` Len Brown
  0 siblings, 0 replies; 91+ messages in thread
From: Len Brown @ 2021-08-31 22:44 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: Dave Hansen, Borislav Petkov, Chang S. Bae, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, X86 ML, Brown, Len, Liu, Jing2,
	Ravi V. Shankar, Linux Kernel Mailing List

On Tue, Aug 31, 2021 at 6:39 PM Thiago Macieira
<thiago.macieira@intel.com> wrote:
>
> On Tuesday, 31 August 2021 15:15:55 PDT Len Brown wrote:
> > Indeed, I believe that there is universal agreement that a synchronous
> > return code
> > from a system call is a far superior programming model than decoding
> > the location of a failure in a system call.  (no, the IP isn't random -- it
> > is always the 1st instruction in that thread to touch a TMM register).
>
> That instruction is actually likely going to be a memory load, probably an
> LDTILECFG.

There is no fault on LDTILECONFIG, it will occur on the load tile data.
But yes, still a memory load (with a TMM destination)

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state
  2021-08-30 17:31               ` Borislav Petkov
@ 2021-09-17  3:48                 ` Len Brown
  0 siblings, 0 replies; 91+ messages in thread
From: Len Brown @ 2021-09-17  3:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Bae, Chang Seok, Macieira, Thiago, Lutomirski, Andy, tglx, mingo,
	x86, Brown, Len, Hansen, Dave, Liu, Jing2, Shankar, Ravi V,
	linux-kernel

On Mon, Aug 30, 2021 at 1:31 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Aug 24, 2021 at 07:22:18PM -0400, Len Brown wrote:
> > We are forced to complicate their life for AMX (and subsequent features)
> > because of the legacy Linux signal ABI.
>
> No, we need to design this interface properly because you folks went and
> put this AMX thing in xstates. Where it doesn't belong at all.

Years ago, somebody, other than you or I, decided to use uncompacted
XSTATE on the Linux signal stack.

Years ago, somebody else, also other than you or I, decided that AMX should
be implemented using XSTATE.

Today, we are all working together to deal with this collision, in as
graceful a manner as possible.  Yes?

> > We require that new apps invoke a system call to tell us that they
> > are not indeed a legacy program, but that they are a program that
> > understands if they use an alt-sig-stack that it must be big enough to
> > handle whatever current hardware requires.
>
> Yes, because of the reason I gave above. If no additional 8K fat wasn't
> an xstate, we wouldn't be having this conversation.

While not as huge, AVX-512 has the  same XSTATE bloat issue as AMX --
including the demonstrated ability to overflow the signal stack and kill apps.

The silver lining is that due to the AMX enabling effort, we updated
the glibc ABI
to comprehend variable sigstacksize.  So glibc 2.34, which released Aug 1st,
comprehends whatever the current hardware supports.

> > The secondary motivation for the system call is the desire to give the
> > kernel a hook so that it can refuse to give permission for some apps
> > to use AMX, should the need arise.
>
> Yes.
>
> > > prctl(GET_FEATURES_WITH_KERNEL_ASSISTANCE);
> >
> > The problem is that it adds zero value over the currently used xgetbv(XCR0).
> > As it adds no value, programmers will not use it.

[expletive deleted]

> First of all, it is a new interface we're introducing and if it is
> there from the get-go along with examples how to use it and proper
> documentation, people will.

The application people I talk to are not asking for more system calls.
They would prefer zero system calls (which was our initial proposal).

> Secondly, from a previous email of mine: "What if this modus operandi of
> features userspace can use with kernel assistance but need an explicit
> request and are off otherwise, gets extended beyond XSAVE-managed
> features?"
>
> In that case you can xgetbv() all you want but the new fat feature is
> not even in XCR0. So *then* you *have* to introduce a new prctl() to
> query supported features. And right then and there you wish you would've
> done that from the very beginning!

Sorry, I don't recall seeing that previous note -- maybe it flew past
when I was out.

I have no problem with the quest to develop a universal ABI
to layer over or otherwise replace CPUID and XCR0 and allow kernel override etc.

My point is simply that I haven't seen a case where somebody wanting to use AMX
would need it, and so I don't think developing such an ABI should gate
AMX support.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2021-09-17  3:48 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30 14:59 [PATCH v9 00/26] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 01/26] x86/fpu/xstate: Modify the initialization helper to handle both static and dynamic buffers Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 02/26] x86/fpu/xstate: Modify state copy helpers " Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 03/26] x86/fpu/xstate: Modify address finders " Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 04/26] x86/fpu/xstate: Add a new variable to indicate dynamic user states Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 05/26] x86/fpu/xstate: Add new variables to indicate dynamic XSTATE buffer size Chang S. Bae
2021-08-12 15:03   ` Borislav Petkov
2021-07-30 14:59 ` [PATCH v9 06/26] x86/fpu/xstate: Calculate and remember dynamic XSTATE buffer sizes Chang S. Bae
2021-08-12 16:36   ` Borislav Petkov
2021-07-30 14:59 ` [PATCH v9 07/26] x86/fpu/xstate: Convert the struct fpu 'state' field to a pointer Chang S. Bae
2021-08-12 17:09   ` Borislav Petkov
2021-07-30 14:59 ` [PATCH v9 08/26] x86/fpu/xstate: Introduce helpers to manage the XSTATE buffer dynamically Chang S. Bae
2021-08-12 19:44   ` Borislav Petkov
2021-08-13  8:04     ` Bae, Chang Seok
2021-08-13 10:04       ` Borislav Petkov
2021-08-13 19:43         ` Bae, Chang Seok
2021-08-18  9:28           ` Borislav Petkov
2021-08-18 19:46             ` Bae, Chang Seok
2021-08-25 16:01               ` Bae, Chang Seok
2021-08-30 17:07               ` Borislav Petkov
2021-08-30 23:39                 ` Bae, Chang Seok
2021-08-16 18:33     ` Bae, Chang Seok
2021-08-16 18:53       ` Borislav Petkov
2021-08-30 17:45   ` Dave Hansen
2021-08-30 23:39     ` Bae, Chang Seok
2021-07-30 14:59 ` [PATCH v9 09/26] x86/fpu/xstate: Update the XSTATE save function to support dynamic states Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 10/26] x86/fpu/xstate: Update the XSTATE buffer address finder " Chang S. Bae
2021-08-18 11:33   ` Borislav Petkov
2021-08-18 19:47     ` Bae, Chang Seok
2021-08-30 17:18       ` Borislav Petkov
2021-08-30 23:38         ` Bae, Chang Seok
2021-07-30 14:59 ` [PATCH v9 11/26] x86/fpu/xstate: Update the XSTATE context copy function " Chang S. Bae
2021-08-18 12:03   ` Borislav Petkov
2021-08-18 19:47     ` Bae, Chang Seok
2021-07-30 14:59 ` [PATCH v9 12/26] x86/fpu/xstate: Use feature disable (XFD) to protect dynamic user state Chang S. Bae
2021-08-18 16:24   ` Borislav Petkov
2021-08-18 17:20     ` Thiago Macieira
2021-08-18 17:46       ` Borislav Petkov
2021-08-18 17:58         ` Thiago Macieira
2021-08-18 18:10           ` Borislav Petkov
2021-08-24 22:51             ` Len Brown
2021-08-18 20:43         ` Bae, Chang Seok
2021-08-18 21:04           ` Thiago Macieira
2021-08-18 21:12             ` Bae, Chang Seok
2021-08-18 22:27               ` Thiago Macieira
2021-08-19  1:21             ` Andy Lutomirski
2021-08-19 16:06               ` Thiago Macieira
2021-08-18 21:17           ` Borislav Petkov
2021-08-18 21:37             ` Bae, Chang Seok
2021-08-19  8:00               ` Borislav Petkov
2021-08-19 15:24                 ` Bae, Chang Seok
2021-08-24 23:22             ` Len Brown
2021-08-30 17:31               ` Borislav Petkov
2021-09-17  3:48                 ` Len Brown
2021-08-18 19:47     ` Bae, Chang Seok
2021-08-24 22:21     ` Len Brown
2021-08-30 17:41       ` Borislav Petkov
2021-08-31 21:44         ` Len Brown
2021-08-24 23:17     ` Len Brown
2021-08-30 17:53       ` Borislav Petkov
2021-08-31 22:07         ` Len Brown
2021-08-31 22:11           ` Dave Hansen
2021-08-30 18:04       ` Dave Hansen
2021-08-31 22:15         ` Len Brown
2021-08-31 22:16           ` Len Brown
2021-08-31 22:39           ` Thiago Macieira
2021-08-31 22:44             ` Len Brown
2021-07-30 14:59 ` [PATCH v9 13/26] x86/fpu/xstate: Support ptracer-induced XSTATE buffer expansion Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE Chang S. Bae
2021-08-06 16:46   ` Thiago Macieira
2021-08-09 22:08     ` Bae, Chang Seok
2021-08-09 23:42       ` Thiago Macieira
2021-08-10  0:57         ` Bae, Chang Seok
2021-08-13 19:44           ` Bae, Chang Seok
2021-07-30 14:59 ` [PATCH v9 15/26] x86/fpu/xstate: Support both legacy and expanded signal XSTATE size Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 16/26] x86/fpu/xstate: Adjust the XSAVE feature table to address gaps in state component numbers Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 17/26] x86/fpu/xstate: Disable XSTATE support if an inconsistent state is detected Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 18/26] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 19/26] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 20/26] x86/fpu/amx: Initialize child's AMX state Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 21/26] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 22/26] x86/fpu/xstate: Skip writing zeros to signal frame for dynamic user states if in INIT-state Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 23/26] selftest/x86/amx: Test cases for the AMX state management Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 24/26] x86/insn/amx: Add TILERELEASE instruction to the opcode map Chang S. Bae
2021-07-30 14:59 ` [PATCH v9 25/26] intel_idle/amx: Add SPR support with XTILEDATA capability Chang S. Bae
2021-07-30 18:41   ` Dave Hansen
2021-08-03 21:32     ` Bae, Chang Seok
2021-08-03 21:38       ` Dave Hansen
2021-08-03 21:43         ` Brown, Len
2021-07-30 20:15   ` Dave Hansen
2021-07-30 14:59 ` [PATCH v9 26/26] x86/fpu/xstate: Add a sanity check for XFD state when saving XSTATE Chang S. Bae

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).