All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: x86@kernel.org, Andrew Cooper <andrew.cooper3@citrix.com>,
	"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>
Subject: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
Date: Mon,  4 Apr 2022 14:11:27 +0200 (CEST)	[thread overview]
Message-ID: <20220404104820.713066297@linutronix.de> (raw)
In-Reply-To: 20220404103741.809025935@linutronix.de

XSAVEC/S store the FPU state in compacted format, which avoids holes in the
memory image. The kernel uses this feature in a very naive way and just
avoids holes which come from unsupported features, like PT. That's a
marginal saving of 128 byte vs. the uncompacted format on a SKL-X.

The first 576 bytes are fixed. 512 byte legacy (FP/SSE) and 64 byte XSAVE
header. On a SKL-X machine the other components are stored at the following
offsets:
 
 xstate_offset[2]:  576, xstate_sizes[2]:  256
 xstate_offset[3]:  832, xstate_sizes[3]:   64
 xstate_offset[4]:  896, xstate_sizes[4]:   64
 xstate_offset[5]:  960, xstate_sizes[5]:   64
 xstate_offset[6]: 1024, xstate_sizes[6]:  512
 xstate_offset[7]: 1536, xstate_sizes[7]: 1024
 xstate_offset[9]: 2560, xstate_sizes[9]:    8

XSAVEC/S use the init optimization which does not write data of a component
when the component is in init state. The state is stored in the XSTATE_BV
bitmap of the XSTATE header.

The kernel requests to save all enabled components, which results in a
suboptimal write/read pattern when the set of active components is sparse.

A typical scenario is an active set of 0x202 (PKRU + SSE) out of the full
supported set of 0x2FF. That means XSAVEC/S writes and XRSTOR[S] reads:

  - SSE in the legacy area (0-511)
  - Part of the XSTATE header (512-575)
  - PKRU at offset 2560

which is suboptimal. Prefetch works better when the access is linear. But
what's worse is that PKRU can be located in a different page which
obviously affects dTLB.

XSAVEC/S allows to further reduce the memory footprint when the active
feature set is sparse and the CPU supports XGETBV1. XGETBV1 reads the state
of the XSTATE components as a bitmap. This bitmap can be fed into XSAVEC/S
to request only the storage of the active components, which changes the
layout of the state buffer to:

  - SSE in the legacy area (0-511)
  - Part of the XSTATE header (512-575)
  - PKRU at offset 576

This optimization does not gain much for e.g. a kernel build, but for
context switch heavy applications it's very visible. Perf stats from
hackbench:

Before:

        242,618.89 msec task-clock                #  102.928 CPUs utilized            ( +-  0.20% )
         1,038,988      context-switches          #    0.004 M/sec                    ( +-  0.54% )
           460,081      cpu-migrations            #    0.002 M/sec                    ( +-  0.56% )
            10,813      page-faults               #    0.045 K/sec                    ( +-  0.62% )
   506,912,353,968      cycles                    #    2.089 GHz                      ( +-  0.20% )
   167,267,811,210      instructions              #    0.33  insn per cycle           ( +-  0.04% )
    34,481,978,727      branches                  #  142.124 M/sec                    ( +-  0.04% )
       305,975,304      branch-misses             #    0.89% of all branches          ( +-  0.09% )

           2.35717 +- 0.00607 seconds time elapsed  ( +-  0.26% )

   506,064,738,921      cycles                                                        ( +-  0.43% )
     3,334,160,871      L1-dcache-load-misses                                         ( +-  0.77% )
       135,271,979      dTLB-load-misses                                              ( +-  2.12% )
        18,169,634      dTLB-store-misses                                             ( +-  1.78% )

            2.3323 +- 0.0117 seconds time elapsed  ( +-  0.50% )

After:

        222,252.90 msec task-clock                #  103.800 CPUs utilized            ( +-  0.51% )
         1,004,665      context-switches          #    0.005 M/sec                    ( +-  0.42% )
           459,123      cpu-migrations            #    0.002 M/sec                    ( +-  0.33% )
            10,677      page-faults               #    0.048 K/sec                    ( +-  0.79% )
   464,356,465,870      cycles                    #    2.089 GHz                      ( +-  0.51% )
   166,615,501,152      instructions              #    0.36  insn per cycle           ( +-  0.05% )
    34,355,848,663      branches                  #  154.580 M/sec                    ( +-  0.05% )
       300,049,704      branch-misses             #    0.87% of all branches          ( +-  0.14% )

            2.1412 +- 0.0117 seconds time elapsed  ( +-  0.55% )

   473,864,807,936      cycles                                                        ( +-  0.64% )
     3,198,078,809      L1-dcache-load-misses                                         ( +-  0.24% )
        27,798,721      dTLB-load-misses                                              ( +-  2.33% )
         4,981,069      dTLB-store-misses                                             ( +-  1.80% )

            2.1733 +- 0.0132 seconds time elapsed  ( +-  0.61% )

The most significant change is in dTLB misses.

The effect depends on the application scenario, the kernel configuration
and the allocation placement of task_struct, so it might be not noticable
at all. As the XGETBV1 optimization is not introducing a measurable
overhead it's worth to use it if supported by the hardware.

Enable it when available with a static key and mask out the non-active
states in the requested bitmap for XSAVEC/S.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/fpu/xstate.c |   10 ++++++++--
 arch/x86/kernel/fpu/xstate.h |   16 +++++++++++++---
 2 files changed, 21 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -86,6 +86,8 @@ static unsigned int xstate_flags[XFEATUR
 #define XSTATE_FLAG_SUPERVISOR	BIT(0)
 #define XSTATE_FLAG_ALIGNED64	BIT(1)
 
+DEFINE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
+
 /*
  * Return whether the system supports a given xfeature.
  *
@@ -1481,7 +1483,7 @@ void xfd_validate_state(struct fpstate *
 }
 #endif /* CONFIG_X86_DEBUG_FPU */
 
-static int __init xfd_update_static_branch(void)
+static int __init fpu_update_static_branches(void)
 {
 	/*
 	 * If init_fpstate.xfd has bits set then dynamic features are
@@ -1489,9 +1491,13 @@ static int __init xfd_update_static_bran
 	 */
 	if (init_fpstate.xfd)
 		static_branch_enable(&__fpu_state_size_dynamic);
+
+	if (cpu_feature_enabled(X86_FEATURE_XGETBV1) &&
+	    cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
+		static_branch_enable(&__xsave_use_xgetbv1);
 	return 0;
 }
-arch_initcall(xfd_update_static_branch)
+arch_initcall(fpu_update_static_branches)
 
 void fpstate_free(struct fpu *fpu)
 {
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -10,7 +10,12 @@
 DECLARE_PER_CPU(u64, xfd_state);
 #endif
 
-static inline bool xsave_use_xgetbv1(void) { return false; }
+DECLARE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
+
+static __always_inline __pure bool xsave_use_xgetbv1(void)
+{
+	return static_branch_likely(&__xsave_use_xgetbv1);
+}
 
 static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
 {
@@ -185,13 +190,18 @@ static inline int __xfd_enable_feature(u
 static inline void os_xsave(struct fpstate *fpstate)
 {
 	u64 mask = fpstate->xfeatures;
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
+	u32 lmask, hmask;
 	int err;
 
 	WARN_ON_FPU(!alternatives_patched);
 	xfd_validate_state(fpstate, mask, false);
 
+	if (xsave_use_xgetbv1())
+		mask &= xgetbv(1);
+
+	lmask = mask;
+	hmask = mask >> 32;
+
 	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
 
 	/* We should never fault when copying to a kernel buffer: */


  parent reply	other threads:[~2022-04-04 12:11 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-04 12:11 [patch 0/3] x86/fpu/xsave: Add XSAVEC support and XGETBV1 utilization Thomas Gleixner
2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
2022-04-04 16:10   ` Andrew Cooper
2022-04-14 14:43   ` Dave Hansen
2022-04-25 13:11   ` [tip: x86/fpu] " tip-bot2 for Thomas Gleixner
2022-04-04 12:11 ` [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction Thomas Gleixner
2022-04-14 15:46   ` Dave Hansen
2022-04-19 12:39     ` Thomas Gleixner
2022-04-19 13:33       ` Thomas Gleixner
2022-04-04 12:11 ` Thomas Gleixner [this message]
2022-04-14 17:24   ` [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported Dave Hansen
2022-04-19 13:43     ` Thomas Gleixner
2022-04-19 21:22       ` Thomas Gleixner
2022-04-20 18:15         ` Tom Lendacky
2022-04-22 19:30           ` Thomas Gleixner
2022-04-23 15:20             ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220404104820.713066297@linutronix.de \
    --to=tglx@linutronix.de \
    --cc=andrew.cooper3@citrix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.