linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage
@ 2021-01-19 17:38 Andy Lutomirski
  2021-01-19 17:38 ` [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-19 17:38 UTC (permalink / raw)
  To: x86
  Cc: LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann,
	Andy Lutomirski

This series fixes two regressions: a boot failure on AMD K7 and a
performance regression on everything.

I did a double-take here -- the regressions were reported by different
people, both named Krzysztof :)

Changes from v1:
 - Fix MMX better -- MMX really does need FNINIT.
 - Improve the EFI code.
 - Rename the KFPU constants.
 - Changelog improvements.

Andy Lutomirski (4):
  x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  x86/mmx: Use KFPU_387 for MMX string operations
  x86/fpu: Make the EFI FPU calling convention explicit
  x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

 arch/x86/include/asm/efi.h     | 24 ++++++++++++++++++++----
 arch/x86/include/asm/fpu/api.h | 27 +++++++++++++++++++++++++--
 arch/x86/kernel/fpu/core.c     | 17 +++++++++++------
 arch/x86/lib/mmx_32.c          | 20 +++++++++++++++-----
 arch/x86/platform/efi/efi_64.c |  4 ++--
 5 files changed, 73 insertions(+), 19 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  2021-01-19 17:38 [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Andy Lutomirski
@ 2021-01-19 17:38 ` Andy Lutomirski
  2021-01-19 19:34   ` Sean Christopherson
  2021-01-20 11:53   ` Borislav Petkov
  2021-01-19 17:39 ` [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations Andy Lutomirski
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-19 17:38 UTC (permalink / raw)
  To: x86
  Cc: LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann,
	Andy Lutomirski

Currently, requesting kernel FPU access doesn't distinguish which parts of
the extended ("FPU") state are needed.  This is nice for simplicity, but
there are a few cases in which it's suboptimal:

 - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
   not use legacy 387 state.  These users want MXCSR initialized but don't
   care about the FPU control word.  Skipping FNINIT would save time.
   (Empirically, FNINIT is several times slower than LDMXCSR.)

 - Code that wants MMX doesn't want or need MXCSR initialized.
   _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
   initializing MXCSR will fail.

 - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
   dynamic states will need special handling.

This patch adds a more specific API that allows callers specify exactly
what they want.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/fpu/api.h | 15 +++++++++++++--
 arch/x86/kernel/fpu/core.c     | 17 +++++++++++------
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index dcd9503b1098..38f4936045ab 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -16,14 +16,25 @@
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
  * disables preemption so be careful if you intend to use it for long periods
  * of time.
- * If you intend to use the FPU in softirq you need to check first with
+ * If you intend to use the FPU in irq/softirq you need to check first with
  * irq_fpu_usable() if it is possible.
  */
-extern void kernel_fpu_begin(void);
+
+/* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
+#define KFPU_387	_BITUL(0)	/* 387 state will be initialized */
+#define KFPU_MXCSR	_BITUL(1)	/* MXCSR will be initialized */
+
+extern void kernel_fpu_begin_mask(unsigned int kfpu_mask);
 extern void kernel_fpu_end(void);
 extern bool irq_fpu_usable(void);
 extern void fpregs_mark_activate(void);
 
+/* Code that is unaware of kernel_fpu_begin_mask() can use this */
+static inline void kernel_fpu_begin(void)
+{
+	kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
+}
+
 /*
  * Use fpregs_lock() while editing CPU's FPU registers or fpu->state.
  * A context switch will (and softirq might) save CPU's FPU registers to
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index eb86a2b831b1..d4a71596c41e 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -121,7 +121,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
 }
 EXPORT_SYMBOL(copy_fpregs_to_fpstate);
 
-void kernel_fpu_begin(void)
+void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
 	preempt_disable();
 
@@ -141,13 +141,18 @@ void kernel_fpu_begin(void)
 	}
 	__cpu_invalidate_fpregs_state();
 
-	if (boot_cpu_has(X86_FEATURE_XMM))
-		ldmxcsr(MXCSR_DEFAULT);
+	/* Put sane initial values into the control registers. */
+	if (likely(kfpu_mask & KFPU_MXCSR)) {
+		if (boot_cpu_has(X86_FEATURE_XMM))
+			ldmxcsr(MXCSR_DEFAULT);
+	}
 
-	if (boot_cpu_has(X86_FEATURE_FPU))
-		asm volatile ("fninit");
+	if (unlikely(kfpu_mask & KFPU_387)) {
+		if (boot_cpu_has(X86_FEATURE_FPU))
+			asm volatile ("fninit");
+	}
 }
-EXPORT_SYMBOL_GPL(kernel_fpu_begin);
+EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
 
 void kernel_fpu_end(void)
 {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations
  2021-01-19 17:38 [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Andy Lutomirski
  2021-01-19 17:38 ` [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state Andy Lutomirski
@ 2021-01-19 17:39 ` Andy Lutomirski
  2021-01-20 18:18   ` Borislav Petkov
  2021-01-19 17:39 ` [PATCH v2 3/4] x86/fpu: Make the EFI FPU calling convention explicit Andy Lutomirski
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-19 17:39 UTC (permalink / raw)
  To: x86
  Cc: LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann,
	Andy Lutomirski

The default kernel_fpu_begin() doesn't work on systems that support XMM but
haven't yet enabled CR4.OSFXSR.  This causes crashes when _mmx_memcpy() is
called too early because LDMXCSR generates #UD when the aforementioned bit
is clear.

Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly.

Fixes: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")
Reported-by: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/lib/mmx_32.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/lib/mmx_32.c b/arch/x86/lib/mmx_32.c
index 4321fa02e18d..2a6ad7aa148a 100644
--- a/arch/x86/lib/mmx_32.c
+++ b/arch/x86/lib/mmx_32.c
@@ -26,6 +26,16 @@
 #include <asm/fpu/api.h>
 #include <asm/asm.h>
 
+/*
+ * For MMX, we use KFPU_387.  MMX instructions are not affected by MXCSR,
+ * but both AMD and Intel documentation states that even integer MMX
+ * operations will result in #MF if an exception is pending in FCW.
+ *
+ * We don't need EMMS afterwards because, after we call kernel_fpu_end(),
+ * any subsequent user of the 387 stack will reinitialize it using
+ * KFPU_387.
+ */
+
 void *_mmx_memcpy(void *to, const void *from, size_t len)
 {
 	void *p;
@@ -37,7 +47,7 @@ void *_mmx_memcpy(void *to, const void *from, size_t len)
 	p = to;
 	i = len >> 6; /* len/64 */
 
-	kernel_fpu_begin();
+	kernel_fpu_begin_mask(KFPU_387);
 
 	__asm__ __volatile__ (
 		"1: prefetch (%0)\n"		/* This set is 28 bytes */
@@ -127,7 +137,7 @@ static void fast_clear_page(void *page)
 {
 	int i;
 
-	kernel_fpu_begin();
+	kernel_fpu_begin_mask(KFPU_387);
 
 	__asm__ __volatile__ (
 		"  pxor %%mm0, %%mm0\n" : :
@@ -160,7 +170,7 @@ static void fast_copy_page(void *to, void *from)
 {
 	int i;
 
-	kernel_fpu_begin();
+	kernel_fpu_begin_mask(KFPU_387);
 
 	/*
 	 * maybe the prefetch stuff can go before the expensive fnsave...
@@ -247,7 +257,7 @@ static void fast_clear_page(void *page)
 {
 	int i;
 
-	kernel_fpu_begin();
+	kernel_fpu_begin_mask(KFPU_387);
 
 	__asm__ __volatile__ (
 		"  pxor %%mm0, %%mm0\n" : :
@@ -282,7 +292,7 @@ static void fast_copy_page(void *to, void *from)
 {
 	int i;
 
-	kernel_fpu_begin();
+	kernel_fpu_begin_mask(KFPU_387);
 
 	__asm__ __volatile__ (
 		"1: prefetch (%0)\n"
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] x86/fpu: Make the EFI FPU calling convention explicit
  2021-01-19 17:38 [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Andy Lutomirski
  2021-01-19 17:38 ` [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state Andy Lutomirski
  2021-01-19 17:39 ` [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations Andy Lutomirski
@ 2021-01-19 17:39 ` Andy Lutomirski
  2021-01-19 17:39 ` [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin() Andy Lutomirski
  2021-01-20  7:51 ` [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Krzysztof Olędzki
  4 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-19 17:39 UTC (permalink / raw)
  To: x86
  Cc: LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann,
	Andy Lutomirski

EFI uses kernel_fpu_begin() to conform to the UEFI calling convention.
This specifically requires initializing FCW, whereas no sane 64-bit kernel
code should use legacy 387 operations that reference FCW.

This should enable us to safely change the default semantics of
kernel_fpu_begin() to stop initializing FCW on 64-bit kernels.

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/efi.h     | 24 ++++++++++++++++++++----
 arch/x86/platform/efi/efi_64.c |  4 ++--
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index bc9758ef292e..f3201544fbf8 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -68,17 +68,33 @@ extern unsigned long efi_fw_vendor, efi_config_table;
 		#f " called with too many arguments (" #p ">" #n ")");	\
 })
 
+static inline void efi_fpu_begin(void)
+{
+	/*
+	 * The UEFI calling convention (UEFI spec 2.3.2 and 2.3.4) requires
+	 * that FCW and MXCSR (64-bit) must be initialized prior to calling
+	 * UEFI code.  (Oddly the spec does not require that the FPU stack
+	 * be empty.)
+	 */
+	kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
+}
+
+static inline void efi_fpu_end(void)
+{
+	kernel_fpu_end();
+}
+
 #ifdef CONFIG_X86_32
 #define arch_efi_call_virt_setup()					\
 ({									\
-	kernel_fpu_begin();						\
+	efi_fpu_begin();						\
 	firmware_restrict_branch_speculation_start();			\
 })
 
 #define arch_efi_call_virt_teardown()					\
 ({									\
 	firmware_restrict_branch_speculation_end();			\
-	kernel_fpu_end();						\
+	efi_fpu_end();							\
 })
 
 #define arch_efi_call_virt(p, f, args...)	p->f(args)
@@ -107,7 +123,7 @@ struct efi_scratch {
 #define arch_efi_call_virt_setup()					\
 ({									\
 	efi_sync_low_kernel_mappings();					\
-	kernel_fpu_begin();						\
+	efi_fpu_begin();						\
 	firmware_restrict_branch_speculation_start();			\
 	efi_switch_mm(&efi_mm);						\
 })
@@ -119,7 +135,7 @@ struct efi_scratch {
 ({									\
 	efi_switch_mm(efi_scratch.prev_mm);				\
 	firmware_restrict_branch_speculation_end();			\
-	kernel_fpu_end();						\
+	efi_fpu_end();							\
 })
 
 #ifdef CONFIG_KASAN
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8f5759df7776..f042040b5da1 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -848,7 +848,7 @@ efi_set_virtual_address_map(unsigned long memory_map_size,
 							 virtual_map);
 	efi_switch_mm(&efi_mm);
 
-	kernel_fpu_begin();
+	efi_fpu_begin();
 
 	/* Disable interrupts around EFI calls: */
 	local_irq_save(flags);
@@ -857,7 +857,7 @@ efi_set_virtual_address_map(unsigned long memory_map_size,
 			  descriptor_version, virtual_map);
 	local_irq_restore(flags);
 
-	kernel_fpu_end();
+	efi_fpu_end();
 
 	/* grab the virtually remapped EFI runtime services table pointer */
 	efi.runtime = READ_ONCE(systab->runtime);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
  2021-01-19 17:38 [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Andy Lutomirski
                   ` (2 preceding siblings ...)
  2021-01-19 17:39 ` [PATCH v2 3/4] x86/fpu: Make the EFI FPU calling convention explicit Andy Lutomirski
@ 2021-01-19 17:39 ` Andy Lutomirski
  2021-01-20 10:07   ` Peter Zijlstra
  2021-01-20  7:51 ` [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Krzysztof Olędzki
  4 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-19 17:39 UTC (permalink / raw)
  To: x86
  Cc: LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann,
	Andy Lutomirski

The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
instructions, so there's no need to sanitize the FPU state.  Skip it to get
most of the performance we lost back.

Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 38f4936045ab..435bc59d539b 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -32,7 +32,19 @@ extern void fpregs_mark_activate(void);
 /* Code that is unaware of kernel_fpu_begin_mask() can use this */
 static inline void kernel_fpu_begin(void)
 {
+#ifdef CONFIG_X86_64
+	/*
+	 * Any 64-bit code that uses 387 instructions must explicitly request
+	 * KFPU_387.
+	 */
+	kernel_fpu_begin_mask(KFPU_MXCSR);
+#else
+	/*
+	 * 32-bit kernel code may use 387 operations as well as SSE2, etc,
+	 * as long as it checks that the CPU has the required capability.
+	 */
 	kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
+#endif
 }
 
 /*
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  2021-01-19 17:38 ` [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state Andy Lutomirski
@ 2021-01-19 19:34   ` Sean Christopherson
  2021-01-20 11:53   ` Borislav Petkov
  1 sibling, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2021-01-19 19:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann

On Tue, Jan 19, 2021, Andy Lutomirski wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index eb86a2b831b1..d4a71596c41e 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -121,7 +121,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
>  }
>  EXPORT_SYMBOL(copy_fpregs_to_fpstate);
>  
> -void kernel_fpu_begin(void)
> +void kernel_fpu_begin_mask(unsigned int kfpu_mask)
>  {
>  	preempt_disable();
>  
> @@ -141,13 +141,18 @@ void kernel_fpu_begin(void)
>  	}
>  	__cpu_invalidate_fpregs_state();
>  
> -	if (boot_cpu_has(X86_FEATURE_XMM))
> -		ldmxcsr(MXCSR_DEFAULT);
> +	/* Put sane initial values into the control registers. */
> +	if (likely(kfpu_mask & KFPU_MXCSR)) {
> +		if (boot_cpu_has(X86_FEATURE_XMM))
> +			ldmxcsr(MXCSR_DEFAULT);
> +	}
>  
> -	if (boot_cpu_has(X86_FEATURE_FPU))
> -		asm volatile ("fninit");
> +	if (unlikely(kfpu_mask & KFPU_387)) {
> +		if (boot_cpu_has(X86_FEATURE_FPU))
> +			asm volatile ("fninit");
> +	}

Why not combine these into a single if statement?  Easier on the eyes (IMO), and
would generate a smaller diff.

	if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
		ldmxcsr(MXCSR_DEFAULT);

	if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
		asm volatile ("fninit");

>  }
> -EXPORT_SYMBOL_GPL(kernel_fpu_begin);
> +EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
>  
>  void kernel_fpu_end(void)
>  {
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage
  2021-01-19 17:38 [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Andy Lutomirski
                   ` (3 preceding siblings ...)
  2021-01-19 17:39 ` [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin() Andy Lutomirski
@ 2021-01-20  7:51 ` Krzysztof Olędzki
  2021-01-21  5:03   ` Andy Lutomirski
  4 siblings, 1 reply; 13+ messages in thread
From: Krzysztof Olędzki @ 2021-01-20  7:51 UTC (permalink / raw)
  To: Andy Lutomirski, x86; +Cc: LKML, Krzysztof Mazur, Arnd Bergmann

On 2021-01-19 at 09:38, Andy Lutomirski wrote:
> This series fixes two regressions: a boot failure on AMD K7 and a
> performance regression on everything.
> 
> I did a double-take here -- the regressions were reported by different
> people, both named Krzysztof :)
> 
> Changes from v1:
>   - Fix MMX better -- MMX really does need FNINIT.
>   - Improve the EFI code.
>   - Rename the KFPU constants.
>   - Changelog improvements.
> 
> Andy Lutomirski (4):
>    x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
>    x86/mmx: Use KFPU_387 for MMX string operations
>    x86/fpu: Make the EFI FPU calling convention explicit
>    x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

Hi Andy.

I have quickly tested the new version on E3-1280 V2.

* 5.10.9 + 7ad816762f9bf89e940e618ea40c43138b479e10 reverted (aka unfixed)
xor: measuring software checksum speed
    avx             : 38616 MB/sec
    prefetch64-sse  : 25797 MB/sec
    generic_sse     : 23147 MB/sec
xor: using function: avx (38616 MB/sec)

* 5.10.9 (the original)
xor: measuring software checksum speed
    avx             : 23318 MB/sec
    prefetch64-sse  : 22562 MB/sec
    generic_sse     : 20431 MB/sec
xor: using function: avx (23318 MB/sec)

* 5.10.9 + "Reduce unnecessary FNINIT and MXCSR usage" v2
xor: measuring software checksum speed
    avx             : 26451 MB/sec
    prefetch64-sse  : 25777 MB/sec
    generic_sse     : 23178 MB/sec
xor: using function: avx (26451 MB/sec)

Overall, kernel xor benchmark reports better performance on 5.10.9 than 
on 5.4.90 (see my prev e-mail), but the general trend is the same.

The "unfixed" kernel is much faster for all of avx, prefetch64-sse and 
generic_sse. With the fix, we see the expected perf regression.

Now, with your patchset, both prefetch64-sse and generic_sse are able to 
recover the full performance, as seen on 5.4. However, this is not the 
case for avx. While there is still an improvement, it is nowhere close 
to where it used to be.

I wonder why AVX still sees a regression and if anything more can be 
done about it?

Will do more tests tomorrow.

Thanks,
  Krzysztof

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
  2021-01-19 17:39 ` [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin() Andy Lutomirski
@ 2021-01-20 10:07   ` Peter Zijlstra
  2021-01-20 18:28     ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-01-20 10:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Krzysztof Mazur, Krzysztof Olędzki,
	Arnd Bergmann, Josh Poimboeuf

On Tue, Jan 19, 2021 at 09:39:02AM -0800, Andy Lutomirski wrote:
> The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
> instructions, so there's no need to sanitize the FPU state.  Skip it to get
> most of the performance we lost back.
> 
> Reported-by: Krzysztof Olędzki <ole@ans.pl>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> index 38f4936045ab..435bc59d539b 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -32,7 +32,19 @@ extern void fpregs_mark_activate(void);
>  /* Code that is unaware of kernel_fpu_begin_mask() can use this */
>  static inline void kernel_fpu_begin(void)
>  {
> +#ifdef CONFIG_X86_64
> +	/*
> +	 * Any 64-bit code that uses 387 instructions must explicitly request
> +	 * KFPU_387.
> +	 */
> +	kernel_fpu_begin_mask(KFPU_MXCSR);

I'm also still sitting on this:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/fpu

what do we do with that?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  2021-01-19 17:38 ` [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state Andy Lutomirski
  2021-01-19 19:34   ` Sean Christopherson
@ 2021-01-20 11:53   ` Borislav Petkov
  2021-01-20 12:19     ` Andy Lutomirski
  1 sibling, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2021-01-20 11:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann

On Tue, Jan 19, 2021 at 09:38:59AM -0800, Andy Lutomirski wrote:
> Currently, requesting kernel FPU access doesn't distinguish which parts of
> the extended ("FPU") state are needed.  This is nice for simplicity, but
> there are a few cases in which it's suboptimal:
> 
>  - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
>    not use legacy 387 state.  These users want MXCSR initialized but don't
>    care about the FPU control word.  Skipping FNINIT would save time.
>    (Empirically, FNINIT is several times slower than LDMXCSR.)
> 
>  - Code that wants MMX doesn't want or need MXCSR initialized.
>    _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
>    initializing MXCSR will fail.
> 
>  - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
>    dynamic states will need special handling.
> 
> This patch adds a more specific API that allows callers specify exactly
> what they want.

Same nitpicks:

https://lkml.kernel.org/r/20210119110834.GH27433@zn.tnic

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  2021-01-20 11:53   ` Borislav Petkov
@ 2021-01-20 12:19     ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-20 12:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, x86, LKML, Krzysztof Mazur,
	Krzysztof Olędzki, Arnd Bergmann


> On Jan 20, 2021, at 3:53 AM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Tue, Jan 19, 2021 at 09:38:59AM -0800, Andy Lutomirski wrote:
>> Currently, requesting kernel FPU access doesn't distinguish which parts of
>> the extended ("FPU") state are needed.  This is nice for simplicity, but
>> there are a few cases in which it's suboptimal:
>> 
>> - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
>>   not use legacy 387 state.  These users want MXCSR initialized but don't
>>   care about the FPU control word.  Skipping FNINIT would save time.
>>   (Empirically, FNINIT is several times slower than LDMXCSR.)
>> 
>> - Code that wants MMX doesn't want or need MXCSR initialized.
>>   _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
>>   initializing MXCSR will fail.
>> 
>> - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
>>   dynamic states will need special handling.
>> 
>> This patch adds a more specific API that allows callers specify exactly
>> what they want.
> 
> Same nitpicks:
> 
> https://lkml.kernel.org/r/20210119110834.GH27433@zn.tnic

I would have sworn I fixed those.  Sorry!

> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations
  2021-01-19 17:39 ` [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations Andy Lutomirski
@ 2021-01-20 18:18   ` Borislav Petkov
  0 siblings, 0 replies; 13+ messages in thread
From: Borislav Petkov @ 2021-01-20 18:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Krzysztof Mazur, Krzysztof Olędzki, Arnd Bergmann

On Tue, Jan 19, 2021 at 09:39:00AM -0800, Andy Lutomirski wrote:
> The default kernel_fpu_begin() doesn't work on systems that support XMM but
> haven't yet enabled CR4.OSFXSR.  This causes crashes when _mmx_memcpy() is
> called too early because LDMXCSR generates #UD when the aforementioned bit
> is clear.
> 
> Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly.
> 
> Fixes: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")
> Reported-by: Krzysztof Mazur <krzysiek@podlesie.net>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Cc: <stable@vger.kernel.org> I guess.

> ---
>  arch/x86/lib/mmx_32.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/lib/mmx_32.c b/arch/x86/lib/mmx_32.c
> index 4321fa02e18d..2a6ad7aa148a 100644
> --- a/arch/x86/lib/mmx_32.c
> +++ b/arch/x86/lib/mmx_32.c
> @@ -26,6 +26,16 @@
>  #include <asm/fpu/api.h>
>  #include <asm/asm.h>
>  
> +/*
> + * For MMX, we use KFPU_387.  MMX instructions are not affected by MXCSR,
> + * but both AMD and Intel documentation states that even integer MMX
> + * operations will result in #MF if an exception is pending in FCW.
> + *
> + * We don't need EMMS afterwards because, after we call kernel_fpu_end(),
> + * any subsequent user of the 387 stack will reinitialize it using
> + * KFPU_387.

Please use passive voice and convert the "we" to something impersonal.
For example:

"Use KFPU_387 for MMX. MMX instructions are not affected by MXCSR, but
both AMD and Intel documentation states that even integer MMX operations
will result in #MF if an exception is pending in FCW.

EMMS afterwards is not needed because, after kernel_fpu_end(), any
subsequent user of the 387 stack will reinitialize it using KFPU_387."

Voila, de-we-fied!

:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
  2021-01-20 10:07   ` Peter Zijlstra
@ 2021-01-20 18:28     ` Borislav Petkov
  0 siblings, 0 replies; 13+ messages in thread
From: Borislav Petkov @ 2021-01-20 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, LKML, Krzysztof Mazur,
	Krzysztof Olędzki, Arnd Bergmann, Josh Poimboeuf

On Wed, Jan 20, 2021 at 11:07:11AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2021 at 09:39:02AM -0800, Andy Lutomirski wrote:
> > The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
> > instructions, so there's no need to sanitize the FPU state.  Skip it to get
> > most of the performance we lost back.
> > 
> > Reported-by: Krzysztof Olędzki <ole@ans.pl>
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> > index 38f4936045ab..435bc59d539b 100644
> > --- a/arch/x86/include/asm/fpu/api.h
> > +++ b/arch/x86/include/asm/fpu/api.h
> > @@ -32,7 +32,19 @@ extern void fpregs_mark_activate(void);
> >  /* Code that is unaware of kernel_fpu_begin_mask() can use this */
> >  static inline void kernel_fpu_begin(void)
> >  {
> > +#ifdef CONFIG_X86_64
> > +	/*
> > +	 * Any 64-bit code that uses 387 instructions must explicitly request
> > +	 * KFPU_387.
> > +	 */
> > +	kernel_fpu_begin_mask(KFPU_MXCSR);
> 
> I'm also still sitting on this:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/fpu
> 
> what do we do with that?

Yah, I'd prefer an actual explicit check infra for stuff like that
instead of us expecting callers to know what bits they would need to
supply in the mask and then inadvertently goofing it up, leading to
funky context corruption bugs...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage
  2021-01-20  7:51 ` [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Krzysztof Olędzki
@ 2021-01-21  5:03   ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2021-01-21  5:03 UTC (permalink / raw)
  To: Krzysztof Olędzki
  Cc: Andy Lutomirski, X86 ML, LKML, Krzysztof Mazur, Arnd Bergmann

On Tue, Jan 19, 2021 at 11:51 PM Krzysztof Olędzki <ole@ans.pl> wrote:
>
> On 2021-01-19 at 09:38, Andy Lutomirski wrote:
> > This series fixes two regressions: a boot failure on AMD K7 and a
> > performance regression on everything.
> >
> > I did a double-take here -- the regressions were reported by different
> > people, both named Krzysztof :)
> >
> > Changes from v1:
> >   - Fix MMX better -- MMX really does need FNINIT.
> >   - Improve the EFI code.
> >   - Rename the KFPU constants.
> >   - Changelog improvements.
> >
> > Andy Lutomirski (4):
> >    x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
> >    x86/mmx: Use KFPU_387 for MMX string operations
> >    x86/fpu: Make the EFI FPU calling convention explicit
> >    x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
>
> Hi Andy.
>
> I have quickly tested the new version on E3-1280 V2.
>
> * 5.10.9 + 7ad816762f9bf89e940e618ea40c43138b479e10 reverted (aka unfixed)
> xor: measuring software checksum speed
>     avx             : 38616 MB/sec
>     prefetch64-sse  : 25797 MB/sec
>     generic_sse     : 23147 MB/sec
> xor: using function: avx (38616 MB/sec)
>
> * 5.10.9 (the original)
> xor: measuring software checksum speed
>     avx             : 23318 MB/sec
>     prefetch64-sse  : 22562 MB/sec
>     generic_sse     : 20431 MB/sec
> xor: using function: avx (23318 MB/sec)
>
> * 5.10.9 + "Reduce unnecessary FNINIT and MXCSR usage" v2
> xor: measuring software checksum speed
>     avx             : 26451 MB/sec
>     prefetch64-sse  : 25777 MB/sec
>     generic_sse     : 23178 MB/sec
> xor: using function: avx (26451 MB/sec)
>
> Overall, kernel xor benchmark reports better performance on 5.10.9 than
> on 5.4.90 (see my prev e-mail), but the general trend is the same.
>
> The "unfixed" kernel is much faster for all of avx, prefetch64-sse and
> generic_sse. With the fix, we see the expected perf regression.
>
> Now, with your patchset, both prefetch64-sse and generic_sse are able to
> recover the full performance, as seen on 5.4. However, this is not the
> case for avx. While there is still an improvement, it is nowhere close
> to where it used to be.
>
> I wonder why AVX still sees a regression and if anything more can be
> done about it?
>
> Will do more tests tomorrow.

I'm moderately confident that the problem is that LDMXCSR is
considered a "legacy SSE" instruction and it's triggering the
VEX-to-SSE and SSE-to-VEX penalties.  perf could tell you for sure,
and testing with VLDMXCSR might be informative.

I'm not sure whether the best solution is to try to use VLDMXCSR, to
throw some VZEROUPPER instructions in, or to try to get rid of MXCSR
initialization entirely for integer code.  VZEROUPPER might be good
regardless, since for all we know, we're coming from user mode and
user mode could have been using SSE.  If we do the latter, we should
probably arrange to do it just once per user-FP-to-kernel-FP
transition.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-01-21  5:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 17:38 [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Andy Lutomirski
2021-01-19 17:38 ` [PATCH v2 1/4] x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state Andy Lutomirski
2021-01-19 19:34   ` Sean Christopherson
2021-01-20 11:53   ` Borislav Petkov
2021-01-20 12:19     ` Andy Lutomirski
2021-01-19 17:39 ` [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations Andy Lutomirski
2021-01-20 18:18   ` Borislav Petkov
2021-01-19 17:39 ` [PATCH v2 3/4] x86/fpu: Make the EFI FPU calling convention explicit Andy Lutomirski
2021-01-19 17:39 ` [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin() Andy Lutomirski
2021-01-20 10:07   ` Peter Zijlstra
2021-01-20 18:28     ` Borislav Petkov
2021-01-20  7:51 ` [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage Krzysztof Olędzki
2021-01-21  5:03   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).