linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] x86/fpu: Prevent FPU state corruption
@ 2022-05-01 19:31 Thomas Gleixner
  2022-05-01 19:31 ` [patch 1/3] " Thomas Gleixner
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-01 19:31 UTC (permalink / raw)
  To: LKML; +Cc: x86, Filipe Manana

The recent changes in the random code unearthed a long standing FPU state
corruption due do a buggy condition for granting in-kernel FPU usage.

The following series addresses this issue and makes the code more robust.

Thanks,

	tglx
---
 arch/um/include/asm/fpu/api.h       |    2 
 arch/x86/include/asm/fpu/api.h      |   21 +-------
 arch/x86/include/asm/simd.h         |    2 
 arch/x86/kernel/fpu/core.c          |   92 +++++++++++++++++++-----------------
 net/netfilter/nft_set_pipapo_avx2.c |    2 
 5 files changed, 57 insertions(+), 62 deletions(-)




^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 1/3] x86/fpu: Prevent FPU state corruption
  2022-05-01 19:31 [patch 0/3] x86/fpu: Prevent FPU state corruption Thomas Gleixner
@ 2022-05-01 19:31 ` Thomas Gleixner
  2022-05-02 13:16   ` Borislav Petkov
  2022-05-05  0:42   ` [tip: x86/urgent] " tip-bot2 for Thomas Gleixner
  2022-05-01 19:31 ` [patch 2/3] x86/fpu: Rename irq_fpu_usable() Thomas Gleixner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-01 19:31 UTC (permalink / raw)
  To: LKML; +Cc: x86, Filipe Manana, stable

The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().

But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:

   !in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()

The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.

Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.

It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.

This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.

Fixes: 5d2bd7009f30 ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/fpu/core.c |   67 +++++++++++++++++----------------------------
 1 file changed, 26 insertions(+), 41 deletions(-)

--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -41,17 +41,7 @@ struct fpu_state_config fpu_user_cfg __r
  */
 struct fpstate init_fpstate __ro_after_init;
 
-/*
- * Track whether the kernel is using the FPU state
- * currently.
- *
- * This flag is used:
- *
- *   - by IRQ context code to potentially use the FPU
- *     if it's unused.
- *
- *   - to debug kernel_fpu_begin()/end() correctness
- */
+/* Track in-kernel FPU usage */
 static DEFINE_PER_CPU(bool, in_kernel_fpu);
 
 /*
@@ -59,42 +49,37 @@ static DEFINE_PER_CPU(bool, in_kernel_fp
  */
 DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 
-static bool kernel_fpu_disabled(void)
-{
-	return this_cpu_read(in_kernel_fpu);
-}
-
-static bool interrupted_kernel_fpu_idle(void)
-{
-	return !kernel_fpu_disabled();
-}
-
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
- *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
- */
-static bool interrupted_user_mode(void)
-{
-	struct pt_regs *regs = get_irq_regs();
-	return regs && user_mode(regs);
-}
-
 /*
  * Can we use the FPU in kernel mode with the
  * whole "kernel_fpu_begin/end()" sequence?
- *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
  */
 bool irq_fpu_usable(void)
 {
-	return !in_interrupt() ||
-		interrupted_user_mode() ||
-		interrupted_kernel_fpu_idle();
+	if (WARN_ON_ONCE(in_nmi()))
+		return false;
+
+	/* In kernel FPU usage already active? */
+	if (this_cpu_read(in_kernel_fpu))
+		return false;
+
+	/*
+	 * When not in NMI or hard interrupt context, FPU can be used:
+	 *
+	 * - Task context is safe except from within fpregs_lock()'ed
+	 *   critical regions.
+	 *
+	 * - Soft interrupt processing context which cannot happen
+	 *   while in a fpregs_lock()'ed critical region.
+	 */
+	if (!in_hardirq())
+		return true;
+
+	/*
+	 * In hard interrupt context it's safe when soft interrupts
+	 * are enabled, which means the interrupt did not hit in
+	 * a fpregs_lock()'ed critical region.
+	 */
+	return !softirq_count();
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 2/3] x86/fpu: Rename irq_fpu_usable()
  2022-05-01 19:31 [patch 0/3] x86/fpu: Prevent FPU state corruption Thomas Gleixner
  2022-05-01 19:31 ` [patch 1/3] " Thomas Gleixner
@ 2022-05-01 19:31 ` Thomas Gleixner
  2022-05-02 13:57   ` Borislav Petkov
  2022-05-01 19:31 ` [patch 3/3] x86/fpu: Make FPU protection more robust Thomas Gleixner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-01 19:31 UTC (permalink / raw)
  To: LKML; +Cc: x86, Filipe Manana

This can be invoked from almost any context, except NMI. The only
requirement is that the context is not migrateable.

Rename it to kernel_fpu_usable().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/um/include/asm/fpu/api.h       |    2 +-
 arch/x86/include/asm/fpu/api.h      |    4 ++--
 arch/x86/include/asm/simd.h         |    2 +-
 arch/x86/kernel/fpu/core.c          |   13 +++++++------
 net/netfilter/nft_set_pipapo_avx2.c |    2 +-
 5 files changed, 12 insertions(+), 11 deletions(-)

--- a/arch/um/include/asm/fpu/api.h
+++ b/arch/um/include/asm/fpu/api.h
@@ -11,7 +11,7 @@
 #define kernel_fpu_begin() (void)0
 #define kernel_fpu_end() (void)0
 
-static inline bool irq_fpu_usable(void)
+static inline bool kernel_fpu_usable(void)
 {
 	return true;
 }
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -19,7 +19,7 @@
  * disables preemption so be careful if you intend to use it for long periods
  * of time.
  * If you intend to use the FPU in irq/softirq you need to check first with
- * irq_fpu_usable() if it is possible.
+ * kernel_fpu_usable() if it is possible.
  */
 
 /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
@@ -28,7 +28,7 @@
 
 extern void kernel_fpu_begin_mask(unsigned int kfpu_mask);
 extern void kernel_fpu_end(void);
-extern bool irq_fpu_usable(void);
+extern bool kernel_fpu_usable(void);
 extern void fpregs_mark_activate(void);
 
 /* Code that is unaware of kernel_fpu_begin_mask() can use this */
--- a/arch/x86/include/asm/simd.h
+++ b/arch/x86/include/asm/simd.h
@@ -8,5 +8,5 @@
  */
 static __must_check inline bool may_use_simd(void)
 {
-	return irq_fpu_usable();
+	return kernel_fpu_usable();
 }
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -49,11 +49,12 @@ static DEFINE_PER_CPU(bool, in_kernel_fp
  */
 DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 
-/*
- * Can we use the FPU in kernel mode with the
- * whole "kernel_fpu_begin/end()" sequence?
+/**
+ * kernel_fpu_usable - Check whether kernel FPU usage is possible
+ *
+ * Has to be invoked before calling kernel_fpu_begin().
  */
-bool irq_fpu_usable(void)
+bool kernel_fpu_usable(void)
 {
 	if (WARN_ON_ONCE(in_nmi()))
 		return false;
@@ -81,7 +82,7 @@ bool irq_fpu_usable(void)
 	 */
 	return !softirq_count();
 }
-EXPORT_SYMBOL(irq_fpu_usable);
+EXPORT_SYMBOL(kernel_fpu_usable);
 
 /*
  * Track AVX512 state use because it is known to slow the max clock
@@ -409,7 +410,7 @@ void kernel_fpu_begin_mask(unsigned int
 {
 	preempt_disable();
 
-	WARN_ON_FPU(!irq_fpu_usable());
+	WARN_ON_FPU(!kernel_fpu_usable());
 	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, true);
--- a/net/netfilter/nft_set_pipapo_avx2.c
+++ b/net/netfilter/nft_set_pipapo_avx2.c
@@ -1128,7 +1128,7 @@ bool nft_pipapo_avx2_lookup(const struct
 	bool map_index;
 	int i, ret = 0;
 
-	if (unlikely(!irq_fpu_usable()))
+	if (unlikely(!kernel_fpu_usable()))
 		return nft_pipapo_lookup(net, set, key, ext);
 
 	m = rcu_dereference(priv->match);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-01 19:31 [patch 0/3] x86/fpu: Prevent FPU state corruption Thomas Gleixner
  2022-05-01 19:31 ` [patch 1/3] " Thomas Gleixner
  2022-05-01 19:31 ` [patch 2/3] x86/fpu: Rename irq_fpu_usable() Thomas Gleixner
@ 2022-05-01 19:31 ` Thomas Gleixner
  2022-05-02 14:35   ` Borislav Petkov
  2022-05-03  9:03   ` Peter Zijlstra
  2022-05-02 10:02 ` [patch 0/3] x86/fpu: Prevent FPU state corruption Filipe Manana
  2022-05-04 15:40 ` Jason A. Donenfeld
  4 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-01 19:31 UTC (permalink / raw)
  To: LKML; +Cc: x86, Filipe Manana

FPU state maintenance is protected by fpregs_lock(), which is a wrapper
around local_bh_disable() on non-RT kernels and preempt_disable() on RT
kernels. In-kernel FPU usage has it's own protection via a per CPU
variable.

This separation is pointless and error-prone as a recently discovered wrong
condition for granting in-kernel FPU usage has shown.

Make the whole FPU state protection simpler and more robust by using the
per CPU usage variable for all FPU operations so state is tracked
consistently.

Change related WARN_ON_FPU() instances to WARN_ON_ONCE() as the usage of
CONFIG_X86_DEBUG_FPU is optional and hides inconsistencies for a
potentially long time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/fpu/api.h |   17 +-------
 arch/x86/kernel/fpu/core.c     |   78 ++++++++++++++++++++++++++---------------
 2 files changed, 52 insertions(+), 43 deletions(-)

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -66,21 +66,8 @@ static inline void kernel_fpu_begin(void
  *
  * Disabling preemption also serializes against kernel_fpu_begin().
  */
-static inline void fpregs_lock(void)
-{
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
-		local_bh_disable();
-	else
-		preempt_disable();
-}
-
-static inline void fpregs_unlock(void)
-{
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
-		local_bh_enable();
-	else
-		preempt_enable();
-}
+extern void fpregs_lock(void);
+extern void fpregs_unlock(void);
 
 #ifdef CONFIG_X86_DEBUG_FPU
 extern void fpregs_assert_state_consistent(void);
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -42,7 +42,7 @@ struct fpu_state_config fpu_user_cfg __r
 struct fpstate init_fpstate __ro_after_init;
 
 /* Track in-kernel FPU usage */
-static DEFINE_PER_CPU(bool, in_kernel_fpu);
+static DEFINE_PER_CPU(bool, fpu_in_use);
 
 /*
  * Track which context is using the FPU on the CPU:
@@ -50,6 +50,50 @@ static DEFINE_PER_CPU(bool, in_kernel_fp
 DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 
 /**
+ * fpregs_lock - Lock FPU state for maintenance operations
+ *
+ * This protects against preemption, soft interrupts and in-kernel FPU
+ * usage on both !RT and RT enabled kernels.
+ *
+ * !RT kernels use local_bh_disable() to prevent soft interrupt processing
+ * and preemption.
+ *
+ * On RT kernels local_bh_disable() is not sufficient because it only
+ * serializes soft interrupt related sections via a local lock, but stays
+ * preemptible. Disabling preemption is the right choice here as bottom
+ * half processing is always in thread context on RT kernels so it
+ * implicitly prevents bottom half processing as well.
+ */
+void fpregs_lock(void)
+{
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_bh_disable();
+	else
+		preempt_disable();
+
+	WARN_ON_ONCE(this_cpu_read(fpu_in_use));
+	this_cpu_write(fpu_in_use, true);
+}
+EXPORT_SYMBOL_GPL(fpregs_lock);
+
+/**
+ * fpregs_unlock - Unlock FPU state after maintenance operations
+ *
+ * Counterpart to fpregs_lock().
+ */
+void fpregs_unlock(void)
+{
+	WARN_ON_ONCE(!this_cpu_read(fpu_in_use));
+	this_cpu_write(fpu_in_use, false);
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_bh_enable();
+	else
+		preempt_enable();
+}
+EXPORT_SYMBOL_GPL(fpregs_unlock);
+
+/**
  * kernel_fpu_usable - Check whether kernel FPU usage is possible
  *
  * Has to be invoked before calling kernel_fpu_begin().
@@ -59,28 +103,7 @@ bool kernel_fpu_usable(void)
 	if (WARN_ON_ONCE(in_nmi()))
 		return false;
 
-	/* In kernel FPU usage already active? */
-	if (this_cpu_read(in_kernel_fpu))
-		return false;
-
-	/*
-	 * When not in NMI or hard interrupt context, FPU can be used:
-	 *
-	 * - Task context is safe except from within fpregs_lock()'ed
-	 *   critical regions.
-	 *
-	 * - Soft interrupt processing context which cannot happen
-	 *   while in a fpregs_lock()'ed critical region.
-	 */
-	if (!in_hardirq())
-		return true;
-
-	/*
-	 * In hard interrupt context it's safe when soft interrupts
-	 * are enabled, which means the interrupt did not hit in
-	 * a fpregs_lock()'ed critical region.
-	 */
-	return !softirq_count();
+	return !this_cpu_read(fpu_in_use);
 }
 EXPORT_SYMBOL(kernel_fpu_usable);
 
@@ -410,10 +433,9 @@ void kernel_fpu_begin_mask(unsigned int
 {
 	preempt_disable();
 
-	WARN_ON_FPU(!kernel_fpu_usable());
-	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
+	WARN_ON_ONCE(!kernel_fpu_usable());
 
-	this_cpu_write(in_kernel_fpu, true);
+	this_cpu_write(fpu_in_use, true);
 
 	if (!(current->flags & PF_KTHREAD) &&
 	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
@@ -433,9 +455,9 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask)
 
 void kernel_fpu_end(void)
 {
-	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
+	WARN_ON_ONCE(!this_cpu_read(fpu_in_use));
 
-	this_cpu_write(in_kernel_fpu, false);
+	this_cpu_write(fpu_in_use, false);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_end);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-01 19:31 [patch 0/3] x86/fpu: Prevent FPU state corruption Thomas Gleixner
                   ` (2 preceding siblings ...)
  2022-05-01 19:31 ` [patch 3/3] x86/fpu: Make FPU protection more robust Thomas Gleixner
@ 2022-05-02 10:02 ` Filipe Manana
  2022-05-02 12:22   ` Borislav Petkov
  2022-05-04 15:40 ` Jason A. Donenfeld
  4 siblings, 1 reply; 38+ messages in thread
From: Filipe Manana @ 2022-05-02 10:02 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana

On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
> The recent changes in the random code unearthed a long standing FPU state
> corruption due do a buggy condition for granting in-kernel FPU usage.

Thanks for fixing this. I had been reliably hitting that fpu state splat
since 5.18-rc1 with fstests, and was annoying because it made a few tests
fail (fstests reports a failure whenever there's a splat or warning in
dmesg).

I confirm that with this patchset applied the issue no longer happens and
everything seems to be working fine, so:

Tested-by: Filipe Manana <fdmanana@suse.com>

> 
> The following series addresses this issue and makes the code more robust.
> 
> Thanks,
> 
> 	tglx
> ---
>  arch/um/include/asm/fpu/api.h       |    2 
>  arch/x86/include/asm/fpu/api.h      |   21 +-------
>  arch/x86/include/asm/simd.h         |    2 
>  arch/x86/kernel/fpu/core.c          |   92 +++++++++++++++++++-----------------
>  net/netfilter/nft_set_pipapo_avx2.c |    2 
>  5 files changed, 57 insertions(+), 62 deletions(-)
> 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-02 10:02 ` [patch 0/3] x86/fpu: Prevent FPU state corruption Filipe Manana
@ 2022-05-02 12:22   ` Borislav Petkov
  0 siblings, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2022-05-02 12:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Thomas Gleixner, LKML, x86, Filipe Manana

On Mon, May 02, 2022 at 11:02:26AM +0100, Filipe Manana wrote:
> Thanks for fixing this. I had been reliably hitting that fpu state splat
> since 5.18-rc1 with fstests, and was annoying because it made a few tests
> fail (fstests reports a failure whenever there's a splat or warning in
> dmesg).
> 
> I confirm that with this patchset applied the issue no longer happens and
> everything seems to be working fine, so:
> 
> Tested-by: Filipe Manana <fdmanana@suse.com>

Thanks for taking the time to bisect it, trace it - which basically
showed where the problem is - and test it!

Much appreciated.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/3] x86/fpu: Prevent FPU state corruption
  2022-05-01 19:31 ` [patch 1/3] " Thomas Gleixner
@ 2022-05-02 13:16   ` Borislav Petkov
  2022-05-05  0:42   ` [tip: x86/urgent] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2022-05-02 13:16 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana, stable

On Sun, May 01, 2022 at 09:31:43PM +0200, Thomas Gleixner wrote:
> The latter is checking whether some other context already uses FPU in the
> kernel, but if that's not the case then it allows FPU to be used
> unconditionally even if the calling context interupted a fpregs_lock()

Unknown word [interupted] in commit message.
Suggestions: ['interrupted', ...

>  /*
>   * Can we use the FPU in kernel mode with the
>   * whole "kernel_fpu_begin/end()" sequence?

While at it, drop the "we": "Can the FPU be used in kernel mode... "

> - *
> - * It's always ok in process context (ie "not interrupt")
> - * but it is sometimes ok even from an irq.
>   */
>  bool irq_fpu_usable(void)
>  {
> -	return !in_interrupt() ||
> -		interrupted_user_mode() ||
> -		interrupted_kernel_fpu_idle();
> +	if (WARN_ON_ONCE(in_nmi()))
> +		return false;
> +
> +	/* In kernel FPU usage already active? */
> +	if (this_cpu_read(in_kernel_fpu))
> +		return false;
> +
> +	/*
> +	 * When not in NMI or hard interrupt context, FPU can be used:

"... can be used in:"

> +	 *
> +	 * - Task context is safe except from within fpregs_lock()'ed
> +	 *   critical regions.
> +	 *
> +	 * - Soft interrupt processing context which cannot happen
> +	 *   while in a fpregs_lock()'ed critical region.

But those are only nitpicks. With those fixed:

Reviewed-by: Borislav Petkov <bp@suse.de>

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/3] x86/fpu: Rename irq_fpu_usable()
  2022-05-01 19:31 ` [patch 2/3] x86/fpu: Rename irq_fpu_usable() Thomas Gleixner
@ 2022-05-02 13:57   ` Borislav Petkov
  0 siblings, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2022-05-02 13:57 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana

On Sun, May 01, 2022 at 09:31:45PM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -49,11 +49,12 @@ static DEFINE_PER_CPU(bool, in_kernel_fp
>   */
>  DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
>  
> -/*
> - * Can we use the FPU in kernel mode with the
> - * whole "kernel_fpu_begin/end()" sequence?
> +/**
> + * kernel_fpu_usable - Check whether kernel FPU usage is possible
> + *
> + * Has to be invoked before calling kernel_fpu_begin().
>   */

Ah, there it is.

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-01 19:31 ` [patch 3/3] x86/fpu: Make FPU protection more robust Thomas Gleixner
@ 2022-05-02 14:35   ` Borislav Petkov
  2022-05-02 15:58     ` Thomas Gleixner
  2022-05-03  9:03   ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2022-05-02 14:35 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana

On Sun, May 01, 2022 at 09:31:47PM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -42,7 +42,7 @@ struct fpu_state_config fpu_user_cfg __r
>  struct fpstate init_fpstate __ro_after_init;
>  
>  /* Track in-kernel FPU usage */
> -static DEFINE_PER_CPU(bool, in_kernel_fpu);
> +static DEFINE_PER_CPU(bool, fpu_in_use);
>  
>  /*
>   * Track which context is using the FPU on the CPU:
> @@ -50,6 +50,50 @@ static DEFINE_PER_CPU(bool, in_kernel_fp
>  DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
>  
>  /**
> + * fpregs_lock - Lock FPU state for maintenance operations

"maintenance"?

> + *
> + * This protects against preemption, soft interrupts and in-kernel FPU
> + * usage on both !RT and RT enabled kernels.
> + *
> + * !RT kernels use local_bh_disable() to prevent soft interrupt processing
> + * and preemption.
> + *
> + * On RT kernels local_bh_disable() is not sufficient because it only
> + * serializes soft interrupt related sections via a local lock, but stays
> + * preemptible. Disabling preemption is the right choice here as bottom
> + * half processing is always in thread context on RT kernels so it
> + * implicitly prevents bottom half processing as well.
> + */
> +void fpregs_lock(void)
> +{
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		local_bh_disable();
> +	else
> +		preempt_disable();

So I'm wondering: can we get rid of this distinction and simply do
preempt_disable()?

Or can FPU be used in softirq processing too so we want to block that
there?

But even if, fpu_in_use will already state that fact...

...

> @@ -410,10 +433,9 @@ void kernel_fpu_begin_mask(unsigned int
>  {
>  	preempt_disable();
>  
> -	WARN_ON_FPU(!kernel_fpu_usable());
> -	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
> +	WARN_ON_ONCE(!kernel_fpu_usable());
>  
> -	this_cpu_write(in_kernel_fpu, true);
> +	this_cpu_write(fpu_in_use, true);

This starts to look awfully similar to fpregs_lock()...

>  
>  	if (!(current->flags & PF_KTHREAD) &&
>  	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
> @@ -433,9 +455,9 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask)
>  
>  void kernel_fpu_end(void)
>  {
> -	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
> +	WARN_ON_ONCE(!this_cpu_read(fpu_in_use));
>  
> -	this_cpu_write(in_kernel_fpu, false);
> +	this_cpu_write(fpu_in_use, false);
>  	preempt_enable();

... and this to fpregs_unlock().

Can we use those here too instead of open-coding them mostly?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-02 14:35   ` Borislav Petkov
@ 2022-05-02 15:58     ` Thomas Gleixner
  2022-05-03  9:06       ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-02 15:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: LKML, x86, Filipe Manana

On Mon, May 02 2022 at 16:35, Borislav Petkov wrote:
> On Sun, May 01, 2022 at 09:31:47PM +0200, Thomas Gleixner wrote:
>>  /**
>> + * fpregs_lock - Lock FPU state for maintenance operations
>
> "maintenance"?

I meant maintenance of user thread FPU state. Let me rephrase. 

>> + *
>> + * This protects against preemption, soft interrupts and in-kernel FPU
>> + * usage on both !RT and RT enabled kernels.
>> + *
>> + * !RT kernels use local_bh_disable() to prevent soft interrupt processing
>> + * and preemption.
>> + *
>> + * On RT kernels local_bh_disable() is not sufficient because it only
>> + * serializes soft interrupt related sections via a local lock, but stays
>> + * preemptible. Disabling preemption is the right choice here as bottom
>> + * half processing is always in thread context on RT kernels so it
>> + * implicitly prevents bottom half processing as well.
>> + */
>> +void fpregs_lock(void)
>> +{
>> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> +		local_bh_disable();
>> +	else
>> +		preempt_disable();
>
> So I'm wondering: can we get rid of this distinction and simply do
> preempt_disable()?
> 
> Or can FPU be used in softirq processing too so we want to block that
> there?

Yes, FPU can be used legitimately in softirq processing context.

> But even if, fpu_in_use will already state that fact...

Right, though currently it's guaranteed that softirq processing context
can use the FPU. Quite some of the network crypto work runs in softirq
context, so this might cause a regression. If so, then this needs to be
an explicit commit on top which is easy to revert. Let me stare at it
some more.

>> @@ -410,10 +433,9 @@ void kernel_fpu_begin_mask(unsigned int
>>  {
>>  	preempt_disable();
>>  
>> -	WARN_ON_FPU(!kernel_fpu_usable());
>> -	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
>> +	WARN_ON_ONCE(!kernel_fpu_usable());
>>  
>> -	this_cpu_write(in_kernel_fpu, true);
>> +	this_cpu_write(fpu_in_use, true);
>
> This starts to look awfully similar to fpregs_lock()...

Similar, but not identical and we cannot use fpregs_lock() here as we
don't want to have local_bh_disable() when in hard interrupt context.

>>  	if (!(current->flags & PF_KTHREAD) &&
>>  	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
>> @@ -433,9 +455,9 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask)
>>  
>>  void kernel_fpu_end(void)
>>  {
>> -	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
>> +	WARN_ON_ONCE(!this_cpu_read(fpu_in_use));
>>  
>> -	this_cpu_write(in_kernel_fpu, false);
>> +	this_cpu_write(fpu_in_use, false);
>>  	preempt_enable();
>
> ... and this to fpregs_unlock().
>
> Can we use those here too instead of open-coding them mostly?

Not really, unless we drop the use FPU in softirq processing context
guarantee. See above.

Let me think about it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-01 19:31 ` [patch 3/3] x86/fpu: Make FPU protection more robust Thomas Gleixner
  2022-05-02 14:35   ` Borislav Petkov
@ 2022-05-03  9:03   ` Peter Zijlstra
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-05-03  9:03 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana

On Sun, May 01, 2022 at 09:31:47PM +0200, Thomas Gleixner wrote:
> +void fpregs_lock(void)
> +{
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		local_bh_disable();
> +	else
> +		preempt_disable();
> +
> +	WARN_ON_ONCE(this_cpu_read(fpu_in_use));
> +	this_cpu_write(fpu_in_use, true);

	barrier();
> +}
> +EXPORT_SYMBOL_GPL(fpregs_lock);

> +void fpregs_unlock(void)
> +{
	barrier();

> +	WARN_ON_ONCE(!this_cpu_read(fpu_in_use));
> +	this_cpu_write(fpu_in_use, false);
> +
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		local_bh_enable();
> +	else
> +		preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(fpregs_unlock);

I think this isn't currently a problem because a function call is a C
sequence point, but 'funnily' C doesn't preserve sequence points when
inlining so LTO can actually break this without barrier() on.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-02 15:58     ` Thomas Gleixner
@ 2022-05-03  9:06       ` Peter Zijlstra
  2022-05-04 15:36         ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-05-03  9:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Borislav Petkov, LKML, x86, Filipe Manana

On Mon, May 02, 2022 at 05:58:40PM +0200, Thomas Gleixner wrote:

> >> +void fpregs_lock(void)
> >> +{
> >> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> >> +		local_bh_disable();
> >> +	else
> >> +		preempt_disable();
> >
> > So I'm wondering: can we get rid of this distinction and simply do
> > preempt_disable()?
> > 
> > Or can FPU be used in softirq processing too so we want to block that
> > there?
> 
> Yes, FPU can be used legitimately in softirq processing context.
> 
> > But even if, fpu_in_use will already state that fact...
> 
> Right, though currently it's guaranteed that softirq processing context
> can use the FPU. Quite some of the network crypto work runs in softirq
> context, so this might cause a regression. If so, then this needs to be
> an explicit commit on top which is easy to revert. Let me stare at it
> some more.

Right, so with the:

	preempt_disable();
	this_cpu_write(fpu_in_use, true);
	barrier();

sequence it is safe against both softirq and hardirq fpu usage. The only
concern is performance not correctness when dropping that
local_bh_disable() thing.

So what Thomas proposes makes sense to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-03  9:06       ` Peter Zijlstra
@ 2022-05-04 15:36         ` Thomas Gleixner
  2022-05-04 15:55           ` Jason A. Donenfeld
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-04 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, LKML, x86, Filipe Manana, Jason A. Donenfeld,
	linux-crypto

On Tue, May 03 2022 at 11:06, Peter Zijlstra wrote:
> On Mon, May 02, 2022 at 05:58:40PM +0200, Thomas Gleixner wrote:
>> Right, though currently it's guaranteed that softirq processing context
>> can use the FPU. Quite some of the network crypto work runs in softirq
>> context, so this might cause a regression. If so, then this needs to be
>> an explicit commit on top which is easy to revert. Let me stare at it
>> some more.
>
> Right, so with the:
>
> 	preempt_disable();
> 	this_cpu_write(fpu_in_use, true);
> 	barrier();
>
> sequence it is safe against both softirq and hardirq fpu usage. The only
> concern is performance not correctness when dropping that
> local_bh_disable() thing.
>
> So what Thomas proposes makes sense to me.

Now I was looking at it the other way round too; i.e. to use
local_bh_disable() for both fpregs_lock() and kernel_fpu_begin().

Using local_bh_disable() for both fpregs_lock() and kernel_fpu_begin()
is not possible with the current constraints, because kernel_fpu_begin()
can be called from hard interrupt context.

But the only use case which utilizes FPU from hard interrupt context is
the random generator via add_randomness_...().

I did a benchmark of these functions, which invoke blake2s_update()
three times in a row, on a SKL-X and a ZEN3. The generic code and the
FPU accelerated code are pretty much on par vs. execution time of the
algorithm itself plus/minus noise.

But in case that the interrupt hits a userspace task the FPU needs to be
saved and if the interrupt does not result in rescheduling then the
return to user space has to restore it. That's _expensive_ and the
actual cost depends on the FPU state, but 200-300 cycles for save and
200-700 cycles for restore are due.

Even if we ignore the save/restore part and assume that it averages out
vs. schedule() having to save FPU state anyway, then there is another
aspect to this: power consumption which affects also thermal budget and
capacity.

Though that made me more curious and I did the same comparison for crc32
which is heavily used by ext4. crc32c_pcl_intel_update() already
contains a switch to software when the buffer length is less than 512
bytes. But even on larger buffers, typically ~4k, FPU is not necessarily
a win. It's consistently slower by a factor of ~1.4x. And that's not due
to xsave/rstor overhead because these computations run on a worker
thread which does not do that dance at all.

IOW, using the FPU blindly for this kind of computations is not
necessarily a good plan. I have no idea how these things are analyzed
and evaluated if at all. Maybe the crypto people can shed some light on
this.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-01 19:31 [patch 0/3] x86/fpu: Prevent FPU state corruption Thomas Gleixner
                   ` (3 preceding siblings ...)
  2022-05-02 10:02 ` [patch 0/3] x86/fpu: Prevent FPU state corruption Filipe Manana
@ 2022-05-04 15:40 ` Jason A. Donenfeld
  2022-05-04 18:05   ` Thomas Gleixner
  2022-05-18  1:02   ` Jason A. Donenfeld
  4 siblings, 2 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-04 15:40 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana

Hi Thomas,

On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
> The recent changes in the random code unearthed a long standing FPU state
> corruption due do a buggy condition for granting in-kernel FPU usage.
 
Thanks for working that out. I've been banging my head over [1] for a
few days now trying to see if it's a mis-bisect or a real thing. I'll
ask Larry to retry with this patchset.

Jason

[1] https://lore.kernel.org/lkml/7f01221d-f693-adf8-f5a5-d71944b44162@lwfinger.net/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 15:36         ` Thomas Gleixner
@ 2022-05-04 15:55           ` Jason A. Donenfeld
  2022-05-04 16:45             ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-04 15:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hi Thomas,

On Wed, May 04, 2022 at 05:36:38PM +0200, Thomas Gleixner wrote:
> But the only use case which utilizes FPU from hard interrupt context is
> the random generator via add_randomness_...().
> 
> I did a benchmark of these functions, which invoke blake2s_update()
> three times in a row, on a SKL-X and a ZEN3. The generic code and the
> FPU accelerated code are pretty much on par vs. execution time of the
> algorithm itself plus/minus noise.
>
> IOW, using the FPU blindly for this kind of computations is not
> necessarily a good plan. I have no idea how these things are analyzed
> and evaluated if at all. Maybe the crypto people can shed some light on
> this.

drivers/net/wireguard/{noise,cookie}.c makes pretty heavy use of BLAKE2s
in hot paths where the FPU is already being used for other algorithms,
and so there the save/restore is worth it (assuming restore finally
works lazily). In benchmarks, the SIMD code made a real difference.

But this presumably regards mix_pool_bytes() in the RNG. If it turns out
that supporting the FPU in hard IRQ context is a major PITA, and the RNG
is the only thing making use of it, then sure, drop hard IRQ context
support for it. However... This may be unearthing a larger bug.
Sebastian and I put in a decent amount of work during 5.18 to remove all
calls to mix_pool_bytes() (and hence to blake2s_compress()) from
add_interrupt_randomness(). Have a look:

https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/tree/drivers/char/random.c#n1289

It now accumulates in some per-CPU buffer, and then every 64 interrupts
a worker runs that does the actual mix_pool_bytes() from kthread
context.

So the question is: what is still hitting mix_pool_bytes() from hard IRQ
context? I'll investigate a bit and see.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 15:55           ` Jason A. Donenfeld
@ 2022-05-04 16:45             ` Thomas Gleixner
  2022-05-04 19:05               ` Jason A. Donenfeld
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-04 16:45 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Jason,

On Wed, May 04 2022 at 17:55, Jason A. Donenfeld wrote:
> On Wed, May 04, 2022 at 05:36:38PM +0200, Thomas Gleixner wrote:
>> But the only use case which utilizes FPU from hard interrupt context is
>> the random generator via add_randomness_...().
>> 
>> I did a benchmark of these functions, which invoke blake2s_update()
>> three times in a row, on a SKL-X and a ZEN3. The generic code and the
>> FPU accelerated code are pretty much on par vs. execution time of the
>> algorithm itself plus/minus noise.
>>
>> IOW, using the FPU blindly for this kind of computations is not
>> necessarily a good plan. I have no idea how these things are analyzed
>> and evaluated if at all. Maybe the crypto people can shed some light on
>> this.
>
> drivers/net/wireguard/{noise,cookie}.c makes pretty heavy use of BLAKE2s
> in hot paths where the FPU is already being used for other algorithms,
> and so there the save/restore is worth it (assuming restore finally
> works lazily). In benchmarks, the SIMD code made a real difference.

I'm sure there are very valid use cases, but just the two things I
looked at turned out to be questionable at least.

> But this presumably regards mix_pool_bytes() in the RNG. If it turns out
> that supporting the FPU in hard IRQ context is a major PITA, and the
> RNG

Supporting FPU in hard interrupt context is possible if required and the
preexisting bug which survived 10+ years has been fixed.
x
I just started to look into this because of that bug and due to the
inconsistency between the FPU protections we have. The inconsistency
comes from the hardirq requirement.

> is the only thing making use of it, then sure, drop hard IRQ context
> support for it. However... This may be unearthing a larger bug.
> Sebastian and I put in a decent amount of work during 5.18 to remove all
> calls to mix_pool_bytes() (and hence to blake2s_compress()) from
> add_interrupt_randomness(). Have a look:

I know.

> It now accumulates in some per-CPU buffer, and then every 64 interrupts
> a worker runs that does the actual mix_pool_bytes() from kthread
> context.

That's add_interrupt_randomness() and not affected by this.

> So the question is: what is still hitting mix_pool_bytes() from hard IRQ
> context? I'll investigate a bit and see.

add_disk_randomness() on !RT kernels. That's what made me look into this
in the first place as it unearthed the long standing FPU protection
bug. See the first patch in this thread.

Possibly add_device_randomness() too, but I haven't seen evidence so far.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-04 15:40 ` Jason A. Donenfeld
@ 2022-05-04 18:05   ` Thomas Gleixner
  2022-05-18  1:02   ` Jason A. Donenfeld
  1 sibling, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-04 18:05 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: LKML, x86, Filipe Manana, Larry Finger

On Wed, May 04 2022 at 17:40, Jason A. Donenfeld wrote:
> On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
>> The recent changes in the random code unearthed a long standing FPU state
>> corruption due do a buggy condition for granting in-kernel FPU usage.
>  
> Thanks for working that out. I've been banging my head over [1] for a
> few days now trying to see if it's a mis-bisect or a real thing. I'll
> ask Larry to retry with this patchset.

You could have just Cc'ed him. :) Did so now.

Larry, can you please test:

 https://lore.kernel.org/lkml/20220501193102.588689270@linutronix.de

Thanks,

        Thomas

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 16:45             ` Thomas Gleixner
@ 2022-05-04 19:05               ` Jason A. Donenfeld
  2022-05-04 21:04                 ` Thomas Gleixner
  2022-05-06 22:15                 ` Jason A. Donenfeld
  0 siblings, 2 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hey Thomas,

On Wed, May 04, 2022 at 06:45:45PM +0200, Thomas Gleixner wrote:
> add_disk_randomness() on !RT kernels. That's what made me look into this
> in the first place as it unearthed the long standing FPU protection
> bug. See the first patch in this thread.
> 
> Possibly add_device_randomness() too, but I haven't seen evidence so far.

It looks like it's being hit from add_input_randomness() via
input_handle_event() too. There are two positions we could take toward
this:

One stance to take would be that if an event happens in an interrupt,
add_interrupt_randomness() should in theory already be siphashing in a
cycle counter so additional calls to add_{input,disk}_randomness() don't
contribute substantially (unless you assume the num field has some
entropic value). If that's compelling, then the next thing to do would
be adding a `if (in_interrupt()) return;` early on in some function, and
then we forever after impose a rule, "never mix into the input pool
directly from an irq".

The other stance is that these input/disk events are relatively rare --
compared to, say, a storm of interrupts from a NIC -- so mixing into the
input pool from there isn't actually a problem, and we benefit from the
quasi domain-specific accounting and the superior mixing function,
there, so keep it around. And the non-raw spinlock on the input pool
won't negatively affect RT from this context, because all its callers on
RT should be threaded.

The second stance seems easier and more conservative from a certain
perspective -- we don't need to change anything -- so I'm more inclined
toward it. And given that you've fixed the bug now, it sounds like
that's fine with you too. But if you're thinking about it differently in
fact, let me know.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 19:05               ` Jason A. Donenfeld
@ 2022-05-04 21:04                 ` Thomas Gleixner
  2022-05-04 23:52                   ` Jason A. Donenfeld
  2022-05-06 22:15                 ` Jason A. Donenfeld
  1 sibling, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-04 21:04 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Jason,

On Wed, May 04 2022 at 21:05, Jason A. Donenfeld wrote:
> The other stance is that these input/disk events are relatively rare --
> compared to, say, a storm of interrupts from a NIC -- so mixing into the
> input pool from there isn't actually a problem, and we benefit from the
> quasi domain-specific accounting and the superior mixing function,
> there, so keep it around. And the non-raw spinlock on the input pool
> won't negatively affect RT from this context, because all its callers on
> RT should be threaded.

I'm not worried about RT here.

> The second stance seems easier and more conservative from a certain
> perspective -- we don't need to change anything -- so I'm more inclined
> toward it.

That's not conservative, that's lazy and lame. Staying with the status
quo and piling more stuff on top because we can is just increasing
technical debt. Works for a while by some definition of works.

> And given that you've fixed the bug now, it sounds like that's fine
> with you too. But if you're thinking about it differently in fact, let
> me know.

That still does not address my observation that using the FPU for this
mixing, which is handling a couple of bytes per invocation, is not
really benefitial.

Which in turn bears the question, why we have to maintain an asymmetric
FPU protection mechanism in order to support hard interrupt FPU usage
for no or questionable benefit.

The current implementation, courtesy to hard interrupt support, has the
following downside:

  Any FPU usage in task context where soft interrupts are enabled will
  prevent FPU usage in soft interrupt processing when the interrupt hits
  into the FPU usage region. That means the softirq processing has to
  fall back to the generic implementations.

Sure, the protection could be context dependent, but that's generally
frowned upon. If we go there, then there has to be a really convincing
technical argument.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 21:04                 ` Thomas Gleixner
@ 2022-05-04 23:52                   ` Jason A. Donenfeld
  2022-05-05  0:55                     ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-04 23:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hi Thomas,

On Wed, May 04, 2022 at 11:04:12PM +0200, Thomas Gleixner wrote:

> > The second stance seems easier and more conservative from a certain
> > perspective -- we don't need to change anything -- so I'm more inclined
> > toward it.
> 
> That's not conservative, that's lazy and lame. Staying with the status
> quo and piling more stuff on top because we can is just increasing
> technical debt. Works for a while by some definition of works.

I actually find this minorly upsetting :(. Considering the amount of
technical debt I've been tirelessly cleaning up over the last 5 months,
"lazy" certainly can't be correct here. None of this has anything to do
with laziness, but rather how the entropy collection logic works out.
It'd be lazy to just wave my hands around and say, "oh well it's handled
in add_interrupt_randomness() anyway, so let's torch the other thing,"
without having taken the time to do the analysis to see if that's
actually true or not. One way or another, it _will_ require real
analysis. Which obviously I'm volunteering to do here (it's an
interesting question to me) and have already started poking around with
it.

> > And given that you've fixed the bug now, it sounds like that's fine
> > with you too. But if you're thinking about it differently in fact, let
> > me know.
> 
> That still does not address my observation that using the FPU for this
> mixing, which is handling a couple of bytes per invocation, is not
> really benefitial.
> 
> Which in turn bears the question, why we have to maintain an asymmetric
> FPU protection mechanism in order to support hard interrupt FPU usage
> for no or questionable benefit.
> 
> The current implementation, courtesy to hard interrupt support, has the
> following downside:
> 
>   Any FPU usage in task context where soft interrupts are enabled will
>   prevent FPU usage in soft interrupt processing when the interrupt hits
>   into the FPU usage region. That means the softirq processing has to
>   fall back to the generic implementations.
> 
> Sure, the protection could be context dependent, but that's generally
> frowned upon. If we go there, then there has to be a really convincing
> technical argument.

That's curious about the softirq; I hadn't realized that. Indeed it
sounds like the technical burden of supporting it may not be worth it.
From the perspective of high-speed crypto, I know two areas pretty well:
pacrypt/padata and wireguard's queueing. Both of these run in workqueues
pinned to a CPU with queue_work_on(). In some cases, I believe locks may
be held during various crypto operations though. Also, I suspect some
paths of xfrm and mac80211 may process during softirq. Anyway, none of
those cases is hardirq. I haven't looked at dmcrypt, but I'd be
surprised if that was doing anything in hardirqs either.

So if truly the only user of this is random.c as of 5.18 (is it? I'm
assuming from a not very thorough survey...), and if the performance
boost doesn't even exist, then yeah, I think it'd make sense to just get
rid of it, and have kernel_fpu_usable() return false in those cases.

I'll run some benchmarks on a little bit more hardware in representative
cases and see.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [tip: x86/urgent] x86/fpu: Prevent FPU state corruption
  2022-05-01 19:31 ` [patch 1/3] " Thomas Gleixner
  2022-05-02 13:16   ` Borislav Petkov
@ 2022-05-05  0:42   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 38+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-05-05  0:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Filipe Manana, Thomas Gleixner, Borislav Petkov, stable, x86,
	linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     59f5ede3bc0f00eb856425f636dab0c10feb06d8
Gitweb:        https://git.kernel.org/tip/59f5ede3bc0f00eb856425f636dab0c10feb06d8
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sun, 01 May 2022 21:31:43 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 05 May 2022 02:40:19 +02:00

x86/fpu: Prevent FPU state corruption

The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().

But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:

   !in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()

The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interrupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.

Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.

It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.

This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.

Fixes: 5d2bd7009f30 ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220501193102.588689270@linutronix.de

---
 arch/x86/kernel/fpu/core.c | 67 ++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c049561..e28ab0e 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -41,17 +41,7 @@ struct fpu_state_config fpu_user_cfg __ro_after_init;
  */
 struct fpstate init_fpstate __ro_after_init;
 
-/*
- * Track whether the kernel is using the FPU state
- * currently.
- *
- * This flag is used:
- *
- *   - by IRQ context code to potentially use the FPU
- *     if it's unused.
- *
- *   - to debug kernel_fpu_begin()/end() correctness
- */
+/* Track in-kernel FPU usage */
 static DEFINE_PER_CPU(bool, in_kernel_fpu);
 
 /*
@@ -59,42 +49,37 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu);
  */
 DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 
-static bool kernel_fpu_disabled(void)
-{
-	return this_cpu_read(in_kernel_fpu);
-}
-
-static bool interrupted_kernel_fpu_idle(void)
-{
-	return !kernel_fpu_disabled();
-}
-
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
- *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
- */
-static bool interrupted_user_mode(void)
-{
-	struct pt_regs *regs = get_irq_regs();
-	return regs && user_mode(regs);
-}
-
 /*
  * Can we use the FPU in kernel mode with the
  * whole "kernel_fpu_begin/end()" sequence?
- *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
  */
 bool irq_fpu_usable(void)
 {
-	return !in_interrupt() ||
-		interrupted_user_mode() ||
-		interrupted_kernel_fpu_idle();
+	if (WARN_ON_ONCE(in_nmi()))
+		return false;
+
+	/* In kernel FPU usage already active? */
+	if (this_cpu_read(in_kernel_fpu))
+		return false;
+
+	/*
+	 * When not in NMI or hard interrupt context, FPU can be used in:
+	 *
+	 * - Task context except from within fpregs_lock()'ed critical
+	 *   regions.
+	 *
+	 * - Soft interrupt processing context which cannot happen
+	 *   while in a fpregs_lock()'ed critical region.
+	 */
+	if (!in_hardirq())
+		return true;
+
+	/*
+	 * In hard interrupt context it's safe when soft interrupts
+	 * are enabled, which means the interrupt did not hit in
+	 * a fpregs_lock()'ed critical region.
+	 */
+	return !softirq_count();
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 23:52                   ` Jason A. Donenfeld
@ 2022-05-05  0:55                     ` Thomas Gleixner
  2022-05-05  1:11                       ` Jason A. Donenfeld
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-05  0:55 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Jason,

On Thu, May 05 2022 at 01:52, Jason A. Donenfeld wrote:
> On Wed, May 04, 2022 at 11:04:12PM +0200, Thomas Gleixner wrote:
>
>> > The second stance seems easier and more conservative from a certain
>> > perspective -- we don't need to change anything -- so I'm more inclined
>> > toward it.
>> 
>> That's not conservative, that's lazy and lame. Staying with the status
>> quo and piling more stuff on top because we can is just increasing
>> technical debt. Works for a while by some definition of works.
>
> I actually find this minorly upsetting :(.

I'm glad it's only minorly. This was not meant offensive, but I have
heard the "let's be more conservative -- we don't need to change
anything -- " song so many times that it became of of those buttons...

> Considering the amount of technical debt I've been tirelessly cleaning
> up over the last 5 months, "lazy" certainly can't be correct here.

I'm aware of this and I'm thankful that you are tackling these things.

> So if truly the only user of this is random.c as of 5.18 (is it? I'm
> assuming from a not very thorough survey...), and if the performance
> boost doesn't even exist, then yeah, I think it'd make sense to just get
> rid of it, and have kernel_fpu_usable() return false in those cases.
>
> I'll run some benchmarks on a little bit more hardware in representative
> cases and see.

Find below a combo patch which makes use of strict softirq serialization
for the price of not supporting the hardirq FPU usage. 

Thanks,

        tglx
---
diff --git a/arch/um/include/asm/fpu/api.h b/arch/um/include/asm/fpu/api.h
index 71bfd9ef3938..e3deb051b596 100644
--- a/arch/um/include/asm/fpu/api.h
+++ b/arch/um/include/asm/fpu/api.h
@@ -11,7 +11,7 @@
 #define kernel_fpu_begin() (void)0
 #define kernel_fpu_end() (void)0
 
-static inline bool irq_fpu_usable(void)
+static inline bool kernel_fpu_usable(void)
 {
 	return true;
 }
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index c83b3020350a..21b7ef7e0cfb 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -19,7 +19,7 @@
  * disables preemption so be careful if you intend to use it for long periods
  * of time.
  * If you intend to use the FPU in irq/softirq you need to check first with
- * irq_fpu_usable() if it is possible.
+ * kernel_fpu_usable() if it is possible.
  */
 
 /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
@@ -28,7 +28,7 @@
 
 extern void kernel_fpu_begin_mask(unsigned int kfpu_mask);
 extern void kernel_fpu_end(void);
-extern bool irq_fpu_usable(void);
+extern bool kernel_fpu_usable(void);
 extern void fpregs_mark_activate(void);
 
 /* Code that is unaware of kernel_fpu_begin_mask() can use this */
@@ -66,21 +66,8 @@ static inline void kernel_fpu_begin(void)
  *
  * Disabling preemption also serializes against kernel_fpu_begin().
  */
-static inline void fpregs_lock(void)
-{
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
-		local_bh_disable();
-	else
-		preempt_disable();
-}
-
-static inline void fpregs_unlock(void)
-{
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
-		local_bh_enable();
-	else
-		preempt_enable();
-}
+extern void fpregs_lock(void);
+extern void fpregs_unlock(void);
 
 #ifdef CONFIG_X86_DEBUG_FPU
 extern void fpregs_assert_state_consistent(void);
diff --git a/arch/x86/include/asm/simd.h b/arch/x86/include/asm/simd.h
index a341c878e977..bdc0629bd987 100644
--- a/arch/x86/include/asm/simd.h
+++ b/arch/x86/include/asm/simd.h
@@ -8,5 +8,5 @@
  */
 static __must_check inline bool may_use_simd(void)
 {
-	return irq_fpu_usable();
+	return kernel_fpu_usable();
 }
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c049561f373a..7361583a5cfb 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -41,62 +41,127 @@ struct fpu_state_config fpu_user_cfg __ro_after_init;
  */
 struct fpstate init_fpstate __ro_after_init;
 
+/* Track in-kernel FPU usage */
+static DEFINE_PER_CPU(bool, fpu_in_use);
+
 /*
- * Track whether the kernel is using the FPU state
- * currently.
- *
- * This flag is used:
+ * This protects against preemption, soft interrupts and in-kernel FPU
+ * usage on both !RT and RT enabled kernels.
  *
- *   - by IRQ context code to potentially use the FPU
- *     if it's unused.
+ * !RT kernels use local_bh_disable() to prevent soft interrupt processing
+ * and preemption.
  *
- *   - to debug kernel_fpu_begin()/end() correctness
+ * On RT kernels local_bh_disable() is not sufficient because it only
+ * serializes soft interrupt related sections via a local lock, but stays
+ * preemptible. Disabling preemption is the right choice here as bottom
+ * half processing is always in thread context on RT kernels so disabling
+ * preemption implicitly prevents bottom half processing.
  */
-static DEFINE_PER_CPU(bool, in_kernel_fpu);
+static inline void fpu_in_use_begin(void)
+{
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_bh_disable();
+	else
+		preempt_disable();
+
+	WARN_ON_ONCE(__this_cpu_read(fpu_in_use));
+	__this_cpu_write(fpu_in_use, true);
+}
+
+static inline void fpu_in_use_end(void)
+{
+	WARN_ON_ONCE(!__this_cpu_read(fpu_in_use));
+	__this_cpu_write(fpu_in_use, false);
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_bh_enable();
+	else
+		preempt_enable();
+}
 
 /*
  * Track which context is using the FPU on the CPU:
  */
 DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 
-static bool kernel_fpu_disabled(void)
+/**
+ * fpregs_lock - Lock FPU state for current's FPU state maintenance operations
+ *
+ * For FPU internal usage to initialize, safe, restore FPU state of
+ * the current task. Not for in-kernel FPU usage.
+ */
+void fpregs_lock(void)
 {
-	return this_cpu_read(in_kernel_fpu);
+	fpu_in_use_begin();
 }
+EXPORT_SYMBOL_GPL(fpregs_lock);
 
-static bool interrupted_kernel_fpu_idle(void)
+/**
+ * fpregs_unlock - Unlock FPU state after maintenance operations
+ *
+ * Counterpart to fpregs_lock().
+ */
+void fpregs_unlock(void)
 {
-	return !kernel_fpu_disabled();
+	fpu_in_use_end();
 }
+EXPORT_SYMBOL_GPL(fpregs_unlock);
 
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
+/**
+ * kernel_fpu_usable - Check whether kernel FPU usage is possible
  *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
+ * Must be checked before calling kernel_fpu_begin().
  */
-static bool interrupted_user_mode(void)
+bool kernel_fpu_usable(void)
 {
-	struct pt_regs *regs = get_irq_regs();
-	return regs && user_mode(regs);
+	if (WARN_ON_ONCE(in_nmi()))
+		return false;
+
+	return !in_hardirq() && !this_cpu_read(fpu_in_use);
 }
+EXPORT_SYMBOL(kernel_fpu_usable);
 
-/*
- * Can we use the FPU in kernel mode with the
- * whole "kernel_fpu_begin/end()" sequence?
+/**
+ * kernel_fpu_begin_mask - Start a in-kernel FPU usage section
+ * @kfpu_mask:		Bitmask to indicate initialization requirements
  *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
+ * Has to be invoked before in-kernel FPU usage. This saves the current
+ * tasks FPU register state if necessary and initializes MXCSR and/or the
+ * legacy FPU depending on @kfpu_mask.
+ *
+ * The function returns with softirqs disabled, so the call site has to
+ * ensure that the FPU usage is limited in terms of runtime.
  */
-bool irq_fpu_usable(void)
+void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
-	return !in_interrupt() ||
-		interrupted_user_mode() ||
-		interrupted_kernel_fpu_idle();
+	fpu_in_use_begin();
+
+	if (!(current->flags & PF_KTHREAD) &&
+	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
+		set_thread_flag(TIF_NEED_FPU_LOAD);
+		save_fpregs_to_fpstate(&current->thread.fpu);
+	}
+	__cpu_invalidate_fpregs_state();
+
+	/* Put sane initial values into the control registers. */
+	if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
+		ldmxcsr(MXCSR_DEFAULT);
+
+	if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
+		asm volatile ("fninit");
 }
-EXPORT_SYMBOL(irq_fpu_usable);
+EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
+
+/**
+ * kernel_fpu_end - End a in-kernel FPU usage section
+ *
+ * Counterpart to kernel_fpu_begin_mask().
+ */
+void kernel_fpu_end(void)
+{
+	fpu_in_use_end();
+}
+EXPORT_SYMBOL_GPL(kernel_fpu_end);
 
 /*
  * Track AVX512 state use because it is known to slow the max clock
@@ -420,40 +485,6 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
 EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate);
 #endif /* CONFIG_KVM */
 
-void kernel_fpu_begin_mask(unsigned int kfpu_mask)
-{
-	preempt_disable();
-
-	WARN_ON_FPU(!irq_fpu_usable());
-	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
-
-	this_cpu_write(in_kernel_fpu, true);
-
-	if (!(current->flags & PF_KTHREAD) &&
-	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
-		set_thread_flag(TIF_NEED_FPU_LOAD);
-		save_fpregs_to_fpstate(&current->thread.fpu);
-	}
-	__cpu_invalidate_fpregs_state();
-
-	/* Put sane initial values into the control registers. */
-	if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
-		ldmxcsr(MXCSR_DEFAULT);
-
-	if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
-		asm volatile ("fninit");
-}
-EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
-
-void kernel_fpu_end(void)
-{
-	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
-
-	this_cpu_write(in_kernel_fpu, false);
-	preempt_enable();
-}
-EXPORT_SYMBOL_GPL(kernel_fpu_end);
-
 /*
  * Sync the FPU register state to current's memory register state when the
  * current task owns the FPU. The hardware register state is preserved.
diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
index 52e0d026d30a..dde1d2260f51 100644
--- a/net/netfilter/nft_set_pipapo_avx2.c
+++ b/net/netfilter/nft_set_pipapo_avx2.c
@@ -1128,7 +1128,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
 	bool map_index;
 	int i, ret = 0;
 
-	if (unlikely(!irq_fpu_usable()))
+	if (unlikely(!kernel_fpu_usable()))
 		return nft_pipapo_lookup(net, set, key, ext);
 
 	m = rcu_dereference(priv->match);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05  0:55                     ` Thomas Gleixner
@ 2022-05-05  1:11                       ` Jason A. Donenfeld
  2022-05-05  1:21                         ` Thomas Gleixner
  0 siblings, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-05  1:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hi Thomas,

On Thu, May 05, 2022 at 02:55:58AM +0200, Thomas Gleixner wrote:
> > So if truly the only user of this is random.c as of 5.18 (is it? I'm
> > assuming from a not very thorough survey...), and if the performance
> > boost doesn't even exist, then yeah, I think it'd make sense to just get
> > rid of it, and have kernel_fpu_usable() return false in those cases.
> >
> > I'll run some benchmarks on a little bit more hardware in representative
> > cases and see.
> 
> Find below a combo patch which makes use of strict softirq serialization
> for the price of not supporting the hardirq FPU usage. 

Thanks, I'll give it a shot in the morning (3am) when trying to do a
more realistic benchmark. But just as a synthetic thing, I ran the
numbers in kBench900 and am getting:

     generic:    430 cycles per call
       ssse3:    315 cycles per call
      avx512:    277 cycles per call

for a single call to the compression function, which is the most any of
those mix_pool_bytes() calls do from add_{input,disk}_randomness(), on
Tiger Lake, using RDPMC from kernel space.

This _doesn't_ take into account the price of calling kernel_fpu_begin().
That's a little hard to bench synthetically by running it in a loop and
taking medians because of the lazy restoration. But that's an indication
anyway that I should be looking at the cost of the actual function as
its running in random.c, rather than the synthetic test. Will keep this
thread updated.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05  1:11                       ` Jason A. Donenfeld
@ 2022-05-05  1:21                         ` Thomas Gleixner
  2022-05-05 11:02                           ` Jason A. Donenfeld
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-05  1:21 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Jason,

On Thu, May 05 2022 at 03:11, Jason A. Donenfeld wrote:
> On Thu, May 05, 2022 at 02:55:58AM +0200, Thomas Gleixner wrote:
>> > So if truly the only user of this is random.c as of 5.18 (is it? I'm
>> > assuming from a not very thorough survey...), and if the performance
>> > boost doesn't even exist, then yeah, I think it'd make sense to just get
>> > rid of it, and have kernel_fpu_usable() return false in those cases.
>> >
>> > I'll run some benchmarks on a little bit more hardware in representative
>> > cases and see.
>> 
>> Find below a combo patch which makes use of strict softirq serialization
>> for the price of not supporting the hardirq FPU usage. 
>
> Thanks, I'll give it a shot in the morning (3am) when trying to do a
> more realistic benchmark. But just as a synthetic thing, I ran the
> numbers in kBench900 and am getting:
>
>      generic:    430 cycles per call
>        ssse3:    315 cycles per call
>       avx512:    277 cycles per call
>
> for a single call to the compression function, which is the most any of
> those mix_pool_bytes() calls do from add_{input,disk}_randomness(), on
> Tiger Lake, using RDPMC from kernel space.

I'm well aware of the difference between synthetic benchmarks and real
world scenarios and with the more in depth instrumentation of these
things I'm even more concerned that the difference is underestimated.

> This _doesn't_ take into account the price of calling kernel_fpu_begin().
> That's a little hard to bench synthetically by running it in a loop and
> taking medians because of the lazy restoration. But that's an indication
> anyway that I should be looking at the cost of the actual function as
> its running in random.c, rather than the synthetic test. Will keep this
> thread updated.

Appreciated.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05  1:21                         ` Thomas Gleixner
@ 2022-05-05 11:02                           ` Jason A. Donenfeld
  2022-05-05 11:34                             ` David Laight
  2022-05-05 13:48                             ` Jason A. Donenfeld
  0 siblings, 2 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-05 11:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hey again,

On Thu, May 05, 2022 at 03:21:43AM +0200, Thomas Gleixner wrote:
> > Thanks, I'll give it a shot in the morning (3am) when trying to do a
> > more realistic benchmark. But just as a synthetic thing, I ran the
> > numbers in kBench900 and am getting:
> >
> >      generic:    430 cycles per call
> >        ssse3:    315 cycles per call
> >       avx512:    277 cycles per call
> >
> > for a single call to the compression function, which is the most any of
> > those mix_pool_bytes() calls do from add_{input,disk}_randomness(), on
> > Tiger Lake, using RDPMC from kernel space.
> 
> I'm well aware of the difference between synthetic benchmarks and real
> world scenarios and with the more in depth instrumentation of these
> things I'm even more concerned that the difference is underestimated.
> 
> > This _doesn't_ take into account the price of calling kernel_fpu_begin().
> > That's a little hard to bench synthetically by running it in a loop and
> > taking medians because of the lazy restoration. But that's an indication
> > anyway that I should be looking at the cost of the actual function as
> > its running in random.c, rather than the synthetic test. Will keep this
> > thread updated.
> 
> Appreciated.

Interestingly, disabling the simd paths makes things around 283 cycles
slower on my Tiger Lake laptop, just doing ordinary things. I'm actually
slightly surprised, so I'll probably keep playing with this. My patch
for this is attached. Let me know if you have a different methodology in
mind...

Jason

diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd292927654c..7a70a186c26b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -53,6 +53,7 @@
 #include <linux/uuid.h>
 #include <linux/uaccess.h>
 #include <linux/suspend.h>
+#include <linux/sort.h>
 #include <crypto/chacha.h>
 #include <crypto/blake2s.h>
 #include <asm/processor.h>
@@ -755,9 +756,48 @@ static struct {
 	.lock = __SPIN_LOCK_UNLOCKED(input_pool.lock),
 };

+struct {
+	u32 durations[1 << 20];
+	u32 pos, len;
+} irqbench;
+
 static void _mix_pool_bytes(const void *in, size_t nbytes)
 {
+	u32 ctr = input_pool.hash.t[0];
+	cycles_t end, start = get_cycles();
 	blake2s_update(&input_pool.hash, in, nbytes);
+	end = get_cycles();
+
+	if (ctr == input_pool.hash.t[0] || !in_hardirq())
+		return;
+
+	irqbench.durations[irqbench.pos++ % ARRAY_SIZE(irqbench.durations)] = end - start;
+	irqbench.len = min_t(u32, irqbench.len + 1, ARRAY_SIZE(irqbench.durations));
+}
+
+static int cmp_u32(const void *a, const void *b)
+{
+	return *(const u32 *)a - *(const u32 *)b;
+}
+
+static int proc_do_irqbench_median(struct ctl_table *table, int write, void *buffer,
+				   size_t *lenp, loff_t *ppos)
+{
+	u32 len = READ_ONCE(irqbench.len), median, *sorted;
+	struct ctl_table fake_table = {
+		.data = &median,
+		.maxlen = sizeof(median)
+	};
+	if (!len)
+		return -ENODATA;
+	sorted = kmalloc_array(len, sizeof(*sorted), GFP_KERNEL);
+	if (!sorted)
+		return -ENOMEM;
+	memcpy(sorted, irqbench.durations, len * sizeof(*sorted));
+	sort(sorted, len, sizeof(*sorted), cmp_u32, NULL);
+	median = sorted[len / 2];
+	kfree(sorted);
+	return write ? 0 : proc_douintvec(&fake_table, 0, buffer, lenp, ppos);
 }

 /*
@@ -1709,6 +1749,18 @@ static struct ctl_table random_table[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_do_uuid,
 	},
+	{
+		.procname	= "irqbench_median",
+		.mode		= 0444,
+		.proc_handler	= proc_do_irqbench_median,
+	},
+	{
+		.procname	= "irqbench_count",
+		.data		= &irqbench.len,
+		.maxlen		= sizeof(irqbench.len),
+		.mode		= 0444,
+		.proc_handler	= proc_douintvec,
+	},
 	{ }
 };



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05 11:02                           ` Jason A. Donenfeld
@ 2022-05-05 11:34                             ` David Laight
  2022-05-05 11:35                               ` Jason A. Donenfeld
  2022-05-06 22:34                               ` Jason A. Donenfeld
  2022-05-05 13:48                             ` Jason A. Donenfeld
  1 sibling, 2 replies; 38+ messages in thread
From: David Laight @ 2022-05-05 11:34 UTC (permalink / raw)
  To: 'Jason A. Donenfeld', Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

...
> +	cycles_t end, start = get_cycles();
>  	blake2s_update(&input_pool.hash, in, nbytes);
> +	end = get_cycles();

If get_cycles() is rdtsc then that gives meaningless numbers.
The cpu clock frequency will change on you.

You can use one of the performance counters to get an actual
cycle count - although that is only stable for 'hot cache'
as any memory accesses are clock speed dependant.

OTOH the entropy mixing is very likely to be 'cold cache'
and all the unrolling in blakes7 will completely kill
performance.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05 11:34                             ` David Laight
@ 2022-05-05 11:35                               ` Jason A. Donenfeld
  2022-05-05 11:53                                 ` David Laight
  2022-05-06 22:34                               ` Jason A. Donenfeld
  1 sibling, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-05 11:35 UTC (permalink / raw)
  To: David Laight
  Cc: Thomas Gleixner, Peter Zijlstra, Borislav Petkov, LKML, x86,
	Filipe Manana, linux-crypto

On Thu, May 5, 2022 at 1:34 PM David Laight <David.Laight@aculab.com> wrote:
>
> ...
> > +     cycles_t end, start = get_cycles();
> >       blake2s_update(&input_pool.hash, in, nbytes);
> > +     end = get_cycles();
>
> If get_cycles() is rdtsc then that gives meaningless numbers.
> The cpu clock frequency will change on you.
>
> You can use one of the performance counters to get an actual

Indeed. In the process of wiring up rdpmc now.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05 11:35                               ` Jason A. Donenfeld
@ 2022-05-05 11:53                                 ` David Laight
  0 siblings, 0 replies; 38+ messages in thread
From: David Laight @ 2022-05-05 11:53 UTC (permalink / raw)
  To: 'Jason A. Donenfeld'
  Cc: Thomas Gleixner, Peter Zijlstra, Borislav Petkov, LKML, x86,
	Filipe Manana, linux-crypto



> -----Original Message-----
> From: Jason A. Donenfeld <Jason@zx2c4.com>
> Sent: 05 May 2022 12:36
> To: David Laight <David.Laight@ACULAB.COM>
> Cc: Thomas Gleixner <tglx@linutronix.de>; Peter Zijlstra <peterz@infradead.org>; Borislav Petkov
> <bp@alien8.de>; LKML <linux-kernel@vger.kernel.org>; x86@kernel.org; Filipe Manana
> <fdmanana@suse.com>; linux-crypto@vger.kernel.org
> Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust
> 
> On Thu, May 5, 2022 at 1:34 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > ...
> > > +     cycles_t end, start = get_cycles();
> > >       blake2s_update(&input_pool.hash, in, nbytes);
> > > +     end = get_cycles();
> >
> > If get_cycles() is rdtsc then that gives meaningless numbers.
> > The cpu clock frequency will change on you.
> >
> > You can use one of the performance counters to get an actual
> 
> Indeed. In the process of wiring up rdpmc now.

I've used this before now.
But the loop getting the pmc value in non-deterministic.
So I sometimes remove it.
Also you need to add a serialising instruction - otherwise
the pmc get read before the code you are measuring
actually finishes.

I've used similar code to measure iterations on the ip checksum code.
Can show how many bytes/clock that achieves in its inner loop.
Which can match what you might expect the instructions to generate.

	David

static inline unsigned int rdpmc(unsigned int counter)
{
        unsigned int low, high;

        asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));

        // return low bits, counter might to 32 or 40 bits wide.
        return low;
}

static inline unsigned int rdtsc(void)
{
        unsigned int low, high;

        asm volatile("rdtsc" : "=a" (low), "=d" (high));

        return low;
}

unsigned int read_cpu_cycles(void)
{
        static struct perf_event_attr perf_attr = {
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .pinned = 1,
        };
        static __thread struct perf_event_mmap_page *pc;
        static int pmc_failed;

        unsigned int seq, idx, count;

        if (pmc_failed)
                return rdtsc();

        if (!pc) {
                int perf_fd;
                perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
                if (perf_fd < 0) {
                        pmc_failed = 1;
                        return rdtsc();
                }
                pc = mmap(NULL, sysconf(_SC_PAGESIZE), PROT_READ, MAP_SHARED, perf_fd, 0);
                close(perf_fd);
                if (pc == MAP_FAILED) {
                        pmc_failed = 1;
                        return rdtsc();
                }
        }

        do {
                seq = pc->lock;
                asm volatile("":::"memory");
                idx = pc->index;
                if (!idx) //  || !pc->cap_user_rdpmc)
                        return 0;
                count = pc->offset + rdpmc(idx - 1);
                asm volatile("":::"memory");
        } while (pc->lock != seq);

        return count;
}

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05 11:02                           ` Jason A. Donenfeld
  2022-05-05 11:34                             ` David Laight
@ 2022-05-05 13:48                             ` Jason A. Donenfeld
  1 sibling, 0 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-05 13:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hey again Thomas,

On Thu, May 05, 2022 at 01:02:02PM +0200, Jason A. Donenfeld wrote:
> Interestingly, disabling the simd paths makes things around 283 cycles
> slower on my Tiger Lake laptop, just doing ordinary things. I'm actually
> slightly surprised, so I'll probably keep playing with this. My patch
> for this is attached. Let me know if you have a different methodology in
> mind...

Using RDPMC/perf, the performance is shown to be even closer for real
world cases, with the simd code only ~80 cycles faster. Bench code
follows below. If the observation on this hardware holds for other
hardware, we can probably improve the performance of the generic code a
bit, and then the difference really won't matter. Any thoughts about
this and the test code?

Jason

diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd292927654c..6577e9f2f3b7 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -53,6 +53,7 @@
 #include <linux/uuid.h>
 #include <linux/uaccess.h>
 #include <linux/suspend.h>
+#include <linux/sort.h>
 #include <crypto/chacha.h>
 #include <crypto/blake2s.h>
 #include <asm/processor.h>
@@ -755,9 +756,54 @@ static struct {
 	.lock = __SPIN_LOCK_UNLOCKED(input_pool.lock),
 };
 
+static DEFINE_PER_CPU(int, pmc_index) = -1;
+static struct {
+	u32 durations[1 << 20];
+	u32 pos, len;
+} irqbench;
+
 static void _mix_pool_bytes(const void *in, size_t nbytes)
 {
+	int idx = *this_cpu_ptr(&pmc_index);
+	u32 ctr = input_pool.hash.t[0], reg = 0;
+	cycles_t end, start;
+
+
+	native_cpuid(&reg, &reg, &reg, &reg);
+	start = idx == -1 ? 0 : native_read_pmc(idx);
 	blake2s_update(&input_pool.hash, in, nbytes);
+	end = idx == -1 ? 0 : native_read_pmc(idx);
+
+	if (ctr == input_pool.hash.t[0] || !in_hardirq() || idx == -1)
+		return;
+
+	irqbench.durations[irqbench.pos++ % ARRAY_SIZE(irqbench.durations)] = end - start;
+	irqbench.len = min_t(u32, irqbench.len + 1, ARRAY_SIZE(irqbench.durations));
+}
+
+static int cmp_u32(const void *a, const void *b)
+{
+	return *(const u32 *)a - *(const u32 *)b;
+}
+
+static int proc_do_irqbench_median(struct ctl_table *table, int write, void *buffer,
+				   size_t *lenp, loff_t *ppos)
+{
+	u32 len = READ_ONCE(irqbench.len), median, *sorted;
+	struct ctl_table fake_table = {
+		.data = &median,
+		.maxlen = sizeof(median)
+	};
+	if (!len)
+		return -ENODATA;
+	sorted = kmalloc_array(len, sizeof(*sorted), GFP_KERNEL);
+	if (!sorted)
+		return -ENOMEM;
+	memcpy(sorted, irqbench.durations, len * sizeof(*sorted));
+	sort(sorted, len, sizeof(*sorted), cmp_u32, NULL);
+	median = sorted[len / 2];
+	kfree(sorted);
+	return write ? 0 : proc_douintvec(&fake_table, 0, buffer, lenp, ppos);
 }
 
 /*
@@ -1709,6 +1755,18 @@ static struct ctl_table random_table[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_do_uuid,
 	},
+	{
+		.procname	= "irqbench_median",
+		.mode		= 0444,
+		.proc_handler	= proc_do_irqbench_median,
+	},
+	{
+		.procname	= "irqbench_count",
+		.data		= &irqbench.len,
+		.maxlen		= sizeof(irqbench.len),
+		.mode		= 0444,
+		.proc_handler	= proc_douintvec,
+	},
 	{ }
 };
 
@@ -1718,6 +1776,21 @@ static struct ctl_table random_table[] = {
  */
 static int __init random_sysctls_init(void)
 {
+	int i;
+	struct perf_event *cycles_event;
+	struct perf_event_attr perf_cycles_attr = {
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_CPU_CYCLES,
+		.size = sizeof(struct perf_event_attr),
+		.pinned = true
+	};
+	for_each_possible_cpu(i) {
+		cycles_event = perf_event_create_kernel_counter(&perf_cycles_attr, i, NULL, NULL, NULL);
+		if (IS_ERR(cycles_event))
+			pr_err("unable to create perf counter on cpu %d: %ld\n", i, PTR_ERR(cycles_event));
+		else
+			*per_cpu_ptr(&pmc_index, i) = cycles_event->hw.event_base_rdpmc;
+	}
 	register_sysctl_init("kernel/random", random_table);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-04 19:05               ` Jason A. Donenfeld
  2022-05-04 21:04                 ` Thomas Gleixner
@ 2022-05-06 22:15                 ` Jason A. Donenfeld
  1 sibling, 0 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-06 22:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Borislav Petkov, LKML, x86, Filipe Manana, linux-crypto

Hi Thomas,

On Wed, May 04, 2022 at 09:05:01PM +0200, Jason A. Donenfeld wrote:
> Hey Thomas,
> 
> On Wed, May 04, 2022 at 06:45:45PM +0200, Thomas Gleixner wrote:
> > add_disk_randomness() on !RT kernels. That's what made me look into this
> > in the first place as it unearthed the long standing FPU protection
> > bug. See the first patch in this thread.
> > 
> > Possibly add_device_randomness() too, but I haven't seen evidence so far.
> 
> It looks like it's being hit from add_input_randomness() via
> input_handle_event() too. There are two positions we could take toward
> this:
> 
> One stance to take would be that if an event happens in an interrupt,
> add_interrupt_randomness() should in theory already be siphashing in a
> cycle counter so additional calls to add_{input,disk}_randomness() don't
> contribute substantially (unless you assume the num field has some
> entropic value). If that's compelling, then the next thing to do would
> be adding a `if (in_interrupt()) return;` early on in some function, and
> then we forever after impose a rule, "never mix into the input pool
> directly from an irq".
> 
> The other stance is that these input/disk events are relatively rare --
> compared to, say, a storm of interrupts from a NIC -- so mixing into the
> input pool from there isn't actually a problem, and we benefit from the
> quasi domain-specific accounting and the superior mixing function,
> there, so keep it around. And the non-raw spinlock on the input pool
> won't negatively affect RT from this context, because all its callers on
> RT should be threaded.

It turned out to be more complicated than this. Sometimes a given
interrupt handler would have multiple calls to add_disk_randomness(),
followed by a single call to add_interrupt_randomness(). Since those
multiple calls all represent different things being measured, the second
option of discarding them didn't seem like a good idea, but the first
option seemed pretty bad too, since the status quo is way too much
hashing from an irq handler. Further complicating things is the fact
that add_{input,disk}_randomness() is never just called from irq or from
elsewhere, as different drivers do different things. Luckily the way the
entropy estimator currently works allows for a pretty straight forward
compromise, which I posted here:

https://lore.kernel.org/lkml/20220506215454.1671-2-Jason@zx2c4.com/

The upshot for this discussion is that it means we can probably get rid
of hard irq FPU support. (This is in addition to the performance
argument, in which tentatively it seemed like the difference wasn't
/that/ much to justify the complexity.)

I intend to mull this over a bit more, though, so if my patch is to make
it in, it'll be for 5.19 and certainly not 5.18. So for 5.18, I think
it's probably a good idea to apply the patches in this thread. And then
for 5.19 we can take the next step.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-05 11:34                             ` David Laight
  2022-05-05 11:35                               ` Jason A. Donenfeld
@ 2022-05-06 22:34                               ` Jason A. Donenfeld
  2022-05-07 13:50                                 ` David Laight
  1 sibling, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-06 22:34 UTC (permalink / raw)
  To: David Laight
  Cc: Thomas Gleixner, Peter Zijlstra, Borislav Petkov, LKML, x86,
	Filipe Manana, linux-crypto

Hi David,

On Thu, May 05, 2022 at 11:34:40AM +0000, David Laight wrote:
> OTOH the entropy mixing is very likely to be 'cold cache'
> and all the unrolling in blakes7 will completely kill
> performance.

I've seen you mention the BLAKE2s unrolling in like 8 different threads
now, and I'm not convinced that you're entirely wrong, nor am I
convinced that you're entirely right. My response to you is the same as
always: please send a patch with some measurements! I'd love to get this
worked out in a real way.

The last time I went benching these, the unrolled code was ~100 cycles
faster, if I recall correctly, than the rolled code, when used from
WireGuard's hot path. I don't doubt that a cold path would be more
fraught, though, as that's a decent amount of code. So the question is
how to re-roll the rounds without sacrificing those 100 cycles.

In order to begin to figure that out, we have to look at why the
re-rolled loop is slow and the unrolled loop fast. It's not because of
complicated pipeline things. It's because the BLAKE2s permutation is
actually 10 different permutations, one for each round. Take a look at
the core function, G, and its uses of the round number, r:

    #define G(r, i, a, b, c, d) do { \
        a += b + m[blake2s_sigma[r][2 * i + 0]]; \
        d = ror32(d ^ a, 16); \
        c += d; \
        b = ror32(b ^ c, 12); \
        a += b + m[blake2s_sigma[r][2 * i + 1]]; \
        d = ror32(d ^ a, 8); \
        c += d; \
        b = ror32(b ^ c, 7); \
    } while (0)

The blake2s_sigma array is a `static const u8 blake2s_sigma[10][16]`,
with a row for every one of the 10 rounds. What this is actually doing
is reading the message words in a different order each round, so that
the whole permutation is different.

When the loop is unrolled, blake2s_sigma gets inlined, and then there
are no memory accesses. When it's re-rolled, every round accesses
blake2s_sigma 16 times, which hinders performance.

You'll notice, on the other hand, that the SIMD hand coded assembly
implementations do not unroll. The trick is to hide the cost of the
blake2s_sigma indirection in the data dependencies, so that performance
isn't affected. Naively re-rolling the generic code does not inspire the
compiler to do that. But maybe you can figure something out?

Anyway, that's about where my thinking is on this, but I'd love to see
some patches from you at some point if you're interested.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [patch 3/3] x86/fpu: Make FPU protection more robust
  2022-05-06 22:34                               ` Jason A. Donenfeld
@ 2022-05-07 13:50                                 ` David Laight
  0 siblings, 0 replies; 38+ messages in thread
From: David Laight @ 2022-05-07 13:50 UTC (permalink / raw)
  To: 'Jason A. Donenfeld'
  Cc: Thomas Gleixner, Peter Zijlstra, Borislav Petkov, LKML, x86,
	Filipe Manana, linux-crypto

From: Jason A. Donenfeld
> Sent: 06 May 2022 23:34
> 
> Hi David,
> 
> On Thu, May 05, 2022 at 11:34:40AM +0000, David Laight wrote:
> > OTOH the entropy mixing is very likely to be 'cold cache'
> > and all the unrolling in blakes7 will completely kill
> > performance.
> 
> I've seen you mention the BLAKE2s unrolling in like 8 different threads
> now, and I'm not convinced that you're entirely wrong, nor am I
> convinced that you're entirely right. My response to you is the same as
> always: please send a patch with some measurements! I'd love to get this
> worked out in a real way.
> 
> The last time I went benching these, the unrolled code was ~100 cycles
> faster, if I recall correctly, than the rolled code, when used from
> WireGuard's hot path. I don't doubt that a cold path would be more
> fraught, though, as that's a decent amount of code. So the question is
> how to re-roll the rounds without sacrificing those 100 cycles.
> 
> In order to begin to figure that out, we have to look at why the
> re-rolled loop is slow and the unrolled loop fast. It's not because of
> complicated pipeline things. It's because the BLAKE2s permutation is
> actually 10 different permutations, one for each round. Take a look at
> the core function, G, and its uses of the round number, r:
> 
>     #define G(r, i, a, b, c, d) do { \
>         a += b + m[blake2s_sigma[r][2 * i + 0]]; \
>         d = ror32(d ^ a, 16); \
>         c += d; \
>         b = ror32(b ^ c, 12); \
>         a += b + m[blake2s_sigma[r][2 * i + 1]]; \
>         d = ror32(d ^ a, 8); \
>         c += d; \
>         b = ror32(b ^ c, 7); \
>     } while (0)

Each of those lines is a couple of instructions and they are
all dependant on the preceding value.
I count 14 - excluding the m[] accesses.
There are 80 copies of G() - total 1120, or 17.5/byte.
To get any faster than that you need to get the compiler
to interleave the generated code for multiple expansions of G().

> The blake2s_sigma array is a `static const u8 blake2s_sigma[10][16]`,
> with a row for every one of the 10 rounds. What this is actually doing
> is reading the message words in a different order each round, so that
> the whole permutation is different.
> 
> When the loop is unrolled, blake2s_sigma gets inlined, and then there
> are no memory accesses. When it's re-rolled, every round accesses
> blake2s_sigma 16 times, which hinders performance.

It shouldn't really make much difference.
There are only two memory reads for each 14 arithmetic ops.
So unless you manage to maintain 4 instructions/clock there
are spare clocks for the extra memory cycles.
Any that is assuming one read/clock, modern x86 can do 2
(with a following wind!)
On x86 the array index is free.

> You'll notice, on the other hand, that the SIMD hand coded assembly
> implementations do not unroll. The trick is to hide the cost of the
> blake2s_sigma indirection in the data dependencies, so that performance
> isn't affected. Naively re-rolling the generic code does not inspire the
> compiler to do that. But maybe you can figure something out?

I've not looked at that version.
I have written AVX/SSE code - hard work finding the asm mnumonics!

> Anyway, that's about where my thinking is on this, but I'd love to see
> some patches from you at some point if you're interested.

Ok I just ran some tests looping over the ROUND() without updating v[].
These are using rdpmc to time single calls - not averaging over
a large number of iterations.

On my i7-7700 the unrolled loop is about 6.2 clocks/byte.
The 'cold cache' for a single 64 byte block is about 20 clocks/byte.

If I use:
	for (u32 i = 0; i < 10; i++) ROUND(i);
it drops to 7.8 clocks/byte but the single block is about 15.
Part of the problem is extra register spills to stack.

So we play some games:
Remove the blake2s_sigma[] out of G() into ROUND()
so the 'normal' code uses ROUND(blake2s_sigma[0]) (etc).
Then we can use the loop:
	for (const u8 *bs = &blake2s_sigma[0][0]; bs < end; bs += 16)
		ROUND(bs);
(with some bits to make it compile.)
Annoyingly that makes the compiler spill all the bs[] to stack.
A few carefully placed barrier() would help.
But simpler is 'volatile const u8 *bs'.
This has the desired effect of interleaving the 'sigma' reads
with the buffer ones.
This gave me 7.2 clocks/byte, single block maybe 14.

So about 1 clock/byte (maybe 15%) slower.
OTOH in many cases this isn't critical.

The cold cache cutoff for the unrolled loop is around 256 bytes.

The other big effect the unrolled loop has for small buffers
is that it displaces a lot of code from the i-cache.
That will make the unrolled code worse for many real situations.

I suspect that the compiler ends up spilling another register (or 2)
to the stack.
Hand assembly could avoid that.

I've not looked at what gets spilled, gcc can make horrid choices.
It might be worth making (say) a and c volatile (by splitting v[])
and using a temporary in G(), eg:
	c = t = c + d;
	b = ror32(b ^ t, 12);

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-04 15:40 ` Jason A. Donenfeld
  2022-05-04 18:05   ` Thomas Gleixner
@ 2022-05-18  1:02   ` Jason A. Donenfeld
  2022-05-18 11:14     ` Jason A. Donenfeld
  2022-05-18 13:09     ` Thomas Gleixner
  1 sibling, 2 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-18  1:02 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana, Vadim Galitsin

Hey Thomas,

On Wed, May 04, 2022 at 05:40:26PM +0200, Jason A. Donenfeld wrote:
> Hi Thomas,
> 
> On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
> > The recent changes in the random code unearthed a long standing FPU state
> > corruption due do a buggy condition for granting in-kernel FPU usage.
>  
> Thanks for working that out. I've been banging my head over [1] for a
> few days now trying to see if it's a mis-bisect or a real thing. I'll
> ask Larry to retry with this patchset.

So, Larry's debugging was inconsistent and didn't result in anything I
could piece together into basic cause and effect. But luckily Vadim, who
maintains the VirtualBox drivers for Oracle, was able to reproduce the
issue and was able to conduct some real debugging. I've CC'd him here.
From talking with Vadim, here are some findings thus far:

  - Certain Linux guest processes crash under high load.
  - Windows kernel guest panics.

Observation: the Windows kernel uses SSSE3 in their kernel all over the
place, generated by the compiler.

  - Moving the mouse around helps induce the crash.

Observation: add_input_randomness() -> .. -> kernel_fpu_begin() -> blake2s_compress().

  - The problem exhibits itself in rc7, so this patchset does not fix
    the issue.
  - Applying https://xn--4db.cc/ttEUSvdC fixes the issue.

Observation: the problem is definitely related to using the FPU in a
hard IRQ.

I went reading KVM to get some idea of why KVM does *not* have this
problem, and it looks like there's some careful code there about doing
xsave and such around IRQs. So my current theory is that VirtualBox's
VMM just forgot to do this, and until now this bug went unnoticed.

Since VirtualBox is out of tree (and extremely messy of a codebase), and
this appears to be an out of tree module problem rather than a kernel
problem, I'm inclined to think that there's not much for us to do, at
least until we receive information to the contrary of this presumption.

But in case you do want to do something proactively, I don't have any
objections to just disabling the FPU in hard IRQ for 5.18. And in 5.19,
add_input_randomness() isn't going to hit that path anyway. But also,
doing nothing and letting the VirtualBox people figure out their bug
would be fine with me too. Either way, just wanted to give you a heads
up.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-18  1:02   ` Jason A. Donenfeld
@ 2022-05-18 11:14     ` Jason A. Donenfeld
  2022-05-18 11:18       ` Jason A. Donenfeld
  2022-05-18 13:09     ` Thomas Gleixner
  1 sibling, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-18 11:14 UTC (permalink / raw)
  To: Thomas Gleixner, Vadim Galitsin; +Cc: LKML, x86, Filipe Manana, Vadim Galitsin

Hi Vadim,

On Wed, May 18, 2022 at 03:02:05AM +0200, Jason A. Donenfeld wrote:
> Observation: the problem is definitely related to using the FPU in a
> hard IRQ.

I wrote a tiny reproducer that should be pretty reliable for testing
this, attached below. I think this proves my working theory. Run this in
a VirtualBox VM, and then move your mouse around or hit the keyboard, or
do something that triggers the add_{input,disk}_randomness() path from a
hardirq handler. On my laptop, for example, the trackpoint goes via
hardirq, but the touchpad does not. As soon as I move the trackpoint
around, the below program prints "XSAVE is borked!".

Also, note that this isn't just "corruption" of the guest VM, but also
leaking secret contents of the host VM into the guest. So you might
really want to make sure VirtualBox issues a fix for this before 5.18,
as it's arguably security sensitive.

Regards,
Jason


#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
	int status = 0;

	for (int i = 0, nproc = sysconf(_SC_NPROCESSORS_ONLN); i < nproc; ++i) {
		if (!fork()) {
			prctl(PR_SET_PDEATHSIG, SIGKILL);
			asm("movq $42, %%rax\n"
			    "movq %%rax, %%xmm0\n"
			    "0:\n"
			    "movq %%xmm0, %%rbx\n"
			    "cmpq %%rax, %%rbx\n"
			    "je 0b\n"
			    : : : "rax", "rbx", "xmm0", "cc");
			_exit(77);
		}
	}
	wait(&status);
	if (WEXITSTATUS(status) == 77)
		printf("XSAVE is borked!\n");
	return 1;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-18 11:14     ` Jason A. Donenfeld
@ 2022-05-18 11:18       ` Jason A. Donenfeld
  0 siblings, 0 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-18 11:18 UTC (permalink / raw)
  To: Thomas Gleixner, Vadim Galitsin; +Cc: LKML, x86, Filipe Manana

On Wed, May 18, 2022 at 01:14:40PM +0200, Jason A. Donenfeld wrote:
> I wrote a tiny reproducer that should be pretty reliable for testing
> this, attached below. I think this proves my working theory. Run this in
> a VirtualBox VM, and then move your mouse around or hit the keyboard, or
> do something that triggers the add_{input,disk}_randomness() path from a
> hardirq handler. On my laptop, for example, the trackpoint goes via
> hardirq, but the touchpad does not. As soon as I move the trackpoint
> around, the below program prints "XSAVE is borked!".

I should also add that the VM should have as many CPUs as the host, to
better the chances that the irq arrives on the same CPU that the guest
VM is running on.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-18  1:02   ` Jason A. Donenfeld
  2022-05-18 11:14     ` Jason A. Donenfeld
@ 2022-05-18 13:09     ` Thomas Gleixner
  2022-05-18 14:08       ` Jason A. Donenfeld
  1 sibling, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2022-05-18 13:09 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: LKML, x86, Filipe Manana, Vadim Galitsin

On Wed, May 18 2022 at 03:02, Jason A. Donenfeld wrote:
> On Wed, May 04, 2022 at 05:40:26PM +0200, Jason A. Donenfeld wrote:
>> On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
>> > The recent changes in the random code unearthed a long standing FPU state
>> > corruption due do a buggy condition for granting in-kernel FPU usage.
>>  
>> Thanks for working that out. I've been banging my head over [1] for a
>> few days now trying to see if it's a mis-bisect or a real thing. I'll
>> ask Larry to retry with this patchset.
>
> So, Larry's debugging was inconsistent and didn't result in anything I
> could piece together into basic cause and effect. But luckily Vadim, who
> maintains the VirtualBox drivers for Oracle, was able to reproduce the
> issue and was able to conduct some real debugging. I've CC'd him here.
> From talking with Vadim, here are some findings thus far:
>
>   - Certain Linux guest processes crash under high load.
>   - Windows kernel guest panics.
>
> Observation: the Windows kernel uses SSSE3 in their kernel all over the
> place, generated by the compiler.
>
>   - Moving the mouse around helps induce the crash.
>
> Observation: add_input_randomness() -> .. -> kernel_fpu_begin() -> blake2s_compress().
>
>   - The problem exhibits itself in rc7, so this patchset does not fix
>     the issue.
>   - Applying https://xn--4db.cc/ttEUSvdC fixes the issue.
>
> Observation: the problem is definitely related to using the FPU in a
> hard IRQ.
>
> I went reading KVM to get some idea of why KVM does *not* have this
> problem, and it looks like there's some careful code there about doing
> xsave and such around IRQs. So my current theory is that VirtualBox's
> VMM just forgot to do this, and until now this bug went unnoticed.

That's a very valid assumption. I audited all places which fiddle with
FPU in Linus tree and with the fix applied they're all safe.

> Since VirtualBox is out of tree (and extremely messy of a codebase), and
> this appears to be an out of tree module problem rather than a kernel
> problem, I'm inclined to think that there's not much for us to do, at
> least until we receive information to the contrary of this presumption.

Agreed in all points.

> But in case you do want to do something proactively, I don't have any
> objections to just disabling the FPU in hard IRQ for 5.18. And in 5.19,
> add_input_randomness() isn't going to hit that path anyway. But also,
> doing nothing and letting the VirtualBox people figure out their bug
> would be fine with me too. Either way, just wanted to give you a heads
> up.

That virtualborx bug has to be fixed in any case as this problem exists
forever and there have been drivers using FPU in hard interrupt context
in the past sporadically, so it's sheer luck that this didn't explode
before. AFAICT all of this has been moved to softirq context over the
years, so the random code is probably the sole in hard interrupt user in
mainline today.

In the interest of users we should probably bite the bullet and just
disable hard interrupt FPU usage upstream and Cc stable. The stable
kernel updates probably reach users faster.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-18 13:09     ` Thomas Gleixner
@ 2022-05-18 14:08       ` Jason A. Donenfeld
  2022-05-25 20:36         ` Jason A. Donenfeld
  0 siblings, 1 reply; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-18 14:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Filipe Manana, Vadim Galitsin

Hi Thomas,

On Wed, May 18, 2022 at 03:09:54PM +0200, Thomas Gleixner wrote:
> In the interest of users we should probably bite the bullet and just
> disable hard interrupt FPU usage upstream and Cc stable. The stable
> kernel updates probably reach users faster.

Considering <https://git.zx2c4.com/linux-rng/commit/?id=e3e33fc2e> is
slated for 5.19, that seems fine, as this will remove what is hopefully
the last hardirq FPU user.

The bigger motivation, though, for removing the hardirq FPU code would
be for just simplifying the FPU handling. The VirtualBox people are
going to have to fix this bug anyway, since it also affects old kernels
they support.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 0/3] x86/fpu: Prevent FPU state corruption
  2022-05-18 14:08       ` Jason A. Donenfeld
@ 2022-05-25 20:36         ` Jason A. Donenfeld
  0 siblings, 0 replies; 38+ messages in thread
From: Jason A. Donenfeld @ 2022-05-25 20:36 UTC (permalink / raw)
  To: LKML, x86, Filipe Manana, Thomas Gleixner

Hey folks,

An update on the VirtualBox thing (not that anybody here actually
cares, but at least I've been nerd sniped):

It looks like they've got a patch out now for it:
https://www.virtualbox.org/attachment/ticket/20914/vbox-linux-5.18.patch
It seems like they call kernel_fpu_begin() before loading guest fpu
regs, but then immediately re-enable preemption. Yikes. So if another
kernel task preempts that one and uses the fpu... And it makes me
wonder the extent of this breakage prior (maybe not just hardirq?
unsure.). Also, it apparently is only enabled for 5.18, so that
doesn't help with old kernels. Oh well. All that development happens
behind closed doors so probably not worth losing sleep over.

Anyway, from a kernel perspective, there's now no urgency for us to do
anything about this. VirtualBox won't compile for 5.18 without that
patch, and given that Oracle fixed it, it doesn't appear to be our
bug. Case closed. So if we go ahead with removing hardirq fpu support,
it won't be due to this but for other motivations.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-05-25 20:36 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-01 19:31 [patch 0/3] x86/fpu: Prevent FPU state corruption Thomas Gleixner
2022-05-01 19:31 ` [patch 1/3] " Thomas Gleixner
2022-05-02 13:16   ` Borislav Petkov
2022-05-05  0:42   ` [tip: x86/urgent] " tip-bot2 for Thomas Gleixner
2022-05-01 19:31 ` [patch 2/3] x86/fpu: Rename irq_fpu_usable() Thomas Gleixner
2022-05-02 13:57   ` Borislav Petkov
2022-05-01 19:31 ` [patch 3/3] x86/fpu: Make FPU protection more robust Thomas Gleixner
2022-05-02 14:35   ` Borislav Petkov
2022-05-02 15:58     ` Thomas Gleixner
2022-05-03  9:06       ` Peter Zijlstra
2022-05-04 15:36         ` Thomas Gleixner
2022-05-04 15:55           ` Jason A. Donenfeld
2022-05-04 16:45             ` Thomas Gleixner
2022-05-04 19:05               ` Jason A. Donenfeld
2022-05-04 21:04                 ` Thomas Gleixner
2022-05-04 23:52                   ` Jason A. Donenfeld
2022-05-05  0:55                     ` Thomas Gleixner
2022-05-05  1:11                       ` Jason A. Donenfeld
2022-05-05  1:21                         ` Thomas Gleixner
2022-05-05 11:02                           ` Jason A. Donenfeld
2022-05-05 11:34                             ` David Laight
2022-05-05 11:35                               ` Jason A. Donenfeld
2022-05-05 11:53                                 ` David Laight
2022-05-06 22:34                               ` Jason A. Donenfeld
2022-05-07 13:50                                 ` David Laight
2022-05-05 13:48                             ` Jason A. Donenfeld
2022-05-06 22:15                 ` Jason A. Donenfeld
2022-05-03  9:03   ` Peter Zijlstra
2022-05-02 10:02 ` [patch 0/3] x86/fpu: Prevent FPU state corruption Filipe Manana
2022-05-02 12:22   ` Borislav Petkov
2022-05-04 15:40 ` Jason A. Donenfeld
2022-05-04 18:05   ` Thomas Gleixner
2022-05-18  1:02   ` Jason A. Donenfeld
2022-05-18 11:14     ` Jason A. Donenfeld
2022-05-18 11:18       ` Jason A. Donenfeld
2022-05-18 13:09     ` Thomas Gleixner
2022-05-18 14:08       ` Jason A. Donenfeld
2022-05-25 20:36         ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).