All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2020-12-18 17:01 ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

[ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
  SIMD in kernel mode could reduce complexity and improve performance, but we
  need to decide whether we can do this, and how much softirq processing
  latency we can tolerate. If we can find a satisfactory solution for this,
  we might do the same for x86 and 32-bit ARM as well. ]

The crypto API provides two ways to invoke symmetric encryption algorithms:
- synchronously, where the transformation is guaranteed to be done by the
  time the function returns;
- asynchronously, where the function may return with a -EINPROGRESS return code,
  and a completion will be signalled when the transformation is done.

The latter is mainly intended for h/w accelerators, where the throughput would
be severely limited by the latency otherwise. However, it is also being used
for software algorithms based on SIMD instructions, which cannot be issued from
any context (the rules are not the same on each architecture, but typically,
SIMD can be used in task context, or in softirq context if it was not taken
while the SIMD was already in use in kernel mode).

Many users of the crypto API exist in the kernel today that opt out of this
asynchronous interface (802.11, macsec, kerberos, sw kTLS), or use a library
interface which is fundamentally synchronous (wireguard). This means we end
up using a degraded mode for the contended case (a scalar fallback) as well
as the uncontended case (generic GCM/CCM/CTR chaining mode templates wrapped
around the SIMD cipher as opposed to accelerated implementations of the full
chaining modes in question). Note that scalar AES runs ~20x slower than the
SIMD instruction based version.

So let's address this for arm64, by reorganizing kernel mode SIMD support so
that the SIMD unit can always be assumed to be available. This means we need
to defer softirq processing when grabbing the NEON unit in task context, so
that any use of it in softirq context is guaranteed not to interrupt any code
that was already using the NEON.

This obviously impacts softirq processing latency, which is why the existing
conditional NEON yield support is modified to take pending softirqs into
account.

As an example of how this impacts the code, the existing arm64 GCM driver is
updated to:
- Add yield support - currently, the pending softirq check is performed every
  64 bytes of input, which is way too often - one of the desired outcomes of
  this RFC is getting a reasonable ballpark for how long we want to run with
  softirqs disabled.
- Remove the existing scalar fallbacks, which are no longer needed.

Questions:
- what did I miss or break horribly?
- does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
  kthread, so I don't think it cares.
- what would be a reasonable upper bound to keep softirqs disabled? I suppose
  100s of cycles or less is overkill, but I'm not sure how to derive a better
  answer.
- could we do the same on x86, now that kernel_fpu_begin/end is no longer
  expensive?

Cc: Dave Martin <dave.martin@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>

Ard Biesheuvel (5):
  crypto: aead - disallow en/decrypt for non-task or non-softirq context
  crypto: skcipher - disallow en/decrypt for non-task or non-softirq
    context
  crypto: arm64/gcm-aes-ce - add NEON yield support
  arm64: fpsimd: run kernel mode NEON with softirqs disabled
  crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path

 arch/arm64/crypto/ghash-ce-core.S  | 115 ++++++-----
 arch/arm64/crypto/ghash-ce-glue.c  | 209 +++++---------------
 arch/arm64/include/asm/assembler.h |  19 +-
 arch/arm64/kernel/asm-offsets.c    |   2 +
 arch/arm64/kernel/fpsimd.c         |   4 +-
 crypto/aead.c                      |  10 +
 crypto/skcipher.c                  |  10 +
 7 files changed, 155 insertions(+), 214 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2020-12-18 17:01 ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ingo Molnar, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Ard Biesheuvel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	Dave Martin, linux-arm-kernel

[ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
  SIMD in kernel mode could reduce complexity and improve performance, but we
  need to decide whether we can do this, and how much softirq processing
  latency we can tolerate. If we can find a satisfactory solution for this,
  we might do the same for x86 and 32-bit ARM as well. ]

The crypto API provides two ways to invoke symmetric encryption algorithms:
- synchronously, where the transformation is guaranteed to be done by the
  time the function returns;
- asynchronously, where the function may return with a -EINPROGRESS return code,
  and a completion will be signalled when the transformation is done.

The latter is mainly intended for h/w accelerators, where the throughput would
be severely limited by the latency otherwise. However, it is also being used
for software algorithms based on SIMD instructions, which cannot be issued from
any context (the rules are not the same on each architecture, but typically,
SIMD can be used in task context, or in softirq context if it was not taken
while the SIMD was already in use in kernel mode).

Many users of the crypto API exist in the kernel today that opt out of this
asynchronous interface (802.11, macsec, kerberos, sw kTLS), or use a library
interface which is fundamentally synchronous (wireguard). This means we end
up using a degraded mode for the contended case (a scalar fallback) as well
as the uncontended case (generic GCM/CCM/CTR chaining mode templates wrapped
around the SIMD cipher as opposed to accelerated implementations of the full
chaining modes in question). Note that scalar AES runs ~20x slower than the
SIMD instruction based version.

So let's address this for arm64, by reorganizing kernel mode SIMD support so
that the SIMD unit can always be assumed to be available. This means we need
to defer softirq processing when grabbing the NEON unit in task context, so
that any use of it in softirq context is guaranteed not to interrupt any code
that was already using the NEON.

This obviously impacts softirq processing latency, which is why the existing
conditional NEON yield support is modified to take pending softirqs into
account.

As an example of how this impacts the code, the existing arm64 GCM driver is
updated to:
- Add yield support - currently, the pending softirq check is performed every
  64 bytes of input, which is way too often - one of the desired outcomes of
  this RFC is getting a reasonable ballpark for how long we want to run with
  softirqs disabled.
- Remove the existing scalar fallbacks, which are no longer needed.

Questions:
- what did I miss or break horribly?
- does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
  kthread, so I don't think it cares.
- what would be a reasonable upper bound to keep softirqs disabled? I suppose
  100s of cycles or less is overkill, but I'm not sure how to derive a better
  answer.
- could we do the same on x86, now that kernel_fpu_begin/end is no longer
  expensive?

Cc: Dave Martin <dave.martin@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>

Ard Biesheuvel (5):
  crypto: aead - disallow en/decrypt for non-task or non-softirq context
  crypto: skcipher - disallow en/decrypt for non-task or non-softirq
    context
  crypto: arm64/gcm-aes-ce - add NEON yield support
  arm64: fpsimd: run kernel mode NEON with softirqs disabled
  crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path

 arch/arm64/crypto/ghash-ce-core.S  | 115 ++++++-----
 arch/arm64/crypto/ghash-ce-glue.c  | 209 +++++---------------
 arch/arm64/include/asm/assembler.h |  19 +-
 arch/arm64/kernel/asm-offsets.c    |   2 +
 arch/arm64/kernel/fpsimd.c         |   4 +-
 crypto/aead.c                      |  10 +
 crypto/skcipher.c                  |  10 +
 7 files changed, 155 insertions(+), 214 deletions(-)

-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2020-12-18 17:01   ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

In order to ensure that kernel mode SIMD routines will not need a scalar
fallback if they run with softirqs disabled, disallow any use of the
AEAD encrypt and decrypt routines from outside of task or softirq context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/aead.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/crypto/aead.c b/crypto/aead.c
index 16991095270d..b5304b3d3314 100644
--- a/crypto/aead.c
+++ b/crypto/aead.c
@@ -87,6 +87,11 @@ int crypto_aead_encrypt(struct aead_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_aead_get_flags(aead) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
@@ -104,6 +109,11 @@ int crypto_aead_decrypt(struct aead_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_aead_get_flags(aead) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context
@ 2020-12-18 17:01   ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ingo Molnar, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Ard Biesheuvel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	Dave Martin, linux-arm-kernel

In order to ensure that kernel mode SIMD routines will not need a scalar
fallback if they run with softirqs disabled, disallow any use of the
AEAD encrypt and decrypt routines from outside of task or softirq context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/aead.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/crypto/aead.c b/crypto/aead.c
index 16991095270d..b5304b3d3314 100644
--- a/crypto/aead.c
+++ b/crypto/aead.c
@@ -87,6 +87,11 @@ int crypto_aead_encrypt(struct aead_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_aead_get_flags(aead) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
@@ -104,6 +109,11 @@ int crypto_aead_decrypt(struct aead_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_aead_get_flags(aead) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/5] crypto: skcipher - disallow en/decrypt for non-task or non-softirq context
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2020-12-18 17:01   ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

In order to ensure that kernel mode SIMD routines will not need a scalar
fallback if they run with softirqs disabled, disallow any use of the
skcipher encrypt and decrypt routines from outside of task or softirq
context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/skcipher.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index b4dae640de9f..d1a656d7d498 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -628,6 +628,11 @@ int crypto_skcipher_encrypt(struct skcipher_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
@@ -645,6 +650,11 @@ int crypto_skcipher_decrypt(struct skcipher_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/5] crypto: skcipher - disallow en/decrypt for non-task or non-softirq context
@ 2020-12-18 17:01   ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ingo Molnar, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Ard Biesheuvel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	Dave Martin, linux-arm-kernel

In order to ensure that kernel mode SIMD routines will not need a scalar
fallback if they run with softirqs disabled, disallow any use of the
skcipher encrypt and decrypt routines from outside of task or softirq
context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/skcipher.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index b4dae640de9f..d1a656d7d498 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -628,6 +628,11 @@ int crypto_skcipher_encrypt(struct skcipher_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
@@ -645,6 +650,11 @@ int crypto_skcipher_decrypt(struct skcipher_request *req)
 	unsigned int cryptlen = req->cryptlen;
 	int ret;
 
+	if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+	    WARN_ONCE(!in_task() && !in_serving_softirq(),
+		      "synchronous call from invalid context\n"))
+		return -EBUSY;
+
 	crypto_stats_get(alg);
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		ret = -ENOKEY;
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2020-12-18 17:01   ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

GCM mode is typically used for IPsec, but is now also used for software
kTLS, which means that we may end up operating on much larger inputs.

So let's wire up conditional NEON yield support for this driver, to ensure
that it does not cause scheduling blackouts when used with large inputs.

Given that GCM executes at 1-2 cycles per bytes and operates on 64 byte
chunks, doing a yield check every iteration should limit the scheduling
(or softirq) latency to < 200 cycles, which is a very conservative upper
bound.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 115 ++++++++++++--------
 1 file changed, 67 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 7868330dd54e..533cfb810232 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -435,14 +435,23 @@ SYM_FUNC_END(pmull_ghash_update_p8)
 
 	.align		6
 	.macro		pmull_gcm_do_crypt, enc
-	stp		x29, x30, [sp, #-32]!
-	mov		x29, sp
-	str		x19, [sp, #24]
+	frame_push	8
 
-	load_round_keys	x7, x6, x8
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
+	mov		x25, x6
+	mov		x26, x7
 
-	ld1		{SHASH.2d}, [x3], #16
-	ld1		{HH.2d-HH4.2d}, [x3]
+.Lrestart\enc:
+	load_round_keys	x26, x25, x8
+
+	add		x0, x22, #16
+	ld1		{SHASH.2d}, [x22]
+	ld1		{HH.2d-HH4.2d}, [x0]
 
 	trn1		SHASH2.2d, SHASH.2d, HH.2d
 	trn2		T1.2d, SHASH.2d, HH.2d
@@ -452,23 +461,23 @@ SYM_FUNC_END(pmull_ghash_update_p8)
 	trn2		T1.2d, HH3.2d, HH4.2d
 	eor		HH34.16b, HH34.16b, T1.16b
 
-	ld1		{XL.2d}, [x4]
+	ld1		{XL.2d}, [x23]
 
-	cbz		x0, 3f				// tag only?
+	cbz		x19, 3f				// tag only?
 
-	ldr		w8, [x5, #12]			// load lower counter
+	ldr		w8, [x24, #12]			// load lower counter
 CPU_LE(	rev		w8, w8		)
 
 0:	mov		w9, #4				// max blocks per round
-	add		x10, x0, #0xf
+	add		x10, x19, #0xf
 	lsr		x10, x10, #4			// remaining blocks
 
-	subs		x0, x0, #64
+	subs		x19, x19, #64
 	csel		w9, w10, w9, mi
 	add		w8, w8, w9
 
 	bmi		1f
-	ld1		{INP0.16b-INP3.16b}, [x2], #64
+	ld1		{INP0.16b-INP3.16b}, [x21], #64
 	.subsection	1
 	/*
 	 * Populate the four input registers right to left with up to 63 bytes
@@ -487,30 +496,30 @@ CPU_LE(	rev		w8, w8		)
 	 * input size is < 16 bytes)
 	 */
 1:	mov		x15, #16
-	ands		x19, x0, #0xf
-	csel		x19, x19, x15, ne
+	ands		x0, x19, #0xf
+	csel		x0, x0, x15, ne
 	adr_l		x17, .Lpermute_table + 16
 
-	sub		x11, x15, x19
+	sub		x11, x15, x0
 	add		x12, x17, x11
 	sub		x17, x17, x11
 	ld1		{T1.16b}, [x12]
-	sub		x10, x1, x11
-	sub		x11, x2, x11
+	sub		x10, x20, x11
+	sub		x11, x21, x11
 
-	cmp		x0, #-16
+	cmp		x19, #-16
 	csel		x14, x15, xzr, gt
-	cmp		x0, #-32
+	cmp		x19, #-32
 	csel		x15, x15, xzr, gt
-	cmp		x0, #-48
-	csel		x16, x19, xzr, gt
-	csel		x1, x1, x10, gt
-	csel		x2, x2, x11, gt
-
-	ld1		{INP0.16b}, [x2], x14
-	ld1		{INP1.16b}, [x2], x15
-	ld1		{INP2.16b}, [x2], x16
-	ld1		{INP3.16b}, [x2]
+	cmp		x19, #-48
+	csel		x16, x0, xzr, gt
+	csel		x20, x20, x10, gt
+	csel		x21, x21, x11, gt
+
+	ld1		{INP0.16b}, [x21], x14
+	ld1		{INP1.16b}, [x21], x15
+	ld1		{INP2.16b}, [x21], x16
+	ld1		{INP3.16b}, [x21]
 	tbl		INP3.16b, {INP3.16b}, T1.16b
 	b		2f
 	.previous
@@ -521,14 +530,24 @@ CPU_LE(	rev		w8, w8		)
 
 	bl		pmull_gcm_enc_4x
 
-	tbnz		x0, #63, 6f
-	st1		{INP0.16b-INP3.16b}, [x1], #64
+	tbnz		x19, #63, 6f
+	st1		{INP0.16b-INP3.16b}, [x20], #64
 	.if		\enc == 1
 	bl		pmull_gcm_ghash_4x
 	.endif
-	bne		0b
 
-3:	ldp		x19, x10, [sp, #24]
+	beq		3f
+
+	if_will_cond_yield_neon
+CPU_LE(	rev		w8, w8		)
+	str		w8, [x24, #12]			// store lower counter
+	st1		{XL.2d}, [x23]
+	do_cond_yield_neon
+	b		.Lrestart\enc
+	endif_yield_neon
+	b		0b
+
+3:	ldr		x10, [sp, #.Lframe_local_offset]
 	cbz		x10, 5f				// output tag?
 
 	ld1		{INP3.16b}, [x10]		// load lengths[]
@@ -536,10 +555,10 @@ CPU_LE(	rev		w8, w8		)
 	bl		pmull_gcm_ghash_4x
 
 	mov		w11, #(0x1 << 24)		// BE '1U'
-	ld1		{KS0.16b}, [x5]
+	ld1		{KS0.16b}, [x24]
 	mov		KS0.s[3], w11
 
-	enc_block	KS0, x7, x6, x12
+	enc_block	KS0, x26, x25, x12
 
 	ext		XL.16b, XL.16b, XL.16b, #8
 	rev64		XL.16b, XL.16b
@@ -548,7 +567,7 @@ CPU_LE(	rev		w8, w8		)
 	.if		\enc == 1
 	st1		{XL.16b}, [x10]			// store tag
 	.else
-	ldp		x11, x12, [sp, #40]		// load tag pointer and authsize
+	ldp		x11, x12, [sp, #.Lframe_local_offset + 8] // load tag pointer and authsize
 	adr_l		x17, .Lpermute_table
 	ld1		{KS0.16b}, [x11]		// load supplied tag
 	add		x17, x17, x12
@@ -561,33 +580,33 @@ CPU_LE(	rev		w8, w8		)
 	smov		w0, v0.b[0]			// return b0
 	.endif
 
-4:	ldp		x29, x30, [sp], #32
+4:	frame_pop
 	ret
 
 5:
 CPU_LE(	rev		w8, w8		)
-	str		w8, [x5, #12]			// store lower counter
-	st1		{XL.2d}, [x4]
+	str		w8, [x24, #12]			// store lower counter
+	st1		{XL.2d}, [x23]
 	b		4b
 
 6:	ld1		{T1.16b-T2.16b}, [x17], #32	// permute vectors
-	sub		x17, x17, x19, lsl #1
+	sub		x17, x17, x0, lsl #1
 
 	cmp		w9, #1
 	beq		7f
 	.subsection	1
-7:	ld1		{INP2.16b}, [x1]
+7:	ld1		{INP2.16b}, [x20]
 	tbx		INP2.16b, {INP3.16b}, T1.16b
 	mov		INP3.16b, INP2.16b
 	b		8f
 	.previous
 
-	st1		{INP0.16b}, [x1], x14
-	st1		{INP1.16b}, [x1], x15
-	st1		{INP2.16b}, [x1], x16
+	st1		{INP0.16b}, [x20], x14
+	st1		{INP1.16b}, [x20], x15
+	st1		{INP2.16b}, [x20], x16
 	tbl		INP3.16b, {INP3.16b}, T1.16b
 	tbx		INP3.16b, {INP2.16b}, T2.16b
-8:	st1		{INP3.16b}, [x1]
+8:	st1		{INP3.16b}, [x20]
 
 	.if		\enc == 1
 	ld1		{T1.16b}, [x17]
@@ -699,7 +718,7 @@ SYM_FUNC_START_LOCAL(pmull_gcm_ghash_4x)
 SYM_FUNC_END(pmull_gcm_ghash_4x)
 
 SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
-	ld1		{KS0.16b}, [x5]			// load upper counter
+	ld1		{KS0.16b}, [x24]		// load upper counter
 	sub		w10, w8, #4
 	sub		w11, w8, #3
 	sub		w12, w8, #2
@@ -716,13 +735,13 @@ SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
 	ins		KS2.s[3], w12
 	ins		KS3.s[3], w13
 
-	add		x10, x6, #96			// round key pointer
+	add		x10, x25, #96			// round key pointer
 	ld1		{K6.4s-K7.4s}, [x10], #32
 	.irp		key, K0, K1, K2, K3, K4, K5
 	enc_qround	KS0, KS1, KS2, KS3, \key
 	.endr
 
-	tbnz		x7, #2, .Lnot128
+	tbnz		x26, #2, .Lnot128
 	.subsection	1
 .Lnot128:
 	ld1		{K8.4s-K9.4s}, [x10], #32
@@ -733,7 +752,7 @@ SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
 	.irp		key, K8, K9
 	enc_qround	KS0, KS1, KS2, KS3, \key
 	.endr
-	tbz		x7, #1, .Lout192
+	tbz		x26, #1, .Lout192
 	b		.Lout256
 	.previous
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support
@ 2020-12-18 17:01   ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ingo Molnar, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Ard Biesheuvel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	Dave Martin, linux-arm-kernel

GCM mode is typically used for IPsec, but is now also used for software
kTLS, which means that we may end up operating on much larger inputs.

So let's wire up conditional NEON yield support for this driver, to ensure
that it does not cause scheduling blackouts when used with large inputs.

Given that GCM executes at 1-2 cycles per bytes and operates on 64 byte
chunks, doing a yield check every iteration should limit the scheduling
(or softirq) latency to < 200 cycles, which is a very conservative upper
bound.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 115 ++++++++++++--------
 1 file changed, 67 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 7868330dd54e..533cfb810232 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -435,14 +435,23 @@ SYM_FUNC_END(pmull_ghash_update_p8)
 
 	.align		6
 	.macro		pmull_gcm_do_crypt, enc
-	stp		x29, x30, [sp, #-32]!
-	mov		x29, sp
-	str		x19, [sp, #24]
+	frame_push	8
 
-	load_round_keys	x7, x6, x8
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
+	mov		x25, x6
+	mov		x26, x7
 
-	ld1		{SHASH.2d}, [x3], #16
-	ld1		{HH.2d-HH4.2d}, [x3]
+.Lrestart\enc:
+	load_round_keys	x26, x25, x8
+
+	add		x0, x22, #16
+	ld1		{SHASH.2d}, [x22]
+	ld1		{HH.2d-HH4.2d}, [x0]
 
 	trn1		SHASH2.2d, SHASH.2d, HH.2d
 	trn2		T1.2d, SHASH.2d, HH.2d
@@ -452,23 +461,23 @@ SYM_FUNC_END(pmull_ghash_update_p8)
 	trn2		T1.2d, HH3.2d, HH4.2d
 	eor		HH34.16b, HH34.16b, T1.16b
 
-	ld1		{XL.2d}, [x4]
+	ld1		{XL.2d}, [x23]
 
-	cbz		x0, 3f				// tag only?
+	cbz		x19, 3f				// tag only?
 
-	ldr		w8, [x5, #12]			// load lower counter
+	ldr		w8, [x24, #12]			// load lower counter
 CPU_LE(	rev		w8, w8		)
 
 0:	mov		w9, #4				// max blocks per round
-	add		x10, x0, #0xf
+	add		x10, x19, #0xf
 	lsr		x10, x10, #4			// remaining blocks
 
-	subs		x0, x0, #64
+	subs		x19, x19, #64
 	csel		w9, w10, w9, mi
 	add		w8, w8, w9
 
 	bmi		1f
-	ld1		{INP0.16b-INP3.16b}, [x2], #64
+	ld1		{INP0.16b-INP3.16b}, [x21], #64
 	.subsection	1
 	/*
 	 * Populate the four input registers right to left with up to 63 bytes
@@ -487,30 +496,30 @@ CPU_LE(	rev		w8, w8		)
 	 * input size is < 16 bytes)
 	 */
 1:	mov		x15, #16
-	ands		x19, x0, #0xf
-	csel		x19, x19, x15, ne
+	ands		x0, x19, #0xf
+	csel		x0, x0, x15, ne
 	adr_l		x17, .Lpermute_table + 16
 
-	sub		x11, x15, x19
+	sub		x11, x15, x0
 	add		x12, x17, x11
 	sub		x17, x17, x11
 	ld1		{T1.16b}, [x12]
-	sub		x10, x1, x11
-	sub		x11, x2, x11
+	sub		x10, x20, x11
+	sub		x11, x21, x11
 
-	cmp		x0, #-16
+	cmp		x19, #-16
 	csel		x14, x15, xzr, gt
-	cmp		x0, #-32
+	cmp		x19, #-32
 	csel		x15, x15, xzr, gt
-	cmp		x0, #-48
-	csel		x16, x19, xzr, gt
-	csel		x1, x1, x10, gt
-	csel		x2, x2, x11, gt
-
-	ld1		{INP0.16b}, [x2], x14
-	ld1		{INP1.16b}, [x2], x15
-	ld1		{INP2.16b}, [x2], x16
-	ld1		{INP3.16b}, [x2]
+	cmp		x19, #-48
+	csel		x16, x0, xzr, gt
+	csel		x20, x20, x10, gt
+	csel		x21, x21, x11, gt
+
+	ld1		{INP0.16b}, [x21], x14
+	ld1		{INP1.16b}, [x21], x15
+	ld1		{INP2.16b}, [x21], x16
+	ld1		{INP3.16b}, [x21]
 	tbl		INP3.16b, {INP3.16b}, T1.16b
 	b		2f
 	.previous
@@ -521,14 +530,24 @@ CPU_LE(	rev		w8, w8		)
 
 	bl		pmull_gcm_enc_4x
 
-	tbnz		x0, #63, 6f
-	st1		{INP0.16b-INP3.16b}, [x1], #64
+	tbnz		x19, #63, 6f
+	st1		{INP0.16b-INP3.16b}, [x20], #64
 	.if		\enc == 1
 	bl		pmull_gcm_ghash_4x
 	.endif
-	bne		0b
 
-3:	ldp		x19, x10, [sp, #24]
+	beq		3f
+
+	if_will_cond_yield_neon
+CPU_LE(	rev		w8, w8		)
+	str		w8, [x24, #12]			// store lower counter
+	st1		{XL.2d}, [x23]
+	do_cond_yield_neon
+	b		.Lrestart\enc
+	endif_yield_neon
+	b		0b
+
+3:	ldr		x10, [sp, #.Lframe_local_offset]
 	cbz		x10, 5f				// output tag?
 
 	ld1		{INP3.16b}, [x10]		// load lengths[]
@@ -536,10 +555,10 @@ CPU_LE(	rev		w8, w8		)
 	bl		pmull_gcm_ghash_4x
 
 	mov		w11, #(0x1 << 24)		// BE '1U'
-	ld1		{KS0.16b}, [x5]
+	ld1		{KS0.16b}, [x24]
 	mov		KS0.s[3], w11
 
-	enc_block	KS0, x7, x6, x12
+	enc_block	KS0, x26, x25, x12
 
 	ext		XL.16b, XL.16b, XL.16b, #8
 	rev64		XL.16b, XL.16b
@@ -548,7 +567,7 @@ CPU_LE(	rev		w8, w8		)
 	.if		\enc == 1
 	st1		{XL.16b}, [x10]			// store tag
 	.else
-	ldp		x11, x12, [sp, #40]		// load tag pointer and authsize
+	ldp		x11, x12, [sp, #.Lframe_local_offset + 8] // load tag pointer and authsize
 	adr_l		x17, .Lpermute_table
 	ld1		{KS0.16b}, [x11]		// load supplied tag
 	add		x17, x17, x12
@@ -561,33 +580,33 @@ CPU_LE(	rev		w8, w8		)
 	smov		w0, v0.b[0]			// return b0
 	.endif
 
-4:	ldp		x29, x30, [sp], #32
+4:	frame_pop
 	ret
 
 5:
 CPU_LE(	rev		w8, w8		)
-	str		w8, [x5, #12]			// store lower counter
-	st1		{XL.2d}, [x4]
+	str		w8, [x24, #12]			// store lower counter
+	st1		{XL.2d}, [x23]
 	b		4b
 
 6:	ld1		{T1.16b-T2.16b}, [x17], #32	// permute vectors
-	sub		x17, x17, x19, lsl #1
+	sub		x17, x17, x0, lsl #1
 
 	cmp		w9, #1
 	beq		7f
 	.subsection	1
-7:	ld1		{INP2.16b}, [x1]
+7:	ld1		{INP2.16b}, [x20]
 	tbx		INP2.16b, {INP3.16b}, T1.16b
 	mov		INP3.16b, INP2.16b
 	b		8f
 	.previous
 
-	st1		{INP0.16b}, [x1], x14
-	st1		{INP1.16b}, [x1], x15
-	st1		{INP2.16b}, [x1], x16
+	st1		{INP0.16b}, [x20], x14
+	st1		{INP1.16b}, [x20], x15
+	st1		{INP2.16b}, [x20], x16
 	tbl		INP3.16b, {INP3.16b}, T1.16b
 	tbx		INP3.16b, {INP2.16b}, T2.16b
-8:	st1		{INP3.16b}, [x1]
+8:	st1		{INP3.16b}, [x20]
 
 	.if		\enc == 1
 	ld1		{T1.16b}, [x17]
@@ -699,7 +718,7 @@ SYM_FUNC_START_LOCAL(pmull_gcm_ghash_4x)
 SYM_FUNC_END(pmull_gcm_ghash_4x)
 
 SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
-	ld1		{KS0.16b}, [x5]			// load upper counter
+	ld1		{KS0.16b}, [x24]		// load upper counter
 	sub		w10, w8, #4
 	sub		w11, w8, #3
 	sub		w12, w8, #2
@@ -716,13 +735,13 @@ SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
 	ins		KS2.s[3], w12
 	ins		KS3.s[3], w13
 
-	add		x10, x6, #96			// round key pointer
+	add		x10, x25, #96			// round key pointer
 	ld1		{K6.4s-K7.4s}, [x10], #32
 	.irp		key, K0, K1, K2, K3, K4, K5
 	enc_qround	KS0, KS1, KS2, KS3, \key
 	.endr
 
-	tbnz		x7, #2, .Lnot128
+	tbnz		x26, #2, .Lnot128
 	.subsection	1
 .Lnot128:
 	ld1		{K8.4s-K9.4s}, [x10], #32
@@ -733,7 +752,7 @@ SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
 	.irp		key, K8, K9
 	enc_qround	KS0, KS1, KS2, KS3, \key
 	.endr
-	tbz		x7, #1, .Lout192
+	tbz		x26, #1, .Lout192
 	b		.Lout256
 	.previous
 
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2020-12-18 17:01   ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

Kernel mode NEON can be used in task or softirq context, but only in
a non-nesting manner, i.e., softirq context is only permitted if the
interrupt was not taken at a point where the kernel was using the NEON
in task context.

This means all users of kernel mode NEON have to be aware of this
limitation, and either need to provide scalar fallbacks that may be much
slower (up to 20x for AES instructions) and potentially less safe, or
use an asynchronous interface that defers processing to a later time
when the NEON is guaranteed to be available.

Given that grabbing and releasing the NEON is cheap, we can relax this
restriction, by increasing the granularity of kernel mode NEON code, and
always disabling softirq processing while the NEON is being used in task
context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
 arch/arm64/kernel/asm-offsets.c    |  2 ++
 arch/arm64/kernel/fpsimd.c         |  4 ++--
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index ddbe6bf00e33..74ce46ed55ac 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -15,6 +15,7 @@
 #include <asm-generic/export.h>
 
 #include <asm/asm-offsets.h>
+#include <asm/alternative.h>
 #include <asm/cpufeature.h>
 #include <asm/cputype.h>
 #include <asm/debug-monitors.h>
@@ -717,17 +718,23 @@ USER(\label, ic	ivau, \tmp2)			// invalidate I line PoU
 	.endm
 
 	.macro		if_will_cond_yield_neon
-#ifdef CONFIG_PREEMPTION
 	get_current_task	x0
 	ldr		x0, [x0, #TSK_TI_PREEMPT]
-	sub		x0, x0, #PREEMPT_DISABLE_OFFSET
-	cbz		x0, .Lyield_\@
+#ifdef CONFIG_PREEMPTION
+	cmp		x0, #PREEMPT_DISABLE_OFFSET
+	beq		.Lyield_\@	// yield on need_resched in task context
+#endif
+	/* never yield while serving a softirq */
+	tbnz		x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
+
+	adr_l		x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
+	this_cpu_offset	x1
+	ldr		w0, [x0, x1]
+	cbnz		w0, .Lyield_\@	// yield on pending softirq in task context
+.Lnoyield_\@:
 	/* fall through to endif_yield_neon */
 	.subsection	1
 .Lyield_\@ :
-#else
-	.section	".discard.cond_yield_neon", "ax"
-#endif
 	.endm
 
 	.macro		do_cond_yield_neon
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 7d32fc959b1a..34ef70877de4 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
   DEFINE(DMA_FROM_DEVICE,	DMA_FROM_DEVICE);
   BLANK();
   DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
+  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
   BLANK();
   DEFINE(CPU_BOOT_STACK,	offsetof(struct secondary_data, stack));
   DEFINE(CPU_BOOT_TASK,		offsetof(struct secondary_data, task));
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 062b21f30f94..823e3a8a8871 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
  */
 static void get_cpu_fpsimd_context(void)
 {
-	preempt_disable();
+	local_bh_disable();
 	__get_cpu_fpsimd_context();
 }
 
@@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
 static void put_cpu_fpsimd_context(void)
 {
 	__put_cpu_fpsimd_context();
-	preempt_enable();
+	local_bh_enable();
 }
 
 static bool have_cpu_fpsimd_context(void)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
@ 2020-12-18 17:01   ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ingo Molnar, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Ard Biesheuvel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	Dave Martin, linux-arm-kernel

Kernel mode NEON can be used in task or softirq context, but only in
a non-nesting manner, i.e., softirq context is only permitted if the
interrupt was not taken at a point where the kernel was using the NEON
in task context.

This means all users of kernel mode NEON have to be aware of this
limitation, and either need to provide scalar fallbacks that may be much
slower (up to 20x for AES instructions) and potentially less safe, or
use an asynchronous interface that defers processing to a later time
when the NEON is guaranteed to be available.

Given that grabbing and releasing the NEON is cheap, we can relax this
restriction, by increasing the granularity of kernel mode NEON code, and
always disabling softirq processing while the NEON is being used in task
context.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
 arch/arm64/kernel/asm-offsets.c    |  2 ++
 arch/arm64/kernel/fpsimd.c         |  4 ++--
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index ddbe6bf00e33..74ce46ed55ac 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -15,6 +15,7 @@
 #include <asm-generic/export.h>
 
 #include <asm/asm-offsets.h>
+#include <asm/alternative.h>
 #include <asm/cpufeature.h>
 #include <asm/cputype.h>
 #include <asm/debug-monitors.h>
@@ -717,17 +718,23 @@ USER(\label, ic	ivau, \tmp2)			// invalidate I line PoU
 	.endm
 
 	.macro		if_will_cond_yield_neon
-#ifdef CONFIG_PREEMPTION
 	get_current_task	x0
 	ldr		x0, [x0, #TSK_TI_PREEMPT]
-	sub		x0, x0, #PREEMPT_DISABLE_OFFSET
-	cbz		x0, .Lyield_\@
+#ifdef CONFIG_PREEMPTION
+	cmp		x0, #PREEMPT_DISABLE_OFFSET
+	beq		.Lyield_\@	// yield on need_resched in task context
+#endif
+	/* never yield while serving a softirq */
+	tbnz		x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
+
+	adr_l		x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
+	this_cpu_offset	x1
+	ldr		w0, [x0, x1]
+	cbnz		w0, .Lyield_\@	// yield on pending softirq in task context
+.Lnoyield_\@:
 	/* fall through to endif_yield_neon */
 	.subsection	1
 .Lyield_\@ :
-#else
-	.section	".discard.cond_yield_neon", "ax"
-#endif
 	.endm
 
 	.macro		do_cond_yield_neon
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 7d32fc959b1a..34ef70877de4 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
   DEFINE(DMA_FROM_DEVICE,	DMA_FROM_DEVICE);
   BLANK();
   DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
+  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
   BLANK();
   DEFINE(CPU_BOOT_STACK,	offsetof(struct secondary_data, stack));
   DEFINE(CPU_BOOT_TASK,		offsetof(struct secondary_data, task));
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 062b21f30f94..823e3a8a8871 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
  */
 static void get_cpu_fpsimd_context(void)
 {
-	preempt_disable();
+	local_bh_disable();
 	__get_cpu_fpsimd_context();
 }
 
@@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
 static void put_cpu_fpsimd_context(void)
 {
 	__put_cpu_fpsimd_context();
-	preempt_enable();
+	local_bh_enable();
 }
 
 static bool have_cpu_fpsimd_context(void)
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2020-12-18 17:01   ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

Now that kernel mode SIMD is guaranteed to be available when executing
in task or softirq context, we no longer need scalar fallbacks to use
when the NEON is unavailable. So get rid of them.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/ghash-ce-glue.c | 209 +++++---------------
 1 file changed, 51 insertions(+), 158 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 720cd3a58da3..15794fe21a0b 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -362,84 +362,36 @@ static int gcm_encrypt(struct aead_request *req)
 
 	err = skcipher_walk_aead_encrypt(&walk, req, false);
 
-	if (likely(crypto_simd_usable())) {
-		do {
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-			int nbytes = walk.nbytes;
-
-			tag = (u8 *)&lengths;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
-				src = dst = memcpy(buf + sizeof(buf) - nbytes,
-						   src, nbytes);
-			} else if (nbytes < walk.total) {
-				nbytes &= ~(AES_BLOCK_SIZE - 1);
-				tag = NULL;
-			}
-
-			kernel_neon_begin();
-			pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
-					  dg, iv, ctx->aes_key.key_enc, nrounds,
-					  tag);
-			kernel_neon_end();
-
-			if (unlikely(!nbytes))
-				break;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
-				memcpy(walk.dst.virt.addr,
-				       buf + sizeof(buf) - nbytes, nbytes);
-
-			err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
-		} while (walk.nbytes);
-	} else {
-		while (walk.nbytes >= AES_BLOCK_SIZE) {
-			int blocks = walk.nbytes / AES_BLOCK_SIZE;
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-			int remaining = blocks;
-
-			do {
-				aes_encrypt(&ctx->aes_key, buf, iv);
-				crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
-				crypto_inc(iv, AES_BLOCK_SIZE);
-
-				dst += AES_BLOCK_SIZE;
-				src += AES_BLOCK_SIZE;
-			} while (--remaining > 0);
-
-			ghash_do_update(blocks, dg, walk.dst.virt.addr,
-					&ctx->ghash_key, NULL);
-
-			err = skcipher_walk_done(&walk,
-						 walk.nbytes % AES_BLOCK_SIZE);
-		}
-
-		/* handle the tail */
-		if (walk.nbytes) {
-			aes_encrypt(&ctx->aes_key, buf, iv);
+	do {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		int nbytes = walk.nbytes;
 
-			crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
-				       buf, walk.nbytes);
+		tag = (u8 *)&lengths;
 
-			memcpy(buf, walk.dst.virt.addr, walk.nbytes);
-			memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+			src = dst = memcpy(buf + sizeof(buf) - nbytes,
+					   src, nbytes);
+		} else if (nbytes < walk.total) {
+			nbytes &= ~(AES_BLOCK_SIZE - 1);
+			tag = NULL;
 		}
 
-		tag = (u8 *)&lengths;
-		ghash_do_update(1, dg, tag, &ctx->ghash_key,
-				walk.nbytes ? buf : NULL);
+		kernel_neon_begin();
+		pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
+				  dg, iv, ctx->aes_key.key_enc, nrounds,
+				  tag);
+		kernel_neon_end();
 
-		if (walk.nbytes)
-			err = skcipher_walk_done(&walk, 0);
+		if (unlikely(!nbytes))
+			break;
 
-		put_unaligned_be64(dg[1], tag);
-		put_unaligned_be64(dg[0], tag + 8);
-		put_unaligned_be32(1, iv + GCM_IV_SIZE);
-		aes_encrypt(&ctx->aes_key, iv, iv);
-		crypto_xor(tag, iv, AES_BLOCK_SIZE);
-	}
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr,
+			       buf + sizeof(buf) - nbytes, nbytes);
+
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	} while (walk.nbytes);
 
 	if (err)
 		return err;
@@ -464,6 +416,7 @@ static int gcm_decrypt(struct aead_request *req)
 	u64 dg[2] = {};
 	be128 lengths;
 	u8 *tag;
+	int ret;
 	int err;
 
 	lengths.a = cpu_to_be64(req->assoclen * 8);
@@ -481,101 +434,41 @@ static int gcm_decrypt(struct aead_request *req)
 
 	err = skcipher_walk_aead_decrypt(&walk, req, false);
 
-	if (likely(crypto_simd_usable())) {
-		int ret;
-
-		do {
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-			int nbytes = walk.nbytes;
-
-			tag = (u8 *)&lengths;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
-				src = dst = memcpy(buf + sizeof(buf) - nbytes,
-						   src, nbytes);
-			} else if (nbytes < walk.total) {
-				nbytes &= ~(AES_BLOCK_SIZE - 1);
-				tag = NULL;
-			}
-
-			kernel_neon_begin();
-			ret = pmull_gcm_decrypt(nbytes, dst, src,
-						ctx->ghash_key.h,
-						dg, iv, ctx->aes_key.key_enc,
-						nrounds, tag, otag, authsize);
-			kernel_neon_end();
-
-			if (unlikely(!nbytes))
-				break;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
-				memcpy(walk.dst.virt.addr,
-				       buf + sizeof(buf) - nbytes, nbytes);
-
-			err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
-		} while (walk.nbytes);
-
-		if (err)
-			return err;
-		if (ret)
-			return -EBADMSG;
-	} else {
-		while (walk.nbytes >= AES_BLOCK_SIZE) {
-			int blocks = walk.nbytes / AES_BLOCK_SIZE;
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-
-			ghash_do_update(blocks, dg, walk.src.virt.addr,
-					&ctx->ghash_key, NULL);
-
-			do {
-				aes_encrypt(&ctx->aes_key, buf, iv);
-				crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
-				crypto_inc(iv, AES_BLOCK_SIZE);
-
-				dst += AES_BLOCK_SIZE;
-				src += AES_BLOCK_SIZE;
-			} while (--blocks > 0);
+	do {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		int nbytes = walk.nbytes;
 
-			err = skcipher_walk_done(&walk,
-						 walk.nbytes % AES_BLOCK_SIZE);
-		}
+		tag = (u8 *)&lengths;
 
-		/* handle the tail */
-		if (walk.nbytes) {
-			memcpy(buf, walk.src.virt.addr, walk.nbytes);
-			memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+			src = dst = memcpy(buf + sizeof(buf) - nbytes,
+					   src, nbytes);
+		} else if (nbytes < walk.total) {
+			nbytes &= ~(AES_BLOCK_SIZE - 1);
+			tag = NULL;
 		}
 
-		tag = (u8 *)&lengths;
-		ghash_do_update(1, dg, tag, &ctx->ghash_key,
-				walk.nbytes ? buf : NULL);
-
-		if (walk.nbytes) {
-			aes_encrypt(&ctx->aes_key, buf, iv);
+		kernel_neon_begin();
+		ret = pmull_gcm_decrypt(nbytes, dst, src, ctx->ghash_key.h,
+					dg, iv, ctx->aes_key.key_enc,
+					nrounds, tag, otag, authsize);
+		kernel_neon_end();
 
-			crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
-				       buf, walk.nbytes);
+		if (unlikely(!nbytes))
+			break;
 
-			err = skcipher_walk_done(&walk, 0);
-		}
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr,
+			       buf + sizeof(buf) - nbytes, nbytes);
 
-		if (err)
-			return err;
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	} while (walk.nbytes);
 
-		put_unaligned_be64(dg[1], tag);
-		put_unaligned_be64(dg[0], tag + 8);
-		put_unaligned_be32(1, iv + GCM_IV_SIZE);
-		aes_encrypt(&ctx->aes_key, iv, iv);
-		crypto_xor(tag, iv, AES_BLOCK_SIZE);
+	if (err)
+		return err;
 
-		if (crypto_memneq(tag, otag, authsize)) {
-			memzero_explicit(tag, AES_BLOCK_SIZE);
-			return -EBADMSG;
-		}
-	}
-	return 0;
+	return ret ? -EBADMSG : 0;
 }
 
 static struct aead_alg gcm_aes_alg = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path
@ 2020-12-18 17:01   ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
  To: linux-crypto
  Cc: Ingo Molnar, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Ard Biesheuvel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	Dave Martin, linux-arm-kernel

Now that kernel mode SIMD is guaranteed to be available when executing
in task or softirq context, we no longer need scalar fallbacks to use
when the NEON is unavailable. So get rid of them.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm64/crypto/ghash-ce-glue.c | 209 +++++---------------
 1 file changed, 51 insertions(+), 158 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 720cd3a58da3..15794fe21a0b 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -362,84 +362,36 @@ static int gcm_encrypt(struct aead_request *req)
 
 	err = skcipher_walk_aead_encrypt(&walk, req, false);
 
-	if (likely(crypto_simd_usable())) {
-		do {
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-			int nbytes = walk.nbytes;
-
-			tag = (u8 *)&lengths;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
-				src = dst = memcpy(buf + sizeof(buf) - nbytes,
-						   src, nbytes);
-			} else if (nbytes < walk.total) {
-				nbytes &= ~(AES_BLOCK_SIZE - 1);
-				tag = NULL;
-			}
-
-			kernel_neon_begin();
-			pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
-					  dg, iv, ctx->aes_key.key_enc, nrounds,
-					  tag);
-			kernel_neon_end();
-
-			if (unlikely(!nbytes))
-				break;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
-				memcpy(walk.dst.virt.addr,
-				       buf + sizeof(buf) - nbytes, nbytes);
-
-			err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
-		} while (walk.nbytes);
-	} else {
-		while (walk.nbytes >= AES_BLOCK_SIZE) {
-			int blocks = walk.nbytes / AES_BLOCK_SIZE;
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-			int remaining = blocks;
-
-			do {
-				aes_encrypt(&ctx->aes_key, buf, iv);
-				crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
-				crypto_inc(iv, AES_BLOCK_SIZE);
-
-				dst += AES_BLOCK_SIZE;
-				src += AES_BLOCK_SIZE;
-			} while (--remaining > 0);
-
-			ghash_do_update(blocks, dg, walk.dst.virt.addr,
-					&ctx->ghash_key, NULL);
-
-			err = skcipher_walk_done(&walk,
-						 walk.nbytes % AES_BLOCK_SIZE);
-		}
-
-		/* handle the tail */
-		if (walk.nbytes) {
-			aes_encrypt(&ctx->aes_key, buf, iv);
+	do {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		int nbytes = walk.nbytes;
 
-			crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
-				       buf, walk.nbytes);
+		tag = (u8 *)&lengths;
 
-			memcpy(buf, walk.dst.virt.addr, walk.nbytes);
-			memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+			src = dst = memcpy(buf + sizeof(buf) - nbytes,
+					   src, nbytes);
+		} else if (nbytes < walk.total) {
+			nbytes &= ~(AES_BLOCK_SIZE - 1);
+			tag = NULL;
 		}
 
-		tag = (u8 *)&lengths;
-		ghash_do_update(1, dg, tag, &ctx->ghash_key,
-				walk.nbytes ? buf : NULL);
+		kernel_neon_begin();
+		pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
+				  dg, iv, ctx->aes_key.key_enc, nrounds,
+				  tag);
+		kernel_neon_end();
 
-		if (walk.nbytes)
-			err = skcipher_walk_done(&walk, 0);
+		if (unlikely(!nbytes))
+			break;
 
-		put_unaligned_be64(dg[1], tag);
-		put_unaligned_be64(dg[0], tag + 8);
-		put_unaligned_be32(1, iv + GCM_IV_SIZE);
-		aes_encrypt(&ctx->aes_key, iv, iv);
-		crypto_xor(tag, iv, AES_BLOCK_SIZE);
-	}
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr,
+			       buf + sizeof(buf) - nbytes, nbytes);
+
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	} while (walk.nbytes);
 
 	if (err)
 		return err;
@@ -464,6 +416,7 @@ static int gcm_decrypt(struct aead_request *req)
 	u64 dg[2] = {};
 	be128 lengths;
 	u8 *tag;
+	int ret;
 	int err;
 
 	lengths.a = cpu_to_be64(req->assoclen * 8);
@@ -481,101 +434,41 @@ static int gcm_decrypt(struct aead_request *req)
 
 	err = skcipher_walk_aead_decrypt(&walk, req, false);
 
-	if (likely(crypto_simd_usable())) {
-		int ret;
-
-		do {
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-			int nbytes = walk.nbytes;
-
-			tag = (u8 *)&lengths;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
-				src = dst = memcpy(buf + sizeof(buf) - nbytes,
-						   src, nbytes);
-			} else if (nbytes < walk.total) {
-				nbytes &= ~(AES_BLOCK_SIZE - 1);
-				tag = NULL;
-			}
-
-			kernel_neon_begin();
-			ret = pmull_gcm_decrypt(nbytes, dst, src,
-						ctx->ghash_key.h,
-						dg, iv, ctx->aes_key.key_enc,
-						nrounds, tag, otag, authsize);
-			kernel_neon_end();
-
-			if (unlikely(!nbytes))
-				break;
-
-			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
-				memcpy(walk.dst.virt.addr,
-				       buf + sizeof(buf) - nbytes, nbytes);
-
-			err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
-		} while (walk.nbytes);
-
-		if (err)
-			return err;
-		if (ret)
-			return -EBADMSG;
-	} else {
-		while (walk.nbytes >= AES_BLOCK_SIZE) {
-			int blocks = walk.nbytes / AES_BLOCK_SIZE;
-			const u8 *src = walk.src.virt.addr;
-			u8 *dst = walk.dst.virt.addr;
-
-			ghash_do_update(blocks, dg, walk.src.virt.addr,
-					&ctx->ghash_key, NULL);
-
-			do {
-				aes_encrypt(&ctx->aes_key, buf, iv);
-				crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
-				crypto_inc(iv, AES_BLOCK_SIZE);
-
-				dst += AES_BLOCK_SIZE;
-				src += AES_BLOCK_SIZE;
-			} while (--blocks > 0);
+	do {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		int nbytes = walk.nbytes;
 
-			err = skcipher_walk_done(&walk,
-						 walk.nbytes % AES_BLOCK_SIZE);
-		}
+		tag = (u8 *)&lengths;
 
-		/* handle the tail */
-		if (walk.nbytes) {
-			memcpy(buf, walk.src.virt.addr, walk.nbytes);
-			memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+			src = dst = memcpy(buf + sizeof(buf) - nbytes,
+					   src, nbytes);
+		} else if (nbytes < walk.total) {
+			nbytes &= ~(AES_BLOCK_SIZE - 1);
+			tag = NULL;
 		}
 
-		tag = (u8 *)&lengths;
-		ghash_do_update(1, dg, tag, &ctx->ghash_key,
-				walk.nbytes ? buf : NULL);
-
-		if (walk.nbytes) {
-			aes_encrypt(&ctx->aes_key, buf, iv);
+		kernel_neon_begin();
+		ret = pmull_gcm_decrypt(nbytes, dst, src, ctx->ghash_key.h,
+					dg, iv, ctx->aes_key.key_enc,
+					nrounds, tag, otag, authsize);
+		kernel_neon_end();
 
-			crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
-				       buf, walk.nbytes);
+		if (unlikely(!nbytes))
+			break;
 
-			err = skcipher_walk_done(&walk, 0);
-		}
+		if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+			memcpy(walk.dst.virt.addr,
+			       buf + sizeof(buf) - nbytes, nbytes);
 
-		if (err)
-			return err;
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	} while (walk.nbytes);
 
-		put_unaligned_be64(dg[1], tag);
-		put_unaligned_be64(dg[0], tag + 8);
-		put_unaligned_be32(1, iv + GCM_IV_SIZE);
-		aes_encrypt(&ctx->aes_key, iv, iv);
-		crypto_xor(tag, iv, AES_BLOCK_SIZE);
+	if (err)
+		return err;
 
-		if (crypto_memneq(tag, otag, authsize)) {
-			memzero_explicit(tag, AES_BLOCK_SIZE);
-			return -EBADMSG;
-		}
-	}
-	return 0;
+	return ret ? -EBADMSG : 0;
 }
 
 static struct aead_alg gcm_aes_alg = {
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2020-12-19  2:04   ` Herbert Xu
  -1 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2020-12-19  2:04 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Brown, Eric Biggers, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Ingo Molnar

On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
>
> Questions:
> - what did I miss or break horribly?
> - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
>   kthread, so I don't think it cares.
> - what would be a reasonable upper bound to keep softirqs disabled? I suppose
>   100s of cycles or less is overkill, but I'm not sure how to derive a better
>   answer.
> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
>   expensive?

If this approach works not only would it allow us to support the
synchronous users better, it would also allow us to remove loads
of cruft in the Crypto API that exist solely to support these SIMD
code paths.

So I eagerly await the assessment of the scheduler/RT folks on this
approach.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2020-12-19  2:04   ` Herbert Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2020-12-19  2:04 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Peter Zijlstra, Catalin Marinas, Sebastian Andrzej Siewior,
	linux-kernel, Ingo Molnar, Eric Biggers, Mark Brown,
	linux-crypto, Thomas Gleixner, Will Deacon, Dave Martin,
	linux-arm-kernel

On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
>
> Questions:
> - what did I miss or break horribly?
> - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
>   kthread, so I don't think it cares.
> - what would be a reasonable upper bound to keep softirqs disabled? I suppose
>   100s of cycles or less is overkill, but I'm not sure how to derive a better
>   answer.
> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
>   expensive?

If this approach works not only would it allow us to support the
synchronous users better, it would also allow us to remove loads
of cruft in the Crypto API that exist solely to support these SIMD
code paths.

So I eagerly await the assessment of the scheduler/RT folks on this
approach.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
  2020-12-19  2:04   ` Herbert Xu
@ 2021-01-14  8:22     ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-14  8:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Crypto Mailing List, Linux ARM, Linux Kernel Mailing List,
	Dave Martin, Mark Brown, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
	Sebastian Andrzej Siewior, Ingo Molnar

On Sat, 19 Dec 2020 at 03:05, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> >
> > Questions:
> > - what did I miss or break horribly?
> > - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
> >   kthread, so I don't think it cares.
> > - what would be a reasonable upper bound to keep softirqs disabled? I suppose
> >   100s of cycles or less is overkill, but I'm not sure how to derive a better
> >   answer.
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> >   expensive?
>
> If this approach works not only would it allow us to support the
> synchronous users better, it would also allow us to remove loads
> of cruft in the Crypto API that exist solely to support these SIMD
> code paths.
>
> So I eagerly await the assessment of the scheduler/RT folks on this
> approach.
>

Any insights here? Is there a ballpark upper bound for the duration of
a softirq disabled section? Other reasons why dis/enabling softirq
handling is a bad idea?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2021-01-14  8:22     ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-14  8:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Peter Zijlstra, Catalin Marinas, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Ingo Molnar, Eric Biggers, Mark Brown,
	Linux Crypto Mailing List, Thomas Gleixner, Will Deacon,
	Dave Martin, Linux ARM

On Sat, 19 Dec 2020 at 03:05, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> >
> > Questions:
> > - what did I miss or break horribly?
> > - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
> >   kthread, so I don't think it cares.
> > - what would be a reasonable upper bound to keep softirqs disabled? I suppose
> >   100s of cycles or less is overkill, but I'm not sure how to derive a better
> >   answer.
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> >   expensive?
>
> If this approach works not only would it allow us to support the
> synchronous users better, it would also allow us to remove loads
> of cruft in the Crypto API that exist solely to support these SIMD
> code paths.
>
> So I eagerly await the assessment of the scheduler/RT folks on this
> approach.
>

Any insights here? Is there a ballpark upper bound for the duration of
a softirq disabled section? Other reasons why dis/enabling softirq
handling is a bad idea?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
  2020-12-18 17:01   ` Ard Biesheuvel
@ 2021-01-19 16:00     ` Dave Martin
  -1 siblings, 0 replies; 28+ messages in thread
From: Dave Martin @ 2021-01-19 16:00 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, Ingo Molnar, Herbert Xu, Peter Zijlstra,
	Catalin Marinas, Sebastian Andrzej Siewior, linux-kernel,
	Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
	linux-arm-kernel

On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> Kernel mode NEON can be used in task or softirq context, but only in
> a non-nesting manner, i.e., softirq context is only permitted if the
> interrupt was not taken at a point where the kernel was using the NEON
> in task context.
> 
> This means all users of kernel mode NEON have to be aware of this
> limitation, and either need to provide scalar fallbacks that may be much
> slower (up to 20x for AES instructions) and potentially less safe, or
> use an asynchronous interface that defers processing to a later time
> when the NEON is guaranteed to be available.
> 
> Given that grabbing and releasing the NEON is cheap, we can relax this
> restriction, by increasing the granularity of kernel mode NEON code, and
> always disabling softirq processing while the NEON is being used in task
> context.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Sorry for the slow reply on this...  it looks reasonable, but I have a
few comments below.

> ---
>  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
>  arch/arm64/kernel/asm-offsets.c    |  2 ++
>  arch/arm64/kernel/fpsimd.c         |  4 ++--
>  3 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index ddbe6bf00e33..74ce46ed55ac 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -15,6 +15,7 @@
>  #include <asm-generic/export.h>
>  
>  #include <asm/asm-offsets.h>
> +#include <asm/alternative.h>
>  #include <asm/cpufeature.h>
>  #include <asm/cputype.h>
>  #include <asm/debug-monitors.h>
> @@ -717,17 +718,23 @@ USER(\label, ic	ivau, \tmp2)			// invalidate I line PoU
>  	.endm
>  
>  	.macro		if_will_cond_yield_neon
> -#ifdef CONFIG_PREEMPTION
>  	get_current_task	x0
>  	ldr		x0, [x0, #TSK_TI_PREEMPT]
> -	sub		x0, x0, #PREEMPT_DISABLE_OFFSET
> -	cbz		x0, .Lyield_\@
> +#ifdef CONFIG_PREEMPTION
> +	cmp		x0, #PREEMPT_DISABLE_OFFSET
> +	beq		.Lyield_\@	// yield on need_resched in task context
> +#endif
> +	/* never yield while serving a softirq */
> +	tbnz		x0, #SOFTIRQ_SHIFT, .Lnoyield_\@

Can you explain the rationale here?

Using if_will_cond_yield_neon suggests the algo thinks it may run for
too long the stall preemption until completion, but we happily stall
preemption _and_ softirqs here.

Is it actually a bug to use the NEON conditional yield helpers in
softirq context?

Ideally, if processing in softirq context takes an unreasonable about of
time, the work would be handed off to an asynchronous worker, but that
does seem to conflict rather with the purpose of this series...

> +
> +	adr_l		x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> +	this_cpu_offset	x1
> +	ldr		w0, [x0, x1]
> +	cbnz		w0, .Lyield_\@	// yield on pending softirq in task context
> +.Lnoyield_\@:
>  	/* fall through to endif_yield_neon */
>  	.subsection	1
>  .Lyield_\@ :
> -#else
> -	.section	".discard.cond_yield_neon", "ax"
> -#endif
>  	.endm
>  
>  	.macro		do_cond_yield_neon
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 7d32fc959b1a..34ef70877de4 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -93,6 +93,8 @@ int main(void)
>    DEFINE(DMA_FROM_DEVICE,	DMA_FROM_DEVICE);
>    BLANK();
>    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
>    BLANK();
>    DEFINE(CPU_BOOT_STACK,	offsetof(struct secondary_data, stack));
>    DEFINE(CPU_BOOT_TASK,		offsetof(struct secondary_data, task));
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 062b21f30f94..823e3a8a8871 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
>   */
>  static void get_cpu_fpsimd_context(void)
>  {
> -	preempt_disable();
> +	local_bh_disable();
>  	__get_cpu_fpsimd_context();
>  }
>  
> @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
>  static void put_cpu_fpsimd_context(void)
>  {
>  	__put_cpu_fpsimd_context();
> -	preempt_enable();
> +	local_bh_enable();
>  }
>  
>  static bool have_cpu_fpsimd_context(void)

I was concerned about catching all the relevant preempt_disable()s, but
it had slipped my memory that Julien had factored these into one place.

I can't see off the top of my head any reason why this shouldn't work.


In threory, switching to local_bh_enable() here will add a check for
pending softirqs onto context handling fast paths.  I haven't dug into
how that works, so perhaps this is trivial on top of the preemption
check in preempt_enable().  Do you see any difference in hackbench or
similar benchmarks?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
@ 2021-01-19 16:00     ` Dave Martin
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Martin @ 2021-01-19 16:00 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, linux-kernel, Eric Biggers,
	Mark Brown, linux-crypto, Thomas Gleixner, Will Deacon,
	Ingo Molnar, linux-arm-kernel

On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> Kernel mode NEON can be used in task or softirq context, but only in
> a non-nesting manner, i.e., softirq context is only permitted if the
> interrupt was not taken at a point where the kernel was using the NEON
> in task context.
> 
> This means all users of kernel mode NEON have to be aware of this
> limitation, and either need to provide scalar fallbacks that may be much
> slower (up to 20x for AES instructions) and potentially less safe, or
> use an asynchronous interface that defers processing to a later time
> when the NEON is guaranteed to be available.
> 
> Given that grabbing and releasing the NEON is cheap, we can relax this
> restriction, by increasing the granularity of kernel mode NEON code, and
> always disabling softirq processing while the NEON is being used in task
> context.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Sorry for the slow reply on this...  it looks reasonable, but I have a
few comments below.

> ---
>  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
>  arch/arm64/kernel/asm-offsets.c    |  2 ++
>  arch/arm64/kernel/fpsimd.c         |  4 ++--
>  3 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index ddbe6bf00e33..74ce46ed55ac 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -15,6 +15,7 @@
>  #include <asm-generic/export.h>
>  
>  #include <asm/asm-offsets.h>
> +#include <asm/alternative.h>
>  #include <asm/cpufeature.h>
>  #include <asm/cputype.h>
>  #include <asm/debug-monitors.h>
> @@ -717,17 +718,23 @@ USER(\label, ic	ivau, \tmp2)			// invalidate I line PoU
>  	.endm
>  
>  	.macro		if_will_cond_yield_neon
> -#ifdef CONFIG_PREEMPTION
>  	get_current_task	x0
>  	ldr		x0, [x0, #TSK_TI_PREEMPT]
> -	sub		x0, x0, #PREEMPT_DISABLE_OFFSET
> -	cbz		x0, .Lyield_\@
> +#ifdef CONFIG_PREEMPTION
> +	cmp		x0, #PREEMPT_DISABLE_OFFSET
> +	beq		.Lyield_\@	// yield on need_resched in task context
> +#endif
> +	/* never yield while serving a softirq */
> +	tbnz		x0, #SOFTIRQ_SHIFT, .Lnoyield_\@

Can you explain the rationale here?

Using if_will_cond_yield_neon suggests the algo thinks it may run for
too long the stall preemption until completion, but we happily stall
preemption _and_ softirqs here.

Is it actually a bug to use the NEON conditional yield helpers in
softirq context?

Ideally, if processing in softirq context takes an unreasonable about of
time, the work would be handed off to an asynchronous worker, but that
does seem to conflict rather with the purpose of this series...

> +
> +	adr_l		x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> +	this_cpu_offset	x1
> +	ldr		w0, [x0, x1]
> +	cbnz		w0, .Lyield_\@	// yield on pending softirq in task context
> +.Lnoyield_\@:
>  	/* fall through to endif_yield_neon */
>  	.subsection	1
>  .Lyield_\@ :
> -#else
> -	.section	".discard.cond_yield_neon", "ax"
> -#endif
>  	.endm
>  
>  	.macro		do_cond_yield_neon
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 7d32fc959b1a..34ef70877de4 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -93,6 +93,8 @@ int main(void)
>    DEFINE(DMA_FROM_DEVICE,	DMA_FROM_DEVICE);
>    BLANK();
>    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
>    BLANK();
>    DEFINE(CPU_BOOT_STACK,	offsetof(struct secondary_data, stack));
>    DEFINE(CPU_BOOT_TASK,		offsetof(struct secondary_data, task));
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 062b21f30f94..823e3a8a8871 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
>   */
>  static void get_cpu_fpsimd_context(void)
>  {
> -	preempt_disable();
> +	local_bh_disable();
>  	__get_cpu_fpsimd_context();
>  }
>  
> @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
>  static void put_cpu_fpsimd_context(void)
>  {
>  	__put_cpu_fpsimd_context();
> -	preempt_enable();
> +	local_bh_enable();
>  }
>  
>  static bool have_cpu_fpsimd_context(void)

I was concerned about catching all the relevant preempt_disable()s, but
it had slipped my memory that Julien had factored these into one place.

I can't see off the top of my head any reason why this shouldn't work.


In threory, switching to local_bh_enable() here will add a check for
pending softirqs onto context handling fast paths.  I haven't dug into
how that works, so perhaps this is trivial on top of the preemption
check in preempt_enable().  Do you see any difference in hackbench or
similar benchmarks?

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
  2021-01-19 16:00     ` Dave Martin
@ 2021-01-19 16:29       ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-19 16:29 UTC (permalink / raw)
  To: Dave Martin
  Cc: Linux Crypto Mailing List, Ingo Molnar, Herbert Xu,
	Peter Zijlstra, Catalin Marinas, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biggers, Mark Brown,
	Thomas Gleixner, Will Deacon, Linux ARM

On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > Kernel mode NEON can be used in task or softirq context, but only in
> > a non-nesting manner, i.e., softirq context is only permitted if the
> > interrupt was not taken at a point where the kernel was using the NEON
> > in task context.
> >
> > This means all users of kernel mode NEON have to be aware of this
> > limitation, and either need to provide scalar fallbacks that may be much
> > slower (up to 20x for AES instructions) and potentially less safe, or
> > use an asynchronous interface that defers processing to a later time
> > when the NEON is guaranteed to be available.
> >
> > Given that grabbing and releasing the NEON is cheap, we can relax this
> > restriction, by increasing the granularity of kernel mode NEON code, and
> > always disabling softirq processing while the NEON is being used in task
> > context.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> Sorry for the slow reply on this...  it looks reasonable, but I have a
> few comments below.
>

No worries - thanks for taking a look.

> > ---
> >  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> >  arch/arm64/kernel/asm-offsets.c    |  2 ++
> >  arch/arm64/kernel/fpsimd.c         |  4 ++--
> >  3 files changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > index ddbe6bf00e33..74ce46ed55ac 100644
> > --- a/arch/arm64/include/asm/assembler.h
> > +++ b/arch/arm64/include/asm/assembler.h
> > @@ -15,6 +15,7 @@
> >  #include <asm-generic/export.h>
> >
> >  #include <asm/asm-offsets.h>
> > +#include <asm/alternative.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/cputype.h>
> >  #include <asm/debug-monitors.h>
> > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)                    // invalidate I line PoU
> >       .endm
> >
> >       .macro          if_will_cond_yield_neon
> > -#ifdef CONFIG_PREEMPTION
> >       get_current_task        x0
> >       ldr             x0, [x0, #TSK_TI_PREEMPT]
> > -     sub             x0, x0, #PREEMPT_DISABLE_OFFSET
> > -     cbz             x0, .Lyield_\@
> > +#ifdef CONFIG_PREEMPTION
> > +     cmp             x0, #PREEMPT_DISABLE_OFFSET
> > +     beq             .Lyield_\@      // yield on need_resched in task context
> > +#endif
> > +     /* never yield while serving a softirq */
> > +     tbnz            x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
>
> Can you explain the rationale here?
>
> Using if_will_cond_yield_neon suggests the algo thinks it may run for
> too long the stall preemption until completion, but we happily stall
> preemption _and_ softirqs here.
>
> Is it actually a bug to use the NEON conditional yield helpers in
> softirq context?
>

No, it is not. But calling kernel_neon_end() from softirq context will
not cause it to finish any faster, so there is really no point in
doing so.

> Ideally, if processing in softirq context takes an unreasonable about of
> time, the work would be handed off to an asynchronous worker, but that
> does seem to conflict rather with the purpose of this series...
>

Agreed, but this is not something we can police at this level. If the
caller does an unreasonable amount of work from a softirq, no amount
of yielding is going to make a difference.

> > +
> > +     adr_l           x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > +     this_cpu_offset x1
> > +     ldr             w0, [x0, x1]
> > +     cbnz            w0, .Lyield_\@  // yield on pending softirq in task context
> > +.Lnoyield_\@:
> >       /* fall through to endif_yield_neon */
> >       .subsection     1
> >  .Lyield_\@ :
> > -#else
> > -     .section        ".discard.cond_yield_neon", "ax"
> > -#endif
> >       .endm
> >
> >       .macro          do_cond_yield_neon
> > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > index 7d32fc959b1a..34ef70877de4 100644
> > --- a/arch/arm64/kernel/asm-offsets.c
> > +++ b/arch/arm64/kernel/asm-offsets.c
> > @@ -93,6 +93,8 @@ int main(void)
> >    DEFINE(DMA_FROM_DEVICE,    DMA_FROM_DEVICE);
> >    BLANK();
> >    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> >    BLANK();
> >    DEFINE(CPU_BOOT_STACK,     offsetof(struct secondary_data, stack));
> >    DEFINE(CPU_BOOT_TASK,              offsetof(struct secondary_data, task));
> > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > index 062b21f30f94..823e3a8a8871 100644
> > --- a/arch/arm64/kernel/fpsimd.c
> > +++ b/arch/arm64/kernel/fpsimd.c
> > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> >   */
> >  static void get_cpu_fpsimd_context(void)
> >  {
> > -     preempt_disable();
> > +     local_bh_disable();
> >       __get_cpu_fpsimd_context();
> >  }
> >
> > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> >  static void put_cpu_fpsimd_context(void)
> >  {
> >       __put_cpu_fpsimd_context();
> > -     preempt_enable();
> > +     local_bh_enable();
> >  }
> >
> >  static bool have_cpu_fpsimd_context(void)
>
> I was concerned about catching all the relevant preempt_disable()s, but
> it had slipped my memory that Julien had factored these into one place.
>
> I can't see off the top of my head any reason why this shouldn't work.
>

Thanks.

>
> In threory, switching to local_bh_enable() here will add a check for
> pending softirqs onto context handling fast paths.  I haven't dug into
> how that works, so perhaps this is trivial on top of the preemption
> check in preempt_enable().  Do you see any difference in hackbench or
> similar benchmarks?
>

I haven't tried, tbh. But by context handling fast paths, you mean
managing the FP/SIMD state at context switch time, right? Checking for
pending softirqs amounts to a single per-CPU load plus compare, so
that should be negligible AFAICT. Obviously, actually handling the
softirq may take additional time, but that penalty has to be taken
somewhere - I don't see how that would create extra work that we
wouldn't have to do otherwise.

I'll do some experiments with hackbench once I get back to this series.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
@ 2021-01-19 16:29       ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-19 16:29 UTC (permalink / raw)
  To: Dave Martin
  Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Eric Biggers, Mark Brown, Linux Crypto Mailing List,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM

On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > Kernel mode NEON can be used in task or softirq context, but only in
> > a non-nesting manner, i.e., softirq context is only permitted if the
> > interrupt was not taken at a point where the kernel was using the NEON
> > in task context.
> >
> > This means all users of kernel mode NEON have to be aware of this
> > limitation, and either need to provide scalar fallbacks that may be much
> > slower (up to 20x for AES instructions) and potentially less safe, or
> > use an asynchronous interface that defers processing to a later time
> > when the NEON is guaranteed to be available.
> >
> > Given that grabbing and releasing the NEON is cheap, we can relax this
> > restriction, by increasing the granularity of kernel mode NEON code, and
> > always disabling softirq processing while the NEON is being used in task
> > context.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> Sorry for the slow reply on this...  it looks reasonable, but I have a
> few comments below.
>

No worries - thanks for taking a look.

> > ---
> >  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> >  arch/arm64/kernel/asm-offsets.c    |  2 ++
> >  arch/arm64/kernel/fpsimd.c         |  4 ++--
> >  3 files changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > index ddbe6bf00e33..74ce46ed55ac 100644
> > --- a/arch/arm64/include/asm/assembler.h
> > +++ b/arch/arm64/include/asm/assembler.h
> > @@ -15,6 +15,7 @@
> >  #include <asm-generic/export.h>
> >
> >  #include <asm/asm-offsets.h>
> > +#include <asm/alternative.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/cputype.h>
> >  #include <asm/debug-monitors.h>
> > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)                    // invalidate I line PoU
> >       .endm
> >
> >       .macro          if_will_cond_yield_neon
> > -#ifdef CONFIG_PREEMPTION
> >       get_current_task        x0
> >       ldr             x0, [x0, #TSK_TI_PREEMPT]
> > -     sub             x0, x0, #PREEMPT_DISABLE_OFFSET
> > -     cbz             x0, .Lyield_\@
> > +#ifdef CONFIG_PREEMPTION
> > +     cmp             x0, #PREEMPT_DISABLE_OFFSET
> > +     beq             .Lyield_\@      // yield on need_resched in task context
> > +#endif
> > +     /* never yield while serving a softirq */
> > +     tbnz            x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
>
> Can you explain the rationale here?
>
> Using if_will_cond_yield_neon suggests the algo thinks it may run for
> too long the stall preemption until completion, but we happily stall
> preemption _and_ softirqs here.
>
> Is it actually a bug to use the NEON conditional yield helpers in
> softirq context?
>

No, it is not. But calling kernel_neon_end() from softirq context will
not cause it to finish any faster, so there is really no point in
doing so.

> Ideally, if processing in softirq context takes an unreasonable about of
> time, the work would be handed off to an asynchronous worker, but that
> does seem to conflict rather with the purpose of this series...
>

Agreed, but this is not something we can police at this level. If the
caller does an unreasonable amount of work from a softirq, no amount
of yielding is going to make a difference.

> > +
> > +     adr_l           x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > +     this_cpu_offset x1
> > +     ldr             w0, [x0, x1]
> > +     cbnz            w0, .Lyield_\@  // yield on pending softirq in task context
> > +.Lnoyield_\@:
> >       /* fall through to endif_yield_neon */
> >       .subsection     1
> >  .Lyield_\@ :
> > -#else
> > -     .section        ".discard.cond_yield_neon", "ax"
> > -#endif
> >       .endm
> >
> >       .macro          do_cond_yield_neon
> > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > index 7d32fc959b1a..34ef70877de4 100644
> > --- a/arch/arm64/kernel/asm-offsets.c
> > +++ b/arch/arm64/kernel/asm-offsets.c
> > @@ -93,6 +93,8 @@ int main(void)
> >    DEFINE(DMA_FROM_DEVICE,    DMA_FROM_DEVICE);
> >    BLANK();
> >    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> >    BLANK();
> >    DEFINE(CPU_BOOT_STACK,     offsetof(struct secondary_data, stack));
> >    DEFINE(CPU_BOOT_TASK,              offsetof(struct secondary_data, task));
> > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > index 062b21f30f94..823e3a8a8871 100644
> > --- a/arch/arm64/kernel/fpsimd.c
> > +++ b/arch/arm64/kernel/fpsimd.c
> > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> >   */
> >  static void get_cpu_fpsimd_context(void)
> >  {
> > -     preempt_disable();
> > +     local_bh_disable();
> >       __get_cpu_fpsimd_context();
> >  }
> >
> > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> >  static void put_cpu_fpsimd_context(void)
> >  {
> >       __put_cpu_fpsimd_context();
> > -     preempt_enable();
> > +     local_bh_enable();
> >  }
> >
> >  static bool have_cpu_fpsimd_context(void)
>
> I was concerned about catching all the relevant preempt_disable()s, but
> it had slipped my memory that Julien had factored these into one place.
>
> I can't see off the top of my head any reason why this shouldn't work.
>

Thanks.

>
> In threory, switching to local_bh_enable() here will add a check for
> pending softirqs onto context handling fast paths.  I haven't dug into
> how that works, so perhaps this is trivial on top of the preemption
> check in preempt_enable().  Do you see any difference in hackbench or
> similar benchmarks?
>

I haven't tried, tbh. But by context handling fast paths, you mean
managing the FP/SIMD state at context switch time, right? Checking for
pending softirqs amounts to a single per-CPU load plus compare, so
that should be negligible AFAICT. Obviously, actually handling the
softirq may take additional time, but that penalty has to be taken
somewhere - I don't see how that would create extra work that we
wouldn't have to do otherwise.

I'll do some experiments with hackbench once I get back to this series.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
  2021-01-19 16:29       ` Ard Biesheuvel
@ 2021-01-20 15:44         ` Dave Martin
  -1 siblings, 0 replies; 28+ messages in thread
From: Dave Martin @ 2021-01-20 15:44 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Eric Biggers, Mark Brown, Linux Crypto Mailing List,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM

On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > Kernel mode NEON can be used in task or softirq context, but only in
> > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > interrupt was not taken at a point where the kernel was using the NEON
> > > in task context.
> > >
> > > This means all users of kernel mode NEON have to be aware of this
> > > limitation, and either need to provide scalar fallbacks that may be much
> > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > use an asynchronous interface that defers processing to a later time
> > > when the NEON is guaranteed to be available.
> > >
> > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > always disabling softirq processing while the NEON is being used in task
> > > context.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >
> > Sorry for the slow reply on this...  it looks reasonable, but I have a
> > few comments below.
> >
> 
> No worries - thanks for taking a look.
> 
> > > ---
> > >  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > >  arch/arm64/kernel/asm-offsets.c    |  2 ++
> > >  arch/arm64/kernel/fpsimd.c         |  4 ++--
> > >  3 files changed, 17 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > --- a/arch/arm64/include/asm/assembler.h
> > > +++ b/arch/arm64/include/asm/assembler.h
> > > @@ -15,6 +15,7 @@
> > >  #include <asm-generic/export.h>
> > >
> > >  #include <asm/asm-offsets.h>
> > > +#include <asm/alternative.h>
> > >  #include <asm/cpufeature.h>
> > >  #include <asm/cputype.h>
> > >  #include <asm/debug-monitors.h>
> > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)                    // invalidate I line PoU
> > >       .endm
> > >
> > >       .macro          if_will_cond_yield_neon
> > > -#ifdef CONFIG_PREEMPTION
> > >       get_current_task        x0
> > >       ldr             x0, [x0, #TSK_TI_PREEMPT]
> > > -     sub             x0, x0, #PREEMPT_DISABLE_OFFSET
> > > -     cbz             x0, .Lyield_\@
> > > +#ifdef CONFIG_PREEMPTION
> > > +     cmp             x0, #PREEMPT_DISABLE_OFFSET
> > > +     beq             .Lyield_\@      // yield on need_resched in task context
> > > +#endif
> > > +     /* never yield while serving a softirq */
> > > +     tbnz            x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> >
> > Can you explain the rationale here?
> >
> > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > too long the stall preemption until completion, but we happily stall
> > preemption _and_ softirqs here.
> >
> > Is it actually a bug to use the NEON conditional yield helpers in
> > softirq context?
> >
> 
> No, it is not. But calling kernel_neon_end() from softirq context will
> not cause it to finish any faster, so there is really no point in
> doing so.
> 
> > Ideally, if processing in softirq context takes an unreasonable about of
> > time, the work would be handed off to an asynchronous worker, but that
> > does seem to conflict rather with the purpose of this series...
> >
> 
> Agreed, but this is not something we can police at this level. If the
> caller does an unreasonable amount of work from a softirq, no amount
> of yielding is going to make a difference.

Ack, just wanted to make sure I wasn't missing something.

Anyone writing softirq code can starve preemption, so I agree that we
should trust people to know what they're doing.


> > > +
> > > +     adr_l           x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > +     this_cpu_offset x1
> > > +     ldr             w0, [x0, x1]
> > > +     cbnz            w0, .Lyield_\@  // yield on pending softirq in task context
> > > +.Lnoyield_\@:
> > >       /* fall through to endif_yield_neon */
> > >       .subsection     1
> > >  .Lyield_\@ :
> > > -#else
> > > -     .section        ".discard.cond_yield_neon", "ax"
> > > -#endif
> > >       .endm
> > >
> > >       .macro          do_cond_yield_neon
> > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > index 7d32fc959b1a..34ef70877de4 100644
> > > --- a/arch/arm64/kernel/asm-offsets.c
> > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > @@ -93,6 +93,8 @@ int main(void)
> > >    DEFINE(DMA_FROM_DEVICE,    DMA_FROM_DEVICE);
> > >    BLANK();
> > >    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > > +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > > +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > >    BLANK();
> > >    DEFINE(CPU_BOOT_STACK,     offsetof(struct secondary_data, stack));
> > >    DEFINE(CPU_BOOT_TASK,              offsetof(struct secondary_data, task));
> > > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > > index 062b21f30f94..823e3a8a8871 100644
> > > --- a/arch/arm64/kernel/fpsimd.c
> > > +++ b/arch/arm64/kernel/fpsimd.c
> > > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > >   */
> > >  static void get_cpu_fpsimd_context(void)
> > >  {
> > > -     preempt_disable();
> > > +     local_bh_disable();
> > >       __get_cpu_fpsimd_context();
> > >  }
> > >
> > > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > >  static void put_cpu_fpsimd_context(void)
> > >  {
> > >       __put_cpu_fpsimd_context();
> > > -     preempt_enable();
> > > +     local_bh_enable();
> > >  }
> > >
> > >  static bool have_cpu_fpsimd_context(void)
> >
> > I was concerned about catching all the relevant preempt_disable()s, but
> > it had slipped my memory that Julien had factored these into one place.
> >
> > I can't see off the top of my head any reason why this shouldn't work.
> >
> 
> Thanks.
> 
> >
> > In threory, switching to local_bh_enable() here will add a check for
> > pending softirqs onto context handling fast paths.  I haven't dug into
> > how that works, so perhaps this is trivial on top of the preemption
> > check in preempt_enable().  Do you see any difference in hackbench or
> > similar benchmarks?
> >
> 
> I haven't tried, tbh. But by context handling fast paths, you mean
> managing the FP/SIMD state at context switch time, right? Checking for
> pending softirqs amounts to a single per-CPU load plus compare, so
> that should be negligible AFAICT. Obviously, actually handling the

Yes.  I've tended to assume, rather than prove, that this kind of thing
is negligible -- so I confess I had not attempted to measure these
effects when writing the original code.

> softirq may take additional time, but that penalty has to be taken
> somewhere - I don't see how that would create extra work that we
> wouldn't have to do otherwise.
> 
> I'll do some experiments with hackbench once I get back to this series.

That sounds fine.

Probably you won't find a significant difference anyway.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
@ 2021-01-20 15:44         ` Dave Martin
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Martin @ 2021-01-20 15:44 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Eric Biggers, Mark Brown, Linux Crypto Mailing List,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM

On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > Kernel mode NEON can be used in task or softirq context, but only in
> > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > interrupt was not taken at a point where the kernel was using the NEON
> > > in task context.
> > >
> > > This means all users of kernel mode NEON have to be aware of this
> > > limitation, and either need to provide scalar fallbacks that may be much
> > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > use an asynchronous interface that defers processing to a later time
> > > when the NEON is guaranteed to be available.
> > >
> > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > always disabling softirq processing while the NEON is being used in task
> > > context.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >
> > Sorry for the slow reply on this...  it looks reasonable, but I have a
> > few comments below.
> >
> 
> No worries - thanks for taking a look.
> 
> > > ---
> > >  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > >  arch/arm64/kernel/asm-offsets.c    |  2 ++
> > >  arch/arm64/kernel/fpsimd.c         |  4 ++--
> > >  3 files changed, 17 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > --- a/arch/arm64/include/asm/assembler.h
> > > +++ b/arch/arm64/include/asm/assembler.h
> > > @@ -15,6 +15,7 @@
> > >  #include <asm-generic/export.h>
> > >
> > >  #include <asm/asm-offsets.h>
> > > +#include <asm/alternative.h>
> > >  #include <asm/cpufeature.h>
> > >  #include <asm/cputype.h>
> > >  #include <asm/debug-monitors.h>
> > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)                    // invalidate I line PoU
> > >       .endm
> > >
> > >       .macro          if_will_cond_yield_neon
> > > -#ifdef CONFIG_PREEMPTION
> > >       get_current_task        x0
> > >       ldr             x0, [x0, #TSK_TI_PREEMPT]
> > > -     sub             x0, x0, #PREEMPT_DISABLE_OFFSET
> > > -     cbz             x0, .Lyield_\@
> > > +#ifdef CONFIG_PREEMPTION
> > > +     cmp             x0, #PREEMPT_DISABLE_OFFSET
> > > +     beq             .Lyield_\@      // yield on need_resched in task context
> > > +#endif
> > > +     /* never yield while serving a softirq */
> > > +     tbnz            x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> >
> > Can you explain the rationale here?
> >
> > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > too long the stall preemption until completion, but we happily stall
> > preemption _and_ softirqs here.
> >
> > Is it actually a bug to use the NEON conditional yield helpers in
> > softirq context?
> >
> 
> No, it is not. But calling kernel_neon_end() from softirq context will
> not cause it to finish any faster, so there is really no point in
> doing so.
> 
> > Ideally, if processing in softirq context takes an unreasonable about of
> > time, the work would be handed off to an asynchronous worker, but that
> > does seem to conflict rather with the purpose of this series...
> >
> 
> Agreed, but this is not something we can police at this level. If the
> caller does an unreasonable amount of work from a softirq, no amount
> of yielding is going to make a difference.

Ack, just wanted to make sure I wasn't missing something.

Anyone writing softirq code can starve preemption, so I agree that we
should trust people to know what they're doing.


> > > +
> > > +     adr_l           x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > +     this_cpu_offset x1
> > > +     ldr             w0, [x0, x1]
> > > +     cbnz            w0, .Lyield_\@  // yield on pending softirq in task context
> > > +.Lnoyield_\@:
> > >       /* fall through to endif_yield_neon */
> > >       .subsection     1
> > >  .Lyield_\@ :
> > > -#else
> > > -     .section        ".discard.cond_yield_neon", "ax"
> > > -#endif
> > >       .endm
> > >
> > >       .macro          do_cond_yield_neon
> > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > index 7d32fc959b1a..34ef70877de4 100644
> > > --- a/arch/arm64/kernel/asm-offsets.c
> > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > @@ -93,6 +93,8 @@ int main(void)
> > >    DEFINE(DMA_FROM_DEVICE,    DMA_FROM_DEVICE);
> > >    BLANK();
> > >    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > > +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > > +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > >    BLANK();
> > >    DEFINE(CPU_BOOT_STACK,     offsetof(struct secondary_data, stack));
> > >    DEFINE(CPU_BOOT_TASK,              offsetof(struct secondary_data, task));
> > > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > > index 062b21f30f94..823e3a8a8871 100644
> > > --- a/arch/arm64/kernel/fpsimd.c
> > > +++ b/arch/arm64/kernel/fpsimd.c
> > > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > >   */
> > >  static void get_cpu_fpsimd_context(void)
> > >  {
> > > -     preempt_disable();
> > > +     local_bh_disable();
> > >       __get_cpu_fpsimd_context();
> > >  }
> > >
> > > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > >  static void put_cpu_fpsimd_context(void)
> > >  {
> > >       __put_cpu_fpsimd_context();
> > > -     preempt_enable();
> > > +     local_bh_enable();
> > >  }
> > >
> > >  static bool have_cpu_fpsimd_context(void)
> >
> > I was concerned about catching all the relevant preempt_disable()s, but
> > it had slipped my memory that Julien had factored these into one place.
> >
> > I can't see off the top of my head any reason why this shouldn't work.
> >
> 
> Thanks.
> 
> >
> > In threory, switching to local_bh_enable() here will add a check for
> > pending softirqs onto context handling fast paths.  I haven't dug into
> > how that works, so perhaps this is trivial on top of the preemption
> > check in preempt_enable().  Do you see any difference in hackbench or
> > similar benchmarks?
> >
> 
> I haven't tried, tbh. But by context handling fast paths, you mean
> managing the FP/SIMD state at context switch time, right? Checking for
> pending softirqs amounts to a single per-CPU load plus compare, so
> that should be negligible AFAICT. Obviously, actually handling the

Yes.  I've tended to assume, rather than prove, that this kind of thing
is negligible -- so I confess I had not attempted to measure these
effects when writing the original code.

> softirq may take additional time, but that penalty has to be taken
> somewhere - I don't see how that would create extra work that we
> wouldn't have to do otherwise.
> 
> I'll do some experiments with hackbench once I get back to this series.

That sounds fine.

Probably you won't find a significant difference anyway.

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
  2021-01-20 15:44         ` Dave Martin
@ 2021-02-15 18:30           ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-02-15 18:30 UTC (permalink / raw)
  To: Dave Martin
  Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Eric Biggers, Mark Brown, Linux Crypto Mailing List,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM

On Wed, 20 Jan 2021 at 16:44, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> > On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > > Kernel mode NEON can be used in task or softirq context, but only in
> > > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > > interrupt was not taken at a point where the kernel was using the NEON
> > > > in task context.
> > > >
> > > > This means all users of kernel mode NEON have to be aware of this
> > > > limitation, and either need to provide scalar fallbacks that may be much
> > > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > > use an asynchronous interface that defers processing to a later time
> > > > when the NEON is guaranteed to be available.
> > > >
> > > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > > always disabling softirq processing while the NEON is being used in task
> > > > context.
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Sorry for the slow reply on this...  it looks reasonable, but I have a
> > > few comments below.
> > >
> >
> > No worries - thanks for taking a look.
> >
> > > > ---
> > > >  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > > >  arch/arm64/kernel/asm-offsets.c    |  2 ++
> > > >  arch/arm64/kernel/fpsimd.c         |  4 ++--
> > > >  3 files changed, 17 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > > --- a/arch/arm64/include/asm/assembler.h
> > > > +++ b/arch/arm64/include/asm/assembler.h
> > > > @@ -15,6 +15,7 @@
> > > >  #include <asm-generic/export.h>
> > > >
> > > >  #include <asm/asm-offsets.h>
> > > > +#include <asm/alternative.h>
> > > >  #include <asm/cpufeature.h>
> > > >  #include <asm/cputype.h>
> > > >  #include <asm/debug-monitors.h>
> > > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)                    // invalidate I line PoU
> > > >       .endm
> > > >
> > > >       .macro          if_will_cond_yield_neon
> > > > -#ifdef CONFIG_PREEMPTION
> > > >       get_current_task        x0
> > > >       ldr             x0, [x0, #TSK_TI_PREEMPT]
> > > > -     sub             x0, x0, #PREEMPT_DISABLE_OFFSET
> > > > -     cbz             x0, .Lyield_\@
> > > > +#ifdef CONFIG_PREEMPTION
> > > > +     cmp             x0, #PREEMPT_DISABLE_OFFSET
> > > > +     beq             .Lyield_\@      // yield on need_resched in task context
> > > > +#endif
> > > > +     /* never yield while serving a softirq */
> > > > +     tbnz            x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> > >
> > > Can you explain the rationale here?
> > >
> > > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > > too long the stall preemption until completion, but we happily stall
> > > preemption _and_ softirqs here.
> > >
> > > Is it actually a bug to use the NEON conditional yield helpers in
> > > softirq context?
> > >
> >
> > No, it is not. But calling kernel_neon_end() from softirq context will
> > not cause it to finish any faster, so there is really no point in
> > doing so.
> >
> > > Ideally, if processing in softirq context takes an unreasonable about of
> > > time, the work would be handed off to an asynchronous worker, but that
> > > does seem to conflict rather with the purpose of this series...
> > >
> >
> > Agreed, but this is not something we can police at this level. If the
> > caller does an unreasonable amount of work from a softirq, no amount
> > of yielding is going to make a difference.
>
> Ack, just wanted to make sure I wasn't missing something.
>
> Anyone writing softirq code can starve preemption, so I agree that we
> should trust people to know what they're doing.
>
>
> > > > +
> > > > +     adr_l           x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > > +     this_cpu_offset x1
> > > > +     ldr             w0, [x0, x1]
> > > > +     cbnz            w0, .Lyield_\@  // yield on pending softirq in task context
> > > > +.Lnoyield_\@:
> > > >       /* fall through to endif_yield_neon */
> > > >       .subsection     1
> > > >  .Lyield_\@ :
> > > > -#else
> > > > -     .section        ".discard.cond_yield_neon", "ax"
> > > > -#endif
> > > >       .endm
> > > >
> > > >       .macro          do_cond_yield_neon
> > > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > > index 7d32fc959b1a..34ef70877de4 100644
> > > > --- a/arch/arm64/kernel/asm-offsets.c
> > > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > > @@ -93,6 +93,8 @@ int main(void)
> > > >    DEFINE(DMA_FROM_DEVICE,    DMA_FROM_DEVICE);
> > > >    BLANK();
> > > >    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > > > +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > > > +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > > >    BLANK();
> > > >    DEFINE(CPU_BOOT_STACK,     offsetof(struct secondary_data, stack));
> > > >    DEFINE(CPU_BOOT_TASK,              offsetof(struct secondary_data, task));
> > > > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > > > index 062b21f30f94..823e3a8a8871 100644
> > > > --- a/arch/arm64/kernel/fpsimd.c
> > > > +++ b/arch/arm64/kernel/fpsimd.c
> > > > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > > >   */
> > > >  static void get_cpu_fpsimd_context(void)
> > > >  {
> > > > -     preempt_disable();
> > > > +     local_bh_disable();
> > > >       __get_cpu_fpsimd_context();
> > > >  }
> > > >
> > > > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > > >  static void put_cpu_fpsimd_context(void)
> > > >  {
> > > >       __put_cpu_fpsimd_context();
> > > > -     preempt_enable();
> > > > +     local_bh_enable();
> > > >  }
> > > >
> > > >  static bool have_cpu_fpsimd_context(void)
> > >
> > > I was concerned about catching all the relevant preempt_disable()s, but
> > > it had slipped my memory that Julien had factored these into one place.
> > >
> > > I can't see off the top of my head any reason why this shouldn't work.
> > >
> >
> > Thanks.
> >
> > >
> > > In threory, switching to local_bh_enable() here will add a check for
> > > pending softirqs onto context handling fast paths.  I haven't dug into
> > > how that works, so perhaps this is trivial on top of the preemption
> > > check in preempt_enable().  Do you see any difference in hackbench or
> > > similar benchmarks?
> > >
> >
> > I haven't tried, tbh. But by context handling fast paths, you mean
> > managing the FP/SIMD state at context switch time, right? Checking for
> > pending softirqs amounts to a single per-CPU load plus compare, so
> > that should be negligible AFAICT. Obviously, actually handling the
>
> Yes.  I've tended to assume, rather than prove, that this kind of thing
> is negligible -- so I confess I had not attempted to measure these
> effects when writing the original code.
>
> > softirq may take additional time, but that penalty has to be taken
> > somewhere - I don't see how that would create extra work that we
> > wouldn't have to do otherwise.
> >
> > I'll do some experiments with hackbench once I get back to this series.
>
> That sounds fine.
>
> Probably you won't find a significant difference anyway.
>

Finally got around to trying this: as expected, I don't see any
difference at all between the two versions (tested on TX2)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
@ 2021-02-15 18:30           ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-02-15 18:30 UTC (permalink / raw)
  To: Dave Martin
  Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Eric Biggers, Mark Brown, Linux Crypto Mailing List,
	Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM

On Wed, 20 Jan 2021 at 16:44, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> > On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > > Kernel mode NEON can be used in task or softirq context, but only in
> > > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > > interrupt was not taken at a point where the kernel was using the NEON
> > > > in task context.
> > > >
> > > > This means all users of kernel mode NEON have to be aware of this
> > > > limitation, and either need to provide scalar fallbacks that may be much
> > > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > > use an asynchronous interface that defers processing to a later time
> > > > when the NEON is guaranteed to be available.
> > > >
> > > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > > always disabling softirq processing while the NEON is being used in task
> > > > context.
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Sorry for the slow reply on this...  it looks reasonable, but I have a
> > > few comments below.
> > >
> >
> > No worries - thanks for taking a look.
> >
> > > > ---
> > > >  arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > > >  arch/arm64/kernel/asm-offsets.c    |  2 ++
> > > >  arch/arm64/kernel/fpsimd.c         |  4 ++--
> > > >  3 files changed, 17 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > > --- a/arch/arm64/include/asm/assembler.h
> > > > +++ b/arch/arm64/include/asm/assembler.h
> > > > @@ -15,6 +15,7 @@
> > > >  #include <asm-generic/export.h>
> > > >
> > > >  #include <asm/asm-offsets.h>
> > > > +#include <asm/alternative.h>
> > > >  #include <asm/cpufeature.h>
> > > >  #include <asm/cputype.h>
> > > >  #include <asm/debug-monitors.h>
> > > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2)                    // invalidate I line PoU
> > > >       .endm
> > > >
> > > >       .macro          if_will_cond_yield_neon
> > > > -#ifdef CONFIG_PREEMPTION
> > > >       get_current_task        x0
> > > >       ldr             x0, [x0, #TSK_TI_PREEMPT]
> > > > -     sub             x0, x0, #PREEMPT_DISABLE_OFFSET
> > > > -     cbz             x0, .Lyield_\@
> > > > +#ifdef CONFIG_PREEMPTION
> > > > +     cmp             x0, #PREEMPT_DISABLE_OFFSET
> > > > +     beq             .Lyield_\@      // yield on need_resched in task context
> > > > +#endif
> > > > +     /* never yield while serving a softirq */
> > > > +     tbnz            x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> > >
> > > Can you explain the rationale here?
> > >
> > > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > > too long the stall preemption until completion, but we happily stall
> > > preemption _and_ softirqs here.
> > >
> > > Is it actually a bug to use the NEON conditional yield helpers in
> > > softirq context?
> > >
> >
> > No, it is not. But calling kernel_neon_end() from softirq context will
> > not cause it to finish any faster, so there is really no point in
> > doing so.
> >
> > > Ideally, if processing in softirq context takes an unreasonable about of
> > > time, the work would be handed off to an asynchronous worker, but that
> > > does seem to conflict rather with the purpose of this series...
> > >
> >
> > Agreed, but this is not something we can police at this level. If the
> > caller does an unreasonable amount of work from a softirq, no amount
> > of yielding is going to make a difference.
>
> Ack, just wanted to make sure I wasn't missing something.
>
> Anyone writing softirq code can starve preemption, so I agree that we
> should trust people to know what they're doing.
>
>
> > > > +
> > > > +     adr_l           x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > > +     this_cpu_offset x1
> > > > +     ldr             w0, [x0, x1]
> > > > +     cbnz            w0, .Lyield_\@  // yield on pending softirq in task context
> > > > +.Lnoyield_\@:
> > > >       /* fall through to endif_yield_neon */
> > > >       .subsection     1
> > > >  .Lyield_\@ :
> > > > -#else
> > > > -     .section        ".discard.cond_yield_neon", "ax"
> > > > -#endif
> > > >       .endm
> > > >
> > > >       .macro          do_cond_yield_neon
> > > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > > index 7d32fc959b1a..34ef70877de4 100644
> > > > --- a/arch/arm64/kernel/asm-offsets.c
> > > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > > @@ -93,6 +93,8 @@ int main(void)
> > > >    DEFINE(DMA_FROM_DEVICE,    DMA_FROM_DEVICE);
> > > >    BLANK();
> > > >    DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > > > +  DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > > > +  DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > > >    BLANK();
> > > >    DEFINE(CPU_BOOT_STACK,     offsetof(struct secondary_data, stack));
> > > >    DEFINE(CPU_BOOT_TASK,              offsetof(struct secondary_data, task));
> > > > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > > > index 062b21f30f94..823e3a8a8871 100644
> > > > --- a/arch/arm64/kernel/fpsimd.c
> > > > +++ b/arch/arm64/kernel/fpsimd.c
> > > > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > > >   */
> > > >  static void get_cpu_fpsimd_context(void)
> > > >  {
> > > > -     preempt_disable();
> > > > +     local_bh_disable();
> > > >       __get_cpu_fpsimd_context();
> > > >  }
> > > >
> > > > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > > >  static void put_cpu_fpsimd_context(void)
> > > >  {
> > > >       __put_cpu_fpsimd_context();
> > > > -     preempt_enable();
> > > > +     local_bh_enable();
> > > >  }
> > > >
> > > >  static bool have_cpu_fpsimd_context(void)
> > >
> > > I was concerned about catching all the relevant preempt_disable()s, but
> > > it had slipped my memory that Julien had factored these into one place.
> > >
> > > I can't see off the top of my head any reason why this shouldn't work.
> > >
> >
> > Thanks.
> >
> > >
> > > In threory, switching to local_bh_enable() here will add a check for
> > > pending softirqs onto context handling fast paths.  I haven't dug into
> > > how that works, so perhaps this is trivial on top of the preemption
> > > check in preempt_enable().  Do you see any difference in hackbench or
> > > similar benchmarks?
> > >
> >
> > I haven't tried, tbh. But by context handling fast paths, you mean
> > managing the FP/SIMD state at context switch time, right? Checking for
> > pending softirqs amounts to a single per-CPU load plus compare, so
> > that should be negligible AFAICT. Obviously, actually handling the
>
> Yes.  I've tended to assume, rather than prove, that this kind of thing
> is negligible -- so I confess I had not attempted to measure these
> effects when writing the original code.
>
> > softirq may take additional time, but that penalty has to be taken
> > somewhere - I don't see how that would create extra work that we
> > wouldn't have to do otherwise.
> >
> > I'll do some experiments with hackbench once I get back to this series.
>
> That sounds fine.
>
> Probably you won't find a significant difference anyway.
>

Finally got around to trying this: as expected, I don't see any
difference at all between the two versions (tested on TX2)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
  2020-12-18 17:01 ` Ard Biesheuvel
@ 2021-02-16 10:09   ` Peter Zijlstra
  -1 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2021-02-16 10:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, linux-arm-kernel, linux-kernel, Dave Martin,
	Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Sebastian Andrzej Siewior,
	Ingo Molnar

On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
>   SIMD in kernel mode could reduce complexity and improve performance, but we
>   need to decide whether we can do this, and how much softirq processing
>   latency we can tolerate. If we can find a satisfactory solution for this,
>   we might do the same for x86 and 32-bit ARM as well. ]

> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
>   expensive?

Can't we simply save/restore the relevant register set?

So something like (note amluto was wanting to add a regset argument):

	<task>
	kernel_fpu_begin(MMX)
		<SIRQ>
		kernel_fpu_begin(SSE)
		kernel_fpu_end();
		</SIRQ>
	...
	kernel_fpu_end()

Would have to save the MMX regs on first SIRQ invocation of
kernel_fpu_begin(), and then have softirq context termination </SIRQ>
above, restore it.

I mean, we already do much the same for the first kernel_fpu_begin(),
that has to save the user registers, which will be restore when we go
back to userspace.

So why not do exactly the same for softirq context?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2021-02-16 10:09   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2021-02-16 10:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Catalin Marinas, Sebastian Andrzej Siewior,
	linux-kernel, Ingo Molnar, Eric Biggers, Mark Brown,
	linux-crypto, Thomas Gleixner, Will Deacon, Dave Martin,
	linux-arm-kernel

On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
>   SIMD in kernel mode could reduce complexity and improve performance, but we
>   need to decide whether we can do this, and how much softirq processing
>   latency we can tolerate. If we can find a satisfactory solution for this,
>   we might do the same for x86 and 32-bit ARM as well. ]

> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
>   expensive?

Can't we simply save/restore the relevant register set?

So something like (note amluto was wanting to add a regset argument):

	<task>
	kernel_fpu_begin(MMX)
		<SIRQ>
		kernel_fpu_begin(SSE)
		kernel_fpu_end();
		</SIRQ>
	...
	kernel_fpu_end()

Would have to save the MMX regs on first SIRQ invocation of
kernel_fpu_begin(), and then have softirq context termination </SIRQ>
above, restore it.

I mean, we already do much the same for the first kernel_fpu_begin(),
that has to save the user registers, which will be restore when we go
back to userspace.

So why not do exactly the same for softirq context?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
  2021-02-16 10:09   ` Peter Zijlstra
@ 2021-02-16 10:35     ` Ard Biesheuvel
  -1 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-02-16 10:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Crypto Mailing List, Linux ARM, Linux Kernel Mailing List,
	Dave Martin, Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
	Catalin Marinas, Thomas Gleixner, Sebastian Andrzej Siewior,
	Ingo Molnar

On Tue, 16 Feb 2021 at 11:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> > [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
> >   SIMD in kernel mode could reduce complexity and improve performance, but we
> >   need to decide whether we can do this, and how much softirq processing
> >   latency we can tolerate. If we can find a satisfactory solution for this,
> >   we might do the same for x86 and 32-bit ARM as well. ]
>
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> >   expensive?
>
> Can't we simply save/restore the relevant register set?
>
> So something like (note amluto was wanting to add a regset argument):
>
>         <task>
>         kernel_fpu_begin(MMX)
>                 <SIRQ>
>                 kernel_fpu_begin(SSE)
>                 kernel_fpu_end();
>                 </SIRQ>
>         ...
>         kernel_fpu_end()
>
> Would have to save the MMX regs on first SIRQ invocation of
> kernel_fpu_begin(), and then have softirq context termination </SIRQ>
> above, restore it.
>
> I mean, we already do much the same for the first kernel_fpu_begin(),
> that has to save the user registers, which will be restore when we go
> back to userspace.
>
> So why not do exactly the same for softirq context?

That is what we originally had on arm64, with per-CPU buffers of the
appropriate size. This became a bit messy when SVE support was added,
because the register file is so large (32 registers of up to 2048 bits
each), and since the kernel does not use SVE itself, we want the inner
per-CPU buffer to only cover 128 bits per register. This means we
cannot allow the <sirq></sirq> region above to interrupt the outer
preserve (which is implemented entirely in software), since resuming
the outer preserve after a sirq would preserve content that was
corrupted by the inner preserve/restore. This could be addressed by
disabling interrupts across the outer preserve, but this caused a
problem somewhere else (Dave might remember the details better than I
do). So we ended up making SIMD in task context mutually exclusive
with SIMD in softirq context, also because that is what 32-bit ARM and
x86 were already doing as well.

But I understand that these concerns may not apply to x86 at all, so
perhaps the answer there is indeed to have a alternate buffer. And
actually, Andy L. suggested the same when I asked him about it on IRC.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2021-02-16 10:35     ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-02-16 10:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Herbert Xu, Catalin Marinas, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Ingo Molnar, Eric Biggers, Mark Brown,
	Linux Crypto Mailing List, Thomas Gleixner, Will Deacon,
	Dave Martin, Linux ARM

On Tue, 16 Feb 2021 at 11:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> > [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
> >   SIMD in kernel mode could reduce complexity and improve performance, but we
> >   need to decide whether we can do this, and how much softirq processing
> >   latency we can tolerate. If we can find a satisfactory solution for this,
> >   we might do the same for x86 and 32-bit ARM as well. ]
>
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> >   expensive?
>
> Can't we simply save/restore the relevant register set?
>
> So something like (note amluto was wanting to add a regset argument):
>
>         <task>
>         kernel_fpu_begin(MMX)
>                 <SIRQ>
>                 kernel_fpu_begin(SSE)
>                 kernel_fpu_end();
>                 </SIRQ>
>         ...
>         kernel_fpu_end()
>
> Would have to save the MMX regs on first SIRQ invocation of
> kernel_fpu_begin(), and then have softirq context termination </SIRQ>
> above, restore it.
>
> I mean, we already do much the same for the first kernel_fpu_begin(),
> that has to save the user registers, which will be restore when we go
> back to userspace.
>
> So why not do exactly the same for softirq context?

That is what we originally had on arm64, with per-CPU buffers of the
appropriate size. This became a bit messy when SVE support was added,
because the register file is so large (32 registers of up to 2048 bits
each), and since the kernel does not use SVE itself, we want the inner
per-CPU buffer to only cover 128 bits per register. This means we
cannot allow the <sirq></sirq> region above to interrupt the outer
preserve (which is implemented entirely in software), since resuming
the outer preserve after a sirq would preserve content that was
corrupted by the inner preserve/restore. This could be addressed by
disabling interrupts across the outer preserve, but this caused a
problem somewhere else (Dave might remember the details better than I
do). So we ended up making SIMD in task context mutually exclusive
with SIMD in softirq context, also because that is what 32-bit ARM and
x86 were already doing as well.

But I understand that these concerns may not apply to x86 at all, so
perhaps the answer there is indeed to have a alternate buffer. And
actually, Andy L. suggested the same when I asked him about it on IRC.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-02-16 10:37 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
2020-12-18 17:01 ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context Ard Biesheuvel
2020-12-18 17:01   ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 2/5] crypto: skcipher " Ard Biesheuvel
2020-12-18 17:01   ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support Ard Biesheuvel
2020-12-18 17:01   ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled Ard Biesheuvel
2020-12-18 17:01   ` Ard Biesheuvel
2021-01-19 16:00   ` Dave Martin
2021-01-19 16:00     ` Dave Martin
2021-01-19 16:29     ` Ard Biesheuvel
2021-01-19 16:29       ` Ard Biesheuvel
2021-01-20 15:44       ` Dave Martin
2021-01-20 15:44         ` Dave Martin
2021-02-15 18:30         ` Ard Biesheuvel
2021-02-15 18:30           ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path Ard Biesheuvel
2020-12-18 17:01   ` Ard Biesheuvel
2020-12-19  2:04 ` [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Herbert Xu
2020-12-19  2:04   ` Herbert Xu
2021-01-14  8:22   ` Ard Biesheuvel
2021-01-14  8:22     ` Ard Biesheuvel
2021-02-16 10:09 ` Peter Zijlstra
2021-02-16 10:09   ` Peter Zijlstra
2021-02-16 10:35   ` Ard Biesheuvel
2021-02-16 10:35     ` Ard Biesheuvel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.