* [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
@ 2020-12-18 17:01 Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context Ard Biesheuvel
` (6 more replies)
0 siblings, 7 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
To: linux-crypto
Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
[ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
SIMD in kernel mode could reduce complexity and improve performance, but we
need to decide whether we can do this, and how much softirq processing
latency we can tolerate. If we can find a satisfactory solution for this,
we might do the same for x86 and 32-bit ARM as well. ]
The crypto API provides two ways to invoke symmetric encryption algorithms:
- synchronously, where the transformation is guaranteed to be done by the
time the function returns;
- asynchronously, where the function may return with a -EINPROGRESS return code,
and a completion will be signalled when the transformation is done.
The latter is mainly intended for h/w accelerators, where the throughput would
be severely limited by the latency otherwise. However, it is also being used
for software algorithms based on SIMD instructions, which cannot be issued from
any context (the rules are not the same on each architecture, but typically,
SIMD can be used in task context, or in softirq context if it was not taken
while the SIMD was already in use in kernel mode).
Many users of the crypto API exist in the kernel today that opt out of this
asynchronous interface (802.11, macsec, kerberos, sw kTLS), or use a library
interface which is fundamentally synchronous (wireguard). This means we end
up using a degraded mode for the contended case (a scalar fallback) as well
as the uncontended case (generic GCM/CCM/CTR chaining mode templates wrapped
around the SIMD cipher as opposed to accelerated implementations of the full
chaining modes in question). Note that scalar AES runs ~20x slower than the
SIMD instruction based version.
So let's address this for arm64, by reorganizing kernel mode SIMD support so
that the SIMD unit can always be assumed to be available. This means we need
to defer softirq processing when grabbing the NEON unit in task context, so
that any use of it in softirq context is guaranteed not to interrupt any code
that was already using the NEON.
This obviously impacts softirq processing latency, which is why the existing
conditional NEON yield support is modified to take pending softirqs into
account.
As an example of how this impacts the code, the existing arm64 GCM driver is
updated to:
- Add yield support - currently, the pending softirq check is performed every
64 bytes of input, which is way too often - one of the desired outcomes of
this RFC is getting a reasonable ballpark for how long we want to run with
softirqs disabled.
- Remove the existing scalar fallbacks, which are no longer needed.
Questions:
- what did I miss or break horribly?
- does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
kthread, so I don't think it cares.
- what would be a reasonable upper bound to keep softirqs disabled? I suppose
100s of cycles or less is overkill, but I'm not sure how to derive a better
answer.
- could we do the same on x86, now that kernel_fpu_begin/end is no longer
expensive?
Cc: Dave Martin <dave.martin@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Ard Biesheuvel (5):
crypto: aead - disallow en/decrypt for non-task or non-softirq context
crypto: skcipher - disallow en/decrypt for non-task or non-softirq
context
crypto: arm64/gcm-aes-ce - add NEON yield support
arm64: fpsimd: run kernel mode NEON with softirqs disabled
crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path
arch/arm64/crypto/ghash-ce-core.S | 115 ++++++-----
arch/arm64/crypto/ghash-ce-glue.c | 209 +++++---------------
arch/arm64/include/asm/assembler.h | 19 +-
arch/arm64/kernel/asm-offsets.c | 2 +
arch/arm64/kernel/fpsimd.c | 4 +-
crypto/aead.c | 10 +
crypto/skcipher.c | 10 +
7 files changed, 155 insertions(+), 214 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
@ 2020-12-18 17:01 ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 2/5] crypto: skcipher " Ard Biesheuvel
` (5 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
To: linux-crypto
Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
In order to ensure that kernel mode SIMD routines will not need a scalar
fallback if they run with softirqs disabled, disallow any use of the
AEAD encrypt and decrypt routines from outside of task or softirq context.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
crypto/aead.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/crypto/aead.c b/crypto/aead.c
index 16991095270d..b5304b3d3314 100644
--- a/crypto/aead.c
+++ b/crypto/aead.c
@@ -87,6 +87,11 @@ int crypto_aead_encrypt(struct aead_request *req)
unsigned int cryptlen = req->cryptlen;
int ret;
+ if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+ WARN_ONCE(!in_task() && !in_serving_softirq(),
+ "synchronous call from invalid context\n"))
+ return -EBUSY;
+
crypto_stats_get(alg);
if (crypto_aead_get_flags(aead) & CRYPTO_TFM_NEED_KEY)
ret = -ENOKEY;
@@ -104,6 +109,11 @@ int crypto_aead_decrypt(struct aead_request *req)
unsigned int cryptlen = req->cryptlen;
int ret;
+ if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+ WARN_ONCE(!in_task() && !in_serving_softirq(),
+ "synchronous call from invalid context\n"))
+ return -EBUSY;
+
crypto_stats_get(alg);
if (crypto_aead_get_flags(aead) & CRYPTO_TFM_NEED_KEY)
ret = -ENOKEY;
--
2.17.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 2/5] crypto: skcipher - disallow en/decrypt for non-task or non-softirq context
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context Ard Biesheuvel
@ 2020-12-18 17:01 ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support Ard Biesheuvel
` (4 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
To: linux-crypto
Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
In order to ensure that kernel mode SIMD routines will not need a scalar
fallback if they run with softirqs disabled, disallow any use of the
skcipher encrypt and decrypt routines from outside of task or softirq
context.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
crypto/skcipher.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index b4dae640de9f..d1a656d7d498 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -628,6 +628,11 @@ int crypto_skcipher_encrypt(struct skcipher_request *req)
unsigned int cryptlen = req->cryptlen;
int ret;
+ if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+ WARN_ONCE(!in_task() && !in_serving_softirq(),
+ "synchronous call from invalid context\n"))
+ return -EBUSY;
+
crypto_stats_get(alg);
if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
ret = -ENOKEY;
@@ -645,6 +650,11 @@ int crypto_skcipher_decrypt(struct skcipher_request *req)
unsigned int cryptlen = req->cryptlen;
int ret;
+ if (!(alg->cra_flags & CRYPTO_ALG_ASYNC) &&
+ WARN_ONCE(!in_task() && !in_serving_softirq(),
+ "synchronous call from invalid context\n"))
+ return -EBUSY;
+
crypto_stats_get(alg);
if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
ret = -ENOKEY;
--
2.17.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 2/5] crypto: skcipher " Ard Biesheuvel
@ 2020-12-18 17:01 ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled Ard Biesheuvel
` (3 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
To: linux-crypto
Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
GCM mode is typically used for IPsec, but is now also used for software
kTLS, which means that we may end up operating on much larger inputs.
So let's wire up conditional NEON yield support for this driver, to ensure
that it does not cause scheduling blackouts when used with large inputs.
Given that GCM executes at 1-2 cycles per bytes and operates on 64 byte
chunks, doing a yield check every iteration should limit the scheduling
(or softirq) latency to < 200 cycles, which is a very conservative upper
bound.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/arm64/crypto/ghash-ce-core.S | 115 ++++++++++++--------
1 file changed, 67 insertions(+), 48 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 7868330dd54e..533cfb810232 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -435,14 +435,23 @@ SYM_FUNC_END(pmull_ghash_update_p8)
.align 6
.macro pmull_gcm_do_crypt, enc
- stp x29, x30, [sp, #-32]!
- mov x29, sp
- str x19, [sp, #24]
+ frame_push 8
- load_round_keys x7, x6, x8
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+ mov x26, x7
- ld1 {SHASH.2d}, [x3], #16
- ld1 {HH.2d-HH4.2d}, [x3]
+.Lrestart\enc:
+ load_round_keys x26, x25, x8
+
+ add x0, x22, #16
+ ld1 {SHASH.2d}, [x22]
+ ld1 {HH.2d-HH4.2d}, [x0]
trn1 SHASH2.2d, SHASH.2d, HH.2d
trn2 T1.2d, SHASH.2d, HH.2d
@@ -452,23 +461,23 @@ SYM_FUNC_END(pmull_ghash_update_p8)
trn2 T1.2d, HH3.2d, HH4.2d
eor HH34.16b, HH34.16b, T1.16b
- ld1 {XL.2d}, [x4]
+ ld1 {XL.2d}, [x23]
- cbz x0, 3f // tag only?
+ cbz x19, 3f // tag only?
- ldr w8, [x5, #12] // load lower counter
+ ldr w8, [x24, #12] // load lower counter
CPU_LE( rev w8, w8 )
0: mov w9, #4 // max blocks per round
- add x10, x0, #0xf
+ add x10, x19, #0xf
lsr x10, x10, #4 // remaining blocks
- subs x0, x0, #64
+ subs x19, x19, #64
csel w9, w10, w9, mi
add w8, w8, w9
bmi 1f
- ld1 {INP0.16b-INP3.16b}, [x2], #64
+ ld1 {INP0.16b-INP3.16b}, [x21], #64
.subsection 1
/*
* Populate the four input registers right to left with up to 63 bytes
@@ -487,30 +496,30 @@ CPU_LE( rev w8, w8 )
* input size is < 16 bytes)
*/
1: mov x15, #16
- ands x19, x0, #0xf
- csel x19, x19, x15, ne
+ ands x0, x19, #0xf
+ csel x0, x0, x15, ne
adr_l x17, .Lpermute_table + 16
- sub x11, x15, x19
+ sub x11, x15, x0
add x12, x17, x11
sub x17, x17, x11
ld1 {T1.16b}, [x12]
- sub x10, x1, x11
- sub x11, x2, x11
+ sub x10, x20, x11
+ sub x11, x21, x11
- cmp x0, #-16
+ cmp x19, #-16
csel x14, x15, xzr, gt
- cmp x0, #-32
+ cmp x19, #-32
csel x15, x15, xzr, gt
- cmp x0, #-48
- csel x16, x19, xzr, gt
- csel x1, x1, x10, gt
- csel x2, x2, x11, gt
-
- ld1 {INP0.16b}, [x2], x14
- ld1 {INP1.16b}, [x2], x15
- ld1 {INP2.16b}, [x2], x16
- ld1 {INP3.16b}, [x2]
+ cmp x19, #-48
+ csel x16, x0, xzr, gt
+ csel x20, x20, x10, gt
+ csel x21, x21, x11, gt
+
+ ld1 {INP0.16b}, [x21], x14
+ ld1 {INP1.16b}, [x21], x15
+ ld1 {INP2.16b}, [x21], x16
+ ld1 {INP3.16b}, [x21]
tbl INP3.16b, {INP3.16b}, T1.16b
b 2f
.previous
@@ -521,14 +530,24 @@ CPU_LE( rev w8, w8 )
bl pmull_gcm_enc_4x
- tbnz x0, #63, 6f
- st1 {INP0.16b-INP3.16b}, [x1], #64
+ tbnz x19, #63, 6f
+ st1 {INP0.16b-INP3.16b}, [x20], #64
.if \enc == 1
bl pmull_gcm_ghash_4x
.endif
- bne 0b
-3: ldp x19, x10, [sp, #24]
+ beq 3f
+
+ if_will_cond_yield_neon
+CPU_LE( rev w8, w8 )
+ str w8, [x24, #12] // store lower counter
+ st1 {XL.2d}, [x23]
+ do_cond_yield_neon
+ b .Lrestart\enc
+ endif_yield_neon
+ b 0b
+
+3: ldr x10, [sp, #.Lframe_local_offset]
cbz x10, 5f // output tag?
ld1 {INP3.16b}, [x10] // load lengths[]
@@ -536,10 +555,10 @@ CPU_LE( rev w8, w8 )
bl pmull_gcm_ghash_4x
mov w11, #(0x1 << 24) // BE '1U'
- ld1 {KS0.16b}, [x5]
+ ld1 {KS0.16b}, [x24]
mov KS0.s[3], w11
- enc_block KS0, x7, x6, x12
+ enc_block KS0, x26, x25, x12
ext XL.16b, XL.16b, XL.16b, #8
rev64 XL.16b, XL.16b
@@ -548,7 +567,7 @@ CPU_LE( rev w8, w8 )
.if \enc == 1
st1 {XL.16b}, [x10] // store tag
.else
- ldp x11, x12, [sp, #40] // load tag pointer and authsize
+ ldp x11, x12, [sp, #.Lframe_local_offset + 8] // load tag pointer and authsize
adr_l x17, .Lpermute_table
ld1 {KS0.16b}, [x11] // load supplied tag
add x17, x17, x12
@@ -561,33 +580,33 @@ CPU_LE( rev w8, w8 )
smov w0, v0.b[0] // return b0
.endif
-4: ldp x29, x30, [sp], #32
+4: frame_pop
ret
5:
CPU_LE( rev w8, w8 )
- str w8, [x5, #12] // store lower counter
- st1 {XL.2d}, [x4]
+ str w8, [x24, #12] // store lower counter
+ st1 {XL.2d}, [x23]
b 4b
6: ld1 {T1.16b-T2.16b}, [x17], #32 // permute vectors
- sub x17, x17, x19, lsl #1
+ sub x17, x17, x0, lsl #1
cmp w9, #1
beq 7f
.subsection 1
-7: ld1 {INP2.16b}, [x1]
+7: ld1 {INP2.16b}, [x20]
tbx INP2.16b, {INP3.16b}, T1.16b
mov INP3.16b, INP2.16b
b 8f
.previous
- st1 {INP0.16b}, [x1], x14
- st1 {INP1.16b}, [x1], x15
- st1 {INP2.16b}, [x1], x16
+ st1 {INP0.16b}, [x20], x14
+ st1 {INP1.16b}, [x20], x15
+ st1 {INP2.16b}, [x20], x16
tbl INP3.16b, {INP3.16b}, T1.16b
tbx INP3.16b, {INP2.16b}, T2.16b
-8: st1 {INP3.16b}, [x1]
+8: st1 {INP3.16b}, [x20]
.if \enc == 1
ld1 {T1.16b}, [x17]
@@ -699,7 +718,7 @@ SYM_FUNC_START_LOCAL(pmull_gcm_ghash_4x)
SYM_FUNC_END(pmull_gcm_ghash_4x)
SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
- ld1 {KS0.16b}, [x5] // load upper counter
+ ld1 {KS0.16b}, [x24] // load upper counter
sub w10, w8, #4
sub w11, w8, #3
sub w12, w8, #2
@@ -716,13 +735,13 @@ SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
ins KS2.s[3], w12
ins KS3.s[3], w13
- add x10, x6, #96 // round key pointer
+ add x10, x25, #96 // round key pointer
ld1 {K6.4s-K7.4s}, [x10], #32
.irp key, K0, K1, K2, K3, K4, K5
enc_qround KS0, KS1, KS2, KS3, \key
.endr
- tbnz x7, #2, .Lnot128
+ tbnz x26, #2, .Lnot128
.subsection 1
.Lnot128:
ld1 {K8.4s-K9.4s}, [x10], #32
@@ -733,7 +752,7 @@ SYM_FUNC_START_LOCAL(pmull_gcm_enc_4x)
.irp key, K8, K9
enc_qround KS0, KS1, KS2, KS3, \key
.endr
- tbz x7, #1, .Lout192
+ tbz x26, #1, .Lout192
b .Lout256
.previous
--
2.17.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
` (2 preceding siblings ...)
2020-12-18 17:01 ` [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support Ard Biesheuvel
@ 2020-12-18 17:01 ` Ard Biesheuvel
2021-01-19 16:00 ` Dave Martin
2020-12-18 17:01 ` [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path Ard Biesheuvel
` (2 subsequent siblings)
6 siblings, 1 reply; 14+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
To: linux-crypto
Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
Kernel mode NEON can be used in task or softirq context, but only in
a non-nesting manner, i.e., softirq context is only permitted if the
interrupt was not taken at a point where the kernel was using the NEON
in task context.
This means all users of kernel mode NEON have to be aware of this
limitation, and either need to provide scalar fallbacks that may be much
slower (up to 20x for AES instructions) and potentially less safe, or
use an asynchronous interface that defers processing to a later time
when the NEON is guaranteed to be available.
Given that grabbing and releasing the NEON is cheap, we can relax this
restriction, by increasing the granularity of kernel mode NEON code, and
always disabling softirq processing while the NEON is being used in task
context.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
arch/arm64/kernel/asm-offsets.c | 2 ++
arch/arm64/kernel/fpsimd.c | 4 ++--
3 files changed, 17 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index ddbe6bf00e33..74ce46ed55ac 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -15,6 +15,7 @@
#include <asm-generic/export.h>
#include <asm/asm-offsets.h>
+#include <asm/alternative.h>
#include <asm/cpufeature.h>
#include <asm/cputype.h>
#include <asm/debug-monitors.h>
@@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
.endm
.macro if_will_cond_yield_neon
-#ifdef CONFIG_PREEMPTION
get_current_task x0
ldr x0, [x0, #TSK_TI_PREEMPT]
- sub x0, x0, #PREEMPT_DISABLE_OFFSET
- cbz x0, .Lyield_\@
+#ifdef CONFIG_PREEMPTION
+ cmp x0, #PREEMPT_DISABLE_OFFSET
+ beq .Lyield_\@ // yield on need_resched in task context
+#endif
+ /* never yield while serving a softirq */
+ tbnz x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
+
+ adr_l x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
+ this_cpu_offset x1
+ ldr w0, [x0, x1]
+ cbnz w0, .Lyield_\@ // yield on pending softirq in task context
+.Lnoyield_\@:
/* fall through to endif_yield_neon */
.subsection 1
.Lyield_\@ :
-#else
- .section ".discard.cond_yield_neon", "ax"
-#endif
.endm
.macro do_cond_yield_neon
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 7d32fc959b1a..34ef70877de4 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
BLANK();
DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+ DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
+ DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
BLANK();
DEFINE(CPU_BOOT_STACK, offsetof(struct secondary_data, stack));
DEFINE(CPU_BOOT_TASK, offsetof(struct secondary_data, task));
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 062b21f30f94..823e3a8a8871 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
*/
static void get_cpu_fpsimd_context(void)
{
- preempt_disable();
+ local_bh_disable();
__get_cpu_fpsimd_context();
}
@@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
static void put_cpu_fpsimd_context(void)
{
__put_cpu_fpsimd_context();
- preempt_enable();
+ local_bh_enable();
}
static bool have_cpu_fpsimd_context(void)
--
2.17.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
` (3 preceding siblings ...)
2020-12-18 17:01 ` [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled Ard Biesheuvel
@ 2020-12-18 17:01 ` Ard Biesheuvel
2020-12-19 2:04 ` [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Herbert Xu
2021-02-16 10:09 ` Peter Zijlstra
6 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2020-12-18 17:01 UTC (permalink / raw)
To: linux-crypto
Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
Now that kernel mode SIMD is guaranteed to be available when executing
in task or softirq context, we no longer need scalar fallbacks to use
when the NEON is unavailable. So get rid of them.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/arm64/crypto/ghash-ce-glue.c | 209 +++++---------------
1 file changed, 51 insertions(+), 158 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 720cd3a58da3..15794fe21a0b 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -362,84 +362,36 @@ static int gcm_encrypt(struct aead_request *req)
err = skcipher_walk_aead_encrypt(&walk, req, false);
- if (likely(crypto_simd_usable())) {
- do {
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- int nbytes = walk.nbytes;
-
- tag = (u8 *)&lengths;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
- src = dst = memcpy(buf + sizeof(buf) - nbytes,
- src, nbytes);
- } else if (nbytes < walk.total) {
- nbytes &= ~(AES_BLOCK_SIZE - 1);
- tag = NULL;
- }
-
- kernel_neon_begin();
- pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
- dg, iv, ctx->aes_key.key_enc, nrounds,
- tag);
- kernel_neon_end();
-
- if (unlikely(!nbytes))
- break;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
- memcpy(walk.dst.virt.addr,
- buf + sizeof(buf) - nbytes, nbytes);
-
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- } while (walk.nbytes);
- } else {
- while (walk.nbytes >= AES_BLOCK_SIZE) {
- int blocks = walk.nbytes / AES_BLOCK_SIZE;
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- int remaining = blocks;
-
- do {
- aes_encrypt(&ctx->aes_key, buf, iv);
- crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
- crypto_inc(iv, AES_BLOCK_SIZE);
-
- dst += AES_BLOCK_SIZE;
- src += AES_BLOCK_SIZE;
- } while (--remaining > 0);
-
- ghash_do_update(blocks, dg, walk.dst.virt.addr,
- &ctx->ghash_key, NULL);
-
- err = skcipher_walk_done(&walk,
- walk.nbytes % AES_BLOCK_SIZE);
- }
-
- /* handle the tail */
- if (walk.nbytes) {
- aes_encrypt(&ctx->aes_key, buf, iv);
+ do {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ int nbytes = walk.nbytes;
- crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
- buf, walk.nbytes);
+ tag = (u8 *)&lengths;
- memcpy(buf, walk.dst.virt.addr, walk.nbytes);
- memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+ src = dst = memcpy(buf + sizeof(buf) - nbytes,
+ src, nbytes);
+ } else if (nbytes < walk.total) {
+ nbytes &= ~(AES_BLOCK_SIZE - 1);
+ tag = NULL;
}
- tag = (u8 *)&lengths;
- ghash_do_update(1, dg, tag, &ctx->ghash_key,
- walk.nbytes ? buf : NULL);
+ kernel_neon_begin();
+ pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
+ dg, iv, ctx->aes_key.key_enc, nrounds,
+ tag);
+ kernel_neon_end();
- if (walk.nbytes)
- err = skcipher_walk_done(&walk, 0);
+ if (unlikely(!nbytes))
+ break;
- put_unaligned_be64(dg[1], tag);
- put_unaligned_be64(dg[0], tag + 8);
- put_unaligned_be32(1, iv + GCM_IV_SIZE);
- aes_encrypt(&ctx->aes_key, iv, iv);
- crypto_xor(tag, iv, AES_BLOCK_SIZE);
- }
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr,
+ buf + sizeof(buf) - nbytes, nbytes);
+
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ } while (walk.nbytes);
if (err)
return err;
@@ -464,6 +416,7 @@ static int gcm_decrypt(struct aead_request *req)
u64 dg[2] = {};
be128 lengths;
u8 *tag;
+ int ret;
int err;
lengths.a = cpu_to_be64(req->assoclen * 8);
@@ -481,101 +434,41 @@ static int gcm_decrypt(struct aead_request *req)
err = skcipher_walk_aead_decrypt(&walk, req, false);
- if (likely(crypto_simd_usable())) {
- int ret;
-
- do {
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- int nbytes = walk.nbytes;
-
- tag = (u8 *)&lengths;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
- src = dst = memcpy(buf + sizeof(buf) - nbytes,
- src, nbytes);
- } else if (nbytes < walk.total) {
- nbytes &= ~(AES_BLOCK_SIZE - 1);
- tag = NULL;
- }
-
- kernel_neon_begin();
- ret = pmull_gcm_decrypt(nbytes, dst, src,
- ctx->ghash_key.h,
- dg, iv, ctx->aes_key.key_enc,
- nrounds, tag, otag, authsize);
- kernel_neon_end();
-
- if (unlikely(!nbytes))
- break;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
- memcpy(walk.dst.virt.addr,
- buf + sizeof(buf) - nbytes, nbytes);
-
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- } while (walk.nbytes);
-
- if (err)
- return err;
- if (ret)
- return -EBADMSG;
- } else {
- while (walk.nbytes >= AES_BLOCK_SIZE) {
- int blocks = walk.nbytes / AES_BLOCK_SIZE;
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
-
- ghash_do_update(blocks, dg, walk.src.virt.addr,
- &ctx->ghash_key, NULL);
-
- do {
- aes_encrypt(&ctx->aes_key, buf, iv);
- crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
- crypto_inc(iv, AES_BLOCK_SIZE);
-
- dst += AES_BLOCK_SIZE;
- src += AES_BLOCK_SIZE;
- } while (--blocks > 0);
+ do {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ int nbytes = walk.nbytes;
- err = skcipher_walk_done(&walk,
- walk.nbytes % AES_BLOCK_SIZE);
- }
+ tag = (u8 *)&lengths;
- /* handle the tail */
- if (walk.nbytes) {
- memcpy(buf, walk.src.virt.addr, walk.nbytes);
- memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+ src = dst = memcpy(buf + sizeof(buf) - nbytes,
+ src, nbytes);
+ } else if (nbytes < walk.total) {
+ nbytes &= ~(AES_BLOCK_SIZE - 1);
+ tag = NULL;
}
- tag = (u8 *)&lengths;
- ghash_do_update(1, dg, tag, &ctx->ghash_key,
- walk.nbytes ? buf : NULL);
-
- if (walk.nbytes) {
- aes_encrypt(&ctx->aes_key, buf, iv);
+ kernel_neon_begin();
+ ret = pmull_gcm_decrypt(nbytes, dst, src, ctx->ghash_key.h,
+ dg, iv, ctx->aes_key.key_enc,
+ nrounds, tag, otag, authsize);
+ kernel_neon_end();
- crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
- buf, walk.nbytes);
+ if (unlikely(!nbytes))
+ break;
- err = skcipher_walk_done(&walk, 0);
- }
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr,
+ buf + sizeof(buf) - nbytes, nbytes);
- if (err)
- return err;
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ } while (walk.nbytes);
- put_unaligned_be64(dg[1], tag);
- put_unaligned_be64(dg[0], tag + 8);
- put_unaligned_be32(1, iv + GCM_IV_SIZE);
- aes_encrypt(&ctx->aes_key, iv, iv);
- crypto_xor(tag, iv, AES_BLOCK_SIZE);
+ if (err)
+ return err;
- if (crypto_memneq(tag, otag, authsize)) {
- memzero_explicit(tag, AES_BLOCK_SIZE);
- return -EBADMSG;
- }
- }
- return 0;
+ return ret ? -EBADMSG : 0;
}
static struct aead_alg gcm_aes_alg = {
--
2.17.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
` (4 preceding siblings ...)
2020-12-18 17:01 ` [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path Ard Biesheuvel
@ 2020-12-19 2:04 ` Herbert Xu
2021-01-14 8:22 ` Ard Biesheuvel
2021-02-16 10:09 ` Peter Zijlstra
6 siblings, 1 reply; 14+ messages in thread
From: Herbert Xu @ 2020-12-19 2:04 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-crypto, linux-arm-kernel, linux-kernel, Dave Martin,
Mark Brown, Eric Biggers, Will Deacon, Catalin Marinas,
Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
Ingo Molnar
On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
>
> Questions:
> - what did I miss or break horribly?
> - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
> kthread, so I don't think it cares.
> - what would be a reasonable upper bound to keep softirqs disabled? I suppose
> 100s of cycles or less is overkill, but I'm not sure how to derive a better
> answer.
> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> expensive?
If this approach works not only would it allow us to support the
synchronous users better, it would also allow us to remove loads
of cruft in the Crypto API that exist solely to support these SIMD
code paths.
So I eagerly await the assessment of the scheduler/RT folks on this
approach.
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
2020-12-19 2:04 ` [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Herbert Xu
@ 2021-01-14 8:22 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2021-01-14 8:22 UTC (permalink / raw)
To: Herbert Xu
Cc: Linux Crypto Mailing List, Linux ARM, Linux Kernel Mailing List,
Dave Martin, Mark Brown, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Peter Zijlstra,
Sebastian Andrzej Siewior, Ingo Molnar
On Sat, 19 Dec 2020 at 03:05, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> >
> > Questions:
> > - what did I miss or break horribly?
> > - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
> > kthread, so I don't think it cares.
> > - what would be a reasonable upper bound to keep softirqs disabled? I suppose
> > 100s of cycles or less is overkill, but I'm not sure how to derive a better
> > answer.
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> > expensive?
>
> If this approach works not only would it allow us to support the
> synchronous users better, it would also allow us to remove loads
> of cruft in the Crypto API that exist solely to support these SIMD
> code paths.
>
> So I eagerly await the assessment of the scheduler/RT folks on this
> approach.
>
Any insights here? Is there a ballpark upper bound for the duration of
a softirq disabled section? Other reasons why dis/enabling softirq
handling is a bad idea?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
2020-12-18 17:01 ` [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled Ard Biesheuvel
@ 2021-01-19 16:00 ` Dave Martin
2021-01-19 16:29 ` Ard Biesheuvel
0 siblings, 1 reply; 14+ messages in thread
From: Dave Martin @ 2021-01-19 16:00 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-crypto, Ingo Molnar, Herbert Xu, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, linux-kernel,
Eric Biggers, Mark Brown, Thomas Gleixner, Will Deacon,
linux-arm-kernel
On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> Kernel mode NEON can be used in task or softirq context, but only in
> a non-nesting manner, i.e., softirq context is only permitted if the
> interrupt was not taken at a point where the kernel was using the NEON
> in task context.
>
> This means all users of kernel mode NEON have to be aware of this
> limitation, and either need to provide scalar fallbacks that may be much
> slower (up to 20x for AES instructions) and potentially less safe, or
> use an asynchronous interface that defers processing to a later time
> when the NEON is guaranteed to be available.
>
> Given that grabbing and releasing the NEON is cheap, we can relax this
> restriction, by increasing the granularity of kernel mode NEON code, and
> always disabling softirq processing while the NEON is being used in task
> context.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Sorry for the slow reply on this... it looks reasonable, but I have a
few comments below.
> ---
> arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> arch/arm64/kernel/asm-offsets.c | 2 ++
> arch/arm64/kernel/fpsimd.c | 4 ++--
> 3 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index ddbe6bf00e33..74ce46ed55ac 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -15,6 +15,7 @@
> #include <asm-generic/export.h>
>
> #include <asm/asm-offsets.h>
> +#include <asm/alternative.h>
> #include <asm/cpufeature.h>
> #include <asm/cputype.h>
> #include <asm/debug-monitors.h>
> @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
> .endm
>
> .macro if_will_cond_yield_neon
> -#ifdef CONFIG_PREEMPTION
> get_current_task x0
> ldr x0, [x0, #TSK_TI_PREEMPT]
> - sub x0, x0, #PREEMPT_DISABLE_OFFSET
> - cbz x0, .Lyield_\@
> +#ifdef CONFIG_PREEMPTION
> + cmp x0, #PREEMPT_DISABLE_OFFSET
> + beq .Lyield_\@ // yield on need_resched in task context
> +#endif
> + /* never yield while serving a softirq */
> + tbnz x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
Can you explain the rationale here?
Using if_will_cond_yield_neon suggests the algo thinks it may run for
too long the stall preemption until completion, but we happily stall
preemption _and_ softirqs here.
Is it actually a bug to use the NEON conditional yield helpers in
softirq context?
Ideally, if processing in softirq context takes an unreasonable about of
time, the work would be handed off to an asynchronous worker, but that
does seem to conflict rather with the purpose of this series...
> +
> + adr_l x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> + this_cpu_offset x1
> + ldr w0, [x0, x1]
> + cbnz w0, .Lyield_\@ // yield on pending softirq in task context
> +.Lnoyield_\@:
> /* fall through to endif_yield_neon */
> .subsection 1
> .Lyield_\@ :
> -#else
> - .section ".discard.cond_yield_neon", "ax"
> -#endif
> .endm
>
> .macro do_cond_yield_neon
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 7d32fc959b1a..34ef70877de4 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -93,6 +93,8 @@ int main(void)
> DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
> BLANK();
> DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> + DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> + DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> BLANK();
> DEFINE(CPU_BOOT_STACK, offsetof(struct secondary_data, stack));
> DEFINE(CPU_BOOT_TASK, offsetof(struct secondary_data, task));
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 062b21f30f94..823e3a8a8871 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> */
> static void get_cpu_fpsimd_context(void)
> {
> - preempt_disable();
> + local_bh_disable();
> __get_cpu_fpsimd_context();
> }
>
> @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> static void put_cpu_fpsimd_context(void)
> {
> __put_cpu_fpsimd_context();
> - preempt_enable();
> + local_bh_enable();
> }
>
> static bool have_cpu_fpsimd_context(void)
I was concerned about catching all the relevant preempt_disable()s, but
it had slipped my memory that Julien had factored these into one place.
I can't see off the top of my head any reason why this shouldn't work.
In threory, switching to local_bh_enable() here will add a check for
pending softirqs onto context handling fast paths. I haven't dug into
how that works, so perhaps this is trivial on top of the preemption
check in preempt_enable(). Do you see any difference in hackbench or
similar benchmarks?
Cheers
---Dave
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
2021-01-19 16:00 ` Dave Martin
@ 2021-01-19 16:29 ` Ard Biesheuvel
2021-01-20 15:44 ` Dave Martin
0 siblings, 1 reply; 14+ messages in thread
From: Ard Biesheuvel @ 2021-01-19 16:29 UTC (permalink / raw)
To: Dave Martin
Cc: Linux Crypto Mailing List, Ingo Molnar, Herbert Xu,
Peter Zijlstra, Catalin Marinas, Sebastian Andrzej Siewior,
Linux Kernel Mailing List, Eric Biggers, Mark Brown,
Thomas Gleixner, Will Deacon, Linux ARM
On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > Kernel mode NEON can be used in task or softirq context, but only in
> > a non-nesting manner, i.e., softirq context is only permitted if the
> > interrupt was not taken at a point where the kernel was using the NEON
> > in task context.
> >
> > This means all users of kernel mode NEON have to be aware of this
> > limitation, and either need to provide scalar fallbacks that may be much
> > slower (up to 20x for AES instructions) and potentially less safe, or
> > use an asynchronous interface that defers processing to a later time
> > when the NEON is guaranteed to be available.
> >
> > Given that grabbing and releasing the NEON is cheap, we can relax this
> > restriction, by increasing the granularity of kernel mode NEON code, and
> > always disabling softirq processing while the NEON is being used in task
> > context.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> Sorry for the slow reply on this... it looks reasonable, but I have a
> few comments below.
>
No worries - thanks for taking a look.
> > ---
> > arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > arch/arm64/kernel/asm-offsets.c | 2 ++
> > arch/arm64/kernel/fpsimd.c | 4 ++--
> > 3 files changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > index ddbe6bf00e33..74ce46ed55ac 100644
> > --- a/arch/arm64/include/asm/assembler.h
> > +++ b/arch/arm64/include/asm/assembler.h
> > @@ -15,6 +15,7 @@
> > #include <asm-generic/export.h>
> >
> > #include <asm/asm-offsets.h>
> > +#include <asm/alternative.h>
> > #include <asm/cpufeature.h>
> > #include <asm/cputype.h>
> > #include <asm/debug-monitors.h>
> > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
> > .endm
> >
> > .macro if_will_cond_yield_neon
> > -#ifdef CONFIG_PREEMPTION
> > get_current_task x0
> > ldr x0, [x0, #TSK_TI_PREEMPT]
> > - sub x0, x0, #PREEMPT_DISABLE_OFFSET
> > - cbz x0, .Lyield_\@
> > +#ifdef CONFIG_PREEMPTION
> > + cmp x0, #PREEMPT_DISABLE_OFFSET
> > + beq .Lyield_\@ // yield on need_resched in task context
> > +#endif
> > + /* never yield while serving a softirq */
> > + tbnz x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
>
> Can you explain the rationale here?
>
> Using if_will_cond_yield_neon suggests the algo thinks it may run for
> too long the stall preemption until completion, but we happily stall
> preemption _and_ softirqs here.
>
> Is it actually a bug to use the NEON conditional yield helpers in
> softirq context?
>
No, it is not. But calling kernel_neon_end() from softirq context will
not cause it to finish any faster, so there is really no point in
doing so.
> Ideally, if processing in softirq context takes an unreasonable about of
> time, the work would be handed off to an asynchronous worker, but that
> does seem to conflict rather with the purpose of this series...
>
Agreed, but this is not something we can police at this level. If the
caller does an unreasonable amount of work from a softirq, no amount
of yielding is going to make a difference.
> > +
> > + adr_l x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > + this_cpu_offset x1
> > + ldr w0, [x0, x1]
> > + cbnz w0, .Lyield_\@ // yield on pending softirq in task context
> > +.Lnoyield_\@:
> > /* fall through to endif_yield_neon */
> > .subsection 1
> > .Lyield_\@ :
> > -#else
> > - .section ".discard.cond_yield_neon", "ax"
> > -#endif
> > .endm
> >
> > .macro do_cond_yield_neon
> > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > index 7d32fc959b1a..34ef70877de4 100644
> > --- a/arch/arm64/kernel/asm-offsets.c
> > +++ b/arch/arm64/kernel/asm-offsets.c
> > @@ -93,6 +93,8 @@ int main(void)
> > DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
> > BLANK();
> > DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > + DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > + DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > BLANK();
> > DEFINE(CPU_BOOT_STACK, offsetof(struct secondary_data, stack));
> > DEFINE(CPU_BOOT_TASK, offsetof(struct secondary_data, task));
> > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > index 062b21f30f94..823e3a8a8871 100644
> > --- a/arch/arm64/kernel/fpsimd.c
> > +++ b/arch/arm64/kernel/fpsimd.c
> > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > */
> > static void get_cpu_fpsimd_context(void)
> > {
> > - preempt_disable();
> > + local_bh_disable();
> > __get_cpu_fpsimd_context();
> > }
> >
> > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > static void put_cpu_fpsimd_context(void)
> > {
> > __put_cpu_fpsimd_context();
> > - preempt_enable();
> > + local_bh_enable();
> > }
> >
> > static bool have_cpu_fpsimd_context(void)
>
> I was concerned about catching all the relevant preempt_disable()s, but
> it had slipped my memory that Julien had factored these into one place.
>
> I can't see off the top of my head any reason why this shouldn't work.
>
Thanks.
>
> In threory, switching to local_bh_enable() here will add a check for
> pending softirqs onto context handling fast paths. I haven't dug into
> how that works, so perhaps this is trivial on top of the preemption
> check in preempt_enable(). Do you see any difference in hackbench or
> similar benchmarks?
>
I haven't tried, tbh. But by context handling fast paths, you mean
managing the FP/SIMD state at context switch time, right? Checking for
pending softirqs amounts to a single per-CPU load plus compare, so
that should be negligible AFAICT. Obviously, actually handling the
softirq may take additional time, but that penalty has to be taken
somewhere - I don't see how that would create extra work that we
wouldn't have to do otherwise.
I'll do some experiments with hackbench once I get back to this series.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
2021-01-19 16:29 ` Ard Biesheuvel
@ 2021-01-20 15:44 ` Dave Martin
2021-02-15 18:30 ` Ard Biesheuvel
0 siblings, 1 reply; 14+ messages in thread
From: Dave Martin @ 2021-01-20 15:44 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Linux Kernel Mailing List,
Eric Biggers, Mark Brown, Linux Crypto Mailing List,
Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM
On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > Kernel mode NEON can be used in task or softirq context, but only in
> > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > interrupt was not taken at a point where the kernel was using the NEON
> > > in task context.
> > >
> > > This means all users of kernel mode NEON have to be aware of this
> > > limitation, and either need to provide scalar fallbacks that may be much
> > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > use an asynchronous interface that defers processing to a later time
> > > when the NEON is guaranteed to be available.
> > >
> > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > always disabling softirq processing while the NEON is being used in task
> > > context.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >
> > Sorry for the slow reply on this... it looks reasonable, but I have a
> > few comments below.
> >
>
> No worries - thanks for taking a look.
>
> > > ---
> > > arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > > arch/arm64/kernel/asm-offsets.c | 2 ++
> > > arch/arm64/kernel/fpsimd.c | 4 ++--
> > > 3 files changed, 17 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > --- a/arch/arm64/include/asm/assembler.h
> > > +++ b/arch/arm64/include/asm/assembler.h
> > > @@ -15,6 +15,7 @@
> > > #include <asm-generic/export.h>
> > >
> > > #include <asm/asm-offsets.h>
> > > +#include <asm/alternative.h>
> > > #include <asm/cpufeature.h>
> > > #include <asm/cputype.h>
> > > #include <asm/debug-monitors.h>
> > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
> > > .endm
> > >
> > > .macro if_will_cond_yield_neon
> > > -#ifdef CONFIG_PREEMPTION
> > > get_current_task x0
> > > ldr x0, [x0, #TSK_TI_PREEMPT]
> > > - sub x0, x0, #PREEMPT_DISABLE_OFFSET
> > > - cbz x0, .Lyield_\@
> > > +#ifdef CONFIG_PREEMPTION
> > > + cmp x0, #PREEMPT_DISABLE_OFFSET
> > > + beq .Lyield_\@ // yield on need_resched in task context
> > > +#endif
> > > + /* never yield while serving a softirq */
> > > + tbnz x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> >
> > Can you explain the rationale here?
> >
> > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > too long the stall preemption until completion, but we happily stall
> > preemption _and_ softirqs here.
> >
> > Is it actually a bug to use the NEON conditional yield helpers in
> > softirq context?
> >
>
> No, it is not. But calling kernel_neon_end() from softirq context will
> not cause it to finish any faster, so there is really no point in
> doing so.
>
> > Ideally, if processing in softirq context takes an unreasonable about of
> > time, the work would be handed off to an asynchronous worker, but that
> > does seem to conflict rather with the purpose of this series...
> >
>
> Agreed, but this is not something we can police at this level. If the
> caller does an unreasonable amount of work from a softirq, no amount
> of yielding is going to make a difference.
Ack, just wanted to make sure I wasn't missing something.
Anyone writing softirq code can starve preemption, so I agree that we
should trust people to know what they're doing.
> > > +
> > > + adr_l x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > + this_cpu_offset x1
> > > + ldr w0, [x0, x1]
> > > + cbnz w0, .Lyield_\@ // yield on pending softirq in task context
> > > +.Lnoyield_\@:
> > > /* fall through to endif_yield_neon */
> > > .subsection 1
> > > .Lyield_\@ :
> > > -#else
> > > - .section ".discard.cond_yield_neon", "ax"
> > > -#endif
> > > .endm
> > >
> > > .macro do_cond_yield_neon
> > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > index 7d32fc959b1a..34ef70877de4 100644
> > > --- a/arch/arm64/kernel/asm-offsets.c
> > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > @@ -93,6 +93,8 @@ int main(void)
> > > DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
> > > BLANK();
> > > DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > > + DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > > + DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > > BLANK();
> > > DEFINE(CPU_BOOT_STACK, offsetof(struct secondary_data, stack));
> > > DEFINE(CPU_BOOT_TASK, offsetof(struct secondary_data, task));
> > > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > > index 062b21f30f94..823e3a8a8871 100644
> > > --- a/arch/arm64/kernel/fpsimd.c
> > > +++ b/arch/arm64/kernel/fpsimd.c
> > > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > > */
> > > static void get_cpu_fpsimd_context(void)
> > > {
> > > - preempt_disable();
> > > + local_bh_disable();
> > > __get_cpu_fpsimd_context();
> > > }
> > >
> > > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > > static void put_cpu_fpsimd_context(void)
> > > {
> > > __put_cpu_fpsimd_context();
> > > - preempt_enable();
> > > + local_bh_enable();
> > > }
> > >
> > > static bool have_cpu_fpsimd_context(void)
> >
> > I was concerned about catching all the relevant preempt_disable()s, but
> > it had slipped my memory that Julien had factored these into one place.
> >
> > I can't see off the top of my head any reason why this shouldn't work.
> >
>
> Thanks.
>
> >
> > In threory, switching to local_bh_enable() here will add a check for
> > pending softirqs onto context handling fast paths. I haven't dug into
> > how that works, so perhaps this is trivial on top of the preemption
> > check in preempt_enable(). Do you see any difference in hackbench or
> > similar benchmarks?
> >
>
> I haven't tried, tbh. But by context handling fast paths, you mean
> managing the FP/SIMD state at context switch time, right? Checking for
> pending softirqs amounts to a single per-CPU load plus compare, so
> that should be negligible AFAICT. Obviously, actually handling the
Yes. I've tended to assume, rather than prove, that this kind of thing
is negligible -- so I confess I had not attempted to measure these
effects when writing the original code.
> softirq may take additional time, but that penalty has to be taken
> somewhere - I don't see how that would create extra work that we
> wouldn't have to do otherwise.
>
> I'll do some experiments with hackbench once I get back to this series.
That sounds fine.
Probably you won't find a significant difference anyway.
Cheers
---Dave
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled
2021-01-20 15:44 ` Dave Martin
@ 2021-02-15 18:30 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2021-02-15 18:30 UTC (permalink / raw)
To: Dave Martin
Cc: Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Linux Kernel Mailing List,
Eric Biggers, Mark Brown, Linux Crypto Mailing List,
Thomas Gleixner, Will Deacon, Ingo Molnar, Linux ARM
On Wed, 20 Jan 2021 at 16:44, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Tue, Jan 19, 2021 at 05:29:05PM +0100, Ard Biesheuvel wrote:
> > On Tue, 19 Jan 2021 at 17:01, Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Fri, Dec 18, 2020 at 06:01:05PM +0100, Ard Biesheuvel wrote:
> > > > Kernel mode NEON can be used in task or softirq context, but only in
> > > > a non-nesting manner, i.e., softirq context is only permitted if the
> > > > interrupt was not taken at a point where the kernel was using the NEON
> > > > in task context.
> > > >
> > > > This means all users of kernel mode NEON have to be aware of this
> > > > limitation, and either need to provide scalar fallbacks that may be much
> > > > slower (up to 20x for AES instructions) and potentially less safe, or
> > > > use an asynchronous interface that defers processing to a later time
> > > > when the NEON is guaranteed to be available.
> > > >
> > > > Given that grabbing and releasing the NEON is cheap, we can relax this
> > > > restriction, by increasing the granularity of kernel mode NEON code, and
> > > > always disabling softirq processing while the NEON is being used in task
> > > > context.
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Sorry for the slow reply on this... it looks reasonable, but I have a
> > > few comments below.
> > >
> >
> > No worries - thanks for taking a look.
> >
> > > > ---
> > > > arch/arm64/include/asm/assembler.h | 19 +++++++++++++------
> > > > arch/arm64/kernel/asm-offsets.c | 2 ++
> > > > arch/arm64/kernel/fpsimd.c | 4 ++--
> > > > 3 files changed, 17 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> > > > index ddbe6bf00e33..74ce46ed55ac 100644
> > > > --- a/arch/arm64/include/asm/assembler.h
> > > > +++ b/arch/arm64/include/asm/assembler.h
> > > > @@ -15,6 +15,7 @@
> > > > #include <asm-generic/export.h>
> > > >
> > > > #include <asm/asm-offsets.h>
> > > > +#include <asm/alternative.h>
> > > > #include <asm/cpufeature.h>
> > > > #include <asm/cputype.h>
> > > > #include <asm/debug-monitors.h>
> > > > @@ -717,17 +718,23 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
> > > > .endm
> > > >
> > > > .macro if_will_cond_yield_neon
> > > > -#ifdef CONFIG_PREEMPTION
> > > > get_current_task x0
> > > > ldr x0, [x0, #TSK_TI_PREEMPT]
> > > > - sub x0, x0, #PREEMPT_DISABLE_OFFSET
> > > > - cbz x0, .Lyield_\@
> > > > +#ifdef CONFIG_PREEMPTION
> > > > + cmp x0, #PREEMPT_DISABLE_OFFSET
> > > > + beq .Lyield_\@ // yield on need_resched in task context
> > > > +#endif
> > > > + /* never yield while serving a softirq */
> > > > + tbnz x0, #SOFTIRQ_SHIFT, .Lnoyield_\@
> > >
> > > Can you explain the rationale here?
> > >
> > > Using if_will_cond_yield_neon suggests the algo thinks it may run for
> > > too long the stall preemption until completion, but we happily stall
> > > preemption _and_ softirqs here.
> > >
> > > Is it actually a bug to use the NEON conditional yield helpers in
> > > softirq context?
> > >
> >
> > No, it is not. But calling kernel_neon_end() from softirq context will
> > not cause it to finish any faster, so there is really no point in
> > doing so.
> >
> > > Ideally, if processing in softirq context takes an unreasonable about of
> > > time, the work would be handed off to an asynchronous worker, but that
> > > does seem to conflict rather with the purpose of this series...
> > >
> >
> > Agreed, but this is not something we can police at this level. If the
> > caller does an unreasonable amount of work from a softirq, no amount
> > of yielding is going to make a difference.
>
> Ack, just wanted to make sure I wasn't missing something.
>
> Anyone writing softirq code can starve preemption, so I agree that we
> should trust people to know what they're doing.
>
>
> > > > +
> > > > + adr_l x0, irq_stat + IRQ_CPUSTAT_SOFTIRQ_PENDING
> > > > + this_cpu_offset x1
> > > > + ldr w0, [x0, x1]
> > > > + cbnz w0, .Lyield_\@ // yield on pending softirq in task context
> > > > +.Lnoyield_\@:
> > > > /* fall through to endif_yield_neon */
> > > > .subsection 1
> > > > .Lyield_\@ :
> > > > -#else
> > > > - .section ".discard.cond_yield_neon", "ax"
> > > > -#endif
> > > > .endm
> > > >
> > > > .macro do_cond_yield_neon
> > > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > > index 7d32fc959b1a..34ef70877de4 100644
> > > > --- a/arch/arm64/kernel/asm-offsets.c
> > > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > > @@ -93,6 +93,8 @@ int main(void)
> > > > DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
> > > > BLANK();
> > > > DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
> > > > + DEFINE(SOFTIRQ_SHIFT, SOFTIRQ_SHIFT);
> > > > + DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
> > > > BLANK();
> > > > DEFINE(CPU_BOOT_STACK, offsetof(struct secondary_data, stack));
> > > > DEFINE(CPU_BOOT_TASK, offsetof(struct secondary_data, task));
> > > > diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> > > > index 062b21f30f94..823e3a8a8871 100644
> > > > --- a/arch/arm64/kernel/fpsimd.c
> > > > +++ b/arch/arm64/kernel/fpsimd.c
> > > > @@ -180,7 +180,7 @@ static void __get_cpu_fpsimd_context(void)
> > > > */
> > > > static void get_cpu_fpsimd_context(void)
> > > > {
> > > > - preempt_disable();
> > > > + local_bh_disable();
> > > > __get_cpu_fpsimd_context();
> > > > }
> > > >
> > > > @@ -201,7 +201,7 @@ static void __put_cpu_fpsimd_context(void)
> > > > static void put_cpu_fpsimd_context(void)
> > > > {
> > > > __put_cpu_fpsimd_context();
> > > > - preempt_enable();
> > > > + local_bh_enable();
> > > > }
> > > >
> > > > static bool have_cpu_fpsimd_context(void)
> > >
> > > I was concerned about catching all the relevant preempt_disable()s, but
> > > it had slipped my memory that Julien had factored these into one place.
> > >
> > > I can't see off the top of my head any reason why this shouldn't work.
> > >
> >
> > Thanks.
> >
> > >
> > > In threory, switching to local_bh_enable() here will add a check for
> > > pending softirqs onto context handling fast paths. I haven't dug into
> > > how that works, so perhaps this is trivial on top of the preemption
> > > check in preempt_enable(). Do you see any difference in hackbench or
> > > similar benchmarks?
> > >
> >
> > I haven't tried, tbh. But by context handling fast paths, you mean
> > managing the FP/SIMD state at context switch time, right? Checking for
> > pending softirqs amounts to a single per-CPU load plus compare, so
> > that should be negligible AFAICT. Obviously, actually handling the
>
> Yes. I've tended to assume, rather than prove, that this kind of thing
> is negligible -- so I confess I had not attempted to measure these
> effects when writing the original code.
>
> > softirq may take additional time, but that penalty has to be taken
> > somewhere - I don't see how that would create extra work that we
> > wouldn't have to do otherwise.
> >
> > I'll do some experiments with hackbench once I get back to this series.
>
> That sounds fine.
>
> Probably you won't find a significant difference anyway.
>
Finally got around to trying this: as expected, I don't see any
difference at all between the two versions (tested on TX2)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
` (5 preceding siblings ...)
2020-12-19 2:04 ` [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Herbert Xu
@ 2021-02-16 10:09 ` Peter Zijlstra
2021-02-16 10:35 ` Ard Biesheuvel
6 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2021-02-16 10:09 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-crypto, linux-arm-kernel, linux-kernel, Dave Martin,
Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Sebastian Andrzej Siewior,
Ingo Molnar
On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
> SIMD in kernel mode could reduce complexity and improve performance, but we
> need to decide whether we can do this, and how much softirq processing
> latency we can tolerate. If we can find a satisfactory solution for this,
> we might do the same for x86 and 32-bit ARM as well. ]
> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> expensive?
Can't we simply save/restore the relevant register set?
So something like (note amluto was wanting to add a regset argument):
<task>
kernel_fpu_begin(MMX)
<SIRQ>
kernel_fpu_begin(SSE)
kernel_fpu_end();
</SIRQ>
...
kernel_fpu_end()
Would have to save the MMX regs on first SIRQ invocation of
kernel_fpu_begin(), and then have softirq context termination </SIRQ>
above, restore it.
I mean, we already do much the same for the first kernel_fpu_begin(),
that has to save the user registers, which will be restore when we go
back to userspace.
So why not do exactly the same for softirq context?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled
2021-02-16 10:09 ` Peter Zijlstra
@ 2021-02-16 10:35 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2021-02-16 10:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Linux Crypto Mailing List, Linux ARM, Linux Kernel Mailing List,
Dave Martin, Mark Brown, Herbert Xu, Eric Biggers, Will Deacon,
Catalin Marinas, Thomas Gleixner, Sebastian Andrzej Siewior,
Ingo Molnar
On Tue, 16 Feb 2021 at 11:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> > [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
> > SIMD in kernel mode could reduce complexity and improve performance, but we
> > need to decide whether we can do this, and how much softirq processing
> > latency we can tolerate. If we can find a satisfactory solution for this,
> > we might do the same for x86 and 32-bit ARM as well. ]
>
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> > expensive?
>
> Can't we simply save/restore the relevant register set?
>
> So something like (note amluto was wanting to add a regset argument):
>
> <task>
> kernel_fpu_begin(MMX)
> <SIRQ>
> kernel_fpu_begin(SSE)
> kernel_fpu_end();
> </SIRQ>
> ...
> kernel_fpu_end()
>
> Would have to save the MMX regs on first SIRQ invocation of
> kernel_fpu_begin(), and then have softirq context termination </SIRQ>
> above, restore it.
>
> I mean, we already do much the same for the first kernel_fpu_begin(),
> that has to save the user registers, which will be restore when we go
> back to userspace.
>
> So why not do exactly the same for softirq context?
That is what we originally had on arm64, with per-CPU buffers of the
appropriate size. This became a bit messy when SVE support was added,
because the register file is so large (32 registers of up to 2048 bits
each), and since the kernel does not use SVE itself, we want the inner
per-CPU buffer to only cover 128 bits per register. This means we
cannot allow the <sirq></sirq> region above to interrupt the outer
preserve (which is implemented entirely in software), since resuming
the outer preserve after a sirq would preserve content that was
corrupted by the inner preserve/restore. This could be addressed by
disabling interrupts across the outer preserve, but this caused a
problem somewhere else (Dave might remember the details better than I
do). So we ended up making SIMD in task context mutually exclusive
with SIMD in softirq context, also because that is what 32-bit ARM and
x86 were already doing as well.
But I understand that these concerns may not apply to x86 at all, so
perhaps the answer there is indeed to have a alternate buffer. And
actually, Andy L. suggested the same when I asked him about it on IRC.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-02-16 10:36 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 17:01 [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 1/5] crypto: aead - disallow en/decrypt for non-task or non-softirq context Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 2/5] crypto: skcipher " Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 3/5] crypto: arm64/gcm-aes-ce - add NEON yield support Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 4/5] arm64: fpsimd: run kernel mode NEON with softirqs disabled Ard Biesheuvel
2021-01-19 16:00 ` Dave Martin
2021-01-19 16:29 ` Ard Biesheuvel
2021-01-20 15:44 ` Dave Martin
2021-02-15 18:30 ` Ard Biesheuvel
2020-12-18 17:01 ` [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path Ard Biesheuvel
2020-12-19 2:04 ` [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled Herbert Xu
2021-01-14 8:22 ` Ard Biesheuvel
2021-02-16 10:09 ` Peter Zijlstra
2021-02-16 10:35 ` Ard Biesheuvel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).