* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.
This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)
So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.
Note that the remaining crypto drivers simply operate on fixed buffers, so
while the RT crowd may still feel the need to disable those (and the ones
below as well, perhaps), they don't call back into the crypto layer like
the ones updated by this series, and so there's no room for improvement
there AFAICT.
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Ard Biesheuvel (5):
crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
crypto: arm64/ghash - move kernel mode neon en/disable into loop
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +++++-----
arch/arm64/crypto/aes-glue.c | 81 +++++++++---------
arch/arm64/crypto/aes-modes.S | 90 ++++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 38 ++++-----
arch/arm64/crypto/chacha20-neon-glue.c | 4 +-
arch/arm64/crypto/ghash-ce-glue.c | 9 +-
6 files changed, 132 insertions(+), 137 deletions(-)
--
2.11.0
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-arm-kernel
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.
This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)
So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.
Note that the remaining crypto drivers simply operate on fixed buffers, so
while the RT crowd may still feel the need to disable those (and the ones
below as well, perhaps), they don't call back into the crypto layer like
the ones updated by this series, and so there's no room for improvement
there AFAICT.
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users at vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Ard Biesheuvel (5):
crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
crypto: arm64/ghash - move kernel mode neon en/disable into loop
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +++++-----
arch/arm64/crypto/aes-glue.c | 81 +++++++++---------
arch/arm64/crypto/aes-modes.S | 90 ++++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 38 ++++-----
arch/arm64/crypto/chacha20-neon-glue.c | 4 +-
arch/arm64/crypto/ghash-ce-glue.c | 9 +-
6 files changed, 132 insertions(+), 137 deletions(-)
--
2.11.0
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/5] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
2017-12-01 21:19 ` Ard Biesheuvel
@ 2017-12-01 21:19 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
}
static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
- u32 abytes, u32 *macp, bool use_neon)
+ u32 abytes, u32 *macp)
{
- if (likely(use_neon)) {
+ if (may_use_simd()) {
+ kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
num_rounds(key));
+ kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
}
}
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
- bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
ltag.len = 6;
}
- ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp, use_neon);
+ ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp);
scatterwalk_start(&walk, req->src);
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
n = scatterwalk_clamp(&walk, len);
}
p = scatterwalk_map(&walk);
- ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+ ccm_update_mac(ctx, mac, p, n, &macp);
len -= n;
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_encrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_decrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_decrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
}
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 1/5] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
}
static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
- u32 abytes, u32 *macp, bool use_neon)
+ u32 abytes, u32 *macp)
{
- if (likely(use_neon)) {
+ if (may_use_simd()) {
+ kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
num_rounds(key));
+ kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
}
}
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
- bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
ltag.len = 6;
}
- ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp, use_neon);
+ ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp);
scatterwalk_start(&walk, req->src);
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
n = scatterwalk_clamp(&walk, len);
}
p = scatterwalk_map(&walk);
- ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+ ccm_update_mac(ctx, mac, p, n, &macp);
len -= n;
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_encrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_decrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_decrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
}
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/5] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
2017-12-01 21:19 ` Ard Biesheuvel
@ 2017-12-01 21:19 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 81 +++++++++---------
arch/arm64/crypto/aes-modes.S | 90 ++++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
3 files changed, 90 insertions(+), 95 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..c983a2bfebe3 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 ctr[], int first);
+ int rounds, int blocks, u8 ctr[]);
asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, first);
+ (u8 *)ctx->key_enc, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, first);
+ (u8 *)ctx->key_dec, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
int blocks;
err = skcipher_walk_virt(&walk, req, true);
- first = 1;
- kernel_neon_begin();
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
+ kernel_neon_end();
}
if (walk.nbytes) {
u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
*/
blocks = -1;
+ kernel_neon_begin();
aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
- blocks, walk.iv, first);
+ blocks, walk.iv);
+ kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
- kernel_neon_end();
return err;
}
@@ -272,14 +269,14 @@ static int xts_encrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_enc, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -294,14 +291,14 @@ static int xts_decrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_dec, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+ aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
#if INTERLEAVE == 2
aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block2x)
aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block2x)
#elif INTERLEAVE == 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
#define FRAME_POP
.macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
.macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
#endif
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
* aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
*/
AES_ENTRY(aes_ecb_encrypt)
FRAME_PUSH
- cbz w5, .LecbencloopNx
enc_prepare w3, x2, x5
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
FRAME_PUSH
- cbz w5, .LecbdecloopNx
dec_prepare w3, x2, x5
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
/*
* aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
* aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
*/
AES_ENTRY(aes_cbc_encrypt)
- cbz w6, .Lcbcencloop
-
ld1 {v0.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
FRAME_PUSH
- cbz w6, .LcbcdecloopNx
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 ctr[], int first)
+ * int blocks, u8 ctr[])
*/
AES_ENTRY(aes_ctr_encrypt)
FRAME_PUSH
- cbz w6, .Lctrnotfirst /* 1st time around? */
+
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
-.Lctrnotfirst:
- umov x8, v4.d[1] /* keep swabbed ctr in reg */
- rev x8, x8
+ umov x6, v4.d[1] /* keep swabbed ctr in reg */
+ rev x6, x6
#if INTERLEAVE >= 2
- cmn w8, w4 /* 32 bit overflow? */
+ cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
subs w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
#if INTERLEAVE == 2
mov v0.8b, v4.8b
mov v1.8b, v4.8b
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v0.d[1], x7
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v1.d[1], x7
ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
st1 {v0.16b-v1.16b}, [x0], #32
#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
- dup v7.4s, w8
+ dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x8, x8, #INTERLEAVE
+ add x6, x6, #INTERLEAVE
#endif
- rev x7, x8
+ rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
#endif
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
- adds x8, x8, #1 /* increment BE ctr */
- rev x7, x8
+ adds x6, x6, #1 /* increment BE ctr */
+ rev x7, x6
ins v4.d[1], x7
bcs .Lctrcarry /* overflow? */
@@ -385,15 +379,17 @@ CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
FRAME_PUSH
- cbz w7, .LxtsencloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- enc_switch_key w3, x2, x6
+ cbz w7, .Lxtsencnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ enc_switch_key w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencnotfirst:
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
FRAME_PUSH
- cbz w7, .LxtsdecloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- dec_prepare w3, x2, x6
+ cbz w7, .Lxtsdecnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ dec_prepare w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecnotfirst:
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
/* borrowed from aes-neon-blk.ko */
asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, u8 iv[],
- int first);
+ int rounds, int blocks, u8 iv[]);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
- int err, first = 1;
+ int err;
err = skcipher_walk_virt(&walk, req, true);
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
/* fall back to the non-bitsliced NEON implementation */
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- ctx->enc, ctx->key.rounds, blocks, walk.iv,
- first);
+ ctx->enc, ctx->key.rounds, blocks,
+ walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
}
kernel_neon_end();
return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
kernel_neon_begin();
neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1, 1);
+ ctx->key.rounds, 1);
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/5] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 81 +++++++++---------
arch/arm64/crypto/aes-modes.S | 90 ++++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
3 files changed, 90 insertions(+), 95 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..c983a2bfebe3 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 ctr[], int first);
+ int rounds, int blocks, u8 ctr[]);
asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, first);
+ (u8 *)ctx->key_enc, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, first);
+ (u8 *)ctx->key_dec, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
int blocks;
err = skcipher_walk_virt(&walk, req, true);
- first = 1;
- kernel_neon_begin();
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
+ kernel_neon_end();
}
if (walk.nbytes) {
u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
*/
blocks = -1;
+ kernel_neon_begin();
aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
- blocks, walk.iv, first);
+ blocks, walk.iv);
+ kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
- kernel_neon_end();
return err;
}
@@ -272,14 +269,14 @@ static int xts_encrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_enc, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -294,14 +291,14 @@ static int xts_decrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_dec, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+ aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
#if INTERLEAVE == 2
aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block2x)
aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block2x)
#elif INTERLEAVE == 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
#define FRAME_POP
.macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
.macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
#endif
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
* aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
*/
AES_ENTRY(aes_ecb_encrypt)
FRAME_PUSH
- cbz w5, .LecbencloopNx
enc_prepare w3, x2, x5
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
FRAME_PUSH
- cbz w5, .LecbdecloopNx
dec_prepare w3, x2, x5
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
/*
* aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
* aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
*/
AES_ENTRY(aes_cbc_encrypt)
- cbz w6, .Lcbcencloop
-
ld1 {v0.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
FRAME_PUSH
- cbz w6, .LcbcdecloopNx
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 ctr[], int first)
+ * int blocks, u8 ctr[])
*/
AES_ENTRY(aes_ctr_encrypt)
FRAME_PUSH
- cbz w6, .Lctrnotfirst /* 1st time around? */
+
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
-.Lctrnotfirst:
- umov x8, v4.d[1] /* keep swabbed ctr in reg */
- rev x8, x8
+ umov x6, v4.d[1] /* keep swabbed ctr in reg */
+ rev x6, x6
#if INTERLEAVE >= 2
- cmn w8, w4 /* 32 bit overflow? */
+ cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
subs w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
#if INTERLEAVE == 2
mov v0.8b, v4.8b
mov v1.8b, v4.8b
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v0.d[1], x7
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v1.d[1], x7
ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
st1 {v0.16b-v1.16b}, [x0], #32
#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
- dup v7.4s, w8
+ dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x8, x8, #INTERLEAVE
+ add x6, x6, #INTERLEAVE
#endif
- rev x7, x8
+ rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
#endif
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
- adds x8, x8, #1 /* increment BE ctr */
- rev x7, x8
+ adds x6, x6, #1 /* increment BE ctr */
+ rev x7, x6
ins v4.d[1], x7
bcs .Lctrcarry /* overflow? */
@@ -385,15 +379,17 @@ CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
FRAME_PUSH
- cbz w7, .LxtsencloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- enc_switch_key w3, x2, x6
+ cbz w7, .Lxtsencnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ enc_switch_key w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencnotfirst:
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
FRAME_PUSH
- cbz w7, .LxtsdecloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- dec_prepare w3, x2, x6
+ cbz w7, .Lxtsdecnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ dec_prepare w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecnotfirst:
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
/* borrowed from aes-neon-blk.ko */
asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, u8 iv[],
- int first);
+ int rounds, int blocks, u8 iv[]);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
- int err, first = 1;
+ int err;
err = skcipher_walk_virt(&walk, req, true);
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
/* fall back to the non-bitsliced NEON implementation */
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- ctx->enc, ctx->key.rounds, blocks, walk.iv,
- first);
+ ctx->enc, ctx->key.rounds, blocks,
+ walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
}
kernel_neon_end();
return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
kernel_neon_begin();
neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1, 1);
+ ctx->key.rounds, 1);
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/5] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
2017-12-01 21:19 ` Ard Biesheuvel
@ 2017-12-01 21:19 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-glue.c | 26 +++++++++-----------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..fa09dc340a1e 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -101,7 +101,6 @@ static int __ecb_crypt(struct skcipher_request *req,
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
ctx->rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -160,17 +160,17 @@ static int cbc_encrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
/* fall back to the non-bitsliced NEON implementation */
+ kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->enc, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -183,7 +183,6 @@ static int cbc_decrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->key.rk, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -231,7 +231,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
+ kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->rk, ctx->rounds, blocks, walk.iv, final);
+ kernel_neon_end();
if (final) {
u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
@@ -307,9 +306,8 @@ static int __xts_crypt(struct skcipher_request *req,
err = skcipher_walk_virt(&walk, req, true);
kernel_neon_begin();
-
- neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1);
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+ kernel_neon_end();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
ctx->key.rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/5] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-glue.c | 26 +++++++++-----------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..fa09dc340a1e 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -101,7 +101,6 @@ static int __ecb_crypt(struct skcipher_request *req,
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
ctx->rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -160,17 +160,17 @@ static int cbc_encrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
/* fall back to the non-bitsliced NEON implementation */
+ kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->enc, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -183,7 +183,6 @@ static int cbc_decrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->key.rk, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -231,7 +231,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_virt(&walk, req, true);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
+ kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->rk, ctx->rounds, blocks, walk.iv, final);
+ kernel_neon_end();
if (final) {
u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
@@ -307,9 +306,8 @@ static int __xts_crypt(struct skcipher_request *req,
err = skcipher_walk_virt(&walk, req, true);
kernel_neon_begin();
-
- neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1);
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+ kernel_neon_end();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
ctx->key.rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/5] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
2017-12-01 21:19 ` Ard Biesheuvel
@ 2017-12-01 21:19 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..b9ca7bef428a 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -36,6 +36,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
{
u8 buf[CHACHA20_BLOCK_SIZE];
+ kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
chacha20_4block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE * 4;
@@ -55,6 +56,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+ kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
@@ -72,7 +74,6 @@ static int chacha20_neon(struct skcipher_request *req)
crypto_chacha20_init(state, ctx, walk.iv);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -83,7 +84,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
- kernel_neon_end();
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/5] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..b9ca7bef428a 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -36,6 +36,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
{
u8 buf[CHACHA20_BLOCK_SIZE];
+ kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
chacha20_4block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE * 4;
@@ -55,6 +56,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+ kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
@@ -72,7 +74,6 @@ static int chacha20_neon(struct skcipher_request *req)
crypto_chacha20_init(state, ctx, walk.iv);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -83,7 +84,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
- kernel_neon_end();
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 5/5] crypto: arm64/ghash - move kernel mode neon en/disable into loop
2017-12-01 21:19 ` Ard Biesheuvel
@ 2017-12-01 21:19 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-glue.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..ddd9b565d775 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -368,20 +368,22 @@ static int gcm_encrypt(struct aead_request *req)
pmull_gcm_encrypt_block(ks, iv, NULL,
num_rounds(&ctx->aes_key));
put_unaligned_be32(3, iv + GCM_IV_SIZE);
+ kernel_neon_end();
err = skcipher_walk_aead_encrypt(&walk, req, true);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
iv, num_rounds(&ctx->aes_key), ks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
@@ -467,15 +469,18 @@ static int gcm_decrypt(struct aead_request *req)
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
+ kernel_neon_end();
err = skcipher_walk_aead_decrypt(&walk, req, true);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
iv, num_rounds(&ctx->aes_key));
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
@@ -483,8 +488,6 @@ static int gcm_decrypt(struct aead_request *req)
if (walk.nbytes)
pmull_gcm_encrypt_block(iv, iv, NULL,
num_rounds(&ctx->aes_key));
-
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 5/5] crypto: arm64/ghash - move kernel mode neon en/disable into loop
@ 2017-12-01 21:19 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-01 21:19 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-glue.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..ddd9b565d775 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -368,20 +368,22 @@ static int gcm_encrypt(struct aead_request *req)
pmull_gcm_encrypt_block(ks, iv, NULL,
num_rounds(&ctx->aes_key));
put_unaligned_be32(3, iv + GCM_IV_SIZE);
+ kernel_neon_end();
err = skcipher_walk_aead_encrypt(&walk, req, true);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
iv, num_rounds(&ctx->aes_key), ks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
@@ -467,15 +469,18 @@ static int gcm_decrypt(struct aead_request *req)
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
+ kernel_neon_end();
err = skcipher_walk_aead_decrypt(&walk, req, true);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
iv, num_rounds(&ctx->aes_key));
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
@@ -483,8 +488,6 @@ static int gcm_decrypt(struct aead_request *req)
if (walk.nbytes)
pmull_gcm_encrypt_block(iv, iv, NULL,
num_rounds(&ctx->aes_key));
-
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
--
2.11.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
2017-12-01 21:19 ` Ard Biesheuvel
@ 2017-12-02 9:01 ` Peter Zijlstra
-1 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-12-02 9:01 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, herbert, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, Dave Martin,
linux-arm-kernel, linux-rt-users
On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
> Note that the remaining crypto drivers simply operate on fixed buffers, so
> while the RT crowd may still feel the need to disable those (and the ones
> below as well, perhaps), they don't call back into the crypto layer like
> the ones updated by this series, and so there's no room for improvement
> there AFAICT.
Do these other drivers process all the blocks fed to them in one go
under a single NEON section, or do they do a single fixed block per
NEON invocation?
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-02 9:01 ` Peter Zijlstra
0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-12-02 9:01 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
> Note that the remaining crypto drivers simply operate on fixed buffers, so
> while the RT crowd may still feel the need to disable those (and the ones
> below as well, perhaps), they don't call back into the crypto layer like
> the ones updated by this series, and so there's no room for improvement
> there AFAICT.
Do these other drivers process all the blocks fed to them in one go
under a single NEON section, or do they do a single fixed block per
NEON invocation?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
2017-12-02 9:01 ` Peter Zijlstra
@ 2017-12-02 9:11 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-02 9:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-crypto, Herbert Xu, linux-arm-kernel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Catalin Marinas, Will Deacon,
Steven Rostedt, Thomas Gleixner
On 2 December 2017 at 09:01, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
>> Note that the remaining crypto drivers simply operate on fixed buffers, so
>> while the RT crowd may still feel the need to disable those (and the ones
>> below as well, perhaps), they don't call back into the crypto layer like
>> the ones updated by this series, and so there's no room for improvement
>> there AFAICT.
>
> Do these other drivers process all the blocks fed to them in one go
> under a single NEON section, or do they do a single fixed block per
> NEON invocation?
They consume the entire input in a single go, yes. But making it more
granular than that is going to hurt performance, unless we introduce
some kind of kernel_neon_yield(), which does a end+begin but only if
the task is being scheduled out.
For example, the SHA256 keeps 256 bytes of round constants in NEON
registers, and reloading those from memory for each 64 byte block of
input is going to be noticeable. The same applies to the AES code
(although the numbers are slightly different)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-02 9:11 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-02 9:11 UTC (permalink / raw)
To: linux-arm-kernel
On 2 December 2017 at 09:01, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
>> Note that the remaining crypto drivers simply operate on fixed buffers, so
>> while the RT crowd may still feel the need to disable those (and the ones
>> below as well, perhaps), they don't call back into the crypto layer like
>> the ones updated by this series, and so there's no room for improvement
>> there AFAICT.
>
> Do these other drivers process all the blocks fed to them in one go
> under a single NEON section, or do they do a single fixed block per
> NEON invocation?
They consume the entire input in a single go, yes. But making it more
granular than that is going to hurt performance, unless we introduce
some kind of kernel_neon_yield(), which does a end+begin but only if
the task is being scheduled out.
For example, the SHA256 keeps 256 bytes of round constants in NEON
registers, and reloading those from memory for each 64 byte block of
input is going to be noticeable. The same applies to the AES code
(although the numbers are slightly different)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
2017-12-02 9:11 ` Ard Biesheuvel
@ 2017-12-02 11:15 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-02 11:15 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-crypto, Herbert Xu, linux-arm-kernel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Catalin Marinas, Will Deacon,
Steven Rostedt, Thomas Gleixner
On 2 December 2017 at 09:11, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 2 December 2017 at 09:01, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
>>> Note that the remaining crypto drivers simply operate on fixed buffers, so
>>> while the RT crowd may still feel the need to disable those (and the ones
>>> below as well, perhaps), they don't call back into the crypto layer like
>>> the ones updated by this series, and so there's no room for improvement
>>> there AFAICT.
>>
>> Do these other drivers process all the blocks fed to them in one go
>> under a single NEON section, or do they do a single fixed block per
>> NEON invocation?
>
> They consume the entire input in a single go, yes. But making it more
> granular than that is going to hurt performance, unless we introduce
> some kind of kernel_neon_yield(), which does a end+begin but only if
> the task is being scheduled out.
>
> For example, the SHA256 keeps 256 bytes of round constants in NEON
> registers, and reloading those from memory for each 64 byte block of
> input is going to be noticeable. The same applies to the AES code
> (although the numbers are slightly different)
Something like below should do the trick I think (apologies for the
patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
few NEON registers are live, and preserve/restore the live registers
across calls to kernel_neon_end + kernel_neon_begin. Would that work
for RT?
diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..4f12038574f3 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -77,6 +77,10 @@
* int blocks)
*/
ENTRY(sha2_ce_transform)
+ stp x29, x30, [sp, #-48]!
+ mov x29, sp
+
+restart:
/* load round constants */
adr x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
@@ -129,14 +133,17 @@ CPU_LE( rev32 v19.16b, v19.16b )
add dgbv.4s, dgbv.4s, dg1v.4s
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w2, 2f
+
+ tif_need_resched 4f, 5
+ b 0b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+2: cbz x4, 3f
ldr_l w4, sha256_ce_offsetof_count, x4
ldr x4, [x0, x4]
movi v17.2d, #0
@@ -151,5 +158,15 @@ CPU_LE( rev32 v19.16b, v19.16b )
/* store new state */
3: st1 {dgav.4s, dgbv.4s}, [x0]
+ ldp x29, x30, [sp], #48
ret
+
+4: st1 {dgav.4s, dgbv.4s}, [x0]
+ stp x0, x1, [sp, #16]
+ stp x2, x4, [sp, #32]
+ bl kernel_neon_end
+ bl kernel_neon_begin
+ ldp x0, x1, [sp, #16]
+ ldp x2, x4, [sp, #32]
+ b restart
ENDPROC(sha2_ce_transform)
diff --git a/arch/arm64/include/asm/assembler.h
b/arch/arm64/include/asm/assembler.h
index aef72d886677..e3e7e15ebefd 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -512,4 +512,15 @@ alternative_else_nop_endif
#endif
.endm
+/*
+ * Check TIF_NEED_RESCHED flag from assembler (for kernel mode NEON)
+ */
+ .macro tif_need_resched, lbl:req, regnum:req
+#ifdef CONFIG_PREEMPT
+ get_thread_info x\regnum
+ ldr w\regnum, [x\regnum, #TSK_TI_FLAGS] // get flags
+ tbnz w\regnum, #TIF_NEED_RESCHED, \lbl // needs rescheduling?
+#endif
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-02 11:15 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-02 11:15 UTC (permalink / raw)
To: linux-arm-kernel
On 2 December 2017 at 09:11, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 2 December 2017 at 09:01, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
>>> Note that the remaining crypto drivers simply operate on fixed buffers, so
>>> while the RT crowd may still feel the need to disable those (and the ones
>>> below as well, perhaps), they don't call back into the crypto layer like
>>> the ones updated by this series, and so there's no room for improvement
>>> there AFAICT.
>>
>> Do these other drivers process all the blocks fed to them in one go
>> under a single NEON section, or do they do a single fixed block per
>> NEON invocation?
>
> They consume the entire input in a single go, yes. But making it more
> granular than that is going to hurt performance, unless we introduce
> some kind of kernel_neon_yield(), which does a end+begin but only if
> the task is being scheduled out.
>
> For example, the SHA256 keeps 256 bytes of round constants in NEON
> registers, and reloading those from memory for each 64 byte block of
> input is going to be noticeable. The same applies to the AES code
> (although the numbers are slightly different)
Something like below should do the trick I think (apologies for the
patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
few NEON registers are live, and preserve/restore the live registers
across calls to kernel_neon_end + kernel_neon_begin. Would that work
for RT?
diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..4f12038574f3 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -77,6 +77,10 @@
* int blocks)
*/
ENTRY(sha2_ce_transform)
+ stp x29, x30, [sp, #-48]!
+ mov x29, sp
+
+restart:
/* load round constants */
adr x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
@@ -129,14 +133,17 @@ CPU_LE( rev32 v19.16b, v19.16b )
add dgbv.4s, dgbv.4s, dg1v.4s
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w2, 2f
+
+ tif_need_resched 4f, 5
+ b 0b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+2: cbz x4, 3f
ldr_l w4, sha256_ce_offsetof_count, x4
ldr x4, [x0, x4]
movi v17.2d, #0
@@ -151,5 +158,15 @@ CPU_LE( rev32 v19.16b, v19.16b )
/* store new state */
3: st1 {dgav.4s, dgbv.4s}, [x0]
+ ldp x29, x30, [sp], #48
ret
+
+4: st1 {dgav.4s, dgbv.4s}, [x0]
+ stp x0, x1, [sp, #16]
+ stp x2, x4, [sp, #32]
+ bl kernel_neon_end
+ bl kernel_neon_begin
+ ldp x0, x1, [sp, #16]
+ ldp x2, x4, [sp, #32]
+ b restart
ENDPROC(sha2_ce_transform)
diff --git a/arch/arm64/include/asm/assembler.h
b/arch/arm64/include/asm/assembler.h
index aef72d886677..e3e7e15ebefd 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -512,4 +512,15 @@ alternative_else_nop_endif
#endif
.endm
+/*
+ * Check TIF_NEED_RESCHED flag from assembler (for kernel mode NEON)
+ */
+ .macro tif_need_resched, lbl:req, regnum:req
+#ifdef CONFIG_PREEMPT
+ get_thread_info x\regnum
+ ldr w\regnum, [x\regnum, #TSK_TI_FLAGS] // get flags
+ tbnz w\regnum, #TIF_NEED_RESCHED, \lbl // needs rescheduling?
+#endif
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
2017-12-02 9:11 ` Ard Biesheuvel
@ 2017-12-02 13:54 ` Peter Zijlstra
-1 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-12-02 13:54 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, Herbert Xu, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, Dave Martin,
linux-arm-kernel, linux-rt-users
On Sat, Dec 02, 2017 at 09:11:46AM +0000, Ard Biesheuvel wrote:
> On 2 December 2017 at 09:01, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
> >> Note that the remaining crypto drivers simply operate on fixed buffers, so
> >> while the RT crowd may still feel the need to disable those (and the ones
> >> below as well, perhaps), they don't call back into the crypto layer like
> >> the ones updated by this series, and so there's no room for improvement
> >> there AFAICT.
> >
> > Do these other drivers process all the blocks fed to them in one go
> > under a single NEON section, or do they do a single fixed block per
> > NEON invocation?
>
> They consume the entire input in a single go, yes. But making it more
> granular than that is going to hurt performance, unless we introduce
> some kind of kernel_neon_yield(), which does a end+begin but only if
> the task is being scheduled out.
A little something like this:
https://lkml.kernel.org/r/20171201113235.6tmkwtov5cg2locv@hirez.programming.kicks-ass.net
> For example, the SHA256 keeps 256 bytes of round constants in NEON
> registers, and reloading those from memory for each 64 byte block of
> input is going to be noticeable. The same applies to the AES code
> (although the numbers are slightly different)
Quite. We could augment the above function with a return value that says
if we actually did a end/begin and registers were clobbered.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-02 13:54 ` Peter Zijlstra
0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-12-02 13:54 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Dec 02, 2017 at 09:11:46AM +0000, Ard Biesheuvel wrote:
> On 2 December 2017 at 09:01, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Dec 01, 2017 at 09:19:22PM +0000, Ard Biesheuvel wrote:
> >> Note that the remaining crypto drivers simply operate on fixed buffers, so
> >> while the RT crowd may still feel the need to disable those (and the ones
> >> below as well, perhaps), they don't call back into the crypto layer like
> >> the ones updated by this series, and so there's no room for improvement
> >> there AFAICT.
> >
> > Do these other drivers process all the blocks fed to them in one go
> > under a single NEON section, or do they do a single fixed block per
> > NEON invocation?
>
> They consume the entire input in a single go, yes. But making it more
> granular than that is going to hurt performance, unless we introduce
> some kind of kernel_neon_yield(), which does a end+begin but only if
> the task is being scheduled out.
A little something like this:
https://lkml.kernel.org/r/20171201113235.6tmkwtov5cg2locv at hirez.programming.kicks-ass.net
> For example, the SHA256 keeps 256 bytes of round constants in NEON
> registers, and reloading those from memory for each 64 byte block of
> input is going to be noticeable. The same applies to the AES code
> (although the numbers are slightly different)
Quite. We could augment the above function with a return value that says
if we actually did a end/begin and registers were clobbered.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
2017-12-02 11:15 ` Ard Biesheuvel
@ 2017-12-02 13:59 ` Peter Zijlstra
-1 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-12-02 13:59 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, Herbert Xu, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, Dave Martin,
linux-arm-kernel, linux-rt-users
On Sat, Dec 02, 2017 at 11:15:14AM +0000, Ard Biesheuvel wrote:
> On 2 December 2017 at 09:11, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > They consume the entire input in a single go, yes. But making it more
> > granular than that is going to hurt performance, unless we introduce
> > some kind of kernel_neon_yield(), which does a end+begin but only if
> > the task is being scheduled out.
> >
> > For example, the SHA256 keeps 256 bytes of round constants in NEON
> > registers, and reloading those from memory for each 64 byte block of
> > input is going to be noticeable. The same applies to the AES code
> > (although the numbers are slightly different)
>
> Something like below should do the trick I think (apologies for the
> patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
> few NEON registers are live, and preserve/restore the live registers
> across calls to kernel_neon_end + kernel_neon_begin. Would that work
> for RT?
Probably yes. The important point is that preempt latencies (and thus by
extension NEON regions) are bounded and preferably small.
Unbounded stuff (like depends on the amount of data fed) are a complete
no-no for RT since then you cannot make predictions on how long things
will take.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-02 13:59 ` Peter Zijlstra
0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-12-02 13:59 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Dec 02, 2017 at 11:15:14AM +0000, Ard Biesheuvel wrote:
> On 2 December 2017 at 09:11, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > They consume the entire input in a single go, yes. But making it more
> > granular than that is going to hurt performance, unless we introduce
> > some kind of kernel_neon_yield(), which does a end+begin but only if
> > the task is being scheduled out.
> >
> > For example, the SHA256 keeps 256 bytes of round constants in NEON
> > registers, and reloading those from memory for each 64 byte block of
> > input is going to be noticeable. The same applies to the AES code
> > (although the numbers are slightly different)
>
> Something like below should do the trick I think (apologies for the
> patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
> few NEON registers are live, and preserve/restore the live registers
> across calls to kernel_neon_end + kernel_neon_begin. Would that work
> for RT?
Probably yes. The important point is that preempt latencies (and thus by
extension NEON regions) are bounded and preferably small.
Unbounded stuff (like depends on the amount of data fed) are a complete
no-no for RT since then you cannot make predictions on how long things
will take.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
2017-12-02 13:59 ` Peter Zijlstra
@ 2017-12-04 9:08 ` Ard Biesheuvel
-1 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 9:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Mark Rutland, Herbert Xu, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, Dave Martin,
linux-arm-kernel, linux-rt-users
On 2 December 2017 at 13:59, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Dec 02, 2017 at 11:15:14AM +0000, Ard Biesheuvel wrote:
>> On 2 December 2017 at 09:11, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
>> > They consume the entire input in a single go, yes. But making it more
>> > granular than that is going to hurt performance, unless we introduce
>> > some kind of kernel_neon_yield(), which does a end+begin but only if
>> > the task is being scheduled out.
>> >
>> > For example, the SHA256 keeps 256 bytes of round constants in NEON
>> > registers, and reloading those from memory for each 64 byte block of
>> > input is going to be noticeable. The same applies to the AES code
>> > (although the numbers are slightly different)
>>
>> Something like below should do the trick I think (apologies for the
>> patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
>> few NEON registers are live, and preserve/restore the live registers
>> across calls to kernel_neon_end + kernel_neon_begin. Would that work
>> for RT?
>
> Probably yes. The important point is that preempt latencies (and thus by
> extension NEON regions) are bounded and preferably small.
>
> Unbounded stuff (like depends on the amount of data fed) are a complete
> no-no for RT since then you cannot make predictions on how long things
> will take.
>
OK, that makes sense. But I do wonder what the parameters should be here.
For instance, the AES instructions on ARMv8 operate at <1 cycle per
byte, and so checking the TIF_NEED_RESCHED flag for every iteration of
the inner loop (i.e., every 64 bytes ~ 64 cycles) is clearly going to
be noticeable, and is probably overkill. The pure NEON version (which
is instantiated from the same block mode wrappers) uses ~25 cycles per
byte, and the bit sliced NEON version runs at ~20 cycles per byte but
can only operate at 8 blocks (128 bytes) at a time.
So rather than simply polling the bit at each iteration of the inner
loop in each algorithm, I'd prefer to aim for a ballpark number of
cycles to execute, in the order 1000 - 2000. Would that be OK or too
coarse?
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
@ 2017-12-04 9:08 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 9:08 UTC (permalink / raw)
To: linux-arm-kernel
On 2 December 2017 at 13:59, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Dec 02, 2017 at 11:15:14AM +0000, Ard Biesheuvel wrote:
>> On 2 December 2017 at 09:11, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
>> > They consume the entire input in a single go, yes. But making it more
>> > granular than that is going to hurt performance, unless we introduce
>> > some kind of kernel_neon_yield(), which does a end+begin but only if
>> > the task is being scheduled out.
>> >
>> > For example, the SHA256 keeps 256 bytes of round constants in NEON
>> > registers, and reloading those from memory for each 64 byte block of
>> > input is going to be noticeable. The same applies to the AES code
>> > (although the numbers are slightly different)
>>
>> Something like below should do the trick I think (apologies for the
>> patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
>> few NEON registers are live, and preserve/restore the live registers
>> across calls to kernel_neon_end + kernel_neon_begin. Would that work
>> for RT?
>
> Probably yes. The important point is that preempt latencies (and thus by
> extension NEON regions) are bounded and preferably small.
>
> Unbounded stuff (like depends on the amount of data fed) are a complete
> no-no for RT since then you cannot make predictions on how long things
> will take.
>
OK, that makes sense. But I do wonder what the parameters should be here.
For instance, the AES instructions on ARMv8 operate at <1 cycle per
byte, and so checking the TIF_NEED_RESCHED flag for every iteration of
the inner loop (i.e., every 64 bytes ~ 64 cycles) is clearly going to
be noticeable, and is probably overkill. The pure NEON version (which
is instantiated from the same block mode wrappers) uses ~25 cycles per
byte, and the bit sliced NEON version runs at ~20 cycles per byte but
can only operate at 8 blocks (128 bytes) at a time.
So rather than simply polling the bit@each iteration of the inner
loop in each algorithm, I'd prefer to aim for a ballpark number of
cycles to execute, in the order 1000 - 2000. Would that be OK or too
coarse?
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2017-12-04 9:08 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-01 21:19 [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls Ard Biesheuvel
2017-12-01 21:19 ` Ard Biesheuvel
2017-12-01 21:19 ` [PATCH 1/5] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop Ard Biesheuvel
2017-12-01 21:19 ` Ard Biesheuvel
2017-12-01 21:19 ` [PATCH 2/5] crypto: arm64/aes-blk " Ard Biesheuvel
2017-12-01 21:19 ` Ard Biesheuvel
2017-12-01 21:19 ` [PATCH 3/5] crypto: arm64/aes-bs " Ard Biesheuvel
2017-12-01 21:19 ` Ard Biesheuvel
2017-12-01 21:19 ` [PATCH 4/5] crypto: arm64/chacha20 " Ard Biesheuvel
2017-12-01 21:19 ` Ard Biesheuvel
2017-12-01 21:19 ` [PATCH 5/5] crypto: arm64/ghash " Ard Biesheuvel
2017-12-01 21:19 ` Ard Biesheuvel
2017-12-02 9:01 ` [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls Peter Zijlstra
2017-12-02 9:01 ` Peter Zijlstra
2017-12-02 9:11 ` Ard Biesheuvel
2017-12-02 9:11 ` Ard Biesheuvel
2017-12-02 11:15 ` Ard Biesheuvel
2017-12-02 11:15 ` Ard Biesheuvel
2017-12-02 13:59 ` Peter Zijlstra
2017-12-02 13:59 ` Peter Zijlstra
2017-12-04 9:08 ` Ard Biesheuvel
2017-12-04 9:08 ` Ard Biesheuvel
2017-12-02 13:54 ` Peter Zijlstra
2017-12-02 13:54 ` Peter Zijlstra
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.