* [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2)
@ 2017-01-23 14:05 Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Patch #1 is a fix for the CBC chaining issue that was discussed on the
mailing list. The driver itself is queued for v4.11, so this fix can go
right on top.
Patches #2 - #6 clear the cra_alignmasks of various drivers: all NEON
capable CPUs can perform unaligned accesses, and the advantage of using
the slightly faster aligned accessors (which only exist on ARM not arm64)
is certainly outweighed by the cost of copying data to suitably aligned
buffers.
NOTE: patch #5 won't apply unless 'crypto: arm64/aes-blk - honour iv_out
requirement in CBC and CTR modes' is applied first, which was sent out
separately as a bugfix for v3.16 - v4.9. If this is a problem, this patch
can wait.
Patch #7 and #8 are minor tweaks to the new scalar AES code.
Patch #9 improves the performance of the plain NEON AES code, to make it
more suitable as a fallback for the new bitsliced NEON code, which can
only operate on 8 blocks in parallel, and needs another driver to perform
CBC encryption or XTS tweak generation.
Patch #10 updates the new bitsliced AES NEON code to switch to the plain
NEON driver as a fallback.
Patches #9 and #10 improve the performance of CBC encryption by ~35% on
low end cores such as the Cortex-A53 that can be found in the Raspberry Pi3
Changes since v1:
- shave off another few cycles from the sequential AES NEON code (patch #9)
Ard Biesheuvel (10):
crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode
crypto: arm/aes-ce - remove cra_alignmask
crypto: arm/chacha20 - remove cra_alignmask
crypto: arm64/aes-ce-ccm - remove cra_alignmask
crypto: arm64/aes-blk - remove cra_alignmask
crypto: arm64/chacha20 - remove cra_alignmask
crypto: arm64/aes - avoid literals for cross-module symbol references
crypto: arm64/aes - performance tweak
crypto: arm64/aes-neon-blk - tweak performance for low end cores
crypto: arm64/aes - replace scalar fallback with plain NEON fallback
arch/arm/crypto/aes-ce-core.S | 84 ++++----
arch/arm/crypto/aes-ce-glue.c | 15 +-
arch/arm/crypto/chacha20-neon-glue.c | 1 -
arch/arm64/crypto/Kconfig | 2 +-
arch/arm64/crypto/aes-ce-ccm-glue.c | 1 -
arch/arm64/crypto/aes-cipher-core.S | 59 ++----
arch/arm64/crypto/aes-glue.c | 18 +-
arch/arm64/crypto/aes-modes.S | 8 +-
arch/arm64/crypto/aes-neon.S | 210 ++++++++------------
arch/arm64/crypto/aes-neonbs-core.S | 25 ++-
arch/arm64/crypto/aes-neonbs-glue.c | 38 +++-
arch/arm64/crypto/chacha20-neon-glue.c | 1 -
12 files changed, 203 insertions(+), 259 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask Ard Biesheuvel
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Update the new bitsliced NEON AES implementation in CTR mode to return
the next IV back to the skcipher API client. This is necessary for
chaining to work correctly.
Note that this is only done if the request is a round multiple of the
block size, since otherwise, chaining is impossible anyway.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-core.S | 25 +++++++++++++-------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index 8d0cdaa2768d..2ada12dd768e 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -874,12 +874,19 @@ CPU_LE( rev x8, x8 )
csel x4, x4, xzr, pl
csel x9, x9, xzr, le
+ tbnz x9, #1, 0f
next_ctr v1
+ tbnz x9, #2, 0f
next_ctr v2
+ tbnz x9, #3, 0f
next_ctr v3
+ tbnz x9, #4, 0f
next_ctr v4
+ tbnz x9, #5, 0f
next_ctr v5
+ tbnz x9, #6, 0f
next_ctr v6
+ tbnz x9, #7, 0f
next_ctr v7
0: mov bskey, x2
@@ -928,11 +935,11 @@ CPU_LE( rev x8, x8 )
eor v5.16b, v5.16b, v15.16b
st1 {v5.16b}, [x0], #16
- next_ctr v0
+8: next_ctr v0
cbnz x4, 99b
0: st1 {v0.16b}, [x5]
-8: ldp x29, x30, [sp], #16
+9: ldp x29, x30, [sp], #16
ret
/*
@@ -941,23 +948,23 @@ CPU_LE( rev x8, x8 )
*/
1: cbz x6, 8b
st1 {v1.16b}, [x5]
- b 8b
+ b 9b
2: cbz x6, 8b
st1 {v4.16b}, [x5]
- b 8b
+ b 9b
3: cbz x6, 8b
st1 {v6.16b}, [x5]
- b 8b
+ b 9b
4: cbz x6, 8b
st1 {v3.16b}, [x5]
- b 8b
+ b 9b
5: cbz x6, 8b
st1 {v7.16b}, [x5]
- b 8b
+ b 9b
6: cbz x6, 8b
st1 {v2.16b}, [x5]
- b 8b
+ b 9b
7: cbz x6, 8b
st1 {v5.16b}, [x5]
- b 8b
+ b 9b
ENDPROC(aesbs_ctr_encrypt)
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 03/10] crypto: arm/chacha20 " Ard Biesheuvel
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm/crypto/aes-ce-core.S | 84 ++++++++++----------
arch/arm/crypto/aes-ce-glue.c | 15 ++--
2 files changed, 47 insertions(+), 52 deletions(-)
diff --git a/arch/arm/crypto/aes-ce-core.S b/arch/arm/crypto/aes-ce-core.S
index 987aa632c9f0..ba8e6a32fdc9 100644
--- a/arch/arm/crypto/aes-ce-core.S
+++ b/arch/arm/crypto/aes-ce-core.S
@@ -169,19 +169,19 @@ ENTRY(ce_aes_ecb_encrypt)
.Lecbencloop3x:
subs r4, r4, #3
bmi .Lecbenc1x
- vld1.8 {q0-q1}, [r1, :64]!
- vld1.8 {q2}, [r1, :64]!
+ vld1.8 {q0-q1}, [r1]!
+ vld1.8 {q2}, [r1]!
bl aes_encrypt_3x
- vst1.8 {q0-q1}, [r0, :64]!
- vst1.8 {q2}, [r0, :64]!
+ vst1.8 {q0-q1}, [r0]!
+ vst1.8 {q2}, [r0]!
b .Lecbencloop3x
.Lecbenc1x:
adds r4, r4, #3
beq .Lecbencout
.Lecbencloop:
- vld1.8 {q0}, [r1, :64]!
+ vld1.8 {q0}, [r1]!
bl aes_encrypt
- vst1.8 {q0}, [r0, :64]!
+ vst1.8 {q0}, [r0]!
subs r4, r4, #1
bne .Lecbencloop
.Lecbencout:
@@ -195,19 +195,19 @@ ENTRY(ce_aes_ecb_decrypt)
.Lecbdecloop3x:
subs r4, r4, #3
bmi .Lecbdec1x
- vld1.8 {q0-q1}, [r1, :64]!
- vld1.8 {q2}, [r1, :64]!
+ vld1.8 {q0-q1}, [r1]!
+ vld1.8 {q2}, [r1]!
bl aes_decrypt_3x
- vst1.8 {q0-q1}, [r0, :64]!
- vst1.8 {q2}, [r0, :64]!
+ vst1.8 {q0-q1}, [r0]!
+ vst1.8 {q2}, [r0]!
b .Lecbdecloop3x
.Lecbdec1x:
adds r4, r4, #3
beq .Lecbdecout
.Lecbdecloop:
- vld1.8 {q0}, [r1, :64]!
+ vld1.8 {q0}, [r1]!
bl aes_decrypt
- vst1.8 {q0}, [r0, :64]!
+ vst1.8 {q0}, [r0]!
subs r4, r4, #1
bne .Lecbdecloop
.Lecbdecout:
@@ -226,10 +226,10 @@ ENTRY(ce_aes_cbc_encrypt)
vld1.8 {q0}, [r5]
prepare_key r2, r3
.Lcbcencloop:
- vld1.8 {q1}, [r1, :64]! @ get next pt block
+ vld1.8 {q1}, [r1]! @ get next pt block
veor q0, q0, q1 @ ..and xor with iv
bl aes_encrypt
- vst1.8 {q0}, [r0, :64]!
+ vst1.8 {q0}, [r0]!
subs r4, r4, #1
bne .Lcbcencloop
vst1.8 {q0}, [r5]
@@ -244,8 +244,8 @@ ENTRY(ce_aes_cbc_decrypt)
.Lcbcdecloop3x:
subs r4, r4, #3
bmi .Lcbcdec1x
- vld1.8 {q0-q1}, [r1, :64]!
- vld1.8 {q2}, [r1, :64]!
+ vld1.8 {q0-q1}, [r1]!
+ vld1.8 {q2}, [r1]!
vmov q3, q0
vmov q4, q1
vmov q5, q2
@@ -254,19 +254,19 @@ ENTRY(ce_aes_cbc_decrypt)
veor q1, q1, q3
veor q2, q2, q4
vmov q6, q5
- vst1.8 {q0-q1}, [r0, :64]!
- vst1.8 {q2}, [r0, :64]!
+ vst1.8 {q0-q1}, [r0]!
+ vst1.8 {q2}, [r0]!
b .Lcbcdecloop3x
.Lcbcdec1x:
adds r4, r4, #3
beq .Lcbcdecout
vmov q15, q14 @ preserve last round key
.Lcbcdecloop:
- vld1.8 {q0}, [r1, :64]! @ get next ct block
+ vld1.8 {q0}, [r1]! @ get next ct block
veor q14, q15, q6 @ combine prev ct with last key
vmov q6, q0
bl aes_decrypt
- vst1.8 {q0}, [r0, :64]!
+ vst1.8 {q0}, [r0]!
subs r4, r4, #1
bne .Lcbcdecloop
.Lcbcdecout:
@@ -300,15 +300,15 @@ ENTRY(ce_aes_ctr_encrypt)
rev ip, r6
add r6, r6, #1
vmov s11, ip
- vld1.8 {q3-q4}, [r1, :64]!
- vld1.8 {q5}, [r1, :64]!
+ vld1.8 {q3-q4}, [r1]!
+ vld1.8 {q5}, [r1]!
bl aes_encrypt_3x
veor q0, q0, q3
veor q1, q1, q4
veor q2, q2, q5
rev ip, r6
- vst1.8 {q0-q1}, [r0, :64]!
- vst1.8 {q2}, [r0, :64]!
+ vst1.8 {q0-q1}, [r0]!
+ vst1.8 {q2}, [r0]!
vmov s27, ip
b .Lctrloop3x
.Lctr1x:
@@ -318,10 +318,10 @@ ENTRY(ce_aes_ctr_encrypt)
vmov q0, q6
bl aes_encrypt
subs r4, r4, #1
- bmi .Lctrhalfblock @ blocks < 0 means 1/2 block
- vld1.8 {q3}, [r1, :64]!
+ bmi .Lctrtailblock @ blocks < 0 means tail block
+ vld1.8 {q3}, [r1]!
veor q3, q0, q3
- vst1.8 {q3}, [r0, :64]!
+ vst1.8 {q3}, [r0]!
adds r6, r6, #1 @ increment BE ctr
rev ip, r6
@@ -333,10 +333,8 @@ ENTRY(ce_aes_ctr_encrypt)
vst1.8 {q6}, [r5]
pop {r4-r6, pc}
-.Lctrhalfblock:
- vld1.8 {d1}, [r1, :64]
- veor d0, d0, d1
- vst1.8 {d0}, [r0, :64]
+.Lctrtailblock:
+ vst1.8 {q0}, [r0, :64] @ return just the key stream
pop {r4-r6, pc}
.Lctrcarry:
@@ -405,8 +403,8 @@ ENTRY(ce_aes_xts_encrypt)
.Lxtsenc3x:
subs r4, r4, #3
bmi .Lxtsenc1x
- vld1.8 {q0-q1}, [r1, :64]! @ get 3 pt blocks
- vld1.8 {q2}, [r1, :64]!
+ vld1.8 {q0-q1}, [r1]! @ get 3 pt blocks
+ vld1.8 {q2}, [r1]!
next_tweak q4, q3, q7, q6
veor q0, q0, q3
next_tweak q5, q4, q7, q6
@@ -416,8 +414,8 @@ ENTRY(ce_aes_xts_encrypt)
veor q0, q0, q3
veor q1, q1, q4
veor q2, q2, q5
- vst1.8 {q0-q1}, [r0, :64]! @ write 3 ct blocks
- vst1.8 {q2}, [r0, :64]!
+ vst1.8 {q0-q1}, [r0]! @ write 3 ct blocks
+ vst1.8 {q2}, [r0]!
vmov q3, q5
teq r4, #0
beq .Lxtsencout
@@ -426,11 +424,11 @@ ENTRY(ce_aes_xts_encrypt)
adds r4, r4, #3
beq .Lxtsencout
.Lxtsencloop:
- vld1.8 {q0}, [r1, :64]!
+ vld1.8 {q0}, [r1]!
veor q0, q0, q3
bl aes_encrypt
veor q0, q0, q3
- vst1.8 {q0}, [r0, :64]!
+ vst1.8 {q0}, [r0]!
subs r4, r4, #1
beq .Lxtsencout
next_tweak q3, q3, q7, q6
@@ -456,8 +454,8 @@ ENTRY(ce_aes_xts_decrypt)
.Lxtsdec3x:
subs r4, r4, #3
bmi .Lxtsdec1x
- vld1.8 {q0-q1}, [r1, :64]! @ get 3 ct blocks
- vld1.8 {q2}, [r1, :64]!
+ vld1.8 {q0-q1}, [r1]! @ get 3 ct blocks
+ vld1.8 {q2}, [r1]!
next_tweak q4, q3, q7, q6
veor q0, q0, q3
next_tweak q5, q4, q7, q6
@@ -467,8 +465,8 @@ ENTRY(ce_aes_xts_decrypt)
veor q0, q0, q3
veor q1, q1, q4
veor q2, q2, q5
- vst1.8 {q0-q1}, [r0, :64]! @ write 3 pt blocks
- vst1.8 {q2}, [r0, :64]!
+ vst1.8 {q0-q1}, [r0]! @ write 3 pt blocks
+ vst1.8 {q2}, [r0]!
vmov q3, q5
teq r4, #0
beq .Lxtsdecout
@@ -477,12 +475,12 @@ ENTRY(ce_aes_xts_decrypt)
adds r4, r4, #3
beq .Lxtsdecout
.Lxtsdecloop:
- vld1.8 {q0}, [r1, :64]!
+ vld1.8 {q0}, [r1]!
veor q0, q0, q3
add ip, r2, #32 @ 3rd round key
bl aes_decrypt
veor q0, q0, q3
- vst1.8 {q0}, [r0, :64]!
+ vst1.8 {q0}, [r0]!
subs r4, r4, #1
beq .Lxtsdecout
next_tweak q3, q3, q7, q6
diff --git a/arch/arm/crypto/aes-ce-glue.c b/arch/arm/crypto/aes-ce-glue.c
index 8857531915bf..883b84d828c5 100644
--- a/arch/arm/crypto/aes-ce-glue.c
+++ b/arch/arm/crypto/aes-ce-glue.c
@@ -278,14 +278,15 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 *tsrc = walk.src.virt.addr;
/*
- * Minimum alignment is 8 bytes, so if nbytes is <= 8, we need
- * to tell aes_ctr_encrypt() to only read half a block.
+ * Tell aes_ctr_encrypt() to process a tail block.
*/
- blocks = (nbytes <= 8) ? -1 : 1;
+ blocks = -1;
- ce_aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc,
+ ce_aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc,
num_rounds(ctx), blocks, walk.iv);
- memcpy(tdst, tail, nbytes);
+ if (tdst != tsrc)
+ memcpy(tdst, tsrc, nbytes);
+ crypto_xor(tdst, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
kernel_neon_end();
@@ -345,7 +346,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -361,7 +361,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -378,7 +377,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -396,7 +394,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto_aes_xts_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = 2 * AES_MIN_KEY_SIZE,
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 03/10] crypto: arm/chacha20 - remove cra_alignmask
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 04/10] crypto: arm64/aes-ce-ccm " Ard Biesheuvel
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm/crypto/chacha20-neon-glue.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/arm/crypto/chacha20-neon-glue.c b/arch/arm/crypto/chacha20-neon-glue.c
index 592f75ae4fa1..59a7be08e80c 100644
--- a/arch/arm/crypto/chacha20-neon-glue.c
+++ b/arch/arm/crypto/chacha20-neon-glue.c
@@ -94,7 +94,6 @@ static struct skcipher_alg alg = {
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha20_ctx),
- .base.cra_alignmask = 1,
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA20_KEY_SIZE,
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 04/10] crypto: arm64/aes-ce-ccm - remove cra_alignmask
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (2 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 03/10] crypto: arm/chacha20 " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 05/10] crypto: arm64/aes-blk " Ard Biesheuvel
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index cc5515dac74a..6a7dbc7c83a6 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -258,7 +258,6 @@ static struct aead_alg ccm_aes_alg = {
.cra_priority = 300,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.ivsize = AES_BLOCK_SIZE,
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 05/10] crypto: arm64/aes-blk - remove cra_alignmask
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (3 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 04/10] crypto: arm64/aes-ce-ccm " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 06/10] crypto: arm64/chacha20 " Ard Biesheuvel
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 16 ++++++----------
arch/arm64/crypto/aes-modes.S | 8 +++-----
2 files changed, 9 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 5164aaf82c6a..8ee1fb7aaa4f 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -215,14 +215,15 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 *tsrc = walk.src.virt.addr;
/*
- * Minimum alignment is 8 bytes, so if nbytes is <= 8, we need
- * to tell aes_ctr_encrypt() to only read half a block.
+ * Tell aes_ctr_encrypt() to process a tail block.
*/
- blocks = (nbytes <= 8) ? -1 : 1;
+ blocks = -1;
- aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc, rounds,
+ aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
blocks, walk.iv, first);
- memcpy(tdst, tail, nbytes);
+ if (tdst != tsrc)
+ memcpy(tdst, tsrc, nbytes);
+ crypto_xor(tdst, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
kernel_neon_end();
@@ -282,7 +283,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -298,7 +298,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -315,7 +314,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -332,7 +330,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_priority = PRIO - 1,
.cra_blocksize = 1,
.cra_ctxsize = sizeof(struct crypto_aes_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = AES_MIN_KEY_SIZE,
@@ -350,7 +347,6 @@ static struct skcipher_alg aes_algs[] = { {
.cra_flags = CRYPTO_ALG_INTERNAL,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct crypto_aes_xts_ctx),
- .cra_alignmask = 7,
.cra_module = THIS_MODULE,
},
.min_keysize = 2 * AES_MIN_KEY_SIZE,
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 838dad5c209f..92b982a8b112 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -337,7 +337,7 @@ AES_ENTRY(aes_ctr_encrypt)
.Lctrcarrydone:
subs w4, w4, #1
- bmi .Lctrhalfblock /* blocks < 0 means 1/2 block */
+ bmi .Lctrtailblock /* blocks <0 means tail block */
ld1 {v3.16b}, [x1], #16
eor v3.16b, v0.16b, v3.16b
st1 {v3.16b}, [x0], #16
@@ -348,10 +348,8 @@ AES_ENTRY(aes_ctr_encrypt)
FRAME_POP
ret
-.Lctrhalfblock:
- ld1 {v3.8b}, [x1]
- eor v3.8b, v0.8b, v3.8b
- st1 {v3.8b}, [x0]
+.Lctrtailblock:
+ st1 {v0.16b}, [x0]
FRAME_POP
ret
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 06/10] crypto: arm64/chacha20 - remove cra_alignmask
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (4 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 05/10] crypto: arm64/aes-blk " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references Ard Biesheuvel
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index a7f2337d46cf..a7cd575ea223 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -93,7 +93,6 @@ static struct skcipher_alg alg = {
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha20_ctx),
- .base.cra_alignmask = 1,
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA20_KEY_SIZE,
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (5 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 06/10] crypto: arm64/chacha20 " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 08/10] crypto: arm64/aes - performance tweak Ard Biesheuvel
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Using simple adrp/add pairs to refer to the AES lookup tables exposed by
the generic AES driver (which could be loaded far away from this driver
when KASLR is in effect) was unreliable at module load time before commit
41c066f2c4d4 ("arm64: assembler: make adr_l work in modules under KASLR"),
which is why the AES code used literals instead.
So now we can get rid of the literals, and switch to the adr_l macro.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-cipher-core.S | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/crypto/aes-cipher-core.S b/arch/arm64/crypto/aes-cipher-core.S
index 37590ab8121a..cd58c61e6677 100644
--- a/arch/arm64/crypto/aes-cipher-core.S
+++ b/arch/arm64/crypto/aes-cipher-core.S
@@ -89,8 +89,8 @@ CPU_BE( rev w8, w8 )
eor w7, w7, w11
eor w8, w8, w12
- ldr tt, =\ttab
- ldr lt, =\ltab
+ adr_l tt, \ttab
+ adr_l lt, \ltab
tbnz rounds, #1, 1f
@@ -111,9 +111,6 @@ CPU_BE( rev w8, w8 )
stp w5, w6, [out]
stp w7, w8, [out, #8]
ret
-
- .align 4
- .ltorg
.endm
.align 5
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 08/10] crypto: arm64/aes - performance tweak
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (6 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback Ard Biesheuvel
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
Shuffle some instructions around in the __hround macro to shave off
0.1 cycles per byte on Cortex-A57.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-cipher-core.S | 52 +++++++-------------
1 file changed, 19 insertions(+), 33 deletions(-)
diff --git a/arch/arm64/crypto/aes-cipher-core.S b/arch/arm64/crypto/aes-cipher-core.S
index cd58c61e6677..f2f9cc519309 100644
--- a/arch/arm64/crypto/aes-cipher-core.S
+++ b/arch/arm64/crypto/aes-cipher-core.S
@@ -20,46 +20,32 @@
tt .req x4
lt .req x2
- .macro __hround, out0, out1, in0, in1, in2, in3, t0, t1, enc
- ldp \out0, \out1, [rk], #8
-
- ubfx w13, \in0, #0, #8
- ubfx w14, \in1, #8, #8
- ldr w13, [tt, w13, uxtw #2]
- ldr w14, [tt, w14, uxtw #2]
-
+ .macro __pair, enc, reg0, reg1, in0, in1e, in1d, shift
+ ubfx \reg0, \in0, #\shift, #8
.if \enc
- ubfx w17, \in1, #0, #8
- ubfx w18, \in2, #8, #8
+ ubfx \reg1, \in1e, #\shift, #8
.else
- ubfx w17, \in3, #0, #8
- ubfx w18, \in0, #8, #8
+ ubfx \reg1, \in1d, #\shift, #8
.endif
- ldr w17, [tt, w17, uxtw #2]
- ldr w18, [tt, w18, uxtw #2]
+ ldr \reg0, [tt, \reg0, uxtw #2]
+ ldr \reg1, [tt, \reg1, uxtw #2]
+ .endm
- ubfx w15, \in2, #16, #8
- ubfx w16, \in3, #24, #8
- ldr w15, [tt, w15, uxtw #2]
- ldr w16, [tt, w16, uxtw #2]
+ .macro __hround, out0, out1, in0, in1, in2, in3, t0, t1, enc
+ ldp \out0, \out1, [rk], #8
- .if \enc
- ubfx \t0, \in3, #16, #8
- ubfx \t1, \in0, #24, #8
- .else
- ubfx \t0, \in1, #16, #8
- ubfx \t1, \in2, #24, #8
- .endif
- ldr \t0, [tt, \t0, uxtw #2]
- ldr \t1, [tt, \t1, uxtw #2]
+ __pair \enc, w13, w14, \in0, \in1, \in3, 0
+ __pair \enc, w15, w16, \in1, \in2, \in0, 8
+ __pair \enc, w17, w18, \in2, \in3, \in1, 16
+ __pair \enc, \t0, \t1, \in3, \in0, \in2, 24
eor \out0, \out0, w13
- eor \out1, \out1, w17
- eor \out0, \out0, w14, ror #24
- eor \out1, \out1, w18, ror #24
- eor \out0, \out0, w15, ror #16
- eor \out1, \out1, \t0, ror #16
- eor \out0, \out0, w16, ror #8
+ eor \out1, \out1, w14
+ eor \out0, \out0, w15, ror #24
+ eor \out1, \out1, w16, ror #24
+ eor \out0, \out0, w17, ror #16
+ eor \out1, \out1, w18, ror #16
+ eor \out0, \out0, \t0, ror #8
eor \out1, \out1, \t1, ror #8
.endm
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (7 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 08/10] crypto: arm64/aes - performance tweak Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback Ard Biesheuvel
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
The non-bitsliced AES implementation using the NEON is highly sensitive
to micro-architectural details, and, as it turns out, the Cortex-A53 on
the Raspberry Pi 3 is a core that can benefit from this code, given that
its scalar AES performance is abysmal (32.9 cycles per byte).
The new bitsliced AES code manages 19.8 cycles per byte on this core,
but can only operate on 8 blocks at a time, which is not supported by
all chaining modes. With a bit of tweaking, we can get the plain NEON
code to run at 22.0 cycles per byte, making it useful for sequential
modes like CBC encryption. (Like bitsliced NEON, the plain NEON
implementation does not use any lookup tables, which makes it easy on
the D-cache, and invulnerable to cache timing attacks)
So tweak the plain NEON AES code to use tbl instructions rather than
shl/sri pairs, and to avoid the need to reload permutation vectors or
other constants from memory in every round.
To allow the ECB and CBC encrypt routines to be reused by the bitsliced
NEON code in a subsequent patch, export them from the module.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 2 +
arch/arm64/crypto/aes-neon.S | 210 ++++++++------------
2 files changed, 81 insertions(+), 131 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 8ee1fb7aaa4f..055bc3f61138 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -409,5 +409,7 @@ static int __init aes_init(void)
module_cpu_feature_match(AES, aes_init);
#else
module_init(aes_init);
+EXPORT_SYMBOL(neon_aes_ecb_encrypt);
+EXPORT_SYMBOL(neon_aes_cbc_encrypt);
#endif
module_exit(aes_exit);
diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S
index 85f07ead7c5c..0d111a6fc229 100644
--- a/arch/arm64/crypto/aes-neon.S
+++ b/arch/arm64/crypto/aes-neon.S
@@ -1,7 +1,7 @@
/*
* linux/arch/arm64/crypto/aes-neon.S - AES cipher for ARMv8 NEON
*
- * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2013 - 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
@@ -25,9 +25,9 @@
/* preload the entire Sbox */
.macro prepare, sbox, shiftrows, temp
adr \temp, \sbox
- movi v12.16b, #0x40
+ movi v12.16b, #0x1b
ldr q13, \shiftrows
- movi v14.16b, #0x1b
+ ldr q14, .Lror32by8
ld1 {v16.16b-v19.16b}, [\temp], #64
ld1 {v20.16b-v23.16b}, [\temp], #64
ld1 {v24.16b-v27.16b}, [\temp], #64
@@ -50,37 +50,32 @@
/* apply SubBytes transformation using the the preloaded Sbox */
.macro sub_bytes, in
- sub v9.16b, \in\().16b, v12.16b
+ sub v9.16b, \in\().16b, v15.16b
tbl \in\().16b, {v16.16b-v19.16b}, \in\().16b
- sub v10.16b, v9.16b, v12.16b
+ sub v10.16b, v9.16b, v15.16b
tbx \in\().16b, {v20.16b-v23.16b}, v9.16b
- sub v11.16b, v10.16b, v12.16b
+ sub v11.16b, v10.16b, v15.16b
tbx \in\().16b, {v24.16b-v27.16b}, v10.16b
tbx \in\().16b, {v28.16b-v31.16b}, v11.16b
.endm
/* apply MixColumns transformation */
- .macro mix_columns, in
- mul_by_x v10.16b, \in\().16b, v9.16b, v14.16b
- rev32 v8.8h, \in\().8h
- eor \in\().16b, v10.16b, \in\().16b
- shl v9.4s, v8.4s, #24
- shl v11.4s, \in\().4s, #24
- sri v9.4s, v8.4s, #8
- sri v11.4s, \in\().4s, #8
- eor v9.16b, v9.16b, v8.16b
- eor v10.16b, v10.16b, v9.16b
- eor \in\().16b, v10.16b, v11.16b
- .endm
-
+ .macro mix_columns, in, enc
+ .if \enc == 0
/* Inverse MixColumns: pre-multiply by { 5, 0, 4, 0 } */
- .macro inv_mix_columns, in
- mul_by_x v11.16b, \in\().16b, v10.16b, v14.16b
- mul_by_x v11.16b, v11.16b, v10.16b, v14.16b
- eor \in\().16b, \in\().16b, v11.16b
- rev32 v11.8h, v11.8h
- eor \in\().16b, \in\().16b, v11.16b
- mix_columns \in
+ mul_by_x v8.16b, \in\().16b, v9.16b, v12.16b
+ mul_by_x v8.16b, v8.16b, v9.16b, v12.16b
+ eor \in\().16b, \in\().16b, v8.16b
+ rev32 v8.8h, v8.8h
+ eor \in\().16b, \in\().16b, v8.16b
+ .endif
+
+ mul_by_x v9.16b, \in\().16b, v8.16b, v12.16b
+ rev32 v8.8h, \in\().8h
+ eor v8.16b, v8.16b, v9.16b
+ eor \in\().16b, \in\().16b, v8.16b
+ tbl \in\().16b, {\in\().16b}, v14.16b
+ eor \in\().16b, \in\().16b, v8.16b
.endm
.macro do_block, enc, in, rounds, rk, rkp, i
@@ -88,16 +83,13 @@
add \rkp, \rk, #16
mov \i, \rounds
1111: eor \in\().16b, \in\().16b, v15.16b /* ^round key */
+ movi v15.16b, #0x40
tbl \in\().16b, {\in\().16b}, v13.16b /* ShiftRows */
sub_bytes \in
- ld1 {v15.4s}, [\rkp], #16
subs \i, \i, #1
+ ld1 {v15.4s}, [\rkp], #16
beq 2222f
- .if \enc == 1
- mix_columns \in
- .else
- inv_mix_columns \in
- .endif
+ mix_columns \in, \enc
b 1111b
2222: eor \in\().16b, \in\().16b, v15.16b /* ^round key */
.endm
@@ -116,48 +108,48 @@
*/
.macro sub_bytes_2x, in0, in1
- sub v8.16b, \in0\().16b, v12.16b
- sub v9.16b, \in1\().16b, v12.16b
+ sub v8.16b, \in0\().16b, v15.16b
tbl \in0\().16b, {v16.16b-v19.16b}, \in0\().16b
+ sub v9.16b, \in1\().16b, v15.16b
tbl \in1\().16b, {v16.16b-v19.16b}, \in1\().16b
- sub v10.16b, v8.16b, v12.16b
- sub v11.16b, v9.16b, v12.16b
+ sub v10.16b, v8.16b, v15.16b
tbx \in0\().16b, {v20.16b-v23.16b}, v8.16b
+ sub v11.16b, v9.16b, v15.16b
tbx \in1\().16b, {v20.16b-v23.16b}, v9.16b
- sub v8.16b, v10.16b, v12.16b
- sub v9.16b, v11.16b, v12.16b
+ sub v8.16b, v10.16b, v15.16b
tbx \in0\().16b, {v24.16b-v27.16b}, v10.16b
+ sub v9.16b, v11.16b, v15.16b
tbx \in1\().16b, {v24.16b-v27.16b}, v11.16b
tbx \in0\().16b, {v28.16b-v31.16b}, v8.16b
tbx \in1\().16b, {v28.16b-v31.16b}, v9.16b
.endm
.macro sub_bytes_4x, in0, in1, in2, in3
- sub v8.16b, \in0\().16b, v12.16b
+ sub v8.16b, \in0\().16b, v15.16b
tbl \in0\().16b, {v16.16b-v19.16b}, \in0\().16b
- sub v9.16b, \in1\().16b, v12.16b
+ sub v9.16b, \in1\().16b, v15.16b
tbl \in1\().16b, {v16.16b-v19.16b}, \in1\().16b
- sub v10.16b, \in2\().16b, v12.16b
+ sub v10.16b, \in2\().16b, v15.16b
tbl \in2\().16b, {v16.16b-v19.16b}, \in2\().16b
- sub v11.16b, \in3\().16b, v12.16b
+ sub v11.16b, \in3\().16b, v15.16b
tbl \in3\().16b, {v16.16b-v19.16b}, \in3\().16b
tbx \in0\().16b, {v20.16b-v23.16b}, v8.16b
tbx \in1\().16b, {v20.16b-v23.16b}, v9.16b
- sub v8.16b, v8.16b, v12.16b
+ sub v8.16b, v8.16b, v15.16b
tbx \in2\().16b, {v20.16b-v23.16b}, v10.16b
- sub v9.16b, v9.16b, v12.16b
+ sub v9.16b, v9.16b, v15.16b
tbx \in3\().16b, {v20.16b-v23.16b}, v11.16b
- sub v10.16b, v10.16b, v12.16b
+ sub v10.16b, v10.16b, v15.16b
tbx \in0\().16b, {v24.16b-v27.16b}, v8.16b
- sub v11.16b, v11.16b, v12.16b
+ sub v11.16b, v11.16b, v15.16b
tbx \in1\().16b, {v24.16b-v27.16b}, v9.16b
- sub v8.16b, v8.16b, v12.16b
+ sub v8.16b, v8.16b, v15.16b
tbx \in2\().16b, {v24.16b-v27.16b}, v10.16b
- sub v9.16b, v9.16b, v12.16b
+ sub v9.16b, v9.16b, v15.16b
tbx \in3\().16b, {v24.16b-v27.16b}, v11.16b
- sub v10.16b, v10.16b, v12.16b
+ sub v10.16b, v10.16b, v15.16b
tbx \in0\().16b, {v28.16b-v31.16b}, v8.16b
- sub v11.16b, v11.16b, v12.16b
+ sub v11.16b, v11.16b, v15.16b
tbx \in1\().16b, {v28.16b-v31.16b}, v9.16b
tbx \in2\().16b, {v28.16b-v31.16b}, v10.16b
tbx \in3\().16b, {v28.16b-v31.16b}, v11.16b
@@ -174,59 +166,30 @@
eor \out1\().16b, \out1\().16b, \tmp1\().16b
.endm
- .macro mix_columns_2x, in0, in1
- mul_by_x_2x v8, v9, \in0, \in1, v10, v11, v14
- rev32 v10.8h, \in0\().8h
- rev32 v11.8h, \in1\().8h
- eor \in0\().16b, v8.16b, \in0\().16b
- eor \in1\().16b, v9.16b, \in1\().16b
- shl v12.4s, v10.4s, #24
- shl v13.4s, v11.4s, #24
- eor v8.16b, v8.16b, v10.16b
- sri v12.4s, v10.4s, #8
- shl v10.4s, \in0\().4s, #24
- eor v9.16b, v9.16b, v11.16b
- sri v13.4s, v11.4s, #8
- shl v11.4s, \in1\().4s, #24
- sri v10.4s, \in0\().4s, #8
- eor \in0\().16b, v8.16b, v12.16b
- sri v11.4s, \in1\().4s, #8
- eor \in1\().16b, v9.16b, v13.16b
- eor \in0\().16b, v10.16b, \in0\().16b
- eor \in1\().16b, v11.16b, \in1\().16b
- .endm
-
- .macro inv_mix_cols_2x, in0, in1
- mul_by_x_2x v8, v9, \in0, \in1, v10, v11, v14
- mul_by_x_2x v8, v9, v8, v9, v10, v11, v14
+ .macro mix_columns_2x, in0, in1, enc
+ .if \enc == 0
+ /* Inverse MixColumns: pre-multiply by { 5, 0, 4, 0 } */
+ mul_by_x_2x v8, v9, \in0, \in1, v10, v11, v12
+ mul_by_x_2x v8, v9, v8, v9, v10, v11, v12
eor \in0\().16b, \in0\().16b, v8.16b
- eor \in1\().16b, \in1\().16b, v9.16b
rev32 v8.8h, v8.8h
- rev32 v9.8h, v9.8h
- eor \in0\().16b, \in0\().16b, v8.16b
eor \in1\().16b, \in1\().16b, v9.16b
- mix_columns_2x \in0, \in1
- .endm
-
- .macro inv_mix_cols_4x, in0, in1, in2, in3
- mul_by_x_2x v8, v9, \in0, \in1, v10, v11, v14
- mul_by_x_2x v10, v11, \in2, \in3, v12, v13, v14
- mul_by_x_2x v8, v9, v8, v9, v12, v13, v14
- mul_by_x_2x v10, v11, v10, v11, v12, v13, v14
- eor \in0\().16b, \in0\().16b, v8.16b
- eor \in1\().16b, \in1\().16b, v9.16b
- eor \in2\().16b, \in2\().16b, v10.16b
- eor \in3\().16b, \in3\().16b, v11.16b
- rev32 v8.8h, v8.8h
rev32 v9.8h, v9.8h
- rev32 v10.8h, v10.8h
- rev32 v11.8h, v11.8h
eor \in0\().16b, \in0\().16b, v8.16b
eor \in1\().16b, \in1\().16b, v9.16b
- eor \in2\().16b, \in2\().16b, v10.16b
- eor \in3\().16b, \in3\().16b, v11.16b
- mix_columns_2x \in0, \in1
- mix_columns_2x \in2, \in3
+ .endif
+
+ mul_by_x_2x v8, v9, \in0, \in1, v10, v11, v12
+ rev32 v10.8h, \in0\().8h
+ rev32 v11.8h, \in1\().8h
+ eor v10.16b, v10.16b, v8.16b
+ eor v11.16b, v11.16b, v9.16b
+ eor \in0\().16b, \in0\().16b, v10.16b
+ eor \in1\().16b, \in1\().16b, v11.16b
+ tbl \in0\().16b, {\in0\().16b}, v14.16b
+ tbl \in1\().16b, {\in1\().16b}, v14.16b
+ eor \in0\().16b, \in0\().16b, v10.16b
+ eor \in1\().16b, \in1\().16b, v11.16b
.endm
.macro do_block_2x, enc, in0, in1 rounds, rk, rkp, i
@@ -235,20 +198,14 @@
mov \i, \rounds
1111: eor \in0\().16b, \in0\().16b, v15.16b /* ^round key */
eor \in1\().16b, \in1\().16b, v15.16b /* ^round key */
- sub_bytes_2x \in0, \in1
+ movi v15.16b, #0x40
tbl \in0\().16b, {\in0\().16b}, v13.16b /* ShiftRows */
tbl \in1\().16b, {\in1\().16b}, v13.16b /* ShiftRows */
- ld1 {v15.4s}, [\rkp], #16
+ sub_bytes_2x \in0, \in1
subs \i, \i, #1
+ ld1 {v15.4s}, [\rkp], #16
beq 2222f
- .if \enc == 1
- mix_columns_2x \in0, \in1
- ldr q13, .LForward_ShiftRows
- .else
- inv_mix_cols_2x \in0, \in1
- ldr q13, .LReverse_ShiftRows
- .endif
- movi v12.16b, #0x40
+ mix_columns_2x \in0, \in1, \enc
b 1111b
2222: eor \in0\().16b, \in0\().16b, v15.16b /* ^round key */
eor \in1\().16b, \in1\().16b, v15.16b /* ^round key */
@@ -262,23 +219,17 @@
eor \in1\().16b, \in1\().16b, v15.16b /* ^round key */
eor \in2\().16b, \in2\().16b, v15.16b /* ^round key */
eor \in3\().16b, \in3\().16b, v15.16b /* ^round key */
- sub_bytes_4x \in0, \in1, \in2, \in3
+ movi v15.16b, #0x40
tbl \in0\().16b, {\in0\().16b}, v13.16b /* ShiftRows */
tbl \in1\().16b, {\in1\().16b}, v13.16b /* ShiftRows */
tbl \in2\().16b, {\in2\().16b}, v13.16b /* ShiftRows */
tbl \in3\().16b, {\in3\().16b}, v13.16b /* ShiftRows */
- ld1 {v15.4s}, [\rkp], #16
+ sub_bytes_4x \in0, \in1, \in2, \in3
subs \i, \i, #1
+ ld1 {v15.4s}, [\rkp], #16
beq 2222f
- .if \enc == 1
- mix_columns_2x \in0, \in1
- mix_columns_2x \in2, \in3
- ldr q13, .LForward_ShiftRows
- .else
- inv_mix_cols_4x \in0, \in1, \in2, \in3
- ldr q13, .LReverse_ShiftRows
- .endif
- movi v12.16b, #0x40
+ mix_columns_2x \in0, \in1, \enc
+ mix_columns_2x \in2, \in3, \enc
b 1111b
2222: eor \in0\().16b, \in0\().16b, v15.16b /* ^round key */
eor \in1\().16b, \in1\().16b, v15.16b /* ^round key */
@@ -305,19 +256,7 @@
#include "aes-modes.S"
.text
- .align 4
-.LForward_ShiftRows:
-CPU_LE( .byte 0x0, 0x5, 0xa, 0xf, 0x4, 0x9, 0xe, 0x3 )
-CPU_LE( .byte 0x8, 0xd, 0x2, 0x7, 0xc, 0x1, 0x6, 0xb )
-CPU_BE( .byte 0xb, 0x6, 0x1, 0xc, 0x7, 0x2, 0xd, 0x8 )
-CPU_BE( .byte 0x3, 0xe, 0x9, 0x4, 0xf, 0xa, 0x5, 0x0 )
-
-.LReverse_ShiftRows:
-CPU_LE( .byte 0x0, 0xd, 0xa, 0x7, 0x4, 0x1, 0xe, 0xb )
-CPU_LE( .byte 0x8, 0x5, 0x2, 0xf, 0xc, 0x9, 0x6, 0x3 )
-CPU_BE( .byte 0x3, 0x6, 0x9, 0xc, 0xf, 0x2, 0x5, 0x8 )
-CPU_BE( .byte 0xb, 0xe, 0x1, 0x4, 0x7, 0xa, 0xd, 0x0 )
-
+ .align 6
.LForward_Sbox:
.byte 0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5
.byte 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76
@@ -385,3 +324,12 @@ CPU_BE( .byte 0xb, 0xe, 0x1, 0x4, 0x7, 0xa, 0xd, 0x0 )
.byte 0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61
.byte 0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26
.byte 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d
+
+.LForward_ShiftRows:
+ .octa 0x0b06010c07020d08030e09040f0a0500
+
+.LReverse_ShiftRows:
+ .octa 0x0306090c0f0205080b0e0104070a0d00
+
+.Lror32by8:
+ .octa 0x0c0f0e0d080b0a090407060500030201
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
` (8 preceding siblings ...)
2017-01-23 14:05 ` [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
To: linux-crypto, herbert; +Cc: Ard Biesheuvel
The new bitsliced NEON implementation of AES uses a fallback in two
places: CBC encryption (which is strictly sequential, whereas this
driver can only operate efficiently on 8 blocks at a time), and the
XTS tweak generation, which involves encrypting a single AES block
with a different key schedule.
The plain (i.e., non-bitsliced) NEON code is more suitable as a fallback,
given that it is faster than scalar on low end cores (which is what
the NEON implementations target, since high end cores have dedicated
instructions for AES), and shows similar behavior in terms of D-cache
footprint and sensitivity to cache timing attacks. So switch the fallback
handling to the plain NEON driver.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/Kconfig | 2 +-
arch/arm64/crypto/aes-neonbs-glue.c | 38 ++++++++++++++------
2 files changed, 29 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 5de75c3dcbd4..bed7feddfeed 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -86,7 +86,7 @@ config CRYPTO_AES_ARM64_BS
tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
depends on KERNEL_MODE_NEON
select CRYPTO_BLKCIPHER
- select CRYPTO_AES_ARM64
+ select CRYPTO_AES_ARM64_NEON_BLK
select CRYPTO_SIMD
endif
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 323dd76ae5f0..863e436ecf89 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -10,7 +10,6 @@
#include <asm/neon.h>
#include <crypto/aes.h>
-#include <crypto/cbc.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
#include <crypto/xts.h>
@@ -42,7 +41,12 @@ asmlinkage void aesbs_xts_encrypt(u8 out[], u8 const in[], u8 const rk[],
asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
int rounds, int blocks, u8 iv[]);
-asmlinkage void __aes_arm64_encrypt(u32 *rk, u8 *out, const u8 *in, int rounds);
+/* borrowed from aes-neon-blk.ko */
+asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
+ int rounds, int blocks, int first);
+asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
+ int rounds, int blocks, u8 iv[],
+ int first);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -140,16 +144,28 @@ static int aesbs_cbc_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
return 0;
}
-static void cbc_encrypt_one(struct crypto_skcipher *tfm, const u8 *src, u8 *dst)
+static int cbc_encrypt(struct skcipher_request *req)
{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct skcipher_walk walk;
+ int err, first = 1;
- __aes_arm64_encrypt(ctx->enc, dst, src, ctx->key.rounds);
-}
+ err = skcipher_walk_virt(&walk, req, true);
-static int cbc_encrypt(struct skcipher_request *req)
-{
- return crypto_cbc_encrypt_walk(req, cbc_encrypt_one);
+ kernel_neon_begin();
+ while (walk.nbytes >= AES_BLOCK_SIZE) {
+ unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
+
+ /* fall back to the non-bitsliced NEON implementation */
+ neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->enc, ctx->key.rounds, blocks, walk.iv,
+ first);
+ err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
+ first = 0;
+ }
+ kernel_neon_end();
+ return err;
}
static int cbc_decrypt(struct skcipher_request *req)
@@ -254,9 +270,11 @@ static int __xts_crypt(struct skcipher_request *req,
err = skcipher_walk_virt(&walk, req, true);
- __aes_arm64_encrypt(ctx->twkey, walk.iv, walk.iv, ctx->key.rounds);
-
kernel_neon_begin();
+
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
+ ctx->key.rounds, 1, 1);
+
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.7.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-01-23 14:05 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 03/10] crypto: arm/chacha20 " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 04/10] crypto: arm64/aes-ce-ccm " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 05/10] crypto: arm64/aes-blk " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 06/10] crypto: arm64/chacha20 " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 08/10] crypto: arm64/aes - performance tweak Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback Ard Biesheuvel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).