* [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC
@ 2018-09-10 14:41 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-crypto
Cc: Ard Biesheuvel, Theodore Ts'o, herbert, Steve Capper,
Eric Biggers, linux-arm-kernel
Some cleanups and optimizations for the arm64 AES skcipher routines.
Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
which are natively arrays of u32.
Patch #2 partially reverts the use of NEON yield calls, which is not
needed for skciphers.
Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
Patch #4 tweaks the XTS handling to remove a literal load from the inner
loop.
Cc: Eric Biggers <ebiggers@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Steve Capper <steve.capper@arm.com>
Ard Biesheuvel (4):
crypto: arm64/aes-blk - remove pointless (u8 *) casts
crypto: arm64/aes-blk - revert NEON yield for skciphers
crypto: arm64/aes-blk - add support for CTS-CBC mode
crypto: aes/arm64-blk - improve XTS mask handling
arch/arm64/crypto/aes-ce.S | 5 +
arch/arm64/crypto/aes-glue.c | 212 +++++++++--
arch/arm64/crypto/aes-modes.S | 400 ++++++++++----------
arch/arm64/crypto/aes-neon.S | 6 +
4 files changed, 406 insertions(+), 217 deletions(-)
--
2.18.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC
@ 2018-09-10 14:41 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-arm-kernel
Some cleanups and optimizations for the arm64 AES skcipher routines.
Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
which are natively arrays of u32.
Patch #2 partially reverts the use of NEON yield calls, which is not
needed for skciphers.
Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
Patch #4 tweaks the XTS handling to remove a literal load from the inner
loop.
Cc: Eric Biggers <ebiggers@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Steve Capper <steve.capper@arm.com>
Ard Biesheuvel (4):
crypto: arm64/aes-blk - remove pointless (u8 *) casts
crypto: arm64/aes-blk - revert NEON yield for skciphers
crypto: arm64/aes-blk - add support for CTS-CBC mode
crypto: aes/arm64-blk - improve XTS mask handling
arch/arm64/crypto/aes-ce.S | 5 +
arch/arm64/crypto/aes-glue.c | 212 +++++++++--
arch/arm64/crypto/aes-modes.S | 400 ++++++++++----------
arch/arm64/crypto/aes-neon.S | 6 +
4 files changed, 406 insertions(+), 217 deletions(-)
--
2.18.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/4] crypto: arm64/aes-blk - remove pointless (u8 *) casts
2018-09-10 14:41 ` Ard Biesheuvel
@ 2018-09-10 14:41 ` Ard Biesheuvel
-1 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-crypto
Cc: Ard Biesheuvel, Theodore Ts'o, herbert, Steve Capper,
Eric Biggers, linux-arm-kernel
For some reason, the asmlinkage prototypes of the NEON routines take
u8[] arguments for the round key arrays, while the actual round keys
are arrays of u32, and so passing them into those routines requires
u8* casts at each occurrence. Fix that.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index adcb83eb683c..1c6934544c1f 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -63,24 +63,24 @@ MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
-asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks);
-asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks);
-asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
-asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
-asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 ctr[]);
-asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
- int rounds, int blocks, u8 const rk2[], u8 iv[],
+asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u32 const rk1[],
+ int rounds, int blocks, u32 const rk2[], u8 iv[],
int first);
-asmlinkage void aes_xts_decrypt(u8 out[], u8 const in[], u8 const rk1[],
- int rounds, int blocks, u8 const rk2[], u8 iv[],
+asmlinkage void aes_xts_decrypt(u8 out[], u8 const in[], u32 const rk1[],
+ int rounds, int blocks, u32 const rk2[], u8 iv[],
int first);
asmlinkage void aes_mac_update(u8 const in[], u32 const rk[], int rounds,
@@ -142,7 +142,7 @@ static int ecb_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks);
+ ctx->key_enc, rounds, blocks);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -162,7 +162,7 @@ static int ecb_decrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks);
+ ctx->key_dec, rounds, blocks);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -182,7 +182,7 @@ static int cbc_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ ctx->key_enc, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -202,7 +202,7 @@ static int cbc_decrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ ctx->key_dec, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -222,7 +222,7 @@ static int ctr_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ ctx->key_enc, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -238,7 +238,7 @@ static int ctr_encrypt(struct skcipher_request *req)
blocks = -1;
kernel_neon_begin();
- aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
+ aes_ctr_encrypt(tail, NULL, ctx->key_enc, rounds,
blocks, walk.iv);
kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
@@ -272,8 +272,8 @@ static int xts_encrypt(struct skcipher_request *req)
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key1.key_enc, rounds, blocks,
- (u8 *)ctx->key2.key_enc, walk.iv, first);
+ ctx->key1.key_enc, rounds, blocks,
+ ctx->key2.key_enc, walk.iv, first);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -294,8 +294,8 @@ static int xts_decrypt(struct skcipher_request *req)
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key1.key_dec, rounds, blocks,
- (u8 *)ctx->key2.key_enc, walk.iv, first);
+ ctx->key1.key_dec, rounds, blocks,
+ ctx->key2.key_enc, walk.iv, first);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -412,7 +412,6 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
{
struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
be128 *consts = (be128 *)ctx->consts;
- u8 *rk = (u8 *)ctx->key.key_enc;
int rounds = 6 + key_len / 4;
int err;
@@ -422,7 +421,8 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, ctx->key.key_enc,
+ rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -441,7 +441,6 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
};
struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
- u8 *rk = (u8 *)ctx->key.key_enc;
int rounds = 6 + key_len / 4;
u8 key[AES_BLOCK_SIZE];
int err;
@@ -451,8 +450,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
+ aes_ecb_encrypt(key, ks[0], ctx->key.key_enc, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], ctx->key.key_enc, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
--
2.18.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 1/4] crypto: arm64/aes-blk - remove pointless (u8 *) casts
@ 2018-09-10 14:41 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-arm-kernel
For some reason, the asmlinkage prototypes of the NEON routines take
u8[] arguments for the round key arrays, while the actual round keys
are arrays of u32, and so passing them into those routines requires
u8* casts at each occurrence. Fix that.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index adcb83eb683c..1c6934544c1f 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -63,24 +63,24 @@ MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
-asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks);
-asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks);
-asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
-asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
-asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
+asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 ctr[]);
-asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
- int rounds, int blocks, u8 const rk2[], u8 iv[],
+asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u32 const rk1[],
+ int rounds, int blocks, u32 const rk2[], u8 iv[],
int first);
-asmlinkage void aes_xts_decrypt(u8 out[], u8 const in[], u8 const rk1[],
- int rounds, int blocks, u8 const rk2[], u8 iv[],
+asmlinkage void aes_xts_decrypt(u8 out[], u8 const in[], u32 const rk1[],
+ int rounds, int blocks, u32 const rk2[], u8 iv[],
int first);
asmlinkage void aes_mac_update(u8 const in[], u32 const rk[], int rounds,
@@ -142,7 +142,7 @@ static int ecb_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks);
+ ctx->key_enc, rounds, blocks);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -162,7 +162,7 @@ static int ecb_decrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks);
+ ctx->key_dec, rounds, blocks);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -182,7 +182,7 @@ static int cbc_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ ctx->key_enc, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -202,7 +202,7 @@ static int cbc_decrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ ctx->key_dec, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -222,7 +222,7 @@ static int ctr_encrypt(struct skcipher_request *req)
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ ctx->key_enc, rounds, blocks, walk.iv);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -238,7 +238,7 @@ static int ctr_encrypt(struct skcipher_request *req)
blocks = -1;
kernel_neon_begin();
- aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
+ aes_ctr_encrypt(tail, NULL, ctx->key_enc, rounds,
blocks, walk.iv);
kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
@@ -272,8 +272,8 @@ static int xts_encrypt(struct skcipher_request *req)
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key1.key_enc, rounds, blocks,
- (u8 *)ctx->key2.key_enc, walk.iv, first);
+ ctx->key1.key_enc, rounds, blocks,
+ ctx->key2.key_enc, walk.iv, first);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -294,8 +294,8 @@ static int xts_decrypt(struct skcipher_request *req)
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key1.key_dec, rounds, blocks,
- (u8 *)ctx->key2.key_enc, walk.iv, first);
+ ctx->key1.key_dec, rounds, blocks,
+ ctx->key2.key_enc, walk.iv, first);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
@@ -412,7 +412,6 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
{
struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
be128 *consts = (be128 *)ctx->consts;
- u8 *rk = (u8 *)ctx->key.key_enc;
int rounds = 6 + key_len / 4;
int err;
@@ -422,7 +421,8 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, ctx->key.key_enc,
+ rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -441,7 +441,6 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
};
struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
- u8 *rk = (u8 *)ctx->key.key_enc;
int rounds = 6 + key_len / 4;
u8 key[AES_BLOCK_SIZE];
int err;
@@ -451,8 +450,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
+ aes_ecb_encrypt(key, ks[0], ctx->key.key_enc, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], ctx->key.key_enc, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
--
2.18.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/4] crypto: arm64/aes-blk - revert NEON yield for skciphers
2018-09-10 14:41 ` Ard Biesheuvel
@ 2018-09-10 14:41 ` Ard Biesheuvel
-1 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-crypto
Cc: Ard Biesheuvel, Theodore Ts'o, herbert, Steve Capper,
Eric Biggers, linux-arm-kernel
The reasoning of commit f10dc56c64bb ("crypto: arm64 - revert NEON yield
for fast AEAD implementations") applies equally to skciphers: the walk
API already guarantees that the input size of each call into the NEON
code is bounded to the size of a page, and so there is no need for an
additional TIF_NEED_RESCHED flag check inside the inner loop. So revert
the skcipher changes to aes-modes.S (but retain the mac ones)
This partially reverts commit 0c8f838a52fe9fd82761861a934f16ef9896b4e5.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 281 ++++++++------------
1 file changed, 108 insertions(+), 173 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 496c243de4ac..35632d11200f 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -31,71 +31,57 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- frame_push 5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
-
-.Lecbencrestart:
- enc_prepare w22, x21, x5
+ enc_prepare w3, x2, x5
.LecbencloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lecbenc1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
bl aes_encrypt_block4x
- st1 {v0.16b-v3.16b}, [x19], #64
- cond_yield_neon .Lecbencrestart
+ st1 {v0.16b-v3.16b}, [x0], #64
b .LecbencloopNx
.Lecbenc1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lecbencout
.Lecbencloop:
- ld1 {v0.16b}, [x20], #16 /* get next pt block */
- encrypt_block v0, w22, x21, x5, w6
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ encrypt_block v0, w3, x2, x5, w6
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lecbencloop
.Lecbencout:
- frame_pop
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- frame_push 5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
-
-.Lecbdecrestart:
- dec_prepare w22, x21, x5
+ dec_prepare w3, x2, x5
.LecbdecloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lecbdec1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
bl aes_decrypt_block4x
- st1 {v0.16b-v3.16b}, [x19], #64
- cond_yield_neon .Lecbdecrestart
+ st1 {v0.16b-v3.16b}, [x0], #64
b .LecbdecloopNx
.Lecbdec1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lecbdecout
.Lecbdecloop:
- ld1 {v0.16b}, [x20], #16 /* get next ct block */
- decrypt_block v0, w22, x21, x5, w6
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ ld1 {v0.16b}, [x1], #16 /* get next ct block */
+ decrypt_block v0, w3, x2, x5, w6
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lecbdecloop
.Lecbdecout:
- frame_pop
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -108,100 +94,78 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x5
-
-.Lcbcencrestart:
- ld1 {v4.16b}, [x24] /* get iv */
- enc_prepare w22, x21, x6
+ ld1 {v4.16b}, [x5] /* get iv */
+ enc_prepare w3, x2, x6
.Lcbcencloop4x:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lcbcenc1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
- encrypt_block v0, w22, x21, x6, w7
+ encrypt_block v0, w3, x2, x6, w7
eor v1.16b, v1.16b, v0.16b
- encrypt_block v1, w22, x21, x6, w7
+ encrypt_block v1, w3, x2, x6, w7
eor v2.16b, v2.16b, v1.16b
- encrypt_block v2, w22, x21, x6, w7
+ encrypt_block v2, w3, x2, x6, w7
eor v3.16b, v3.16b, v2.16b
- encrypt_block v3, w22, x21, x6, w7
- st1 {v0.16b-v3.16b}, [x19], #64
+ encrypt_block v3, w3, x2, x6, w7
+ st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v3.16b
- st1 {v4.16b}, [x24] /* return iv */
- cond_yield_neon .Lcbcencrestart
b .Lcbcencloop4x
.Lcbcenc1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lcbcencout
.Lcbcencloop:
- ld1 {v0.16b}, [x20], #16 /* get next pt block */
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
- encrypt_block v4, w22, x21, x6, w7
- st1 {v4.16b}, [x19], #16
- subs w23, w23, #1
+ encrypt_block v4, w3, x2, x6, w7
+ st1 {v4.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lcbcencloop
.Lcbcencout:
- st1 {v4.16b}, [x24] /* return iv */
- frame_pop
+ st1 {v4.16b}, [x5] /* return iv */
ret
AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
-.Lcbcdecrestart:
- ld1 {v7.16b}, [x24] /* get iv */
- dec_prepare w22, x21, x6
+ ld1 {v7.16b}, [x5] /* get iv */
+ dec_prepare w3, x2, x6
.LcbcdecloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lcbcdec1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
bl aes_decrypt_block4x
- sub x20, x20, #16
+ sub x1, x1, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
- ld1 {v7.16b}, [x20], #16 /* reload 1 ct block */
+ ld1 {v7.16b}, [x1], #16 /* reload 1 ct block */
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x19], #64
- st1 {v7.16b}, [x24] /* return iv */
- cond_yield_neon .Lcbcdecrestart
+ st1 {v0.16b-v3.16b}, [x0], #64
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lcbcdecout
.Lcbcdecloop:
- ld1 {v1.16b}, [x20], #16 /* get next ct block */
+ ld1 {v1.16b}, [x1], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
- decrypt_block v0, w22, x21, x6, w7
+ decrypt_block v0, w3, x2, x6, w7
eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */
mov v7.16b, v1.16b /* ct is next iv */
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lcbcdecloop
.Lcbcdecout:
- st1 {v7.16b}, [x24] /* return iv */
- frame_pop
+ st1 {v7.16b}, [x5] /* return iv */
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -212,26 +176,19 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
-.Lctrrestart:
- enc_prepare w22, x21, x6
- ld1 {v4.16b}, [x24]
+ enc_prepare w3, x2, x6
+ ld1 {v4.16b}, [x5]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
+ cmn w6, w4 /* 32 bit overflow? */
+ bcs .Lctrloop
.LctrloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lctr1x
- cmn w6, #4 /* 32 bit overflow? */
- bcs .Lctr1x
add w7, w6, #1
mov v0.16b, v4.16b
add w8, w6, #2
@@ -245,27 +202,25 @@ AES_ENTRY(aes_ctr_encrypt)
rev w9, w9
mov v2.s[3], w8
mov v3.s[3], w9
- ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
+ ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
- ld1 {v5.16b}, [x20], #16 /* get 1 input block */
+ ld1 {v5.16b}, [x1], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
- st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v0.16b-v3.16b}, [x0], #64
add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
- cbz w23, .Lctrout
- st1 {v4.16b}, [x24] /* return next CTR value */
- cond_yield_neon .Lctrrestart
+ cbz w4, .Lctrout
b .LctrloopNx
.Lctr1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lctrout
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w22, x21, x8, w7
+ encrypt_block v0, w3, x2, x8, w7
adds x6, x6, #1 /* increment BE ctr */
rev x7, x6
@@ -273,22 +228,22 @@ AES_ENTRY(aes_ctr_encrypt)
bcs .Lctrcarry /* overflow? */
.Lctrcarrydone:
- subs w23, w23, #1
+ subs w4, w4, #1
bmi .Lctrtailblock /* blocks <0 means tail block */
- ld1 {v3.16b}, [x20], #16
+ ld1 {v3.16b}, [x1], #16
eor v3.16b, v0.16b, v3.16b
- st1 {v3.16b}, [x19], #16
+ st1 {v3.16b}, [x0], #16
bne .Lctrloop
.Lctrout:
- st1 {v4.16b}, [x24] /* return next CTR value */
-.Lctrret:
- frame_pop
+ st1 {v4.16b}, [x5] /* return next CTR value */
+ ldp x29, x30, [sp], #16
ret
.Lctrtailblock:
- st1 {v0.16b}, [x19]
- b .Lctrret
+ st1 {v0.16b}, [x0]
+ ldp x29, x30, [sp], #16
+ ret
.Lctrcarry:
umov x7, v4.d[0] /* load upper word of ctr */
@@ -321,16 +276,10 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- frame_push 6
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x6
-
- ld1 {v4.16b}, [x24]
+ ld1 {v4.16b}, [x6]
cbz w7, .Lxtsencnotfirst
enc_prepare w3, x5, x8
@@ -339,17 +288,15 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
b .LxtsencNx
-.Lxtsencrestart:
- ld1 {v4.16b}, [x24]
.Lxtsencnotfirst:
- enc_prepare w22, x21, x8
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lxtsenc1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -362,43 +309,35 @@ AES_ENTRY(aes_xts_encrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v7.16b
- cbz w23, .Lxtsencout
- st1 {v4.16b}, [x24]
- cond_yield_neon .Lxtsencrestart
+ cbz w4, .Lxtsencout
b .LxtsencloopNx
.Lxtsenc1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lxtsencout
.Lxtsencloop:
- ld1 {v1.16b}, [x20], #16
+ ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w22, x21, x8, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
beq .Lxtsencout
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
- st1 {v4.16b}, [x24]
- frame_pop
+ st1 {v4.16b}, [x6]
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x6
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- ld1 {v4.16b}, [x24]
+ ld1 {v4.16b}, [x6]
cbz w7, .Lxtsdecnotfirst
enc_prepare w3, x5, x8
@@ -407,17 +346,15 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
b .LxtsdecNx
-.Lxtsdecrestart:
- ld1 {v4.16b}, [x24]
.Lxtsdecnotfirst:
- dec_prepare w22, x21, x8
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lxtsdec1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -430,28 +367,26 @@ AES_ENTRY(aes_xts_decrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v7.16b
- cbz w23, .Lxtsdecout
- st1 {v4.16b}, [x24]
- cond_yield_neon .Lxtsdecrestart
+ cbz w4, .Lxtsdecout
b .LxtsdecloopNx
.Lxtsdec1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lxtsdecout
.Lxtsdecloop:
- ld1 {v1.16b}, [x20], #16
+ ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w22, x21, x8, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
beq .Lxtsdecout
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
- st1 {v4.16b}, [x24]
- frame_pop
+ st1 {v4.16b}, [x6]
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_decrypt)
--
2.18.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/4] crypto: arm64/aes-blk - revert NEON yield for skciphers
@ 2018-09-10 14:41 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-arm-kernel
The reasoning of commit f10dc56c64bb ("crypto: arm64 - revert NEON yield
for fast AEAD implementations") applies equally to skciphers: the walk
API already guarantees that the input size of each call into the NEON
code is bounded to the size of a page, and so there is no need for an
additional TIF_NEED_RESCHED flag check inside the inner loop. So revert
the skcipher changes to aes-modes.S (but retain the mac ones)
This partially reverts commit 0c8f838a52fe9fd82761861a934f16ef9896b4e5.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 281 ++++++++------------
1 file changed, 108 insertions(+), 173 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 496c243de4ac..35632d11200f 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -31,71 +31,57 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- frame_push 5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
-
-.Lecbencrestart:
- enc_prepare w22, x21, x5
+ enc_prepare w3, x2, x5
.LecbencloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lecbenc1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
bl aes_encrypt_block4x
- st1 {v0.16b-v3.16b}, [x19], #64
- cond_yield_neon .Lecbencrestart
+ st1 {v0.16b-v3.16b}, [x0], #64
b .LecbencloopNx
.Lecbenc1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lecbencout
.Lecbencloop:
- ld1 {v0.16b}, [x20], #16 /* get next pt block */
- encrypt_block v0, w22, x21, x5, w6
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ encrypt_block v0, w3, x2, x5, w6
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lecbencloop
.Lecbencout:
- frame_pop
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- frame_push 5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
-
-.Lecbdecrestart:
- dec_prepare w22, x21, x5
+ dec_prepare w3, x2, x5
.LecbdecloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lecbdec1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
bl aes_decrypt_block4x
- st1 {v0.16b-v3.16b}, [x19], #64
- cond_yield_neon .Lecbdecrestart
+ st1 {v0.16b-v3.16b}, [x0], #64
b .LecbdecloopNx
.Lecbdec1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lecbdecout
.Lecbdecloop:
- ld1 {v0.16b}, [x20], #16 /* get next ct block */
- decrypt_block v0, w22, x21, x5, w6
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ ld1 {v0.16b}, [x1], #16 /* get next ct block */
+ decrypt_block v0, w3, x2, x5, w6
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lecbdecloop
.Lecbdecout:
- frame_pop
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -108,100 +94,78 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x5
-
-.Lcbcencrestart:
- ld1 {v4.16b}, [x24] /* get iv */
- enc_prepare w22, x21, x6
+ ld1 {v4.16b}, [x5] /* get iv */
+ enc_prepare w3, x2, x6
.Lcbcencloop4x:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lcbcenc1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
- encrypt_block v0, w22, x21, x6, w7
+ encrypt_block v0, w3, x2, x6, w7
eor v1.16b, v1.16b, v0.16b
- encrypt_block v1, w22, x21, x6, w7
+ encrypt_block v1, w3, x2, x6, w7
eor v2.16b, v2.16b, v1.16b
- encrypt_block v2, w22, x21, x6, w7
+ encrypt_block v2, w3, x2, x6, w7
eor v3.16b, v3.16b, v2.16b
- encrypt_block v3, w22, x21, x6, w7
- st1 {v0.16b-v3.16b}, [x19], #64
+ encrypt_block v3, w3, x2, x6, w7
+ st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v3.16b
- st1 {v4.16b}, [x24] /* return iv */
- cond_yield_neon .Lcbcencrestart
b .Lcbcencloop4x
.Lcbcenc1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lcbcencout
.Lcbcencloop:
- ld1 {v0.16b}, [x20], #16 /* get next pt block */
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
- encrypt_block v4, w22, x21, x6, w7
- st1 {v4.16b}, [x19], #16
- subs w23, w23, #1
+ encrypt_block v4, w3, x2, x6, w7
+ st1 {v4.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lcbcencloop
.Lcbcencout:
- st1 {v4.16b}, [x24] /* return iv */
- frame_pop
+ st1 {v4.16b}, [x5] /* return iv */
ret
AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
-.Lcbcdecrestart:
- ld1 {v7.16b}, [x24] /* get iv */
- dec_prepare w22, x21, x6
+ ld1 {v7.16b}, [x5] /* get iv */
+ dec_prepare w3, x2, x6
.LcbcdecloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lcbcdec1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
bl aes_decrypt_block4x
- sub x20, x20, #16
+ sub x1, x1, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
- ld1 {v7.16b}, [x20], #16 /* reload 1 ct block */
+ ld1 {v7.16b}, [x1], #16 /* reload 1 ct block */
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x19], #64
- st1 {v7.16b}, [x24] /* return iv */
- cond_yield_neon .Lcbcdecrestart
+ st1 {v0.16b-v3.16b}, [x0], #64
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lcbcdecout
.Lcbcdecloop:
- ld1 {v1.16b}, [x20], #16 /* get next ct block */
+ ld1 {v1.16b}, [x1], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
- decrypt_block v0, w22, x21, x6, w7
+ decrypt_block v0, w3, x2, x6, w7
eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */
mov v7.16b, v1.16b /* ct is next iv */
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
bne .Lcbcdecloop
.Lcbcdecout:
- st1 {v7.16b}, [x24] /* return iv */
- frame_pop
+ st1 {v7.16b}, [x5] /* return iv */
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -212,26 +176,19 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x5
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
-.Lctrrestart:
- enc_prepare w22, x21, x6
- ld1 {v4.16b}, [x24]
+ enc_prepare w3, x2, x6
+ ld1 {v4.16b}, [x5]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
+ cmn w6, w4 /* 32 bit overflow? */
+ bcs .Lctrloop
.LctrloopNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lctr1x
- cmn w6, #4 /* 32 bit overflow? */
- bcs .Lctr1x
add w7, w6, #1
mov v0.16b, v4.16b
add w8, w6, #2
@@ -245,27 +202,25 @@ AES_ENTRY(aes_ctr_encrypt)
rev w9, w9
mov v2.s[3], w8
mov v3.s[3], w9
- ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
+ ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
- ld1 {v5.16b}, [x20], #16 /* get 1 input block */
+ ld1 {v5.16b}, [x1], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
- st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v0.16b-v3.16b}, [x0], #64
add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
- cbz w23, .Lctrout
- st1 {v4.16b}, [x24] /* return next CTR value */
- cond_yield_neon .Lctrrestart
+ cbz w4, .Lctrout
b .LctrloopNx
.Lctr1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lctrout
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w22, x21, x8, w7
+ encrypt_block v0, w3, x2, x8, w7
adds x6, x6, #1 /* increment BE ctr */
rev x7, x6
@@ -273,22 +228,22 @@ AES_ENTRY(aes_ctr_encrypt)
bcs .Lctrcarry /* overflow? */
.Lctrcarrydone:
- subs w23, w23, #1
+ subs w4, w4, #1
bmi .Lctrtailblock /* blocks <0 means tail block */
- ld1 {v3.16b}, [x20], #16
+ ld1 {v3.16b}, [x1], #16
eor v3.16b, v0.16b, v3.16b
- st1 {v3.16b}, [x19], #16
+ st1 {v3.16b}, [x0], #16
bne .Lctrloop
.Lctrout:
- st1 {v4.16b}, [x24] /* return next CTR value */
-.Lctrret:
- frame_pop
+ st1 {v4.16b}, [x5] /* return next CTR value */
+ ldp x29, x30, [sp], #16
ret
.Lctrtailblock:
- st1 {v0.16b}, [x19]
- b .Lctrret
+ st1 {v0.16b}, [x0]
+ ldp x29, x30, [sp], #16
+ ret
.Lctrcarry:
umov x7, v4.d[0] /* load upper word of ctr */
@@ -321,16 +276,10 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- frame_push 6
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x6
-
- ld1 {v4.16b}, [x24]
+ ld1 {v4.16b}, [x6]
cbz w7, .Lxtsencnotfirst
enc_prepare w3, x5, x8
@@ -339,17 +288,15 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
b .LxtsencNx
-.Lxtsencrestart:
- ld1 {v4.16b}, [x24]
.Lxtsencnotfirst:
- enc_prepare w22, x21, x8
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lxtsenc1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -362,43 +309,35 @@ AES_ENTRY(aes_xts_encrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v7.16b
- cbz w23, .Lxtsencout
- st1 {v4.16b}, [x24]
- cond_yield_neon .Lxtsencrestart
+ cbz w4, .Lxtsencout
b .LxtsencloopNx
.Lxtsenc1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lxtsencout
.Lxtsencloop:
- ld1 {v1.16b}, [x20], #16
+ ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w22, x21, x8, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
beq .Lxtsencout
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
- st1 {v4.16b}, [x24]
- frame_pop
+ st1 {v4.16b}, [x6]
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- frame_push 6
-
- mov x19, x0
- mov x20, x1
- mov x21, x2
- mov x22, x3
- mov x23, x4
- mov x24, x6
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
- ld1 {v4.16b}, [x24]
+ ld1 {v4.16b}, [x6]
cbz w7, .Lxtsdecnotfirst
enc_prepare w3, x5, x8
@@ -407,17 +346,15 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
b .LxtsdecNx
-.Lxtsdecrestart:
- ld1 {v4.16b}, [x24]
.Lxtsdecnotfirst:
- dec_prepare w22, x21, x8
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
- subs w23, w23, #4
+ subs w4, w4, #4
bmi .Lxtsdec1x
- ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -430,28 +367,26 @@ AES_ENTRY(aes_xts_decrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v0.16b-v3.16b}, [x0], #64
mov v4.16b, v7.16b
- cbz w23, .Lxtsdecout
- st1 {v4.16b}, [x24]
- cond_yield_neon .Lxtsdecrestart
+ cbz w4, .Lxtsdecout
b .LxtsdecloopNx
.Lxtsdec1x:
- adds w23, w23, #4
+ adds w4, w4, #4
beq .Lxtsdecout
.Lxtsdecloop:
- ld1 {v1.16b}, [x20], #16
+ ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w22, x21, x8, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x19], #16
- subs w23, w23, #1
+ st1 {v0.16b}, [x0], #16
+ subs w4, w4, #1
beq .Lxtsdecout
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
- st1 {v4.16b}, [x24]
- frame_pop
+ st1 {v4.16b}, [x6]
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_decrypt)
--
2.18.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/4] crypto: arm64/aes-blk - add support for CTS-CBC mode
2018-09-10 14:41 ` Ard Biesheuvel
@ 2018-09-10 14:41 ` Ard Biesheuvel
-1 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-crypto
Cc: Ard Biesheuvel, Theodore Ts'o, herbert, Steve Capper,
Eric Biggers, linux-arm-kernel
Currently, we rely on the generic CTS chaining mode wrapper to
instantiate the cts(cbc(aes)) skcipher. Due to the high performance
of the ARMv8 Crypto Extensions AES instructions (~1 cycles per byte),
any overhead in the chaining mode layers is amplified, and so it pays
off considerably to fold the CTS handling into the SIMD routines.
On Cortex-A53, this results in a ~50% speedup for smaller input sizes.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
This patch supersedes '[RFC/RFT PATCH] crypto: arm64/aes-ce - add support
for CTS-CBC mode' sent out last Saturday.
Changes:
- keep subreq and scatterlist in request ctx structure
- optimize away second scatterwalk_ffwd() invocation when encrypting in-place
- keep permute table in .rodata section
- polish asm code (drop literal + offset reference, reorder insns)
Raw performance numbers after the patch.
arch/arm64/crypto/aes-glue.c | 165 ++++++++++++++++++++
arch/arm64/crypto/aes-modes.S | 79 +++++++++-
2 files changed, 243 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 1c6934544c1f..26d2b0263ba6 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -15,6 +15,7 @@
#include <crypto/internal/hash.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
#include <linux/module.h>
#include <linux/cpufeature.h>
#include <crypto/xts.h>
@@ -31,6 +32,8 @@
#define aes_ecb_decrypt ce_aes_ecb_decrypt
#define aes_cbc_encrypt ce_aes_cbc_encrypt
#define aes_cbc_decrypt ce_aes_cbc_decrypt
+#define aes_cbc_cts_encrypt ce_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decrypt ce_aes_cbc_cts_decrypt
#define aes_ctr_encrypt ce_aes_ctr_encrypt
#define aes_xts_encrypt ce_aes_xts_encrypt
#define aes_xts_decrypt ce_aes_xts_decrypt
@@ -45,6 +48,8 @@ MODULE_DESCRIPTION("AES-ECB/CBC/CTR/XTS using ARMv8 Crypto Extensions");
#define aes_ecb_decrypt neon_aes_ecb_decrypt
#define aes_cbc_encrypt neon_aes_cbc_encrypt
#define aes_cbc_decrypt neon_aes_cbc_decrypt
+#define aes_cbc_cts_encrypt neon_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decrypt neon_aes_cbc_cts_decrypt
#define aes_ctr_encrypt neon_aes_ctr_encrypt
#define aes_xts_encrypt neon_aes_xts_encrypt
#define aes_xts_decrypt neon_aes_xts_decrypt
@@ -73,6 +78,11 @@ asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
+asmlinkage void aes_cbc_cts_encrypt(u8 out[], u8 const in[], u32 const rk[],
+ int rounds, int bytes, u8 const iv[]);
+asmlinkage void aes_cbc_cts_decrypt(u8 out[], u8 const in[], u32 const rk[],
+ int rounds, int bytes, u8 const iv[]);
+
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 ctr[]);
@@ -87,6 +97,12 @@ asmlinkage void aes_mac_update(u8 const in[], u32 const rk[], int rounds,
int blocks, u8 dg[], int enc_before,
int enc_after);
+struct cts_cbc_req_ctx {
+ struct scatterlist sg_src[2];
+ struct scatterlist sg_dst[2];
+ struct skcipher_request subreq;
+};
+
struct crypto_aes_xts_ctx {
struct crypto_aes_ctx key1;
struct crypto_aes_ctx __aligned(8) key2;
@@ -209,6 +225,136 @@ static int cbc_decrypt(struct skcipher_request *req)
return err;
}
+static int cts_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+ crypto_skcipher_set_reqsize(tfm, sizeof(struct cts_cbc_req_ctx));
+ return 0;
+}
+
+static int cts_cbc_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct cts_cbc_req_ctx *rctx = skcipher_request_ctx(req);
+ int err, rounds = 6 + ctx->key_length / 4;
+ int cbc_blocks = DIV_ROUND_UP(req->cryptlen, AES_BLOCK_SIZE) - 2;
+ struct scatterlist *src = req->src, *dst = req->dst;
+ struct skcipher_walk walk;
+
+ skcipher_request_set_tfm(&rctx->subreq, tfm);
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ cbc_blocks = 1;
+
+ if (cbc_blocks > 0) {
+ unsigned int blocks;
+
+ skcipher_request_set_crypt(&rctx->subreq, req->src, req->dst,
+ cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
+ aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
+ err = skcipher_walk_done(&walk,
+ walk.nbytes % AES_BLOCK_SIZE);
+ }
+ if (err)
+ return err;
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ return 0;
+
+ dst = src = scatterwalk_ffwd(rctx->sg_src, req->src,
+ rctx->subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(rctx->sg_dst, req->dst,
+ rctx->subreq.cryptlen);
+ }
+
+ /* handle ciphertext stealing */
+ skcipher_request_set_crypt(&rctx->subreq, src, dst,
+ req->cryptlen - cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+ aes_cbc_cts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_enc, rounds, walk.nbytes, walk.iv);
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
+static int cts_cbc_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct cts_cbc_req_ctx *rctx = skcipher_request_ctx(req);
+ int err, rounds = 6 + ctx->key_length / 4;
+ int cbc_blocks = DIV_ROUND_UP(req->cryptlen, AES_BLOCK_SIZE) - 2;
+ struct scatterlist *src = req->src, *dst = req->dst;
+ struct skcipher_walk walk;
+
+ skcipher_request_set_tfm(&rctx->subreq, tfm);
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ cbc_blocks = 1;
+
+ if (cbc_blocks > 0) {
+ unsigned int blocks;
+
+ skcipher_request_set_crypt(&rctx->subreq, req->src, req->dst,
+ cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
+ aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
+ err = skcipher_walk_done(&walk,
+ walk.nbytes % AES_BLOCK_SIZE);
+ }
+ if (err)
+ return err;
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ return 0;
+
+ dst = src = scatterwalk_ffwd(rctx->sg_src, req->src,
+ rctx->subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(rctx->sg_dst, req->dst,
+ rctx->subreq.cryptlen);
+ }
+
+ /* handle ciphertext stealing */
+ skcipher_request_set_crypt(&rctx->subreq, src, dst,
+ req->cryptlen - cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+ aes_cbc_cts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_dec, rounds, walk.nbytes, walk.iv);
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -334,6 +480,25 @@ static struct skcipher_alg aes_algs[] = { {
.setkey = skcipher_aes_setkey,
.encrypt = cbc_encrypt,
.decrypt = cbc_decrypt,
+}, {
+ .base = {
+ .cra_name = "__cts(cbc(aes))",
+ .cra_driver_name = "__cts-cbc-aes-" MODE,
+ .cra_priority = PRIO,
+ .cra_flags = CRYPTO_ALG_INTERNAL,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct crypto_aes_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = AES_MIN_KEY_SIZE,
+ .max_keysize = AES_MAX_KEY_SIZE,
+ .ivsize = AES_BLOCK_SIZE,
+ .chunksize = AES_BLOCK_SIZE,
+ .walksize = 2 * AES_BLOCK_SIZE,
+ .setkey = skcipher_aes_setkey,
+ .encrypt = cts_cbc_encrypt,
+ .decrypt = cts_cbc_decrypt,
+ .init = cts_cbc_init_tfm,
}, {
.base = {
.cra_name = "__ctr(aes)",
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 35632d11200f..82931fba53d2 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -170,6 +170,84 @@ AES_ENTRY(aes_cbc_decrypt)
AES_ENDPROC(aes_cbc_decrypt)
+ /*
+ * aes_cbc_cts_encrypt(u8 out[], u8 const in[], u32 const rk[],
+ * int rounds, int bytes, u8 const iv[])
+ * aes_cbc_cts_decrypt(u8 out[], u8 const in[], u32 const rk[],
+ * int rounds, int bytes, u8 const iv[])
+ */
+
+AES_ENTRY(aes_cbc_cts_encrypt)
+ adr_l x8, .Lcts_permute_table
+ sub x4, x4, #16
+ add x9, x8, #32
+ add x8, x8, x4
+ sub x9, x9, x4
+ ld1 {v3.16b}, [x8]
+ ld1 {v4.16b}, [x9]
+
+ ld1 {v0.16b}, [x1], x4 /* overlapping loads */
+ ld1 {v1.16b}, [x1]
+
+ ld1 {v5.16b}, [x5] /* get iv */
+ enc_prepare w3, x2, x6
+
+ eor v0.16b, v0.16b, v5.16b /* xor with iv */
+ tbl v1.16b, {v1.16b}, v4.16b
+ encrypt_block v0, w3, x2, x6, w7
+
+ eor v1.16b, v1.16b, v0.16b
+ tbl v0.16b, {v0.16b}, v3.16b
+ encrypt_block v1, w3, x2, x6, w7
+
+ add x4, x0, x4
+ st1 {v0.16b}, [x4] /* overlapping stores */
+ st1 {v1.16b}, [x0]
+ ret
+AES_ENDPROC(aes_cbc_cts_encrypt)
+
+AES_ENTRY(aes_cbc_cts_decrypt)
+ adr_l x8, .Lcts_permute_table
+ sub x4, x4, #16
+ add x9, x8, #32
+ add x8, x8, x4
+ sub x9, x9, x4
+ ld1 {v3.16b}, [x8]
+ ld1 {v4.16b}, [x9]
+
+ ld1 {v0.16b}, [x1], x4 /* overlapping loads */
+ ld1 {v1.16b}, [x1]
+
+ ld1 {v5.16b}, [x5] /* get iv */
+ dec_prepare w3, x2, x6
+
+ tbl v2.16b, {v1.16b}, v4.16b
+ decrypt_block v0, w3, x2, x6, w7
+ eor v2.16b, v2.16b, v0.16b
+
+ tbx v0.16b, {v1.16b}, v4.16b
+ tbl v2.16b, {v2.16b}, v3.16b
+ decrypt_block v0, w3, x2, x6, w7
+ eor v0.16b, v0.16b, v5.16b /* xor with iv */
+
+ add x4, x0, x4
+ st1 {v2.16b}, [x4] /* overlapping stores */
+ st1 {v0.16b}, [x0]
+ ret
+AES_ENDPROC(aes_cbc_cts_decrypt)
+
+ .section ".rodata", "a"
+ .align 6
+.Lcts_permute_table:
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .previous
+
+
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
* int blocks, u8 ctr[])
@@ -253,7 +331,6 @@ AES_ENTRY(aes_ctr_encrypt)
ins v4.d[0], x7
b .Lctrcarrydone
AES_ENDPROC(aes_ctr_encrypt)
- .ltorg
/*
--
2.18.0
Cortex-A53 @ 1 GHz
BEFORE:
testing speed of async cts(cbc(aes)) (cts(cbc-aes-ce)) encryption
0 (128 bit key, 16 byte blocks): 1407866 ops in 1 secs ( 22525856 bytes)
1 (128 bit key, 64 byte blocks): 466814 ops in 1 secs ( 29876096 bytes)
2 (128 bit key, 256 byte blocks): 401023 ops in 1 secs (102661888 bytes)
3 (128 bit key, 1024 byte blocks): 258238 ops in 1 secs (264435712 bytes)
4 (128 bit key, 8192 byte blocks): 57905 ops in 1 secs (474357760 bytes)
5 (192 bit key, 16 byte blocks): 1388333 ops in 1 secs ( 22213328 bytes)
6 (192 bit key, 64 byte blocks): 448595 ops in 1 secs ( 28710080 bytes)
7 (192 bit key, 256 byte blocks): 376951 ops in 1 secs ( 96499456 bytes)
8 (192 bit key, 1024 byte blocks): 231635 ops in 1 secs (237194240 bytes)
9 (192 bit key, 8192 byte blocks): 43345 ops in 1 secs (355082240 bytes)
10 (256 bit key, 16 byte blocks): 1370820 ops in 1 secs ( 21933120 bytes)
11 (256 bit key, 64 byte blocks): 452352 ops in 1 secs ( 28950528 bytes)
12 (256 bit key, 256 byte blocks): 376506 ops in 1 secs ( 96385536 bytes)
13 (256 bit key, 1024 byte blocks): 223219 ops in 1 secs (228576256 bytes)
14 (256 bit key, 8192 byte blocks): 44874 ops in 1 secs (367607808 bytes)
testing speed of async cts(cbc(aes)) (cts(cbc-aes-ce)) decryption
0 (128 bit key, 16 byte blocks): 1402795 ops in 1 secs ( 22444720 bytes)
1 (128 bit key, 64 byte blocks): 403300 ops in 1 secs ( 25811200 bytes)
2 (128 bit key, 256 byte blocks): 367710 ops in 1 secs ( 94133760 bytes)
3 (128 bit key, 1024 byte blocks): 269118 ops in 1 secs (275576832 bytes)
4 (128 bit key, 8192 byte blocks): 74706 ops in 1 secs (611991552 bytes)
5 (192 bit key, 16 byte blocks): 1381390 ops in 1 secs ( 22102240 bytes)
6 (192 bit key, 64 byte blocks): 388555 ops in 1 secs ( 24867520 bytes)
7 (192 bit key, 256 byte blocks): 350282 ops in 1 secs ( 89672192 bytes)
8 (192 bit key, 1024 byte blocks): 251268 ops in 1 secs (257298432 bytes)
9 (192 bit key, 8192 byte blocks): 56535 ops in 1 secs (463134720 bytes)
10 (256 bit key, 16 byte blocks): 1364334 ops in 1 secs ( 21829344 bytes)
11 (256 bit key, 64 byte blocks): 392610 ops in 1 secs ( 25127040 bytes)
12 (256 bit key, 256 byte blocks): 351150 ops in 1 secs ( 89894400 bytes)
13 (256 bit key, 1024 byte blocks): 247455 ops in 1 secs (253393920 bytes)
14 (256 bit key, 8192 byte blocks): 62530 ops in 1 secs (512245760 bytes)
AFTER:
testing speed of async cts(cbc(aes)) (cts-cbc-aes-ce) encryption
0 (128 bit key, 16 byte blocks): 1380568 ops in 1 secs ( 22089088 bytes)
1 (128 bit key, 64 byte blocks): 692731 ops in 1 secs ( 44334784 bytes)
2 (128 bit key, 256 byte blocks): 556393 ops in 1 secs (142436608 bytes)
3 (128 bit key, 1024 byte blocks): 314635 ops in 1 secs (322186240 bytes)
4 (128 bit key, 8192 byte blocks): 57550 ops in 1 secs (471449600 bytes)
5 (192 bit key, 16 byte blocks): 1367027 ops in 1 secs ( 21872432 bytes)
6 (192 bit key, 64 byte blocks): 675058 ops in 1 secs ( 43203712 bytes)
7 (192 bit key, 256 byte blocks): 523177 ops in 1 secs (133933312 bytes)
8 (192 bit key, 1024 byte blocks): 279235 ops in 1 secs (285936640 bytes)
9 (192 bit key, 8192 byte blocks): 46316 ops in 1 secs (379420672 bytes)
10 (256 bit key, 16 byte blocks): 1353576 ops in 1 secs ( 21657216 bytes)
11 (256 bit key, 64 byte blocks): 664523 ops in 1 secs ( 42529472 bytes)
12 (256 bit key, 256 byte blocks): 508141 ops in 1 secs (130084096 bytes)
13 (256 bit key, 1024 byte blocks): 264386 ops in 1 secs (270731264 bytes)
14 (256 bit key, 8192 byte blocks): 47224 ops in 1 secs (386859008 bytes)
testing speed of async cts(cbc(aes)) (cts-cbc-aes-ce) decryption
0 (128 bit key, 16 byte blocks): 1388553 ops in 1 secs ( 22216848 bytes)
1 (128 bit key, 64 byte blocks): 688402 ops in 1 secs ( 44057728 bytes)
2 (128 bit key, 256 byte blocks): 589268 ops in 1 secs (150852608 bytes)
3 (128 bit key, 1024 byte blocks): 372238 ops in 1 secs (381171712 bytes)
4 (128 bit key, 8192 byte blocks): 75691 ops in 1 secs (620060672 bytes)
5 (192 bit key, 16 byte blocks): 1366220 ops in 1 secs ( 21859520 bytes)
6 (192 bit key, 64 byte blocks): 666889 ops in 1 secs ( 42680896 bytes)
7 (192 bit key, 256 byte blocks): 561809 ops in 1 secs (143823104 bytes)
8 (192 bit key, 1024 byte blocks): 344117 ops in 1 secs (352375808 bytes)
9 (192 bit key, 8192 byte blocks): 63150 ops in 1 secs (517324800 bytes)
10 (256 bit key, 16 byte blocks): 1349266 ops in 1 secs ( 21588256 bytes)
11 (256 bit key, 64 byte blocks): 661056 ops in 1 secs ( 42307584 bytes)
12 (256 bit key, 256 byte blocks): 550261 ops in 1 secs (140866816 bytes)
13 (256 bit key, 1024 byte blocks): 332947 ops in 1 secs (340937728 bytes)
14 (256 bit key, 8192 byte blocks): 68759 ops in 1 secs (563273728 bytes)
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/4] crypto: arm64/aes-blk - add support for CTS-CBC mode
@ 2018-09-10 14:41 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-arm-kernel
Currently, we rely on the generic CTS chaining mode wrapper to
instantiate the cts(cbc(aes)) skcipher. Due to the high performance
of the ARMv8 Crypto Extensions AES instructions (~1 cycles per byte),
any overhead in the chaining mode layers is amplified, and so it pays
off considerably to fold the CTS handling into the SIMD routines.
On Cortex-A53, this results in a ~50% speedup for smaller input sizes.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
This patch supersedes '[RFC/RFT PATCH] crypto: arm64/aes-ce - add support
for CTS-CBC mode' sent out last Saturday.
Changes:
- keep subreq and scatterlist in request ctx structure
- optimize away second scatterwalk_ffwd() invocation when encrypting in-place
- keep permute table in .rodata section
- polish asm code (drop literal + offset reference, reorder insns)
Raw performance numbers after the patch.
arch/arm64/crypto/aes-glue.c | 165 ++++++++++++++++++++
arch/arm64/crypto/aes-modes.S | 79 +++++++++-
2 files changed, 243 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 1c6934544c1f..26d2b0263ba6 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -15,6 +15,7 @@
#include <crypto/internal/hash.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
#include <linux/module.h>
#include <linux/cpufeature.h>
#include <crypto/xts.h>
@@ -31,6 +32,8 @@
#define aes_ecb_decrypt ce_aes_ecb_decrypt
#define aes_cbc_encrypt ce_aes_cbc_encrypt
#define aes_cbc_decrypt ce_aes_cbc_decrypt
+#define aes_cbc_cts_encrypt ce_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decrypt ce_aes_cbc_cts_decrypt
#define aes_ctr_encrypt ce_aes_ctr_encrypt
#define aes_xts_encrypt ce_aes_xts_encrypt
#define aes_xts_decrypt ce_aes_xts_decrypt
@@ -45,6 +48,8 @@ MODULE_DESCRIPTION("AES-ECB/CBC/CTR/XTS using ARMv8 Crypto Extensions");
#define aes_ecb_decrypt neon_aes_ecb_decrypt
#define aes_cbc_encrypt neon_aes_cbc_encrypt
#define aes_cbc_decrypt neon_aes_cbc_decrypt
+#define aes_cbc_cts_encrypt neon_aes_cbc_cts_encrypt
+#define aes_cbc_cts_decrypt neon_aes_cbc_cts_decrypt
#define aes_ctr_encrypt neon_aes_ctr_encrypt
#define aes_xts_encrypt neon_aes_xts_encrypt
#define aes_xts_decrypt neon_aes_xts_decrypt
@@ -73,6 +78,11 @@ asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 iv[]);
+asmlinkage void aes_cbc_cts_encrypt(u8 out[], u8 const in[], u32 const rk[],
+ int rounds, int bytes, u8 const iv[]);
+asmlinkage void aes_cbc_cts_decrypt(u8 out[], u8 const in[], u32 const rk[],
+ int rounds, int bytes, u8 const iv[]);
+
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u32 const rk[],
int rounds, int blocks, u8 ctr[]);
@@ -87,6 +97,12 @@ asmlinkage void aes_mac_update(u8 const in[], u32 const rk[], int rounds,
int blocks, u8 dg[], int enc_before,
int enc_after);
+struct cts_cbc_req_ctx {
+ struct scatterlist sg_src[2];
+ struct scatterlist sg_dst[2];
+ struct skcipher_request subreq;
+};
+
struct crypto_aes_xts_ctx {
struct crypto_aes_ctx key1;
struct crypto_aes_ctx __aligned(8) key2;
@@ -209,6 +225,136 @@ static int cbc_decrypt(struct skcipher_request *req)
return err;
}
+static int cts_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+ crypto_skcipher_set_reqsize(tfm, sizeof(struct cts_cbc_req_ctx));
+ return 0;
+}
+
+static int cts_cbc_encrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct cts_cbc_req_ctx *rctx = skcipher_request_ctx(req);
+ int err, rounds = 6 + ctx->key_length / 4;
+ int cbc_blocks = DIV_ROUND_UP(req->cryptlen, AES_BLOCK_SIZE) - 2;
+ struct scatterlist *src = req->src, *dst = req->dst;
+ struct skcipher_walk walk;
+
+ skcipher_request_set_tfm(&rctx->subreq, tfm);
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ cbc_blocks = 1;
+
+ if (cbc_blocks > 0) {
+ unsigned int blocks;
+
+ skcipher_request_set_crypt(&rctx->subreq, req->src, req->dst,
+ cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
+ aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
+ err = skcipher_walk_done(&walk,
+ walk.nbytes % AES_BLOCK_SIZE);
+ }
+ if (err)
+ return err;
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ return 0;
+
+ dst = src = scatterwalk_ffwd(rctx->sg_src, req->src,
+ rctx->subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(rctx->sg_dst, req->dst,
+ rctx->subreq.cryptlen);
+ }
+
+ /* handle ciphertext stealing */
+ skcipher_request_set_crypt(&rctx->subreq, src, dst,
+ req->cryptlen - cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+ aes_cbc_cts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_enc, rounds, walk.nbytes, walk.iv);
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
+static int cts_cbc_decrypt(struct skcipher_request *req)
+{
+ struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+ struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+ struct cts_cbc_req_ctx *rctx = skcipher_request_ctx(req);
+ int err, rounds = 6 + ctx->key_length / 4;
+ int cbc_blocks = DIV_ROUND_UP(req->cryptlen, AES_BLOCK_SIZE) - 2;
+ struct scatterlist *src = req->src, *dst = req->dst;
+ struct skcipher_walk walk;
+
+ skcipher_request_set_tfm(&rctx->subreq, tfm);
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ cbc_blocks = 1;
+
+ if (cbc_blocks > 0) {
+ unsigned int blocks;
+
+ skcipher_request_set_crypt(&rctx->subreq, req->src, req->dst,
+ cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
+ aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
+ err = skcipher_walk_done(&walk,
+ walk.nbytes % AES_BLOCK_SIZE);
+ }
+ if (err)
+ return err;
+
+ if (req->cryptlen == AES_BLOCK_SIZE)
+ return 0;
+
+ dst = src = scatterwalk_ffwd(rctx->sg_src, req->src,
+ rctx->subreq.cryptlen);
+ if (req->dst != req->src)
+ dst = scatterwalk_ffwd(rctx->sg_dst, req->dst,
+ rctx->subreq.cryptlen);
+ }
+
+ /* handle ciphertext stealing */
+ skcipher_request_set_crypt(&rctx->subreq, src, dst,
+ req->cryptlen - cbc_blocks * AES_BLOCK_SIZE,
+ req->iv);
+
+ err = skcipher_walk_virt(&walk, &rctx->subreq, false);
+ if (err)
+ return err;
+
+ kernel_neon_begin();
+ aes_cbc_cts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
+ ctx->key_dec, rounds, walk.nbytes, walk.iv);
+ kernel_neon_end();
+
+ return skcipher_walk_done(&walk, 0);
+}
+
static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -334,6 +480,25 @@ static struct skcipher_alg aes_algs[] = { {
.setkey = skcipher_aes_setkey,
.encrypt = cbc_encrypt,
.decrypt = cbc_decrypt,
+}, {
+ .base = {
+ .cra_name = "__cts(cbc(aes))",
+ .cra_driver_name = "__cts-cbc-aes-" MODE,
+ .cra_priority = PRIO,
+ .cra_flags = CRYPTO_ALG_INTERNAL,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct crypto_aes_ctx),
+ .cra_module = THIS_MODULE,
+ },
+ .min_keysize = AES_MIN_KEY_SIZE,
+ .max_keysize = AES_MAX_KEY_SIZE,
+ .ivsize = AES_BLOCK_SIZE,
+ .chunksize = AES_BLOCK_SIZE,
+ .walksize = 2 * AES_BLOCK_SIZE,
+ .setkey = skcipher_aes_setkey,
+ .encrypt = cts_cbc_encrypt,
+ .decrypt = cts_cbc_decrypt,
+ .init = cts_cbc_init_tfm,
}, {
.base = {
.cra_name = "__ctr(aes)",
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 35632d11200f..82931fba53d2 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -170,6 +170,84 @@ AES_ENTRY(aes_cbc_decrypt)
AES_ENDPROC(aes_cbc_decrypt)
+ /*
+ * aes_cbc_cts_encrypt(u8 out[], u8 const in[], u32 const rk[],
+ * int rounds, int bytes, u8 const iv[])
+ * aes_cbc_cts_decrypt(u8 out[], u8 const in[], u32 const rk[],
+ * int rounds, int bytes, u8 const iv[])
+ */
+
+AES_ENTRY(aes_cbc_cts_encrypt)
+ adr_l x8, .Lcts_permute_table
+ sub x4, x4, #16
+ add x9, x8, #32
+ add x8, x8, x4
+ sub x9, x9, x4
+ ld1 {v3.16b}, [x8]
+ ld1 {v4.16b}, [x9]
+
+ ld1 {v0.16b}, [x1], x4 /* overlapping loads */
+ ld1 {v1.16b}, [x1]
+
+ ld1 {v5.16b}, [x5] /* get iv */
+ enc_prepare w3, x2, x6
+
+ eor v0.16b, v0.16b, v5.16b /* xor with iv */
+ tbl v1.16b, {v1.16b}, v4.16b
+ encrypt_block v0, w3, x2, x6, w7
+
+ eor v1.16b, v1.16b, v0.16b
+ tbl v0.16b, {v0.16b}, v3.16b
+ encrypt_block v1, w3, x2, x6, w7
+
+ add x4, x0, x4
+ st1 {v0.16b}, [x4] /* overlapping stores */
+ st1 {v1.16b}, [x0]
+ ret
+AES_ENDPROC(aes_cbc_cts_encrypt)
+
+AES_ENTRY(aes_cbc_cts_decrypt)
+ adr_l x8, .Lcts_permute_table
+ sub x4, x4, #16
+ add x9, x8, #32
+ add x8, x8, x4
+ sub x9, x9, x4
+ ld1 {v3.16b}, [x8]
+ ld1 {v4.16b}, [x9]
+
+ ld1 {v0.16b}, [x1], x4 /* overlapping loads */
+ ld1 {v1.16b}, [x1]
+
+ ld1 {v5.16b}, [x5] /* get iv */
+ dec_prepare w3, x2, x6
+
+ tbl v2.16b, {v1.16b}, v4.16b
+ decrypt_block v0, w3, x2, x6, w7
+ eor v2.16b, v2.16b, v0.16b
+
+ tbx v0.16b, {v1.16b}, v4.16b
+ tbl v2.16b, {v2.16b}, v3.16b
+ decrypt_block v0, w3, x2, x6, w7
+ eor v0.16b, v0.16b, v5.16b /* xor with iv */
+
+ add x4, x0, x4
+ st1 {v2.16b}, [x4] /* overlapping stores */
+ st1 {v0.16b}, [x0]
+ ret
+AES_ENDPROC(aes_cbc_cts_decrypt)
+
+ .section ".rodata", "a"
+ .align 6
+.Lcts_permute_table:
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
+ .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .byte 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+ .previous
+
+
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
* int blocks, u8 ctr[])
@@ -253,7 +331,6 @@ AES_ENTRY(aes_ctr_encrypt)
ins v4.d[0], x7
b .Lctrcarrydone
AES_ENDPROC(aes_ctr_encrypt)
- .ltorg
/*
--
2.18.0
Cortex-A53 @ 1 GHz
BEFORE:
testing speed of async cts(cbc(aes)) (cts(cbc-aes-ce)) encryption
0 (128 bit key, 16 byte blocks): 1407866 ops in 1 secs ( 22525856 bytes)
1 (128 bit key, 64 byte blocks): 466814 ops in 1 secs ( 29876096 bytes)
2 (128 bit key, 256 byte blocks): 401023 ops in 1 secs (102661888 bytes)
3 (128 bit key, 1024 byte blocks): 258238 ops in 1 secs (264435712 bytes)
4 (128 bit key, 8192 byte blocks): 57905 ops in 1 secs (474357760 bytes)
5 (192 bit key, 16 byte blocks): 1388333 ops in 1 secs ( 22213328 bytes)
6 (192 bit key, 64 byte blocks): 448595 ops in 1 secs ( 28710080 bytes)
7 (192 bit key, 256 byte blocks): 376951 ops in 1 secs ( 96499456 bytes)
8 (192 bit key, 1024 byte blocks): 231635 ops in 1 secs (237194240 bytes)
9 (192 bit key, 8192 byte blocks): 43345 ops in 1 secs (355082240 bytes)
10 (256 bit key, 16 byte blocks): 1370820 ops in 1 secs ( 21933120 bytes)
11 (256 bit key, 64 byte blocks): 452352 ops in 1 secs ( 28950528 bytes)
12 (256 bit key, 256 byte blocks): 376506 ops in 1 secs ( 96385536 bytes)
13 (256 bit key, 1024 byte blocks): 223219 ops in 1 secs (228576256 bytes)
14 (256 bit key, 8192 byte blocks): 44874 ops in 1 secs (367607808 bytes)
testing speed of async cts(cbc(aes)) (cts(cbc-aes-ce)) decryption
0 (128 bit key, 16 byte blocks): 1402795 ops in 1 secs ( 22444720 bytes)
1 (128 bit key, 64 byte blocks): 403300 ops in 1 secs ( 25811200 bytes)
2 (128 bit key, 256 byte blocks): 367710 ops in 1 secs ( 94133760 bytes)
3 (128 bit key, 1024 byte blocks): 269118 ops in 1 secs (275576832 bytes)
4 (128 bit key, 8192 byte blocks): 74706 ops in 1 secs (611991552 bytes)
5 (192 bit key, 16 byte blocks): 1381390 ops in 1 secs ( 22102240 bytes)
6 (192 bit key, 64 byte blocks): 388555 ops in 1 secs ( 24867520 bytes)
7 (192 bit key, 256 byte blocks): 350282 ops in 1 secs ( 89672192 bytes)
8 (192 bit key, 1024 byte blocks): 251268 ops in 1 secs (257298432 bytes)
9 (192 bit key, 8192 byte blocks): 56535 ops in 1 secs (463134720 bytes)
10 (256 bit key, 16 byte blocks): 1364334 ops in 1 secs ( 21829344 bytes)
11 (256 bit key, 64 byte blocks): 392610 ops in 1 secs ( 25127040 bytes)
12 (256 bit key, 256 byte blocks): 351150 ops in 1 secs ( 89894400 bytes)
13 (256 bit key, 1024 byte blocks): 247455 ops in 1 secs (253393920 bytes)
14 (256 bit key, 8192 byte blocks): 62530 ops in 1 secs (512245760 bytes)
AFTER:
testing speed of async cts(cbc(aes)) (cts-cbc-aes-ce) encryption
0 (128 bit key, 16 byte blocks): 1380568 ops in 1 secs ( 22089088 bytes)
1 (128 bit key, 64 byte blocks): 692731 ops in 1 secs ( 44334784 bytes)
2 (128 bit key, 256 byte blocks): 556393 ops in 1 secs (142436608 bytes)
3 (128 bit key, 1024 byte blocks): 314635 ops in 1 secs (322186240 bytes)
4 (128 bit key, 8192 byte blocks): 57550 ops in 1 secs (471449600 bytes)
5 (192 bit key, 16 byte blocks): 1367027 ops in 1 secs ( 21872432 bytes)
6 (192 bit key, 64 byte blocks): 675058 ops in 1 secs ( 43203712 bytes)
7 (192 bit key, 256 byte blocks): 523177 ops in 1 secs (133933312 bytes)
8 (192 bit key, 1024 byte blocks): 279235 ops in 1 secs (285936640 bytes)
9 (192 bit key, 8192 byte blocks): 46316 ops in 1 secs (379420672 bytes)
10 (256 bit key, 16 byte blocks): 1353576 ops in 1 secs ( 21657216 bytes)
11 (256 bit key, 64 byte blocks): 664523 ops in 1 secs ( 42529472 bytes)
12 (256 bit key, 256 byte blocks): 508141 ops in 1 secs (130084096 bytes)
13 (256 bit key, 1024 byte blocks): 264386 ops in 1 secs (270731264 bytes)
14 (256 bit key, 8192 byte blocks): 47224 ops in 1 secs (386859008 bytes)
testing speed of async cts(cbc(aes)) (cts-cbc-aes-ce) decryption
0 (128 bit key, 16 byte blocks): 1388553 ops in 1 secs ( 22216848 bytes)
1 (128 bit key, 64 byte blocks): 688402 ops in 1 secs ( 44057728 bytes)
2 (128 bit key, 256 byte blocks): 589268 ops in 1 secs (150852608 bytes)
3 (128 bit key, 1024 byte blocks): 372238 ops in 1 secs (381171712 bytes)
4 (128 bit key, 8192 byte blocks): 75691 ops in 1 secs (620060672 bytes)
5 (192 bit key, 16 byte blocks): 1366220 ops in 1 secs ( 21859520 bytes)
6 (192 bit key, 64 byte blocks): 666889 ops in 1 secs ( 42680896 bytes)
7 (192 bit key, 256 byte blocks): 561809 ops in 1 secs (143823104 bytes)
8 (192 bit key, 1024 byte blocks): 344117 ops in 1 secs (352375808 bytes)
9 (192 bit key, 8192 byte blocks): 63150 ops in 1 secs (517324800 bytes)
10 (256 bit key, 16 byte blocks): 1349266 ops in 1 secs ( 21588256 bytes)
11 (256 bit key, 64 byte blocks): 661056 ops in 1 secs ( 42307584 bytes)
12 (256 bit key, 256 byte blocks): 550261 ops in 1 secs (140866816 bytes)
13 (256 bit key, 1024 byte blocks): 332947 ops in 1 secs (340937728 bytes)
14 (256 bit key, 8192 byte blocks): 68759 ops in 1 secs (563273728 bytes)
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/4] crypto: arm64/aes-blk - improve XTS mask handling
2018-09-10 14:41 ` Ard Biesheuvel
@ 2018-09-10 14:41 ` Ard Biesheuvel
-1 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-crypto
Cc: Ard Biesheuvel, Theodore Ts'o, herbert, Steve Capper,
Eric Biggers, linux-arm-kernel
The Crypto Extension instantiation of the aes-modes.S collection of
skciphers uses only 15 NEON registers for the round key array, whereas
the pure NEON flavor uses 16 NEON registers for the AES S-box.
This means we have a spare register available that we can use to hold
the XTS mask vector, removing the need to reload it at every iteration
of the inner loop.
Since the pure NEON version does not permit this optimization, tweak
the macros so we can factor out this functionality. Also, replace the
literal load with a short sequence to compose the mask vector.
On Cortex-A53, this results in a ~4% speedup.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
Raw performance numbers after the patch.
arch/arm64/crypto/aes-ce.S | 5 +++
arch/arm64/crypto/aes-modes.S | 40 ++++++++++----------
arch/arm64/crypto/aes-neon.S | 6 +++
3 files changed, 32 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 623e74ed1c67..143070510809 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -17,6 +17,11 @@
.arch armv8-a+crypto
+ xtsmask .req v16
+
+ .macro xts_reload_mask, tmp
+ .endm
+
/* preload all round keys */
.macro load_round_keys, rounds, rk
cmp \rounds, #12
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 82931fba53d2..5c0fa7905d24 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -340,17 +340,19 @@ AES_ENDPROC(aes_ctr_encrypt)
* int blocks, u8 const rk2[], u8 iv[], int first)
*/
- .macro next_tweak, out, in, const, tmp
+ .macro next_tweak, out, in, tmp
sshr \tmp\().2d, \in\().2d, #63
- and \tmp\().16b, \tmp\().16b, \const\().16b
+ and \tmp\().16b, \tmp\().16b, xtsmask.16b
add \out\().2d, \in\().2d, \in\().2d
ext \tmp\().16b, \tmp\().16b, \tmp\().16b, #8
eor \out\().16b, \out\().16b, \tmp\().16b
.endm
-.Lxts_mul_x:
-CPU_LE( .quad 1, 0x87 )
-CPU_BE( .quad 0x87, 1 )
+ .macro xts_load_mask, tmp
+ movi xtsmask.2s, #0x1
+ movi \tmp\().2s, #0x87
+ uzp1 xtsmask.4s, xtsmask.4s, \tmp\().4s
+ .endm
AES_ENTRY(aes_xts_encrypt)
stp x29, x30, [sp, #-16]!
@@ -362,24 +364,24 @@ AES_ENTRY(aes_xts_encrypt)
enc_prepare w3, x5, x8
encrypt_block v4, w3, x5, x8, w7 /* first tweak */
enc_switch_key w3, x2, x8
- ldr q7, .Lxts_mul_x
+ xts_load_mask v8
b .LxtsencNx
.Lxtsencnotfirst:
enc_prepare w3, x2, x8
.LxtsencloopNx:
- ldr q7, .Lxts_mul_x
- next_tweak v4, v4, v7, v8
+ xts_reload_mask v8
+ next_tweak v4, v4, v8
.LxtsencNx:
subs w4, w4, #4
bmi .Lxtsenc1x
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
- next_tweak v5, v4, v7, v8
+ next_tweak v5, v4, v8
eor v0.16b, v0.16b, v4.16b
- next_tweak v6, v5, v7, v8
+ next_tweak v6, v5, v8
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- next_tweak v7, v6, v7, v8
+ next_tweak v7, v6, v8
eor v3.16b, v3.16b, v7.16b
bl aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
@@ -401,7 +403,7 @@ AES_ENTRY(aes_xts_encrypt)
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
beq .Lxtsencout
- next_tweak v4, v4, v7, v8
+ next_tweak v4, v4, v8
b .Lxtsencloop
.Lxtsencout:
st1 {v4.16b}, [x6]
@@ -420,24 +422,24 @@ AES_ENTRY(aes_xts_decrypt)
enc_prepare w3, x5, x8
encrypt_block v4, w3, x5, x8, w7 /* first tweak */
dec_prepare w3, x2, x8
- ldr q7, .Lxts_mul_x
+ xts_load_mask v8
b .LxtsdecNx
.Lxtsdecnotfirst:
dec_prepare w3, x2, x8
.LxtsdecloopNx:
- ldr q7, .Lxts_mul_x
- next_tweak v4, v4, v7, v8
+ xts_reload_mask v8
+ next_tweak v4, v4, v8
.LxtsdecNx:
subs w4, w4, #4
bmi .Lxtsdec1x
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
- next_tweak v5, v4, v7, v8
+ next_tweak v5, v4, v8
eor v0.16b, v0.16b, v4.16b
- next_tweak v6, v5, v7, v8
+ next_tweak v6, v5, v8
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- next_tweak v7, v6, v7, v8
+ next_tweak v7, v6, v8
eor v3.16b, v3.16b, v7.16b
bl aes_decrypt_block4x
eor v3.16b, v3.16b, v7.16b
@@ -459,7 +461,7 @@ AES_ENTRY(aes_xts_decrypt)
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
beq .Lxtsdecout
- next_tweak v4, v4, v7, v8
+ next_tweak v4, v4, v8
b .Lxtsdecloop
.Lxtsdecout:
st1 {v4.16b}, [x6]
diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S
index 1c7b45b7268e..29100f692e8a 100644
--- a/arch/arm64/crypto/aes-neon.S
+++ b/arch/arm64/crypto/aes-neon.S
@@ -14,6 +14,12 @@
#define AES_ENTRY(func) ENTRY(neon_ ## func)
#define AES_ENDPROC(func) ENDPROC(neon_ ## func)
+ xtsmask .req v7
+
+ .macro xts_reload_mask, tmp
+ xts_load_mask \tmp
+ .endm
+
/* multiply by polynomial 'x' in GF(2^8) */
.macro mul_by_x, out, in, temp, const
sshr \temp, \in, #7
--
2.18.0
Cortex-A53 @ 1 GHz
BEFORE:
testing speed of async xts(aes) (xts-aes-ce) encryption
0 (256 bit key, 16 byte blocks): 1338059 ops in 1 secs ( 21408944 bytes)
1 (256 bit key, 64 byte blocks): 1249191 ops in 1 secs ( 79948224 bytes)
2 (256 bit key, 256 byte blocks): 918979 ops in 1 secs (235258624 bytes)
3 (256 bit key, 1024 byte blocks): 456993 ops in 1 secs (467960832 bytes)
4 (256 bit key, 8192 byte blocks): 74937 ops in 1 secs (613883904 bytes)
5 (512 bit key, 16 byte blocks): 1269281 ops in 1 secs ( 20308496 bytes)
6 (512 bit key, 64 byte blocks): 1176362 ops in 1 secs ( 75287168 bytes)
7 (512 bit key, 256 byte blocks): 840553 ops in 1 secs (215181568 bytes)
8 (512 bit key, 1024 byte blocks): 400329 ops in 1 secs (409936896 bytes)
9 (512 bit key, 8192 byte blocks): 64268 ops in 1 secs (526483456 bytes)
testing speed of async xts(aes) (xts-aes-ce) decryption
0 (256 bit key, 16 byte blocks): 1333819 ops in 1 secs ( 21341104 bytes)
1 (256 bit key, 64 byte blocks): 1239393 ops in 1 secs ( 79321152 bytes)
2 (256 bit key, 256 byte blocks): 913715 ops in 1 secs (233911040 bytes)
3 (256 bit key, 1024 byte blocks): 455176 ops in 1 secs (466100224 bytes)
4 (256 bit key, 8192 byte blocks): 74343 ops in 1 secs (609017856 bytes)
5 (512 bit key, 16 byte blocks): 1274941 ops in 1 secs ( 20399056 bytes)
6 (512 bit key, 64 byte blocks): 1182107 ops in 1 secs ( 75654848 bytes)
7 (512 bit key, 256 byte blocks): 844930 ops in 1 secs (216302080 bytes)
8 (512 bit key, 1024 byte blocks): 401614 ops in 1 secs (411252736 bytes)
9 (512 bit key, 8192 byte blocks): 63913 ops in 1 secs (523575296 bytes)
AFTER:
testing speed of async xts(aes) (xts-aes-ce) encryption
0 (256 bit key, 16 byte blocks): 1398063 ops in 1 secs ( 22369008 bytes)
1 (256 bit key, 64 byte blocks): 1302694 ops in 1 secs ( 83372416 bytes)
2 (256 bit key, 256 byte blocks): 951692 ops in 1 secs (243633152 bytes)
3 (256 bit key, 1024 byte blocks): 473198 ops in 1 secs (484554752 bytes)
4 (256 bit key, 8192 byte blocks): 77204 ops in 1 secs (632455168 bytes)
5 (512 bit key, 16 byte blocks): 1323582 ops in 1 secs ( 21177312 bytes)
6 (512 bit key, 64 byte blocks): 1222306 ops in 1 secs ( 78227584 bytes)
7 (512 bit key, 256 byte blocks): 871791 ops in 1 secs (223178496 bytes)
8 (512 bit key, 1024 byte blocks): 413557 ops in 1 secs (423482368 bytes)
9 (512 bit key, 8192 byte blocks): 66014 ops in 1 secs (540786688 bytes)
testing speed of async xts(aes) (xts-aes-ce) decryption
0 (256 bit key, 16 byte blocks): 1399388 ops in 1 secs ( 22390208 bytes)
1 (256 bit key, 64 byte blocks): 1300861 ops in 1 secs ( 83255104 bytes)
2 (256 bit key, 256 byte blocks): 951950 ops in 1 secs (243699200 bytes)
3 (256 bit key, 1024 byte blocks): 473399 ops in 1 secs (484760576 bytes)
4 (256 bit key, 8192 byte blocks): 77168 ops in 1 secs (632160256 bytes)
5 (512 bit key, 16 byte blocks): 1317833 ops in 1 secs ( 21085328 bytes)
6 (512 bit key, 64 byte blocks): 1217145 ops in 1 secs ( 77897280 bytes)
7 (512 bit key, 256 byte blocks): 868323 ops in 1 secs (222290688 bytes)
8 (512 bit key, 1024 byte blocks): 412821 ops in 1 secs (422728704 bytes)
9 (512 bit key, 8192 byte blocks): 65919 ops in 1 secs (540008448 bytes)
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/4] crypto: arm64/aes-blk - improve XTS mask handling
@ 2018-09-10 14:41 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-10 14:41 UTC (permalink / raw)
To: linux-arm-kernel
The Crypto Extension instantiation of the aes-modes.S collection of
skciphers uses only 15 NEON registers for the round key array, whereas
the pure NEON flavor uses 16 NEON registers for the AES S-box.
This means we have a spare register available that we can use to hold
the XTS mask vector, removing the need to reload it at every iteration
of the inner loop.
Since the pure NEON version does not permit this optimization, tweak
the macros so we can factor out this functionality. Also, replace the
literal load with a short sequence to compose the mask vector.
On Cortex-A53, this results in a ~4% speedup.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
Raw performance numbers after the patch.
arch/arm64/crypto/aes-ce.S | 5 +++
arch/arm64/crypto/aes-modes.S | 40 ++++++++++----------
arch/arm64/crypto/aes-neon.S | 6 +++
3 files changed, 32 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 623e74ed1c67..143070510809 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -17,6 +17,11 @@
.arch armv8-a+crypto
+ xtsmask .req v16
+
+ .macro xts_reload_mask, tmp
+ .endm
+
/* preload all round keys */
.macro load_round_keys, rounds, rk
cmp \rounds, #12
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 82931fba53d2..5c0fa7905d24 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -340,17 +340,19 @@ AES_ENDPROC(aes_ctr_encrypt)
* int blocks, u8 const rk2[], u8 iv[], int first)
*/
- .macro next_tweak, out, in, const, tmp
+ .macro next_tweak, out, in, tmp
sshr \tmp\().2d, \in\().2d, #63
- and \tmp\().16b, \tmp\().16b, \const\().16b
+ and \tmp\().16b, \tmp\().16b, xtsmask.16b
add \out\().2d, \in\().2d, \in\().2d
ext \tmp\().16b, \tmp\().16b, \tmp\().16b, #8
eor \out\().16b, \out\().16b, \tmp\().16b
.endm
-.Lxts_mul_x:
-CPU_LE( .quad 1, 0x87 )
-CPU_BE( .quad 0x87, 1 )
+ .macro xts_load_mask, tmp
+ movi xtsmask.2s, #0x1
+ movi \tmp\().2s, #0x87
+ uzp1 xtsmask.4s, xtsmask.4s, \tmp\().4s
+ .endm
AES_ENTRY(aes_xts_encrypt)
stp x29, x30, [sp, #-16]!
@@ -362,24 +364,24 @@ AES_ENTRY(aes_xts_encrypt)
enc_prepare w3, x5, x8
encrypt_block v4, w3, x5, x8, w7 /* first tweak */
enc_switch_key w3, x2, x8
- ldr q7, .Lxts_mul_x
+ xts_load_mask v8
b .LxtsencNx
.Lxtsencnotfirst:
enc_prepare w3, x2, x8
.LxtsencloopNx:
- ldr q7, .Lxts_mul_x
- next_tweak v4, v4, v7, v8
+ xts_reload_mask v8
+ next_tweak v4, v4, v8
.LxtsencNx:
subs w4, w4, #4
bmi .Lxtsenc1x
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
- next_tweak v5, v4, v7, v8
+ next_tweak v5, v4, v8
eor v0.16b, v0.16b, v4.16b
- next_tweak v6, v5, v7, v8
+ next_tweak v6, v5, v8
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- next_tweak v7, v6, v7, v8
+ next_tweak v7, v6, v8
eor v3.16b, v3.16b, v7.16b
bl aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
@@ -401,7 +403,7 @@ AES_ENTRY(aes_xts_encrypt)
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
beq .Lxtsencout
- next_tweak v4, v4, v7, v8
+ next_tweak v4, v4, v8
b .Lxtsencloop
.Lxtsencout:
st1 {v4.16b}, [x6]
@@ -420,24 +422,24 @@ AES_ENTRY(aes_xts_decrypt)
enc_prepare w3, x5, x8
encrypt_block v4, w3, x5, x8, w7 /* first tweak */
dec_prepare w3, x2, x8
- ldr q7, .Lxts_mul_x
+ xts_load_mask v8
b .LxtsdecNx
.Lxtsdecnotfirst:
dec_prepare w3, x2, x8
.LxtsdecloopNx:
- ldr q7, .Lxts_mul_x
- next_tweak v4, v4, v7, v8
+ xts_reload_mask v8
+ next_tweak v4, v4, v8
.LxtsdecNx:
subs w4, w4, #4
bmi .Lxtsdec1x
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
- next_tweak v5, v4, v7, v8
+ next_tweak v5, v4, v8
eor v0.16b, v0.16b, v4.16b
- next_tweak v6, v5, v7, v8
+ next_tweak v6, v5, v8
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- next_tweak v7, v6, v7, v8
+ next_tweak v7, v6, v8
eor v3.16b, v3.16b, v7.16b
bl aes_decrypt_block4x
eor v3.16b, v3.16b, v7.16b
@@ -459,7 +461,7 @@ AES_ENTRY(aes_xts_decrypt)
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
beq .Lxtsdecout
- next_tweak v4, v4, v7, v8
+ next_tweak v4, v4, v8
b .Lxtsdecloop
.Lxtsdecout:
st1 {v4.16b}, [x6]
diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S
index 1c7b45b7268e..29100f692e8a 100644
--- a/arch/arm64/crypto/aes-neon.S
+++ b/arch/arm64/crypto/aes-neon.S
@@ -14,6 +14,12 @@
#define AES_ENTRY(func) ENTRY(neon_ ## func)
#define AES_ENDPROC(func) ENDPROC(neon_ ## func)
+ xtsmask .req v7
+
+ .macro xts_reload_mask, tmp
+ xts_load_mask \tmp
+ .endm
+
/* multiply by polynomial 'x' in GF(2^8) */
.macro mul_by_x, out, in, temp, const
sshr \temp, \in, #7
--
2.18.0
Cortex-A53 @ 1 GHz
BEFORE:
testing speed of async xts(aes) (xts-aes-ce) encryption
0 (256 bit key, 16 byte blocks): 1338059 ops in 1 secs ( 21408944 bytes)
1 (256 bit key, 64 byte blocks): 1249191 ops in 1 secs ( 79948224 bytes)
2 (256 bit key, 256 byte blocks): 918979 ops in 1 secs (235258624 bytes)
3 (256 bit key, 1024 byte blocks): 456993 ops in 1 secs (467960832 bytes)
4 (256 bit key, 8192 byte blocks): 74937 ops in 1 secs (613883904 bytes)
5 (512 bit key, 16 byte blocks): 1269281 ops in 1 secs ( 20308496 bytes)
6 (512 bit key, 64 byte blocks): 1176362 ops in 1 secs ( 75287168 bytes)
7 (512 bit key, 256 byte blocks): 840553 ops in 1 secs (215181568 bytes)
8 (512 bit key, 1024 byte blocks): 400329 ops in 1 secs (409936896 bytes)
9 (512 bit key, 8192 byte blocks): 64268 ops in 1 secs (526483456 bytes)
testing speed of async xts(aes) (xts-aes-ce) decryption
0 (256 bit key, 16 byte blocks): 1333819 ops in 1 secs ( 21341104 bytes)
1 (256 bit key, 64 byte blocks): 1239393 ops in 1 secs ( 79321152 bytes)
2 (256 bit key, 256 byte blocks): 913715 ops in 1 secs (233911040 bytes)
3 (256 bit key, 1024 byte blocks): 455176 ops in 1 secs (466100224 bytes)
4 (256 bit key, 8192 byte blocks): 74343 ops in 1 secs (609017856 bytes)
5 (512 bit key, 16 byte blocks): 1274941 ops in 1 secs ( 20399056 bytes)
6 (512 bit key, 64 byte blocks): 1182107 ops in 1 secs ( 75654848 bytes)
7 (512 bit key, 256 byte blocks): 844930 ops in 1 secs (216302080 bytes)
8 (512 bit key, 1024 byte blocks): 401614 ops in 1 secs (411252736 bytes)
9 (512 bit key, 8192 byte blocks): 63913 ops in 1 secs (523575296 bytes)
AFTER:
testing speed of async xts(aes) (xts-aes-ce) encryption
0 (256 bit key, 16 byte blocks): 1398063 ops in 1 secs ( 22369008 bytes)
1 (256 bit key, 64 byte blocks): 1302694 ops in 1 secs ( 83372416 bytes)
2 (256 bit key, 256 byte blocks): 951692 ops in 1 secs (243633152 bytes)
3 (256 bit key, 1024 byte blocks): 473198 ops in 1 secs (484554752 bytes)
4 (256 bit key, 8192 byte blocks): 77204 ops in 1 secs (632455168 bytes)
5 (512 bit key, 16 byte blocks): 1323582 ops in 1 secs ( 21177312 bytes)
6 (512 bit key, 64 byte blocks): 1222306 ops in 1 secs ( 78227584 bytes)
7 (512 bit key, 256 byte blocks): 871791 ops in 1 secs (223178496 bytes)
8 (512 bit key, 1024 byte blocks): 413557 ops in 1 secs (423482368 bytes)
9 (512 bit key, 8192 byte blocks): 66014 ops in 1 secs (540786688 bytes)
testing speed of async xts(aes) (xts-aes-ce) decryption
0 (256 bit key, 16 byte blocks): 1399388 ops in 1 secs ( 22390208 bytes)
1 (256 bit key, 64 byte blocks): 1300861 ops in 1 secs ( 83255104 bytes)
2 (256 bit key, 256 byte blocks): 951950 ops in 1 secs (243699200 bytes)
3 (256 bit key, 1024 byte blocks): 473399 ops in 1 secs (484760576 bytes)
4 (256 bit key, 8192 byte blocks): 77168 ops in 1 secs (632160256 bytes)
5 (512 bit key, 16 byte blocks): 1317833 ops in 1 secs ( 21085328 bytes)
6 (512 bit key, 64 byte blocks): 1217145 ops in 1 secs ( 77897280 bytes)
7 (512 bit key, 256 byte blocks): 868323 ops in 1 secs (222290688 bytes)
8 (512 bit key, 1024 byte blocks): 412821 ops in 1 secs (422728704 bytes)
9 (512 bit key, 8192 byte blocks): 65919 ops in 1 secs (540008448 bytes)
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC
2018-09-10 14:41 ` Ard Biesheuvel
@ 2018-09-20 14:13 ` Ard Biesheuvel
-1 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-20 14:13 UTC (permalink / raw)
To: open list:HARDWARE RANDOM NUMBER GENERATOR CORE
Cc: Ard Biesheuvel, Theodore Ts'o, Herbert Xu, Steve Capper,
Eric Biggers, linux-arm-kernel
On 10 September 2018 at 07:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> Some cleanups and optimizations for the arm64 AES skcipher routines.
>
> Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
> which are natively arrays of u32.
>
> Patch #2 partially reverts the use of NEON yield calls, which is not
> needed for skciphers.
>
> Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
>
> Patch #4 tweaks the XTS handling to remove a literal load from the inner
> loop.
>
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Steve Capper <steve.capper@arm.com>
>
> Ard Biesheuvel (4):
> crypto: arm64/aes-blk - remove pointless (u8 *) casts
> crypto: arm64/aes-blk - revert NEON yield for skciphers
> crypto: arm64/aes-blk - add support for CTS-CBC mode
> crypto: aes/arm64-blk - improve XTS mask handling
>
> arch/arm64/crypto/aes-ce.S | 5 +
> arch/arm64/crypto/aes-glue.c | 212 +++++++++--
> arch/arm64/crypto/aes-modes.S | 400 ++++++++++----------
> arch/arm64/crypto/aes-neon.S | 6 +
> 4 files changed, 406 insertions(+), 217 deletions(-)
>
Eric, any thoughts on this?
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC
@ 2018-09-20 14:13 ` Ard Biesheuvel
0 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2018-09-20 14:13 UTC (permalink / raw)
To: linux-arm-kernel
On 10 September 2018 at 07:41, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> Some cleanups and optimizations for the arm64 AES skcipher routines.
>
> Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
> which are natively arrays of u32.
>
> Patch #2 partially reverts the use of NEON yield calls, which is not
> needed for skciphers.
>
> Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
>
> Patch #4 tweaks the XTS handling to remove a literal load from the inner
> loop.
>
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Steve Capper <steve.capper@arm.com>
>
> Ard Biesheuvel (4):
> crypto: arm64/aes-blk - remove pointless (u8 *) casts
> crypto: arm64/aes-blk - revert NEON yield for skciphers
> crypto: arm64/aes-blk - add support for CTS-CBC mode
> crypto: aes/arm64-blk - improve XTS mask handling
>
> arch/arm64/crypto/aes-ce.S | 5 +
> arch/arm64/crypto/aes-glue.c | 212 +++++++++--
> arch/arm64/crypto/aes-modes.S | 400 ++++++++++----------
> arch/arm64/crypto/aes-neon.S | 6 +
> 4 files changed, 406 insertions(+), 217 deletions(-)
>
Eric, any thoughts on this?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC
2018-09-10 14:41 ` Ard Biesheuvel
@ 2018-09-21 5:44 ` Herbert Xu
-1 siblings, 0 replies; 14+ messages in thread
From: Herbert Xu @ 2018-09-21 5:44 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Steve Capper, Theodore Ts'o, linux-crypto, linux-arm-kernel,
Eric Biggers
On Mon, Sep 10, 2018 at 04:41:11PM +0200, Ard Biesheuvel wrote:
> Some cleanups and optimizations for the arm64 AES skcipher routines.
>
> Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
> which are natively arrays of u32.
>
> Patch #2 partially reverts the use of NEON yield calls, which is not
> needed for skciphers.
>
> Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
>
> Patch #4 tweaks the XTS handling to remove a literal load from the inner
> loop.
>
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Steve Capper <steve.capper@arm.com>
>
> Ard Biesheuvel (4):
> crypto: arm64/aes-blk - remove pointless (u8 *) casts
> crypto: arm64/aes-blk - revert NEON yield for skciphers
> crypto: arm64/aes-blk - add support for CTS-CBC mode
> crypto: aes/arm64-blk - improve XTS mask handling
>
> arch/arm64/crypto/aes-ce.S | 5 +
> arch/arm64/crypto/aes-glue.c | 212 +++++++++--
> arch/arm64/crypto/aes-modes.S | 400 ++++++++++----------
> arch/arm64/crypto/aes-neon.S | 6 +
> 4 files changed, 406 insertions(+), 217 deletions(-)
All applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC
@ 2018-09-21 5:44 ` Herbert Xu
0 siblings, 0 replies; 14+ messages in thread
From: Herbert Xu @ 2018-09-21 5:44 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Sep 10, 2018 at 04:41:11PM +0200, Ard Biesheuvel wrote:
> Some cleanups and optimizations for the arm64 AES skcipher routines.
>
> Patch #1 fixes the peculiar use of u8 arrays to refer to AES round keys,
> which are natively arrays of u32.
>
> Patch #2 partially reverts the use of NEON yield calls, which is not
> needed for skciphers.
>
> Patch #3 adds support for cts(cbc(aes)) in the NEON chaining mode handling.
>
> Patch #4 tweaks the XTS handling to remove a literal load from the inner
> loop.
>
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Steve Capper <steve.capper@arm.com>
>
> Ard Biesheuvel (4):
> crypto: arm64/aes-blk - remove pointless (u8 *) casts
> crypto: arm64/aes-blk - revert NEON yield for skciphers
> crypto: arm64/aes-blk - add support for CTS-CBC mode
> crypto: aes/arm64-blk - improve XTS mask handling
>
> arch/arm64/crypto/aes-ce.S | 5 +
> arch/arm64/crypto/aes-glue.c | 212 +++++++++--
> arch/arm64/crypto/aes-modes.S | 400 ++++++++++----------
> arch/arm64/crypto/aes-neon.S | 6 +
> 4 files changed, 406 insertions(+), 217 deletions(-)
All applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-09-21 5:44 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-10 14:41 [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC Ard Biesheuvel
2018-09-10 14:41 ` Ard Biesheuvel
2018-09-10 14:41 ` [PATCH 1/4] crypto: arm64/aes-blk - remove pointless (u8 *) casts Ard Biesheuvel
2018-09-10 14:41 ` Ard Biesheuvel
2018-09-10 14:41 ` [PATCH 2/4] crypto: arm64/aes-blk - revert NEON yield for skciphers Ard Biesheuvel
2018-09-10 14:41 ` Ard Biesheuvel
2018-09-10 14:41 ` [PATCH 3/4] crypto: arm64/aes-blk - add support for CTS-CBC mode Ard Biesheuvel
2018-09-10 14:41 ` Ard Biesheuvel
2018-09-10 14:41 ` [PATCH 4/4] crypto: arm64/aes-blk - improve XTS mask handling Ard Biesheuvel
2018-09-10 14:41 ` Ard Biesheuvel
2018-09-20 14:13 ` [PATCH 0/4] crypto: arm64/aes-blk - cleanups and optimizations for XTS/CTS-CBC Ard Biesheuvel
2018-09-20 14:13 ` Ard Biesheuvel
2018-09-21 5:44 ` Herbert Xu
2018-09-21 5:44 ` Herbert Xu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.