[PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2)

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2)
@ 2017-01-23 14:05 Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Patch #1 is a fix for the CBC chaining issue that was discussed on the
mailing list. The driver itself is queued for v4.11, so this fix can go
right on top.

Patches #2 - #6 clear the cra_alignmasks of various drivers: all NEON
capable CPUs can perform unaligned accesses, and the advantage of using
the slightly faster aligned accessors (which only exist on ARM not arm64)
is certainly outweighed by the cost of copying data to suitably aligned
buffers.

NOTE: patch #5 won't apply unless 'crypto: arm64/aes-blk - honour iv_out
requirement in CBC and CTR modes' is applied first, which was sent out
separately as a bugfix for v3.16 - v4.9. If this is a problem, this patch
can wait.

Patch #7 and #8 are minor tweaks to the new scalar AES code.

Patch #9 improves the performance of the plain NEON AES code, to make it
more suitable as a fallback for the new bitsliced NEON code, which can
only operate on 8 blocks in parallel, and needs another driver to perform
CBC encryption or XTS tweak generation.

Patch #10 updates the new bitsliced AES NEON code to switch to the plain
NEON driver as a fallback.

Patches #9 and #10 improve the performance of CBC encryption by ~35% on
low end cores such as the Cortex-A53 that can be found in the Raspberry Pi3

Changes since v1:
- shave off another few cycles from the sequential AES NEON code (patch #9)

Ard Biesheuvel (10):
  crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode
  crypto: arm/aes-ce - remove cra_alignmask
  crypto: arm/chacha20 - remove cra_alignmask
  crypto: arm64/aes-ce-ccm - remove cra_alignmask
  crypto: arm64/aes-blk - remove cra_alignmask
  crypto: arm64/chacha20 - remove cra_alignmask
  crypto: arm64/aes - avoid literals for cross-module symbol references
  crypto: arm64/aes - performance tweak
  crypto: arm64/aes-neon-blk - tweak performance for low end cores
  crypto: arm64/aes - replace scalar fallback with plain NEON fallback

 arch/arm/crypto/aes-ce-core.S          |  84 ++++----
 arch/arm/crypto/aes-ce-glue.c          |  15 +-
 arch/arm/crypto/chacha20-neon-glue.c   |   1 -
 arch/arm64/crypto/Kconfig              |   2 +-
 arch/arm64/crypto/aes-ce-ccm-glue.c    |   1 -
 arch/arm64/crypto/aes-cipher-core.S    |  59 ++----
 arch/arm64/crypto/aes-glue.c           |  18 +-
 arch/arm64/crypto/aes-modes.S          |   8 +-
 arch/arm64/crypto/aes-neon.S           | 210 ++++++++------------
 arch/arm64/crypto/aes-neonbs-core.S    |  25 ++-
 arch/arm64/crypto/aes-neonbs-glue.c    |  38 +++-
 arch/arm64/crypto/chacha20-neon-glue.c |   1 -
 12 files changed, 203 insertions(+), 259 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask Ard Biesheuvel
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Update the new bitsliced NEON AES implementation in CTR mode to return
the next IV back to the skcipher API client. This is necessary for
chaining to work correctly.

Note that this is only done if the request is a round multiple of the
block size, since otherwise, chaining is impossible anyway.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-neonbs-core.S | 25 +++++++++++++-------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index 8d0cdaa2768d..2ada12dd768e 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -874,12 +874,19 @@ CPU_LE(	rev		x8, x8		)
 	csel		x4, x4, xzr, pl
 	csel		x9, x9, xzr, le
 
+	tbnz		x9, #1, 0f
 	next_ctr	v1
+	tbnz		x9, #2, 0f
 	next_ctr	v2
+	tbnz		x9, #3, 0f
 	next_ctr	v3
+	tbnz		x9, #4, 0f
 	next_ctr	v4
+	tbnz		x9, #5, 0f
 	next_ctr	v5
+	tbnz		x9, #6, 0f
 	next_ctr	v6
+	tbnz		x9, #7, 0f
 	next_ctr	v7
 
 0:	mov		bskey, x2
@@ -928,11 +935,11 @@ CPU_LE(	rev		x8, x8		)
 	eor		v5.16b, v5.16b, v15.16b
 	st1		{v5.16b}, [x0], #16
 
-	next_ctr	v0
+8:	next_ctr	v0
 	cbnz		x4, 99b
 
 0:	st1		{v0.16b}, [x5]
-8:	ldp		x29, x30, [sp], #16
+9:	ldp		x29, x30, [sp], #16
 	ret
 
 	/*
@@ -941,23 +948,23 @@ CPU_LE(	rev		x8, x8		)
 	 */
 1:	cbz		x6, 8b
 	st1		{v1.16b}, [x5]
-	b		8b
+	b		9b
 2:	cbz		x6, 8b
 	st1		{v4.16b}, [x5]
-	b		8b
+	b		9b
 3:	cbz		x6, 8b
 	st1		{v6.16b}, [x5]
-	b		8b
+	b		9b
 4:	cbz		x6, 8b
 	st1		{v3.16b}, [x5]
-	b		8b
+	b		9b
 5:	cbz		x6, 8b
 	st1		{v7.16b}, [x5]
-	b		8b
+	b		9b
 6:	cbz		x6, 8b
 	st1		{v2.16b}, [x5]
-	b		8b
+	b		9b
 7:	cbz		x6, 8b
 	st1		{v5.16b}, [x5]
-	b		8b
+	b		9b
 ENDPROC(aesbs_ctr_encrypt)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 03/10] crypto: arm/chacha20 " Ard Biesheuvel
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/aes-ce-core.S | 84 ++++++++++----------
 arch/arm/crypto/aes-ce-glue.c | 15 ++--
 2 files changed, 47 insertions(+), 52 deletions(-)

diff --git a/arch/arm/crypto/aes-ce-core.S b/arch/arm/crypto/aes-ce-core.S
index 987aa632c9f0..ba8e6a32fdc9 100644
--- a/arch/arm/crypto/aes-ce-core.S
+++ b/arch/arm/crypto/aes-ce-core.S
@@ -169,19 +169,19 @@ ENTRY(ce_aes_ecb_encrypt)
 .Lecbencloop3x:
 	subs		r4, r4, #3
 	bmi		.Lecbenc1x
-	vld1.8		{q0-q1}, [r1, :64]!
-	vld1.8		{q2}, [r1, :64]!
+	vld1.8		{q0-q1}, [r1]!
+	vld1.8		{q2}, [r1]!
 	bl		aes_encrypt_3x
-	vst1.8		{q0-q1}, [r0, :64]!
-	vst1.8		{q2}, [r0, :64]!
+	vst1.8		{q0-q1}, [r0]!
+	vst1.8		{q2}, [r0]!
 	b		.Lecbencloop3x
 .Lecbenc1x:
 	adds		r4, r4, #3
 	beq		.Lecbencout
 .Lecbencloop:
-	vld1.8		{q0}, [r1, :64]!
+	vld1.8		{q0}, [r1]!
 	bl		aes_encrypt
-	vst1.8		{q0}, [r0, :64]!
+	vst1.8		{q0}, [r0]!
 	subs		r4, r4, #1
 	bne		.Lecbencloop
 .Lecbencout:
@@ -195,19 +195,19 @@ ENTRY(ce_aes_ecb_decrypt)
 .Lecbdecloop3x:
 	subs		r4, r4, #3
 	bmi		.Lecbdec1x
-	vld1.8		{q0-q1}, [r1, :64]!
-	vld1.8		{q2}, [r1, :64]!
+	vld1.8		{q0-q1}, [r1]!
+	vld1.8		{q2}, [r1]!
 	bl		aes_decrypt_3x
-	vst1.8		{q0-q1}, [r0, :64]!
-	vst1.8		{q2}, [r0, :64]!
+	vst1.8		{q0-q1}, [r0]!
+	vst1.8		{q2}, [r0]!
 	b		.Lecbdecloop3x
 .Lecbdec1x:
 	adds		r4, r4, #3
 	beq		.Lecbdecout
 .Lecbdecloop:
-	vld1.8		{q0}, [r1, :64]!
+	vld1.8		{q0}, [r1]!
 	bl		aes_decrypt
-	vst1.8		{q0}, [r0, :64]!
+	vst1.8		{q0}, [r0]!
 	subs		r4, r4, #1
 	bne		.Lecbdecloop
 .Lecbdecout:
@@ -226,10 +226,10 @@ ENTRY(ce_aes_cbc_encrypt)
 	vld1.8		{q0}, [r5]
 	prepare_key	r2, r3
 .Lcbcencloop:
-	vld1.8		{q1}, [r1, :64]!	@ get next pt block
+	vld1.8		{q1}, [r1]!		@ get next pt block
 	veor		q0, q0, q1		@ ..and xor with iv
 	bl		aes_encrypt
-	vst1.8		{q0}, [r0, :64]!
+	vst1.8		{q0}, [r0]!
 	subs		r4, r4, #1
 	bne		.Lcbcencloop
 	vst1.8		{q0}, [r5]
@@ -244,8 +244,8 @@ ENTRY(ce_aes_cbc_decrypt)
 .Lcbcdecloop3x:
 	subs		r4, r4, #3
 	bmi		.Lcbcdec1x
-	vld1.8		{q0-q1}, [r1, :64]!
-	vld1.8		{q2}, [r1, :64]!
+	vld1.8		{q0-q1}, [r1]!
+	vld1.8		{q2}, [r1]!
 	vmov		q3, q0
 	vmov		q4, q1
 	vmov		q5, q2
@@ -254,19 +254,19 @@ ENTRY(ce_aes_cbc_decrypt)
 	veor		q1, q1, q3
 	veor		q2, q2, q4
 	vmov		q6, q5
-	vst1.8		{q0-q1}, [r0, :64]!
-	vst1.8		{q2}, [r0, :64]!
+	vst1.8		{q0-q1}, [r0]!
+	vst1.8		{q2}, [r0]!
 	b		.Lcbcdecloop3x
 .Lcbcdec1x:
 	adds		r4, r4, #3
 	beq		.Lcbcdecout
 	vmov		q15, q14		@ preserve last round key
 .Lcbcdecloop:
-	vld1.8		{q0}, [r1, :64]!	@ get next ct block
+	vld1.8		{q0}, [r1]!		@ get next ct block
 	veor		q14, q15, q6		@ combine prev ct with last key
 	vmov		q6, q0
 	bl		aes_decrypt
-	vst1.8		{q0}, [r0, :64]!
+	vst1.8		{q0}, [r0]!
 	subs		r4, r4, #1
 	bne		.Lcbcdecloop
 .Lcbcdecout:
@@ -300,15 +300,15 @@ ENTRY(ce_aes_ctr_encrypt)
 	rev		ip, r6
 	add		r6, r6, #1
 	vmov		s11, ip
-	vld1.8		{q3-q4}, [r1, :64]!
-	vld1.8		{q5}, [r1, :64]!
+	vld1.8		{q3-q4}, [r1]!
+	vld1.8		{q5}, [r1]!
 	bl		aes_encrypt_3x
 	veor		q0, q0, q3
 	veor		q1, q1, q4
 	veor		q2, q2, q5
 	rev		ip, r6
-	vst1.8		{q0-q1}, [r0, :64]!
-	vst1.8		{q2}, [r0, :64]!
+	vst1.8		{q0-q1}, [r0]!
+	vst1.8		{q2}, [r0]!
 	vmov		s27, ip
 	b		.Lctrloop3x
 .Lctr1x:
@@ -318,10 +318,10 @@ ENTRY(ce_aes_ctr_encrypt)
 	vmov		q0, q6
 	bl		aes_encrypt
 	subs		r4, r4, #1
-	bmi		.Lctrhalfblock		@ blocks < 0 means 1/2 block
-	vld1.8		{q3}, [r1, :64]!
+	bmi		.Lctrtailblock		@ blocks < 0 means tail block
+	vld1.8		{q3}, [r1]!
 	veor		q3, q0, q3
-	vst1.8		{q3}, [r0, :64]!
+	vst1.8		{q3}, [r0]!
 
 	adds		r6, r6, #1		@ increment BE ctr
 	rev		ip, r6
@@ -333,10 +333,8 @@ ENTRY(ce_aes_ctr_encrypt)
 	vst1.8		{q6}, [r5]
 	pop		{r4-r6, pc}
 
-.Lctrhalfblock:
-	vld1.8		{d1}, [r1, :64]
-	veor		d0, d0, d1
-	vst1.8		{d0}, [r0, :64]
+.Lctrtailblock:
+	vst1.8		{q0}, [r0, :64]		@ return just the key stream
 	pop		{r4-r6, pc}
 
 .Lctrcarry:
@@ -405,8 +403,8 @@ ENTRY(ce_aes_xts_encrypt)
 .Lxtsenc3x:
 	subs		r4, r4, #3
 	bmi		.Lxtsenc1x
-	vld1.8		{q0-q1}, [r1, :64]!	@ get 3 pt blocks
-	vld1.8		{q2}, [r1, :64]!
+	vld1.8		{q0-q1}, [r1]!		@ get 3 pt blocks
+	vld1.8		{q2}, [r1]!
 	next_tweak	q4, q3, q7, q6
 	veor		q0, q0, q3
 	next_tweak	q5, q4, q7, q6
@@ -416,8 +414,8 @@ ENTRY(ce_aes_xts_encrypt)
 	veor		q0, q0, q3
 	veor		q1, q1, q4
 	veor		q2, q2, q5
-	vst1.8		{q0-q1}, [r0, :64]!	@ write 3 ct blocks
-	vst1.8		{q2}, [r0, :64]!
+	vst1.8		{q0-q1}, [r0]!		@ write 3 ct blocks
+	vst1.8		{q2}, [r0]!
 	vmov		q3, q5
 	teq		r4, #0
 	beq		.Lxtsencout
@@ -426,11 +424,11 @@ ENTRY(ce_aes_xts_encrypt)
 	adds		r4, r4, #3
 	beq		.Lxtsencout
 .Lxtsencloop:
-	vld1.8		{q0}, [r1, :64]!
+	vld1.8		{q0}, [r1]!
 	veor		q0, q0, q3
 	bl		aes_encrypt
 	veor		q0, q0, q3
-	vst1.8		{q0}, [r0, :64]!
+	vst1.8		{q0}, [r0]!
 	subs		r4, r4, #1
 	beq		.Lxtsencout
 	next_tweak	q3, q3, q7, q6
@@ -456,8 +454,8 @@ ENTRY(ce_aes_xts_decrypt)
 .Lxtsdec3x:
 	subs		r4, r4, #3
 	bmi		.Lxtsdec1x
-	vld1.8		{q0-q1}, [r1, :64]!	@ get 3 ct blocks
-	vld1.8		{q2}, [r1, :64]!
+	vld1.8		{q0-q1}, [r1]!		@ get 3 ct blocks
+	vld1.8		{q2}, [r1]!
 	next_tweak	q4, q3, q7, q6
 	veor		q0, q0, q3
 	next_tweak	q5, q4, q7, q6
@@ -467,8 +465,8 @@ ENTRY(ce_aes_xts_decrypt)
 	veor		q0, q0, q3
 	veor		q1, q1, q4
 	veor		q2, q2, q5
-	vst1.8		{q0-q1}, [r0, :64]!	@ write 3 pt blocks
-	vst1.8		{q2}, [r0, :64]!
+	vst1.8		{q0-q1}, [r0]!		@ write 3 pt blocks
+	vst1.8		{q2}, [r0]!
 	vmov		q3, q5
 	teq		r4, #0
 	beq		.Lxtsdecout
@@ -477,12 +475,12 @@ ENTRY(ce_aes_xts_decrypt)
 	adds		r4, r4, #3
 	beq		.Lxtsdecout
 .Lxtsdecloop:
-	vld1.8		{q0}, [r1, :64]!
+	vld1.8		{q0}, [r1]!
 	veor		q0, q0, q3
 	add		ip, r2, #32		@ 3rd round key
 	bl		aes_decrypt
 	veor		q0, q0, q3
-	vst1.8		{q0}, [r0, :64]!
+	vst1.8		{q0}, [r0]!
 	subs		r4, r4, #1
 	beq		.Lxtsdecout
 	next_tweak	q3, q3, q7, q6
diff --git a/arch/arm/crypto/aes-ce-glue.c b/arch/arm/crypto/aes-ce-glue.c
index 8857531915bf..883b84d828c5 100644
--- a/arch/arm/crypto/aes-ce-glue.c
+++ b/arch/arm/crypto/aes-ce-glue.c
@@ -278,14 +278,15 @@ static int ctr_encrypt(struct skcipher_request *req)
 		u8 *tsrc = walk.src.virt.addr;
 
 		/*
-		 * Minimum alignment is 8 bytes, so if nbytes is <= 8, we need
-		 * to tell aes_ctr_encrypt() to only read half a block.
+		 * Tell aes_ctr_encrypt() to process a tail block.
 		 */
-		blocks = (nbytes <= 8) ? -1 : 1;
+		blocks = -1;
 
-		ce_aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc,
+		ce_aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc,
 				   num_rounds(ctx), blocks, walk.iv);
-		memcpy(tdst, tail, nbytes);
+		if (tdst != tsrc)
+			memcpy(tdst, tsrc, nbytes);
+		crypto_xor(tdst, tail, nbytes);
 		err = skcipher_walk_done(&walk, 0);
 	}
 	kernel_neon_end();
@@ -345,7 +346,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= AES_BLOCK_SIZE,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -361,7 +361,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= AES_BLOCK_SIZE,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -378,7 +377,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= 1,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -396,7 +394,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= AES_BLOCK_SIZE,
 		.cra_ctxsize		= sizeof(struct crypto_aes_xts_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= 2 * AES_MIN_KEY_SIZE,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 03/10] crypto: arm/chacha20 - remove cra_alignmask
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 04/10] crypto: arm64/aes-ce-ccm " Ard Biesheuvel
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/chacha20-neon-glue.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/crypto/chacha20-neon-glue.c b/arch/arm/crypto/chacha20-neon-glue.c
index 592f75ae4fa1..59a7be08e80c 100644
--- a/arch/arm/crypto/chacha20-neon-glue.c
+++ b/arch/arm/crypto/chacha20-neon-glue.c
@@ -94,7 +94,6 @@ static struct skcipher_alg alg = {
 	.base.cra_priority	= 300,
 	.base.cra_blocksize	= 1,
 	.base.cra_ctxsize	= sizeof(struct chacha20_ctx),
-	.base.cra_alignmask	= 1,
 	.base.cra_module	= THIS_MODULE,
 
 	.min_keysize		= CHACHA20_KEY_SIZE,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 04/10] crypto: arm64/aes-ce-ccm - remove cra_alignmask
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 03/10] crypto: arm/chacha20 " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 05/10] crypto: arm64/aes-blk " Ard Biesheuvel
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index cc5515dac74a..6a7dbc7c83a6 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -258,7 +258,6 @@ static struct aead_alg ccm_aes_alg = {
 		.cra_priority		= 300,
 		.cra_blocksize		= 1,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.ivsize		= AES_BLOCK_SIZE,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 05/10] crypto: arm64/aes-blk - remove cra_alignmask
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 04/10] crypto: arm64/aes-ce-ccm " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 06/10] crypto: arm64/chacha20 " Ard Biesheuvel
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-glue.c  | 16 ++++++----------
 arch/arm64/crypto/aes-modes.S |  8 +++-----
 2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 5164aaf82c6a..8ee1fb7aaa4f 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -215,14 +215,15 @@ static int ctr_encrypt(struct skcipher_request *req)
 		u8 *tsrc = walk.src.virt.addr;
 
 		/*
-		 * Minimum alignment is 8 bytes, so if nbytes is <= 8, we need
-		 * to tell aes_ctr_encrypt() to only read half a block.
+		 * Tell aes_ctr_encrypt() to process a tail block.
 		 */
-		blocks = (nbytes <= 8) ? -1 : 1;
+		blocks = -1;
 
-		aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc, rounds,
+		aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
 				blocks, walk.iv, first);
-		memcpy(tdst, tail, nbytes);
+		if (tdst != tsrc)
+			memcpy(tdst, tsrc, nbytes);
+		crypto_xor(tdst, tail, nbytes);
 		err = skcipher_walk_done(&walk, 0);
 	}
 	kernel_neon_end();
@@ -282,7 +283,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= AES_BLOCK_SIZE,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -298,7 +298,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= AES_BLOCK_SIZE,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -315,7 +314,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= 1,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -332,7 +330,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_priority		= PRIO - 1,
 		.cra_blocksize		= 1,
 		.cra_ctxsize		= sizeof(struct crypto_aes_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= AES_MIN_KEY_SIZE,
@@ -350,7 +347,6 @@ static struct skcipher_alg aes_algs[] = { {
 		.cra_flags		= CRYPTO_ALG_INTERNAL,
 		.cra_blocksize		= AES_BLOCK_SIZE,
 		.cra_ctxsize		= sizeof(struct crypto_aes_xts_ctx),
-		.cra_alignmask		= 7,
 		.cra_module		= THIS_MODULE,
 	},
 	.min_keysize	= 2 * AES_MIN_KEY_SIZE,
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 838dad5c209f..92b982a8b112 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -337,7 +337,7 @@ AES_ENTRY(aes_ctr_encrypt)
 
 .Lctrcarrydone:
 	subs		w4, w4, #1
-	bmi		.Lctrhalfblock		/* blocks < 0 means 1/2 block */
+	bmi		.Lctrtailblock		/* blocks <0 means tail block */
 	ld1		{v3.16b}, [x1], #16
 	eor		v3.16b, v0.16b, v3.16b
 	st1		{v3.16b}, [x0], #16
@@ -348,10 +348,8 @@ AES_ENTRY(aes_ctr_encrypt)
 	FRAME_POP
 	ret
 
-.Lctrhalfblock:
-	ld1		{v3.8b}, [x1]
-	eor		v3.8b, v0.8b, v3.8b
-	st1		{v3.8b}, [x0]
+.Lctrtailblock:
+	st1		{v0.16b}, [x0]
 	FRAME_POP
 	ret
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 06/10] crypto: arm64/chacha20 - remove cra_alignmask
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 05/10] crypto: arm64/aes-blk " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references Ard Biesheuvel
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Remove the unnecessary alignmask: it is much more efficient to deal with
the misalignment in the core algorithm than relying on the crypto API to
copy the data to a suitably aligned buffer.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/chacha20-neon-glue.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index a7f2337d46cf..a7cd575ea223 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -93,7 +93,6 @@ static struct skcipher_alg alg = {
 	.base.cra_priority	= 300,
 	.base.cra_blocksize	= 1,
 	.base.cra_ctxsize	= sizeof(struct chacha20_ctx),
-	.base.cra_alignmask	= 1,
 	.base.cra_module	= THIS_MODULE,
 
 	.min_keysize		= CHACHA20_KEY_SIZE,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 06/10] crypto: arm64/chacha20 " Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 08/10] crypto: arm64/aes - performance tweak Ard Biesheuvel
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Using simple adrp/add pairs to refer to the AES lookup tables exposed by
the generic AES driver (which could be loaded far away from this driver
when KASLR is in effect) was unreliable at module load time before commit
41c066f2c4d4 ("arm64: assembler: make adr_l work in modules under KASLR"),
which is why the AES code used literals instead.

So now we can get rid of the literals, and switch to the adr_l macro.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-cipher-core.S | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/crypto/aes-cipher-core.S b/arch/arm64/crypto/aes-cipher-core.S
index 37590ab8121a..cd58c61e6677 100644
--- a/arch/arm64/crypto/aes-cipher-core.S
+++ b/arch/arm64/crypto/aes-cipher-core.S
@@ -89,8 +89,8 @@ CPU_BE(	rev		w8, w8		)
 	eor		w7, w7, w11
 	eor		w8, w8, w12
 
-	ldr		tt, =\ttab
-	ldr		lt, =\ltab
+	adr_l		tt, \ttab
+	adr_l		lt, \ltab
 
 	tbnz		rounds, #1, 1f
 
@@ -111,9 +111,6 @@ CPU_BE(	rev		w8, w8		)
 	stp		w5, w6, [out]
 	stp		w7, w8, [out, #8]
 	ret
-
-	.align		4
-	.ltorg
 	.endm
 
 	.align		5
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 08/10] crypto: arm64/aes - performance tweak
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback Ard Biesheuvel
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

Shuffle some instructions around in the __hround macro to shave off
0.1 cycles per byte on Cortex-A57.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-cipher-core.S | 52 +++++++-------------
 1 file changed, 19 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/crypto/aes-cipher-core.S b/arch/arm64/crypto/aes-cipher-core.S
index cd58c61e6677..f2f9cc519309 100644
--- a/arch/arm64/crypto/aes-cipher-core.S
+++ b/arch/arm64/crypto/aes-cipher-core.S
@@ -20,46 +20,32 @@
 	tt		.req	x4
 	lt		.req	x2
 
-	.macro		__hround, out0, out1, in0, in1, in2, in3, t0, t1, enc
-	ldp		\out0, \out1, [rk], #8
-
-	ubfx		w13, \in0, #0, #8
-	ubfx		w14, \in1, #8, #8
-	ldr		w13, [tt, w13, uxtw #2]
-	ldr		w14, [tt, w14, uxtw #2]
-
+	.macro		__pair, enc, reg0, reg1, in0, in1e, in1d, shift
+	ubfx		\reg0, \in0, #\shift, #8
 	.if		\enc
-	ubfx		w17, \in1, #0, #8
-	ubfx		w18, \in2, #8, #8
+	ubfx		\reg1, \in1e, #\shift, #8
 	.else
-	ubfx		w17, \in3, #0, #8
-	ubfx		w18, \in0, #8, #8
+	ubfx		\reg1, \in1d, #\shift, #8
 	.endif
-	ldr		w17, [tt, w17, uxtw #2]
-	ldr		w18, [tt, w18, uxtw #2]
+	ldr		\reg0, [tt, \reg0, uxtw #2]
+	ldr		\reg1, [tt, \reg1, uxtw #2]
+	.endm
 
-	ubfx		w15, \in2, #16, #8
-	ubfx		w16, \in3, #24, #8
-	ldr		w15, [tt, w15, uxtw #2]
-	ldr		w16, [tt, w16, uxtw #2]
+	.macro		__hround, out0, out1, in0, in1, in2, in3, t0, t1, enc
+	ldp		\out0, \out1, [rk], #8
 
-	.if		\enc
-	ubfx		\t0, \in3, #16, #8
-	ubfx		\t1, \in0, #24, #8
-	.else
-	ubfx		\t0, \in1, #16, #8
-	ubfx		\t1, \in2, #24, #8
-	.endif
-	ldr		\t0, [tt, \t0, uxtw #2]
-	ldr		\t1, [tt, \t1, uxtw #2]
+	__pair		\enc, w13, w14, \in0, \in1, \in3, 0
+	__pair		\enc, w15, w16, \in1, \in2, \in0, 8
+	__pair		\enc, w17, w18, \in2, \in3, \in1, 16
+	__pair		\enc, \t0, \t1, \in3, \in0, \in2, 24
 
 	eor		\out0, \out0, w13
-	eor		\out1, \out1, w17
-	eor		\out0, \out0, w14, ror #24
-	eor		\out1, \out1, w18, ror #24
-	eor		\out0, \out0, w15, ror #16
-	eor		\out1, \out1, \t0, ror #16
-	eor		\out0, \out0, w16, ror #8
+	eor		\out1, \out1, w14
+	eor		\out0, \out0, w15, ror #24
+	eor		\out1, \out1, w16, ror #24
+	eor		\out0, \out0, w17, ror #16
+	eor		\out1, \out1, w18, ror #16
+	eor		\out0, \out0, \t0, ror #8
 	eor		\out1, \out1, \t1, ror #8
 	.endm
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 08/10] crypto: arm64/aes - performance tweak Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  2017-01-23 14:05 ` [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback Ard Biesheuvel
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

The non-bitsliced AES implementation using the NEON is highly sensitive
to micro-architectural details, and, as it turns out, the Cortex-A53 on
the Raspberry Pi 3 is a core that can benefit from this code, given that
its scalar AES performance is abysmal (32.9 cycles per byte).

The new bitsliced AES code manages 19.8 cycles per byte on this core,
but can only operate on 8 blocks at a time, which is not supported by
all chaining modes. With a bit of tweaking, we can get the plain NEON
code to run at 22.0 cycles per byte, making it useful for sequential
modes like CBC encryption. (Like bitsliced NEON, the plain NEON
implementation does not use any lookup tables, which makes it easy on
the D-cache, and invulnerable to cache timing attacks)

So tweak the plain NEON AES code to use tbl instructions rather than
shl/sri pairs, and to avoid the need to reload permutation vectors or
other constants from memory in every round.

To allow the ECB and CBC encrypt routines to be reused by the bitsliced
NEON code in a subsequent patch, export them from the module.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-glue.c |   2 +
 arch/arm64/crypto/aes-neon.S | 210 ++++++++------------
 2 files changed, 81 insertions(+), 131 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 8ee1fb7aaa4f..055bc3f61138 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -409,5 +409,7 @@ static int __init aes_init(void)
 module_cpu_feature_match(AES, aes_init);
 #else
 module_init(aes_init);
+EXPORT_SYMBOL(neon_aes_ecb_encrypt);
+EXPORT_SYMBOL(neon_aes_cbc_encrypt);
 #endif
 module_exit(aes_exit);
diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S
index 85f07ead7c5c..0d111a6fc229 100644
--- a/arch/arm64/crypto/aes-neon.S
+++ b/arch/arm64/crypto/aes-neon.S
@@ -1,7 +1,7 @@
 /*
  * linux/arch/arm64/crypto/aes-neon.S - AES cipher for ARMv8 NEON
  *
- * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ * Copyright (C) 2013 - 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -25,9 +25,9 @@
 	/* preload the entire Sbox */
 	.macro		prepare, sbox, shiftrows, temp
 	adr		\temp, \sbox
-	movi		v12.16b, #0x40
+	movi		v12.16b, #0x1b
 	ldr		q13, \shiftrows
-	movi		v14.16b, #0x1b
+	ldr		q14, .Lror32by8
 	ld1		{v16.16b-v19.16b}, [\temp], #64
 	ld1		{v20.16b-v23.16b}, [\temp], #64
 	ld1		{v24.16b-v27.16b}, [\temp], #64
@@ -50,37 +50,32 @@
 
 	/* apply SubBytes transformation using the the preloaded Sbox */
 	.macro		sub_bytes, in
-	sub		v9.16b, \in\().16b, v12.16b
+	sub		v9.16b, \in\().16b, v15.16b
 	tbl		\in\().16b, {v16.16b-v19.16b}, \in\().16b
-	sub		v10.16b, v9.16b, v12.16b
+	sub		v10.16b, v9.16b, v15.16b
 	tbx		\in\().16b, {v20.16b-v23.16b}, v9.16b
-	sub		v11.16b, v10.16b, v12.16b
+	sub		v11.16b, v10.16b, v15.16b
 	tbx		\in\().16b, {v24.16b-v27.16b}, v10.16b
 	tbx		\in\().16b, {v28.16b-v31.16b}, v11.16b
 	.endm
 
 	/* apply MixColumns transformation */
-	.macro		mix_columns, in
-	mul_by_x	v10.16b, \in\().16b, v9.16b, v14.16b
-	rev32		v8.8h, \in\().8h
-	eor		\in\().16b, v10.16b, \in\().16b
-	shl		v9.4s, v8.4s, #24
-	shl		v11.4s, \in\().4s, #24
-	sri		v9.4s, v8.4s, #8
-	sri		v11.4s, \in\().4s, #8
-	eor		v9.16b, v9.16b, v8.16b
-	eor		v10.16b, v10.16b, v9.16b
-	eor		\in\().16b, v10.16b, v11.16b
-	.endm
-
+	.macro		mix_columns, in, enc
+	.if		\enc == 0
 	/* Inverse MixColumns: pre-multiply by { 5, 0, 4, 0 } */
-	.macro		inv_mix_columns, in
-	mul_by_x	v11.16b, \in\().16b, v10.16b, v14.16b
-	mul_by_x	v11.16b, v11.16b, v10.16b, v14.16b
-	eor		\in\().16b, \in\().16b, v11.16b
-	rev32		v11.8h, v11.8h
-	eor		\in\().16b, \in\().16b, v11.16b
-	mix_columns	\in
+	mul_by_x	v8.16b, \in\().16b, v9.16b, v12.16b
+	mul_by_x	v8.16b, v8.16b, v9.16b, v12.16b
+	eor		\in\().16b, \in\().16b, v8.16b
+	rev32		v8.8h, v8.8h
+	eor		\in\().16b, \in\().16b, v8.16b
+	.endif
+
+	mul_by_x	v9.16b, \in\().16b, v8.16b, v12.16b
+	rev32		v8.8h, \in\().8h
+	eor		v8.16b, v8.16b, v9.16b
+	eor		\in\().16b, \in\().16b, v8.16b
+	tbl		\in\().16b, {\in\().16b}, v14.16b
+	eor		\in\().16b, \in\().16b, v8.16b
 	.endm
 
 	.macro		do_block, enc, in, rounds, rk, rkp, i
@@ -88,16 +83,13 @@
 	add		\rkp, \rk, #16
 	mov		\i, \rounds
 1111:	eor		\in\().16b, \in\().16b, v15.16b		/* ^round key */
+	movi		v15.16b, #0x40
 	tbl		\in\().16b, {\in\().16b}, v13.16b	/* ShiftRows */
 	sub_bytes	\in
-	ld1		{v15.4s}, [\rkp], #16
 	subs		\i, \i, #1
+	ld1		{v15.4s}, [\rkp], #16
 	beq		2222f
-	.if		\enc == 1
-	mix_columns	\in
-	.else
-	inv_mix_columns	\in
-	.endif
+	mix_columns	\in, \enc
 	b		1111b
 2222:	eor		\in\().16b, \in\().16b, v15.16b		/* ^round key */
 	.endm
@@ -116,48 +108,48 @@
 	 */
 
 	.macro		sub_bytes_2x, in0, in1
-	sub		v8.16b, \in0\().16b, v12.16b
-	sub		v9.16b, \in1\().16b, v12.16b
+	sub		v8.16b, \in0\().16b, v15.16b
 	tbl		\in0\().16b, {v16.16b-v19.16b}, \in0\().16b
+	sub		v9.16b, \in1\().16b, v15.16b
 	tbl		\in1\().16b, {v16.16b-v19.16b}, \in1\().16b
-	sub		v10.16b, v8.16b, v12.16b
-	sub		v11.16b, v9.16b, v12.16b
+	sub		v10.16b, v8.16b, v15.16b
 	tbx		\in0\().16b, {v20.16b-v23.16b}, v8.16b
+	sub		v11.16b, v9.16b, v15.16b
 	tbx		\in1\().16b, {v20.16b-v23.16b}, v9.16b
-	sub		v8.16b, v10.16b, v12.16b
-	sub		v9.16b, v11.16b, v12.16b
+	sub		v8.16b, v10.16b, v15.16b
 	tbx		\in0\().16b, {v24.16b-v27.16b}, v10.16b
+	sub		v9.16b, v11.16b, v15.16b
 	tbx		\in1\().16b, {v24.16b-v27.16b}, v11.16b
 	tbx		\in0\().16b, {v28.16b-v31.16b}, v8.16b
 	tbx		\in1\().16b, {v28.16b-v31.16b}, v9.16b
 	.endm
 
 	.macro		sub_bytes_4x, in0, in1, in2, in3
-	sub		v8.16b, \in0\().16b, v12.16b
+	sub		v8.16b, \in0\().16b, v15.16b
 	tbl		\in0\().16b, {v16.16b-v19.16b}, \in0\().16b
-	sub		v9.16b, \in1\().16b, v12.16b
+	sub		v9.16b, \in1\().16b, v15.16b
 	tbl		\in1\().16b, {v16.16b-v19.16b}, \in1\().16b
-	sub		v10.16b, \in2\().16b, v12.16b
+	sub		v10.16b, \in2\().16b, v15.16b
 	tbl		\in2\().16b, {v16.16b-v19.16b}, \in2\().16b
-	sub		v11.16b, \in3\().16b, v12.16b
+	sub		v11.16b, \in3\().16b, v15.16b
 	tbl		\in3\().16b, {v16.16b-v19.16b}, \in3\().16b
 	tbx		\in0\().16b, {v20.16b-v23.16b}, v8.16b
 	tbx		\in1\().16b, {v20.16b-v23.16b}, v9.16b
-	sub		v8.16b, v8.16b, v12.16b
+	sub		v8.16b, v8.16b, v15.16b
 	tbx		\in2\().16b, {v20.16b-v23.16b}, v10.16b
-	sub		v9.16b, v9.16b, v12.16b
+	sub		v9.16b, v9.16b, v15.16b
 	tbx		\in3\().16b, {v20.16b-v23.16b}, v11.16b
-	sub		v10.16b, v10.16b, v12.16b
+	sub		v10.16b, v10.16b, v15.16b
 	tbx		\in0\().16b, {v24.16b-v27.16b}, v8.16b
-	sub		v11.16b, v11.16b, v12.16b
+	sub		v11.16b, v11.16b, v15.16b
 	tbx		\in1\().16b, {v24.16b-v27.16b}, v9.16b
-	sub		v8.16b, v8.16b, v12.16b
+	sub		v8.16b, v8.16b, v15.16b
 	tbx		\in2\().16b, {v24.16b-v27.16b}, v10.16b
-	sub		v9.16b, v9.16b, v12.16b
+	sub		v9.16b, v9.16b, v15.16b
 	tbx		\in3\().16b, {v24.16b-v27.16b}, v11.16b
-	sub		v10.16b, v10.16b, v12.16b
+	sub		v10.16b, v10.16b, v15.16b
 	tbx		\in0\().16b, {v28.16b-v31.16b}, v8.16b
-	sub		v11.16b, v11.16b, v12.16b
+	sub		v11.16b, v11.16b, v15.16b
 	tbx		\in1\().16b, {v28.16b-v31.16b}, v9.16b
 	tbx		\in2\().16b, {v28.16b-v31.16b}, v10.16b
 	tbx		\in3\().16b, {v28.16b-v31.16b}, v11.16b
@@ -174,59 +166,30 @@
 	eor		\out1\().16b, \out1\().16b, \tmp1\().16b
 	.endm
 
-	.macro		mix_columns_2x, in0, in1
-	mul_by_x_2x	v8, v9, \in0, \in1, v10, v11, v14
-	rev32		v10.8h, \in0\().8h
-	rev32		v11.8h, \in1\().8h
-	eor		\in0\().16b, v8.16b, \in0\().16b
-	eor		\in1\().16b, v9.16b, \in1\().16b
-	shl		v12.4s, v10.4s, #24
-	shl		v13.4s, v11.4s, #24
-	eor		v8.16b, v8.16b, v10.16b
-	sri		v12.4s, v10.4s, #8
-	shl		v10.4s, \in0\().4s, #24
-	eor		v9.16b, v9.16b, v11.16b
-	sri		v13.4s, v11.4s, #8
-	shl		v11.4s, \in1\().4s, #24
-	sri		v10.4s, \in0\().4s, #8
-	eor		\in0\().16b, v8.16b, v12.16b
-	sri		v11.4s, \in1\().4s, #8
-	eor		\in1\().16b, v9.16b, v13.16b
-	eor		\in0\().16b, v10.16b, \in0\().16b
-	eor		\in1\().16b, v11.16b, \in1\().16b
-	.endm
-
-	.macro		inv_mix_cols_2x, in0, in1
-	mul_by_x_2x	v8, v9, \in0, \in1, v10, v11, v14
-	mul_by_x_2x	v8, v9, v8, v9, v10, v11, v14
+	.macro		mix_columns_2x, in0, in1, enc
+	.if		\enc == 0
+	/* Inverse MixColumns: pre-multiply by { 5, 0, 4, 0 } */
+	mul_by_x_2x	v8, v9, \in0, \in1, v10, v11, v12
+	mul_by_x_2x	v8, v9, v8, v9, v10, v11, v12
 	eor		\in0\().16b, \in0\().16b, v8.16b
-	eor		\in1\().16b, \in1\().16b, v9.16b
 	rev32		v8.8h, v8.8h
-	rev32		v9.8h, v9.8h
-	eor		\in0\().16b, \in0\().16b, v8.16b
 	eor		\in1\().16b, \in1\().16b, v9.16b
-	mix_columns_2x	\in0, \in1
-	.endm
-
-	.macro		inv_mix_cols_4x, in0, in1, in2, in3
-	mul_by_x_2x	v8, v9, \in0, \in1, v10, v11, v14
-	mul_by_x_2x	v10, v11, \in2, \in3, v12, v13, v14
-	mul_by_x_2x	v8, v9, v8, v9, v12, v13, v14
-	mul_by_x_2x	v10, v11, v10, v11, v12, v13, v14
-	eor		\in0\().16b, \in0\().16b, v8.16b
-	eor		\in1\().16b, \in1\().16b, v9.16b
-	eor		\in2\().16b, \in2\().16b, v10.16b
-	eor		\in3\().16b, \in3\().16b, v11.16b
-	rev32		v8.8h, v8.8h
 	rev32		v9.8h, v9.8h
-	rev32		v10.8h, v10.8h
-	rev32		v11.8h, v11.8h
 	eor		\in0\().16b, \in0\().16b, v8.16b
 	eor		\in1\().16b, \in1\().16b, v9.16b
-	eor		\in2\().16b, \in2\().16b, v10.16b
-	eor		\in3\().16b, \in3\().16b, v11.16b
-	mix_columns_2x	\in0, \in1
-	mix_columns_2x	\in2, \in3
+	.endif
+
+	mul_by_x_2x	v8, v9, \in0, \in1, v10, v11, v12
+	rev32		v10.8h, \in0\().8h
+	rev32		v11.8h, \in1\().8h
+	eor		v10.16b, v10.16b, v8.16b
+	eor		v11.16b, v11.16b, v9.16b
+	eor		\in0\().16b, \in0\().16b, v10.16b
+	eor		\in1\().16b, \in1\().16b, v11.16b
+	tbl		\in0\().16b, {\in0\().16b}, v14.16b
+	tbl		\in1\().16b, {\in1\().16b}, v14.16b
+	eor		\in0\().16b, \in0\().16b, v10.16b
+	eor		\in1\().16b, \in1\().16b, v11.16b
 	.endm
 
 	.macro		do_block_2x, enc, in0, in1 rounds, rk, rkp, i
@@ -235,20 +198,14 @@
 	mov		\i, \rounds
 1111:	eor		\in0\().16b, \in0\().16b, v15.16b	/* ^round key */
 	eor		\in1\().16b, \in1\().16b, v15.16b	/* ^round key */
-	sub_bytes_2x	\in0, \in1
+	movi		v15.16b, #0x40
 	tbl		\in0\().16b, {\in0\().16b}, v13.16b	/* ShiftRows */
 	tbl		\in1\().16b, {\in1\().16b}, v13.16b	/* ShiftRows */
-	ld1		{v15.4s}, [\rkp], #16
+	sub_bytes_2x	\in0, \in1
 	subs		\i, \i, #1
+	ld1		{v15.4s}, [\rkp], #16
 	beq		2222f
-	.if		\enc == 1
-	mix_columns_2x	\in0, \in1
-	ldr		q13, .LForward_ShiftRows
-	.else
-	inv_mix_cols_2x	\in0, \in1
-	ldr		q13, .LReverse_ShiftRows
-	.endif
-	movi		v12.16b, #0x40
+	mix_columns_2x	\in0, \in1, \enc
 	b		1111b
 2222:	eor		\in0\().16b, \in0\().16b, v15.16b	/* ^round key */
 	eor		\in1\().16b, \in1\().16b, v15.16b	/* ^round key */
@@ -262,23 +219,17 @@
 	eor		\in1\().16b, \in1\().16b, v15.16b	/* ^round key */
 	eor		\in2\().16b, \in2\().16b, v15.16b	/* ^round key */
 	eor		\in3\().16b, \in3\().16b, v15.16b	/* ^round key */
-	sub_bytes_4x	\in0, \in1, \in2, \in3
+	movi		v15.16b, #0x40
 	tbl		\in0\().16b, {\in0\().16b}, v13.16b	/* ShiftRows */
 	tbl		\in1\().16b, {\in1\().16b}, v13.16b	/* ShiftRows */
 	tbl		\in2\().16b, {\in2\().16b}, v13.16b	/* ShiftRows */
 	tbl		\in3\().16b, {\in3\().16b}, v13.16b	/* ShiftRows */
-	ld1		{v15.4s}, [\rkp], #16
+	sub_bytes_4x	\in0, \in1, \in2, \in3
 	subs		\i, \i, #1
+	ld1		{v15.4s}, [\rkp], #16
 	beq		2222f
-	.if		\enc == 1
-	mix_columns_2x	\in0, \in1
-	mix_columns_2x	\in2, \in3
-	ldr		q13, .LForward_ShiftRows
-	.else
-	inv_mix_cols_4x	\in0, \in1, \in2, \in3
-	ldr		q13, .LReverse_ShiftRows
-	.endif
-	movi		v12.16b, #0x40
+	mix_columns_2x	\in0, \in1, \enc
+	mix_columns_2x	\in2, \in3, \enc
 	b		1111b
 2222:	eor		\in0\().16b, \in0\().16b, v15.16b	/* ^round key */
 	eor		\in1\().16b, \in1\().16b, v15.16b	/* ^round key */
@@ -305,19 +256,7 @@
 #include "aes-modes.S"
 
 	.text
-	.align		4
-.LForward_ShiftRows:
-CPU_LE(	.byte		0x0, 0x5, 0xa, 0xf, 0x4, 0x9, 0xe, 0x3	)
-CPU_LE(	.byte		0x8, 0xd, 0x2, 0x7, 0xc, 0x1, 0x6, 0xb	)
-CPU_BE(	.byte		0xb, 0x6, 0x1, 0xc, 0x7, 0x2, 0xd, 0x8	)
-CPU_BE(	.byte		0x3, 0xe, 0x9, 0x4, 0xf, 0xa, 0x5, 0x0	)
-
-.LReverse_ShiftRows:
-CPU_LE(	.byte		0x0, 0xd, 0xa, 0x7, 0x4, 0x1, 0xe, 0xb	)
-CPU_LE(	.byte		0x8, 0x5, 0x2, 0xf, 0xc, 0x9, 0x6, 0x3	)
-CPU_BE(	.byte		0x3, 0x6, 0x9, 0xc, 0xf, 0x2, 0x5, 0x8	)
-CPU_BE(	.byte		0xb, 0xe, 0x1, 0x4, 0x7, 0xa, 0xd, 0x0	)
-
+	.align		6
 .LForward_Sbox:
 	.byte		0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5
 	.byte		0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76
@@ -385,3 +324,12 @@ CPU_BE(	.byte		0xb, 0xe, 0x1, 0x4, 0x7, 0xa, 0xd, 0x0	)
 	.byte		0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61
 	.byte		0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26
 	.byte		0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d
+
+.LForward_ShiftRows:
+	.octa		0x0b06010c07020d08030e09040f0a0500
+
+.LReverse_ShiftRows:
+	.octa		0x0306090c0f0205080b0e0104070a0d00
+
+.Lror32by8:
+	.octa		0x0c0f0e0d080b0a090407060500030201
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback
  2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2017-01-23 14:05 ` [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores Ard Biesheuvel
@ 2017-01-23 14:05 ` Ard Biesheuvel
  9 siblings, 0 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2017-01-23 14:05 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Ard Biesheuvel

The new bitsliced NEON implementation of AES uses a fallback in two
places: CBC encryption (which is strictly sequential, whereas this
driver can only operate efficiently on 8 blocks at a time), and the
XTS tweak generation, which involves encrypting a single AES block
with a different key schedule.

The plain (i.e., non-bitsliced) NEON code is more suitable as a fallback,
given that it is faster than scalar on low end cores (which is what
the NEON implementations target, since high end cores have dedicated
instructions for AES), and shows similar behavior in terms of D-cache
footprint and sensitivity to cache timing attacks. So switch the fallback
handling to the plain NEON driver.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig           |  2 +-
 arch/arm64/crypto/aes-neonbs-glue.c | 38 ++++++++++++++------
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 5de75c3dcbd4..bed7feddfeed 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -86,7 +86,7 @@ config CRYPTO_AES_ARM64_BS
 	tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_BLKCIPHER
-	select CRYPTO_AES_ARM64
+	select CRYPTO_AES_ARM64_NEON_BLK
 	select CRYPTO_SIMD
 
 endif
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 323dd76ae5f0..863e436ecf89 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -10,7 +10,6 @@
 
 #include <asm/neon.h>
 #include <crypto/aes.h>
-#include <crypto/cbc.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/xts.h>
@@ -42,7 +41,12 @@ asmlinkage void aesbs_xts_encrypt(u8 out[], u8 const in[], u8 const rk[],
 asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
 				  int rounds, int blocks, u8 iv[]);
 
-asmlinkage void __aes_arm64_encrypt(u32 *rk, u8 *out, const u8 *in, int rounds);
+/* borrowed from aes-neon-blk.ko */
+asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
+				     int rounds, int blocks, int first);
+asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
+				     int rounds, int blocks, u8 iv[],
+				     int first);
 
 struct aesbs_ctx {
 	u8	rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -140,16 +144,28 @@ static int aesbs_cbc_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
 	return 0;
 }
 
-static void cbc_encrypt_one(struct crypto_skcipher *tfm, const u8 *src, u8 *dst)
+static int cbc_encrypt(struct skcipher_request *req)
 {
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	int err, first = 1;
 
-	__aes_arm64_encrypt(ctx->enc, dst, src, ctx->key.rounds);
-}
+	err = skcipher_walk_virt(&walk, req, true);
 
-static int cbc_encrypt(struct skcipher_request *req)
-{
-	return crypto_cbc_encrypt_walk(req, cbc_encrypt_one);
+	kernel_neon_begin();
+	while (walk.nbytes >= AES_BLOCK_SIZE) {
+		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
+
+		/* fall back to the non-bitsliced NEON implementation */
+		neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
+				     ctx->enc, ctx->key.rounds, blocks, walk.iv,
+				     first);
+		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
+		first = 0;
+	}
+	kernel_neon_end();
+	return err;
 }
 
 static int cbc_decrypt(struct skcipher_request *req)
@@ -254,9 +270,11 @@ static int __xts_crypt(struct skcipher_request *req,
 
 	err = skcipher_walk_virt(&walk, req, true);
 
-	__aes_arm64_encrypt(ctx->twkey, walk.iv, walk.iv, ctx->key.rounds);
-
 	kernel_neon_begin();
+
+	neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
+			     ctx->key.rounds, 1, 1);
+
 	while (walk.nbytes >= AES_BLOCK_SIZE) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-23 14:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 14:05 [PATCH v2 00/10] crypto - AES for ARM/arm64 updates for v4.11 (round #2) Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 01/10] crypto: arm64/aes-neon-bs - honour iv_out requirement in CTR mode Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 02/10] crypto: arm/aes-ce - remove cra_alignmask Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 03/10] crypto: arm/chacha20 " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 04/10] crypto: arm64/aes-ce-ccm " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 05/10] crypto: arm64/aes-blk " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 06/10] crypto: arm64/chacha20 " Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 07/10] crypto: arm64/aes - avoid literals for cross-module symbol references Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 08/10] crypto: arm64/aes - performance tweak Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 09/10] crypto: arm64/aes-neon-blk - tweak performance for low end cores Ard Biesheuvel
2017-01-23 14:05 ` [PATCH v2 10/10] crypto: arm64/aes - replace scalar fallback with plain NEON fallback Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).