* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.
This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)
So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.
As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.
Given that this issue was flagged by the RT people, I would appreciate it
if they could confirm whether they are happy with this approach.
Changes since v4:
- rebase onto v4.16-rc3
- apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
in v4.16-rc1
Changes since v3:
- incorporate Dave's feedback on the asm macros to push/pop frames and to yield
the NEON conditionally
- make frame_push/pop more easy to use, by recording the arguments to
frame_push, removing the need to specify them again when calling frame_pop
- emit local symbol .Lframe_local_offset to allow code using the frame push/pop
macros to index the stack more easily
- use the magic \@ macro invocation counter provided by GAS to generate unique
labels om the NEON yield macros, rather than relying on chance
Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
throughput of the algorithms that are most likely to be affected by the
overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
is inacceptable, you are probably not using CONFIG_PREEMPT in the first
place.
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
driver. The latter was not correct without the former.
Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Ard Biesheuvel (23):
crypto: testmgr - add a new test case for CRC-T10DIF
crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - remove configurable interleave
crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
arm64: assembler: add utility macros to push/pop stack frames
arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
crypto: arm64/sha1-ce - yield NEON after every block of input
crypto: arm64/sha2-ce - yield NEON after every block of input
crypto: arm64/aes-ccm - yield NEON after every block of input
crypto: arm64/aes-blk - yield NEON after every block of input
crypto: arm64/aes-bs - yield NEON after every block of input
crypto: arm64/aes-ghash - yield NEON after every block of input
crypto: arm64/crc32-ce - yield NEON after every block of input
crypto: arm64/crct10dif-ce - yield NEON after every block of input
crypto: arm64/sha3-ce - yield NEON after every block of input
crypto: arm64/sha512-ce - yield NEON after every block of input
crypto: arm64/sm3-ce - yield NEON after every block of input
DO NOT MERGE
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-glue.c | 95 ++--
arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
arch/arm64/crypto/crc32-ce-core.S | 40 +-
arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
arch/arm64/crypto/ghash-ce-core.S | 113 ++--
arch/arm64/crypto/ghash-ce-glue.c | 28 +-
arch/arm64/crypto/sha1-ce-core.S | 42 +-
arch/arm64/crypto/sha2-ce-core.S | 37 +-
arch/arm64/crypto/sha256-glue.c | 36 +-
arch/arm64/crypto/sha3-ce-core.S | 77 ++-
arch/arm64/crypto/sha512-ce-core.S | 27 +-
arch/arm64/crypto/sm3-ce-core.S | 30 +-
arch/arm64/include/asm/assembler.h | 167 ++++++
arch/arm64/kernel/asm-offsets.c | 2 +
crypto/testmgr.h | 259 +++++++++
22 files changed, 1392 insertions(+), 735 deletions(-)
--
2.15.1
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.
This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)
So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.
As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.
Given that this issue was flagged by the RT people, I would appreciate it
if they could confirm whether they are happy with this approach.
Changes since v4:
- rebase onto v4.16-rc3
- apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
in v4.16-rc1
Changes since v3:
- incorporate Dave's feedback on the asm macros to push/pop frames and to yield
the NEON conditionally
- make frame_push/pop more easy to use, by recording the arguments to
frame_push, removing the need to specify them again when calling frame_pop
- emit local symbol .Lframe_local_offset to allow code using the frame push/pop
macros to index the stack more easily
- use the magic \@ macro invocation counter provided by GAS to generate unique
labels om the NEON yield macros, rather than relying on chance
Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
throughput of the algorithms that are most likely to be affected by the
overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
is inacceptable, you are probably not using CONFIG_PREEMPT in the first
place.
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
driver. The latter was not correct without the former.
Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users at vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Ard Biesheuvel (23):
crypto: testmgr - add a new test case for CRC-T10DIF
crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - remove configurable interleave
crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
arm64: assembler: add utility macros to push/pop stack frames
arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
crypto: arm64/sha1-ce - yield NEON after every block of input
crypto: arm64/sha2-ce - yield NEON after every block of input
crypto: arm64/aes-ccm - yield NEON after every block of input
crypto: arm64/aes-blk - yield NEON after every block of input
crypto: arm64/aes-bs - yield NEON after every block of input
crypto: arm64/aes-ghash - yield NEON after every block of input
crypto: arm64/crc32-ce - yield NEON after every block of input
crypto: arm64/crct10dif-ce - yield NEON after every block of input
crypto: arm64/sha3-ce - yield NEON after every block of input
crypto: arm64/sha512-ce - yield NEON after every block of input
crypto: arm64/sm3-ce - yield NEON after every block of input
DO NOT MERGE
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-glue.c | 95 ++--
arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
arch/arm64/crypto/crc32-ce-core.S | 40 +-
arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
arch/arm64/crypto/ghash-ce-core.S | 113 ++--
arch/arm64/crypto/ghash-ce-glue.c | 28 +-
arch/arm64/crypto/sha1-ce-core.S | 42 +-
arch/arm64/crypto/sha2-ce-core.S | 37 +-
arch/arm64/crypto/sha256-glue.c | 36 +-
arch/arm64/crypto/sha3-ce-core.S | 77 ++-
arch/arm64/crypto/sha512-ce-core.S | 27 +-
arch/arm64/crypto/sm3-ce-core.S | 30 +-
arch/arm64/include/asm/assembler.h | 167 ++++++
arch/arm64/kernel/asm-offsets.c | 2 +
crypto/testmgr.h | 259 +++++++++
22 files changed, 1392 insertions(+), 735 deletions(-)
--
2.15.1
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 01/23] crypto: testmgr - add a new test case for CRC-T10DIF
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
crypto/testmgr.h | 259 ++++++++++++++++++++
1 file changed, 259 insertions(+)
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6044f6906bd6..52d9ff93beac 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -2044,6 +2044,265 @@ static const struct hash_testvec crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap = { 1, 255, 57, 6 },
+ }, {
+ .plaintext = "\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+ "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+ "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+ "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+ "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+ "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+ "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+ "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+ "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+ "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+ "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+ "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+ "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+ "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+ "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+ "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+ "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+ "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+ "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+ "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+ "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+ "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+ "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+ "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+ "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+ "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+ "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+ "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+ "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+ "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+ "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+ "\x47\xde\x75\x0c\x80\x17\xae\x22"
+ "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+ "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+ "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+ "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+ "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+ "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+ "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+ "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+ "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+ "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+ "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+ "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+ "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+ "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+ "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+ "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+ "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+ "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+ "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+ "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+ "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+ "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+ "\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+ "\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+ "\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+ "\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+ "\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+ "\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+ "\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+ "\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+ "\x43\xda\x4e\xe5\x7c\x13\x87\x1e"
+ "\xb5\x29\xc0\x57\xee\x62\xf9\x90"
+ "\x04\x9b\x32\xc9\x3d\xd4\x6b\x02"
+ "\x76\x0d\xa4\x18\xaf\x46\xdd\x51"
+ "\xe8\x7f\x16\x8a\x21\xb8\x2c\xc3"
+ "\x5a\xf1\x65\xfc\x93\x07\x9e\x35"
+ "\xcc\x40\xd7\x6e\x05\x79\x10\xa7"
+ "\x1b\xb2\x49\xe0\x54\xeb\x82\x19"
+ "\x8d\x24\xbb\x2f\xc6\x5d\xf4\x68"
+ "\xff\x96\x0a\xa1\x38\xcf\x43\xda"
+ "\x71\x08\x7c\x13\xaa\x1e\xb5\x4c"
+ "\xe3\x57\xee\x85\x1c\x90\x27\xbe"
+ "\x32\xc9\x60\xf7\x6b\x02\x99\x0d"
+ "\xa4\x3b\xd2\x46\xdd\x74\x0b\x7f"
+ "\x16\xad\x21\xb8\x4f\xe6\x5a\xf1"
+ "\x88\x1f\x93\x2a\xc1\x35\xcc\x63"
+ "\xfa\x6e\x05\x9c\x10\xa7\x3e\xd5"
+ "\x49\xe0\x77\x0e\x82\x19\xb0\x24"
+ "\xbb\x52\xe9\x5d\xf4\x8b\x22\x96"
+ "\x2d\xc4\x38\xcf\x66\xfd\x71\x08"
+ "\x9f\x13\xaa\x41\xd8\x4c\xe3\x7a"
+ "\x11\x85\x1c\xb3\x27\xbe\x55\xec"
+ "\x60\xf7\x8e\x02\x99\x30\xc7\x3b"
+ "\xd2\x69\x00\x74\x0b\xa2\x16\xad"
+ "\x44\xdb\x4f\xe6\x7d\x14\x88\x1f"
+ "\xb6\x2a\xc1\x58\xef\x63\xfa\x91"
+ "\x05\x9c\x33\xca\x3e\xd5\x6c\x03"
+ "\x77\x0e\xa5\x19\xb0\x47\xde\x52"
+ "\xe9\x80\x17\x8b\x22\xb9\x2d\xc4"
+ "\x5b\xf2\x66\xfd\x94\x08\x9f\x36"
+ "\xcd\x41\xd8\x6f\x06\x7a\x11\xa8"
+ "\x1c\xb3\x4a\xe1\x55\xec\x83\x1a"
+ "\x8e\x25\xbc\x30\xc7\x5e\xf5\x69"
+ "\x00\x97\x0b\xa2\x39\xd0\x44\xdb"
+ "\x72\x09\x7d\x14\xab\x1f\xb6\x4d"
+ "\xe4\x58\xef\x86\x1d\x91\x28\xbf"
+ "\x33\xca\x61\xf8\x6c\x03\x9a\x0e"
+ "\xa5\x3c\xd3\x47\xde\x75\x0c\x80"
+ "\x17\xae\x22\xb9\x50\xe7\x5b\xf2"
+ "\x89\x20\x94\x2b\xc2\x36\xcd\x64"
+ "\xfb\x6f\x06\x9d\x11\xa8\x3f\xd6"
+ "\x4a\xe1\x78\x0f\x83\x1a\xb1\x25"
+ "\xbc\x53\xea\x5e\xf5\x8c\x00\x97"
+ "\x2e\xc5\x39\xd0\x67\xfe\x72\x09"
+ "\xa0\x14\xab\x42\xd9\x4d\xe4\x7b"
+ "\x12\x86\x1d\xb4\x28\xbf\x56\xed"
+ "\x61\xf8\x8f\x03\x9a\x31\xc8\x3c"
+ "\xd3\x6a\x01\x75\x0c\xa3\x17\xae"
+ "\x45\xdc\x50\xe7\x7e\x15\x89\x20"
+ "\xb7\x2b\xc2\x59\xf0\x64\xfb\x92"
+ "\x06\x9d\x34\xcb\x3f\xd6\x6d\x04"
+ "\x78\x0f\xa6\x1a\xb1\x48\xdf\x53"
+ "\xea\x81\x18\x8c\x23\xba\x2e\xc5"
+ "\x5c\xf3\x67\xfe\x95\x09\xa0\x37"
+ "\xce\x42\xd9\x70\x07\x7b\x12\xa9"
+ "\x1d\xb4\x4b\xe2\x56\xed\x84\x1b"
+ "\x8f\x26\xbd\x31\xc8\x5f\xf6\x6a"
+ "\x01\x98\x0c\xa3\x3a\xd1\x45\xdc"
+ "\x73\x0a\x7e\x15\xac\x20\xb7\x4e"
+ "\xe5\x59\xf0\x87\x1e\x92\x29\xc0"
+ "\x34\xcb\x62\xf9\x6d\x04\x9b\x0f"
+ "\xa6\x3d\xd4\x48\xdf\x76\x0d\x81"
+ "\x18\xaf\x23\xba\x51\xe8\x5c\xf3"
+ "\x8a\x21\x95\x2c\xc3\x37\xce\x65"
+ "\xfc\x70\x07\x9e\x12\xa9\x40\xd7"
+ "\x4b\xe2\x79\x10\x84\x1b\xb2\x26"
+ "\xbd\x54\xeb\x5f\xf6\x8d\x01\x98"
+ "\x2f\xc6\x3a\xd1\x68\xff\x73\x0a"
+ "\xa1\x15\xac\x43\xda\x4e\xe5\x7c"
+ "\x13\x87\x1e\xb5\x29\xc0\x57\xee"
+ "\x62\xf9\x90\x04\x9b\x32\xc9\x3d"
+ "\xd4\x6b\x02\x76\x0d\xa4\x18\xaf"
+ "\x46\xdd\x51\xe8\x7f\x16\x8a\x21"
+ "\xb8\x2c\xc3\x5a\xf1\x65\xfc\x93"
+ "\x07\x9e\x35\xcc\x40\xd7\x6e\x05"
+ "\x79\x10\xa7\x1b\xb2\x49\xe0\x54"
+ "\xeb\x82\x19\x8d\x24\xbb\x2f\xc6"
+ "\x5d\xf4\x68\xff\x96\x0a\xa1\x38"
+ "\xcf\x43\xda\x71\x08\x7c\x13\xaa"
+ "\x1e\xb5\x4c\xe3\x57\xee\x85\x1c"
+ "\x90\x27\xbe\x32\xc9\x60\xf7\x6b"
+ "\x02\x99\x0d\xa4\x3b\xd2\x46\xdd"
+ "\x74\x0b\x7f\x16\xad\x21\xb8\x4f"
+ "\xe6\x5a\xf1\x88\x1f\x93\x2a\xc1"
+ "\x35\xcc\x63\xfa\x6e\x05\x9c\x10"
+ "\xa7\x3e\xd5\x49\xe0\x77\x0e\x82"
+ "\x19\xb0\x24\xbb\x52\xe9\x5d\xf4"
+ "\x8b\x22\x96\x2d\xc4\x38\xcf\x66"
+ "\xfd\x71\x08\x9f\x13\xaa\x41\xd8"
+ "\x4c\xe3\x7a\x11\x85\x1c\xb3\x27"
+ "\xbe\x55\xec\x60\xf7\x8e\x02\x99"
+ "\x30\xc7\x3b\xd2\x69\x00\x74\x0b"
+ "\xa2\x16\xad\x44\xdb\x4f\xe6\x7d"
+ "\x14\x88\x1f\xb6\x2a\xc1\x58\xef"
+ "\x63\xfa\x91\x05\x9c\x33\xca\x3e"
+ "\xd5\x6c\x03\x77\x0e\xa5\x19\xb0"
+ "\x47\xde\x52\xe9\x80\x17\x8b\x22"
+ "\xb9\x2d\xc4\x5b\xf2\x66\xfd\x94"
+ "\x08\x9f\x36\xcd\x41\xd8\x6f\x06"
+ "\x7a\x11\xa8\x1c\xb3\x4a\xe1\x55"
+ "\xec\x83\x1a\x8e\x25\xbc\x30\xc7"
+ "\x5e\xf5\x69\x00\x97\x0b\xa2\x39"
+ "\xd0\x44\xdb\x72\x09\x7d\x14\xab"
+ "\x1f\xb6\x4d\xe4\x58\xef\x86\x1d"
+ "\x91\x28\xbf\x33\xca\x61\xf8\x6c"
+ "\x03\x9a\x0e\xa5\x3c\xd3\x47\xde"
+ "\x75\x0c\x80\x17\xae\x22\xb9\x50"
+ "\xe7\x5b\xf2\x89\x20\x94\x2b\xc2"
+ "\x36\xcd\x64\xfb\x6f\x06\x9d\x11"
+ "\xa8\x3f\xd6\x4a\xe1\x78\x0f\x83"
+ "\x1a\xb1\x25\xbc\x53\xea\x5e\xf5"
+ "\x8c\x00\x97\x2e\xc5\x39\xd0\x67"
+ "\xfe\x72\x09\xa0\x14\xab\x42\xd9"
+ "\x4d\xe4\x7b\x12\x86\x1d\xb4\x28"
+ "\xbf\x56\xed\x61\xf8\x8f\x03\x9a"
+ "\x31\xc8\x3c\xd3\x6a\x01\x75\x0c"
+ "\xa3\x17\xae\x45\xdc\x50\xe7\x7e"
+ "\x15\x89\x20\xb7\x2b\xc2\x59\xf0"
+ "\x64\xfb\x92\x06\x9d\x34\xcb\x3f"
+ "\xd6\x6d\x04\x78\x0f\xa6\x1a\xb1"
+ "\x48\xdf\x53\xea\x81\x18\x8c\x23"
+ "\xba\x2e\xc5\x5c\xf3\x67\xfe\x95"
+ "\x09\xa0\x37\xce\x42\xd9\x70\x07"
+ "\x7b\x12\xa9\x1d\xb4\x4b\xe2\x56"
+ "\xed\x84\x1b\x8f\x26\xbd\x31\xc8"
+ "\x5f\xf6\x6a\x01\x98\x0c\xa3\x3a"
+ "\xd1\x45\xdc\x73\x0a\x7e\x15\xac"
+ "\x20\xb7\x4e\xe5\x59\xf0\x87\x1e"
+ "\x92\x29\xc0\x34\xcb\x62\xf9\x6d"
+ "\x04\x9b\x0f\xa6\x3d\xd4\x48\xdf"
+ "\x76\x0d\x81\x18\xaf\x23\xba\x51"
+ "\xe8\x5c\xf3\x8a\x21\x95\x2c\xc3"
+ "\x37\xce\x65\xfc\x70\x07\x9e\x12"
+ "\xa9\x40\xd7\x4b\xe2\x79\x10\x84"
+ "\x1b\xb2\x26\xbd\x54\xeb\x5f\xf6"
+ "\x8d\x01\x98\x2f\xc6\x3a\xd1\x68"
+ "\xff\x73\x0a\xa1\x15\xac\x43\xda"
+ "\x4e\xe5\x7c\x13\x87\x1e\xb5\x29"
+ "\xc0\x57\xee\x62\xf9\x90\x04\x9b"
+ "\x32\xc9\x3d\xd4\x6b\x02\x76\x0d"
+ "\xa4\x18\xaf\x46\xdd\x51\xe8\x7f"
+ "\x16\x8a\x21\xb8\x2c\xc3\x5a\xf1"
+ "\x65\xfc\x93\x07\x9e\x35\xcc\x40"
+ "\xd7\x6e\x05\x79\x10\xa7\x1b\xb2"
+ "\x49\xe0\x54\xeb\x82\x19\x8d\x24"
+ "\xbb\x2f\xc6\x5d\xf4\x68\xff\x96"
+ "\x0a\xa1\x38\xcf\x43\xda\x71\x08"
+ "\x7c\x13\xaa\x1e\xb5\x4c\xe3\x57"
+ "\xee\x85\x1c\x90\x27\xbe\x32\xc9"
+ "\x60\xf7\x6b\x02\x99\x0d\xa4\x3b"
+ "\xd2\x46\xdd\x74\x0b\x7f\x16\xad"
+ "\x21\xb8\x4f\xe6\x5a\xf1\x88\x1f"
+ "\x93\x2a\xc1\x35\xcc\x63\xfa\x6e"
+ "\x05\x9c\x10\xa7\x3e\xd5\x49\xe0"
+ "\x77\x0e\x82\x19\xb0\x24\xbb\x52"
+ "\xe9\x5d\xf4\x8b\x22\x96\x2d\xc4"
+ "\x38\xcf\x66\xfd\x71\x08\x9f\x13"
+ "\xaa\x41\xd8\x4c\xe3\x7a\x11\x85"
+ "\x1c\xb3\x27\xbe\x55\xec\x60\xf7"
+ "\x8e\x02\x99\x30\xc7\x3b\xd2\x69"
+ "\x00\x74\x0b\xa2\x16\xad\x44\xdb"
+ "\x4f\xe6\x7d\x14\x88\x1f\xb6\x2a"
+ "\xc1\x58\xef\x63\xfa\x91\x05\x9c"
+ "\x33\xca\x3e\xd5\x6c\x03\x77\x0e"
+ "\xa5\x19\xb0\x47\xde\x52\xe9\x80"
+ "\x17\x8b\x22\xb9\x2d\xc4\x5b\xf2"
+ "\x66\xfd\x94\x08\x9f\x36\xcd\x41"
+ "\xd8\x6f\x06\x7a\x11\xa8\x1c\xb3"
+ "\x4a\xe1\x55\xec\x83\x1a\x8e\x25"
+ "\xbc\x30\xc7\x5e\xf5\x69\x00\x97"
+ "\x0b\xa2\x39\xd0\x44\xdb\x72\x09"
+ "\x7d\x14\xab\x1f\xb6\x4d\xe4\x58"
+ "\xef\x86\x1d\x91\x28\xbf\x33\xca"
+ "\x61\xf8\x6c\x03\x9a\x0e\xa5\x3c"
+ "\xd3\x47\xde\x75\x0c\x80\x17\xae"
+ "\x22\xb9\x50\xe7\x5b\xf2\x89\x20"
+ "\x94\x2b\xc2\x36\xcd\x64\xfb\x6f"
+ "\x06\x9d\x11\xa8\x3f\xd6\x4a\xe1"
+ "\x78\x0f\x83\x1a\xb1\x25\xbc\x53"
+ "\xea\x5e\xf5\x8c\x00\x97\x2e\xc5"
+ "\x39\xd0\x67\xfe\x72\x09\xa0\x14"
+ "\xab\x42\xd9\x4d\xe4\x7b\x12\x86"
+ "\x1d\xb4\x28\xbf\x56\xed\x61\xf8"
+ "\x8f\x03\x9a\x31\xc8\x3c\xd3\x6a"
+ "\x01\x75\x0c\xa3\x17\xae\x45\xdc"
+ "\x50\xe7\x7e\x15\x89\x20\xb7\x2b"
+ "\xc2\x59\xf0\x64\xfb\x92\x06\x9d"
+ "\x34\xcb\x3f\xd6\x6d\x04\x78\x0f"
+ "\xa6\x1a\xb1\x48\xdf\x53\xea\x81"
+ "\x18\x8c\x23\xba\x2e\xc5\x5c\xf3"
+ "\x67\xfe\x95\x09\xa0\x37\xce\x42"
+ "\xd9\x70\x07\x7b\x12\xa9\x1d\xb4"
+ "\x4b\xe2\x56\xed\x84\x1b\x8f\x26"
+ "\xbd\x31\xc8\x5f\xf6\x6a\x01\x98",
+ .psize = 2048,
+ .digest = (u8 *)(u16 []){ 0x23ca },
}
};
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 01/23] crypto: testmgr - add a new test case for CRC-T10DIF
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
crypto/testmgr.h | 259 ++++++++++++++++++++
1 file changed, 259 insertions(+)
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 6044f6906bd6..52d9ff93beac 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -2044,6 +2044,265 @@ static const struct hash_testvec crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap = { 1, 255, 57, 6 },
+ }, {
+ .plaintext = "\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+ "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+ "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+ "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+ "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+ "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+ "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+ "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+ "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+ "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+ "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+ "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+ "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+ "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+ "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+ "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+ "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+ "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+ "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+ "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+ "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+ "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+ "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+ "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+ "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+ "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+ "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+ "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+ "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+ "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+ "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+ "\x47\xde\x75\x0c\x80\x17\xae\x22"
+ "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+ "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+ "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+ "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+ "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+ "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+ "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+ "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+ "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+ "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+ "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+ "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+ "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+ "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+ "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+ "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+ "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+ "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+ "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+ "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+ "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+ "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+ "\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+ "\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+ "\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+ "\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+ "\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+ "\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+ "\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+ "\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+ "\x43\xda\x4e\xe5\x7c\x13\x87\x1e"
+ "\xb5\x29\xc0\x57\xee\x62\xf9\x90"
+ "\x04\x9b\x32\xc9\x3d\xd4\x6b\x02"
+ "\x76\x0d\xa4\x18\xaf\x46\xdd\x51"
+ "\xe8\x7f\x16\x8a\x21\xb8\x2c\xc3"
+ "\x5a\xf1\x65\xfc\x93\x07\x9e\x35"
+ "\xcc\x40\xd7\x6e\x05\x79\x10\xa7"
+ "\x1b\xb2\x49\xe0\x54\xeb\x82\x19"
+ "\x8d\x24\xbb\x2f\xc6\x5d\xf4\x68"
+ "\xff\x96\x0a\xa1\x38\xcf\x43\xda"
+ "\x71\x08\x7c\x13\xaa\x1e\xb5\x4c"
+ "\xe3\x57\xee\x85\x1c\x90\x27\xbe"
+ "\x32\xc9\x60\xf7\x6b\x02\x99\x0d"
+ "\xa4\x3b\xd2\x46\xdd\x74\x0b\x7f"
+ "\x16\xad\x21\xb8\x4f\xe6\x5a\xf1"
+ "\x88\x1f\x93\x2a\xc1\x35\xcc\x63"
+ "\xfa\x6e\x05\x9c\x10\xa7\x3e\xd5"
+ "\x49\xe0\x77\x0e\x82\x19\xb0\x24"
+ "\xbb\x52\xe9\x5d\xf4\x8b\x22\x96"
+ "\x2d\xc4\x38\xcf\x66\xfd\x71\x08"
+ "\x9f\x13\xaa\x41\xd8\x4c\xe3\x7a"
+ "\x11\x85\x1c\xb3\x27\xbe\x55\xec"
+ "\x60\xf7\x8e\x02\x99\x30\xc7\x3b"
+ "\xd2\x69\x00\x74\x0b\xa2\x16\xad"
+ "\x44\xdb\x4f\xe6\x7d\x14\x88\x1f"
+ "\xb6\x2a\xc1\x58\xef\x63\xfa\x91"
+ "\x05\x9c\x33\xca\x3e\xd5\x6c\x03"
+ "\x77\x0e\xa5\x19\xb0\x47\xde\x52"
+ "\xe9\x80\x17\x8b\x22\xb9\x2d\xc4"
+ "\x5b\xf2\x66\xfd\x94\x08\x9f\x36"
+ "\xcd\x41\xd8\x6f\x06\x7a\x11\xa8"
+ "\x1c\xb3\x4a\xe1\x55\xec\x83\x1a"
+ "\x8e\x25\xbc\x30\xc7\x5e\xf5\x69"
+ "\x00\x97\x0b\xa2\x39\xd0\x44\xdb"
+ "\x72\x09\x7d\x14\xab\x1f\xb6\x4d"
+ "\xe4\x58\xef\x86\x1d\x91\x28\xbf"
+ "\x33\xca\x61\xf8\x6c\x03\x9a\x0e"
+ "\xa5\x3c\xd3\x47\xde\x75\x0c\x80"
+ "\x17\xae\x22\xb9\x50\xe7\x5b\xf2"
+ "\x89\x20\x94\x2b\xc2\x36\xcd\x64"
+ "\xfb\x6f\x06\x9d\x11\xa8\x3f\xd6"
+ "\x4a\xe1\x78\x0f\x83\x1a\xb1\x25"
+ "\xbc\x53\xea\x5e\xf5\x8c\x00\x97"
+ "\x2e\xc5\x39\xd0\x67\xfe\x72\x09"
+ "\xa0\x14\xab\x42\xd9\x4d\xe4\x7b"
+ "\x12\x86\x1d\xb4\x28\xbf\x56\xed"
+ "\x61\xf8\x8f\x03\x9a\x31\xc8\x3c"
+ "\xd3\x6a\x01\x75\x0c\xa3\x17\xae"
+ "\x45\xdc\x50\xe7\x7e\x15\x89\x20"
+ "\xb7\x2b\xc2\x59\xf0\x64\xfb\x92"
+ "\x06\x9d\x34\xcb\x3f\xd6\x6d\x04"
+ "\x78\x0f\xa6\x1a\xb1\x48\xdf\x53"
+ "\xea\x81\x18\x8c\x23\xba\x2e\xc5"
+ "\x5c\xf3\x67\xfe\x95\x09\xa0\x37"
+ "\xce\x42\xd9\x70\x07\x7b\x12\xa9"
+ "\x1d\xb4\x4b\xe2\x56\xed\x84\x1b"
+ "\x8f\x26\xbd\x31\xc8\x5f\xf6\x6a"
+ "\x01\x98\x0c\xa3\x3a\xd1\x45\xdc"
+ "\x73\x0a\x7e\x15\xac\x20\xb7\x4e"
+ "\xe5\x59\xf0\x87\x1e\x92\x29\xc0"
+ "\x34\xcb\x62\xf9\x6d\x04\x9b\x0f"
+ "\xa6\x3d\xd4\x48\xdf\x76\x0d\x81"
+ "\x18\xaf\x23\xba\x51\xe8\x5c\xf3"
+ "\x8a\x21\x95\x2c\xc3\x37\xce\x65"
+ "\xfc\x70\x07\x9e\x12\xa9\x40\xd7"
+ "\x4b\xe2\x79\x10\x84\x1b\xb2\x26"
+ "\xbd\x54\xeb\x5f\xf6\x8d\x01\x98"
+ "\x2f\xc6\x3a\xd1\x68\xff\x73\x0a"
+ "\xa1\x15\xac\x43\xda\x4e\xe5\x7c"
+ "\x13\x87\x1e\xb5\x29\xc0\x57\xee"
+ "\x62\xf9\x90\x04\x9b\x32\xc9\x3d"
+ "\xd4\x6b\x02\x76\x0d\xa4\x18\xaf"
+ "\x46\xdd\x51\xe8\x7f\x16\x8a\x21"
+ "\xb8\x2c\xc3\x5a\xf1\x65\xfc\x93"
+ "\x07\x9e\x35\xcc\x40\xd7\x6e\x05"
+ "\x79\x10\xa7\x1b\xb2\x49\xe0\x54"
+ "\xeb\x82\x19\x8d\x24\xbb\x2f\xc6"
+ "\x5d\xf4\x68\xff\x96\x0a\xa1\x38"
+ "\xcf\x43\xda\x71\x08\x7c\x13\xaa"
+ "\x1e\xb5\x4c\xe3\x57\xee\x85\x1c"
+ "\x90\x27\xbe\x32\xc9\x60\xf7\x6b"
+ "\x02\x99\x0d\xa4\x3b\xd2\x46\xdd"
+ "\x74\x0b\x7f\x16\xad\x21\xb8\x4f"
+ "\xe6\x5a\xf1\x88\x1f\x93\x2a\xc1"
+ "\x35\xcc\x63\xfa\x6e\x05\x9c\x10"
+ "\xa7\x3e\xd5\x49\xe0\x77\x0e\x82"
+ "\x19\xb0\x24\xbb\x52\xe9\x5d\xf4"
+ "\x8b\x22\x96\x2d\xc4\x38\xcf\x66"
+ "\xfd\x71\x08\x9f\x13\xaa\x41\xd8"
+ "\x4c\xe3\x7a\x11\x85\x1c\xb3\x27"
+ "\xbe\x55\xec\x60\xf7\x8e\x02\x99"
+ "\x30\xc7\x3b\xd2\x69\x00\x74\x0b"
+ "\xa2\x16\xad\x44\xdb\x4f\xe6\x7d"
+ "\x14\x88\x1f\xb6\x2a\xc1\x58\xef"
+ "\x63\xfa\x91\x05\x9c\x33\xca\x3e"
+ "\xd5\x6c\x03\x77\x0e\xa5\x19\xb0"
+ "\x47\xde\x52\xe9\x80\x17\x8b\x22"
+ "\xb9\x2d\xc4\x5b\xf2\x66\xfd\x94"
+ "\x08\x9f\x36\xcd\x41\xd8\x6f\x06"
+ "\x7a\x11\xa8\x1c\xb3\x4a\xe1\x55"
+ "\xec\x83\x1a\x8e\x25\xbc\x30\xc7"
+ "\x5e\xf5\x69\x00\x97\x0b\xa2\x39"
+ "\xd0\x44\xdb\x72\x09\x7d\x14\xab"
+ "\x1f\xb6\x4d\xe4\x58\xef\x86\x1d"
+ "\x91\x28\xbf\x33\xca\x61\xf8\x6c"
+ "\x03\x9a\x0e\xa5\x3c\xd3\x47\xde"
+ "\x75\x0c\x80\x17\xae\x22\xb9\x50"
+ "\xe7\x5b\xf2\x89\x20\x94\x2b\xc2"
+ "\x36\xcd\x64\xfb\x6f\x06\x9d\x11"
+ "\xa8\x3f\xd6\x4a\xe1\x78\x0f\x83"
+ "\x1a\xb1\x25\xbc\x53\xea\x5e\xf5"
+ "\x8c\x00\x97\x2e\xc5\x39\xd0\x67"
+ "\xfe\x72\x09\xa0\x14\xab\x42\xd9"
+ "\x4d\xe4\x7b\x12\x86\x1d\xb4\x28"
+ "\xbf\x56\xed\x61\xf8\x8f\x03\x9a"
+ "\x31\xc8\x3c\xd3\x6a\x01\x75\x0c"
+ "\xa3\x17\xae\x45\xdc\x50\xe7\x7e"
+ "\x15\x89\x20\xb7\x2b\xc2\x59\xf0"
+ "\x64\xfb\x92\x06\x9d\x34\xcb\x3f"
+ "\xd6\x6d\x04\x78\x0f\xa6\x1a\xb1"
+ "\x48\xdf\x53\xea\x81\x18\x8c\x23"
+ "\xba\x2e\xc5\x5c\xf3\x67\xfe\x95"
+ "\x09\xa0\x37\xce\x42\xd9\x70\x07"
+ "\x7b\x12\xa9\x1d\xb4\x4b\xe2\x56"
+ "\xed\x84\x1b\x8f\x26\xbd\x31\xc8"
+ "\x5f\xf6\x6a\x01\x98\x0c\xa3\x3a"
+ "\xd1\x45\xdc\x73\x0a\x7e\x15\xac"
+ "\x20\xb7\x4e\xe5\x59\xf0\x87\x1e"
+ "\x92\x29\xc0\x34\xcb\x62\xf9\x6d"
+ "\x04\x9b\x0f\xa6\x3d\xd4\x48\xdf"
+ "\x76\x0d\x81\x18\xaf\x23\xba\x51"
+ "\xe8\x5c\xf3\x8a\x21\x95\x2c\xc3"
+ "\x37\xce\x65\xfc\x70\x07\x9e\x12"
+ "\xa9\x40\xd7\x4b\xe2\x79\x10\x84"
+ "\x1b\xb2\x26\xbd\x54\xeb\x5f\xf6"
+ "\x8d\x01\x98\x2f\xc6\x3a\xd1\x68"
+ "\xff\x73\x0a\xa1\x15\xac\x43\xda"
+ "\x4e\xe5\x7c\x13\x87\x1e\xb5\x29"
+ "\xc0\x57\xee\x62\xf9\x90\x04\x9b"
+ "\x32\xc9\x3d\xd4\x6b\x02\x76\x0d"
+ "\xa4\x18\xaf\x46\xdd\x51\xe8\x7f"
+ "\x16\x8a\x21\xb8\x2c\xc3\x5a\xf1"
+ "\x65\xfc\x93\x07\x9e\x35\xcc\x40"
+ "\xd7\x6e\x05\x79\x10\xa7\x1b\xb2"
+ "\x49\xe0\x54\xeb\x82\x19\x8d\x24"
+ "\xbb\x2f\xc6\x5d\xf4\x68\xff\x96"
+ "\x0a\xa1\x38\xcf\x43\xda\x71\x08"
+ "\x7c\x13\xaa\x1e\xb5\x4c\xe3\x57"
+ "\xee\x85\x1c\x90\x27\xbe\x32\xc9"
+ "\x60\xf7\x6b\x02\x99\x0d\xa4\x3b"
+ "\xd2\x46\xdd\x74\x0b\x7f\x16\xad"
+ "\x21\xb8\x4f\xe6\x5a\xf1\x88\x1f"
+ "\x93\x2a\xc1\x35\xcc\x63\xfa\x6e"
+ "\x05\x9c\x10\xa7\x3e\xd5\x49\xe0"
+ "\x77\x0e\x82\x19\xb0\x24\xbb\x52"
+ "\xe9\x5d\xf4\x8b\x22\x96\x2d\xc4"
+ "\x38\xcf\x66\xfd\x71\x08\x9f\x13"
+ "\xaa\x41\xd8\x4c\xe3\x7a\x11\x85"
+ "\x1c\xb3\x27\xbe\x55\xec\x60\xf7"
+ "\x8e\x02\x99\x30\xc7\x3b\xd2\x69"
+ "\x00\x74\x0b\xa2\x16\xad\x44\xdb"
+ "\x4f\xe6\x7d\x14\x88\x1f\xb6\x2a"
+ "\xc1\x58\xef\x63\xfa\x91\x05\x9c"
+ "\x33\xca\x3e\xd5\x6c\x03\x77\x0e"
+ "\xa5\x19\xb0\x47\xde\x52\xe9\x80"
+ "\x17\x8b\x22\xb9\x2d\xc4\x5b\xf2"
+ "\x66\xfd\x94\x08\x9f\x36\xcd\x41"
+ "\xd8\x6f\x06\x7a\x11\xa8\x1c\xb3"
+ "\x4a\xe1\x55\xec\x83\x1a\x8e\x25"
+ "\xbc\x30\xc7\x5e\xf5\x69\x00\x97"
+ "\x0b\xa2\x39\xd0\x44\xdb\x72\x09"
+ "\x7d\x14\xab\x1f\xb6\x4d\xe4\x58"
+ "\xef\x86\x1d\x91\x28\xbf\x33\xca"
+ "\x61\xf8\x6c\x03\x9a\x0e\xa5\x3c"
+ "\xd3\x47\xde\x75\x0c\x80\x17\xae"
+ "\x22\xb9\x50\xe7\x5b\xf2\x89\x20"
+ "\x94\x2b\xc2\x36\xcd\x64\xfb\x6f"
+ "\x06\x9d\x11\xa8\x3f\xd6\x4a\xe1"
+ "\x78\x0f\x83\x1a\xb1\x25\xbc\x53"
+ "\xea\x5e\xf5\x8c\x00\x97\x2e\xc5"
+ "\x39\xd0\x67\xfe\x72\x09\xa0\x14"
+ "\xab\x42\xd9\x4d\xe4\x7b\x12\x86"
+ "\x1d\xb4\x28\xbf\x56\xed\x61\xf8"
+ "\x8f\x03\x9a\x31\xc8\x3c\xd3\x6a"
+ "\x01\x75\x0c\xa3\x17\xae\x45\xdc"
+ "\x50\xe7\x7e\x15\x89\x20\xb7\x2b"
+ "\xc2\x59\xf0\x64\xfb\x92\x06\x9d"
+ "\x34\xcb\x3f\xd6\x6d\x04\x78\x0f"
+ "\xa6\x1a\xb1\x48\xdf\x53\xea\x81"
+ "\x18\x8c\x23\xba\x2e\xc5\x5c\xf3"
+ "\x67\xfe\x95\x09\xa0\x37\xce\x42"
+ "\xd9\x70\x07\x7b\x12\xa9\x1d\xb4"
+ "\x4b\xe2\x56\xed\x84\x1b\x8f\x26"
+ "\xbd\x31\xc8\x5f\xf6\x6a\x01\x98",
+ .psize = 2048,
+ .digest = (u8 *)(u16 []){ 0x23ca },
}
};
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 02/23] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
}
static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
- u32 abytes, u32 *macp, bool use_neon)
+ u32 abytes, u32 *macp)
{
- if (likely(use_neon)) {
+ if (may_use_simd()) {
+ kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
num_rounds(key));
+ kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
}
}
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
- bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
ltag.len = 6;
}
- ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp, use_neon);
+ ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp);
scatterwalk_start(&walk, req->src);
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
n = scatterwalk_clamp(&walk, len);
}
p = scatterwalk_map(&walk);
- ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+ ccm_update_mac(ctx, mac, p, n, &macp);
len -= n;
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_encrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_decrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_decrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
}
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 02/23] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
}
static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
- u32 abytes, u32 *macp, bool use_neon)
+ u32 abytes, u32 *macp)
{
- if (likely(use_neon)) {
+ if (may_use_simd()) {
+ kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
num_rounds(key));
+ kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
}
}
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
- bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
ltag.len = 6;
}
- ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp, use_neon);
+ ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp);
scatterwalk_start(&walk, req->src);
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
n = scatterwalk_clamp(&walk, len);
}
p = scatterwalk_map(&walk);
- ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+ ccm_update_mac(ctx, mac, p, n, &macp);
len -= n;
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_encrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_decrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_decrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
}
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 03/23] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 95 ++++++++++----------
arch/arm64/crypto/aes-modes.S | 90 +++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
3 files changed, 97 insertions(+), 102 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 2fa850e86aa8..253188fb8cb0 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 ctr[], int first);
+ int rounds, int blocks, u8 ctr[]);
asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, first);
+ (u8 *)ctx->key_enc, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, first);
+ (u8 *)ctx->key_dec, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- first = 1;
- kernel_neon_begin();
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
+ kernel_neon_end();
}
if (walk.nbytes) {
u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
*/
blocks = -1;
+ kernel_neon_begin();
aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
- blocks, walk.iv, first);
+ blocks, walk.iv);
+ kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
- kernel_neon_end();
return err;
}
@@ -270,16 +267,16 @@ static int xts_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_enc, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -292,16 +289,16 @@ static int xts_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_dec, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+ aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
#if INTERLEAVE == 2
aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block2x)
aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block2x)
#elif INTERLEAVE == 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
#define FRAME_POP
.macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
.macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
#endif
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
* aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
*/
AES_ENTRY(aes_ecb_encrypt)
FRAME_PUSH
- cbz w5, .LecbencloopNx
enc_prepare w3, x2, x5
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
FRAME_PUSH
- cbz w5, .LecbdecloopNx
dec_prepare w3, x2, x5
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
/*
* aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
* aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
*/
AES_ENTRY(aes_cbc_encrypt)
- cbz w6, .Lcbcencloop
-
ld1 {v0.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
FRAME_PUSH
- cbz w6, .LcbcdecloopNx
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 ctr[], int first)
+ * int blocks, u8 ctr[])
*/
AES_ENTRY(aes_ctr_encrypt)
FRAME_PUSH
- cbz w6, .Lctrnotfirst /* 1st time around? */
+
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
-.Lctrnotfirst:
- umov x8, v4.d[1] /* keep swabbed ctr in reg */
- rev x8, x8
+ umov x6, v4.d[1] /* keep swabbed ctr in reg */
+ rev x6, x6
#if INTERLEAVE >= 2
- cmn w8, w4 /* 32 bit overflow? */
+ cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
subs w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
#if INTERLEAVE == 2
mov v0.8b, v4.8b
mov v1.8b, v4.8b
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v0.d[1], x7
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v1.d[1], x7
ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
st1 {v0.16b-v1.16b}, [x0], #32
#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
- dup v7.4s, w8
+ dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x8, x8, #INTERLEAVE
+ add x6, x6, #INTERLEAVE
#endif
- rev x7, x8
+ rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
#endif
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
- adds x8, x8, #1 /* increment BE ctr */
- rev x7, x8
+ adds x6, x6, #1 /* increment BE ctr */
+ rev x7, x6
ins v4.d[1], x7
bcs .Lctrcarry /* overflow? */
@@ -385,15 +379,17 @@ CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
FRAME_PUSH
- cbz w7, .LxtsencloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- enc_switch_key w3, x2, x6
+ cbz w7, .Lxtsencnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ enc_switch_key w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencnotfirst:
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
FRAME_PUSH
- cbz w7, .LxtsdecloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- dec_prepare w3, x2, x6
+ cbz w7, .Lxtsdecnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ dec_prepare w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecnotfirst:
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
/* borrowed from aes-neon-blk.ko */
asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, u8 iv[],
- int first);
+ int rounds, int blocks, u8 iv[]);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
- int err, first = 1;
+ int err;
err = skcipher_walk_virt(&walk, req, true);
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
/* fall back to the non-bitsliced NEON implementation */
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- ctx->enc, ctx->key.rounds, blocks, walk.iv,
- first);
+ ctx->enc, ctx->key.rounds, blocks,
+ walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
}
kernel_neon_end();
return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
kernel_neon_begin();
neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1, 1);
+ ctx->key.rounds, 1);
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 03/23] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 95 ++++++++++----------
arch/arm64/crypto/aes-modes.S | 90 +++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
3 files changed, 97 insertions(+), 102 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 2fa850e86aa8..253188fb8cb0 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 ctr[], int first);
+ int rounds, int blocks, u8 ctr[]);
asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, first);
+ (u8 *)ctx->key_enc, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, first);
+ (u8 *)ctx->key_dec, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- first = 1;
- kernel_neon_begin();
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
+ kernel_neon_end();
}
if (walk.nbytes) {
u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
*/
blocks = -1;
+ kernel_neon_begin();
aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
- blocks, walk.iv, first);
+ blocks, walk.iv);
+ kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
- kernel_neon_end();
return err;
}
@@ -270,16 +267,16 @@ static int xts_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_enc, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -292,16 +289,16 @@ static int xts_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_dec, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+ aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
#if INTERLEAVE == 2
aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block2x)
aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block2x)
#elif INTERLEAVE == 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
#define FRAME_POP
.macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
.macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
#endif
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
* aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
*/
AES_ENTRY(aes_ecb_encrypt)
FRAME_PUSH
- cbz w5, .LecbencloopNx
enc_prepare w3, x2, x5
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
FRAME_PUSH
- cbz w5, .LecbdecloopNx
dec_prepare w3, x2, x5
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
/*
* aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
* aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
*/
AES_ENTRY(aes_cbc_encrypt)
- cbz w6, .Lcbcencloop
-
ld1 {v0.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
FRAME_PUSH
- cbz w6, .LcbcdecloopNx
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 ctr[], int first)
+ * int blocks, u8 ctr[])
*/
AES_ENTRY(aes_ctr_encrypt)
FRAME_PUSH
- cbz w6, .Lctrnotfirst /* 1st time around? */
+
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
-.Lctrnotfirst:
- umov x8, v4.d[1] /* keep swabbed ctr in reg */
- rev x8, x8
+ umov x6, v4.d[1] /* keep swabbed ctr in reg */
+ rev x6, x6
#if INTERLEAVE >= 2
- cmn w8, w4 /* 32 bit overflow? */
+ cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
subs w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
#if INTERLEAVE == 2
mov v0.8b, v4.8b
mov v1.8b, v4.8b
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v0.d[1], x7
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v1.d[1], x7
ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
st1 {v0.16b-v1.16b}, [x0], #32
#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
- dup v7.4s, w8
+ dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x8, x8, #INTERLEAVE
+ add x6, x6, #INTERLEAVE
#endif
- rev x7, x8
+ rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
#endif
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
- adds x8, x8, #1 /* increment BE ctr */
- rev x7, x8
+ adds x6, x6, #1 /* increment BE ctr */
+ rev x7, x6
ins v4.d[1], x7
bcs .Lctrcarry /* overflow? */
@@ -385,15 +379,17 @@ CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
FRAME_PUSH
- cbz w7, .LxtsencloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- enc_switch_key w3, x2, x6
+ cbz w7, .Lxtsencnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ enc_switch_key w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencnotfirst:
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
FRAME_PUSH
- cbz w7, .LxtsdecloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- dec_prepare w3, x2, x6
+ cbz w7, .Lxtsdecnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ dec_prepare w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecnotfirst:
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
/* borrowed from aes-neon-blk.ko */
asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, u8 iv[],
- int first);
+ int rounds, int blocks, u8 iv[]);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
- int err, first = 1;
+ int err;
err = skcipher_walk_virt(&walk, req, true);
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
/* fall back to the non-bitsliced NEON implementation */
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- ctx->enc, ctx->key.rounds, blocks, walk.iv,
- first);
+ ctx->enc, ctx->key.rounds, blocks,
+ walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
}
kernel_neon_end();
return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
kernel_neon_begin();
neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1, 1);
+ ctx->key.rounds, 1);
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 04/23] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-glue.c | 36 +++++++++-----------
1 file changed, 17 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
ctx->rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
/* fall back to the non-bitsliced NEON implementation */
+ kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->enc, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->key.rk, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
+ kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->rk, ctx->rounds, blocks, walk.iv, final);
+ kernel_neon_end();
if (final) {
u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
@@ -304,12 +303,11 @@ static int __xts_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
kernel_neon_begin();
-
- neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1);
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+ kernel_neon_end();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
ctx->key.rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 04/23] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-glue.c | 36 +++++++++-----------
1 file changed, 17 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
ctx->rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
/* fall back to the non-bitsliced NEON implementation */
+ kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->enc, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->key.rk, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
+ kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->rk, ctx->rounds, blocks, walk.iv, final);
+ kernel_neon_end();
if (final) {
u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
@@ -304,12 +303,11 @@ static int __xts_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
kernel_neon_begin();
-
- neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1);
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+ kernel_neon_end();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
ctx->key.rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 05/23] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
u8 buf[CHACHA20_BLOCK_SIZE];
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+ kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+ kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+ if (!bytes)
+ return;
+
+ kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+ kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
crypto_chacha20_init(state, ctx, walk.iv);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
- kernel_neon_end();
return err;
}
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 05/23] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
u8 buf[CHACHA20_BLOCK_SIZE];
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+ kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+ kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+ if (!bytes)
+ return;
+
+ kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+ kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
crypto_chacha20_init(state, ctx, walk.iv);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
- kernel_neon_end();
return err;
}
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 06/23] crypto: arm64/aes-blk - remove configurable interleave
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)
We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-modes.S | 237 ++++----------------
2 files changed, 40 insertions(+), 200 deletions(-)
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index cee9b8d9830b..b6b624602582 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -59,9 +59,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
-AFLAGS_aes-ce.o := -DINTERLEAVE=4
-AFLAGS_aes-neon.o := -DINTERLEAVE=4
-
CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
$(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align 4
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare - setup NEON registers for encryption
- * - dec_prepare - setup NEON registers for decryption
- * - enc_switch_key - change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
ENDPROC(aes_decrypt_block4x)
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
- .macro do_encrypt_block2x
- bl aes_encrypt_block2x
- .endm
-
- .macro do_decrypt_block2x
- bl aes_decrypt_block2x
- .endm
-
- .macro do_encrypt_block4x
- bl aes_encrypt_block4x
- .endm
-
- .macro do_decrypt_block4x
- bl aes_decrypt_block4x
- .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
- .macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
-#endif
-
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
* int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x5
.LecbencloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- do_encrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbencout
-#endif
.Lecbencloop:
ld1 {v0.16b}, [x1], #16 /* get next pt block */
encrypt_block v0, w3, x2, x5, w6
@@ -140,34 +53,27 @@ AES_ENTRY(aes_ecb_encrypt)
subs w4, w4, #1
bne .Lecbencloop
.Lecbencout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
dec_prepare w3, x2, x5
.LecbdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- do_decrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
- do_decrypt_block4x
+ bl aes_decrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbdecout
-#endif
.Lecbdecloop:
ld1 {v0.16b}, [x1], #16 /* get next ct block */
decrypt_block v0, w3, x2, x5, w6
@@ -175,7 +81,7 @@ AES_ENTRY(aes_ecb_decrypt)
subs w4, w4, #1
bne .Lecbdecloop
.Lecbdecout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -204,30 +110,20 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
.LcbcdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lcbcdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- mov v2.16b, v0.16b
- mov v3.16b, v1.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v7.16b
- eor v1.16b, v1.16b, v2.16b
- mov v7.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
sub x1, x1, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
@@ -235,12 +131,10 @@ AES_ENTRY(aes_cbc_decrypt)
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lcbcdecout
-#endif
.Lcbcdecloop:
ld1 {v1.16b}, [x1], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
@@ -251,8 +145,8 @@ AES_ENTRY(aes_cbc_decrypt)
subs w4, w4, #1
bne .Lcbcdecloop
.Lcbcdecout:
- FRAME_POP
st1 {v7.16b}, [x5] /* return iv */
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -263,34 +157,19 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
-#if INTERLEAVE >= 2
cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lctr1x
-#if INTERLEAVE == 2
- mov v0.8b, v4.8b
- mov v1.8b, v4.8b
- rev x7, x6
- add x6, x6, #1
- ins v0.d[1], x7
- rev x7, x6
- add x6, x6, #1
- ins v1.d[1], x7
- ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
- do_encrypt_block2x
- eor v0.16b, v0.16b, v2.16b
- eor v1.16b, v1.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -303,23 +182,21 @@ AES_ENTRY(aes_ctr_encrypt)
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
ld1 {v5.16b}, [x1], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x6, x6, #INTERLEAVE
-#endif
+ add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lctrout
-#endif
.Lctrloop:
mov v0.16b, v4.16b
encrypt_block v0, w3, x2, x8, w7
@@ -339,12 +216,12 @@ AES_ENTRY(aes_ctr_encrypt)
.Lctrout:
st1 {v4.16b}, [x5] /* return next CTR value */
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrtailblock:
st1 {v0.16b}, [x0]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrcarry:
@@ -378,7 +255,9 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsencnotfirst
@@ -394,25 +273,8 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_encrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsencoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsencNx
-.LxtsencoutNx:
- mov v4.16b, v5.16b
- b .Lxtsencout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -421,7 +283,7 @@ AES_ENTRY(aes_xts_encrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -430,11 +292,9 @@ AES_ENTRY(aes_xts_encrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsencout
b .LxtsencloopNx
-#endif
.Lxtsenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsencout
-#endif
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -447,13 +307,15 @@ AES_ENTRY(aes_xts_encrypt)
b .Lxtsencloop
.Lxtsencout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsdecnotfirst
@@ -469,25 +331,8 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsdecoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsdecNx
-.LxtsdecoutNx:
- mov v4.16b, v5.16b
- b .Lxtsdecout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -496,7 +341,7 @@ AES_ENTRY(aes_xts_decrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -505,11 +350,9 @@ AES_ENTRY(aes_xts_decrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsdecout
b .LxtsdecloopNx
-#endif
.Lxtsdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsdecout
-#endif
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -522,7 +365,7 @@ AES_ENTRY(aes_xts_decrypt)
b .Lxtsdecloop
.Lxtsdecout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_decrypt)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 06/23] crypto: arm64/aes-blk - remove configurable interleave
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)
We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-modes.S | 237 ++++----------------
2 files changed, 40 insertions(+), 200 deletions(-)
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index cee9b8d9830b..b6b624602582 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -59,9 +59,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
-AFLAGS_aes-ce.o := -DINTERLEAVE=4
-AFLAGS_aes-neon.o := -DINTERLEAVE=4
-
CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
$(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align 4
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare - setup NEON registers for encryption
- * - dec_prepare - setup NEON registers for decryption
- * - enc_switch_key - change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
ENDPROC(aes_decrypt_block4x)
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
- .macro do_encrypt_block2x
- bl aes_encrypt_block2x
- .endm
-
- .macro do_decrypt_block2x
- bl aes_decrypt_block2x
- .endm
-
- .macro do_encrypt_block4x
- bl aes_encrypt_block4x
- .endm
-
- .macro do_decrypt_block4x
- bl aes_decrypt_block4x
- .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
- .macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
-#endif
-
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
* int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x5
.LecbencloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- do_encrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbencout
-#endif
.Lecbencloop:
ld1 {v0.16b}, [x1], #16 /* get next pt block */
encrypt_block v0, w3, x2, x5, w6
@@ -140,34 +53,27 @@ AES_ENTRY(aes_ecb_encrypt)
subs w4, w4, #1
bne .Lecbencloop
.Lecbencout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
dec_prepare w3, x2, x5
.LecbdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- do_decrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
- do_decrypt_block4x
+ bl aes_decrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbdecout
-#endif
.Lecbdecloop:
ld1 {v0.16b}, [x1], #16 /* get next ct block */
decrypt_block v0, w3, x2, x5, w6
@@ -175,7 +81,7 @@ AES_ENTRY(aes_ecb_decrypt)
subs w4, w4, #1
bne .Lecbdecloop
.Lecbdecout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -204,30 +110,20 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
.LcbcdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lcbcdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- mov v2.16b, v0.16b
- mov v3.16b, v1.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v7.16b
- eor v1.16b, v1.16b, v2.16b
- mov v7.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
sub x1, x1, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
@@ -235,12 +131,10 @@ AES_ENTRY(aes_cbc_decrypt)
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lcbcdecout
-#endif
.Lcbcdecloop:
ld1 {v1.16b}, [x1], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
@@ -251,8 +145,8 @@ AES_ENTRY(aes_cbc_decrypt)
subs w4, w4, #1
bne .Lcbcdecloop
.Lcbcdecout:
- FRAME_POP
st1 {v7.16b}, [x5] /* return iv */
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -263,34 +157,19 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
-#if INTERLEAVE >= 2
cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lctr1x
-#if INTERLEAVE == 2
- mov v0.8b, v4.8b
- mov v1.8b, v4.8b
- rev x7, x6
- add x6, x6, #1
- ins v0.d[1], x7
- rev x7, x6
- add x6, x6, #1
- ins v1.d[1], x7
- ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
- do_encrypt_block2x
- eor v0.16b, v0.16b, v2.16b
- eor v1.16b, v1.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -303,23 +182,21 @@ AES_ENTRY(aes_ctr_encrypt)
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
ld1 {v5.16b}, [x1], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x6, x6, #INTERLEAVE
-#endif
+ add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lctrout
-#endif
.Lctrloop:
mov v0.16b, v4.16b
encrypt_block v0, w3, x2, x8, w7
@@ -339,12 +216,12 @@ AES_ENTRY(aes_ctr_encrypt)
.Lctrout:
st1 {v4.16b}, [x5] /* return next CTR value */
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrtailblock:
st1 {v0.16b}, [x0]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrcarry:
@@ -378,7 +255,9 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsencnotfirst
@@ -394,25 +273,8 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_encrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsencoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsencNx
-.LxtsencoutNx:
- mov v4.16b, v5.16b
- b .Lxtsencout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -421,7 +283,7 @@ AES_ENTRY(aes_xts_encrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -430,11 +292,9 @@ AES_ENTRY(aes_xts_encrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsencout
b .LxtsencloopNx
-#endif
.Lxtsenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsencout
-#endif
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -447,13 +307,15 @@ AES_ENTRY(aes_xts_encrypt)
b .Lxtsencloop
.Lxtsencout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsdecnotfirst
@@ -469,25 +331,8 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsdecoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsdecNx
-.LxtsdecoutNx:
- mov v4.16b, v5.16b
- b .Lxtsdecout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -496,7 +341,7 @@ AES_ENTRY(aes_xts_decrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -505,11 +350,9 @@ AES_ENTRY(aes_xts_decrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsdecout
b .LxtsdecloopNx
-#endif
.Lxtsdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsdecout
-#endif
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -522,7 +365,7 @@ AES_ENTRY(aes_xts_decrypt)
b .Lxtsdecloop
.Lxtsdecout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_decrypt)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 07/23] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 31 ++++++++++++++++----
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v0.16b}, [x5] /* get iv */
+ ld1 {v4.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
-.Lcbcencloop:
- ld1 {v1.16b}, [x1], #16 /* get next pt block */
- eor v0.16b, v0.16b, v1.16b /* ..and xor with iv */
+.Lcbcencloop4x:
+ subs w4, w4, #4
+ bmi .Lcbcenc1x
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
encrypt_block v0, w3, x2, x6, w7
- st1 {v0.16b}, [x0], #16
+ eor v1.16b, v1.16b, v0.16b
+ encrypt_block v1, w3, x2, x6, w7
+ eor v2.16b, v2.16b, v1.16b
+ encrypt_block v2, w3, x2, x6, w7
+ eor v3.16b, v3.16b, v2.16b
+ encrypt_block v3, w3, x2, x6, w7
+ st1 {v0.16b-v3.16b}, [x0], #64
+ mov v4.16b, v3.16b
+ b .Lcbcencloop4x
+.Lcbcenc1x:
+ adds w4, w4, #4
+ beq .Lcbcencout
+.Lcbcencloop:
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
+ encrypt_block v4, w3, x2, x6, w7
+ st1 {v4.16b}, [x0], #16
subs w4, w4, #1
bne .Lcbcencloop
- st1 {v0.16b}, [x5] /* return iv */
+.Lcbcencout:
+ st1 {v4.16b}, [x5] /* return iv */
ret
AES_ENDPROC(aes_cbc_encrypt)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 07/23] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 31 ++++++++++++++++----
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v0.16b}, [x5] /* get iv */
+ ld1 {v4.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
-.Lcbcencloop:
- ld1 {v1.16b}, [x1], #16 /* get next pt block */
- eor v0.16b, v0.16b, v1.16b /* ..and xor with iv */
+.Lcbcencloop4x:
+ subs w4, w4, #4
+ bmi .Lcbcenc1x
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
encrypt_block v0, w3, x2, x6, w7
- st1 {v0.16b}, [x0], #16
+ eor v1.16b, v1.16b, v0.16b
+ encrypt_block v1, w3, x2, x6, w7
+ eor v2.16b, v2.16b, v1.16b
+ encrypt_block v2, w3, x2, x6, w7
+ eor v3.16b, v3.16b, v2.16b
+ encrypt_block v3, w3, x2, x6, w7
+ st1 {v0.16b-v3.16b}, [x0], #64
+ mov v4.16b, v3.16b
+ b .Lcbcencloop4x
+.Lcbcenc1x:
+ adds w4, w4, #4
+ beq .Lcbcencout
+.Lcbcencloop:
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
+ encrypt_block v4, w3, x2, x6, w7
+ st1 {v4.16b}, [x0], #16
subs w4, w4, #1
bne .Lcbcencloop
- st1 {v0.16b}, [x5] /* return iv */
+.Lcbcencout:
+ st1 {v4.16b}, [x5] /* return iv */
ret
AES_ENDPROC(aes_cbc_encrypt)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 08/23] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 23 ++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4] /* get dg */
enc_prepare w2, x1, x7
- cbnz w5, .Lmacenc
+ cbz w5, .Lmacloop4x
+ encrypt_block v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+ subs w3, w3, #4
+ bmi .Lmac1x
+ ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v2.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v3.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v4.16b
+ cmp w3, wzr
+ csinv x5, x6, xzr, eq
+ cbz w5, .Lmacout
+ encrypt_block v0, w2, x1, x7, w8
+ b .Lmacloop4x
+.Lmac1x:
+ add w3, w3, #4
.Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv x5, x6, xzr, eq
cbz w5, .Lmacout
-.Lmacenc:
encrypt_block v0, w2, x1, x7, w8
b .Lmacloop
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 08/23] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 23 ++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4] /* get dg */
enc_prepare w2, x1, x7
- cbnz w5, .Lmacenc
+ cbz w5, .Lmacloop4x
+ encrypt_block v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+ subs w3, w3, #4
+ bmi .Lmac1x
+ ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v2.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v3.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v4.16b
+ cmp w3, wzr
+ csinv x5, x6, xzr, eq
+ cbz w5, .Lmacout
+ encrypt_block v0, w2, x1, x7, w8
+ b .Lmacloop4x
+.Lmac1x:
+ add w3, w3, #4
.Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv x5, x6, xzr, eq
cbz w5, .Lmacout
-.Lmacenc:
encrypt_block v0, w2, x1, x7, w8
b .Lmacloop
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 09/23] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.
Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha256-glue.c | 36 +++++++++++++-------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- /*
- * Stacking and unstacking a substantial slice of the NEON register
- * file may significantly affect performance for small updates when
- * executing in interrupt context, so fall back to the scalar code
- * in that case.
- */
+ struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
- kernel_neon_begin();
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
- kernel_neon_end();
+ while (len > 0) {
+ unsigned int chunk = len;
+
+ /*
+ * Don't hog the CPU for the entire time it takes to process all
+ * input when running on a preemptible kernel, but process the
+ * data block by block instead.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT) &&
+ chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+ chunk = SHA256_BLOCK_SIZE -
+ sctx->count % SHA256_BLOCK_SIZE;
+ kernel_neon_begin();
+ sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+ kernel_neon_end();
+ data += chunk;
+ len -= chunk;
+ }
return 0;
}
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
- kernel_neon_begin();
if (len)
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
+ sha256_update_neon(desc, data, len);
+ kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 09/23] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.
Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha256-glue.c | 36 +++++++++++++-------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- /*
- * Stacking and unstacking a substantial slice of the NEON register
- * file may significantly affect performance for small updates when
- * executing in interrupt context, so fall back to the scalar code
- * in that case.
- */
+ struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
- kernel_neon_begin();
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
- kernel_neon_end();
+ while (len > 0) {
+ unsigned int chunk = len;
+
+ /*
+ * Don't hog the CPU for the entire time it takes to process all
+ * input when running on a preemptible kernel, but process the
+ * data block by block instead.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT) &&
+ chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+ chunk = SHA256_BLOCK_SIZE -
+ sctx->count % SHA256_BLOCK_SIZE;
+ kernel_neon_begin();
+ sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+ kernel_neon_end();
+ data += chunk;
+ len -= chunk;
+ }
return 0;
}
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
- kernel_neon_begin();
if (len)
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
+ sha256_update_neon(desc, data, len);
+ kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 10/23] arm64: assembler: add utility macros to push/pop stack frames
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 70 ++++++++++++++++++++
1 file changed, 70 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 053d83e8db6f..eef1fd2c1c0b 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -555,6 +555,19 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
#endif
.endm
+/*
+ * Errata workaround post TTBR0_EL1 update.
+ */
+ .macro post_ttbr0_update_workaround
+#ifdef CONFIG_CAVIUM_ERRATUM_27456
+alternative_if ARM64_WORKAROUND_CAVIUM_27456
+ ic iallu
+ dsb nsh
+ isb
+alternative_else_nop_endif
+#endif
+ .endm
+
/**
* Errata workaround prior to disable MMU. Insert an ISB immediately prior
* to executing the MSR that will change SCTLR_ELn[M] from a value of 1 to 0.
@@ -565,4 +578,61 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
#endif
.endm
+ /*
+ * frame_push - Push @regcount callee saved registers to the stack,
+ * starting at x19, as well as x29/x30, and set x29 to
+ * the new value of sp. Add @extra bytes of stack space
+ * for locals.
+ */
+ .macro frame_push, regcount:req, extra
+ __frame st, \regcount, \extra
+ .endm
+
+ /*
+ * frame_pop - Pop the callee saved registers from the stack that were
+ * pushed in the most recent call to frame_push, as well
+ * as x29/x30 and any extra stack space that may have been
+ * allocated.
+ */
+ .macro frame_pop
+ __frame ld
+ .endm
+
+ .macro __frame_regs, reg1, reg2, op, num
+ .if .Lframe_regcount == \num
+ \op\()r \reg1, [sp, #(\num + 1) * 8]
+ .elseif .Lframe_regcount > \num
+ \op\()p \reg1, \reg2, [sp, #(\num + 1) * 8]
+ .endif
+ .endm
+
+ .macro __frame, op, regcount, extra=0
+ .ifc \op, st
+ .if (\regcount) < 0 || (\regcount) > 10
+ .error "regcount should be in the range [0 ... 10]"
+ .endif
+ .if ((\extra) % 16) != 0
+ .error "extra should be a multiple of 16 bytes"
+ .endif
+ .set .Lframe_regcount, \regcount
+ .set .Lframe_extra, \extra
+ .set .Lframe_local_offset, ((\regcount + 3) / 2) * 16
+ stp x29, x30, [sp, #-.Lframe_local_offset - .Lframe_extra]!
+ mov x29, sp
+ .elseif .Lframe_regcount == -1 // && op == 'ld'
+ .error "frame_push/frame_pop may not be nested"
+ .endif
+
+ __frame_regs x19, x20, \op, 1
+ __frame_regs x21, x22, \op, 3
+ __frame_regs x23, x24, \op, 5
+ __frame_regs x25, x26, \op, 7
+ __frame_regs x27, x28, \op, 9
+
+ .ifc \op, ld
+ ldp x29, x30, [sp], #.Lframe_local_offset + .Lframe_extra
+ .set .Lframe_regcount, -1
+ .endif
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 10/23] arm64: assembler: add utility macros to push/pop stack frames
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 70 ++++++++++++++++++++
1 file changed, 70 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 053d83e8db6f..eef1fd2c1c0b 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -555,6 +555,19 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
#endif
.endm
+/*
+ * Errata workaround post TTBR0_EL1 update.
+ */
+ .macro post_ttbr0_update_workaround
+#ifdef CONFIG_CAVIUM_ERRATUM_27456
+alternative_if ARM64_WORKAROUND_CAVIUM_27456
+ ic iallu
+ dsb nsh
+ isb
+alternative_else_nop_endif
+#endif
+ .endm
+
/**
* Errata workaround prior to disable MMU. Insert an ISB immediately prior
* to executing the MSR that will change SCTLR_ELn[M] from a value of 1 to 0.
@@ -565,4 +578,61 @@ USER(\label, ic ivau, \tmp2) // invalidate I line PoU
#endif
.endm
+ /*
+ * frame_push - Push @regcount callee saved registers to the stack,
+ * starting at x19, as well as x29/x30, and set x29 to
+ * the new value of sp. Add @extra bytes of stack space
+ * for locals.
+ */
+ .macro frame_push, regcount:req, extra
+ __frame st, \regcount, \extra
+ .endm
+
+ /*
+ * frame_pop - Pop the callee saved registers from the stack that were
+ * pushed in the most recent call to frame_push, as well
+ * as x29/x30 and any extra stack space that may have been
+ * allocated.
+ */
+ .macro frame_pop
+ __frame ld
+ .endm
+
+ .macro __frame_regs, reg1, reg2, op, num
+ .if .Lframe_regcount == \num
+ \op\()r \reg1, [sp, #(\num + 1) * 8]
+ .elseif .Lframe_regcount > \num
+ \op\()p \reg1, \reg2, [sp, #(\num + 1) * 8]
+ .endif
+ .endm
+
+ .macro __frame, op, regcount, extra=0
+ .ifc \op, st
+ .if (\regcount) < 0 || (\regcount) > 10
+ .error "regcount should be in the range [0 ... 10]"
+ .endif
+ .if ((\extra) % 16) != 0
+ .error "extra should be a multiple of 16 bytes"
+ .endif
+ .set .Lframe_regcount, \regcount
+ .set .Lframe_extra, \extra
+ .set .Lframe_local_offset, ((\regcount + 3) / 2) * 16
+ stp x29, x30, [sp, #-.Lframe_local_offset - .Lframe_extra]!
+ mov x29, sp
+ .elseif .Lframe_regcount == -1 // && op == 'ld'
+ .error "frame_push/frame_pop may not be nested"
+ .endif
+
+ __frame_regs x19, x20, \op, 1
+ __frame_regs x21, x22, \op, 3
+ __frame_regs x23, x24, \op, 5
+ __frame_regs x25, x26, \op, 7
+ __frame_regs x27, x28, \op, 9
+
+ .ifc \op, ld
+ ldp x29, x30, [sp], #.Lframe_local_offset + .Lframe_extra
+ .set .Lframe_regcount, -1
+ .endif
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 11/23] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.
In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 64 ++++++++++++++++++++
arch/arm64/kernel/asm-offsets.c | 2 +
2 files changed, 66 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index eef1fd2c1c0b..61168cbe9781 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -635,4 +635,68 @@ alternative_else_nop_endif
.endif
.endm
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ * // pre-yield patchup code
+ * do_cond_yield_neon
+ * // post-yield patchup code
+ * endif_yield_neon <label>
+ *
+ * where <label> is optional, and marks the point where execution will resume
+ * after a yield has been performed. If omitted, execution resumes right after
+ * the endif_yield_neon invocation.
+ *
+ * Note that the patchup code does not support assembler directives that change
+ * the output section, any use of such directives is undefined.
+ *
+ * The yield itself consists of the following:
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ * preemption once will make the task preemptible. If this is not the case,
+ * yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ * kernel mode NEON (which will trigger a reschedule), and branch to the
+ * yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+ .macro cond_yield_neon, lbl
+ if_will_cond_yield_neon
+ do_cond_yield_neon
+ endif_yield_neon \lbl
+ .endm
+
+ .macro if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+ get_thread_info x0
+ ldr w1, [x0, #TSK_TI_PREEMPT]
+ ldr x0, [x0, #TSK_TI_FLAGS]
+ cmp w1, #PREEMPT_DISABLE_OFFSET
+ csel x0, x0, xzr, eq
+ tbnz x0, #TIF_NEED_RESCHED, .Lyield_\@ // needs rescheduling?
+#endif
+ /* fall through to endif_yield_neon */
+ .subsection 1
+.Lyield_\@ :
+ .endm
+
+ .macro do_cond_yield_neon
+ bl kernel_neon_end
+ bl kernel_neon_begin
+ .endm
+
+ .macro endif_yield_neon, lbl
+ .ifnb \lbl
+ b \lbl
+ .else
+ b .Lyield_out_\@
+ .endif
+ .previous
+.Lyield_out_\@ :
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 1303e04110cd..1e2ea2e51acb 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
DEFINE(DMA_TO_DEVICE, DMA_TO_DEVICE);
DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
BLANK();
+ DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+ BLANK();
DEFINE(CLOCK_REALTIME, CLOCK_REALTIME);
DEFINE(CLOCK_MONOTONIC, CLOCK_MONOTONIC);
DEFINE(CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC_RAW);
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 11/23] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.
In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 64 ++++++++++++++++++++
arch/arm64/kernel/asm-offsets.c | 2 +
2 files changed, 66 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index eef1fd2c1c0b..61168cbe9781 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -635,4 +635,68 @@ alternative_else_nop_endif
.endif
.endm
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ * // pre-yield patchup code
+ * do_cond_yield_neon
+ * // post-yield patchup code
+ * endif_yield_neon <label>
+ *
+ * where <label> is optional, and marks the point where execution will resume
+ * after a yield has been performed. If omitted, execution resumes right after
+ * the endif_yield_neon invocation.
+ *
+ * Note that the patchup code does not support assembler directives that change
+ * the output section, any use of such directives is undefined.
+ *
+ * The yield itself consists of the following:
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ * preemption once will make the task preemptible. If this is not the case,
+ * yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ * kernel mode NEON (which will trigger a reschedule), and branch to the
+ * yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+ .macro cond_yield_neon, lbl
+ if_will_cond_yield_neon
+ do_cond_yield_neon
+ endif_yield_neon \lbl
+ .endm
+
+ .macro if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+ get_thread_info x0
+ ldr w1, [x0, #TSK_TI_PREEMPT]
+ ldr x0, [x0, #TSK_TI_FLAGS]
+ cmp w1, #PREEMPT_DISABLE_OFFSET
+ csel x0, x0, xzr, eq
+ tbnz x0, #TIF_NEED_RESCHED, .Lyield_\@ // needs rescheduling?
+#endif
+ /* fall through to endif_yield_neon */
+ .subsection 1
+.Lyield_\@ :
+ .endm
+
+ .macro do_cond_yield_neon
+ bl kernel_neon_end
+ bl kernel_neon_begin
+ .endm
+
+ .macro endif_yield_neon, lbl
+ .ifnb \lbl
+ b \lbl
+ .else
+ b .Lyield_out_\@
+ .endif
+ .previous
+.Lyield_out_\@ :
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 1303e04110cd..1e2ea2e51acb 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -93,6 +93,8 @@ int main(void)
DEFINE(DMA_TO_DEVICE, DMA_TO_DEVICE);
DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
BLANK();
+ DEFINE(PREEMPT_DISABLE_OFFSET, PREEMPT_DISABLE_OFFSET);
+ BLANK();
DEFINE(CLOCK_REALTIME, CLOCK_REALTIME);
DEFINE(CLOCK_MONOTONIC, CLOCK_MONOTONIC);
DEFINE(CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC_RAW);
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 12/23] crypto: arm64/sha1-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha1-ce-core.S | 42 ++++++++++++++------
1 file changed, 29 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 46049850727d..78eb35fb5056 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -69,30 +69,36 @@
* int blocks)
*/
ENTRY(sha1_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- loadrc k0.4s, 0x5a827999, w6
+0: loadrc k0.4s, 0x5a827999, w6
loadrc k1.4s, 0x6ed9eba1, w6
loadrc k2.4s, 0x8f1bbcdc, w6
loadrc k3.4s, 0xca62c1d6, w6
/* load state */
- ld1 {dgav.4s}, [x0]
- ldr dgb, [x0, #16]
+ ld1 {dgav.4s}, [x19]
+ ldr dgb, [x19, #16]
/* load sha1_ce_state::finalize */
ldr_l w4, sha1_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v8.16b, v8.16b )
CPU_LE( rev32 v9.16b, v9.16b )
CPU_LE( rev32 v10.16b, v10.16b )
CPU_LE( rev32 v11.16b, v11.16b )
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
add_update c, ev, k0, 8, 9, 10, 11, dgb
@@ -123,16 +129,25 @@ CPU_LE( rev32 v11.16b, v11.16b )
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha1_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v9.2d, #0
mov x8, #0x80000000
movi v10.2d, #0
@@ -141,10 +156,11 @@ CPU_LE( rev32 v11.16b, v11.16b )
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s}, [x0]
- str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ frame_pop
ret
ENDPROC(sha1_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 12/23] crypto: arm64/sha1-ce - yield NEON after every block of input
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha1-ce-core.S | 42 ++++++++++++++------
1 file changed, 29 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 46049850727d..78eb35fb5056 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -69,30 +69,36 @@
* int blocks)
*/
ENTRY(sha1_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- loadrc k0.4s, 0x5a827999, w6
+0: loadrc k0.4s, 0x5a827999, w6
loadrc k1.4s, 0x6ed9eba1, w6
loadrc k2.4s, 0x8f1bbcdc, w6
loadrc k3.4s, 0xca62c1d6, w6
/* load state */
- ld1 {dgav.4s}, [x0]
- ldr dgb, [x0, #16]
+ ld1 {dgav.4s}, [x19]
+ ldr dgb, [x19, #16]
/* load sha1_ce_state::finalize */
ldr_l w4, sha1_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v8.16b, v8.16b )
CPU_LE( rev32 v9.16b, v9.16b )
CPU_LE( rev32 v10.16b, v10.16b )
CPU_LE( rev32 v11.16b, v11.16b )
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
add_update c, ev, k0, 8, 9, 10, 11, dgb
@@ -123,16 +129,25 @@ CPU_LE( rev32 v11.16b, v11.16b )
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha1_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v9.2d, #0
mov x8, #0x80000000
movi v10.2d, #0
@@ -141,10 +156,11 @@ CPU_LE( rev32 v11.16b, v11.16b )
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s}, [x0]
- str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ frame_pop
ret
ENDPROC(sha1_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 13/23] crypto: arm64/sha2-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha2-ce-core.S | 37 ++++++++++++++------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 4c3c89b812ce..cd8b36412469 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -79,30 +79,36 @@
*/
.text
ENTRY(sha2_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- adr_l x8, .Lsha2_rcon
+0: adr_l x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
/* load state */
- ld1 {dgav.4s, dgbv.4s}, [x0]
+ ld1 {dgav.4s, dgbv.4s}, [x19]
/* load sha256_ce_state::finalize */
ldr_l w4, sha256_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v16.16b, v16.16b )
CPU_LE( rev32 v17.16b, v17.16b )
CPU_LE( rev32 v18.16b, v18.16b )
CPU_LE( rev32 v19.16b, v19.16b )
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
@@ -131,16 +137,24 @@ CPU_LE( rev32 v19.16b, v19.16b )
add dgbv.4s, dgbv.4s, dg1v.4s
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s, dgbv.4s}, [x19]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha256_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v17.2d, #0
mov x8, #0x80000000
movi v18.2d, #0
@@ -149,9 +163,10 @@ CPU_LE( rev32 v19.16b, v19.16b )
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+ frame_pop
ret
ENDPROC(sha2_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 13/23] crypto: arm64/sha2-ce - yield NEON after every block of input
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha2-ce-core.S | 37 ++++++++++++++------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 4c3c89b812ce..cd8b36412469 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -79,30 +79,36 @@
*/
.text
ENTRY(sha2_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- adr_l x8, .Lsha2_rcon
+0: adr_l x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
/* load state */
- ld1 {dgav.4s, dgbv.4s}, [x0]
+ ld1 {dgav.4s, dgbv.4s}, [x19]
/* load sha256_ce_state::finalize */
ldr_l w4, sha256_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v16.16b, v16.16b )
CPU_LE( rev32 v17.16b, v17.16b )
CPU_LE( rev32 v18.16b, v18.16b )
CPU_LE( rev32 v19.16b, v19.16b )
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
@@ -131,16 +137,24 @@ CPU_LE( rev32 v19.16b, v19.16b )
add dgbv.4s, dgbv.4s, dg1v.4s
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s, dgbv.4s}, [x19]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha256_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v17.2d, #0
mov x8, #0x80000000
movi v18.2d, #0
@@ -149,9 +163,10 @@ CPU_LE( rev32 v19.16b, v19.16b )
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+ frame_pop
ret
ENDPROC(sha2_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 14/23] crypto: arm64/aes-ccm - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++++++++++++-------
1 file changed, 95 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..88f5aef7934c 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
* u32 *macp, u8 const rk[], u32 rounds);
*/
ENTRY(ce_aes_ccm_auth_data)
- ldr w8, [x3] /* leftover from prev round? */
+ frame_push 7
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+ ldr w25, [x22] /* leftover from prev round? */
ld1 {v0.16b}, [x0] /* load mac */
- cbz w8, 1f
- sub w8, w8, #16
+ cbz w25, 1f
+ sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrb w7, [x1], #1 /* get 1 byte of input */
- subs w2, w2, #1
- add w8, w8, #1
+0: ldrb w7, [x20], #1 /* get 1 byte of input */
+ subs w21, w21, #1
+ add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1 /* rotate in the input bytes */
beq 8f /* out of input? */
- cbnz w8, 0b
+ cbnz w25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4] /* load first round key */
- prfm pldl1strm, [x1]
- cmp w5, #12 /* which key size? */
- add x6, x4, #16
- sub w7, w5, #2 /* modified # of rounds */
+1: ld1 {v3.4s}, [x23] /* load first round key */
+ prfm pldl1strm, [x20]
+ cmp w24, #12 /* which key size? */
+ add x6, x23, #16
+ sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16 /* load next round key */
bpl 3b
aese v0.16b, v4.16b
- subs w2, w2, #16 /* last data? */
+ subs w21, w21, #16 /* last data? */
eor v0.16b, v0.16b, v5.16b /* final round */
bmi 6f
- ld1 {v1.16b}, [x1], #16 /* load next input block */
+ ld1 {v1.16b}, [x20], #16 /* load next input block */
eor v0.16b, v0.16b, v1.16b /* xor with mac */
- bne 1b
-6: st1 {v0.16b}, [x0] /* store mac */
+ beq 6f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x19] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x19] /* reload mac */
+ endif_yield_neon
+
+ b 1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
- adds w2, w2, #16
+ adds w21, w21, #16
beq 10f
- mov w8, w2
-7: ldrb w7, [x1], #1
+ mov w25, w21
+7: ldrb w7, [x20], #1
umov w6, v0.b[0]
eor w6, w6, w7
- strb w6, [x0], #1
- subs w2, w2, #1
+ strb w6, [x19], #1
+ subs w21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1 /* rotate out the mac bytes */
b 7b
-8: mov w7, w8
- add w8, w8, #16
+8: mov w7, w25
+ add w25, w25, #16
9: ext v1.16b, v1.16b, v1.16b, #1
adds w7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
- st1 {v0.16b}, [x0]
-10: str w8, [x3]
+ st1 {v0.16b}, [x19]
+10: str w25, [x22]
+
+ frame_pop
ret
ENDPROC(ce_aes_ccm_auth_data)
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
ENDPROC(ce_aes_ccm_final)
.macro aes_ccm_do_crypt,enc
- ldr x8, [x6, #8] /* load lower ctr */
- ld1 {v0.16b}, [x5] /* load mac */
-CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ ldr x26, [x25, #8] /* load lower ctr */
+ ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE( rev x26, x26 ) /* keep swabbed ctr in reg */
0: /* outer loop */
- ld1 {v1.8b}, [x6] /* load upper ctr */
- prfm pldl1strm, [x1]
- add x8, x8, #1
- rev x9, x8
- cmp w4, #12 /* which key size? */
- sub w7, w4, #2 /* get modified # of rounds */
+ ld1 {v1.8b}, [x25] /* load upper ctr */
+ prfm pldl1strm, [x20]
+ add x26, x26, #1
+ rev x9, x26
+ cmp w23, #12 /* which key size? */
+ sub w7, w23, #2 /* get modified # of rounds */
ins v1.d[1], x9 /* no carry in lower ctr */
- ld1 {v3.4s}, [x3] /* load first round key */
- add x10, x3, #16
+ ld1 {v3.4s}, [x22] /* load first round key */
+ add x10, x22, #16
bmi 1f
bne 4f
mov v5.16b, v3.16b
@@ -165,9 +194,9 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
bpl 2b
aese v0.16b, v4.16b
aese v1.16b, v4.16b
- subs w2, w2, #16
- bmi 6f /* partial block? */
- ld1 {v2.16b}, [x1], #16 /* load next input block */
+ subs w21, w21, #16
+ bmi 7f /* partial block? */
+ ld1 {v2.16b}, [x20], #16 /* load next input block */
.if \enc == 1
eor v2.16b, v2.16b, v5.16b /* final round enc+mac */
eor v1.16b, v1.16b, v2.16b /* xor with crypted ctr */
@@ -176,18 +205,29 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
eor v1.16b, v2.16b, v5.16b /* final round enc */
.endif
eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
- st1 {v1.16b}, [x0], #16 /* write output block */
- bne 0b
-CPU_LE( rev x8, x8 )
- st1 {v0.16b}, [x5] /* store mac */
- str x8, [x6, #8] /* store lsb end of ctr (BE) */
-5: ret
-
-6: eor v0.16b, v0.16b, v5.16b /* final round mac */
+ st1 {v1.16b}, [x19], #16 /* write output block */
+ beq 5f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x24] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x24] /* reload mac */
+ endif_yield_neon
+
+ b 0b
+5:
+CPU_LE( rev x26, x26 )
+ st1 {v0.16b}, [x24] /* store mac */
+ str x26, [x25, #8] /* store lsb end of ctr (BE) */
+
+6: frame_pop
+ ret
+
+7: eor v0.16b, v0.16b, v5.16b /* final round mac */
eor v1.16b, v1.16b, v5.16b /* final round enc */
- st1 {v0.16b}, [x5] /* store mac */
- add w2, w2, #16 /* process partial tail block */
-7: ldrb w9, [x1], #1 /* get 1 byte of input */
+ st1 {v0.16b}, [x24] /* store mac */
+ add w21, w21, #16 /* process partial tail block */
+8: ldrb w9, [x20], #1 /* get 1 byte of input */
umov w6, v1.b[0] /* get top crypted ctr byte */
umov w7, v0.b[0] /* get top mac byte */
.if \enc == 1
@@ -197,13 +237,13 @@ CPU_LE( rev x8, x8 )
eor w9, w9, w6
eor w7, w7, w9
.endif
- strb w9, [x0], #1 /* store out byte */
- strb w7, [x5], #1 /* store mac byte */
- subs w2, w2, #1
- beq 5b
+ strb w9, [x19], #1 /* store out byte */
+ strb w7, [x24], #1 /* store mac byte */
+ subs w21, w21, #1
+ beq 6b
ext v0.16b, v0.16b, v0.16b, #1 /* shift out mac byte */
ext v1.16b, v1.16b, v1.16b, #1 /* shift out ctr byte */
- b 7b
+ b 8b
.endm
/*
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 14/23] crypto: arm64/aes-ccm - yield NEON after every block of input
@ 2018-03-10 15:21 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:21 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++++++++++++-------
1 file changed, 95 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..88f5aef7934c 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
* u32 *macp, u8 const rk[], u32 rounds);
*/
ENTRY(ce_aes_ccm_auth_data)
- ldr w8, [x3] /* leftover from prev round? */
+ frame_push 7
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+ ldr w25, [x22] /* leftover from prev round? */
ld1 {v0.16b}, [x0] /* load mac */
- cbz w8, 1f
- sub w8, w8, #16
+ cbz w25, 1f
+ sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrb w7, [x1], #1 /* get 1 byte of input */
- subs w2, w2, #1
- add w8, w8, #1
+0: ldrb w7, [x20], #1 /* get 1 byte of input */
+ subs w21, w21, #1
+ add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1 /* rotate in the input bytes */
beq 8f /* out of input? */
- cbnz w8, 0b
+ cbnz w25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4] /* load first round key */
- prfm pldl1strm, [x1]
- cmp w5, #12 /* which key size? */
- add x6, x4, #16
- sub w7, w5, #2 /* modified # of rounds */
+1: ld1 {v3.4s}, [x23] /* load first round key */
+ prfm pldl1strm, [x20]
+ cmp w24, #12 /* which key size? */
+ add x6, x23, #16
+ sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16 /* load next round key */
bpl 3b
aese v0.16b, v4.16b
- subs w2, w2, #16 /* last data? */
+ subs w21, w21, #16 /* last data? */
eor v0.16b, v0.16b, v5.16b /* final round */
bmi 6f
- ld1 {v1.16b}, [x1], #16 /* load next input block */
+ ld1 {v1.16b}, [x20], #16 /* load next input block */
eor v0.16b, v0.16b, v1.16b /* xor with mac */
- bne 1b
-6: st1 {v0.16b}, [x0] /* store mac */
+ beq 6f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x19] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x19] /* reload mac */
+ endif_yield_neon
+
+ b 1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
- adds w2, w2, #16
+ adds w21, w21, #16
beq 10f
- mov w8, w2
-7: ldrb w7, [x1], #1
+ mov w25, w21
+7: ldrb w7, [x20], #1
umov w6, v0.b[0]
eor w6, w6, w7
- strb w6, [x0], #1
- subs w2, w2, #1
+ strb w6, [x19], #1
+ subs w21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1 /* rotate out the mac bytes */
b 7b
-8: mov w7, w8
- add w8, w8, #16
+8: mov w7, w25
+ add w25, w25, #16
9: ext v1.16b, v1.16b, v1.16b, #1
adds w7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
- st1 {v0.16b}, [x0]
-10: str w8, [x3]
+ st1 {v0.16b}, [x19]
+10: str w25, [x22]
+
+ frame_pop
ret
ENDPROC(ce_aes_ccm_auth_data)
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
ENDPROC(ce_aes_ccm_final)
.macro aes_ccm_do_crypt,enc
- ldr x8, [x6, #8] /* load lower ctr */
- ld1 {v0.16b}, [x5] /* load mac */
-CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ ldr x26, [x25, #8] /* load lower ctr */
+ ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE( rev x26, x26 ) /* keep swabbed ctr in reg */
0: /* outer loop */
- ld1 {v1.8b}, [x6] /* load upper ctr */
- prfm pldl1strm, [x1]
- add x8, x8, #1
- rev x9, x8
- cmp w4, #12 /* which key size? */
- sub w7, w4, #2 /* get modified # of rounds */
+ ld1 {v1.8b}, [x25] /* load upper ctr */
+ prfm pldl1strm, [x20]
+ add x26, x26, #1
+ rev x9, x26
+ cmp w23, #12 /* which key size? */
+ sub w7, w23, #2 /* get modified # of rounds */
ins v1.d[1], x9 /* no carry in lower ctr */
- ld1 {v3.4s}, [x3] /* load first round key */
- add x10, x3, #16
+ ld1 {v3.4s}, [x22] /* load first round key */
+ add x10, x22, #16
bmi 1f
bne 4f
mov v5.16b, v3.16b
@@ -165,9 +194,9 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
bpl 2b
aese v0.16b, v4.16b
aese v1.16b, v4.16b
- subs w2, w2, #16
- bmi 6f /* partial block? */
- ld1 {v2.16b}, [x1], #16 /* load next input block */
+ subs w21, w21, #16
+ bmi 7f /* partial block? */
+ ld1 {v2.16b}, [x20], #16 /* load next input block */
.if \enc == 1
eor v2.16b, v2.16b, v5.16b /* final round enc+mac */
eor v1.16b, v1.16b, v2.16b /* xor with crypted ctr */
@@ -176,18 +205,29 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
eor v1.16b, v2.16b, v5.16b /* final round enc */
.endif
eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
- st1 {v1.16b}, [x0], #16 /* write output block */
- bne 0b
-CPU_LE( rev x8, x8 )
- st1 {v0.16b}, [x5] /* store mac */
- str x8, [x6, #8] /* store lsb end of ctr (BE) */
-5: ret
-
-6: eor v0.16b, v0.16b, v5.16b /* final round mac */
+ st1 {v1.16b}, [x19], #16 /* write output block */
+ beq 5f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x24] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x24] /* reload mac */
+ endif_yield_neon
+
+ b 0b
+5:
+CPU_LE( rev x26, x26 )
+ st1 {v0.16b}, [x24] /* store mac */
+ str x26, [x25, #8] /* store lsb end of ctr (BE) */
+
+6: frame_pop
+ ret
+
+7: eor v0.16b, v0.16b, v5.16b /* final round mac */
eor v1.16b, v1.16b, v5.16b /* final round enc */
- st1 {v0.16b}, [x5] /* store mac */
- add w2, w2, #16 /* process partial tail block */
-7: ldrb w9, [x1], #1 /* get 1 byte of input */
+ st1 {v0.16b}, [x24] /* store mac */
+ add w21, w21, #16 /* process partial tail block */
+8: ldrb w9, [x20], #1 /* get 1 byte of input */
umov w6, v1.b[0] /* get top crypted ctr byte */
umov w7, v0.b[0] /* get top mac byte */
.if \enc == 1
@@ -197,13 +237,13 @@ CPU_LE( rev x8, x8 )
eor w9, w9, w6
eor w7, w7, w9
.endif
- strb w9, [x0], #1 /* store out byte */
- strb w7, [x5], #1 /* store mac byte */
- subs w2, w2, #1
- beq 5b
+ strb w9, [x19], #1 /* store out byte */
+ strb w7, [x24], #1 /* store mac byte */
+ subs w21, w21, #1
+ beq 6b
ext v0.16b, v0.16b, v0.16b, #1 /* shift out mac byte */
ext v1.16b, v1.16b, v1.16b, #1 /* shift out ctr byte */
- b 7b
+ b 8b
.endm
/*
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 15/23] crypto: arm64/aes-blk - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-modes.S | 331 ++++++++++++--------
2 files changed, 216 insertions(+), 130 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
/* prepare for encryption with key in rk[] */
- .macro enc_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for encryption (again) but with new key in rk[] */
- .macro enc_switch_key, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_switch_key, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for decryption with key in rk[] */
- .macro dec_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro dec_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
.macro do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..483a7130cf0e 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
- enc_prepare w3, x2, x5
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+.Lecbencrestart:
+ enc_prepare w22, x21, x5
.LecbencloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl aes_encrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbencrestart
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbencout
.Lecbencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
- encrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
+ encrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbencloop
.Lecbencout:
- ldp x29, x30, [sp], #16
+ frame_pop
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
- dec_prepare w3, x2, x5
+.Lecbdecrestart:
+ dec_prepare w22, x21, x5
.LecbdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl aes_decrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbdecrestart
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbdecout
.Lecbdecloop:
- ld1 {v0.16b}, [x1], #16 /* get next ct block */
- decrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next ct block */
+ decrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbdecloop
.Lecbdecout:
- ldp x29, x30, [sp], #16
+ frame_pop
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -94,78 +108,100 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v4.16b}, [x5] /* get iv */
- enc_prepare w3, x2, x6
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lcbcencrestart:
+ ld1 {v4.16b}, [x24] /* get iv */
+ enc_prepare w22, x21, x6
.Lcbcencloop4x:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w22, x21, x6, w7
eor v1.16b, v1.16b, v0.16b
- encrypt_block v1, w3, x2, x6, w7
+ encrypt_block v1, w22, x21, x6, w7
eor v2.16b, v2.16b, v1.16b
- encrypt_block v2, w3, x2, x6, w7
+ encrypt_block v2, w22, x21, x6, w7
eor v3.16b, v3.16b, v2.16b
- encrypt_block v3, w3, x2, x6, w7
- st1 {v0.16b-v3.16b}, [x0], #64
+ encrypt_block v3, w22, x21, x6, w7
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v3.16b
+ st1 {v4.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcencrestart
b .Lcbcencloop4x
.Lcbcenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcencout
.Lcbcencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
- encrypt_block v4, w3, x2, x6, w7
- st1 {v4.16b}, [x0], #16
- subs w4, w4, #1
+ encrypt_block v4, w22, x21, x6, w7
+ st1 {v4.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcencloop
.Lcbcencout:
- st1 {v4.16b}, [x5] /* return iv */
+ st1 {v4.16b}, [x24] /* return iv */
+ frame_pop
ret
AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
- ld1 {v7.16b}, [x5] /* get iv */
- dec_prepare w3, x2, x6
+.Lcbcdecrestart:
+ ld1 {v7.16b}, [x24] /* get iv */
+ dec_prepare w22, x21, x6
.LcbcdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
bl aes_decrypt_block4x
- sub x1, x1, #16
+ sub x20, x20, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
- ld1 {v7.16b}, [x1], #16 /* reload 1 ct block */
+ ld1 {v7.16b}, [x20], #16 /* reload 1 ct block */
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v7.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcdecrestart
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcdecout
.Lcbcdecloop:
- ld1 {v1.16b}, [x1], #16 /* get next ct block */
+ ld1 {v1.16b}, [x20], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w22, x21, x6, w7
eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */
mov v7.16b, v1.16b /* ct is next iv */
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcdecloop
.Lcbcdecout:
- st1 {v7.16b}, [x5] /* return iv */
- ldp x29, x30, [sp], #16
+ st1 {v7.16b}, [x24] /* return iv */
+ frame_pop
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -176,19 +212,26 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- enc_prepare w3, x2, x6
- ld1 {v4.16b}, [x5]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lctrrestart:
+ enc_prepare w22, x21, x6
+ ld1 {v4.16b}, [x24]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
- cmn w6, w4 /* 32 bit overflow? */
- bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lctr1x
+ cmn w6, #4 /* 32 bit overflow? */
+ bcs .Lctr1x
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -200,25 +243,27 @@ AES_ENTRY(aes_ctr_encrypt)
mov v1.s[3], v8.s[0]
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
- ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
+ ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
- ld1 {v5.16b}, [x1], #16 /* get 1 input block */
+ ld1 {v5.16b}, [x20], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
- cbz w4, .Lctrout
+ cbz w23, .Lctrout
+ st1 {v4.16b}, [x24] /* return next CTR value */
+ cond_yield_neon .Lctrrestart
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lctrout
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
adds x6, x6, #1 /* increment BE ctr */
rev x7, x6
@@ -226,22 +271,22 @@ AES_ENTRY(aes_ctr_encrypt)
bcs .Lctrcarry /* overflow? */
.Lctrcarrydone:
- subs w4, w4, #1
+ subs w23, w23, #1
bmi .Lctrtailblock /* blocks <0 means tail block */
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
eor v3.16b, v0.16b, v3.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
bne .Lctrloop
.Lctrout:
- st1 {v4.16b}, [x5] /* return next CTR value */
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24] /* return next CTR value */
+.Lctrret:
+ frame_pop
ret
.Lctrtailblock:
- st1 {v0.16b}, [x0]
- ldp x29, x30, [sp], #16
- ret
+ st1 {v0.16b}, [x19]
+ b .Lctrret
.Lctrcarry:
umov x7, v4.d[0] /* load upper word of ctr */
@@ -274,10 +319,16 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
- ld1 {v4.16b}, [x6]
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsencnotfirst
enc_prepare w3, x5, x8
@@ -286,15 +337,17 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsencnotfirst:
- enc_prepare w3, x2, x8
+ enc_prepare w22, x21, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -307,35 +360,43 @@ AES_ENTRY(aes_xts_encrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsencout
+ cbz w23, .Lxtsencout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsencrestart
b .LxtsencloopNx
.Lxtsenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsencout
.Lxtsencloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsencout
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- ld1 {v4.16b}, [x6]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsdecnotfirst
enc_prepare w3, x5, x8
@@ -344,15 +405,17 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsdecnotfirst:
- dec_prepare w3, x2, x8
+ dec_prepare w22, x21, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -365,26 +428,28 @@ AES_ENTRY(aes_xts_decrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsdecout
+ cbz w23, .Lxtsdecout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsdecrestart
b .LxtsdecloopNx
.Lxtsdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsdecout
.Lxtsdecloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x8, w7
+ decrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsdecout
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop
ret
AES_ENDPROC(aes_xts_decrypt)
@@ -393,43 +458,61 @@ AES_ENDPROC(aes_xts_decrypt)
* int blocks, u8 dg[], int enc_before, int enc_after)
*/
AES_ENTRY(aes_mac_update)
- ld1 {v0.16b}, [x4] /* get dg */
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v0.16b}, [x23] /* get dg */
enc_prepare w2, x1, x7
cbz w5, .Lmacloop4x
encrypt_block v0, w2, x1, x7, w8
.Lmacloop4x:
- subs w3, w3, #4
+ subs w22, w22, #4
bmi .Lmac1x
- ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ ld1 {v1.16b-v4.16b}, [x19], #64 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v2.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v3.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v4.16b
- cmp w3, wzr
- csinv x5, x6, xzr, eq
+ cmp w22, wzr
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
+ st1 {v0.16b}, [x23] /* return dg */
+ cond_yield_neon .Lmacrestart
b .Lmacloop4x
.Lmac1x:
- add w3, w3, #4
+ add w22, w22, #4
.Lmacloop:
- cbz w3, .Lmacout
- ld1 {v1.16b}, [x0], #16 /* get next pt block */
+ cbz w22, .Lmacout
+ ld1 {v1.16b}, [x19], #16 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- subs w3, w3, #1
- csinv x5, x6, xzr, eq
+ subs w22, w22, #1
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+.Lmacenc:
+ encrypt_block v0, w21, x20, x7, w8
b .Lmacloop
.Lmacout:
- st1 {v0.16b}, [x4] /* return dg */
+ st1 {v0.16b}, [x23] /* return dg */
+ frame_pop
ret
+
+.Lmacrestart:
+ ld1 {v0.16b}, [x23] /* get dg */
+ enc_prepare w21, x20, x0
+ b .Lmacloop4x
AES_ENDPROC(aes_mac_update)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 15/23] crypto: arm64/aes-blk - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-modes.S | 331 ++++++++++++--------
2 files changed, 216 insertions(+), 130 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
/* prepare for encryption with key in rk[] */
- .macro enc_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for encryption (again) but with new key in rk[] */
- .macro enc_switch_key, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_switch_key, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for decryption with key in rk[] */
- .macro dec_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro dec_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
.macro do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..483a7130cf0e 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
- enc_prepare w3, x2, x5
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+.Lecbencrestart:
+ enc_prepare w22, x21, x5
.LecbencloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl aes_encrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbencrestart
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbencout
.Lecbencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
- encrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
+ encrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbencloop
.Lecbencout:
- ldp x29, x30, [sp], #16
+ frame_pop
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
- dec_prepare w3, x2, x5
+.Lecbdecrestart:
+ dec_prepare w22, x21, x5
.LecbdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl aes_decrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbdecrestart
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbdecout
.Lecbdecloop:
- ld1 {v0.16b}, [x1], #16 /* get next ct block */
- decrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next ct block */
+ decrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbdecloop
.Lecbdecout:
- ldp x29, x30, [sp], #16
+ frame_pop
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -94,78 +108,100 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v4.16b}, [x5] /* get iv */
- enc_prepare w3, x2, x6
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lcbcencrestart:
+ ld1 {v4.16b}, [x24] /* get iv */
+ enc_prepare w22, x21, x6
.Lcbcencloop4x:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w22, x21, x6, w7
eor v1.16b, v1.16b, v0.16b
- encrypt_block v1, w3, x2, x6, w7
+ encrypt_block v1, w22, x21, x6, w7
eor v2.16b, v2.16b, v1.16b
- encrypt_block v2, w3, x2, x6, w7
+ encrypt_block v2, w22, x21, x6, w7
eor v3.16b, v3.16b, v2.16b
- encrypt_block v3, w3, x2, x6, w7
- st1 {v0.16b-v3.16b}, [x0], #64
+ encrypt_block v3, w22, x21, x6, w7
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v3.16b
+ st1 {v4.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcencrestart
b .Lcbcencloop4x
.Lcbcenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcencout
.Lcbcencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
- encrypt_block v4, w3, x2, x6, w7
- st1 {v4.16b}, [x0], #16
- subs w4, w4, #1
+ encrypt_block v4, w22, x21, x6, w7
+ st1 {v4.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcencloop
.Lcbcencout:
- st1 {v4.16b}, [x5] /* return iv */
+ st1 {v4.16b}, [x24] /* return iv */
+ frame_pop
ret
AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
- ld1 {v7.16b}, [x5] /* get iv */
- dec_prepare w3, x2, x6
+.Lcbcdecrestart:
+ ld1 {v7.16b}, [x24] /* get iv */
+ dec_prepare w22, x21, x6
.LcbcdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
bl aes_decrypt_block4x
- sub x1, x1, #16
+ sub x20, x20, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
- ld1 {v7.16b}, [x1], #16 /* reload 1 ct block */
+ ld1 {v7.16b}, [x20], #16 /* reload 1 ct block */
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v7.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcdecrestart
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcdecout
.Lcbcdecloop:
- ld1 {v1.16b}, [x1], #16 /* get next ct block */
+ ld1 {v1.16b}, [x20], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w22, x21, x6, w7
eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */
mov v7.16b, v1.16b /* ct is next iv */
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcdecloop
.Lcbcdecout:
- st1 {v7.16b}, [x5] /* return iv */
- ldp x29, x30, [sp], #16
+ st1 {v7.16b}, [x24] /* return iv */
+ frame_pop
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -176,19 +212,26 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- enc_prepare w3, x2, x6
- ld1 {v4.16b}, [x5]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lctrrestart:
+ enc_prepare w22, x21, x6
+ ld1 {v4.16b}, [x24]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
- cmn w6, w4 /* 32 bit overflow? */
- bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lctr1x
+ cmn w6, #4 /* 32 bit overflow? */
+ bcs .Lctr1x
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -200,25 +243,27 @@ AES_ENTRY(aes_ctr_encrypt)
mov v1.s[3], v8.s[0]
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
- ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
+ ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
- ld1 {v5.16b}, [x1], #16 /* get 1 input block */
+ ld1 {v5.16b}, [x20], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
- cbz w4, .Lctrout
+ cbz w23, .Lctrout
+ st1 {v4.16b}, [x24] /* return next CTR value */
+ cond_yield_neon .Lctrrestart
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lctrout
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
adds x6, x6, #1 /* increment BE ctr */
rev x7, x6
@@ -226,22 +271,22 @@ AES_ENTRY(aes_ctr_encrypt)
bcs .Lctrcarry /* overflow? */
.Lctrcarrydone:
- subs w4, w4, #1
+ subs w23, w23, #1
bmi .Lctrtailblock /* blocks <0 means tail block */
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
eor v3.16b, v0.16b, v3.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
bne .Lctrloop
.Lctrout:
- st1 {v4.16b}, [x5] /* return next CTR value */
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24] /* return next CTR value */
+.Lctrret:
+ frame_pop
ret
.Lctrtailblock:
- st1 {v0.16b}, [x0]
- ldp x29, x30, [sp], #16
- ret
+ st1 {v0.16b}, [x19]
+ b .Lctrret
.Lctrcarry:
umov x7, v4.d[0] /* load upper word of ctr */
@@ -274,10 +319,16 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
- ld1 {v4.16b}, [x6]
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsencnotfirst
enc_prepare w3, x5, x8
@@ -286,15 +337,17 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsencnotfirst:
- enc_prepare w3, x2, x8
+ enc_prepare w22, x21, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -307,35 +360,43 @@ AES_ENTRY(aes_xts_encrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsencout
+ cbz w23, .Lxtsencout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsencrestart
b .LxtsencloopNx
.Lxtsenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsencout
.Lxtsencloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsencout
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- ld1 {v4.16b}, [x6]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsdecnotfirst
enc_prepare w3, x5, x8
@@ -344,15 +405,17 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsdecnotfirst:
- dec_prepare w3, x2, x8
+ dec_prepare w22, x21, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -365,26 +428,28 @@ AES_ENTRY(aes_xts_decrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsdecout
+ cbz w23, .Lxtsdecout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsdecrestart
b .LxtsdecloopNx
.Lxtsdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsdecout
.Lxtsdecloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x8, w7
+ decrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsdecout
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop
ret
AES_ENDPROC(aes_xts_decrypt)
@@ -393,43 +458,61 @@ AES_ENDPROC(aes_xts_decrypt)
* int blocks, u8 dg[], int enc_before, int enc_after)
*/
AES_ENTRY(aes_mac_update)
- ld1 {v0.16b}, [x4] /* get dg */
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v0.16b}, [x23] /* get dg */
enc_prepare w2, x1, x7
cbz w5, .Lmacloop4x
encrypt_block v0, w2, x1, x7, w8
.Lmacloop4x:
- subs w3, w3, #4
+ subs w22, w22, #4
bmi .Lmac1x
- ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ ld1 {v1.16b-v4.16b}, [x19], #64 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v2.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v3.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v4.16b
- cmp w3, wzr
- csinv x5, x6, xzr, eq
+ cmp w22, wzr
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
+ st1 {v0.16b}, [x23] /* return dg */
+ cond_yield_neon .Lmacrestart
b .Lmacloop4x
.Lmac1x:
- add w3, w3, #4
+ add w22, w22, #4
.Lmacloop:
- cbz w3, .Lmacout
- ld1 {v1.16b}, [x0], #16 /* get next pt block */
+ cbz w22, .Lmacout
+ ld1 {v1.16b}, [x19], #16 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- subs w3, w3, #1
- csinv x5, x6, xzr, eq
+ subs w22, w22, #1
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+.Lmacenc:
+ encrypt_block v0, w21, x20, x7, w8
b .Lmacloop
.Lmacout:
- st1 {v0.16b}, [x4] /* return dg */
+ st1 {v0.16b}, [x23] /* return dg */
+ frame_pop
ret
+
+.Lmacrestart:
+ ld1 {v0.16b}, [x23] /* get dg */
+ enc_prepare w21, x20, x0
+ b .Lmacloop4x
AES_ENDPROC(aes_mac_update)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 16/23] crypto: arm64/aes-bs - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-core.S | 305 +++++++++++---------
1 file changed, 170 insertions(+), 135 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..e613a87f8b53 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
* int blocks)
*/
.macro __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
99: mov x5, #1
- lsl x5, x5, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x5, x5, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x5, x5, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
tbnz x5, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
tbnz x5, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
tbnz x5, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
tbnz x5, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
tbnz x5, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
tbnz x5, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
tbnz x5, #7, 0f
- ld1 {v7.16b}, [x1], #16
+ ld1 {v7.16b}, [x20], #16
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl \do8
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
tbnz x5, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
tbnz x5, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
tbnz x5, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
tbnz x5, #4, 1f
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x5, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x5, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x5, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
- cbnz x4, 99b
+ cbz x23, 1f
+ cond_yield_neon
+ b 99b
-1: ldp x29, x30, [sp], #16
+1: frame_pop
ret
.endm
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
*/
.align 4
ENTRY(aesbs_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
99: mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
mov v29.16b, v4.16b
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
mov v30.16b, v5.16b
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
mov v31.16b, v6.16b
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1]
+ ld1 {v7.16b}, [x20]
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_decrypt8
- ld1 {v24.16b}, [x5] // load IV
+ ld1 {v24.16b}, [x24] // load IV
eor v1.16b, v1.16b, v25.16b
eor v6.16b, v6.16b, v26.16b
@@ -679,34 +692,36 @@ ENTRY(aesbs_cbc_decrypt)
eor v3.16b, v3.16b, v30.16b
eor v5.16b, v5.16b, v31.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
mov v24.16b, v25.16b
tbnz x6, #1, 1f
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
mov v24.16b, v26.16b
tbnz x6, #2, 1f
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
mov v24.16b, v27.16b
tbnz x6, #3, 1f
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
mov v24.16b, v28.16b
tbnz x6, #4, 1f
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
mov v24.16b, v29.16b
tbnz x6, #5, 1f
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
mov v24.16b, v30.16b
tbnz x6, #6, 1f
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
mov v24.16b, v31.16b
tbnz x6, #7, 1f
- ld1 {v24.16b}, [x1], #16
- st1 {v5.16b}, [x0], #16
-1: st1 {v24.16b}, [x5] // store IV
+ ld1 {v24.16b}, [x20], #16
+ st1 {v5.16b}, [x19], #16
+1: st1 {v24.16b}, [x24] // store IV
- cbnz x4, 99b
+ cbz x23, 2f
+ cond_yield_neon
+ b 99b
- ldp x29, x30, [sp], #16
+2: frame_pop
ret
ENDPROC(aesbs_cbc_decrypt)
@@ -731,87 +746,93 @@ CPU_BE( .quad 0x87, 1 )
*/
__xts_crypt8:
mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
next_tweak v26, v25, v30, v31
eor v0.16b, v0.16b, v25.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
next_tweak v27, v26, v30, v31
eor v1.16b, v1.16b, v26.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
next_tweak v28, v27, v30, v31
eor v2.16b, v2.16b, v27.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
next_tweak v29, v28, v30, v31
eor v3.16b, v3.16b, v28.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
- str q29, [sp, #16]
+ ld1 {v4.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset]
eor v4.16b, v4.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
- str q29, [sp, #32]
+ ld1 {v5.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset + 16]
eor v5.16b, v5.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
- str q29, [sp, #48]
+ ld1 {v6.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset + 32]
eor v6.16b, v6.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1], #16
- str q29, [sp, #64]
+ ld1 {v7.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset + 48]
eor v7.16b, v7.16b, v29.16b
next_tweak v29, v29, v30, v31
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
br x7
ENDPROC(__xts_crypt8)
.macro __xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-80]!
- mov x29, sp
+ frame_push 6, 64
- ldr q30, .Lxts_mul_x
- ld1 {v25.16b}, [x5]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+0: ldr q30, .Lxts_mul_x
+ ld1 {v25.16b}, [x24]
99: adr x7, \do8
bl __xts_crypt8
- ldp q16, q17, [sp, #16]
- ldp q18, q19, [sp, #48]
+ ldp q16, q17, [sp, #.Lframe_local_offset]
+ ldp q18, q19, [sp, #.Lframe_local_offset + 32]
eor \o0\().16b, \o0\().16b, v25.16b
eor \o1\().16b, \o1\().16b, v26.16b
eor \o2\().16b, \o2\().16b, v27.16b
eor \o3\().16b, \o3\().16b, v28.16b
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
mov v25.16b, v26.16b
tbnz x6, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
mov v25.16b, v27.16b
tbnz x6, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
mov v25.16b, v28.16b
tbnz x6, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
mov v25.16b, v29.16b
tbnz x6, #4, 1f
@@ -820,18 +841,22 @@ ENDPROC(__xts_crypt8)
eor \o6\().16b, \o6\().16b, v18.16b
eor \o7\().16b, \o7\().16b, v19.16b
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x6, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x6, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x6, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
+
+ cbz x23, 1f
+ st1 {v25.16b}, [x24]
- cbnz x4, 99b
+ cond_yield_neon 0b
+ b 99b
-1: st1 {v25.16b}, [x5]
- ldp x29, x30, [sp], #80
+1: st1 {v25.16b}, [x24]
+ frame_pop
ret
.endm
@@ -856,24 +881,31 @@ ENDPROC(aesbs_xts_decrypt)
* int rounds, int blocks, u8 iv[], u8 final[])
*/
ENTRY(aesbs_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
-
- cmp x6, #0
- cset x10, ne
- add x4, x4, x10 // do one extra block if final
-
- ldp x7, x8, [x5]
- ld1 {v0.16b}, [x5]
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ cmp x25, #0
+ cset x26, ne
+ add x23, x23, x26 // do one extra block if final
+
+98: ldp x7, x8, [x24]
+ ld1 {v0.16b}, [x24]
CPU_LE( rev x7, x7 )
CPU_LE( rev x8, x8 )
adds x8, x8, #1
adc x7, x7, xzr
99: mov x9, #1
- lsl x9, x9, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x9, x9, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x9, x9, xzr, le
tbnz x9, #1, 0f
@@ -891,82 +923,85 @@ CPU_LE( rev x8, x8 )
tbnz x9, #7, 0f
next_ctr v7
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_encrypt8
- lsr x9, x9, x10 // disregard the extra block
+ lsr x9, x9, x26 // disregard the extra block
tbnz x9, #0, 0f
- ld1 {v8.16b}, [x1], #16
+ ld1 {v8.16b}, [x20], #16
eor v0.16b, v0.16b, v8.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
tbnz x9, #1, 1f
- ld1 {v9.16b}, [x1], #16
+ ld1 {v9.16b}, [x20], #16
eor v1.16b, v1.16b, v9.16b
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
tbnz x9, #2, 2f
- ld1 {v10.16b}, [x1], #16
+ ld1 {v10.16b}, [x20], #16
eor v4.16b, v4.16b, v10.16b
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
tbnz x9, #3, 3f
- ld1 {v11.16b}, [x1], #16
+ ld1 {v11.16b}, [x20], #16
eor v6.16b, v6.16b, v11.16b
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
tbnz x9, #4, 4f
- ld1 {v12.16b}, [x1], #16
+ ld1 {v12.16b}, [x20], #16
eor v3.16b, v3.16b, v12.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
tbnz x9, #5, 5f
- ld1 {v13.16b}, [x1], #16
+ ld1 {v13.16b}, [x20], #16
eor v7.16b, v7.16b, v13.16b
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
tbnz x9, #6, 6f
- ld1 {v14.16b}, [x1], #16
+ ld1 {v14.16b}, [x20], #16
eor v2.16b, v2.16b, v14.16b
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
tbnz x9, #7, 7f
- ld1 {v15.16b}, [x1], #16
+ ld1 {v15.16b}, [x20], #16
eor v5.16b, v5.16b, v15.16b
- st1 {v5.16b}, [x0], #16
+ st1 {v5.16b}, [x19], #16
8: next_ctr v0
- cbnz x4, 99b
+ st1 {v0.16b}, [x24]
+ cbz x23, 0f
+
+ cond_yield_neon 98b
+ b 99b
-0: st1 {v0.16b}, [x5]
- ldp x29, x30, [sp], #16
+0: frame_pop
ret
/*
* If we are handling the tail of the input (x6 != NULL), return the
* final keystream block back to the caller.
*/
-1: cbz x6, 8b
- st1 {v1.16b}, [x6]
+1: cbz x25, 8b
+ st1 {v1.16b}, [x25]
b 8b
-2: cbz x6, 8b
- st1 {v4.16b}, [x6]
+2: cbz x25, 8b
+ st1 {v4.16b}, [x25]
b 8b
-3: cbz x6, 8b
- st1 {v6.16b}, [x6]
+3: cbz x25, 8b
+ st1 {v6.16b}, [x25]
b 8b
-4: cbz x6, 8b
- st1 {v3.16b}, [x6]
+4: cbz x25, 8b
+ st1 {v3.16b}, [x25]
b 8b
-5: cbz x6, 8b
- st1 {v7.16b}, [x6]
+5: cbz x25, 8b
+ st1 {v7.16b}, [x25]
b 8b
-6: cbz x6, 8b
- st1 {v2.16b}, [x6]
+6: cbz x25, 8b
+ st1 {v2.16b}, [x25]
b 8b
-7: cbz x6, 8b
- st1 {v5.16b}, [x6]
+7: cbz x25, 8b
+ st1 {v5.16b}, [x25]
b 8b
ENDPROC(aesbs_ctr_encrypt)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 16/23] crypto: arm64/aes-bs - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-core.S | 305 +++++++++++---------
1 file changed, 170 insertions(+), 135 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..e613a87f8b53 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
* int blocks)
*/
.macro __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
99: mov x5, #1
- lsl x5, x5, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x5, x5, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x5, x5, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
tbnz x5, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
tbnz x5, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
tbnz x5, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
tbnz x5, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
tbnz x5, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
tbnz x5, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
tbnz x5, #7, 0f
- ld1 {v7.16b}, [x1], #16
+ ld1 {v7.16b}, [x20], #16
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl \do8
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
tbnz x5, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
tbnz x5, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
tbnz x5, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
tbnz x5, #4, 1f
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x5, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x5, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x5, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
- cbnz x4, 99b
+ cbz x23, 1f
+ cond_yield_neon
+ b 99b
-1: ldp x29, x30, [sp], #16
+1: frame_pop
ret
.endm
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
*/
.align 4
ENTRY(aesbs_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
99: mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
mov v29.16b, v4.16b
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
mov v30.16b, v5.16b
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
mov v31.16b, v6.16b
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1]
+ ld1 {v7.16b}, [x20]
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_decrypt8
- ld1 {v24.16b}, [x5] // load IV
+ ld1 {v24.16b}, [x24] // load IV
eor v1.16b, v1.16b, v25.16b
eor v6.16b, v6.16b, v26.16b
@@ -679,34 +692,36 @@ ENTRY(aesbs_cbc_decrypt)
eor v3.16b, v3.16b, v30.16b
eor v5.16b, v5.16b, v31.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
mov v24.16b, v25.16b
tbnz x6, #1, 1f
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
mov v24.16b, v26.16b
tbnz x6, #2, 1f
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
mov v24.16b, v27.16b
tbnz x6, #3, 1f
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
mov v24.16b, v28.16b
tbnz x6, #4, 1f
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
mov v24.16b, v29.16b
tbnz x6, #5, 1f
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
mov v24.16b, v30.16b
tbnz x6, #6, 1f
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
mov v24.16b, v31.16b
tbnz x6, #7, 1f
- ld1 {v24.16b}, [x1], #16
- st1 {v5.16b}, [x0], #16
-1: st1 {v24.16b}, [x5] // store IV
+ ld1 {v24.16b}, [x20], #16
+ st1 {v5.16b}, [x19], #16
+1: st1 {v24.16b}, [x24] // store IV
- cbnz x4, 99b
+ cbz x23, 2f
+ cond_yield_neon
+ b 99b
- ldp x29, x30, [sp], #16
+2: frame_pop
ret
ENDPROC(aesbs_cbc_decrypt)
@@ -731,87 +746,93 @@ CPU_BE( .quad 0x87, 1 )
*/
__xts_crypt8:
mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
next_tweak v26, v25, v30, v31
eor v0.16b, v0.16b, v25.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
next_tweak v27, v26, v30, v31
eor v1.16b, v1.16b, v26.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
next_tweak v28, v27, v30, v31
eor v2.16b, v2.16b, v27.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
next_tweak v29, v28, v30, v31
eor v3.16b, v3.16b, v28.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
- str q29, [sp, #16]
+ ld1 {v4.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset]
eor v4.16b, v4.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
- str q29, [sp, #32]
+ ld1 {v5.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset + 16]
eor v5.16b, v5.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
- str q29, [sp, #48]
+ ld1 {v6.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset + 32]
eor v6.16b, v6.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1], #16
- str q29, [sp, #64]
+ ld1 {v7.16b}, [x20], #16
+ str q29, [sp, #.Lframe_local_offset + 48]
eor v7.16b, v7.16b, v29.16b
next_tweak v29, v29, v30, v31
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
br x7
ENDPROC(__xts_crypt8)
.macro __xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-80]!
- mov x29, sp
+ frame_push 6, 64
- ldr q30, .Lxts_mul_x
- ld1 {v25.16b}, [x5]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+0: ldr q30, .Lxts_mul_x
+ ld1 {v25.16b}, [x24]
99: adr x7, \do8
bl __xts_crypt8
- ldp q16, q17, [sp, #16]
- ldp q18, q19, [sp, #48]
+ ldp q16, q17, [sp, #.Lframe_local_offset]
+ ldp q18, q19, [sp, #.Lframe_local_offset + 32]
eor \o0\().16b, \o0\().16b, v25.16b
eor \o1\().16b, \o1\().16b, v26.16b
eor \o2\().16b, \o2\().16b, v27.16b
eor \o3\().16b, \o3\().16b, v28.16b
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
mov v25.16b, v26.16b
tbnz x6, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
mov v25.16b, v27.16b
tbnz x6, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
mov v25.16b, v28.16b
tbnz x6, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
mov v25.16b, v29.16b
tbnz x6, #4, 1f
@@ -820,18 +841,22 @@ ENDPROC(__xts_crypt8)
eor \o6\().16b, \o6\().16b, v18.16b
eor \o7\().16b, \o7\().16b, v19.16b
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x6, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x6, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x6, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
+
+ cbz x23, 1f
+ st1 {v25.16b}, [x24]
- cbnz x4, 99b
+ cond_yield_neon 0b
+ b 99b
-1: st1 {v25.16b}, [x5]
- ldp x29, x30, [sp], #80
+1: st1 {v25.16b}, [x24]
+ frame_pop
ret
.endm
@@ -856,24 +881,31 @@ ENDPROC(aesbs_xts_decrypt)
* int rounds, int blocks, u8 iv[], u8 final[])
*/
ENTRY(aesbs_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
-
- cmp x6, #0
- cset x10, ne
- add x4, x4, x10 // do one extra block if final
-
- ldp x7, x8, [x5]
- ld1 {v0.16b}, [x5]
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ cmp x25, #0
+ cset x26, ne
+ add x23, x23, x26 // do one extra block if final
+
+98: ldp x7, x8, [x24]
+ ld1 {v0.16b}, [x24]
CPU_LE( rev x7, x7 )
CPU_LE( rev x8, x8 )
adds x8, x8, #1
adc x7, x7, xzr
99: mov x9, #1
- lsl x9, x9, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x9, x9, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x9, x9, xzr, le
tbnz x9, #1, 0f
@@ -891,82 +923,85 @@ CPU_LE( rev x8, x8 )
tbnz x9, #7, 0f
next_ctr v7
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_encrypt8
- lsr x9, x9, x10 // disregard the extra block
+ lsr x9, x9, x26 // disregard the extra block
tbnz x9, #0, 0f
- ld1 {v8.16b}, [x1], #16
+ ld1 {v8.16b}, [x20], #16
eor v0.16b, v0.16b, v8.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
tbnz x9, #1, 1f
- ld1 {v9.16b}, [x1], #16
+ ld1 {v9.16b}, [x20], #16
eor v1.16b, v1.16b, v9.16b
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
tbnz x9, #2, 2f
- ld1 {v10.16b}, [x1], #16
+ ld1 {v10.16b}, [x20], #16
eor v4.16b, v4.16b, v10.16b
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
tbnz x9, #3, 3f
- ld1 {v11.16b}, [x1], #16
+ ld1 {v11.16b}, [x20], #16
eor v6.16b, v6.16b, v11.16b
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
tbnz x9, #4, 4f
- ld1 {v12.16b}, [x1], #16
+ ld1 {v12.16b}, [x20], #16
eor v3.16b, v3.16b, v12.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
tbnz x9, #5, 5f
- ld1 {v13.16b}, [x1], #16
+ ld1 {v13.16b}, [x20], #16
eor v7.16b, v7.16b, v13.16b
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
tbnz x9, #6, 6f
- ld1 {v14.16b}, [x1], #16
+ ld1 {v14.16b}, [x20], #16
eor v2.16b, v2.16b, v14.16b
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
tbnz x9, #7, 7f
- ld1 {v15.16b}, [x1], #16
+ ld1 {v15.16b}, [x20], #16
eor v5.16b, v5.16b, v15.16b
- st1 {v5.16b}, [x0], #16
+ st1 {v5.16b}, [x19], #16
8: next_ctr v0
- cbnz x4, 99b
+ st1 {v0.16b}, [x24]
+ cbz x23, 0f
+
+ cond_yield_neon 98b
+ b 99b
-0: st1 {v0.16b}, [x5]
- ldp x29, x30, [sp], #16
+0: frame_pop
ret
/*
* If we are handling the tail of the input (x6 != NULL), return the
* final keystream block back to the caller.
*/
-1: cbz x6, 8b
- st1 {v1.16b}, [x6]
+1: cbz x25, 8b
+ st1 {v1.16b}, [x25]
b 8b
-2: cbz x6, 8b
- st1 {v4.16b}, [x6]
+2: cbz x25, 8b
+ st1 {v4.16b}, [x25]
b 8b
-3: cbz x6, 8b
- st1 {v6.16b}, [x6]
+3: cbz x25, 8b
+ st1 {v6.16b}, [x25]
b 8b
-4: cbz x6, 8b
- st1 {v3.16b}, [x6]
+4: cbz x25, 8b
+ st1 {v3.16b}, [x25]
b 8b
-5: cbz x6, 8b
- st1 {v7.16b}, [x6]
+5: cbz x25, 8b
+ st1 {v7.16b}, [x25]
b 8b
-6: cbz x6, 8b
- st1 {v2.16b}, [x6]
+6: cbz x25, 8b
+ st1 {v2.16b}, [x25]
b 8b
-7: cbz x6, 8b
- st1 {v5.16b}, [x6]
+7: cbz x25, 8b
+ st1 {v5.16b}, [x25]
b 8b
ENDPROC(aesbs_ctr_encrypt)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 17/23] crypto: arm64/aes-ghash - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-core.S | 113 ++++++++++++++------
arch/arm64/crypto/ghash-ce-glue.c | 28 +++--
2 files changed, 97 insertions(+), 44 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..dcffb9e77589 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
.macro __pmull_ghash, pn
- ld1 {SHASH.2d}, [x3]
- ld1 {XL.2d}, [x1]
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+ ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
__pmull_pre_\pn
/* do the head block first, if supplied */
- cbz x4, 0f
- ld1 {T1.2d}, [x4]
- b 1f
+ cbz x23, 1f
+ ld1 {T1.2d}, [x23]
+ mov x23, xzr
+ b 2f
-0: ld1 {T1.2d}, [x2], #16
- sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+ sub w19, w19, #1
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
CPU_LE( rev64 T1.16b, T1.16b )
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE( rev64 T1.16b, T1.16b )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
- cbnz w0, 0b
+ cbz w19, 3f
+
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
- st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+ frame_pop
ret
.endm
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
.macro pmull_gcm_do_crypt, enc
- ld1 {SHASH.2d}, [x4]
- ld1 {XL.2d}, [x1]
- ldr x8, [x5, #8] // load lower counter
+ frame_push 10
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+ mov x26, x7
+ .if \enc == 1
+ ldr x27, [sp, #96] // first stacked arg
+ .endif
+
+ ldr x28, [x24, #8] // load lower counter
+CPU_LE( rev x28, x28 )
+
+0: mov x0, x25
+ load_round_keys w26, x0
+ ld1 {SHASH.2d}, [x23]
+ ld1 {XL.2d}, [x20]
movi MASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE( rev x8, x8 )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
.if \enc == 1
- ld1 {KS.16b}, [x7]
+ ld1 {KS.16b}, [x27]
.endif
-0: ld1 {CTR.8b}, [x5] // load upper counter
- ld1 {INP.16b}, [x3], #16
- rev x9, x8
- add x8, x8, #1
- sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+ ld1 {INP.16b}, [x22], #16
+ rev x9, x28
+ add x28, x28, #1
+ sub w19, w19, #1
ins CTR.d[1], x9 // set lower counter
.if \enc == 1
eor INP.16b, INP.16b, KS.16b // encrypt input
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
rev64 T1.16b, INP.16b
- cmp w6, #12
- b.ge 2f // AES-192/256?
+ cmp w26, #12
+ b.ge 4f // AES-192/256?
-1: enc_round CTR, v21
+2: enc_round CTR, v21
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE( rev x8, x8 )
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
- cbnz w0, 0b
+ cbz w19, 3f
-CPU_LE( rev x8, x8 )
- st1 {XL.2d}, [x1]
- str x8, [x5, #8] // store lower counter
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ .if \enc == 1
+ st1 {KS.16b}, [x27]
+ .endif
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+ b 1b
+
+3: st1 {XL.2d}, [x20]
.if \enc == 1
- st1 {KS.16b}, [x7]
+ st1 {KS.16b}, [x27]
.endif
+CPU_LE( rev x28, x28 )
+ str x28, [x24, #8] // store lower counter
+
+ frame_pop
ret
-2: b.eq 3f // AES-192?
+4: b.eq 5f // AES-192?
enc_round CTR, v17
enc_round CTR, v18
-3: enc_round CTR, v19
+5: enc_round CTR, v19
enc_round CTR, v20
- b 1b
+ b 2b
.endm
/*
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..7cf0b1aa6ea8 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -63,11 +63,12 @@ static void (*pmull_ghash_update)(int blocks, u64 dg[], const char *src,
asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds, u8 ks[]);
+ u8 ctr[], u32 const rk[], int rounds,
+ u8 ks[]);
asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds);
+ u8 ctr[], u32 const rk[], int rounds);
asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
u32 const rk[], int rounds);
@@ -368,26 +369,29 @@ static int gcm_encrypt(struct aead_request *req)
pmull_gcm_encrypt_block(ks, iv, NULL,
num_rounds(&ctx->aes_key));
put_unaligned_be32(3, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key), ks);
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key), ks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -467,15 +471,19 @@ static int gcm_decrypt(struct aead_request *req)
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key));
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key));
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
@@ -483,14 +491,12 @@ static int gcm_decrypt(struct aead_request *req)
if (walk.nbytes)
pmull_gcm_encrypt_block(iv, iv, NULL,
num_rounds(&ctx->aes_key));
-
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 17/23] crypto: arm64/aes-ghash - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-core.S | 113 ++++++++++++++------
arch/arm64/crypto/ghash-ce-glue.c | 28 +++--
2 files changed, 97 insertions(+), 44 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..dcffb9e77589 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
.macro __pmull_ghash, pn
- ld1 {SHASH.2d}, [x3]
- ld1 {XL.2d}, [x1]
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+ ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
__pmull_pre_\pn
/* do the head block first, if supplied */
- cbz x4, 0f
- ld1 {T1.2d}, [x4]
- b 1f
+ cbz x23, 1f
+ ld1 {T1.2d}, [x23]
+ mov x23, xzr
+ b 2f
-0: ld1 {T1.2d}, [x2], #16
- sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+ sub w19, w19, #1
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
CPU_LE( rev64 T1.16b, T1.16b )
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE( rev64 T1.16b, T1.16b )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
- cbnz w0, 0b
+ cbz w19, 3f
+
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
- st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+ frame_pop
ret
.endm
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
.macro pmull_gcm_do_crypt, enc
- ld1 {SHASH.2d}, [x4]
- ld1 {XL.2d}, [x1]
- ldr x8, [x5, #8] // load lower counter
+ frame_push 10
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+ mov x26, x7
+ .if \enc == 1
+ ldr x27, [sp, #96] // first stacked arg
+ .endif
+
+ ldr x28, [x24, #8] // load lower counter
+CPU_LE( rev x28, x28 )
+
+0: mov x0, x25
+ load_round_keys w26, x0
+ ld1 {SHASH.2d}, [x23]
+ ld1 {XL.2d}, [x20]
movi MASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE( rev x8, x8 )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
.if \enc == 1
- ld1 {KS.16b}, [x7]
+ ld1 {KS.16b}, [x27]
.endif
-0: ld1 {CTR.8b}, [x5] // load upper counter
- ld1 {INP.16b}, [x3], #16
- rev x9, x8
- add x8, x8, #1
- sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+ ld1 {INP.16b}, [x22], #16
+ rev x9, x28
+ add x28, x28, #1
+ sub w19, w19, #1
ins CTR.d[1], x9 // set lower counter
.if \enc == 1
eor INP.16b, INP.16b, KS.16b // encrypt input
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
rev64 T1.16b, INP.16b
- cmp w6, #12
- b.ge 2f // AES-192/256?
+ cmp w26, #12
+ b.ge 4f // AES-192/256?
-1: enc_round CTR, v21
+2: enc_round CTR, v21
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE( rev x8, x8 )
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
- cbnz w0, 0b
+ cbz w19, 3f
-CPU_LE( rev x8, x8 )
- st1 {XL.2d}, [x1]
- str x8, [x5, #8] // store lower counter
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ .if \enc == 1
+ st1 {KS.16b}, [x27]
+ .endif
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+ b 1b
+
+3: st1 {XL.2d}, [x20]
.if \enc == 1
- st1 {KS.16b}, [x7]
+ st1 {KS.16b}, [x27]
.endif
+CPU_LE( rev x28, x28 )
+ str x28, [x24, #8] // store lower counter
+
+ frame_pop
ret
-2: b.eq 3f // AES-192?
+4: b.eq 5f // AES-192?
enc_round CTR, v17
enc_round CTR, v18
-3: enc_round CTR, v19
+5: enc_round CTR, v19
enc_round CTR, v20
- b 1b
+ b 2b
.endm
/*
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..7cf0b1aa6ea8 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -63,11 +63,12 @@ static void (*pmull_ghash_update)(int blocks, u64 dg[], const char *src,
asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds, u8 ks[]);
+ u8 ctr[], u32 const rk[], int rounds,
+ u8 ks[]);
asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds);
+ u8 ctr[], u32 const rk[], int rounds);
asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
u32 const rk[], int rounds);
@@ -368,26 +369,29 @@ static int gcm_encrypt(struct aead_request *req)
pmull_gcm_encrypt_block(ks, iv, NULL,
num_rounds(&ctx->aes_key));
put_unaligned_be32(3, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key), ks);
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key), ks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -467,15 +471,19 @@ static int gcm_decrypt(struct aead_request *req)
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key));
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key));
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
@@ -483,14 +491,12 @@ static int gcm_decrypt(struct aead_request *req)
if (walk.nbytes)
pmull_gcm_encrypt_block(iv, iv, NULL,
num_rounds(&ctx->aes_key));
-
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 18/23] crypto: arm64/crc32-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crc32-ce-core.S | 40 +++++++++++++++-----
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/crypto/crc32-ce-core.S b/arch/arm64/crypto/crc32-ce-core.S
index 16ed3c7ebd37..8061bf0f9c66 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,10 @@
dCONSTANT .req d0
qCONSTANT .req q0
- BUF .req x0
- LEN .req x1
- CRC .req x2
+ BUF .req x19
+ LEN .req x20
+ CRC .req x21
+ CONST .req x22
vzr .req v9
@@ -123,7 +124,14 @@ ENTRY(crc32_pmull_le)
ENTRY(crc32c_pmull_le)
adr_l x3, .Lcrc32c_constants
-0: bic LEN, LEN, #15
+0: frame_push 4, 64
+
+ mov BUF, x0
+ mov LEN, x1
+ mov CRC, x2
+ mov CONST, x3
+
+ bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movi vzr.16b, #0
fmov dCONSTANT, CRC
@@ -132,7 +140,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.lt less_64
- ldr qCONSTANT, [x3]
+ ldr qCONSTANT, [CONST]
loop_64: /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -162,10 +170,21 @@ loop_64: /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
cmp LEN, #0x40
- b.ge loop_64
+ b.lt less_64
+
+ if_will_cond_yield_neon
+ stp q1, q2, [sp, #.Lframe_local_offset]
+ stp q3, q4, [sp, #.Lframe_local_offset + 32]
+ do_cond_yield_neon
+ ldp q1, q2, [sp, #.Lframe_local_offset]
+ ldp q3, q4, [sp, #.Lframe_local_offset + 32]
+ ldr qCONSTANT, [CONST]
+ movi vzr.16b, #0
+ endif_yield_neon
+ b loop_64
less_64: /* Folding cache line into 128bit */
- ldr qCONSTANT, [x3, #16]
+ ldr qCONSTANT, [CONST, #16]
pmull2 v5.1q, v1.2d, vCONSTANT.2d
pmull v1.1q, v1.1d, vCONSTANT.1d
@@ -204,8 +223,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* final 32-bit fold */
- ldr dCONSTANT, [x3, #32]
- ldr d3, [x3, #40]
+ ldr dCONSTANT, [CONST, #32]
+ ldr d3, [CONST, #40]
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -213,7 +232,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
- ldr qCONSTANT, [x3, #48]
+ ldr qCONSTANT, [CONST, #48]
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -223,6 +242,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
+ frame_pop
ret
ENDPROC(crc32_pmull_le)
ENDPROC(crc32c_pmull_le)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 18/23] crypto: arm64/crc32-ce - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crc32-ce-core.S | 40 +++++++++++++++-----
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/crypto/crc32-ce-core.S b/arch/arm64/crypto/crc32-ce-core.S
index 16ed3c7ebd37..8061bf0f9c66 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,10 @@
dCONSTANT .req d0
qCONSTANT .req q0
- BUF .req x0
- LEN .req x1
- CRC .req x2
+ BUF .req x19
+ LEN .req x20
+ CRC .req x21
+ CONST .req x22
vzr .req v9
@@ -123,7 +124,14 @@ ENTRY(crc32_pmull_le)
ENTRY(crc32c_pmull_le)
adr_l x3, .Lcrc32c_constants
-0: bic LEN, LEN, #15
+0: frame_push 4, 64
+
+ mov BUF, x0
+ mov LEN, x1
+ mov CRC, x2
+ mov CONST, x3
+
+ bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movi vzr.16b, #0
fmov dCONSTANT, CRC
@@ -132,7 +140,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.lt less_64
- ldr qCONSTANT, [x3]
+ ldr qCONSTANT, [CONST]
loop_64: /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -162,10 +170,21 @@ loop_64: /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
cmp LEN, #0x40
- b.ge loop_64
+ b.lt less_64
+
+ if_will_cond_yield_neon
+ stp q1, q2, [sp, #.Lframe_local_offset]
+ stp q3, q4, [sp, #.Lframe_local_offset + 32]
+ do_cond_yield_neon
+ ldp q1, q2, [sp, #.Lframe_local_offset]
+ ldp q3, q4, [sp, #.Lframe_local_offset + 32]
+ ldr qCONSTANT, [CONST]
+ movi vzr.16b, #0
+ endif_yield_neon
+ b loop_64
less_64: /* Folding cache line into 128bit */
- ldr qCONSTANT, [x3, #16]
+ ldr qCONSTANT, [CONST, #16]
pmull2 v5.1q, v1.2d, vCONSTANT.2d
pmull v1.1q, v1.1d, vCONSTANT.1d
@@ -204,8 +223,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* final 32-bit fold */
- ldr dCONSTANT, [x3, #32]
- ldr d3, [x3, #40]
+ ldr dCONSTANT, [CONST, #32]
+ ldr d3, [CONST, #40]
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -213,7 +232,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
- ldr qCONSTANT, [x3, #48]
+ ldr qCONSTANT, [CONST, #48]
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -223,6 +242,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
+ frame_pop
ret
ENDPROC(crc32_pmull_le)
ENDPROC(crc32c_pmull_le)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 19/23] crypto: arm64/crct10dif-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crct10dif-ce-core.S | 32 +++++++++++++++++---
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
index f179c01bd55c..663ea71cdb38 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpu generic+crypto
- arg1_low32 .req w0
- arg2 .req x1
- arg3 .req x2
+ arg1_low32 .req w19
+ arg2 .req x20
+ arg3 .req x21
vzr .req v13
ENTRY(crc_t10dif_pmull)
+ frame_push 3, 128
+
+ mov arg1_low32, w0
+ mov arg2, x1
+ mov arg3, x2
+
movi vzr.16b, #0 // init zero register
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 )
subs arg3, arg3, #128
// check if there is another 64B in the buffer to be able to fold
- b.ge _fold_64_B_loop
+ b.lt _fold_64_B_end
+
+ if_will_cond_yield_neon
+ stp q0, q1, [sp, #.Lframe_local_offset]
+ stp q2, q3, [sp, #.Lframe_local_offset + 32]
+ stp q4, q5, [sp, #.Lframe_local_offset + 64]
+ stp q6, q7, [sp, #.Lframe_local_offset + 96]
+ do_cond_yield_neon
+ ldp q0, q1, [sp, #.Lframe_local_offset]
+ ldp q2, q3, [sp, #.Lframe_local_offset + 32]
+ ldp q4, q5, [sp, #.Lframe_local_offset + 64]
+ ldp q6, q7, [sp, #.Lframe_local_offset + 96]
+ ldr_l q10, rk3, x8
+ movi vzr.16b, #0 // init zero register
+ endif_yield_neon
+
+ b _fold_64_B_loop
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
_cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+ frame_pop
ret
_less_than_128:
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 19/23] crypto: arm64/crct10dif-ce - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crct10dif-ce-core.S | 32 +++++++++++++++++---
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
index f179c01bd55c..663ea71cdb38 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpu generic+crypto
- arg1_low32 .req w0
- arg2 .req x1
- arg3 .req x2
+ arg1_low32 .req w19
+ arg2 .req x20
+ arg3 .req x21
vzr .req v13
ENTRY(crc_t10dif_pmull)
+ frame_push 3, 128
+
+ mov arg1_low32, w0
+ mov arg2, x1
+ mov arg3, x2
+
movi vzr.16b, #0 // init zero register
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 )
subs arg3, arg3, #128
// check if there is another 64B in the buffer to be able to fold
- b.ge _fold_64_B_loop
+ b.lt _fold_64_B_end
+
+ if_will_cond_yield_neon
+ stp q0, q1, [sp, #.Lframe_local_offset]
+ stp q2, q3, [sp, #.Lframe_local_offset + 32]
+ stp q4, q5, [sp, #.Lframe_local_offset + 64]
+ stp q6, q7, [sp, #.Lframe_local_offset + 96]
+ do_cond_yield_neon
+ ldp q0, q1, [sp, #.Lframe_local_offset]
+ ldp q2, q3, [sp, #.Lframe_local_offset + 32]
+ ldp q4, q5, [sp, #.Lframe_local_offset + 64]
+ ldp q6, q7, [sp, #.Lframe_local_offset + 96]
+ ldr_l q10, rk3, x8
+ movi vzr.16b, #0 // init zero register
+ endif_yield_neon
+
+ b _fold_64_B_loop
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
_cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+ frame_pop
ret
_less_than_128:
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 20/23] crypto: arm64/sha3-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha3-ce-core.S | 77 +++++++++++++-------
1 file changed, 50 insertions(+), 27 deletions(-)
diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
index 332ad7530690..a7d587fa54f6 100644
--- a/arch/arm64/crypto/sha3-ce-core.S
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -41,9 +41,16 @@
*/
.text
ENTRY(sha3_ce_transform)
- /* load state */
- add x8, x0, #32
- ld1 { v0.1d- v3.1d}, [x0]
+ frame_push 4
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+
+0: /* load state */
+ add x8, x19, #32
+ ld1 { v0.1d- v3.1d}, [x19]
ld1 { v4.1d- v7.1d}, [x8], #32
ld1 { v8.1d-v11.1d}, [x8], #32
ld1 {v12.1d-v15.1d}, [x8], #32
@@ -51,13 +58,13 @@ ENTRY(sha3_ce_transform)
ld1 {v20.1d-v23.1d}, [x8], #32
ld1 {v24.1d}, [x8]
-0: sub w2, w2, #1
+1: sub w21, w21, #1
mov w8, #24
adr_l x9, .Lsha3_rcon
/* load input */
- ld1 {v25.8b-v28.8b}, [x1], #32
- ld1 {v29.8b-v31.8b}, [x1], #24
+ ld1 {v25.8b-v28.8b}, [x20], #32
+ ld1 {v29.8b-v31.8b}, [x20], #24
eor v0.8b, v0.8b, v25.8b
eor v1.8b, v1.8b, v26.8b
eor v2.8b, v2.8b, v27.8b
@@ -66,10 +73,10 @@ ENTRY(sha3_ce_transform)
eor v5.8b, v5.8b, v30.8b
eor v6.8b, v6.8b, v31.8b
- tbnz x3, #6, 2f // SHA3-512
+ tbnz x22, #6, 3f // SHA3-512
- ld1 {v25.8b-v28.8b}, [x1], #32
- ld1 {v29.8b-v30.8b}, [x1], #16
+ ld1 {v25.8b-v28.8b}, [x20], #32
+ ld1 {v29.8b-v30.8b}, [x20], #16
eor v7.8b, v7.8b, v25.8b
eor v8.8b, v8.8b, v26.8b
eor v9.8b, v9.8b, v27.8b
@@ -77,34 +84,34 @@ ENTRY(sha3_ce_transform)
eor v11.8b, v11.8b, v29.8b
eor v12.8b, v12.8b, v30.8b
- tbnz x3, #4, 1f // SHA3-384 or SHA3-224
+ tbnz x22, #4, 2f // SHA3-384 or SHA3-224
// SHA3-256
- ld1 {v25.8b-v28.8b}, [x1], #32
+ ld1 {v25.8b-v28.8b}, [x20], #32
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
- b 3f
+ b 4f
-1: tbz x3, #2, 3f // bit 2 cleared? SHA-384
+2: tbz x22, #2, 4f // bit 2 cleared? SHA-384
// SHA3-224
- ld1 {v25.8b-v28.8b}, [x1], #32
- ld1 {v29.8b}, [x1], #8
+ ld1 {v25.8b-v28.8b}, [x20], #32
+ ld1 {v29.8b}, [x20], #8
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
eor v17.8b, v17.8b, v29.8b
- b 3f
+ b 4f
// SHA3-512
-2: ld1 {v25.8b-v26.8b}, [x1], #16
+3: ld1 {v25.8b-v26.8b}, [x20], #16
eor v7.8b, v7.8b, v25.8b
eor v8.8b, v8.8b, v26.8b
-3: sub w8, w8, #1
+4: sub w8, w8, #1
eor3 v29.16b, v4.16b, v9.16b, v14.16b
eor3 v26.16b, v1.16b, v6.16b, v11.16b
@@ -183,17 +190,33 @@ ENTRY(sha3_ce_transform)
eor v0.16b, v0.16b, v31.16b
- cbnz w8, 3b
- cbnz w2, 0b
+ cbnz w8, 4b
+ cbz w21, 5f
+
+ if_will_cond_yield_neon
+ add x8, x19, #32
+ st1 { v0.1d- v3.1d}, [x19]
+ st1 { v4.1d- v7.1d}, [x8], #32
+ st1 { v8.1d-v11.1d}, [x8], #32
+ st1 {v12.1d-v15.1d}, [x8], #32
+ st1 {v16.1d-v19.1d}, [x8], #32
+ st1 {v20.1d-v23.1d}, [x8], #32
+ st1 {v24.1d}, [x8]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/* save state */
- st1 { v0.1d- v3.1d}, [x0], #32
- st1 { v4.1d- v7.1d}, [x0], #32
- st1 { v8.1d-v11.1d}, [x0], #32
- st1 {v12.1d-v15.1d}, [x0], #32
- st1 {v16.1d-v19.1d}, [x0], #32
- st1 {v20.1d-v23.1d}, [x0], #32
- st1 {v24.1d}, [x0]
+5: st1 { v0.1d- v3.1d}, [x19], #32
+ st1 { v4.1d- v7.1d}, [x19], #32
+ st1 { v8.1d-v11.1d}, [x19], #32
+ st1 {v12.1d-v15.1d}, [x19], #32
+ st1 {v16.1d-v19.1d}, [x19], #32
+ st1 {v20.1d-v23.1d}, [x19], #32
+ st1 {v24.1d}, [x19]
+ frame_pop
ret
ENDPROC(sha3_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 20/23] crypto: arm64/sha3-ce - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha3-ce-core.S | 77 +++++++++++++-------
1 file changed, 50 insertions(+), 27 deletions(-)
diff --git a/arch/arm64/crypto/sha3-ce-core.S b/arch/arm64/crypto/sha3-ce-core.S
index 332ad7530690..a7d587fa54f6 100644
--- a/arch/arm64/crypto/sha3-ce-core.S
+++ b/arch/arm64/crypto/sha3-ce-core.S
@@ -41,9 +41,16 @@
*/
.text
ENTRY(sha3_ce_transform)
- /* load state */
- add x8, x0, #32
- ld1 { v0.1d- v3.1d}, [x0]
+ frame_push 4
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+
+0: /* load state */
+ add x8, x19, #32
+ ld1 { v0.1d- v3.1d}, [x19]
ld1 { v4.1d- v7.1d}, [x8], #32
ld1 { v8.1d-v11.1d}, [x8], #32
ld1 {v12.1d-v15.1d}, [x8], #32
@@ -51,13 +58,13 @@ ENTRY(sha3_ce_transform)
ld1 {v20.1d-v23.1d}, [x8], #32
ld1 {v24.1d}, [x8]
-0: sub w2, w2, #1
+1: sub w21, w21, #1
mov w8, #24
adr_l x9, .Lsha3_rcon
/* load input */
- ld1 {v25.8b-v28.8b}, [x1], #32
- ld1 {v29.8b-v31.8b}, [x1], #24
+ ld1 {v25.8b-v28.8b}, [x20], #32
+ ld1 {v29.8b-v31.8b}, [x20], #24
eor v0.8b, v0.8b, v25.8b
eor v1.8b, v1.8b, v26.8b
eor v2.8b, v2.8b, v27.8b
@@ -66,10 +73,10 @@ ENTRY(sha3_ce_transform)
eor v5.8b, v5.8b, v30.8b
eor v6.8b, v6.8b, v31.8b
- tbnz x3, #6, 2f // SHA3-512
+ tbnz x22, #6, 3f // SHA3-512
- ld1 {v25.8b-v28.8b}, [x1], #32
- ld1 {v29.8b-v30.8b}, [x1], #16
+ ld1 {v25.8b-v28.8b}, [x20], #32
+ ld1 {v29.8b-v30.8b}, [x20], #16
eor v7.8b, v7.8b, v25.8b
eor v8.8b, v8.8b, v26.8b
eor v9.8b, v9.8b, v27.8b
@@ -77,34 +84,34 @@ ENTRY(sha3_ce_transform)
eor v11.8b, v11.8b, v29.8b
eor v12.8b, v12.8b, v30.8b
- tbnz x3, #4, 1f // SHA3-384 or SHA3-224
+ tbnz x22, #4, 2f // SHA3-384 or SHA3-224
// SHA3-256
- ld1 {v25.8b-v28.8b}, [x1], #32
+ ld1 {v25.8b-v28.8b}, [x20], #32
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
- b 3f
+ b 4f
-1: tbz x3, #2, 3f // bit 2 cleared? SHA-384
+2: tbz x22, #2, 4f // bit 2 cleared? SHA-384
// SHA3-224
- ld1 {v25.8b-v28.8b}, [x1], #32
- ld1 {v29.8b}, [x1], #8
+ ld1 {v25.8b-v28.8b}, [x20], #32
+ ld1 {v29.8b}, [x20], #8
eor v13.8b, v13.8b, v25.8b
eor v14.8b, v14.8b, v26.8b
eor v15.8b, v15.8b, v27.8b
eor v16.8b, v16.8b, v28.8b
eor v17.8b, v17.8b, v29.8b
- b 3f
+ b 4f
// SHA3-512
-2: ld1 {v25.8b-v26.8b}, [x1], #16
+3: ld1 {v25.8b-v26.8b}, [x20], #16
eor v7.8b, v7.8b, v25.8b
eor v8.8b, v8.8b, v26.8b
-3: sub w8, w8, #1
+4: sub w8, w8, #1
eor3 v29.16b, v4.16b, v9.16b, v14.16b
eor3 v26.16b, v1.16b, v6.16b, v11.16b
@@ -183,17 +190,33 @@ ENTRY(sha3_ce_transform)
eor v0.16b, v0.16b, v31.16b
- cbnz w8, 3b
- cbnz w2, 0b
+ cbnz w8, 4b
+ cbz w21, 5f
+
+ if_will_cond_yield_neon
+ add x8, x19, #32
+ st1 { v0.1d- v3.1d}, [x19]
+ st1 { v4.1d- v7.1d}, [x8], #32
+ st1 { v8.1d-v11.1d}, [x8], #32
+ st1 {v12.1d-v15.1d}, [x8], #32
+ st1 {v16.1d-v19.1d}, [x8], #32
+ st1 {v20.1d-v23.1d}, [x8], #32
+ st1 {v24.1d}, [x8]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/* save state */
- st1 { v0.1d- v3.1d}, [x0], #32
- st1 { v4.1d- v7.1d}, [x0], #32
- st1 { v8.1d-v11.1d}, [x0], #32
- st1 {v12.1d-v15.1d}, [x0], #32
- st1 {v16.1d-v19.1d}, [x0], #32
- st1 {v20.1d-v23.1d}, [x0], #32
- st1 {v24.1d}, [x0]
+5: st1 { v0.1d- v3.1d}, [x19], #32
+ st1 { v4.1d- v7.1d}, [x19], #32
+ st1 { v8.1d-v11.1d}, [x19], #32
+ st1 {v12.1d-v15.1d}, [x19], #32
+ st1 {v16.1d-v19.1d}, [x19], #32
+ st1 {v20.1d-v23.1d}, [x19], #32
+ st1 {v24.1d}, [x19]
+ frame_pop
ret
ENDPROC(sha3_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 21/23] crypto: arm64/sha512-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha512-ce-core.S | 27 +++++++++++++++-----
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/sha512-ce-core.S b/arch/arm64/crypto/sha512-ce-core.S
index 7f3bca5c59a2..ce65e3abe4f2 100644
--- a/arch/arm64/crypto/sha512-ce-core.S
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -107,17 +107,23 @@
*/
.text
ENTRY(sha512_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load state */
- ld1 {v8.2d-v11.2d}, [x0]
+0: ld1 {v8.2d-v11.2d}, [x19]
/* load first 4 round constants */
adr_l x3, .Lsha512_rcon
ld1 {v20.2d-v23.2d}, [x3], #64
/* load input */
-0: ld1 {v12.2d-v15.2d}, [x1], #64
- ld1 {v16.2d-v19.2d}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v12.2d-v15.2d}, [x20], #64
+ ld1 {v16.2d-v19.2d}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev64 v12.16b, v12.16b )
CPU_LE( rev64 v13.16b, v13.16b )
@@ -196,9 +202,18 @@ CPU_LE( rev64 v19.16b, v19.16b )
add v11.2d, v11.2d, v3.2d
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {v8.2d-v11.2d}, [x19]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/* store new state */
-3: st1 {v8.2d-v11.2d}, [x0]
+3: st1 {v8.2d-v11.2d}, [x19]
+ frame_pop
ret
ENDPROC(sha512_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 21/23] crypto: arm64/sha512-ce - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha512-ce-core.S | 27 +++++++++++++++-----
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/sha512-ce-core.S b/arch/arm64/crypto/sha512-ce-core.S
index 7f3bca5c59a2..ce65e3abe4f2 100644
--- a/arch/arm64/crypto/sha512-ce-core.S
+++ b/arch/arm64/crypto/sha512-ce-core.S
@@ -107,17 +107,23 @@
*/
.text
ENTRY(sha512_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load state */
- ld1 {v8.2d-v11.2d}, [x0]
+0: ld1 {v8.2d-v11.2d}, [x19]
/* load first 4 round constants */
adr_l x3, .Lsha512_rcon
ld1 {v20.2d-v23.2d}, [x3], #64
/* load input */
-0: ld1 {v12.2d-v15.2d}, [x1], #64
- ld1 {v16.2d-v19.2d}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v12.2d-v15.2d}, [x20], #64
+ ld1 {v16.2d-v19.2d}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev64 v12.16b, v12.16b )
CPU_LE( rev64 v13.16b, v13.16b )
@@ -196,9 +202,18 @@ CPU_LE( rev64 v19.16b, v19.16b )
add v11.2d, v11.2d, v3.2d
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {v8.2d-v11.2d}, [x19]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/* store new state */
-3: st1 {v8.2d-v11.2d}, [x0]
+3: st1 {v8.2d-v11.2d}, [x19]
+ frame_pop
ret
ENDPROC(sha512_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 22/23] crypto: arm64/sm3-ce - yield NEON after every block of input
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sm3-ce-core.S | 30 +++++++++++++++-----
1 file changed, 23 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/crypto/sm3-ce-core.S b/arch/arm64/crypto/sm3-ce-core.S
index 27169fe07a68..5a116c8d0cee 100644
--- a/arch/arm64/crypto/sm3-ce-core.S
+++ b/arch/arm64/crypto/sm3-ce-core.S
@@ -77,19 +77,25 @@
*/
.text
ENTRY(sm3_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load state */
- ld1 {v8.4s-v9.4s}, [x0]
+ ld1 {v8.4s-v9.4s}, [x19]
rev64 v8.4s, v8.4s
rev64 v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
- adr_l x8, .Lt
+0: adr_l x8, .Lt
ldp s13, s14, [x8]
/* load input */
-0: ld1 {v0.16b-v3.16b}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v0.16b-v3.16b}, [x20], #64
+ sub w21, w21, #1
mov v15.16b, v8.16b
mov v16.16b, v9.16b
@@ -125,14 +131,24 @@ CPU_LE( rev32 v3.16b, v3.16b )
eor v9.16b, v9.16b, v16.16b
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 2f
+
+ if_will_cond_yield_neon
+ st1 {v8.4s-v9.4s}, [x19]
+ do_cond_yield_neon
+ ld1 {v8.4s-v9.4s}, [x19]
+ b 0b
+ endif_yield_neon
+
+ b 1b
/* save state */
- rev64 v8.4s, v8.4s
+2: rev64 v8.4s, v8.4s
rev64 v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
- st1 {v8.4s-v9.4s}, [x0]
+ st1 {v8.4s-v9.4s}, [x19]
+ frame_pop
ret
ENDPROC(sm3_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 22/23] crypto: arm64/sm3-ce - yield NEON after every block of input
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
conditionally yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sm3-ce-core.S | 30 +++++++++++++++-----
1 file changed, 23 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/crypto/sm3-ce-core.S b/arch/arm64/crypto/sm3-ce-core.S
index 27169fe07a68..5a116c8d0cee 100644
--- a/arch/arm64/crypto/sm3-ce-core.S
+++ b/arch/arm64/crypto/sm3-ce-core.S
@@ -77,19 +77,25 @@
*/
.text
ENTRY(sm3_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load state */
- ld1 {v8.4s-v9.4s}, [x0]
+ ld1 {v8.4s-v9.4s}, [x19]
rev64 v8.4s, v8.4s
rev64 v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
- adr_l x8, .Lt
+0: adr_l x8, .Lt
ldp s13, s14, [x8]
/* load input */
-0: ld1 {v0.16b-v3.16b}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v0.16b-v3.16b}, [x20], #64
+ sub w21, w21, #1
mov v15.16b, v8.16b
mov v16.16b, v9.16b
@@ -125,14 +131,24 @@ CPU_LE( rev32 v3.16b, v3.16b )
eor v9.16b, v9.16b, v16.16b
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 2f
+
+ if_will_cond_yield_neon
+ st1 {v8.4s-v9.4s}, [x19]
+ do_cond_yield_neon
+ ld1 {v8.4s-v9.4s}, [x19]
+ b 0b
+ endif_yield_neon
+
+ b 1b
/* save state */
- rev64 v8.4s, v8.4s
+2: rev64 v8.4s, v8.4s
rev64 v9.4s, v9.4s
ext v8.16b, v8.16b, v8.16b, #8
ext v9.16b, v9.16b, v9.16b, #8
- st1 {v8.4s-v9.4s}, [x0]
+ st1 {v8.4s-v9.4s}, [x19]
+ frame_pop
ret
ENDPROC(sm3_ce_transform)
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 23/23] DO NOT MERGE
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-10 15:22 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-crypto
Cc: Mark Rutland, herbert, Ard Biesheuvel, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
Dave Martin, linux-arm-kernel, linux-rt-users
Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
arch/arm64/include/asm/assembler.h | 33 ++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 61168cbe9781..b471b0bbdfe6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -678,6 +678,7 @@ alternative_else_nop_endif
cmp w1, #PREEMPT_DISABLE_OFFSET
csel x0, x0, xzr, eq
tbnz x0, #TIF_NEED_RESCHED, .Lyield_\@ // needs rescheduling?
+ b .Lyield_\@
#endif
/* fall through to endif_yield_neon */
.subsection 1
@@ -687,6 +688,38 @@ alternative_else_nop_endif
.macro do_cond_yield_neon
bl kernel_neon_end
bl kernel_neon_begin
+ movi v0.16b, #0x55
+ movi v1.16b, #0x55
+ movi v2.16b, #0x55
+ movi v3.16b, #0x55
+ movi v4.16b, #0x55
+ movi v5.16b, #0x55
+ movi v6.16b, #0x55
+ movi v7.16b, #0x55
+ movi v8.16b, #0x55
+ movi v9.16b, #0x55
+ movi v10.16b, #0x55
+ movi v11.16b, #0x55
+ movi v12.16b, #0x55
+ movi v13.16b, #0x55
+ movi v14.16b, #0x55
+ movi v15.16b, #0x55
+ movi v16.16b, #0x55
+ movi v17.16b, #0x55
+ movi v18.16b, #0x55
+ movi v19.16b, #0x55
+ movi v20.16b, #0x55
+ movi v21.16b, #0x55
+ movi v22.16b, #0x55
+ movi v23.16b, #0x55
+ movi v24.16b, #0x55
+ movi v25.16b, #0x55
+ movi v26.16b, #0x55
+ movi v27.16b, #0x55
+ movi v28.16b, #0x55
+ movi v29.16b, #0x55
+ movi v30.16b, #0x55
+ movi v31.16b, #0x55
.endm
.macro endif_yield_neon, lbl
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v5 23/23] DO NOT MERGE
@ 2018-03-10 15:22 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-10 15:22 UTC (permalink / raw)
To: linux-arm-kernel
Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
arch/arm64/include/asm/assembler.h | 33 ++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 61168cbe9781..b471b0bbdfe6 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -678,6 +678,7 @@ alternative_else_nop_endif
cmp w1, #PREEMPT_DISABLE_OFFSET
csel x0, x0, xzr, eq
tbnz x0, #TIF_NEED_RESCHED, .Lyield_\@ // needs rescheduling?
+ b .Lyield_\@
#endif
/* fall through to endif_yield_neon */
.subsection 1
@@ -687,6 +688,38 @@ alternative_else_nop_endif
.macro do_cond_yield_neon
bl kernel_neon_end
bl kernel_neon_begin
+ movi v0.16b, #0x55
+ movi v1.16b, #0x55
+ movi v2.16b, #0x55
+ movi v3.16b, #0x55
+ movi v4.16b, #0x55
+ movi v5.16b, #0x55
+ movi v6.16b, #0x55
+ movi v7.16b, #0x55
+ movi v8.16b, #0x55
+ movi v9.16b, #0x55
+ movi v10.16b, #0x55
+ movi v11.16b, #0x55
+ movi v12.16b, #0x55
+ movi v13.16b, #0x55
+ movi v14.16b, #0x55
+ movi v15.16b, #0x55
+ movi v16.16b, #0x55
+ movi v17.16b, #0x55
+ movi v18.16b, #0x55
+ movi v19.16b, #0x55
+ movi v20.16b, #0x55
+ movi v21.16b, #0x55
+ movi v22.16b, #0x55
+ movi v23.16b, #0x55
+ movi v24.16b, #0x55
+ movi v25.16b, #0x55
+ movi v26.16b, #0x55
+ movi v27.16b, #0x55
+ movi v28.16b, #0x55
+ movi v29.16b, #0x55
+ movi v30.16b, #0x55
+ movi v31.16b, #0x55
.endm
.macro endif_yield_neon, lbl
--
2.15.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* RE: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-11 5:16 ` Vakul Garg
-1 siblings, 0 replies; 58+ messages in thread
From: Vakul Garg @ 2018-03-11 5:16 UTC (permalink / raw)
To: Ard Biesheuvel, linux-crypto
Cc: Mark Rutland, herbert, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, Thomas Gleixner, Dave Martin, linux-arm-kernel,
linux-rt-users
Hi
How does this patchset affect the throughput performance of crypto?
Is it expected to increase?
Regards
Vakul
> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org [mailto:linux-crypto-
> owner@vger.kernel.org] On Behalf Of Ard Biesheuvel
> Sent: Saturday, March 10, 2018 8:52 PM
> To: linux-crypto@vger.kernel.org
> Cc: herbert@gondor.apana.org.au; linux-arm-kernel@lists.infradead.org;
> Ard Biesheuvel <ard.biesheuvel@linaro.org>; Dave Martin
> <Dave.Martin@arm.com>; Russell King - ARM Linux
> <linux@armlinux.org.uk>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>; Mark Rutland <mark.rutland@arm.com>; linux-rt-
> users@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>; Catalin
> Marinas <catalin.marinas@arm.com>; Will Deacon
> <will.deacon@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Thomas
> Gleixner <tglx@linutronix.de>
> Subject: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
>
> As reported by Sebastian, the way the arm64 NEON crypto code currently
> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
> causing problems with RT builds, given that the skcipher walk API may
> allocate and free temporary buffers it uses to present the input and output
> arrays to the crypto algorithm in blocksize sized chunks (where blocksize is
> the natural blocksize of the crypto algorithm), and doing so with NEON
> enabled means we're alloc/free'ing memory with preemption disabled.
>
> This was deliberate: when this code was introduced, each
> kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of
> storing resp.
> loading the contents of all NEON registers to/from memory, and so doing it
> less often had an obvious performance benefit. However, in the mean time,
> we have refactored the core kernel mode NEON code, and now
> kernel_neon_begin() only incurs this penalty the first time it is called after
> entering the kernel, and the NEON register restore is deferred until returning
> to userland. This means pulling those calls into the loops that iterate over the
> input/output of the crypto algorithm is not a big deal anymore (although
> there are some places in the code where we relied on the NEON registers
> retaining their values between calls)
>
> So let's clean this up for arm64: update the NEON based skcipher drivers to
> no longer keep the NEON enabled when calling into the skcipher walk API.
>
> As pointed out by Peter, this only solves part of the problem. So let's tackle it
> more thoroughly, and update the algorithms to test the NEED_RESCHED flag
> each time after processing a fixed chunk of input.
>
> Given that this issue was flagged by the RT people, I would appreciate it if
> they could confirm whether they are happy with this approach.
>
> Changes since v4:
> - rebase onto v4.16-rc3
> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
> in v4.16-rc1
>
> Changes since v3:
> - incorporate Dave's feedback on the asm macros to push/pop frames and to
> yield
> the NEON conditionally
> - make frame_push/pop more easy to use, by recording the arguments to
> frame_push, removing the need to specify them again when calling
> frame_pop
> - emit local symbol .Lframe_local_offset to allow code using the frame
> push/pop
> macros to index the stack more easily
> - use the magic \@ macro invocation counter provided by GAS to generate
> unique
> labels om the NEON yield macros, rather than relying on chance
>
> Changes since v2:
> - Drop logic to yield only after so many blocks - as it turns out, the
> throughput of the algorithms that are most likely to be affected by the
> overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if
> that
> is inacceptable, you are probably not using CONFIG_PREEMPT in the first
> place.
> - Add yield support to the AES-CCM driver
> - Clean up macros based on feedback from Dave
> - Given that I had to add stack frame logic to many of these functions, factor
> it out and wrap it in a couple of macros
> - Merge the changes to the core asm driver and glue code of the
> GHASH/GCM
> driver. The latter was not correct without the former.
>
> Changes since v1:
> - add CRC-T10DIF test vector (#1)
> - stop using GFP_ATOMIC in scatterwalk API calls, now that they are
> executed
> with preemption enabled (#2 - #6)
> - do some preparatory refactoring on the AES block mode code (#7 - #9)
> - add yield patches (#10 - #18)
> - add test patch (#19) - DO NOT MERGE
>
> Cc: Dave Martin <Dave.Martin@arm.com>
> Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-rt-users@vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
>
> Ard Biesheuvel (23):
> crypto: testmgr - add a new test case for CRC-T10DIF
> crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
> crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
> crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
> crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
> crypto: arm64/aes-blk - remove configurable interleave
> crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
> crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
> crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
> arm64: assembler: add utility macros to push/pop stack frames
> arm64: assembler: add macros to conditionally yield the NEON under
> PREEMPT
> crypto: arm64/sha1-ce - yield NEON after every block of input
> crypto: arm64/sha2-ce - yield NEON after every block of input
> crypto: arm64/aes-ccm - yield NEON after every block of input
> crypto: arm64/aes-blk - yield NEON after every block of input
> crypto: arm64/aes-bs - yield NEON after every block of input
> crypto: arm64/aes-ghash - yield NEON after every block of input
> crypto: arm64/crc32-ce - yield NEON after every block of input
> crypto: arm64/crct10dif-ce - yield NEON after every block of input
> crypto: arm64/sha3-ce - yield NEON after every block of input
> crypto: arm64/sha512-ce - yield NEON after every block of input
> crypto: arm64/sm3-ce - yield NEON after every block of input
> DO NOT MERGE
>
> arch/arm64/crypto/Makefile | 3 -
> arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
> arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
> arch/arm64/crypto/aes-ce.S | 15 +-
> arch/arm64/crypto/aes-glue.c | 95 ++--
> arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
> arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
> arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
> arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
> arch/arm64/crypto/crc32-ce-core.S | 40 +-
> arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
> arch/arm64/crypto/ghash-ce-core.S | 113 ++--
> arch/arm64/crypto/ghash-ce-glue.c | 28 +-
> arch/arm64/crypto/sha1-ce-core.S | 42 +-
> arch/arm64/crypto/sha2-ce-core.S | 37 +-
> arch/arm64/crypto/sha256-glue.c | 36 +-
> arch/arm64/crypto/sha3-ce-core.S | 77 ++-
> arch/arm64/crypto/sha512-ce-core.S | 27 +-
> arch/arm64/crypto/sm3-ce-core.S | 30 +-
> arch/arm64/include/asm/assembler.h | 167 ++++++
> arch/arm64/kernel/asm-offsets.c | 2 +
> crypto/testmgr.h | 259 +++++++++
> 22 files changed, 1392 insertions(+), 735 deletions(-)
>
> --
> 2.15.1
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-11 5:16 ` Vakul Garg
0 siblings, 0 replies; 58+ messages in thread
From: Vakul Garg @ 2018-03-11 5:16 UTC (permalink / raw)
To: linux-arm-kernel
Hi
How does this patchset affect the throughput performance of crypto?
Is it expected to increase?
Regards
Vakul
> -----Original Message-----
> From: linux-crypto-owner at vger.kernel.org [mailto:linux-crypto-
> owner at vger.kernel.org] On Behalf Of Ard Biesheuvel
> Sent: Saturday, March 10, 2018 8:52 PM
> To: linux-crypto at vger.kernel.org
> Cc: herbert at gondor.apana.org.au; linux-arm-kernel at lists.infradead.org;
> Ard Biesheuvel <ard.biesheuvel@linaro.org>; Dave Martin
> <Dave.Martin@arm.com>; Russell King - ARM Linux
> <linux@armlinux.org.uk>; Sebastian Andrzej Siewior
> <bigeasy@linutronix.de>; Mark Rutland <mark.rutland@arm.com>; linux-rt-
> users at vger.kernel.org; Peter Zijlstra <peterz@infradead.org>; Catalin
> Marinas <catalin.marinas@arm.com>; Will Deacon
> <will.deacon@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Thomas
> Gleixner <tglx@linutronix.de>
> Subject: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
>
> As reported by Sebastian, the way the arm64 NEON crypto code currently
> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
> causing problems with RT builds, given that the skcipher walk API may
> allocate and free temporary buffers it uses to present the input and output
> arrays to the crypto algorithm in blocksize sized chunks (where blocksize is
> the natural blocksize of the crypto algorithm), and doing so with NEON
> enabled means we're alloc/free'ing memory with preemption disabled.
>
> This was deliberate: when this code was introduced, each
> kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of
> storing resp.
> loading the contents of all NEON registers to/from memory, and so doing it
> less often had an obvious performance benefit. However, in the mean time,
> we have refactored the core kernel mode NEON code, and now
> kernel_neon_begin() only incurs this penalty the first time it is called after
> entering the kernel, and the NEON register restore is deferred until returning
> to userland. This means pulling those calls into the loops that iterate over the
> input/output of the crypto algorithm is not a big deal anymore (although
> there are some places in the code where we relied on the NEON registers
> retaining their values between calls)
>
> So let's clean this up for arm64: update the NEON based skcipher drivers to
> no longer keep the NEON enabled when calling into the skcipher walk API.
>
> As pointed out by Peter, this only solves part of the problem. So let's tackle it
> more thoroughly, and update the algorithms to test the NEED_RESCHED flag
> each time after processing a fixed chunk of input.
>
> Given that this issue was flagged by the RT people, I would appreciate it if
> they could confirm whether they are happy with this approach.
>
> Changes since v4:
> - rebase onto v4.16-rc3
> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
> in v4.16-rc1
>
> Changes since v3:
> - incorporate Dave's feedback on the asm macros to push/pop frames and to
> yield
> the NEON conditionally
> - make frame_push/pop more easy to use, by recording the arguments to
> frame_push, removing the need to specify them again when calling
> frame_pop
> - emit local symbol .Lframe_local_offset to allow code using the frame
> push/pop
> macros to index the stack more easily
> - use the magic \@ macro invocation counter provided by GAS to generate
> unique
> labels om the NEON yield macros, rather than relying on chance
>
> Changes since v2:
> - Drop logic to yield only after so many blocks - as it turns out, the
> throughput of the algorithms that are most likely to be affected by the
> overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if
> that
> is inacceptable, you are probably not using CONFIG_PREEMPT in the first
> place.
> - Add yield support to the AES-CCM driver
> - Clean up macros based on feedback from Dave
> - Given that I had to add stack frame logic to many of these functions, factor
> it out and wrap it in a couple of macros
> - Merge the changes to the core asm driver and glue code of the
> GHASH/GCM
> driver. The latter was not correct without the former.
>
> Changes since v1:
> - add CRC-T10DIF test vector (#1)
> - stop using GFP_ATOMIC in scatterwalk API calls, now that they are
> executed
> with preemption enabled (#2 - #6)
> - do some preparatory refactoring on the AES block mode code (#7 - #9)
> - add yield patches (#10 - #18)
> - add test patch (#19) - DO NOT MERGE
>
> Cc: Dave Martin <Dave.Martin@arm.com>
> Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-rt-users at vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
>
> Ard Biesheuvel (23):
> crypto: testmgr - add a new test case for CRC-T10DIF
> crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
> crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
> crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
> crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
> crypto: arm64/aes-blk - remove configurable interleave
> crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
> crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
> crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
> arm64: assembler: add utility macros to push/pop stack frames
> arm64: assembler: add macros to conditionally yield the NEON under
> PREEMPT
> crypto: arm64/sha1-ce - yield NEON after every block of input
> crypto: arm64/sha2-ce - yield NEON after every block of input
> crypto: arm64/aes-ccm - yield NEON after every block of input
> crypto: arm64/aes-blk - yield NEON after every block of input
> crypto: arm64/aes-bs - yield NEON after every block of input
> crypto: arm64/aes-ghash - yield NEON after every block of input
> crypto: arm64/crc32-ce - yield NEON after every block of input
> crypto: arm64/crct10dif-ce - yield NEON after every block of input
> crypto: arm64/sha3-ce - yield NEON after every block of input
> crypto: arm64/sha512-ce - yield NEON after every block of input
> crypto: arm64/sm3-ce - yield NEON after every block of input
> DO NOT MERGE
>
> arch/arm64/crypto/Makefile | 3 -
> arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
> arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
> arch/arm64/crypto/aes-ce.S | 15 +-
> arch/arm64/crypto/aes-glue.c | 95 ++--
> arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
> arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
> arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
> arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
> arch/arm64/crypto/crc32-ce-core.S | 40 +-
> arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
> arch/arm64/crypto/ghash-ce-core.S | 113 ++--
> arch/arm64/crypto/ghash-ce-glue.c | 28 +-
> arch/arm64/crypto/sha1-ce-core.S | 42 +-
> arch/arm64/crypto/sha2-ce-core.S | 37 +-
> arch/arm64/crypto/sha256-glue.c | 36 +-
> arch/arm64/crypto/sha3-ce-core.S | 77 ++-
> arch/arm64/crypto/sha512-ce-core.S | 27 +-
> arch/arm64/crypto/sm3-ce-core.S | 30 +-
> arch/arm64/include/asm/assembler.h | 167 ++++++
> arch/arm64/kernel/asm-offsets.c | 2 +
> crypto/testmgr.h | 259 +++++++++
> 22 files changed, 1392 insertions(+), 735 deletions(-)
>
> --
> 2.15.1
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
2018-03-11 5:16 ` Vakul Garg
@ 2018-03-11 8:55 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-11 8:55 UTC (permalink / raw)
To: Vakul Garg
Cc: Mark Rutland, herbert, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, Dave Martin,
linux-arm-kernel, linux-rt-users
On 11 March 2018 at 05:16, Vakul Garg <vakul.garg@nxp.com> wrote:
> Hi
>
> How does this patchset affect the throughput performance of crypto?
> Is it expected to increase?
>
This is about latency not throughput. The throughput may decrease
slightly (<1%), but spikes in scheduling latency due to NEON based
crypto should be a thing of the past.
Note that if you require maximum throughput without regard for
scheduling latency, you should disable CONFIG_PREEMPT in your kernel,
in which case these patches do absolutely nothing.
>> -----Original Message-----
>> From: linux-crypto-owner@vger.kernel.org [mailto:linux-crypto-
>> owner@vger.kernel.org] On Behalf Of Ard Biesheuvel
>> Sent: Saturday, March 10, 2018 8:52 PM
>> To: linux-crypto@vger.kernel.org
>> Cc: herbert@gondor.apana.org.au; linux-arm-kernel@lists.infradead.org;
>> Ard Biesheuvel <ard.biesheuvel@linaro.org>; Dave Martin
>> <Dave.Martin@arm.com>; Russell King - ARM Linux
>> <linux@armlinux.org.uk>; Sebastian Andrzej Siewior
>> <bigeasy@linutronix.de>; Mark Rutland <mark.rutland@arm.com>; linux-rt-
>> users@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>; Catalin
>> Marinas <catalin.marinas@arm.com>; Will Deacon
>> <will.deacon@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Thomas
>> Gleixner <tglx@linutronix.de>
>> Subject: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
>>
>> As reported by Sebastian, the way the arm64 NEON crypto code currently
>> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
>> causing problems with RT builds, given that the skcipher walk API may
>> allocate and free temporary buffers it uses to present the input and output
>> arrays to the crypto algorithm in blocksize sized chunks (where blocksize is
>> the natural blocksize of the crypto algorithm), and doing so with NEON
>> enabled means we're alloc/free'ing memory with preemption disabled.
>>
>> This was deliberate: when this code was introduced, each
>> kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of
>> storing resp.
>> loading the contents of all NEON registers to/from memory, and so doing it
>> less often had an obvious performance benefit. However, in the mean time,
>> we have refactored the core kernel mode NEON code, and now
>> kernel_neon_begin() only incurs this penalty the first time it is called after
>> entering the kernel, and the NEON register restore is deferred until returning
>> to userland. This means pulling those calls into the loops that iterate over the
>> input/output of the crypto algorithm is not a big deal anymore (although
>> there are some places in the code where we relied on the NEON registers
>> retaining their values between calls)
>>
>> So let's clean this up for arm64: update the NEON based skcipher drivers to
>> no longer keep the NEON enabled when calling into the skcipher walk API.
>>
>> As pointed out by Peter, this only solves part of the problem. So let's tackle it
>> more thoroughly, and update the algorithms to test the NEED_RESCHED flag
>> each time after processing a fixed chunk of input.
>>
>> Given that this issue was flagged by the RT people, I would appreciate it if
>> they could confirm whether they are happy with this approach.
>>
>> Changes since v4:
>> - rebase onto v4.16-rc3
>> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>> in v4.16-rc1
>>
>> Changes since v3:
>> - incorporate Dave's feedback on the asm macros to push/pop frames and to
>> yield
>> the NEON conditionally
>> - make frame_push/pop more easy to use, by recording the arguments to
>> frame_push, removing the need to specify them again when calling
>> frame_pop
>> - emit local symbol .Lframe_local_offset to allow code using the frame
>> push/pop
>> macros to index the stack more easily
>> - use the magic \@ macro invocation counter provided by GAS to generate
>> unique
>> labels om the NEON yield macros, rather than relying on chance
>>
>> Changes since v2:
>> - Drop logic to yield only after so many blocks - as it turns out, the
>> throughput of the algorithms that are most likely to be affected by the
>> overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if
>> that
>> is inacceptable, you are probably not using CONFIG_PREEMPT in the first
>> place.
>> - Add yield support to the AES-CCM driver
>> - Clean up macros based on feedback from Dave
>> - Given that I had to add stack frame logic to many of these functions, factor
>> it out and wrap it in a couple of macros
>> - Merge the changes to the core asm driver and glue code of the
>> GHASH/GCM
>> driver. The latter was not correct without the former.
>>
>> Changes since v1:
>> - add CRC-T10DIF test vector (#1)
>> - stop using GFP_ATOMIC in scatterwalk API calls, now that they are
>> executed
>> with preemption enabled (#2 - #6)
>> - do some preparatory refactoring on the AES block mode code (#7 - #9)
>> - add yield patches (#10 - #18)
>> - add test patch (#19) - DO NOT MERGE
>>
>> Cc: Dave Martin <Dave.Martin@arm.com>
>> Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-rt-users@vger.kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>
>> Ard Biesheuvel (23):
>> crypto: testmgr - add a new test case for CRC-T10DIF
>> crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
>> crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
>> crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
>> crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
>> crypto: arm64/aes-blk - remove configurable interleave
>> crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
>> crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
>> crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
>> arm64: assembler: add utility macros to push/pop stack frames
>> arm64: assembler: add macros to conditionally yield the NEON under
>> PREEMPT
>> crypto: arm64/sha1-ce - yield NEON after every block of input
>> crypto: arm64/sha2-ce - yield NEON after every block of input
>> crypto: arm64/aes-ccm - yield NEON after every block of input
>> crypto: arm64/aes-blk - yield NEON after every block of input
>> crypto: arm64/aes-bs - yield NEON after every block of input
>> crypto: arm64/aes-ghash - yield NEON after every block of input
>> crypto: arm64/crc32-ce - yield NEON after every block of input
>> crypto: arm64/crct10dif-ce - yield NEON after every block of input
>> crypto: arm64/sha3-ce - yield NEON after every block of input
>> crypto: arm64/sha512-ce - yield NEON after every block of input
>> crypto: arm64/sm3-ce - yield NEON after every block of input
>> DO NOT MERGE
>>
>> arch/arm64/crypto/Makefile | 3 -
>> arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
>> arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
>> arch/arm64/crypto/aes-ce.S | 15 +-
>> arch/arm64/crypto/aes-glue.c | 95 ++--
>> arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
>> arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
>> arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
>> arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
>> arch/arm64/crypto/crc32-ce-core.S | 40 +-
>> arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
>> arch/arm64/crypto/ghash-ce-core.S | 113 ++--
>> arch/arm64/crypto/ghash-ce-glue.c | 28 +-
>> arch/arm64/crypto/sha1-ce-core.S | 42 +-
>> arch/arm64/crypto/sha2-ce-core.S | 37 +-
>> arch/arm64/crypto/sha256-glue.c | 36 +-
>> arch/arm64/crypto/sha3-ce-core.S | 77 ++-
>> arch/arm64/crypto/sha512-ce-core.S | 27 +-
>> arch/arm64/crypto/sm3-ce-core.S | 30 +-
>> arch/arm64/include/asm/assembler.h | 167 ++++++
>> arch/arm64/kernel/asm-offsets.c | 2 +
>> crypto/testmgr.h | 259 +++++++++
>> 22 files changed, 1392 insertions(+), 735 deletions(-)
>>
>> --
>> 2.15.1
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-11 8:55 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-11 8:55 UTC (permalink / raw)
To: linux-arm-kernel
On 11 March 2018 at 05:16, Vakul Garg <vakul.garg@nxp.com> wrote:
> Hi
>
> How does this patchset affect the throughput performance of crypto?
> Is it expected to increase?
>
This is about latency not throughput. The throughput may decrease
slightly (<1%), but spikes in scheduling latency due to NEON based
crypto should be a thing of the past.
Note that if you require maximum throughput without regard for
scheduling latency, you should disable CONFIG_PREEMPT in your kernel,
in which case these patches do absolutely nothing.
>> -----Original Message-----
>> From: linux-crypto-owner at vger.kernel.org [mailto:linux-crypto-
>> owner at vger.kernel.org] On Behalf Of Ard Biesheuvel
>> Sent: Saturday, March 10, 2018 8:52 PM
>> To: linux-crypto at vger.kernel.org
>> Cc: herbert at gondor.apana.org.au; linux-arm-kernel at lists.infradead.org;
>> Ard Biesheuvel <ard.biesheuvel@linaro.org>; Dave Martin
>> <Dave.Martin@arm.com>; Russell King - ARM Linux
>> <linux@armlinux.org.uk>; Sebastian Andrzej Siewior
>> <bigeasy@linutronix.de>; Mark Rutland <mark.rutland@arm.com>; linux-rt-
>> users at vger.kernel.org; Peter Zijlstra <peterz@infradead.org>; Catalin
>> Marinas <catalin.marinas@arm.com>; Will Deacon
>> <will.deacon@arm.com>; Steven Rostedt <rostedt@goodmis.org>; Thomas
>> Gleixner <tglx@linutronix.de>
>> Subject: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
>>
>> As reported by Sebastian, the way the arm64 NEON crypto code currently
>> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
>> causing problems with RT builds, given that the skcipher walk API may
>> allocate and free temporary buffers it uses to present the input and output
>> arrays to the crypto algorithm in blocksize sized chunks (where blocksize is
>> the natural blocksize of the crypto algorithm), and doing so with NEON
>> enabled means we're alloc/free'ing memory with preemption disabled.
>>
>> This was deliberate: when this code was introduced, each
>> kernel_neon_begin() and kernel_neon_end() call incurred a fixed penalty of
>> storing resp.
>> loading the contents of all NEON registers to/from memory, and so doing it
>> less often had an obvious performance benefit. However, in the mean time,
>> we have refactored the core kernel mode NEON code, and now
>> kernel_neon_begin() only incurs this penalty the first time it is called after
>> entering the kernel, and the NEON register restore is deferred until returning
>> to userland. This means pulling those calls into the loops that iterate over the
>> input/output of the crypto algorithm is not a big deal anymore (although
>> there are some places in the code where we relied on the NEON registers
>> retaining their values between calls)
>>
>> So let's clean this up for arm64: update the NEON based skcipher drivers to
>> no longer keep the NEON enabled when calling into the skcipher walk API.
>>
>> As pointed out by Peter, this only solves part of the problem. So let's tackle it
>> more thoroughly, and update the algorithms to test the NEED_RESCHED flag
>> each time after processing a fixed chunk of input.
>>
>> Given that this issue was flagged by the RT people, I would appreciate it if
>> they could confirm whether they are happy with this approach.
>>
>> Changes since v4:
>> - rebase onto v4.16-rc3
>> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>> in v4.16-rc1
>>
>> Changes since v3:
>> - incorporate Dave's feedback on the asm macros to push/pop frames and to
>> yield
>> the NEON conditionally
>> - make frame_push/pop more easy to use, by recording the arguments to
>> frame_push, removing the need to specify them again when calling
>> frame_pop
>> - emit local symbol .Lframe_local_offset to allow code using the frame
>> push/pop
>> macros to index the stack more easily
>> - use the magic \@ macro invocation counter provided by GAS to generate
>> unique
>> labels om the NEON yield macros, rather than relying on chance
>>
>> Changes since v2:
>> - Drop logic to yield only after so many blocks - as it turns out, the
>> throughput of the algorithms that are most likely to be affected by the
>> overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if
>> that
>> is inacceptable, you are probably not using CONFIG_PREEMPT in the first
>> place.
>> - Add yield support to the AES-CCM driver
>> - Clean up macros based on feedback from Dave
>> - Given that I had to add stack frame logic to many of these functions, factor
>> it out and wrap it in a couple of macros
>> - Merge the changes to the core asm driver and glue code of the
>> GHASH/GCM
>> driver. The latter was not correct without the former.
>>
>> Changes since v1:
>> - add CRC-T10DIF test vector (#1)
>> - stop using GFP_ATOMIC in scatterwalk API calls, now that they are
>> executed
>> with preemption enabled (#2 - #6)
>> - do some preparatory refactoring on the AES block mode code (#7 - #9)
>> - add yield patches (#10 - #18)
>> - add test patch (#19) - DO NOT MERGE
>>
>> Cc: Dave Martin <Dave.Martin@arm.com>
>> Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-rt-users at vger.kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>
>> Ard Biesheuvel (23):
>> crypto: testmgr - add a new test case for CRC-T10DIF
>> crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
>> crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
>> crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
>> crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
>> crypto: arm64/aes-blk - remove configurable interleave
>> crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
>> crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
>> crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
>> arm64: assembler: add utility macros to push/pop stack frames
>> arm64: assembler: add macros to conditionally yield the NEON under
>> PREEMPT
>> crypto: arm64/sha1-ce - yield NEON after every block of input
>> crypto: arm64/sha2-ce - yield NEON after every block of input
>> crypto: arm64/aes-ccm - yield NEON after every block of input
>> crypto: arm64/aes-blk - yield NEON after every block of input
>> crypto: arm64/aes-bs - yield NEON after every block of input
>> crypto: arm64/aes-ghash - yield NEON after every block of input
>> crypto: arm64/crc32-ce - yield NEON after every block of input
>> crypto: arm64/crct10dif-ce - yield NEON after every block of input
>> crypto: arm64/sha3-ce - yield NEON after every block of input
>> crypto: arm64/sha512-ce - yield NEON after every block of input
>> crypto: arm64/sm3-ce - yield NEON after every block of input
>> DO NOT MERGE
>>
>> arch/arm64/crypto/Makefile | 3 -
>> arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
>> arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
>> arch/arm64/crypto/aes-ce.S | 15 +-
>> arch/arm64/crypto/aes-glue.c | 95 ++--
>> arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
>> arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
>> arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
>> arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
>> arch/arm64/crypto/crc32-ce-core.S | 40 +-
>> arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
>> arch/arm64/crypto/ghash-ce-core.S | 113 ++--
>> arch/arm64/crypto/ghash-ce-glue.c | 28 +-
>> arch/arm64/crypto/sha1-ce-core.S | 42 +-
>> arch/arm64/crypto/sha2-ce-core.S | 37 +-
>> arch/arm64/crypto/sha256-glue.c | 36 +-
>> arch/arm64/crypto/sha3-ce-core.S | 77 ++-
>> arch/arm64/crypto/sha512-ce-core.S | 27 +-
>> arch/arm64/crypto/sm3-ce-core.S | 30 +-
>> arch/arm64/include/asm/assembler.h | 167 ++++++
>> arch/arm64/kernel/asm-offsets.c | 2 +
>> crypto/testmgr.h | 259 +++++++++
>> 22 files changed, 1392 insertions(+), 735 deletions(-)
>>
>> --
>> 2.15.1
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
2018-03-10 15:21 ` Ard Biesheuvel
@ 2018-03-16 15:57 ` Herbert Xu
-1 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2018-03-16 15:57 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, Dave Martin,
linux-arm-kernel
On Sat, Mar 10, 2018 at 03:21:45PM +0000, Ard Biesheuvel wrote:
> As reported by Sebastian, the way the arm64 NEON crypto code currently
> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
> causing problems with RT builds, given that the skcipher walk API may
> allocate and free temporary buffers it uses to present the input and
> output arrays to the crypto algorithm in blocksize sized chunks (where
> blocksize is the natural blocksize of the crypto algorithm), and doing
> so with NEON enabled means we're alloc/free'ing memory with preemption
> disabled.
>
> This was deliberate: when this code was introduced, each kernel_neon_begin()
> and kernel_neon_end() call incurred a fixed penalty of storing resp.
> loading the contents of all NEON registers to/from memory, and so doing
> it less often had an obvious performance benefit. However, in the mean time,
> we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
> only incurs this penalty the first time it is called after entering the kernel,
> and the NEON register restore is deferred until returning to userland. This
> means pulling those calls into the loops that iterate over the input/output
> of the crypto algorithm is not a big deal anymore (although there are some
> places in the code where we relied on the NEON registers retaining their
> values between calls)
>
> So let's clean this up for arm64: update the NEON based skcipher drivers to
> no longer keep the NEON enabled when calling into the skcipher walk API.
>
> As pointed out by Peter, this only solves part of the problem. So let's
> tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
> flag each time after processing a fixed chunk of input.
>
> Given that this issue was flagged by the RT people, I would appreciate it
> if they could confirm whether they are happy with this approach.
>
> Changes since v4:
> - rebase onto v4.16-rc3
> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
> in v4.16-rc1
Looks good to me. If more work is needed we can always do
incremental fixes.
Patches 1-22 applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-16 15:57 ` Herbert Xu
0 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2018-03-16 15:57 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Mar 10, 2018 at 03:21:45PM +0000, Ard Biesheuvel wrote:
> As reported by Sebastian, the way the arm64 NEON crypto code currently
> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
> causing problems with RT builds, given that the skcipher walk API may
> allocate and free temporary buffers it uses to present the input and
> output arrays to the crypto algorithm in blocksize sized chunks (where
> blocksize is the natural blocksize of the crypto algorithm), and doing
> so with NEON enabled means we're alloc/free'ing memory with preemption
> disabled.
>
> This was deliberate: when this code was introduced, each kernel_neon_begin()
> and kernel_neon_end() call incurred a fixed penalty of storing resp.
> loading the contents of all NEON registers to/from memory, and so doing
> it less often had an obvious performance benefit. However, in the mean time,
> we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
> only incurs this penalty the first time it is called after entering the kernel,
> and the NEON register restore is deferred until returning to userland. This
> means pulling those calls into the loops that iterate over the input/output
> of the crypto algorithm is not a big deal anymore (although there are some
> places in the code where we relied on the NEON registers retaining their
> values between calls)
>
> So let's clean this up for arm64: update the NEON based skcipher drivers to
> no longer keep the NEON enabled when calling into the skcipher walk API.
>
> As pointed out by Peter, this only solves part of the problem. So let's
> tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
> flag each time after processing a fixed chunk of input.
>
> Given that this issue was flagged by the RT people, I would appreciate it
> if they could confirm whether they are happy with this approach.
>
> Changes since v4:
> - rebase onto v4.16-rc3
> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
> in v4.16-rc1
Looks good to me. If more work is needed we can always do
incremental fixes.
Patches 1-22 applied. Thanks.
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
2018-03-16 15:57 ` Herbert Xu
@ 2018-03-19 15:31 ` Ard Biesheuvel
-1 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-19 15:31 UTC (permalink / raw)
To: Herbert Xu, Will Deacon
Cc: Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Russell King - ARM Linux,
Steven Rostedt, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
Thomas Gleixner, Dave Martin, linux-arm-kernel
On 16 March 2018 at 23:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sat, Mar 10, 2018 at 03:21:45PM +0000, Ard Biesheuvel wrote:
>> As reported by Sebastian, the way the arm64 NEON crypto code currently
>> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
>> causing problems with RT builds, given that the skcipher walk API may
>> allocate and free temporary buffers it uses to present the input and
>> output arrays to the crypto algorithm in blocksize sized chunks (where
>> blocksize is the natural blocksize of the crypto algorithm), and doing
>> so with NEON enabled means we're alloc/free'ing memory with preemption
>> disabled.
>>
>> This was deliberate: when this code was introduced, each kernel_neon_begin()
>> and kernel_neon_end() call incurred a fixed penalty of storing resp.
>> loading the contents of all NEON registers to/from memory, and so doing
>> it less often had an obvious performance benefit. However, in the mean time,
>> we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
>> only incurs this penalty the first time it is called after entering the kernel,
>> and the NEON register restore is deferred until returning to userland. This
>> means pulling those calls into the loops that iterate over the input/output
>> of the crypto algorithm is not a big deal anymore (although there are some
>> places in the code where we relied on the NEON registers retaining their
>> values between calls)
>>
>> So let's clean this up for arm64: update the NEON based skcipher drivers to
>> no longer keep the NEON enabled when calling into the skcipher walk API.
>>
>> As pointed out by Peter, this only solves part of the problem. So let's
>> tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
>> flag each time after processing a fixed chunk of input.
>>
>> Given that this issue was flagged by the RT people, I would appreciate it
>> if they could confirm whether they are happy with this approach.
>>
>> Changes since v4:
>> - rebase onto v4.16-rc3
>> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>> in v4.16-rc1
>
> Looks good to me. If more work is needed we can always do
> incremental fixes.
>
> Patches 1-22 applied. Thanks.
Thanks Herbert.
Apologies if this wasn't clear, but there are some cross dependencies
with the arm64 tree, which receives non-trivial modifications in
patches 10 and 11, which are subsequently depended upon by patches 12
- 23.
Without acks from them, we should really not be merging this code yet,
especially because I noticed a rebase issue in patch #10 (my bad).
Would you mind reverting 10 - 22? I will revisit this asap, and try to
get acks for the arm64 patches. If that means waiting for the next
cycle, so be it.
Thanks,
Ard.
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-19 15:31 ` Ard Biesheuvel
0 siblings, 0 replies; 58+ messages in thread
From: Ard Biesheuvel @ 2018-03-19 15:31 UTC (permalink / raw)
To: linux-arm-kernel
On 16 March 2018 at 23:57, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sat, Mar 10, 2018 at 03:21:45PM +0000, Ard Biesheuvel wrote:
>> As reported by Sebastian, the way the arm64 NEON crypto code currently
>> keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
>> causing problems with RT builds, given that the skcipher walk API may
>> allocate and free temporary buffers it uses to present the input and
>> output arrays to the crypto algorithm in blocksize sized chunks (where
>> blocksize is the natural blocksize of the crypto algorithm), and doing
>> so with NEON enabled means we're alloc/free'ing memory with preemption
>> disabled.
>>
>> This was deliberate: when this code was introduced, each kernel_neon_begin()
>> and kernel_neon_end() call incurred a fixed penalty of storing resp.
>> loading the contents of all NEON registers to/from memory, and so doing
>> it less often had an obvious performance benefit. However, in the mean time,
>> we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
>> only incurs this penalty the first time it is called after entering the kernel,
>> and the NEON register restore is deferred until returning to userland. This
>> means pulling those calls into the loops that iterate over the input/output
>> of the crypto algorithm is not a big deal anymore (although there are some
>> places in the code where we relied on the NEON registers retaining their
>> values between calls)
>>
>> So let's clean this up for arm64: update the NEON based skcipher drivers to
>> no longer keep the NEON enabled when calling into the skcipher walk API.
>>
>> As pointed out by Peter, this only solves part of the problem. So let's
>> tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
>> flag each time after processing a fixed chunk of input.
>>
>> Given that this issue was flagged by the RT people, I would appreciate it
>> if they could confirm whether they are happy with this approach.
>>
>> Changes since v4:
>> - rebase onto v4.16-rc3
>> - apply the same treatment to new SHA512, SHA-3 and SM3 code that landed
>> in v4.16-rc1
>
> Looks good to me. If more work is needed we can always do
> incremental fixes.
>
> Patches 1-22 applied. Thanks.
Thanks Herbert.
Apologies if this wasn't clear, but there are some cross dependencies
with the arm64 tree, which receives non-trivial modifications in
patches 10 and 11, which are subsequently depended upon by patches 12
- 23.
Without acks from them, we should really not be merging this code yet,
especially because I noticed a rebase issue in patch #10 (my bad).
Would you mind reverting 10 - 22? I will revisit this asap, and try to
get acks for the arm64 patches. If that means waiting for the next
cycle, so be it.
Thanks,
Ard.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
2018-03-19 15:31 ` Ard Biesheuvel
@ 2018-03-19 23:36 ` Herbert Xu
-1 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2018-03-19 23:36 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
Thomas Gleixner, Dave Martin, linux-arm-kernel
On Mon, Mar 19, 2018 at 11:31:24PM +0800, Ard Biesheuvel wrote:
>
> Apologies if this wasn't clear, but there are some cross dependencies
> with the arm64 tree, which receives non-trivial modifications in
> patches 10 and 11, which are subsequently depended upon by patches 12
> - 23.
>
> Without acks from them, we should really not be merging this code yet,
> especially because I noticed a rebase issue in patch #10 (my bad).
>
> Would you mind reverting 10 - 22? I will revisit this asap, and try to
> get acks for the arm64 patches. If that means waiting for the next
> cycle, so be it.
Sure it's done now.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2018-03-19 23:36 ` Herbert Xu
0 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2018-03-19 23:36 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Mar 19, 2018 at 11:31:24PM +0800, Ard Biesheuvel wrote:
>
> Apologies if this wasn't clear, but there are some cross dependencies
> with the arm64 tree, which receives non-trivial modifications in
> patches 10 and 11, which are subsequently depended upon by patches 12
> - 23.
>
> Without acks from them, we should really not be merging this code yet,
> especially because I noticed a rebase issue in patch #10 (my bad).
>
> Would you mind reverting 10 - 22? I will revisit this asap, and try to
> get acks for the arm64 patches. If that means waiting for the next
> cycle, so be it.
Sure it's done now.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2018-03-19 23:36 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-10 15:21 [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 01/23] crypto: testmgr - add a new test case for CRC-T10DIF Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 02/23] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 03/23] crypto: arm64/aes-blk " Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 04/23] crypto: arm64/aes-bs " Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 05/23] crypto: arm64/chacha20 " Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 06/23] crypto: arm64/aes-blk - remove configurable interleave Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 07/23] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 08/23] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC " Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 09/23] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 10/23] arm64: assembler: add utility macros to push/pop stack frames Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 11/23] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 12/23] crypto: arm64/sha1-ce - yield NEON after every block of input Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 13/23] crypto: arm64/sha2-ce " Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:21 ` [PATCH v5 14/23] crypto: arm64/aes-ccm " Ard Biesheuvel
2018-03-10 15:21 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 15/23] crypto: arm64/aes-blk " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 16/23] crypto: arm64/aes-bs " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 17/23] crypto: arm64/aes-ghash " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 18/23] crypto: arm64/crc32-ce " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 19/23] crypto: arm64/crct10dif-ce " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 20/23] crypto: arm64/sha3-ce " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 21/23] crypto: arm64/sha512-ce " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 22/23] crypto: arm64/sm3-ce " Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-10 15:22 ` [PATCH v5 23/23] DO NOT MERGE Ard Biesheuvel
2018-03-10 15:22 ` Ard Biesheuvel
2018-03-11 5:16 ` [PATCH v5 00/23] crypto: arm64 - play nice with CONFIG_PREEMPT Vakul Garg
2018-03-11 5:16 ` Vakul Garg
2018-03-11 8:55 ` Ard Biesheuvel
2018-03-11 8:55 ` Ard Biesheuvel
2018-03-16 15:57 ` Herbert Xu
2018-03-16 15:57 ` Herbert Xu
2018-03-19 15:31 ` Ard Biesheuvel
2018-03-19 15:31 ` Ard Biesheuvel
2018-03-19 23:36 ` Herbert Xu
2018-03-19 23:36 ` Herbert Xu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.