* [PATCH v3 00/20] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
This is the second followup 'crypto: arm64 - disable NEON across scatterwalk
API calls' sent out last Friday.
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.
This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)
So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.
As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.
Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
throughput of the algorithms that are most likely to be affected by the
overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
is inacceptable, you are probably not using CONFIG_PREEMPT in the first
place. (Speed comparison at the end of this cover letter)
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
driver. The latter was not correct without the former.
Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Ard Biesheuvel (20):
crypto: testmgr - add a new test case for CRC-T10DIF
crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - remove configurable interleave
crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
arm64: assembler: add utility macros to push/pop stack frames
arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
crypto: arm64/sha1-ce - yield NEON after every block of input
crypto: arm64/sha2-ce - yield NEON after every block of input
crypto: arm64/aes-ccm - yield NEON after every block of input
crypto: arm64/aes-blk - yield NEON after every block of input
crypto: arm64/aes-bs - yield NEON after every block of input
crypto: arm64/aes-ghash - yield NEON after every block of input
crypto: arm64/crc32-ce - yield NEON after every block of input
crypto: arm64/crct10dif-ce - yield NEON after every block of input
DO NOT MERGE
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-glue.c | 95 ++--
arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
arch/arm64/crypto/crc32-ce-core.S | 44 +-
arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
arch/arm64/crypto/ghash-ce-core.S | 113 ++--
arch/arm64/crypto/ghash-ce-glue.c | 28 +-
arch/arm64/crypto/sha1-ce-core.S | 42 +-
arch/arm64/crypto/sha2-ce-core.S | 37 +-
arch/arm64/crypto/sha256-glue.c | 36 +-
arch/arm64/include/asm/assembler.h | 144 +++++
crypto/testmgr.h | 259 +++++++++
18 files changed, 1275 insertions(+), 697 deletions(-)
--
2.11.0
BEFORE
======
testing speed of async ctr(aes) (ctr-aes-ce) encryption
tcrypt: test 0 (128 bit key, 16 byte blocks): 5891675 operations in 1 seconds ( 94266800 bytes)
tcrypt: test 1 (128 bit key, 64 byte blocks): 5169493 operations in 1 seconds ( 330847552 bytes)
tcrypt: test 2 (128 bit key, 256 byte blocks): 3430554 operations in 1 seconds ( 878221824 bytes)
tcrypt: test 3 (128 bit key, 1024 byte blocks): 1433293 operations in 1 seconds (1467692032 bytes)
tcrypt: test 4 (128 bit key, 8192 byte blocks): 214314 operations in 1 seconds (1755660288 bytes)
tcrypt: test 5 (192 bit key, 16 byte blocks): 5845561 operations in 1 seconds ( 93528976 bytes)
tcrypt: test 6 (192 bit key, 64 byte blocks): 5051812 operations in 1 seconds ( 323315968 bytes)
tcrypt: test 7 (192 bit key, 256 byte blocks): 3135307 operations in 1 seconds ( 802638592 bytes)
tcrypt: test 8 (192 bit key, 1024 byte blocks): 1308804 operations in 1 seconds (1340215296 bytes)
tcrypt: test 9 (192 bit key, 8192 byte blocks): 174947 operations in 1 seconds (1433165824 bytes)
tcrypt: test 10 (256 bit key, 16 byte blocks): 5711495 operations in 1 seconds ( 91383920 bytes)
tcrypt: test 11 (256 bit key, 64 byte blocks): 4931516 operations in 1 seconds ( 315617024 bytes)
tcrypt: test 12 (256 bit key, 256 byte blocks): 3057619 operations in 1 seconds ( 782750464 bytes)
tcrypt: test 13 (256 bit key, 1024 byte blocks): 1205799 operations in 1 seconds (1234738176 bytes)
tcrypt: test 14 (256 bit key, 8192 byte blocks): 174553 operations in 1 seconds (1429938176 bytes)
testing speed of async ghash (ghash-ce)
tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 6043898 opers/sec, 96702368 bytes/sec
tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 1654308 opers/sec, 105875712 bytes/sec
tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 4610615 opers/sec, 295079360 bytes/sec
tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 440479 opers/sec, 112762624 bytes/sec
tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 1225272 opers/sec, 313669632 bytes/sec
tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 2282970 opers/sec, 584440320 bytes/sec
tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 111741 opers/sec, 114422784 bytes/sec
tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 590457 opers/sec, 604627968 bytes/sec
tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 781719 opers/sec, 800480256 bytes/sec
tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 56889 opers/sec, 116508672 bytes/sec
tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 301876 opers/sec, 618242048 bytes/sec
tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 392222 opers/sec, 803270656 bytes/sec
tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 417255 opers/sec, 854538240 bytes/sec
tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 28383 opers/sec, 116256768 bytes/sec
tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 152114 opers/sec, 623058944 bytes/sec
tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 197840 opers/sec, 810352640 bytes/sec
tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 214064 opers/sec, 876806144 bytes/sec
tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 14173 opers/sec, 116105216 bytes/sec
tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 76121 opers/sec, 623583232 bytes/sec
tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 99424 opers/sec, 814481408 bytes/sec
tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 107896 opers/sec, 883884032 bytes/sec
tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 107200 opers/sec, 878182400 bytes/sec
AFTER
=====
testing speed of async ctr(aes) (ctr-aes-ce) encryption
tcrypt: test 0 (128 bit key, 16 byte blocks): 5991064 operations in 1 seconds ( 95857024 bytes)
tcrypt: test 1 (128 bit key, 64 byte blocks): 5146397 operations in 1 seconds ( 329369408 bytes)
tcrypt: test 2 (128 bit key, 256 byte blocks): 3398949 operations in 1 seconds ( 870130944 bytes)
tcrypt: test 3 (128 bit key, 1024 byte blocks): 1423337 operations in 1 seconds (1457497088 bytes)
tcrypt: test 4 (128 bit key, 8192 byte blocks): 212705 operations in 1 seconds (1742479360 bytes)
tcrypt: test 5 (192 bit key, 16 byte blocks): 5859040 operations in 1 seconds ( 93744640 bytes)
tcrypt: test 6 (192 bit key, 64 byte blocks): 5043498 operations in 1 seconds ( 322783872 bytes)
tcrypt: test 7 (192 bit key, 256 byte blocks): 3117600 operations in 1 seconds ( 798105600 bytes)
tcrypt: test 8 (192 bit key, 1024 byte blocks): 1297050 operations in 1 seconds (1328179200 bytes)
tcrypt: test 9 (192 bit key, 8192 byte blocks): 174041 operations in 1 seconds (1425743872 bytes)
tcrypt: test 10 (256 bit key, 16 byte blocks): 5722483 operations in 1 seconds ( 91559728 bytes)
tcrypt: test 11 (256 bit key, 64 byte blocks): 4908481 operations in 1 seconds ( 314142784 bytes)
tcrypt: test 12 (256 bit key, 256 byte blocks): 2969432 operations in 1 seconds ( 760174592 bytes)
tcrypt: test 13 (256 bit key, 1024 byte blocks): 1196411 operations in 1 seconds (1225124864 bytes)
tcrypt: test 14 (256 bit key, 8192 byte blocks): 173121 operations in 1 seconds (1418207232 bytes)
testing speed of async ghash (ghash-ce)
tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 5756550 opers/sec, 92104800 bytes/sec
tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 1652111 opers/sec, 105735104 bytes/sec
tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 4471887 opers/sec, 286200768 bytes/sec
tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 437829 opers/sec, 112084224 bytes/sec
tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 1223258 opers/sec, 313154048 bytes/sec
tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 2274306 opers/sec, 582222336 bytes/sec
tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 111543 opers/sec, 114220032 bytes/sec
tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 589121 opers/sec, 603259904 bytes/sec
tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 768426 opers/sec, 786868224 bytes/sec
tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 55944 opers/sec, 114573312 bytes/sec
tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 299002 opers/sec, 612356096 bytes/sec
tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 387658 opers/sec, 793923584 bytes/sec
tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 410061 opers/sec, 839804928 bytes/sec
tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 28007 opers/sec, 114716672 bytes/sec
tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 150661 opers/sec, 617107456 bytes/sec
tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 195701 opers/sec, 801591296 bytes/sec
tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 211312 opers/sec, 865533952 bytes/sec
tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 14017 opers/sec, 114827264 bytes/sec
tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 75569 opers/sec, 619061248 bytes/sec
tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 98301 opers/sec, 805281792 bytes/sec
tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 106329 opers/sec, 871047168 bytes/sec
tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 106061 opers/sec, 868851712 bytes/sec
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 00/20] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
This is the second followup 'crypto: arm64 - disable NEON across scatterwalk
API calls' sent out last Friday.
As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.
This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)
So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.
As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input.
Changes since v2:
- Drop logic to yield only after so many blocks - as it turns out, the
throughput of the algorithms that are most likely to be affected by the
overhead (GHASH and AES-CE) only drops by ~1% (on Cortex-A57), and if that
is inacceptable, you are probably not using CONFIG_PREEMPT in the first
place. (Speed comparison at the end of this cover letter)
- Add yield support to the AES-CCM driver
- Clean up macros based on feedback from Dave
- Given that I had to add stack frame logic to many of these functions, factor
it out and wrap it in a couple of macros
- Merge the changes to the core asm driver and glue code of the GHASH/GCM
driver. The latter was not correct without the former.
Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users at vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Ard Biesheuvel (20):
crypto: testmgr - add a new test case for CRC-T10DIF
crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
crypto: arm64/aes-blk - remove configurable interleave
crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
arm64: assembler: add utility macros to push/pop stack frames
arm64: assembler: add macros to conditionally yield the NEON under
PREEMPT
crypto: arm64/sha1-ce - yield NEON after every block of input
crypto: arm64/sha2-ce - yield NEON after every block of input
crypto: arm64/aes-ccm - yield NEON after every block of input
crypto: arm64/aes-blk - yield NEON after every block of input
crypto: arm64/aes-bs - yield NEON after every block of input
crypto: arm64/aes-ghash - yield NEON after every block of input
crypto: arm64/crc32-ce - yield NEON after every block of input
crypto: arm64/crct10dif-ce - yield NEON after every block of input
DO NOT MERGE
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-ce-ccm-core.S | 150 ++++--
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 +-
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-glue.c | 95 ++--
arch/arm64/crypto/aes-modes.S | 562 +++++++++-----------
arch/arm64/crypto/aes-neonbs-core.S | 305 ++++++-----
arch/arm64/crypto/aes-neonbs-glue.c | 48 +-
arch/arm64/crypto/chacha20-neon-glue.c | 12 +-
arch/arm64/crypto/crc32-ce-core.S | 44 +-
arch/arm64/crypto/crct10dif-ce-core.S | 32 +-
arch/arm64/crypto/ghash-ce-core.S | 113 ++--
arch/arm64/crypto/ghash-ce-glue.c | 28 +-
arch/arm64/crypto/sha1-ce-core.S | 42 +-
arch/arm64/crypto/sha2-ce-core.S | 37 +-
arch/arm64/crypto/sha256-glue.c | 36 +-
arch/arm64/include/asm/assembler.h | 144 +++++
crypto/testmgr.h | 259 +++++++++
18 files changed, 1275 insertions(+), 697 deletions(-)
--
2.11.0
BEFORE
======
testing speed of async ctr(aes) (ctr-aes-ce) encryption
tcrypt: test 0 (128 bit key, 16 byte blocks): 5891675 operations in 1 seconds ( 94266800 bytes)
tcrypt: test 1 (128 bit key, 64 byte blocks): 5169493 operations in 1 seconds ( 330847552 bytes)
tcrypt: test 2 (128 bit key, 256 byte blocks): 3430554 operations in 1 seconds ( 878221824 bytes)
tcrypt: test 3 (128 bit key, 1024 byte blocks): 1433293 operations in 1 seconds (1467692032 bytes)
tcrypt: test 4 (128 bit key, 8192 byte blocks): 214314 operations in 1 seconds (1755660288 bytes)
tcrypt: test 5 (192 bit key, 16 byte blocks): 5845561 operations in 1 seconds ( 93528976 bytes)
tcrypt: test 6 (192 bit key, 64 byte blocks): 5051812 operations in 1 seconds ( 323315968 bytes)
tcrypt: test 7 (192 bit key, 256 byte blocks): 3135307 operations in 1 seconds ( 802638592 bytes)
tcrypt: test 8 (192 bit key, 1024 byte blocks): 1308804 operations in 1 seconds (1340215296 bytes)
tcrypt: test 9 (192 bit key, 8192 byte blocks): 174947 operations in 1 seconds (1433165824 bytes)
tcrypt: test 10 (256 bit key, 16 byte blocks): 5711495 operations in 1 seconds ( 91383920 bytes)
tcrypt: test 11 (256 bit key, 64 byte blocks): 4931516 operations in 1 seconds ( 315617024 bytes)
tcrypt: test 12 (256 bit key, 256 byte blocks): 3057619 operations in 1 seconds ( 782750464 bytes)
tcrypt: test 13 (256 bit key, 1024 byte blocks): 1205799 operations in 1 seconds (1234738176 bytes)
tcrypt: test 14 (256 bit key, 8192 byte blocks): 174553 operations in 1 seconds (1429938176 bytes)
testing speed of async ghash (ghash-ce)
tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 6043898 opers/sec, 96702368 bytes/sec
tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 1654308 opers/sec, 105875712 bytes/sec
tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 4610615 opers/sec, 295079360 bytes/sec
tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 440479 opers/sec, 112762624 bytes/sec
tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 1225272 opers/sec, 313669632 bytes/sec
tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 2282970 opers/sec, 584440320 bytes/sec
tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 111741 opers/sec, 114422784 bytes/sec
tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 590457 opers/sec, 604627968 bytes/sec
tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 781719 opers/sec, 800480256 bytes/sec
tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 56889 opers/sec, 116508672 bytes/sec
tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 301876 opers/sec, 618242048 bytes/sec
tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 392222 opers/sec, 803270656 bytes/sec
tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 417255 opers/sec, 854538240 bytes/sec
tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 28383 opers/sec, 116256768 bytes/sec
tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 152114 opers/sec, 623058944 bytes/sec
tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 197840 opers/sec, 810352640 bytes/sec
tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 214064 opers/sec, 876806144 bytes/sec
tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 14173 opers/sec, 116105216 bytes/sec
tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 76121 opers/sec, 623583232 bytes/sec
tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 99424 opers/sec, 814481408 bytes/sec
tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 107896 opers/sec, 883884032 bytes/sec
tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 107200 opers/sec, 878182400 bytes/sec
AFTER
=====
testing speed of async ctr(aes) (ctr-aes-ce) encryption
tcrypt: test 0 (128 bit key, 16 byte blocks): 5991064 operations in 1 seconds ( 95857024 bytes)
tcrypt: test 1 (128 bit key, 64 byte blocks): 5146397 operations in 1 seconds ( 329369408 bytes)
tcrypt: test 2 (128 bit key, 256 byte blocks): 3398949 operations in 1 seconds ( 870130944 bytes)
tcrypt: test 3 (128 bit key, 1024 byte blocks): 1423337 operations in 1 seconds (1457497088 bytes)
tcrypt: test 4 (128 bit key, 8192 byte blocks): 212705 operations in 1 seconds (1742479360 bytes)
tcrypt: test 5 (192 bit key, 16 byte blocks): 5859040 operations in 1 seconds ( 93744640 bytes)
tcrypt: test 6 (192 bit key, 64 byte blocks): 5043498 operations in 1 seconds ( 322783872 bytes)
tcrypt: test 7 (192 bit key, 256 byte blocks): 3117600 operations in 1 seconds ( 798105600 bytes)
tcrypt: test 8 (192 bit key, 1024 byte blocks): 1297050 operations in 1 seconds (1328179200 bytes)
tcrypt: test 9 (192 bit key, 8192 byte blocks): 174041 operations in 1 seconds (1425743872 bytes)
tcrypt: test 10 (256 bit key, 16 byte blocks): 5722483 operations in 1 seconds ( 91559728 bytes)
tcrypt: test 11 (256 bit key, 64 byte blocks): 4908481 operations in 1 seconds ( 314142784 bytes)
tcrypt: test 12 (256 bit key, 256 byte blocks): 2969432 operations in 1 seconds ( 760174592 bytes)
tcrypt: test 13 (256 bit key, 1024 byte blocks): 1196411 operations in 1 seconds (1225124864 bytes)
tcrypt: test 14 (256 bit key, 8192 byte blocks): 173121 operations in 1 seconds (1418207232 bytes)
testing speed of async ghash (ghash-ce)
tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 5756550 opers/sec, 92104800 bytes/sec
tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 1652111 opers/sec, 105735104 bytes/sec
tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 4471887 opers/sec, 286200768 bytes/sec
tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 437829 opers/sec, 112084224 bytes/sec
tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 1223258 opers/sec, 313154048 bytes/sec
tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 2274306 opers/sec, 582222336 bytes/sec
tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 111543 opers/sec, 114220032 bytes/sec
tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 589121 opers/sec, 603259904 bytes/sec
tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 768426 opers/sec, 786868224 bytes/sec
tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 55944 opers/sec, 114573312 bytes/sec
tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 299002 opers/sec, 612356096 bytes/sec
tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 387658 opers/sec, 793923584 bytes/sec
tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 410061 opers/sec, 839804928 bytes/sec
tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 28007 opers/sec, 114716672 bytes/sec
tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 150661 opers/sec, 617107456 bytes/sec
tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 195701 opers/sec, 801591296 bytes/sec
tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 211312 opers/sec, 865533952 bytes/sec
tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 14017 opers/sec, 114827264 bytes/sec
tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 75569 opers/sec, 619061248 bytes/sec
tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 98301 opers/sec, 805281792 bytes/sec
tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 106329 opers/sec, 871047168 bytes/sec
tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 106061 opers/sec, 868851712 bytes/sec
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 01/20] crypto: testmgr - add a new test case for CRC-T10DIF
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
crypto/testmgr.h | 259 ++++++++++++++++++++
1 file changed, 259 insertions(+)
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..0c849aec161d 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1494,6 +1494,265 @@ static const struct hash_testvec crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap = { 1, 255, 57, 6 },
+ }, {
+ .plaintext = "\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+ "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+ "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+ "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+ "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+ "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+ "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+ "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+ "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+ "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+ "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+ "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+ "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+ "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+ "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+ "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+ "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+ "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+ "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+ "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+ "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+ "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+ "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+ "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+ "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+ "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+ "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+ "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+ "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+ "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+ "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+ "\x47\xde\x75\x0c\x80\x17\xae\x22"
+ "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+ "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+ "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+ "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+ "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+ "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+ "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+ "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+ "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+ "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+ "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+ "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+ "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+ "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+ "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+ "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+ "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+ "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+ "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+ "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+ "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+ "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+ "\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+ "\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+ "\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+ "\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+ "\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+ "\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+ "\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+ "\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+ "\x43\xda\x4e\xe5\x7c\x13\x87\x1e"
+ "\xb5\x29\xc0\x57\xee\x62\xf9\x90"
+ "\x04\x9b\x32\xc9\x3d\xd4\x6b\x02"
+ "\x76\x0d\xa4\x18\xaf\x46\xdd\x51"
+ "\xe8\x7f\x16\x8a\x21\xb8\x2c\xc3"
+ "\x5a\xf1\x65\xfc\x93\x07\x9e\x35"
+ "\xcc\x40\xd7\x6e\x05\x79\x10\xa7"
+ "\x1b\xb2\x49\xe0\x54\xeb\x82\x19"
+ "\x8d\x24\xbb\x2f\xc6\x5d\xf4\x68"
+ "\xff\x96\x0a\xa1\x38\xcf\x43\xda"
+ "\x71\x08\x7c\x13\xaa\x1e\xb5\x4c"
+ "\xe3\x57\xee\x85\x1c\x90\x27\xbe"
+ "\x32\xc9\x60\xf7\x6b\x02\x99\x0d"
+ "\xa4\x3b\xd2\x46\xdd\x74\x0b\x7f"
+ "\x16\xad\x21\xb8\x4f\xe6\x5a\xf1"
+ "\x88\x1f\x93\x2a\xc1\x35\xcc\x63"
+ "\xfa\x6e\x05\x9c\x10\xa7\x3e\xd5"
+ "\x49\xe0\x77\x0e\x82\x19\xb0\x24"
+ "\xbb\x52\xe9\x5d\xf4\x8b\x22\x96"
+ "\x2d\xc4\x38\xcf\x66\xfd\x71\x08"
+ "\x9f\x13\xaa\x41\xd8\x4c\xe3\x7a"
+ "\x11\x85\x1c\xb3\x27\xbe\x55\xec"
+ "\x60\xf7\x8e\x02\x99\x30\xc7\x3b"
+ "\xd2\x69\x00\x74\x0b\xa2\x16\xad"
+ "\x44\xdb\x4f\xe6\x7d\x14\x88\x1f"
+ "\xb6\x2a\xc1\x58\xef\x63\xfa\x91"
+ "\x05\x9c\x33\xca\x3e\xd5\x6c\x03"
+ "\x77\x0e\xa5\x19\xb0\x47\xde\x52"
+ "\xe9\x80\x17\x8b\x22\xb9\x2d\xc4"
+ "\x5b\xf2\x66\xfd\x94\x08\x9f\x36"
+ "\xcd\x41\xd8\x6f\x06\x7a\x11\xa8"
+ "\x1c\xb3\x4a\xe1\x55\xec\x83\x1a"
+ "\x8e\x25\xbc\x30\xc7\x5e\xf5\x69"
+ "\x00\x97\x0b\xa2\x39\xd0\x44\xdb"
+ "\x72\x09\x7d\x14\xab\x1f\xb6\x4d"
+ "\xe4\x58\xef\x86\x1d\x91\x28\xbf"
+ "\x33\xca\x61\xf8\x6c\x03\x9a\x0e"
+ "\xa5\x3c\xd3\x47\xde\x75\x0c\x80"
+ "\x17\xae\x22\xb9\x50\xe7\x5b\xf2"
+ "\x89\x20\x94\x2b\xc2\x36\xcd\x64"
+ "\xfb\x6f\x06\x9d\x11\xa8\x3f\xd6"
+ "\x4a\xe1\x78\x0f\x83\x1a\xb1\x25"
+ "\xbc\x53\xea\x5e\xf5\x8c\x00\x97"
+ "\x2e\xc5\x39\xd0\x67\xfe\x72\x09"
+ "\xa0\x14\xab\x42\xd9\x4d\xe4\x7b"
+ "\x12\x86\x1d\xb4\x28\xbf\x56\xed"
+ "\x61\xf8\x8f\x03\x9a\x31\xc8\x3c"
+ "\xd3\x6a\x01\x75\x0c\xa3\x17\xae"
+ "\x45\xdc\x50\xe7\x7e\x15\x89\x20"
+ "\xb7\x2b\xc2\x59\xf0\x64\xfb\x92"
+ "\x06\x9d\x34\xcb\x3f\xd6\x6d\x04"
+ "\x78\x0f\xa6\x1a\xb1\x48\xdf\x53"
+ "\xea\x81\x18\x8c\x23\xba\x2e\xc5"
+ "\x5c\xf3\x67\xfe\x95\x09\xa0\x37"
+ "\xce\x42\xd9\x70\x07\x7b\x12\xa9"
+ "\x1d\xb4\x4b\xe2\x56\xed\x84\x1b"
+ "\x8f\x26\xbd\x31\xc8\x5f\xf6\x6a"
+ "\x01\x98\x0c\xa3\x3a\xd1\x45\xdc"
+ "\x73\x0a\x7e\x15\xac\x20\xb7\x4e"
+ "\xe5\x59\xf0\x87\x1e\x92\x29\xc0"
+ "\x34\xcb\x62\xf9\x6d\x04\x9b\x0f"
+ "\xa6\x3d\xd4\x48\xdf\x76\x0d\x81"
+ "\x18\xaf\x23\xba\x51\xe8\x5c\xf3"
+ "\x8a\x21\x95\x2c\xc3\x37\xce\x65"
+ "\xfc\x70\x07\x9e\x12\xa9\x40\xd7"
+ "\x4b\xe2\x79\x10\x84\x1b\xb2\x26"
+ "\xbd\x54\xeb\x5f\xf6\x8d\x01\x98"
+ "\x2f\xc6\x3a\xd1\x68\xff\x73\x0a"
+ "\xa1\x15\xac\x43\xda\x4e\xe5\x7c"
+ "\x13\x87\x1e\xb5\x29\xc0\x57\xee"
+ "\x62\xf9\x90\x04\x9b\x32\xc9\x3d"
+ "\xd4\x6b\x02\x76\x0d\xa4\x18\xaf"
+ "\x46\xdd\x51\xe8\x7f\x16\x8a\x21"
+ "\xb8\x2c\xc3\x5a\xf1\x65\xfc\x93"
+ "\x07\x9e\x35\xcc\x40\xd7\x6e\x05"
+ "\x79\x10\xa7\x1b\xb2\x49\xe0\x54"
+ "\xeb\x82\x19\x8d\x24\xbb\x2f\xc6"
+ "\x5d\xf4\x68\xff\x96\x0a\xa1\x38"
+ "\xcf\x43\xda\x71\x08\x7c\x13\xaa"
+ "\x1e\xb5\x4c\xe3\x57\xee\x85\x1c"
+ "\x90\x27\xbe\x32\xc9\x60\xf7\x6b"
+ "\x02\x99\x0d\xa4\x3b\xd2\x46\xdd"
+ "\x74\x0b\x7f\x16\xad\x21\xb8\x4f"
+ "\xe6\x5a\xf1\x88\x1f\x93\x2a\xc1"
+ "\x35\xcc\x63\xfa\x6e\x05\x9c\x10"
+ "\xa7\x3e\xd5\x49\xe0\x77\x0e\x82"
+ "\x19\xb0\x24\xbb\x52\xe9\x5d\xf4"
+ "\x8b\x22\x96\x2d\xc4\x38\xcf\x66"
+ "\xfd\x71\x08\x9f\x13\xaa\x41\xd8"
+ "\x4c\xe3\x7a\x11\x85\x1c\xb3\x27"
+ "\xbe\x55\xec\x60\xf7\x8e\x02\x99"
+ "\x30\xc7\x3b\xd2\x69\x00\x74\x0b"
+ "\xa2\x16\xad\x44\xdb\x4f\xe6\x7d"
+ "\x14\x88\x1f\xb6\x2a\xc1\x58\xef"
+ "\x63\xfa\x91\x05\x9c\x33\xca\x3e"
+ "\xd5\x6c\x03\x77\x0e\xa5\x19\xb0"
+ "\x47\xde\x52\xe9\x80\x17\x8b\x22"
+ "\xb9\x2d\xc4\x5b\xf2\x66\xfd\x94"
+ "\x08\x9f\x36\xcd\x41\xd8\x6f\x06"
+ "\x7a\x11\xa8\x1c\xb3\x4a\xe1\x55"
+ "\xec\x83\x1a\x8e\x25\xbc\x30\xc7"
+ "\x5e\xf5\x69\x00\x97\x0b\xa2\x39"
+ "\xd0\x44\xdb\x72\x09\x7d\x14\xab"
+ "\x1f\xb6\x4d\xe4\x58\xef\x86\x1d"
+ "\x91\x28\xbf\x33\xca\x61\xf8\x6c"
+ "\x03\x9a\x0e\xa5\x3c\xd3\x47\xde"
+ "\x75\x0c\x80\x17\xae\x22\xb9\x50"
+ "\xe7\x5b\xf2\x89\x20\x94\x2b\xc2"
+ "\x36\xcd\x64\xfb\x6f\x06\x9d\x11"
+ "\xa8\x3f\xd6\x4a\xe1\x78\x0f\x83"
+ "\x1a\xb1\x25\xbc\x53\xea\x5e\xf5"
+ "\x8c\x00\x97\x2e\xc5\x39\xd0\x67"
+ "\xfe\x72\x09\xa0\x14\xab\x42\xd9"
+ "\x4d\xe4\x7b\x12\x86\x1d\xb4\x28"
+ "\xbf\x56\xed\x61\xf8\x8f\x03\x9a"
+ "\x31\xc8\x3c\xd3\x6a\x01\x75\x0c"
+ "\xa3\x17\xae\x45\xdc\x50\xe7\x7e"
+ "\x15\x89\x20\xb7\x2b\xc2\x59\xf0"
+ "\x64\xfb\x92\x06\x9d\x34\xcb\x3f"
+ "\xd6\x6d\x04\x78\x0f\xa6\x1a\xb1"
+ "\x48\xdf\x53\xea\x81\x18\x8c\x23"
+ "\xba\x2e\xc5\x5c\xf3\x67\xfe\x95"
+ "\x09\xa0\x37\xce\x42\xd9\x70\x07"
+ "\x7b\x12\xa9\x1d\xb4\x4b\xe2\x56"
+ "\xed\x84\x1b\x8f\x26\xbd\x31\xc8"
+ "\x5f\xf6\x6a\x01\x98\x0c\xa3\x3a"
+ "\xd1\x45\xdc\x73\x0a\x7e\x15\xac"
+ "\x20\xb7\x4e\xe5\x59\xf0\x87\x1e"
+ "\x92\x29\xc0\x34\xcb\x62\xf9\x6d"
+ "\x04\x9b\x0f\xa6\x3d\xd4\x48\xdf"
+ "\x76\x0d\x81\x18\xaf\x23\xba\x51"
+ "\xe8\x5c\xf3\x8a\x21\x95\x2c\xc3"
+ "\x37\xce\x65\xfc\x70\x07\x9e\x12"
+ "\xa9\x40\xd7\x4b\xe2\x79\x10\x84"
+ "\x1b\xb2\x26\xbd\x54\xeb\x5f\xf6"
+ "\x8d\x01\x98\x2f\xc6\x3a\xd1\x68"
+ "\xff\x73\x0a\xa1\x15\xac\x43\xda"
+ "\x4e\xe5\x7c\x13\x87\x1e\xb5\x29"
+ "\xc0\x57\xee\x62\xf9\x90\x04\x9b"
+ "\x32\xc9\x3d\xd4\x6b\x02\x76\x0d"
+ "\xa4\x18\xaf\x46\xdd\x51\xe8\x7f"
+ "\x16\x8a\x21\xb8\x2c\xc3\x5a\xf1"
+ "\x65\xfc\x93\x07\x9e\x35\xcc\x40"
+ "\xd7\x6e\x05\x79\x10\xa7\x1b\xb2"
+ "\x49\xe0\x54\xeb\x82\x19\x8d\x24"
+ "\xbb\x2f\xc6\x5d\xf4\x68\xff\x96"
+ "\x0a\xa1\x38\xcf\x43\xda\x71\x08"
+ "\x7c\x13\xaa\x1e\xb5\x4c\xe3\x57"
+ "\xee\x85\x1c\x90\x27\xbe\x32\xc9"
+ "\x60\xf7\x6b\x02\x99\x0d\xa4\x3b"
+ "\xd2\x46\xdd\x74\x0b\x7f\x16\xad"
+ "\x21\xb8\x4f\xe6\x5a\xf1\x88\x1f"
+ "\x93\x2a\xc1\x35\xcc\x63\xfa\x6e"
+ "\x05\x9c\x10\xa7\x3e\xd5\x49\xe0"
+ "\x77\x0e\x82\x19\xb0\x24\xbb\x52"
+ "\xe9\x5d\xf4\x8b\x22\x96\x2d\xc4"
+ "\x38\xcf\x66\xfd\x71\x08\x9f\x13"
+ "\xaa\x41\xd8\x4c\xe3\x7a\x11\x85"
+ "\x1c\xb3\x27\xbe\x55\xec\x60\xf7"
+ "\x8e\x02\x99\x30\xc7\x3b\xd2\x69"
+ "\x00\x74\x0b\xa2\x16\xad\x44\xdb"
+ "\x4f\xe6\x7d\x14\x88\x1f\xb6\x2a"
+ "\xc1\x58\xef\x63\xfa\x91\x05\x9c"
+ "\x33\xca\x3e\xd5\x6c\x03\x77\x0e"
+ "\xa5\x19\xb0\x47\xde\x52\xe9\x80"
+ "\x17\x8b\x22\xb9\x2d\xc4\x5b\xf2"
+ "\x66\xfd\x94\x08\x9f\x36\xcd\x41"
+ "\xd8\x6f\x06\x7a\x11\xa8\x1c\xb3"
+ "\x4a\xe1\x55\xec\x83\x1a\x8e\x25"
+ "\xbc\x30\xc7\x5e\xf5\x69\x00\x97"
+ "\x0b\xa2\x39\xd0\x44\xdb\x72\x09"
+ "\x7d\x14\xab\x1f\xb6\x4d\xe4\x58"
+ "\xef\x86\x1d\x91\x28\xbf\x33\xca"
+ "\x61\xf8\x6c\x03\x9a\x0e\xa5\x3c"
+ "\xd3\x47\xde\x75\x0c\x80\x17\xae"
+ "\x22\xb9\x50\xe7\x5b\xf2\x89\x20"
+ "\x94\x2b\xc2\x36\xcd\x64\xfb\x6f"
+ "\x06\x9d\x11\xa8\x3f\xd6\x4a\xe1"
+ "\x78\x0f\x83\x1a\xb1\x25\xbc\x53"
+ "\xea\x5e\xf5\x8c\x00\x97\x2e\xc5"
+ "\x39\xd0\x67\xfe\x72\x09\xa0\x14"
+ "\xab\x42\xd9\x4d\xe4\x7b\x12\x86"
+ "\x1d\xb4\x28\xbf\x56\xed\x61\xf8"
+ "\x8f\x03\x9a\x31\xc8\x3c\xd3\x6a"
+ "\x01\x75\x0c\xa3\x17\xae\x45\xdc"
+ "\x50\xe7\x7e\x15\x89\x20\xb7\x2b"
+ "\xc2\x59\xf0\x64\xfb\x92\x06\x9d"
+ "\x34\xcb\x3f\xd6\x6d\x04\x78\x0f"
+ "\xa6\x1a\xb1\x48\xdf\x53\xea\x81"
+ "\x18\x8c\x23\xba\x2e\xc5\x5c\xf3"
+ "\x67\xfe\x95\x09\xa0\x37\xce\x42"
+ "\xd9\x70\x07\x7b\x12\xa9\x1d\xb4"
+ "\x4b\xe2\x56\xed\x84\x1b\x8f\x26"
+ "\xbd\x31\xc8\x5f\xf6\x6a\x01\x98",
+ .psize = 2048,
+ .digest = (u8 *)(u16 []){ 0x23ca },
}
};
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 01/20] crypto: testmgr - add a new test case for CRC-T10DIF
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
crypto/testmgr.h | 259 ++++++++++++++++++++
1 file changed, 259 insertions(+)
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..0c849aec161d 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1494,6 +1494,265 @@ static const struct hash_testvec crct10dif_tv_template[] = {
.digest = (u8 *)(u16 []){ 0x44c6 },
.np = 4,
.tap = { 1, 255, 57, 6 },
+ }, {
+ .plaintext = "\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+ "\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+ "\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+ "\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+ "\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+ "\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+ "\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+ "\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+ "\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+ "\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+ "\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+ "\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+ "\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+ "\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+ "\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+ "\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+ "\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+ "\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+ "\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+ "\x58\xef\x63\xfa\x91\x05\x9c\x33"
+ "\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+ "\x19\xb0\x47\xde\x52\xe9\x80\x17"
+ "\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+ "\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+ "\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+ "\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+ "\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+ "\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+ "\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+ "\x86\x1d\x91\x28\xbf\x33\xca\x61"
+ "\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+ "\x47\xde\x75\x0c\x80\x17\xae\x22"
+ "\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+ "\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+ "\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+ "\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+ "\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+ "\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+ "\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+ "\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+ "\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+ "\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+ "\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+ "\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+ "\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+ "\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+ "\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+ "\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+ "\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+ "\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+ "\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+ "\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+ "\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+ "\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+ "\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+ "\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+ "\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+ "\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+ "\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+ "\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+ "\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+ "\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+ "\x43\xda\x4e\xe5\x7c\x13\x87\x1e"
+ "\xb5\x29\xc0\x57\xee\x62\xf9\x90"
+ "\x04\x9b\x32\xc9\x3d\xd4\x6b\x02"
+ "\x76\x0d\xa4\x18\xaf\x46\xdd\x51"
+ "\xe8\x7f\x16\x8a\x21\xb8\x2c\xc3"
+ "\x5a\xf1\x65\xfc\x93\x07\x9e\x35"
+ "\xcc\x40\xd7\x6e\x05\x79\x10\xa7"
+ "\x1b\xb2\x49\xe0\x54\xeb\x82\x19"
+ "\x8d\x24\xbb\x2f\xc6\x5d\xf4\x68"
+ "\xff\x96\x0a\xa1\x38\xcf\x43\xda"
+ "\x71\x08\x7c\x13\xaa\x1e\xb5\x4c"
+ "\xe3\x57\xee\x85\x1c\x90\x27\xbe"
+ "\x32\xc9\x60\xf7\x6b\x02\x99\x0d"
+ "\xa4\x3b\xd2\x46\xdd\x74\x0b\x7f"
+ "\x16\xad\x21\xb8\x4f\xe6\x5a\xf1"
+ "\x88\x1f\x93\x2a\xc1\x35\xcc\x63"
+ "\xfa\x6e\x05\x9c\x10\xa7\x3e\xd5"
+ "\x49\xe0\x77\x0e\x82\x19\xb0\x24"
+ "\xbb\x52\xe9\x5d\xf4\x8b\x22\x96"
+ "\x2d\xc4\x38\xcf\x66\xfd\x71\x08"
+ "\x9f\x13\xaa\x41\xd8\x4c\xe3\x7a"
+ "\x11\x85\x1c\xb3\x27\xbe\x55\xec"
+ "\x60\xf7\x8e\x02\x99\x30\xc7\x3b"
+ "\xd2\x69\x00\x74\x0b\xa2\x16\xad"
+ "\x44\xdb\x4f\xe6\x7d\x14\x88\x1f"
+ "\xb6\x2a\xc1\x58\xef\x63\xfa\x91"
+ "\x05\x9c\x33\xca\x3e\xd5\x6c\x03"
+ "\x77\x0e\xa5\x19\xb0\x47\xde\x52"
+ "\xe9\x80\x17\x8b\x22\xb9\x2d\xc4"
+ "\x5b\xf2\x66\xfd\x94\x08\x9f\x36"
+ "\xcd\x41\xd8\x6f\x06\x7a\x11\xa8"
+ "\x1c\xb3\x4a\xe1\x55\xec\x83\x1a"
+ "\x8e\x25\xbc\x30\xc7\x5e\xf5\x69"
+ "\x00\x97\x0b\xa2\x39\xd0\x44\xdb"
+ "\x72\x09\x7d\x14\xab\x1f\xb6\x4d"
+ "\xe4\x58\xef\x86\x1d\x91\x28\xbf"
+ "\x33\xca\x61\xf8\x6c\x03\x9a\x0e"
+ "\xa5\x3c\xd3\x47\xde\x75\x0c\x80"
+ "\x17\xae\x22\xb9\x50\xe7\x5b\xf2"
+ "\x89\x20\x94\x2b\xc2\x36\xcd\x64"
+ "\xfb\x6f\x06\x9d\x11\xa8\x3f\xd6"
+ "\x4a\xe1\x78\x0f\x83\x1a\xb1\x25"
+ "\xbc\x53\xea\x5e\xf5\x8c\x00\x97"
+ "\x2e\xc5\x39\xd0\x67\xfe\x72\x09"
+ "\xa0\x14\xab\x42\xd9\x4d\xe4\x7b"
+ "\x12\x86\x1d\xb4\x28\xbf\x56\xed"
+ "\x61\xf8\x8f\x03\x9a\x31\xc8\x3c"
+ "\xd3\x6a\x01\x75\x0c\xa3\x17\xae"
+ "\x45\xdc\x50\xe7\x7e\x15\x89\x20"
+ "\xb7\x2b\xc2\x59\xf0\x64\xfb\x92"
+ "\x06\x9d\x34\xcb\x3f\xd6\x6d\x04"
+ "\x78\x0f\xa6\x1a\xb1\x48\xdf\x53"
+ "\xea\x81\x18\x8c\x23\xba\x2e\xc5"
+ "\x5c\xf3\x67\xfe\x95\x09\xa0\x37"
+ "\xce\x42\xd9\x70\x07\x7b\x12\xa9"
+ "\x1d\xb4\x4b\xe2\x56\xed\x84\x1b"
+ "\x8f\x26\xbd\x31\xc8\x5f\xf6\x6a"
+ "\x01\x98\x0c\xa3\x3a\xd1\x45\xdc"
+ "\x73\x0a\x7e\x15\xac\x20\xb7\x4e"
+ "\xe5\x59\xf0\x87\x1e\x92\x29\xc0"
+ "\x34\xcb\x62\xf9\x6d\x04\x9b\x0f"
+ "\xa6\x3d\xd4\x48\xdf\x76\x0d\x81"
+ "\x18\xaf\x23\xba\x51\xe8\x5c\xf3"
+ "\x8a\x21\x95\x2c\xc3\x37\xce\x65"
+ "\xfc\x70\x07\x9e\x12\xa9\x40\xd7"
+ "\x4b\xe2\x79\x10\x84\x1b\xb2\x26"
+ "\xbd\x54\xeb\x5f\xf6\x8d\x01\x98"
+ "\x2f\xc6\x3a\xd1\x68\xff\x73\x0a"
+ "\xa1\x15\xac\x43\xda\x4e\xe5\x7c"
+ "\x13\x87\x1e\xb5\x29\xc0\x57\xee"
+ "\x62\xf9\x90\x04\x9b\x32\xc9\x3d"
+ "\xd4\x6b\x02\x76\x0d\xa4\x18\xaf"
+ "\x46\xdd\x51\xe8\x7f\x16\x8a\x21"
+ "\xb8\x2c\xc3\x5a\xf1\x65\xfc\x93"
+ "\x07\x9e\x35\xcc\x40\xd7\x6e\x05"
+ "\x79\x10\xa7\x1b\xb2\x49\xe0\x54"
+ "\xeb\x82\x19\x8d\x24\xbb\x2f\xc6"
+ "\x5d\xf4\x68\xff\x96\x0a\xa1\x38"
+ "\xcf\x43\xda\x71\x08\x7c\x13\xaa"
+ "\x1e\xb5\x4c\xe3\x57\xee\x85\x1c"
+ "\x90\x27\xbe\x32\xc9\x60\xf7\x6b"
+ "\x02\x99\x0d\xa4\x3b\xd2\x46\xdd"
+ "\x74\x0b\x7f\x16\xad\x21\xb8\x4f"
+ "\xe6\x5a\xf1\x88\x1f\x93\x2a\xc1"
+ "\x35\xcc\x63\xfa\x6e\x05\x9c\x10"
+ "\xa7\x3e\xd5\x49\xe0\x77\x0e\x82"
+ "\x19\xb0\x24\xbb\x52\xe9\x5d\xf4"
+ "\x8b\x22\x96\x2d\xc4\x38\xcf\x66"
+ "\xfd\x71\x08\x9f\x13\xaa\x41\xd8"
+ "\x4c\xe3\x7a\x11\x85\x1c\xb3\x27"
+ "\xbe\x55\xec\x60\xf7\x8e\x02\x99"
+ "\x30\xc7\x3b\xd2\x69\x00\x74\x0b"
+ "\xa2\x16\xad\x44\xdb\x4f\xe6\x7d"
+ "\x14\x88\x1f\xb6\x2a\xc1\x58\xef"
+ "\x63\xfa\x91\x05\x9c\x33\xca\x3e"
+ "\xd5\x6c\x03\x77\x0e\xa5\x19\xb0"
+ "\x47\xde\x52\xe9\x80\x17\x8b\x22"
+ "\xb9\x2d\xc4\x5b\xf2\x66\xfd\x94"
+ "\x08\x9f\x36\xcd\x41\xd8\x6f\x06"
+ "\x7a\x11\xa8\x1c\xb3\x4a\xe1\x55"
+ "\xec\x83\x1a\x8e\x25\xbc\x30\xc7"
+ "\x5e\xf5\x69\x00\x97\x0b\xa2\x39"
+ "\xd0\x44\xdb\x72\x09\x7d\x14\xab"
+ "\x1f\xb6\x4d\xe4\x58\xef\x86\x1d"
+ "\x91\x28\xbf\x33\xca\x61\xf8\x6c"
+ "\x03\x9a\x0e\xa5\x3c\xd3\x47\xde"
+ "\x75\x0c\x80\x17\xae\x22\xb9\x50"
+ "\xe7\x5b\xf2\x89\x20\x94\x2b\xc2"
+ "\x36\xcd\x64\xfb\x6f\x06\x9d\x11"
+ "\xa8\x3f\xd6\x4a\xe1\x78\x0f\x83"
+ "\x1a\xb1\x25\xbc\x53\xea\x5e\xf5"
+ "\x8c\x00\x97\x2e\xc5\x39\xd0\x67"
+ "\xfe\x72\x09\xa0\x14\xab\x42\xd9"
+ "\x4d\xe4\x7b\x12\x86\x1d\xb4\x28"
+ "\xbf\x56\xed\x61\xf8\x8f\x03\x9a"
+ "\x31\xc8\x3c\xd3\x6a\x01\x75\x0c"
+ "\xa3\x17\xae\x45\xdc\x50\xe7\x7e"
+ "\x15\x89\x20\xb7\x2b\xc2\x59\xf0"
+ "\x64\xfb\x92\x06\x9d\x34\xcb\x3f"
+ "\xd6\x6d\x04\x78\x0f\xa6\x1a\xb1"
+ "\x48\xdf\x53\xea\x81\x18\x8c\x23"
+ "\xba\x2e\xc5\x5c\xf3\x67\xfe\x95"
+ "\x09\xa0\x37\xce\x42\xd9\x70\x07"
+ "\x7b\x12\xa9\x1d\xb4\x4b\xe2\x56"
+ "\xed\x84\x1b\x8f\x26\xbd\x31\xc8"
+ "\x5f\xf6\x6a\x01\x98\x0c\xa3\x3a"
+ "\xd1\x45\xdc\x73\x0a\x7e\x15\xac"
+ "\x20\xb7\x4e\xe5\x59\xf0\x87\x1e"
+ "\x92\x29\xc0\x34\xcb\x62\xf9\x6d"
+ "\x04\x9b\x0f\xa6\x3d\xd4\x48\xdf"
+ "\x76\x0d\x81\x18\xaf\x23\xba\x51"
+ "\xe8\x5c\xf3\x8a\x21\x95\x2c\xc3"
+ "\x37\xce\x65\xfc\x70\x07\x9e\x12"
+ "\xa9\x40\xd7\x4b\xe2\x79\x10\x84"
+ "\x1b\xb2\x26\xbd\x54\xeb\x5f\xf6"
+ "\x8d\x01\x98\x2f\xc6\x3a\xd1\x68"
+ "\xff\x73\x0a\xa1\x15\xac\x43\xda"
+ "\x4e\xe5\x7c\x13\x87\x1e\xb5\x29"
+ "\xc0\x57\xee\x62\xf9\x90\x04\x9b"
+ "\x32\xc9\x3d\xd4\x6b\x02\x76\x0d"
+ "\xa4\x18\xaf\x46\xdd\x51\xe8\x7f"
+ "\x16\x8a\x21\xb8\x2c\xc3\x5a\xf1"
+ "\x65\xfc\x93\x07\x9e\x35\xcc\x40"
+ "\xd7\x6e\x05\x79\x10\xa7\x1b\xb2"
+ "\x49\xe0\x54\xeb\x82\x19\x8d\x24"
+ "\xbb\x2f\xc6\x5d\xf4\x68\xff\x96"
+ "\x0a\xa1\x38\xcf\x43\xda\x71\x08"
+ "\x7c\x13\xaa\x1e\xb5\x4c\xe3\x57"
+ "\xee\x85\x1c\x90\x27\xbe\x32\xc9"
+ "\x60\xf7\x6b\x02\x99\x0d\xa4\x3b"
+ "\xd2\x46\xdd\x74\x0b\x7f\x16\xad"
+ "\x21\xb8\x4f\xe6\x5a\xf1\x88\x1f"
+ "\x93\x2a\xc1\x35\xcc\x63\xfa\x6e"
+ "\x05\x9c\x10\xa7\x3e\xd5\x49\xe0"
+ "\x77\x0e\x82\x19\xb0\x24\xbb\x52"
+ "\xe9\x5d\xf4\x8b\x22\x96\x2d\xc4"
+ "\x38\xcf\x66\xfd\x71\x08\x9f\x13"
+ "\xaa\x41\xd8\x4c\xe3\x7a\x11\x85"
+ "\x1c\xb3\x27\xbe\x55\xec\x60\xf7"
+ "\x8e\x02\x99\x30\xc7\x3b\xd2\x69"
+ "\x00\x74\x0b\xa2\x16\xad\x44\xdb"
+ "\x4f\xe6\x7d\x14\x88\x1f\xb6\x2a"
+ "\xc1\x58\xef\x63\xfa\x91\x05\x9c"
+ "\x33\xca\x3e\xd5\x6c\x03\x77\x0e"
+ "\xa5\x19\xb0\x47\xde\x52\xe9\x80"
+ "\x17\x8b\x22\xb9\x2d\xc4\x5b\xf2"
+ "\x66\xfd\x94\x08\x9f\x36\xcd\x41"
+ "\xd8\x6f\x06\x7a\x11\xa8\x1c\xb3"
+ "\x4a\xe1\x55\xec\x83\x1a\x8e\x25"
+ "\xbc\x30\xc7\x5e\xf5\x69\x00\x97"
+ "\x0b\xa2\x39\xd0\x44\xdb\x72\x09"
+ "\x7d\x14\xab\x1f\xb6\x4d\xe4\x58"
+ "\xef\x86\x1d\x91\x28\xbf\x33\xca"
+ "\x61\xf8\x6c\x03\x9a\x0e\xa5\x3c"
+ "\xd3\x47\xde\x75\x0c\x80\x17\xae"
+ "\x22\xb9\x50\xe7\x5b\xf2\x89\x20"
+ "\x94\x2b\xc2\x36\xcd\x64\xfb\x6f"
+ "\x06\x9d\x11\xa8\x3f\xd6\x4a\xe1"
+ "\x78\x0f\x83\x1a\xb1\x25\xbc\x53"
+ "\xea\x5e\xf5\x8c\x00\x97\x2e\xc5"
+ "\x39\xd0\x67\xfe\x72\x09\xa0\x14"
+ "\xab\x42\xd9\x4d\xe4\x7b\x12\x86"
+ "\x1d\xb4\x28\xbf\x56\xed\x61\xf8"
+ "\x8f\x03\x9a\x31\xc8\x3c\xd3\x6a"
+ "\x01\x75\x0c\xa3\x17\xae\x45\xdc"
+ "\x50\xe7\x7e\x15\x89\x20\xb7\x2b"
+ "\xc2\x59\xf0\x64\xfb\x92\x06\x9d"
+ "\x34\xcb\x3f\xd6\x6d\x04\x78\x0f"
+ "\xa6\x1a\xb1\x48\xdf\x53\xea\x81"
+ "\x18\x8c\x23\xba\x2e\xc5\x5c\xf3"
+ "\x67\xfe\x95\x09\xa0\x37\xce\x42"
+ "\xd9\x70\x07\x7b\x12\xa9\x1d\xb4"
+ "\x4b\xe2\x56\xed\x84\x1b\x8f\x26"
+ "\xbd\x31\xc8\x5f\xf6\x6a\x01\x98",
+ .psize = 2048,
+ .digest = (u8 *)(u16 []){ 0x23ca },
}
};
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 02/20] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
}
static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
- u32 abytes, u32 *macp, bool use_neon)
+ u32 abytes, u32 *macp)
{
- if (likely(use_neon)) {
+ if (may_use_simd()) {
+ kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
num_rounds(key));
+ kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
}
}
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
- bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
ltag.len = 6;
}
- ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp, use_neon);
+ ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp);
scatterwalk_start(&walk, req->src);
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
n = scatterwalk_clamp(&walk, len);
}
p = scatterwalk_map(&walk);
- ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+ ccm_update_mac(ctx, mac, p, n, &macp);
len -= n;
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_encrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_decrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_decrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
}
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 02/20] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
}
static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
- u32 abytes, u32 *macp, bool use_neon)
+ u32 abytes, u32 *macp)
{
- if (likely(use_neon)) {
+ if (may_use_simd()) {
+ kernel_neon_begin();
ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
num_rounds(key));
+ kernel_neon_end();
} else {
if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
}
}
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
- bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
{
struct crypto_aead *aead = crypto_aead_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
ltag.len = 6;
}
- ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp, use_neon);
+ ccm_update_mac(ctx, mac, (u8 *)<ag, ltag.len, &macp);
scatterwalk_start(&walk, req->src);
do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
n = scatterwalk_clamp(&walk, len);
}
p = scatterwalk_map(&walk);
- ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+ ccm_update_mac(ctx, mac, p, n, &macp);
len -= n;
scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_encrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_encrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
u8 __aligned(8) mac[AES_BLOCK_SIZE];
u8 buf[AES_BLOCK_SIZE];
u32 len = req->cryptlen - authsize;
- bool use_neon = may_use_simd();
int err;
err = ccm_init_mac(req, mac, len);
if (err)
return err;
- if (likely(use_neon))
- kernel_neon_begin();
-
if (req->assoclen)
- ccm_calculate_auth_mac(req, mac, use_neon);
+ ccm_calculate_auth_mac(req, mac);
/* preserve the original iv for the final round */
memcpy(buf, req->iv, AES_BLOCK_SIZE);
err = skcipher_walk_aead_decrypt(&walk, req, true);
- if (likely(use_neon)) {
+ if (may_use_simd()) {
while (walk.nbytes) {
u32 tail = walk.nbytes % AES_BLOCK_SIZE;
if (walk.nbytes == walk.total)
tail = 0;
+ kernel_neon_begin();
ce_aes_ccm_decrypt(walk.dst.virt.addr,
walk.src.virt.addr,
walk.nbytes - tail, ctx->key_enc,
num_rounds(ctx), mac, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, tail);
}
- if (!err)
+ if (!err) {
+ kernel_neon_begin();
ce_aes_ccm_final(mac, buf, ctx->key_enc,
num_rounds(ctx));
-
- kernel_neon_end();
+ kernel_neon_end();
+ }
} else {
err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
}
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 03/20] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 95 ++++++++++----------
arch/arm64/crypto/aes-modes.S | 90 +++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
3 files changed, 97 insertions(+), 102 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..00a3e2fd6a48 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 ctr[], int first);
+ int rounds, int blocks, u8 ctr[]);
asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, first);
+ (u8 *)ctx->key_enc, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, first);
+ (u8 *)ctx->key_dec, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- first = 1;
- kernel_neon_begin();
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
+ kernel_neon_end();
}
if (walk.nbytes) {
u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
*/
blocks = -1;
+ kernel_neon_begin();
aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
- blocks, walk.iv, first);
+ blocks, walk.iv);
+ kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
- kernel_neon_end();
return err;
}
@@ -270,16 +267,16 @@ static int xts_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_enc, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -292,16 +289,16 @@ static int xts_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_dec, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+ aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
#if INTERLEAVE == 2
aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block2x)
aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block2x)
#elif INTERLEAVE == 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
#define FRAME_POP
.macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
.macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
#endif
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
* aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
*/
AES_ENTRY(aes_ecb_encrypt)
FRAME_PUSH
- cbz w5, .LecbencloopNx
enc_prepare w3, x2, x5
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
FRAME_PUSH
- cbz w5, .LecbdecloopNx
dec_prepare w3, x2, x5
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
/*
* aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
* aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
*/
AES_ENTRY(aes_cbc_encrypt)
- cbz w6, .Lcbcencloop
-
ld1 {v0.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
FRAME_PUSH
- cbz w6, .LcbcdecloopNx
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 ctr[], int first)
+ * int blocks, u8 ctr[])
*/
AES_ENTRY(aes_ctr_encrypt)
FRAME_PUSH
- cbz w6, .Lctrnotfirst /* 1st time around? */
+
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
-.Lctrnotfirst:
- umov x8, v4.d[1] /* keep swabbed ctr in reg */
- rev x8, x8
+ umov x6, v4.d[1] /* keep swabbed ctr in reg */
+ rev x6, x6
#if INTERLEAVE >= 2
- cmn w8, w4 /* 32 bit overflow? */
+ cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
subs w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
#if INTERLEAVE == 2
mov v0.8b, v4.8b
mov v1.8b, v4.8b
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v0.d[1], x7
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v1.d[1], x7
ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
st1 {v0.16b-v1.16b}, [x0], #32
#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
- dup v7.4s, w8
+ dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x8, x8, #INTERLEAVE
+ add x6, x6, #INTERLEAVE
#endif
- rev x7, x8
+ rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
#endif
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
- adds x8, x8, #1 /* increment BE ctr */
- rev x7, x8
+ adds x6, x6, #1 /* increment BE ctr */
+ rev x7, x6
ins v4.d[1], x7
bcs .Lctrcarry /* overflow? */
@@ -385,15 +379,17 @@ CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
FRAME_PUSH
- cbz w7, .LxtsencloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- enc_switch_key w3, x2, x6
+ cbz w7, .Lxtsencnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ enc_switch_key w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencnotfirst:
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
FRAME_PUSH
- cbz w7, .LxtsdecloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- dec_prepare w3, x2, x6
+ cbz w7, .Lxtsdecnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ dec_prepare w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecnotfirst:
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
/* borrowed from aes-neon-blk.ko */
asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, u8 iv[],
- int first);
+ int rounds, int blocks, u8 iv[]);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
- int err, first = 1;
+ int err;
err = skcipher_walk_virt(&walk, req, true);
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
/* fall back to the non-bitsliced NEON implementation */
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- ctx->enc, ctx->key.rounds, blocks, walk.iv,
- first);
+ ctx->enc, ctx->key.rounds, blocks,
+ walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
}
kernel_neon_end();
return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
kernel_neon_begin();
neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1, 1);
+ ctx->key.rounds, 1);
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 03/20] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-glue.c | 95 ++++++++++----------
arch/arm64/crypto/aes-modes.S | 90 +++++++++----------
arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
3 files changed, 97 insertions(+), 102 deletions(-)
diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..00a3e2fd6a48 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
/* defined in aes-modes.S */
asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 iv[], int first);
+ int rounds, int blocks, u8 iv[]);
asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
- int rounds, int blocks, u8 ctr[], int first);
+ int rounds, int blocks, u8 ctr[]);
asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, first);
+ (u8 *)ctx->key_enc, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, first);
+ (u8 *)ctx->key_dec, rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
- for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_dec, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
- int err, first, rounds = 6 + ctx->key_length / 4;
+ int err, rounds = 6 + ctx->key_length / 4;
struct skcipher_walk walk;
int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- first = 1;
- kernel_neon_begin();
while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+ kernel_neon_begin();
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- (u8 *)ctx->key_enc, rounds, blocks, walk.iv,
- first);
+ (u8 *)ctx->key_enc, rounds, blocks, walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
+ kernel_neon_end();
}
if (walk.nbytes) {
u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
*/
blocks = -1;
+ kernel_neon_begin();
aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
- blocks, walk.iv, first);
+ blocks, walk.iv);
+ kernel_neon_end();
crypto_xor_cpy(tdst, tsrc, tail, nbytes);
err = skcipher_walk_done(&walk, 0);
}
- kernel_neon_end();
return err;
}
@@ -270,16 +267,16 @@ static int xts_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_enc, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -292,16 +289,16 @@ static int xts_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
unsigned int blocks;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+ kernel_neon_begin();
aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key1.key_dec, rounds, blocks,
(u8 *)ctx->key2.key_enc, walk.iv, first);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
/* encrypt the zero vector */
kernel_neon_begin();
- aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+ aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
kernel_neon_end();
cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
return err;
kernel_neon_begin();
- aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
- aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+ aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+ aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
kernel_neon_end();
return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
#if INTERLEAVE == 2
aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block2x)
aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block2x)
#elif INTERLEAVE == 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
#define FRAME_POP
.macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x6, w7
+ encrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x6, w7
+ decrypt_block2x v0, v1, w3, x2, x8, w7
.endm
.macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
.macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x6, w7
+ decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
.endm
#endif
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
* aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, int first)
+ * int blocks)
*/
AES_ENTRY(aes_ecb_encrypt)
FRAME_PUSH
- cbz w5, .LecbencloopNx
enc_prepare w3, x2, x5
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
FRAME_PUSH
- cbz w5, .LecbdecloopNx
dec_prepare w3, x2, x5
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
/*
* aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
* aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 iv[], int first)
+ * int blocks, u8 iv[])
*/
AES_ENTRY(aes_cbc_encrypt)
- cbz w6, .Lcbcencloop
-
ld1 {v0.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
FRAME_PUSH
- cbz w6, .LcbcdecloopNx
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
/*
* aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
- * int blocks, u8 ctr[], int first)
+ * int blocks, u8 ctr[])
*/
AES_ENTRY(aes_ctr_encrypt)
FRAME_PUSH
- cbz w6, .Lctrnotfirst /* 1st time around? */
+
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
-.Lctrnotfirst:
- umov x8, v4.d[1] /* keep swabbed ctr in reg */
- rev x8, x8
+ umov x6, v4.d[1] /* keep swabbed ctr in reg */
+ rev x6, x6
#if INTERLEAVE >= 2
- cmn w8, w4 /* 32 bit overflow? */
+ cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
subs w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
#if INTERLEAVE == 2
mov v0.8b, v4.8b
mov v1.8b, v4.8b
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v0.d[1], x7
- rev x7, x8
- add x8, x8, #1
+ rev x7, x6
+ add x6, x6, #1
ins v1.d[1], x7
ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
st1 {v0.16b-v1.16b}, [x0], #32
#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
- dup v7.4s, w8
+ dup v7.4s, w6
mov v0.16b, v4.16b
add v7.4s, v7.4s, v8.4s
mov v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x8, x8, #INTERLEAVE
+ add x6, x6, #INTERLEAVE
#endif
- rev x7, x8
+ rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
#endif
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
- adds x8, x8, #1 /* increment BE ctr */
- rev x7, x8
+ adds x6, x6, #1 /* increment BE ctr */
+ rev x7, x6
ins v4.d[1], x7
bcs .Lctrcarry /* overflow? */
@@ -385,15 +379,17 @@ CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
FRAME_PUSH
- cbz w7, .LxtsencloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- enc_switch_key w3, x2, x6
+ cbz w7, .Lxtsencnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ enc_switch_key w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencnotfirst:
+ enc_prepare w3, x2, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
FRAME_PUSH
- cbz w7, .LxtsdecloopNx
-
ld1 {v4.16b}, [x6]
- enc_prepare w3, x5, x6
- encrypt_block v4, w3, x5, x6, w7 /* first tweak */
- dec_prepare w3, x2, x6
+ cbz w7, .Lxtsdecnotfirst
+
+ enc_prepare w3, x5, x8
+ encrypt_block v4, w3, x5, x8, w7 /* first tweak */
+ dec_prepare w3, x2, x8
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecnotfirst:
+ dec_prepare w3, x2, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w3, x2, x8, w7
eor v0.16b, v0.16b, v4.16b
st1 {v0.16b}, [x0], #16
subs w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
+ st1 {v4.16b}, [x6]
FRAME_POP
ret
AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
/* borrowed from aes-neon-blk.ko */
asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, int first);
+ int rounds, int blocks);
asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
- int rounds, int blocks, u8 iv[],
- int first);
+ int rounds, int blocks, u8 iv[]);
struct aesbs_ctx {
u8 rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
- int err, first = 1;
+ int err;
err = skcipher_walk_virt(&walk, req, true);
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
/* fall back to the non-bitsliced NEON implementation */
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
- ctx->enc, ctx->key.rounds, blocks, walk.iv,
- first);
+ ctx->enc, ctx->key.rounds, blocks,
+ walk.iv);
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
- first = 0;
}
kernel_neon_end();
return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
kernel_neon_begin();
neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1, 1);
+ ctx->key.rounds, 1);
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 04/20] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-glue.c | 36 +++++++++-----------
1 file changed, 17 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
ctx->rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
/* fall back to the non-bitsliced NEON implementation */
+ kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->enc, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->key.rk, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
+ kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->rk, ctx->rounds, blocks, walk.iv, final);
+ kernel_neon_end();
if (final) {
u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
@@ -304,12 +303,11 @@ static int __xts_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
kernel_neon_begin();
-
- neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1);
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+ kernel_neon_end();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
ctx->key.rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 04/20] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-glue.c | 36 +++++++++-----------
1 file changed, 17 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
ctx->rounds, blocks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
/* fall back to the non-bitsliced NEON implementation */
+ kernel_neon_begin();
neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->enc, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->key.rk, ctx->key.rounds, blocks,
walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
return err;
}
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
u8 buf[AES_BLOCK_SIZE];
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
final = NULL;
}
+ kernel_neon_begin();
aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
ctx->rk, ctx->rounds, blocks, walk.iv, final);
+ kernel_neon_end();
if (final) {
u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
@@ -304,12 +303,11 @@ static int __xts_crypt(struct skcipher_request *req,
struct skcipher_walk walk;
int err;
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
kernel_neon_begin();
-
- neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
- ctx->key.rounds, 1);
+ neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+ kernel_neon_end();
while (walk.nbytes >= AES_BLOCK_SIZE) {
unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
blocks = round_down(blocks,
walk.stride / AES_BLOCK_SIZE);
+ kernel_neon_begin();
fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
ctx->key.rounds, blocks, walk.iv);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes - blocks * AES_BLOCK_SIZE);
}
- kernel_neon_end();
-
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 05/20] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
u8 buf[CHACHA20_BLOCK_SIZE];
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+ kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+ kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+ if (!bytes)
+ return;
+
+ kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+ kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
crypto_chacha20_init(state, ctx, walk.iv);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
- kernel_neon_end();
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 05/20] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/chacha20-neon-glue.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
u8 buf[CHACHA20_BLOCK_SIZE];
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+ kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
+ kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
+
+ if (!bytes)
+ return;
+
+ kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
+ kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
- err = skcipher_walk_virt(&walk, req, true);
+ err = skcipher_walk_virt(&walk, req, false);
crypto_chacha20_init(state, ctx, walk.iv);
- kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
- kernel_neon_end();
return err;
}
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 06/20] crypto: arm64/aes-blk - remove configurable interleave
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)
We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-modes.S | 237 ++++----------------
2 files changed, 40 insertions(+), 200 deletions(-)
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index b5edc5918c28..aaf4e9afd750 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,9 +50,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
-AFLAGS_aes-ce.o := -DINTERLEAVE=4
-AFLAGS_aes-neon.o := -DINTERLEAVE=4
-
CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
$(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align 4
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare - setup NEON registers for encryption
- * - dec_prepare - setup NEON registers for decryption
- * - enc_switch_key - change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
ENDPROC(aes_decrypt_block4x)
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
- .macro do_encrypt_block2x
- bl aes_encrypt_block2x
- .endm
-
- .macro do_decrypt_block2x
- bl aes_decrypt_block2x
- .endm
-
- .macro do_encrypt_block4x
- bl aes_encrypt_block4x
- .endm
-
- .macro do_decrypt_block4x
- bl aes_decrypt_block4x
- .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
- .macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
-#endif
-
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
* int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x5
.LecbencloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- do_encrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbencout
-#endif
.Lecbencloop:
ld1 {v0.16b}, [x1], #16 /* get next pt block */
encrypt_block v0, w3, x2, x5, w6
@@ -140,34 +53,27 @@ AES_ENTRY(aes_ecb_encrypt)
subs w4, w4, #1
bne .Lecbencloop
.Lecbencout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
dec_prepare w3, x2, x5
.LecbdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- do_decrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
- do_decrypt_block4x
+ bl aes_decrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbdecout
-#endif
.Lecbdecloop:
ld1 {v0.16b}, [x1], #16 /* get next ct block */
decrypt_block v0, w3, x2, x5, w6
@@ -175,7 +81,7 @@ AES_ENTRY(aes_ecb_decrypt)
subs w4, w4, #1
bne .Lecbdecloop
.Lecbdecout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -204,30 +110,20 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
.LcbcdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lcbcdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- mov v2.16b, v0.16b
- mov v3.16b, v1.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v7.16b
- eor v1.16b, v1.16b, v2.16b
- mov v7.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
sub x1, x1, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
@@ -235,12 +131,10 @@ AES_ENTRY(aes_cbc_decrypt)
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lcbcdecout
-#endif
.Lcbcdecloop:
ld1 {v1.16b}, [x1], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
@@ -251,8 +145,8 @@ AES_ENTRY(aes_cbc_decrypt)
subs w4, w4, #1
bne .Lcbcdecloop
.Lcbcdecout:
- FRAME_POP
st1 {v7.16b}, [x5] /* return iv */
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -263,34 +157,19 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
-#if INTERLEAVE >= 2
cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lctr1x
-#if INTERLEAVE == 2
- mov v0.8b, v4.8b
- mov v1.8b, v4.8b
- rev x7, x6
- add x6, x6, #1
- ins v0.d[1], x7
- rev x7, x6
- add x6, x6, #1
- ins v1.d[1], x7
- ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
- do_encrypt_block2x
- eor v0.16b, v0.16b, v2.16b
- eor v1.16b, v1.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -303,23 +182,21 @@ AES_ENTRY(aes_ctr_encrypt)
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
ld1 {v5.16b}, [x1], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x6, x6, #INTERLEAVE
-#endif
+ add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lctrout
-#endif
.Lctrloop:
mov v0.16b, v4.16b
encrypt_block v0, w3, x2, x8, w7
@@ -339,12 +216,12 @@ AES_ENTRY(aes_ctr_encrypt)
.Lctrout:
st1 {v4.16b}, [x5] /* return next CTR value */
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrtailblock:
st1 {v0.16b}, [x0]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrcarry:
@@ -378,7 +255,9 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsencnotfirst
@@ -394,25 +273,8 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_encrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsencoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsencNx
-.LxtsencoutNx:
- mov v4.16b, v5.16b
- b .Lxtsencout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -421,7 +283,7 @@ AES_ENTRY(aes_xts_encrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -430,11 +292,9 @@ AES_ENTRY(aes_xts_encrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsencout
b .LxtsencloopNx
-#endif
.Lxtsenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsencout
-#endif
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -447,13 +307,15 @@ AES_ENTRY(aes_xts_encrypt)
b .Lxtsencloop
.Lxtsencout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsdecnotfirst
@@ -469,25 +331,8 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsdecoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsdecNx
-.LxtsdecoutNx:
- mov v4.16b, v5.16b
- b .Lxtsdecout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -496,7 +341,7 @@ AES_ENTRY(aes_xts_decrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -505,11 +350,9 @@ AES_ENTRY(aes_xts_decrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsdecout
b .LxtsdecloopNx
-#endif
.Lxtsdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsdecout
-#endif
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -522,7 +365,7 @@ AES_ENTRY(aes_xts_decrypt)
b .Lxtsdecloop
.Lxtsdecout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_decrypt)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 06/20] crypto: arm64/aes-blk - remove configurable interleave
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)
We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/Makefile | 3 -
arch/arm64/crypto/aes-modes.S | 237 ++++----------------
2 files changed, 40 insertions(+), 200 deletions(-)
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index b5edc5918c28..aaf4e9afd750 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,9 +50,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
-AFLAGS_aes-ce.o := -DINTERLEAVE=4
-AFLAGS_aes-neon.o := -DINTERLEAVE=4
-
CFLAGS_aes-glue-ce.o := -DUSE_V8_CRYPTO_EXTENSIONS
$(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
.text
.align 4
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare - setup NEON registers for encryption
- * - dec_prepare - setup NEON registers for decryption
- * - enc_switch_key - change to new key after having prepared for encryption
- * - encrypt_block - encrypt a single block
- * - decrypt block - decrypt a single block
- * - encrypt_block2x - encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x - decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x - encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x - decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
- encrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
- decrypt_block2x v0, v1, w3, x2, x8, w7
- ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
aes_encrypt_block4x:
encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
ret
ENDPROC(aes_decrypt_block4x)
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
- .macro do_encrypt_block2x
- bl aes_encrypt_block2x
- .endm
-
- .macro do_decrypt_block2x
- bl aes_decrypt_block2x
- .endm
-
- .macro do_encrypt_block4x
- bl aes_encrypt_block4x
- .endm
-
- .macro do_decrypt_block4x
- bl aes_decrypt_block4x
- .endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
- .macro do_encrypt_block2x
- encrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block2x
- decrypt_block2x v0, v1, w3, x2, x8, w7
- .endm
-
- .macro do_encrypt_block4x
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
- .macro do_decrypt_block4x
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
- .endm
-
-#endif
-
/*
* aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
* int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x5
.LecbencloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- do_encrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbencout
-#endif
.Lecbencloop:
ld1 {v0.16b}, [x1], #16 /* get next pt block */
encrypt_block v0, w3, x2, x5, w6
@@ -140,34 +53,27 @@ AES_ENTRY(aes_ecb_encrypt)
subs w4, w4, #1
bne .Lecbencloop
.Lecbencout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
dec_prepare w3, x2, x5
.LecbdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lecbdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- do_decrypt_block2x
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
- do_decrypt_block4x
+ bl aes_decrypt_block4x
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lecbdecout
-#endif
.Lecbdecloop:
ld1 {v0.16b}, [x1], #16 /* get next ct block */
decrypt_block v0, w3, x2, x5, w6
@@ -175,7 +81,7 @@ AES_ENTRY(aes_ecb_decrypt)
subs w4, w4, #1
bne .Lecbdecloop
.Lecbdecout:
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -204,30 +110,20 @@ AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
ld1 {v7.16b}, [x5] /* get iv */
dec_prepare w3, x2, x6
.LcbcdecloopNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lcbcdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- mov v2.16b, v0.16b
- mov v3.16b, v1.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v7.16b
- eor v1.16b, v1.16b, v2.16b
- mov v7.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
sub x1, x1, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
@@ -235,12 +131,10 @@ AES_ENTRY(aes_cbc_decrypt)
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
st1 {v0.16b-v3.16b}, [x0], #64
-#endif
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lcbcdecout
-#endif
.Lcbcdecloop:
ld1 {v1.16b}, [x1], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
@@ -251,8 +145,8 @@ AES_ENTRY(aes_cbc_decrypt)
subs w4, w4, #1
bne .Lcbcdecloop
.Lcbcdecout:
- FRAME_POP
st1 {v7.16b}, [x5] /* return iv */
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -263,34 +157,19 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
enc_prepare w3, x2, x6
ld1 {v4.16b}, [x5]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
-#if INTERLEAVE >= 2
cmn w6, w4 /* 32 bit overflow? */
bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lctr1x
-#if INTERLEAVE == 2
- mov v0.8b, v4.8b
- mov v1.8b, v4.8b
- rev x7, x6
- add x6, x6, #1
- ins v0.d[1], x7
- rev x7, x6
- add x6, x6, #1
- ins v1.d[1], x7
- ld1 {v2.16b-v3.16b}, [x1], #32 /* get 2 input blocks */
- do_encrypt_block2x
- eor v0.16b, v0.16b, v2.16b
- eor v1.16b, v1.16b, v3.16b
- st1 {v0.16b-v1.16b}, [x0], #32
-#else
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -303,23 +182,21 @@ AES_ENTRY(aes_ctr_encrypt)
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
ld1 {v5.16b}, [x1], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
st1 {v0.16b-v3.16b}, [x0], #64
- add x6, x6, #INTERLEAVE
-#endif
+ add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
cbz w4, .Lctrout
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lctrout
-#endif
.Lctrloop:
mov v0.16b, v4.16b
encrypt_block v0, w3, x2, x8, w7
@@ -339,12 +216,12 @@ AES_ENTRY(aes_ctr_encrypt)
.Lctrout:
st1 {v4.16b}, [x5] /* return next CTR value */
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrtailblock:
st1 {v0.16b}, [x0]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
.Lctrcarry:
@@ -378,7 +255,9 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsencnotfirst
@@ -394,25 +273,8 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsenc1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 pt blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_encrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsencoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsencNx
-.LxtsencoutNx:
- mov v4.16b, v5.16b
- b .Lxtsencout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -421,7 +283,7 @@ AES_ENTRY(aes_xts_encrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_encrypt_block4x
+ bl aes_encrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -430,11 +292,9 @@ AES_ENTRY(aes_xts_encrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsencout
b .LxtsencloopNx
-#endif
.Lxtsenc1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsencout
-#endif
.Lxtsencloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -447,13 +307,15 @@ AES_ENTRY(aes_xts_encrypt)
b .Lxtsencloop
.Lxtsencout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- FRAME_PUSH
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
ld1 {v4.16b}, [x6]
cbz w7, .Lxtsdecnotfirst
@@ -469,25 +331,8 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
-#if INTERLEAVE >= 2
- subs w4, w4, #INTERLEAVE
+ subs w4, w4, #4
bmi .Lxtsdec1x
-#if INTERLEAVE == 2
- ld1 {v0.16b-v1.16b}, [x1], #32 /* get 2 ct blocks */
- next_tweak v5, v4, v7, v8
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- do_decrypt_block2x
- eor v0.16b, v0.16b, v4.16b
- eor v1.16b, v1.16b, v5.16b
- st1 {v0.16b-v1.16b}, [x0], #32
- cbz w4, .LxtsdecoutNx
- next_tweak v4, v5, v7, v8
- b .LxtsdecNx
-.LxtsdecoutNx:
- mov v4.16b, v5.16b
- b .Lxtsdecout
-#else
ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
@@ -496,7 +341,7 @@ AES_ENTRY(aes_xts_decrypt)
eor v2.16b, v2.16b, v6.16b
next_tweak v7, v6, v7, v8
eor v3.16b, v3.16b, v7.16b
- do_decrypt_block4x
+ bl aes_decrypt_block4x
eor v3.16b, v3.16b, v7.16b
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
@@ -505,11 +350,9 @@ AES_ENTRY(aes_xts_decrypt)
mov v4.16b, v7.16b
cbz w4, .Lxtsdecout
b .LxtsdecloopNx
-#endif
.Lxtsdec1x:
- adds w4, w4, #INTERLEAVE
+ adds w4, w4, #4
beq .Lxtsdecout
-#endif
.Lxtsdecloop:
ld1 {v1.16b}, [x1], #16
eor v0.16b, v1.16b, v4.16b
@@ -522,7 +365,7 @@ AES_ENTRY(aes_xts_decrypt)
b .Lxtsdecloop
.Lxtsdecout:
st1 {v4.16b}, [x6]
- FRAME_POP
+ ldp x29, x30, [sp], #16
ret
AES_ENDPROC(aes_xts_decrypt)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 07/20] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 31 ++++++++++++++++----
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v0.16b}, [x5] /* get iv */
+ ld1 {v4.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
-.Lcbcencloop:
- ld1 {v1.16b}, [x1], #16 /* get next pt block */
- eor v0.16b, v0.16b, v1.16b /* ..and xor with iv */
+.Lcbcencloop4x:
+ subs w4, w4, #4
+ bmi .Lcbcenc1x
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
encrypt_block v0, w3, x2, x6, w7
- st1 {v0.16b}, [x0], #16
+ eor v1.16b, v1.16b, v0.16b
+ encrypt_block v1, w3, x2, x6, w7
+ eor v2.16b, v2.16b, v1.16b
+ encrypt_block v2, w3, x2, x6, w7
+ eor v3.16b, v3.16b, v2.16b
+ encrypt_block v3, w3, x2, x6, w7
+ st1 {v0.16b-v3.16b}, [x0], #64
+ mov v4.16b, v3.16b
+ b .Lcbcencloop4x
+.Lcbcenc1x:
+ adds w4, w4, #4
+ beq .Lcbcencout
+.Lcbcencloop:
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
+ encrypt_block v4, w3, x2, x6, w7
+ st1 {v4.16b}, [x0], #16
subs w4, w4, #1
bne .Lcbcencloop
- st1 {v0.16b}, [x5] /* return iv */
+.Lcbcencout:
+ st1 {v4.16b}, [x5] /* return iv */
ret
AES_ENDPROC(aes_cbc_encrypt)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 07/20] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 31 ++++++++++++++++----
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v0.16b}, [x5] /* get iv */
+ ld1 {v4.16b}, [x5] /* get iv */
enc_prepare w3, x2, x6
-.Lcbcencloop:
- ld1 {v1.16b}, [x1], #16 /* get next pt block */
- eor v0.16b, v0.16b, v1.16b /* ..and xor with iv */
+.Lcbcencloop4x:
+ subs w4, w4, #4
+ bmi .Lcbcenc1x
+ ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
encrypt_block v0, w3, x2, x6, w7
- st1 {v0.16b}, [x0], #16
+ eor v1.16b, v1.16b, v0.16b
+ encrypt_block v1, w3, x2, x6, w7
+ eor v2.16b, v2.16b, v1.16b
+ encrypt_block v2, w3, x2, x6, w7
+ eor v3.16b, v3.16b, v2.16b
+ encrypt_block v3, w3, x2, x6, w7
+ st1 {v0.16b-v3.16b}, [x0], #64
+ mov v4.16b, v3.16b
+ b .Lcbcencloop4x
+.Lcbcenc1x:
+ adds w4, w4, #4
+ beq .Lcbcencout
+.Lcbcencloop:
+ ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
+ encrypt_block v4, w3, x2, x6, w7
+ st1 {v4.16b}, [x0], #16
subs w4, w4, #1
bne .Lcbcencloop
- st1 {v0.16b}, [x5] /* return iv */
+.Lcbcencout:
+ st1 {v4.16b}, [x5] /* return iv */
ret
AES_ENDPROC(aes_cbc_encrypt)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 08/20] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 23 ++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4] /* get dg */
enc_prepare w2, x1, x7
- cbnz w5, .Lmacenc
+ cbz w5, .Lmacloop4x
+ encrypt_block v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+ subs w3, w3, #4
+ bmi .Lmac1x
+ ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v2.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v3.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v4.16b
+ cmp w3, wzr
+ csinv x5, x6, xzr, eq
+ cbz w5, .Lmacout
+ encrypt_block v0, w2, x1, x7, w8
+ b .Lmacloop4x
+.Lmac1x:
+ add w3, w3, #4
.Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv x5, x6, xzr, eq
cbz w5, .Lmacout
-.Lmacenc:
encrypt_block v0, w2, x1, x7, w8
b .Lmacloop
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 08/20] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)
So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-modes.S | 23 ++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
AES_ENTRY(aes_mac_update)
ld1 {v0.16b}, [x4] /* get dg */
enc_prepare w2, x1, x7
- cbnz w5, .Lmacenc
+ cbz w5, .Lmacloop4x
+ encrypt_block v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+ subs w3, w3, #4
+ bmi .Lmac1x
+ ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v2.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v3.16b
+ encrypt_block v0, w2, x1, x7, w8
+ eor v0.16b, v0.16b, v4.16b
+ cmp w3, wzr
+ csinv x5, x6, xzr, eq
+ cbz w5, .Lmacout
+ encrypt_block v0, w2, x1, x7, w8
+ b .Lmacloop4x
+.Lmac1x:
+ add w3, w3, #4
.Lmacloop:
cbz w3, .Lmacout
ld1 {v1.16b}, [x0], #16 /* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
csinv x5, x6, xzr, eq
cbz w5, .Lmacout
-.Lmacenc:
encrypt_block v0, w2, x1, x7, w8
b .Lmacloop
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 09/20] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.
Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha256-glue.c | 36 +++++++++++++-------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- /*
- * Stacking and unstacking a substantial slice of the NEON register
- * file may significantly affect performance for small updates when
- * executing in interrupt context, so fall back to the scalar code
- * in that case.
- */
+ struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
- kernel_neon_begin();
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
- kernel_neon_end();
+ while (len > 0) {
+ unsigned int chunk = len;
+
+ /*
+ * Don't hog the CPU for the entire time it takes to process all
+ * input when running on a preemptible kernel, but process the
+ * data block by block instead.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT) &&
+ chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+ chunk = SHA256_BLOCK_SIZE -
+ sctx->count % SHA256_BLOCK_SIZE;
+ kernel_neon_begin();
+ sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+ kernel_neon_end();
+ data += chunk;
+ len -= chunk;
+ }
return 0;
}
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
- kernel_neon_begin();
if (len)
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
+ sha256_update_neon(desc, data, len);
+ kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 09/20] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.
Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha256-glue.c | 36 +++++++++++++-------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- /*
- * Stacking and unstacking a substantial slice of the NEON register
- * file may significantly affect performance for small updates when
- * executing in interrupt context, so fall back to the scalar code
- * in that case.
- */
+ struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return sha256_base_do_update(desc, data, len,
(sha256_block_fn *)sha256_block_data_order);
- kernel_neon_begin();
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
- kernel_neon_end();
+ while (len > 0) {
+ unsigned int chunk = len;
+
+ /*
+ * Don't hog the CPU for the entire time it takes to process all
+ * input when running on a preemptible kernel, but process the
+ * data block by block instead.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT) &&
+ chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+ chunk = SHA256_BLOCK_SIZE -
+ sctx->count % SHA256_BLOCK_SIZE;
+ kernel_neon_begin();
+ sha256_base_do_update(desc, data, chunk,
+ (sha256_block_fn *)sha256_block_neon);
+ kernel_neon_end();
+ data += chunk;
+ len -= chunk;
+ }
return 0;
}
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, const u8 *data,
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_data_order);
} else {
- kernel_neon_begin();
if (len)
- sha256_base_do_update(desc, data, len,
- (sha256_block_fn *)sha256_block_neon);
+ sha256_update_neon(desc, data, len);
+ kernel_neon_begin();
sha256_base_do_finalize(desc,
(sha256_block_fn *)sha256_block_neon);
kernel_neon_end();
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
1 file changed, 60 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index aef72d886677..5f61487e9f93 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -499,6 +499,66 @@ alternative_else_nop_endif
#endif
.endm
+ /*
+ * frame_push - Push @regcount callee saved registers to the stack,
+ * starting at x19, as well as x29/x30, and set x29 to
+ * the new value of sp. Add @extra bytes of stack space
+ * for locals.
+ */
+ .macro frame_push, regcount:req, extra
+ __frame st, \regcount, \extra
+ .endm
+
+ /*
+ * frame_pop - Pop @regcount callee saved registers from the stack,
+ * starting at x19, as well as x29/x30. Also pop @extra
+ * bytes of stack space for locals.
+ */
+ .macro frame_pop, regcount:req, extra
+ __frame ld, \regcount, \extra
+ .endm
+
+ .macro __frame, op, regcount:req, extra=0
+ .ifc \op, st
+ stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
+ mov x29, sp
+ .endif
+ .if \regcount < 0 || \regcount > 10
+ .error "regcount should be in the range [0 ... 10]"
+ .endif
+ .if (\extra % 16) != 0
+ .error "extra should be a multiple of 16 bytes"
+ .endif
+ .if \regcount > 1
+ \op\()p x19, x20, [sp, #16]
+ .if \regcount > 3
+ \op\()p x21, x22, [sp, #32]
+ .if \regcount > 5
+ \op\()p x23, x24, [sp, #48]
+ .if \regcount > 7
+ \op\()p x25, x26, [sp, #64]
+ .if \regcount > 9
+ \op\()p x27, x28, [sp, #80]
+ .elseif \regcount == 9
+ \op\()r x27, [sp, #80]
+ .endif
+ .elseif \regcount == 7
+ \op\()r x25, [sp, #64]
+ .endif
+ .elseif \regcount == 5
+ \op\()r x23, [sp, #48]
+ .endif
+ .elseif \regcount == 3
+ \op\()r x21, [sp, #32]
+ .endif
+ .elseif \regcount == 1
+ \op\()r x19, [sp, #16]
+ .endif
+ .ifc \op, ld
+ ldp x29, x30, [sp], #((\regcount + 3) / 2) * 16 + \extra
+ .endif
+ .endm
+
/*
* Errata workaround post TTBR0_EL1 update.
*/
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
We are going to add code to all the NEON crypto routines that will
turn them into non-leaf functions, so we need to manage the stack
frames. To make this less tedious and error prone, add some macros
that take the number of callee saved registers to preserve and the
extra size to allocate in the stack frame (for locals) and emit
the ldp/stp sequences.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
1 file changed, 60 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index aef72d886677..5f61487e9f93 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -499,6 +499,66 @@ alternative_else_nop_endif
#endif
.endm
+ /*
+ * frame_push - Push @regcount callee saved registers to the stack,
+ * starting at x19, as well as x29/x30, and set x29 to
+ * the new value of sp. Add @extra bytes of stack space
+ * for locals.
+ */
+ .macro frame_push, regcount:req, extra
+ __frame st, \regcount, \extra
+ .endm
+
+ /*
+ * frame_pop - Pop @regcount callee saved registers from the stack,
+ * starting at x19, as well as x29/x30. Also pop @extra
+ * bytes of stack space for locals.
+ */
+ .macro frame_pop, regcount:req, extra
+ __frame ld, \regcount, \extra
+ .endm
+
+ .macro __frame, op, regcount:req, extra=0
+ .ifc \op, st
+ stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
+ mov x29, sp
+ .endif
+ .if \regcount < 0 || \regcount > 10
+ .error "regcount should be in the range [0 ... 10]"
+ .endif
+ .if (\extra % 16) != 0
+ .error "extra should be a multiple of 16 bytes"
+ .endif
+ .if \regcount > 1
+ \op\()p x19, x20, [sp, #16]
+ .if \regcount > 3
+ \op\()p x21, x22, [sp, #32]
+ .if \regcount > 5
+ \op\()p x23, x24, [sp, #48]
+ .if \regcount > 7
+ \op\()p x25, x26, [sp, #64]
+ .if \regcount > 9
+ \op\()p x27, x28, [sp, #80]
+ .elseif \regcount == 9
+ \op\()r x27, [sp, #80]
+ .endif
+ .elseif \regcount == 7
+ \op\()r x25, [sp, #64]
+ .endif
+ .elseif \regcount == 5
+ \op\()r x23, [sp, #48]
+ .endif
+ .elseif \regcount == 3
+ \op\()r x21, [sp, #32]
+ .endif
+ .elseif \regcount == 1
+ \op\()r x19, [sp, #16]
+ .endif
+ .ifc \op, ld
+ ldp x29, x30, [sp], #((\regcount + 3) / 2) * 16 + \extra
+ .endif
+ .endm
+
/*
* Errata workaround post TTBR0_EL1 update.
*/
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.
In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 5f61487e9f93..c54e408fd5a7 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -572,4 +572,55 @@ alternative_else_nop_endif
#endif
.endm
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ * // pre-yield patchup code
+ * do_cond_yield_neon
+ * // post-yield patchup code
+ * endif_yield_neon
+ *
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ * preemption once will make the task preemptible. If this is not the case,
+ * yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ * kernel mode NEON (which will trigger a reschedule), and branch to the
+ * yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+ .macro cond_yield_neon, lbl
+ if_will_cond_yield_neon
+ do_cond_yield_neon
+ endif_yield_neon \lbl
+ .endm
+
+ .macro if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+ get_thread_info x0
+ ldr w1, [x0, #TSK_TI_PREEMPT]
+ ldr x0, [x0, #TSK_TI_FLAGS]
+ cmp w1, #1 // == PREEMPT_OFFSET
+ csel x0, x0, xzr, eq
+ tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
+#endif
+ .subsection 1
+5555:
+ .endm
+
+ .macro do_cond_yield_neon
+ bl kernel_neon_end
+ bl kernel_neon_begin
+ .endm
+
+ .macro endif_yield_neon, lbl=6666f
+ b \lbl
+ .previous
+6666:
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Add support macros to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code.
In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into three, and the code in between is only
executed when the yield path is taken, allowing the context to be preserved.
The third macro takes an optional label argument that marks the resume
path after a yield has been performed.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 5f61487e9f93..c54e408fd5a7 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -572,4 +572,55 @@ alternative_else_nop_endif
#endif
.endm
+/*
+ * Check whether to yield to another runnable task from kernel mode NEON code
+ * (which runs with preemption disabled).
+ *
+ * if_will_cond_yield_neon
+ * // pre-yield patchup code
+ * do_cond_yield_neon
+ * // post-yield patchup code
+ * endif_yield_neon
+ *
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ * preemption once will make the task preemptible. If this is not the case,
+ * yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ * kernel mode NEON (which will trigger a reschedule), and branch to the
+ * yield fixup code.
+ *
+ * This macro sequence clobbers x0, x1 and the flags register unconditionally,
+ * and may clobber x2 .. x18 if the yield path is taken.
+ */
+
+ .macro cond_yield_neon, lbl
+ if_will_cond_yield_neon
+ do_cond_yield_neon
+ endif_yield_neon \lbl
+ .endm
+
+ .macro if_will_cond_yield_neon
+#ifdef CONFIG_PREEMPT
+ get_thread_info x0
+ ldr w1, [x0, #TSK_TI_PREEMPT]
+ ldr x0, [x0, #TSK_TI_FLAGS]
+ cmp w1, #1 // == PREEMPT_OFFSET
+ csel x0, x0, xzr, eq
+ tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
+#endif
+ .subsection 1
+5555:
+ .endm
+
+ .macro do_cond_yield_neon
+ bl kernel_neon_end
+ bl kernel_neon_begin
+ .endm
+
+ .macro endif_yield_neon, lbl=6666f
+ b \lbl
+ .previous
+6666:
+ .endm
+
#endif /* __ASM_ASSEMBLER_H */
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 12/20] crypto: arm64/sha1-ce - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha1-ce-core.S | 42 ++++++++++++++------
1 file changed, 29 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 8550408735a0..3139206e8787 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -70,31 +70,37 @@
* int blocks)
*/
ENTRY(sha1_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- adr x6, .Lsha1_rcon
+0: adr x6, .Lsha1_rcon
ld1r {k0.4s}, [x6], #4
ld1r {k1.4s}, [x6], #4
ld1r {k2.4s}, [x6], #4
ld1r {k3.4s}, [x6]
/* load state */
- ld1 {dgav.4s}, [x0]
- ldr dgb, [x0, #16]
+ ld1 {dgav.4s}, [x19]
+ ldr dgb, [x19, #16]
/* load sha1_ce_state::finalize */
ldr_l w4, sha1_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v8.16b, v8.16b )
CPU_LE( rev32 v9.16b, v9.16b )
CPU_LE( rev32 v10.16b, v10.16b )
CPU_LE( rev32 v11.16b, v11.16b )
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
add_update c, ev, k0, 8, 9, 10, 11, dgb
@@ -125,16 +131,25 @@ CPU_LE( rev32 v11.16b, v11.16b )
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha1_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v9.2d, #0
mov x8, #0x80000000
movi v10.2d, #0
@@ -143,10 +158,11 @@ CPU_LE( rev32 v11.16b, v11.16b )
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s}, [x0]
- str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ frame_pop 3
ret
ENDPROC(sha1_ce_transform)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 12/20] crypto: arm64/sha1-ce - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha1-ce-core.S | 42 ++++++++++++++------
1 file changed, 29 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 8550408735a0..3139206e8787 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -70,31 +70,37 @@
* int blocks)
*/
ENTRY(sha1_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- adr x6, .Lsha1_rcon
+0: adr x6, .Lsha1_rcon
ld1r {k0.4s}, [x6], #4
ld1r {k1.4s}, [x6], #4
ld1r {k2.4s}, [x6], #4
ld1r {k3.4s}, [x6]
/* load state */
- ld1 {dgav.4s}, [x0]
- ldr dgb, [x0, #16]
+ ld1 {dgav.4s}, [x19]
+ ldr dgb, [x19, #16]
/* load sha1_ce_state::finalize */
ldr_l w4, sha1_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v8.4s-v11.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v8.4s-v11.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v8.16b, v8.16b )
CPU_LE( rev32 v9.16b, v9.16b )
CPU_LE( rev32 v10.16b, v10.16b )
CPU_LE( rev32 v11.16b, v11.16b )
-1: add t0.4s, v8.4s, k0.4s
+2: add t0.4s, v8.4s, k0.4s
mov dg0v.16b, dgav.16b
add_update c, ev, k0, 8, 9, 10, 11, dgb
@@ -125,16 +131,25 @@ CPU_LE( rev32 v11.16b, v11.16b )
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha1_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v9.2d, #0
mov x8, #0x80000000
movi v10.2d, #0
@@ -143,10 +158,11 @@ CPU_LE( rev32 v11.16b, v11.16b )
mov x4, #0
mov v11.d[0], xzr
mov v11.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s}, [x0]
- str dgb, [x0, #16]
+4: st1 {dgav.4s}, [x19]
+ str dgb, [x19, #16]
+ frame_pop 3
ret
ENDPROC(sha1_ce_transform)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 13/20] crypto: arm64/sha2-ce - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha2-ce-core.S | 37 ++++++++++++++------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..7709455dae92 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -77,30 +77,36 @@
* int blocks)
*/
ENTRY(sha2_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- adr x8, .Lsha2_rcon
+0: adr x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
/* load state */
- ld1 {dgav.4s, dgbv.4s}, [x0]
+ ld1 {dgav.4s, dgbv.4s}, [x19]
/* load sha256_ce_state::finalize */
ldr_l w4, sha256_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v16.16b, v16.16b )
CPU_LE( rev32 v17.16b, v17.16b )
CPU_LE( rev32 v18.16b, v18.16b )
CPU_LE( rev32 v19.16b, v19.16b )
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
@@ -129,16 +135,24 @@ CPU_LE( rev32 v19.16b, v19.16b )
add dgbv.4s, dgbv.4s, dg1v.4s
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s, dgbv.4s}, [x19]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha256_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v17.2d, #0
mov x8, #0x80000000
movi v18.2d, #0
@@ -147,9 +161,10 @@ CPU_LE( rev32 v19.16b, v19.16b )
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+ frame_pop 3
ret
ENDPROC(sha2_ce_transform)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 13/20] crypto: arm64/sha2-ce - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/sha2-ce-core.S | 37 ++++++++++++++------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..7709455dae92 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -77,30 +77,36 @@
* int blocks)
*/
ENTRY(sha2_ce_transform)
+ frame_push 3
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+
/* load round constants */
- adr x8, .Lsha2_rcon
+0: adr x8, .Lsha2_rcon
ld1 { v0.4s- v3.4s}, [x8], #64
ld1 { v4.4s- v7.4s}, [x8], #64
ld1 { v8.4s-v11.4s}, [x8], #64
ld1 {v12.4s-v15.4s}, [x8]
/* load state */
- ld1 {dgav.4s, dgbv.4s}, [x0]
+ ld1 {dgav.4s, dgbv.4s}, [x19]
/* load sha256_ce_state::finalize */
ldr_l w4, sha256_ce_offsetof_finalize, x4
- ldr w4, [x0, x4]
+ ldr w4, [x19, x4]
/* load input */
-0: ld1 {v16.4s-v19.4s}, [x1], #64
- sub w2, w2, #1
+1: ld1 {v16.4s-v19.4s}, [x20], #64
+ sub w21, w21, #1
CPU_LE( rev32 v16.16b, v16.16b )
CPU_LE( rev32 v17.16b, v17.16b )
CPU_LE( rev32 v18.16b, v18.16b )
CPU_LE( rev32 v19.16b, v19.16b )
-1: add t0.4s, v16.4s, v0.4s
+2: add t0.4s, v16.4s, v0.4s
mov dg0v.16b, dgav.16b
mov dg1v.16b, dgbv.16b
@@ -129,16 +135,24 @@ CPU_LE( rev32 v19.16b, v19.16b )
add dgbv.4s, dgbv.4s, dg1v.4s
/* handled all input blocks? */
- cbnz w2, 0b
+ cbz w21, 3f
+
+ if_will_cond_yield_neon
+ st1 {dgav.4s, dgbv.4s}, [x19]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
/*
* Final block: add padding and total bit count.
* Skip if the input size was not a round multiple of the block size,
* the padding is handled by the C code in that case.
*/
- cbz x4, 3f
+3: cbz x4, 4f
ldr_l w4, sha256_ce_offsetof_count, x4
- ldr x4, [x0, x4]
+ ldr x4, [x19, x4]
movi v17.2d, #0
mov x8, #0x80000000
movi v18.2d, #0
@@ -147,9 +161,10 @@ CPU_LE( rev32 v19.16b, v19.16b )
mov x4, #0
mov v19.d[0], xzr
mov v19.d[1], x7
- b 1b
+ b 2b
/* store new state */
-3: st1 {dgav.4s, dgbv.4s}, [x0]
+4: st1 {dgav.4s, dgbv.4s}, [x19]
+ frame_pop 3
ret
ENDPROC(sha2_ce_transform)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 14/20] crypto: arm64/aes-ccm - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++++++++++++-------
1 file changed, 95 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..22ee196cae00 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
* u32 *macp, u8 const rk[], u32 rounds);
*/
ENTRY(ce_aes_ccm_auth_data)
- ldr w8, [x3] /* leftover from prev round? */
+ frame_push 7
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+ ldr w25, [x22] /* leftover from prev round? */
ld1 {v0.16b}, [x0] /* load mac */
- cbz w8, 1f
- sub w8, w8, #16
+ cbz w25, 1f
+ sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrb w7, [x1], #1 /* get 1 byte of input */
- subs w2, w2, #1
- add w8, w8, #1
+0: ldrb w7, [x20], #1 /* get 1 byte of input */
+ subs w21, w21, #1
+ add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1 /* rotate in the input bytes */
beq 8f /* out of input? */
- cbnz w8, 0b
+ cbnz w25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4] /* load first round key */
- prfm pldl1strm, [x1]
- cmp w5, #12 /* which key size? */
- add x6, x4, #16
- sub w7, w5, #2 /* modified # of rounds */
+1: ld1 {v3.4s}, [x23] /* load first round key */
+ prfm pldl1strm, [x20]
+ cmp w24, #12 /* which key size? */
+ add x6, x23, #16
+ sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16 /* load next round key */
bpl 3b
aese v0.16b, v4.16b
- subs w2, w2, #16 /* last data? */
+ subs w21, w21, #16 /* last data? */
eor v0.16b, v0.16b, v5.16b /* final round */
bmi 6f
- ld1 {v1.16b}, [x1], #16 /* load next input block */
+ ld1 {v1.16b}, [x20], #16 /* load next input block */
eor v0.16b, v0.16b, v1.16b /* xor with mac */
- bne 1b
-6: st1 {v0.16b}, [x0] /* store mac */
+ beq 6f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x19] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x19] /* reload mac */
+ endif_yield_neon
+
+ b 1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
- adds w2, w2, #16
+ adds w21, w21, #16
beq 10f
- mov w8, w2
-7: ldrb w7, [x1], #1
+ mov w25, w21
+7: ldrb w7, [x20], #1
umov w6, v0.b[0]
eor w6, w6, w7
- strb w6, [x0], #1
- subs w2, w2, #1
+ strb w6, [x19], #1
+ subs w21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1 /* rotate out the mac bytes */
b 7b
-8: mov w7, w8
- add w8, w8, #16
+8: mov w7, w25
+ add w25, w25, #16
9: ext v1.16b, v1.16b, v1.16b, #1
adds w7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
- st1 {v0.16b}, [x0]
-10: str w8, [x3]
+ st1 {v0.16b}, [x19]
+10: str w25, [x22]
+
+ frame_pop 7
ret
ENDPROC(ce_aes_ccm_auth_data)
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
ENDPROC(ce_aes_ccm_final)
.macro aes_ccm_do_crypt,enc
- ldr x8, [x6, #8] /* load lower ctr */
- ld1 {v0.16b}, [x5] /* load mac */
-CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ ldr x26, [x25, #8] /* load lower ctr */
+ ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE( rev x26, x26 ) /* keep swabbed ctr in reg */
0: /* outer loop */
- ld1 {v1.8b}, [x6] /* load upper ctr */
- prfm pldl1strm, [x1]
- add x8, x8, #1
- rev x9, x8
- cmp w4, #12 /* which key size? */
- sub w7, w4, #2 /* get modified # of rounds */
+ ld1 {v1.8b}, [x25] /* load upper ctr */
+ prfm pldl1strm, [x20]
+ add x26, x26, #1
+ rev x9, x26
+ cmp w23, #12 /* which key size? */
+ sub w7, w23, #2 /* get modified # of rounds */
ins v1.d[1], x9 /* no carry in lower ctr */
- ld1 {v3.4s}, [x3] /* load first round key */
- add x10, x3, #16
+ ld1 {v3.4s}, [x22] /* load first round key */
+ add x10, x22, #16
bmi 1f
bne 4f
mov v5.16b, v3.16b
@@ -165,9 +194,9 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
bpl 2b
aese v0.16b, v4.16b
aese v1.16b, v4.16b
- subs w2, w2, #16
- bmi 6f /* partial block? */
- ld1 {v2.16b}, [x1], #16 /* load next input block */
+ subs w21, w21, #16
+ bmi 7f /* partial block? */
+ ld1 {v2.16b}, [x20], #16 /* load next input block */
.if \enc == 1
eor v2.16b, v2.16b, v5.16b /* final round enc+mac */
eor v1.16b, v1.16b, v2.16b /* xor with crypted ctr */
@@ -176,18 +205,29 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
eor v1.16b, v2.16b, v5.16b /* final round enc */
.endif
eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
- st1 {v1.16b}, [x0], #16 /* write output block */
- bne 0b
-CPU_LE( rev x8, x8 )
- st1 {v0.16b}, [x5] /* store mac */
- str x8, [x6, #8] /* store lsb end of ctr (BE) */
-5: ret
-
-6: eor v0.16b, v0.16b, v5.16b /* final round mac */
+ st1 {v1.16b}, [x19], #16 /* write output block */
+ beq 5f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x24] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x24] /* reload mac */
+ endif_yield_neon
+
+ b 0b
+5:
+CPU_LE( rev x26, x26 )
+ st1 {v0.16b}, [x24] /* store mac */
+ str x26, [x25, #8] /* store lsb end of ctr (BE) */
+
+6: frame_pop 8
+ ret
+
+7: eor v0.16b, v0.16b, v5.16b /* final round mac */
eor v1.16b, v1.16b, v5.16b /* final round enc */
- st1 {v0.16b}, [x5] /* store mac */
- add w2, w2, #16 /* process partial tail block */
-7: ldrb w9, [x1], #1 /* get 1 byte of input */
+ st1 {v0.16b}, [x24] /* store mac */
+ add w21, w21, #16 /* process partial tail block */
+8: ldrb w9, [x20], #1 /* get 1 byte of input */
umov w6, v1.b[0] /* get top crypted ctr byte */
umov w7, v0.b[0] /* get top mac byte */
.if \enc == 1
@@ -197,13 +237,13 @@ CPU_LE( rev x8, x8 )
eor w9, w9, w6
eor w7, w7, w9
.endif
- strb w9, [x0], #1 /* store out byte */
- strb w7, [x5], #1 /* store mac byte */
- subs w2, w2, #1
- beq 5b
+ strb w9, [x19], #1 /* store out byte */
+ strb w7, [x24], #1 /* store mac byte */
+ subs w21, w21, #1
+ beq 6b
ext v0.16b, v0.16b, v0.16b, #1 /* shift out mac byte */
ext v1.16b, v1.16b, v1.16b, #1 /* shift out ctr byte */
- b 7b
+ b 8b
.endm
/*
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 14/20] crypto: arm64/aes-ccm - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce-ccm-core.S | 150 +++++++++++++-------
1 file changed, 95 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S b/arch/arm64/crypto/aes-ce-ccm-core.S
index e3a375c4cb83..22ee196cae00 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -19,24 +19,33 @@
* u32 *macp, u8 const rk[], u32 rounds);
*/
ENTRY(ce_aes_ccm_auth_data)
- ldr w8, [x3] /* leftover from prev round? */
+ frame_push 7
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+ ldr w25, [x22] /* leftover from prev round? */
ld1 {v0.16b}, [x0] /* load mac */
- cbz w8, 1f
- sub w8, w8, #16
+ cbz w25, 1f
+ sub w25, w25, #16
eor v1.16b, v1.16b, v1.16b
-0: ldrb w7, [x1], #1 /* get 1 byte of input */
- subs w2, w2, #1
- add w8, w8, #1
+0: ldrb w7, [x20], #1 /* get 1 byte of input */
+ subs w21, w21, #1
+ add w25, w25, #1
ins v1.b[0], w7
ext v1.16b, v1.16b, v1.16b, #1 /* rotate in the input bytes */
beq 8f /* out of input? */
- cbnz w8, 0b
+ cbnz w25, 0b
eor v0.16b, v0.16b, v1.16b
-1: ld1 {v3.4s}, [x4] /* load first round key */
- prfm pldl1strm, [x1]
- cmp w5, #12 /* which key size? */
- add x6, x4, #16
- sub w7, w5, #2 /* modified # of rounds */
+1: ld1 {v3.4s}, [x23] /* load first round key */
+ prfm pldl1strm, [x20]
+ cmp w24, #12 /* which key size? */
+ add x6, x23, #16
+ sub w7, w24, #2 /* modified # of rounds */
bmi 2f
bne 5f
mov v5.16b, v3.16b
@@ -55,33 +64,43 @@ ENTRY(ce_aes_ccm_auth_data)
ld1 {v5.4s}, [x6], #16 /* load next round key */
bpl 3b
aese v0.16b, v4.16b
- subs w2, w2, #16 /* last data? */
+ subs w21, w21, #16 /* last data? */
eor v0.16b, v0.16b, v5.16b /* final round */
bmi 6f
- ld1 {v1.16b}, [x1], #16 /* load next input block */
+ ld1 {v1.16b}, [x20], #16 /* load next input block */
eor v0.16b, v0.16b, v1.16b /* xor with mac */
- bne 1b
-6: st1 {v0.16b}, [x0] /* store mac */
+ beq 6f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x19] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x19] /* reload mac */
+ endif_yield_neon
+
+ b 1b
+6: st1 {v0.16b}, [x19] /* store mac */
beq 10f
- adds w2, w2, #16
+ adds w21, w21, #16
beq 10f
- mov w8, w2
-7: ldrb w7, [x1], #1
+ mov w25, w21
+7: ldrb w7, [x20], #1
umov w6, v0.b[0]
eor w6, w6, w7
- strb w6, [x0], #1
- subs w2, w2, #1
+ strb w6, [x19], #1
+ subs w21, w21, #1
beq 10f
ext v0.16b, v0.16b, v0.16b, #1 /* rotate out the mac bytes */
b 7b
-8: mov w7, w8
- add w8, w8, #16
+8: mov w7, w25
+ add w25, w25, #16
9: ext v1.16b, v1.16b, v1.16b, #1
adds w7, w7, #1
bne 9b
eor v0.16b, v0.16b, v1.16b
- st1 {v0.16b}, [x0]
-10: str w8, [x3]
+ st1 {v0.16b}, [x19]
+10: str w25, [x22]
+
+ frame_pop 7
ret
ENDPROC(ce_aes_ccm_auth_data)
@@ -126,19 +145,29 @@ ENTRY(ce_aes_ccm_final)
ENDPROC(ce_aes_ccm_final)
.macro aes_ccm_do_crypt,enc
- ldr x8, [x6, #8] /* load lower ctr */
- ld1 {v0.16b}, [x5] /* load mac */
-CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ ldr x26, [x25, #8] /* load lower ctr */
+ ld1 {v0.16b}, [x24] /* load mac */
+CPU_LE( rev x26, x26 ) /* keep swabbed ctr in reg */
0: /* outer loop */
- ld1 {v1.8b}, [x6] /* load upper ctr */
- prfm pldl1strm, [x1]
- add x8, x8, #1
- rev x9, x8
- cmp w4, #12 /* which key size? */
- sub w7, w4, #2 /* get modified # of rounds */
+ ld1 {v1.8b}, [x25] /* load upper ctr */
+ prfm pldl1strm, [x20]
+ add x26, x26, #1
+ rev x9, x26
+ cmp w23, #12 /* which key size? */
+ sub w7, w23, #2 /* get modified # of rounds */
ins v1.d[1], x9 /* no carry in lower ctr */
- ld1 {v3.4s}, [x3] /* load first round key */
- add x10, x3, #16
+ ld1 {v3.4s}, [x22] /* load first round key */
+ add x10, x22, #16
bmi 1f
bne 4f
mov v5.16b, v3.16b
@@ -165,9 +194,9 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
bpl 2b
aese v0.16b, v4.16b
aese v1.16b, v4.16b
- subs w2, w2, #16
- bmi 6f /* partial block? */
- ld1 {v2.16b}, [x1], #16 /* load next input block */
+ subs w21, w21, #16
+ bmi 7f /* partial block? */
+ ld1 {v2.16b}, [x20], #16 /* load next input block */
.if \enc == 1
eor v2.16b, v2.16b, v5.16b /* final round enc+mac */
eor v1.16b, v1.16b, v2.16b /* xor with crypted ctr */
@@ -176,18 +205,29 @@ CPU_LE( rev x8, x8 ) /* keep swabbed ctr in reg */
eor v1.16b, v2.16b, v5.16b /* final round enc */
.endif
eor v0.16b, v0.16b, v2.16b /* xor mac with pt ^ rk[last] */
- st1 {v1.16b}, [x0], #16 /* write output block */
- bne 0b
-CPU_LE( rev x8, x8 )
- st1 {v0.16b}, [x5] /* store mac */
- str x8, [x6, #8] /* store lsb end of ctr (BE) */
-5: ret
-
-6: eor v0.16b, v0.16b, v5.16b /* final round mac */
+ st1 {v1.16b}, [x19], #16 /* write output block */
+ beq 5f
+
+ if_will_cond_yield_neon
+ st1 {v0.16b}, [x24] /* store mac */
+ do_cond_yield_neon
+ ld1 {v0.16b}, [x24] /* reload mac */
+ endif_yield_neon
+
+ b 0b
+5:
+CPU_LE( rev x26, x26 )
+ st1 {v0.16b}, [x24] /* store mac */
+ str x26, [x25, #8] /* store lsb end of ctr (BE) */
+
+6: frame_pop 8
+ ret
+
+7: eor v0.16b, v0.16b, v5.16b /* final round mac */
eor v1.16b, v1.16b, v5.16b /* final round enc */
- st1 {v0.16b}, [x5] /* store mac */
- add w2, w2, #16 /* process partial tail block */
-7: ldrb w9, [x1], #1 /* get 1 byte of input */
+ st1 {v0.16b}, [x24] /* store mac */
+ add w21, w21, #16 /* process partial tail block */
+8: ldrb w9, [x20], #1 /* get 1 byte of input */
umov w6, v1.b[0] /* get top crypted ctr byte */
umov w7, v0.b[0] /* get top mac byte */
.if \enc == 1
@@ -197,13 +237,13 @@ CPU_LE( rev x8, x8 )
eor w9, w9, w6
eor w7, w7, w9
.endif
- strb w9, [x0], #1 /* store out byte */
- strb w7, [x5], #1 /* store mac byte */
- subs w2, w2, #1
- beq 5b
+ strb w9, [x19], #1 /* store out byte */
+ strb w7, [x24], #1 /* store mac byte */
+ subs w21, w21, #1
+ beq 6b
ext v0.16b, v0.16b, v0.16b, #1 /* shift out mac byte */
ext v1.16b, v1.16b, v1.16b, #1 /* shift out ctr byte */
- b 7b
+ b 8b
.endm
/*
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 15/20] crypto: arm64/aes-blk - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-modes.S | 331 ++++++++++++--------
2 files changed, 216 insertions(+), 130 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
/* prepare for encryption with key in rk[] */
- .macro enc_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for encryption (again) but with new key in rk[] */
- .macro enc_switch_key, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_switch_key, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for decryption with key in rk[] */
- .macro dec_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro dec_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
.macro do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..ab05772ce385 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
- enc_prepare w3, x2, x5
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+.Lecbencrestart:
+ enc_prepare w22, x21, x5
.LecbencloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl aes_encrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbencrestart
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbencout
.Lecbencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
- encrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
+ encrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbencloop
.Lecbencout:
- ldp x29, x30, [sp], #16
+ frame_pop 5
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
- dec_prepare w3, x2, x5
+.Lecbdecrestart:
+ dec_prepare w22, x21, x5
.LecbdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl aes_decrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbdecrestart
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbdecout
.Lecbdecloop:
- ld1 {v0.16b}, [x1], #16 /* get next ct block */
- decrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next ct block */
+ decrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbdecloop
.Lecbdecout:
- ldp x29, x30, [sp], #16
+ frame_pop 5
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -94,78 +108,100 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v4.16b}, [x5] /* get iv */
- enc_prepare w3, x2, x6
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lcbcencrestart:
+ ld1 {v4.16b}, [x24] /* get iv */
+ enc_prepare w22, x21, x6
.Lcbcencloop4x:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w22, x21, x6, w7
eor v1.16b, v1.16b, v0.16b
- encrypt_block v1, w3, x2, x6, w7
+ encrypt_block v1, w22, x21, x6, w7
eor v2.16b, v2.16b, v1.16b
- encrypt_block v2, w3, x2, x6, w7
+ encrypt_block v2, w22, x21, x6, w7
eor v3.16b, v3.16b, v2.16b
- encrypt_block v3, w3, x2, x6, w7
- st1 {v0.16b-v3.16b}, [x0], #64
+ encrypt_block v3, w22, x21, x6, w7
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v3.16b
+ st1 {v4.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcencrestart
b .Lcbcencloop4x
.Lcbcenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcencout
.Lcbcencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
- encrypt_block v4, w3, x2, x6, w7
- st1 {v4.16b}, [x0], #16
- subs w4, w4, #1
+ encrypt_block v4, w22, x21, x6, w7
+ st1 {v4.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcencloop
.Lcbcencout:
- st1 {v4.16b}, [x5] /* return iv */
+ st1 {v4.16b}, [x24] /* return iv */
+ frame_pop 6
ret
AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
- ld1 {v7.16b}, [x5] /* get iv */
- dec_prepare w3, x2, x6
+.Lcbcdecrestart:
+ ld1 {v7.16b}, [x24] /* get iv */
+ dec_prepare w22, x21, x6
.LcbcdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
bl aes_decrypt_block4x
- sub x1, x1, #16
+ sub x20, x20, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
- ld1 {v7.16b}, [x1], #16 /* reload 1 ct block */
+ ld1 {v7.16b}, [x20], #16 /* reload 1 ct block */
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v7.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcdecrestart
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcdecout
.Lcbcdecloop:
- ld1 {v1.16b}, [x1], #16 /* get next ct block */
+ ld1 {v1.16b}, [x20], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w22, x21, x6, w7
eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */
mov v7.16b, v1.16b /* ct is next iv */
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcdecloop
.Lcbcdecout:
- st1 {v7.16b}, [x5] /* return iv */
- ldp x29, x30, [sp], #16
+ st1 {v7.16b}, [x24] /* return iv */
+ frame_pop 6
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -176,19 +212,26 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- enc_prepare w3, x2, x6
- ld1 {v4.16b}, [x5]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lctrrestart:
+ enc_prepare w22, x21, x6
+ ld1 {v4.16b}, [x24]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
- cmn w6, w4 /* 32 bit overflow? */
- bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lctr1x
+ cmn w6, #4 /* 32 bit overflow? */
+ bcs .Lctr1x
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -200,25 +243,27 @@ AES_ENTRY(aes_ctr_encrypt)
mov v1.s[3], v8.s[0]
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
- ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
+ ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
- ld1 {v5.16b}, [x1], #16 /* get 1 input block */
+ ld1 {v5.16b}, [x20], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
- cbz w4, .Lctrout
+ cbz w23, .Lctrout
+ st1 {v4.16b}, [x24] /* return next CTR value */
+ cond_yield_neon .Lctrrestart
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lctrout
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
adds x6, x6, #1 /* increment BE ctr */
rev x7, x6
@@ -226,22 +271,22 @@ AES_ENTRY(aes_ctr_encrypt)
bcs .Lctrcarry /* overflow? */
.Lctrcarrydone:
- subs w4, w4, #1
+ subs w23, w23, #1
bmi .Lctrtailblock /* blocks <0 means tail block */
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
eor v3.16b, v0.16b, v3.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
bne .Lctrloop
.Lctrout:
- st1 {v4.16b}, [x5] /* return next CTR value */
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24] /* return next CTR value */
+.Lctrret:
+ frame_pop 6
ret
.Lctrtailblock:
- st1 {v0.16b}, [x0]
- ldp x29, x30, [sp], #16
- ret
+ st1 {v0.16b}, [x19]
+ b .Lctrret
.Lctrcarry:
umov x7, v4.d[0] /* load upper word of ctr */
@@ -274,10 +319,16 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
- ld1 {v4.16b}, [x6]
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsencnotfirst
enc_prepare w3, x5, x8
@@ -286,15 +337,17 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsencnotfirst:
- enc_prepare w3, x2, x8
+ enc_prepare w22, x21, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -307,35 +360,43 @@ AES_ENTRY(aes_xts_encrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsencout
+ cbz w23, .Lxtsencout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsencrestart
b .LxtsencloopNx
.Lxtsenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsencout
.Lxtsencloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsencout
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop 6
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- ld1 {v4.16b}, [x6]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsdecnotfirst
enc_prepare w3, x5, x8
@@ -344,15 +405,17 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsdecnotfirst:
- dec_prepare w3, x2, x8
+ dec_prepare w22, x21, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -365,26 +428,28 @@ AES_ENTRY(aes_xts_decrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsdecout
+ cbz w23, .Lxtsdecout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsdecrestart
b .LxtsdecloopNx
.Lxtsdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsdecout
.Lxtsdecloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x8, w7
+ decrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsdecout
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop 6
ret
AES_ENDPROC(aes_xts_decrypt)
@@ -393,43 +458,61 @@ AES_ENDPROC(aes_xts_decrypt)
* int blocks, u8 dg[], int enc_before, int enc_after)
*/
AES_ENTRY(aes_mac_update)
- ld1 {v0.16b}, [x4] /* get dg */
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v0.16b}, [x23] /* get dg */
enc_prepare w2, x1, x7
cbz w5, .Lmacloop4x
encrypt_block v0, w2, x1, x7, w8
.Lmacloop4x:
- subs w3, w3, #4
+ subs w22, w22, #4
bmi .Lmac1x
- ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ ld1 {v1.16b-v4.16b}, [x19], #64 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v2.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v3.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v4.16b
- cmp w3, wzr
- csinv x5, x6, xzr, eq
+ cmp w22, wzr
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
+ st1 {v0.16b}, [x23] /* return dg */
+ cond_yield_neon .Lmacrestart
b .Lmacloop4x
.Lmac1x:
- add w3, w3, #4
+ add w22, w22, #4
.Lmacloop:
- cbz w3, .Lmacout
- ld1 {v1.16b}, [x0], #16 /* get next pt block */
+ cbz w22, .Lmacout
+ ld1 {v1.16b}, [x19], #16 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- subs w3, w3, #1
- csinv x5, x6, xzr, eq
+ subs w22, w22, #1
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+.Lmacenc:
+ encrypt_block v0, w21, x20, x7, w8
b .Lmacloop
.Lmacout:
- st1 {v0.16b}, [x4] /* return dg */
+ st1 {v0.16b}, [x23] /* return dg */
+ frame_pop 6
ret
+
+.Lmacrestart:
+ ld1 {v0.16b}, [x23] /* get dg */
+ enc_prepare w21, x20, x0
+ b .Lmacloop4x
AES_ENDPROC(aes_mac_update)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 15/20] crypto: arm64/aes-blk - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-ce.S | 15 +-
arch/arm64/crypto/aes-modes.S | 331 ++++++++++++--------
2 files changed, 216 insertions(+), 130 deletions(-)
diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..623e74ed1c67 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -30,18 +30,21 @@
.endm
/* prepare for encryption with key in rk[] */
- .macro enc_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for encryption (again) but with new key in rk[] */
- .macro enc_switch_key, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro enc_switch_key, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
/* prepare for decryption with key in rk[] */
- .macro dec_prepare, rounds, rk, ignore
- load_round_keys \rounds, \rk
+ .macro dec_prepare, rounds, rk, temp
+ mov \temp, \rk
+ load_round_keys \rounds, \temp
.endm
.macro do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..ab05772ce385 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
.align 4
aes_encrypt_block4x:
- encrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ encrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_encrypt_block4x)
aes_decrypt_block4x:
- decrypt_block4x v0, v1, v2, v3, w3, x2, x8, w7
+ decrypt_block4x v0, v1, v2, v3, w22, x21, x8, w7
ret
ENDPROC(aes_decrypt_block4x)
@@ -31,57 +31,71 @@ ENDPROC(aes_decrypt_block4x)
*/
AES_ENTRY(aes_ecb_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
- enc_prepare w3, x2, x5
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+.Lecbencrestart:
+ enc_prepare w22, x21, x5
.LecbencloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
bl aes_encrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbencrestart
b .LecbencloopNx
.Lecbenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbencout
.Lecbencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
- encrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
+ encrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbencloop
.Lecbencout:
- ldp x29, x30, [sp], #16
+ frame_pop 5
ret
AES_ENDPROC(aes_ecb_encrypt)
AES_ENTRY(aes_ecb_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
- dec_prepare w3, x2, x5
+.Lecbdecrestart:
+ dec_prepare w22, x21, x5
.LecbdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lecbdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
bl aes_decrypt_block4x
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ cond_yield_neon .Lecbdecrestart
b .LecbdecloopNx
.Lecbdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lecbdecout
.Lecbdecloop:
- ld1 {v0.16b}, [x1], #16 /* get next ct block */
- decrypt_block v0, w3, x2, x5, w6
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ ld1 {v0.16b}, [x20], #16 /* get next ct block */
+ decrypt_block v0, w22, x21, x5, w6
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lecbdecloop
.Lecbdecout:
- ldp x29, x30, [sp], #16
+ frame_pop 5
ret
AES_ENDPROC(aes_ecb_decrypt)
@@ -94,78 +108,100 @@ AES_ENDPROC(aes_ecb_decrypt)
*/
AES_ENTRY(aes_cbc_encrypt)
- ld1 {v4.16b}, [x5] /* get iv */
- enc_prepare w3, x2, x6
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lcbcencrestart:
+ ld1 {v4.16b}, [x24] /* get iv */
+ enc_prepare w22, x21, x6
.Lcbcencloop4x:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
eor v0.16b, v0.16b, v4.16b /* ..and xor with iv */
- encrypt_block v0, w3, x2, x6, w7
+ encrypt_block v0, w22, x21, x6, w7
eor v1.16b, v1.16b, v0.16b
- encrypt_block v1, w3, x2, x6, w7
+ encrypt_block v1, w22, x21, x6, w7
eor v2.16b, v2.16b, v1.16b
- encrypt_block v2, w3, x2, x6, w7
+ encrypt_block v2, w22, x21, x6, w7
eor v3.16b, v3.16b, v2.16b
- encrypt_block v3, w3, x2, x6, w7
- st1 {v0.16b-v3.16b}, [x0], #64
+ encrypt_block v3, w22, x21, x6, w7
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v3.16b
+ st1 {v4.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcencrestart
b .Lcbcencloop4x
.Lcbcenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcencout
.Lcbcencloop:
- ld1 {v0.16b}, [x1], #16 /* get next pt block */
+ ld1 {v0.16b}, [x20], #16 /* get next pt block */
eor v4.16b, v4.16b, v0.16b /* ..and xor with iv */
- encrypt_block v4, w3, x2, x6, w7
- st1 {v4.16b}, [x0], #16
- subs w4, w4, #1
+ encrypt_block v4, w22, x21, x6, w7
+ st1 {v4.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcencloop
.Lcbcencout:
- st1 {v4.16b}, [x5] /* return iv */
+ st1 {v4.16b}, [x24] /* return iv */
+ frame_pop 6
ret
AES_ENDPROC(aes_cbc_encrypt)
AES_ENTRY(aes_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
- ld1 {v7.16b}, [x5] /* get iv */
- dec_prepare w3, x2, x6
+.Lcbcdecrestart:
+ ld1 {v7.16b}, [x24] /* get iv */
+ dec_prepare w22, x21, x6
.LcbcdecloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lcbcdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
mov v4.16b, v0.16b
mov v5.16b, v1.16b
mov v6.16b, v2.16b
bl aes_decrypt_block4x
- sub x1, x1, #16
+ sub x20, x20, #16
eor v0.16b, v0.16b, v7.16b
eor v1.16b, v1.16b, v4.16b
- ld1 {v7.16b}, [x1], #16 /* reload 1 ct block */
+ ld1 {v7.16b}, [x20], #16 /* reload 1 ct block */
eor v2.16b, v2.16b, v5.16b
eor v3.16b, v3.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
+ st1 {v7.16b}, [x24] /* return iv */
+ cond_yield_neon .Lcbcdecrestart
b .LcbcdecloopNx
.Lcbcdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lcbcdecout
.Lcbcdecloop:
- ld1 {v1.16b}, [x1], #16 /* get next ct block */
+ ld1 {v1.16b}, [x20], #16 /* get next ct block */
mov v0.16b, v1.16b /* ...and copy to v0 */
- decrypt_block v0, w3, x2, x6, w7
+ decrypt_block v0, w22, x21, x6, w7
eor v0.16b, v0.16b, v7.16b /* xor with iv => pt */
mov v7.16b, v1.16b /* ct is next iv */
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
bne .Lcbcdecloop
.Lcbcdecout:
- st1 {v7.16b}, [x5] /* return iv */
- ldp x29, x30, [sp], #16
+ st1 {v7.16b}, [x24] /* return iv */
+ frame_pop 6
ret
AES_ENDPROC(aes_cbc_decrypt)
@@ -176,19 +212,26 @@ AES_ENDPROC(aes_cbc_decrypt)
*/
AES_ENTRY(aes_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- enc_prepare w3, x2, x6
- ld1 {v4.16b}, [x5]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+
+.Lctrrestart:
+ enc_prepare w22, x21, x6
+ ld1 {v4.16b}, [x24]
umov x6, v4.d[1] /* keep swabbed ctr in reg */
rev x6, x6
- cmn w6, w4 /* 32 bit overflow? */
- bcs .Lctrloop
.LctrloopNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lctr1x
+ cmn w6, #4 /* 32 bit overflow? */
+ bcs .Lctr1x
ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */
dup v7.4s, w6
mov v0.16b, v4.16b
@@ -200,25 +243,27 @@ AES_ENTRY(aes_ctr_encrypt)
mov v1.s[3], v8.s[0]
mov v2.s[3], v8.s[1]
mov v3.s[3], v8.s[2]
- ld1 {v5.16b-v7.16b}, [x1], #48 /* get 3 input blocks */
+ ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */
bl aes_encrypt_block4x
eor v0.16b, v5.16b, v0.16b
- ld1 {v5.16b}, [x1], #16 /* get 1 input block */
+ ld1 {v5.16b}, [x20], #16 /* get 1 input block */
eor v1.16b, v6.16b, v1.16b
eor v2.16b, v7.16b, v2.16b
eor v3.16b, v5.16b, v3.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
add x6, x6, #4
rev x7, x6
ins v4.d[1], x7
- cbz w4, .Lctrout
+ cbz w23, .Lctrout
+ st1 {v4.16b}, [x24] /* return next CTR value */
+ cond_yield_neon .Lctrrestart
b .LctrloopNx
.Lctr1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lctrout
.Lctrloop:
mov v0.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
adds x6, x6, #1 /* increment BE ctr */
rev x7, x6
@@ -226,22 +271,22 @@ AES_ENTRY(aes_ctr_encrypt)
bcs .Lctrcarry /* overflow? */
.Lctrcarrydone:
- subs w4, w4, #1
+ subs w23, w23, #1
bmi .Lctrtailblock /* blocks <0 means tail block */
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
eor v3.16b, v0.16b, v3.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
bne .Lctrloop
.Lctrout:
- st1 {v4.16b}, [x5] /* return next CTR value */
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24] /* return next CTR value */
+.Lctrret:
+ frame_pop 6
ret
.Lctrtailblock:
- st1 {v0.16b}, [x0]
- ldp x29, x30, [sp], #16
- ret
+ st1 {v0.16b}, [x19]
+ b .Lctrret
.Lctrcarry:
umov x7, v4.d[0] /* load upper word of ctr */
@@ -274,10 +319,16 @@ CPU_LE( .quad 1, 0x87 )
CPU_BE( .quad 0x87, 1 )
AES_ENTRY(aes_xts_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
- ld1 {v4.16b}, [x6]
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsencnotfirst
enc_prepare w3, x5, x8
@@ -286,15 +337,17 @@ AES_ENTRY(aes_xts_encrypt)
ldr q7, .Lxts_mul_x
b .LxtsencNx
+.Lxtsencrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsencnotfirst:
- enc_prepare w3, x2, x8
+ enc_prepare w22, x21, x8
.LxtsencloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsencNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsenc1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 pt blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 pt blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -307,35 +360,43 @@ AES_ENTRY(aes_xts_encrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsencout
+ cbz w23, .Lxtsencout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsencrestart
b .LxtsencloopNx
.Lxtsenc1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsencout
.Lxtsencloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- encrypt_block v0, w3, x2, x8, w7
+ encrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsencout
next_tweak v4, v4, v7, v8
b .Lxtsencloop
.Lxtsencout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop 6
ret
AES_ENDPROC(aes_xts_encrypt)
AES_ENTRY(aes_xts_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
- ld1 {v4.16b}, [x6]
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v4.16b}, [x24]
cbz w7, .Lxtsdecnotfirst
enc_prepare w3, x5, x8
@@ -344,15 +405,17 @@ AES_ENTRY(aes_xts_decrypt)
ldr q7, .Lxts_mul_x
b .LxtsdecNx
+.Lxtsdecrestart:
+ ld1 {v4.16b}, [x24]
.Lxtsdecnotfirst:
- dec_prepare w3, x2, x8
+ dec_prepare w22, x21, x8
.LxtsdecloopNx:
ldr q7, .Lxts_mul_x
next_tweak v4, v4, v7, v8
.LxtsdecNx:
- subs w4, w4, #4
+ subs w23, w23, #4
bmi .Lxtsdec1x
- ld1 {v0.16b-v3.16b}, [x1], #64 /* get 4 ct blocks */
+ ld1 {v0.16b-v3.16b}, [x20], #64 /* get 4 ct blocks */
next_tweak v5, v4, v7, v8
eor v0.16b, v0.16b, v4.16b
next_tweak v6, v5, v7, v8
@@ -365,26 +428,28 @@ AES_ENTRY(aes_xts_decrypt)
eor v0.16b, v0.16b, v4.16b
eor v1.16b, v1.16b, v5.16b
eor v2.16b, v2.16b, v6.16b
- st1 {v0.16b-v3.16b}, [x0], #64
+ st1 {v0.16b-v3.16b}, [x19], #64
mov v4.16b, v7.16b
- cbz w4, .Lxtsdecout
+ cbz w23, .Lxtsdecout
+ st1 {v4.16b}, [x24]
+ cond_yield_neon .Lxtsdecrestart
b .LxtsdecloopNx
.Lxtsdec1x:
- adds w4, w4, #4
+ adds w23, w23, #4
beq .Lxtsdecout
.Lxtsdecloop:
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
eor v0.16b, v1.16b, v4.16b
- decrypt_block v0, w3, x2, x8, w7
+ decrypt_block v0, w22, x21, x8, w7
eor v0.16b, v0.16b, v4.16b
- st1 {v0.16b}, [x0], #16
- subs w4, w4, #1
+ st1 {v0.16b}, [x19], #16
+ subs w23, w23, #1
beq .Lxtsdecout
next_tweak v4, v4, v7, v8
b .Lxtsdecloop
.Lxtsdecout:
- st1 {v4.16b}, [x6]
- ldp x29, x30, [sp], #16
+ st1 {v4.16b}, [x24]
+ frame_pop 6
ret
AES_ENDPROC(aes_xts_decrypt)
@@ -393,43 +458,61 @@ AES_ENDPROC(aes_xts_decrypt)
* int blocks, u8 dg[], int enc_before, int enc_after)
*/
AES_ENTRY(aes_mac_update)
- ld1 {v0.16b}, [x4] /* get dg */
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x6
+
+ ld1 {v0.16b}, [x23] /* get dg */
enc_prepare w2, x1, x7
cbz w5, .Lmacloop4x
encrypt_block v0, w2, x1, x7, w8
.Lmacloop4x:
- subs w3, w3, #4
+ subs w22, w22, #4
bmi .Lmac1x
- ld1 {v1.16b-v4.16b}, [x0], #64 /* get next pt block */
+ ld1 {v1.16b-v4.16b}, [x19], #64 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v2.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v3.16b
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
eor v0.16b, v0.16b, v4.16b
- cmp w3, wzr
- csinv x5, x6, xzr, eq
+ cmp w22, wzr
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+ encrypt_block v0, w21, x20, x7, w8
+ st1 {v0.16b}, [x23] /* return dg */
+ cond_yield_neon .Lmacrestart
b .Lmacloop4x
.Lmac1x:
- add w3, w3, #4
+ add w22, w22, #4
.Lmacloop:
- cbz w3, .Lmacout
- ld1 {v1.16b}, [x0], #16 /* get next pt block */
+ cbz w22, .Lmacout
+ ld1 {v1.16b}, [x19], #16 /* get next pt block */
eor v0.16b, v0.16b, v1.16b /* ..and xor with dg */
- subs w3, w3, #1
- csinv x5, x6, xzr, eq
+ subs w22, w22, #1
+ csinv x5, x24, xzr, eq
cbz w5, .Lmacout
- encrypt_block v0, w2, x1, x7, w8
+.Lmacenc:
+ encrypt_block v0, w21, x20, x7, w8
b .Lmacloop
.Lmacout:
- st1 {v0.16b}, [x4] /* return dg */
+ st1 {v0.16b}, [x23] /* return dg */
+ frame_pop 6
ret
+
+.Lmacrestart:
+ ld1 {v0.16b}, [x23] /* get dg */
+ enc_prepare w21, x20, x0
+ b .Lmacloop4x
AES_ENDPROC(aes_mac_update)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 16/20] crypto: arm64/aes-bs - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-core.S | 305 +++++++++++---------
1 file changed, 170 insertions(+), 135 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..23659369da78 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
* int blocks)
*/
.macro __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
99: mov x5, #1
- lsl x5, x5, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x5, x5, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x5, x5, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
tbnz x5, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
tbnz x5, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
tbnz x5, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
tbnz x5, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
tbnz x5, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
tbnz x5, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
tbnz x5, #7, 0f
- ld1 {v7.16b}, [x1], #16
+ ld1 {v7.16b}, [x20], #16
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl \do8
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
tbnz x5, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
tbnz x5, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
tbnz x5, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
tbnz x5, #4, 1f
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x5, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x5, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x5, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
- cbnz x4, 99b
+ cbz x23, 1f
+ cond_yield_neon
+ b 99b
-1: ldp x29, x30, [sp], #16
+1: frame_pop 5
ret
.endm
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
*/
.align 4
ENTRY(aesbs_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
99: mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
mov v29.16b, v4.16b
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
mov v30.16b, v5.16b
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
mov v31.16b, v6.16b
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1]
+ ld1 {v7.16b}, [x20]
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_decrypt8
- ld1 {v24.16b}, [x5] // load IV
+ ld1 {v24.16b}, [x24] // load IV
eor v1.16b, v1.16b, v25.16b
eor v6.16b, v6.16b, v26.16b
@@ -679,34 +692,36 @@ ENTRY(aesbs_cbc_decrypt)
eor v3.16b, v3.16b, v30.16b
eor v5.16b, v5.16b, v31.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
mov v24.16b, v25.16b
tbnz x6, #1, 1f
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
mov v24.16b, v26.16b
tbnz x6, #2, 1f
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
mov v24.16b, v27.16b
tbnz x6, #3, 1f
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
mov v24.16b, v28.16b
tbnz x6, #4, 1f
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
mov v24.16b, v29.16b
tbnz x6, #5, 1f
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
mov v24.16b, v30.16b
tbnz x6, #6, 1f
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
mov v24.16b, v31.16b
tbnz x6, #7, 1f
- ld1 {v24.16b}, [x1], #16
- st1 {v5.16b}, [x0], #16
-1: st1 {v24.16b}, [x5] // store IV
+ ld1 {v24.16b}, [x20], #16
+ st1 {v5.16b}, [x19], #16
+1: st1 {v24.16b}, [x24] // store IV
- cbnz x4, 99b
+ cbz x23, 2f
+ cond_yield_neon
+ b 99b
- ldp x29, x30, [sp], #16
+2: frame_pop 6
ret
ENDPROC(aesbs_cbc_decrypt)
@@ -731,87 +746,93 @@ CPU_BE( .quad 0x87, 1 )
*/
__xts_crypt8:
mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
next_tweak v26, v25, v30, v31
eor v0.16b, v0.16b, v25.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
next_tweak v27, v26, v30, v31
eor v1.16b, v1.16b, v26.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
next_tweak v28, v27, v30, v31
eor v2.16b, v2.16b, v27.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
next_tweak v29, v28, v30, v31
eor v3.16b, v3.16b, v28.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
- str q29, [sp, #16]
+ ld1 {v4.16b}, [x20], #16
+ str q29, [sp, #64]
eor v4.16b, v4.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
- str q29, [sp, #32]
+ ld1 {v5.16b}, [x20], #16
+ str q29, [sp, #80]
eor v5.16b, v5.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
- str q29, [sp, #48]
+ ld1 {v6.16b}, [x20], #16
+ str q29, [sp, #96]
eor v6.16b, v6.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1], #16
- str q29, [sp, #64]
+ ld1 {v7.16b}, [x20], #16
+ str q29, [sp, #112]
eor v7.16b, v7.16b, v29.16b
next_tweak v29, v29, v30, v31
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
br x7
ENDPROC(__xts_crypt8)
.macro __xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-80]!
- mov x29, sp
+ frame_push 6, 64
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
- ldr q30, .Lxts_mul_x
- ld1 {v25.16b}, [x5]
+0: ldr q30, .Lxts_mul_x
+ ld1 {v25.16b}, [x24]
99: adr x7, \do8
bl __xts_crypt8
- ldp q16, q17, [sp, #16]
- ldp q18, q19, [sp, #48]
+ ldp q16, q17, [sp, #64]
+ ldp q18, q19, [sp, #96]
eor \o0\().16b, \o0\().16b, v25.16b
eor \o1\().16b, \o1\().16b, v26.16b
eor \o2\().16b, \o2\().16b, v27.16b
eor \o3\().16b, \o3\().16b, v28.16b
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
mov v25.16b, v26.16b
tbnz x6, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
mov v25.16b, v27.16b
tbnz x6, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
mov v25.16b, v28.16b
tbnz x6, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
mov v25.16b, v29.16b
tbnz x6, #4, 1f
@@ -820,18 +841,22 @@ ENDPROC(__xts_crypt8)
eor \o6\().16b, \o6\().16b, v18.16b
eor \o7\().16b, \o7\().16b, v19.16b
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x6, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x6, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x6, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
- cbnz x4, 99b
+ cbz x23, 1f
+ st1 {v25.16b}, [x24]
-1: st1 {v25.16b}, [x5]
- ldp x29, x30, [sp], #80
+ cond_yield_neon 0b
+ b 99b
+
+1: st1 {v25.16b}, [x24]
+ frame_pop 6, 64
ret
.endm
@@ -856,24 +881,31 @@ ENDPROC(aesbs_xts_decrypt)
* int rounds, int blocks, u8 iv[], u8 final[])
*/
ENTRY(aesbs_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
-
- cmp x6, #0
- cset x10, ne
- add x4, x4, x10 // do one extra block if final
-
- ldp x7, x8, [x5]
- ld1 {v0.16b}, [x5]
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ cmp x25, #0
+ cset x26, ne
+ add x23, x23, x26 // do one extra block if final
+
+98: ldp x7, x8, [x24]
+ ld1 {v0.16b}, [x24]
CPU_LE( rev x7, x7 )
CPU_LE( rev x8, x8 )
adds x8, x8, #1
adc x7, x7, xzr
99: mov x9, #1
- lsl x9, x9, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x9, x9, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x9, x9, xzr, le
tbnz x9, #1, 0f
@@ -891,82 +923,85 @@ CPU_LE( rev x8, x8 )
tbnz x9, #7, 0f
next_ctr v7
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_encrypt8
- lsr x9, x9, x10 // disregard the extra block
+ lsr x9, x9, x26 // disregard the extra block
tbnz x9, #0, 0f
- ld1 {v8.16b}, [x1], #16
+ ld1 {v8.16b}, [x20], #16
eor v0.16b, v0.16b, v8.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
tbnz x9, #1, 1f
- ld1 {v9.16b}, [x1], #16
+ ld1 {v9.16b}, [x20], #16
eor v1.16b, v1.16b, v9.16b
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
tbnz x9, #2, 2f
- ld1 {v10.16b}, [x1], #16
+ ld1 {v10.16b}, [x20], #16
eor v4.16b, v4.16b, v10.16b
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
tbnz x9, #3, 3f
- ld1 {v11.16b}, [x1], #16
+ ld1 {v11.16b}, [x20], #16
eor v6.16b, v6.16b, v11.16b
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
tbnz x9, #4, 4f
- ld1 {v12.16b}, [x1], #16
+ ld1 {v12.16b}, [x20], #16
eor v3.16b, v3.16b, v12.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
tbnz x9, #5, 5f
- ld1 {v13.16b}, [x1], #16
+ ld1 {v13.16b}, [x20], #16
eor v7.16b, v7.16b, v13.16b
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
tbnz x9, #6, 6f
- ld1 {v14.16b}, [x1], #16
+ ld1 {v14.16b}, [x20], #16
eor v2.16b, v2.16b, v14.16b
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
tbnz x9, #7, 7f
- ld1 {v15.16b}, [x1], #16
+ ld1 {v15.16b}, [x20], #16
eor v5.16b, v5.16b, v15.16b
- st1 {v5.16b}, [x0], #16
+ st1 {v5.16b}, [x19], #16
8: next_ctr v0
- cbnz x4, 99b
+ st1 {v0.16b}, [x24]
+ cbz x23, 0f
+
+ cond_yield_neon 98b
+ b 99b
-0: st1 {v0.16b}, [x5]
- ldp x29, x30, [sp], #16
+0: frame_pop 8
ret
/*
* If we are handling the tail of the input (x6 != NULL), return the
* final keystream block back to the caller.
*/
-1: cbz x6, 8b
- st1 {v1.16b}, [x6]
+1: cbz x25, 8b
+ st1 {v1.16b}, [x25]
b 8b
-2: cbz x6, 8b
- st1 {v4.16b}, [x6]
+2: cbz x25, 8b
+ st1 {v4.16b}, [x25]
b 8b
-3: cbz x6, 8b
- st1 {v6.16b}, [x6]
+3: cbz x25, 8b
+ st1 {v6.16b}, [x25]
b 8b
-4: cbz x6, 8b
- st1 {v3.16b}, [x6]
+4: cbz x25, 8b
+ st1 {v3.16b}, [x25]
b 8b
-5: cbz x6, 8b
- st1 {v7.16b}, [x6]
+5: cbz x25, 8b
+ st1 {v7.16b}, [x25]
b 8b
-6: cbz x6, 8b
- st1 {v2.16b}, [x6]
+6: cbz x25, 8b
+ st1 {v2.16b}, [x25]
b 8b
-7: cbz x6, 8b
- st1 {v5.16b}, [x6]
+7: cbz x25, 8b
+ st1 {v5.16b}, [x25]
b 8b
ENDPROC(aesbs_ctr_encrypt)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 16/20] crypto: arm64/aes-bs - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/aes-neonbs-core.S | 305 +++++++++++---------
1 file changed, 170 insertions(+), 135 deletions(-)
diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..23659369da78 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,61 @@ ENDPROC(aesbs_decrypt8)
* int blocks)
*/
.macro __ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
99: mov x5, #1
- lsl x5, x5, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x5, x5, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x5, x5, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
tbnz x5, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
tbnz x5, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
tbnz x5, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
tbnz x5, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
tbnz x5, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
tbnz x5, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
tbnz x5, #7, 0f
- ld1 {v7.16b}, [x1], #16
+ ld1 {v7.16b}, [x20], #16
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl \do8
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
tbnz x5, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
tbnz x5, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
tbnz x5, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
tbnz x5, #4, 1f
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x5, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x5, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x5, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
- cbnz x4, 99b
+ cbz x23, 1f
+ cond_yield_neon
+ b 99b
-1: ldp x29, x30, [sp], #16
+1: frame_pop 5
ret
.endm
@@ -632,43 +639,49 @@ ENDPROC(aesbs_ecb_decrypt)
*/
.align 4
ENTRY(aesbs_cbc_decrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
+ frame_push 6
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
99: mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
mov v25.16b, v0.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
mov v26.16b, v1.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
mov v27.16b, v2.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
mov v28.16b, v3.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
+ ld1 {v4.16b}, [x20], #16
mov v29.16b, v4.16b
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
+ ld1 {v5.16b}, [x20], #16
mov v30.16b, v5.16b
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
+ ld1 {v6.16b}, [x20], #16
mov v31.16b, v6.16b
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1]
+ ld1 {v7.16b}, [x20]
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_decrypt8
- ld1 {v24.16b}, [x5] // load IV
+ ld1 {v24.16b}, [x24] // load IV
eor v1.16b, v1.16b, v25.16b
eor v6.16b, v6.16b, v26.16b
@@ -679,34 +692,36 @@ ENTRY(aesbs_cbc_decrypt)
eor v3.16b, v3.16b, v30.16b
eor v5.16b, v5.16b, v31.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
mov v24.16b, v25.16b
tbnz x6, #1, 1f
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
mov v24.16b, v26.16b
tbnz x6, #2, 1f
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
mov v24.16b, v27.16b
tbnz x6, #3, 1f
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
mov v24.16b, v28.16b
tbnz x6, #4, 1f
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
mov v24.16b, v29.16b
tbnz x6, #5, 1f
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
mov v24.16b, v30.16b
tbnz x6, #6, 1f
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
mov v24.16b, v31.16b
tbnz x6, #7, 1f
- ld1 {v24.16b}, [x1], #16
- st1 {v5.16b}, [x0], #16
-1: st1 {v24.16b}, [x5] // store IV
+ ld1 {v24.16b}, [x20], #16
+ st1 {v5.16b}, [x19], #16
+1: st1 {v24.16b}, [x24] // store IV
- cbnz x4, 99b
+ cbz x23, 2f
+ cond_yield_neon
+ b 99b
- ldp x29, x30, [sp], #16
+2: frame_pop 6
ret
ENDPROC(aesbs_cbc_decrypt)
@@ -731,87 +746,93 @@ CPU_BE( .quad 0x87, 1 )
*/
__xts_crypt8:
mov x6, #1
- lsl x6, x6, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x6, x6, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x6, x6, xzr, mi
- ld1 {v0.16b}, [x1], #16
+ ld1 {v0.16b}, [x20], #16
next_tweak v26, v25, v30, v31
eor v0.16b, v0.16b, v25.16b
tbnz x6, #1, 0f
- ld1 {v1.16b}, [x1], #16
+ ld1 {v1.16b}, [x20], #16
next_tweak v27, v26, v30, v31
eor v1.16b, v1.16b, v26.16b
tbnz x6, #2, 0f
- ld1 {v2.16b}, [x1], #16
+ ld1 {v2.16b}, [x20], #16
next_tweak v28, v27, v30, v31
eor v2.16b, v2.16b, v27.16b
tbnz x6, #3, 0f
- ld1 {v3.16b}, [x1], #16
+ ld1 {v3.16b}, [x20], #16
next_tweak v29, v28, v30, v31
eor v3.16b, v3.16b, v28.16b
tbnz x6, #4, 0f
- ld1 {v4.16b}, [x1], #16
- str q29, [sp, #16]
+ ld1 {v4.16b}, [x20], #16
+ str q29, [sp, #64]
eor v4.16b, v4.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #5, 0f
- ld1 {v5.16b}, [x1], #16
- str q29, [sp, #32]
+ ld1 {v5.16b}, [x20], #16
+ str q29, [sp, #80]
eor v5.16b, v5.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #6, 0f
- ld1 {v6.16b}, [x1], #16
- str q29, [sp, #48]
+ ld1 {v6.16b}, [x20], #16
+ str q29, [sp, #96]
eor v6.16b, v6.16b, v29.16b
next_tweak v29, v29, v30, v31
tbnz x6, #7, 0f
- ld1 {v7.16b}, [x1], #16
- str q29, [sp, #64]
+ ld1 {v7.16b}, [x20], #16
+ str q29, [sp, #112]
eor v7.16b, v7.16b, v29.16b
next_tweak v29, v29, v30, v31
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
br x7
ENDPROC(__xts_crypt8)
.macro __xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
- stp x29, x30, [sp, #-80]!
- mov x29, sp
+ frame_push 6, 64
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
- ldr q30, .Lxts_mul_x
- ld1 {v25.16b}, [x5]
+0: ldr q30, .Lxts_mul_x
+ ld1 {v25.16b}, [x24]
99: adr x7, \do8
bl __xts_crypt8
- ldp q16, q17, [sp, #16]
- ldp q18, q19, [sp, #48]
+ ldp q16, q17, [sp, #64]
+ ldp q18, q19, [sp, #96]
eor \o0\().16b, \o0\().16b, v25.16b
eor \o1\().16b, \o1\().16b, v26.16b
eor \o2\().16b, \o2\().16b, v27.16b
eor \o3\().16b, \o3\().16b, v28.16b
- st1 {\o0\().16b}, [x0], #16
+ st1 {\o0\().16b}, [x19], #16
mov v25.16b, v26.16b
tbnz x6, #1, 1f
- st1 {\o1\().16b}, [x0], #16
+ st1 {\o1\().16b}, [x19], #16
mov v25.16b, v27.16b
tbnz x6, #2, 1f
- st1 {\o2\().16b}, [x0], #16
+ st1 {\o2\().16b}, [x19], #16
mov v25.16b, v28.16b
tbnz x6, #3, 1f
- st1 {\o3\().16b}, [x0], #16
+ st1 {\o3\().16b}, [x19], #16
mov v25.16b, v29.16b
tbnz x6, #4, 1f
@@ -820,18 +841,22 @@ ENDPROC(__xts_crypt8)
eor \o6\().16b, \o6\().16b, v18.16b
eor \o7\().16b, \o7\().16b, v19.16b
- st1 {\o4\().16b}, [x0], #16
+ st1 {\o4\().16b}, [x19], #16
tbnz x6, #5, 1f
- st1 {\o5\().16b}, [x0], #16
+ st1 {\o5\().16b}, [x19], #16
tbnz x6, #6, 1f
- st1 {\o6\().16b}, [x0], #16
+ st1 {\o6\().16b}, [x19], #16
tbnz x6, #7, 1f
- st1 {\o7\().16b}, [x0], #16
+ st1 {\o7\().16b}, [x19], #16
- cbnz x4, 99b
+ cbz x23, 1f
+ st1 {v25.16b}, [x24]
-1: st1 {v25.16b}, [x5]
- ldp x29, x30, [sp], #80
+ cond_yield_neon 0b
+ b 99b
+
+1: st1 {v25.16b}, [x24]
+ frame_pop 6, 64
ret
.endm
@@ -856,24 +881,31 @@ ENDPROC(aesbs_xts_decrypt)
* int rounds, int blocks, u8 iv[], u8 final[])
*/
ENTRY(aesbs_ctr_encrypt)
- stp x29, x30, [sp, #-16]!
- mov x29, sp
-
- cmp x6, #0
- cset x10, ne
- add x4, x4, x10 // do one extra block if final
-
- ldp x7, x8, [x5]
- ld1 {v0.16b}, [x5]
+ frame_push 8
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+
+ cmp x25, #0
+ cset x26, ne
+ add x23, x23, x26 // do one extra block if final
+
+98: ldp x7, x8, [x24]
+ ld1 {v0.16b}, [x24]
CPU_LE( rev x7, x7 )
CPU_LE( rev x8, x8 )
adds x8, x8, #1
adc x7, x7, xzr
99: mov x9, #1
- lsl x9, x9, x4
- subs w4, w4, #8
- csel x4, x4, xzr, pl
+ lsl x9, x9, x23
+ subs w23, w23, #8
+ csel x23, x23, xzr, pl
csel x9, x9, xzr, le
tbnz x9, #1, 0f
@@ -891,82 +923,85 @@ CPU_LE( rev x8, x8 )
tbnz x9, #7, 0f
next_ctr v7
-0: mov bskey, x2
- mov rounds, x3
+0: mov bskey, x21
+ mov rounds, x22
bl aesbs_encrypt8
- lsr x9, x9, x10 // disregard the extra block
+ lsr x9, x9, x26 // disregard the extra block
tbnz x9, #0, 0f
- ld1 {v8.16b}, [x1], #16
+ ld1 {v8.16b}, [x20], #16
eor v0.16b, v0.16b, v8.16b
- st1 {v0.16b}, [x0], #16
+ st1 {v0.16b}, [x19], #16
tbnz x9, #1, 1f
- ld1 {v9.16b}, [x1], #16
+ ld1 {v9.16b}, [x20], #16
eor v1.16b, v1.16b, v9.16b
- st1 {v1.16b}, [x0], #16
+ st1 {v1.16b}, [x19], #16
tbnz x9, #2, 2f
- ld1 {v10.16b}, [x1], #16
+ ld1 {v10.16b}, [x20], #16
eor v4.16b, v4.16b, v10.16b
- st1 {v4.16b}, [x0], #16
+ st1 {v4.16b}, [x19], #16
tbnz x9, #3, 3f
- ld1 {v11.16b}, [x1], #16
+ ld1 {v11.16b}, [x20], #16
eor v6.16b, v6.16b, v11.16b
- st1 {v6.16b}, [x0], #16
+ st1 {v6.16b}, [x19], #16
tbnz x9, #4, 4f
- ld1 {v12.16b}, [x1], #16
+ ld1 {v12.16b}, [x20], #16
eor v3.16b, v3.16b, v12.16b
- st1 {v3.16b}, [x0], #16
+ st1 {v3.16b}, [x19], #16
tbnz x9, #5, 5f
- ld1 {v13.16b}, [x1], #16
+ ld1 {v13.16b}, [x20], #16
eor v7.16b, v7.16b, v13.16b
- st1 {v7.16b}, [x0], #16
+ st1 {v7.16b}, [x19], #16
tbnz x9, #6, 6f
- ld1 {v14.16b}, [x1], #16
+ ld1 {v14.16b}, [x20], #16
eor v2.16b, v2.16b, v14.16b
- st1 {v2.16b}, [x0], #16
+ st1 {v2.16b}, [x19], #16
tbnz x9, #7, 7f
- ld1 {v15.16b}, [x1], #16
+ ld1 {v15.16b}, [x20], #16
eor v5.16b, v5.16b, v15.16b
- st1 {v5.16b}, [x0], #16
+ st1 {v5.16b}, [x19], #16
8: next_ctr v0
- cbnz x4, 99b
+ st1 {v0.16b}, [x24]
+ cbz x23, 0f
+
+ cond_yield_neon 98b
+ b 99b
-0: st1 {v0.16b}, [x5]
- ldp x29, x30, [sp], #16
+0: frame_pop 8
ret
/*
* If we are handling the tail of the input (x6 != NULL), return the
* final keystream block back to the caller.
*/
-1: cbz x6, 8b
- st1 {v1.16b}, [x6]
+1: cbz x25, 8b
+ st1 {v1.16b}, [x25]
b 8b
-2: cbz x6, 8b
- st1 {v4.16b}, [x6]
+2: cbz x25, 8b
+ st1 {v4.16b}, [x25]
b 8b
-3: cbz x6, 8b
- st1 {v6.16b}, [x6]
+3: cbz x25, 8b
+ st1 {v6.16b}, [x25]
b 8b
-4: cbz x6, 8b
- st1 {v3.16b}, [x6]
+4: cbz x25, 8b
+ st1 {v3.16b}, [x25]
b 8b
-5: cbz x6, 8b
- st1 {v7.16b}, [x6]
+5: cbz x25, 8b
+ st1 {v7.16b}, [x25]
b 8b
-6: cbz x6, 8b
- st1 {v2.16b}, [x6]
+6: cbz x25, 8b
+ st1 {v2.16b}, [x25]
b 8b
-7: cbz x6, 8b
- st1 {v5.16b}, [x6]
+7: cbz x25, 8b
+ st1 {v5.16b}, [x25]
b 8b
ENDPROC(aesbs_ctr_encrypt)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 17/20] crypto: arm64/aes-ghash - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-core.S | 113 ++++++++++++++------
arch/arm64/crypto/ghash-ce-glue.c | 28 +++--
2 files changed, 97 insertions(+), 44 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..8da87cfcce66 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
.macro __pmull_ghash, pn
- ld1 {SHASH.2d}, [x3]
- ld1 {XL.2d}, [x1]
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+ ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
__pmull_pre_\pn
/* do the head block first, if supplied */
- cbz x4, 0f
- ld1 {T1.2d}, [x4]
- b 1f
+ cbz x23, 1f
+ ld1 {T1.2d}, [x23]
+ mov x23, xzr
+ b 2f
-0: ld1 {T1.2d}, [x2], #16
- sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+ sub w19, w19, #1
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
CPU_LE( rev64 T1.16b, T1.16b )
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE( rev64 T1.16b, T1.16b )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
- cbnz w0, 0b
+ cbz w19, 3f
+
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
- st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+ frame_pop 5
ret
.endm
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
.macro pmull_gcm_do_crypt, enc
- ld1 {SHASH.2d}, [x4]
- ld1 {XL.2d}, [x1]
- ldr x8, [x5, #8] // load lower counter
+ frame_push 10
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+ mov x26, x7
+ .if \enc == 1
+ ldr x27, [sp, #96] // first stacked arg
+ .endif
+
+ ldr x28, [x24, #8] // load lower counter
+CPU_LE( rev x28, x28 )
+
+0: mov x0, x25
+ load_round_keys w26, x0
+ ld1 {SHASH.2d}, [x23]
+ ld1 {XL.2d}, [x20]
movi MASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE( rev x8, x8 )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
.if \enc == 1
- ld1 {KS.16b}, [x7]
+ ld1 {KS.16b}, [x27]
.endif
-0: ld1 {CTR.8b}, [x5] // load upper counter
- ld1 {INP.16b}, [x3], #16
- rev x9, x8
- add x8, x8, #1
- sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+ ld1 {INP.16b}, [x22], #16
+ rev x9, x28
+ add x28, x28, #1
+ sub w19, w19, #1
ins CTR.d[1], x9 // set lower counter
.if \enc == 1
eor INP.16b, INP.16b, KS.16b // encrypt input
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
rev64 T1.16b, INP.16b
- cmp w6, #12
- b.ge 2f // AES-192/256?
+ cmp w26, #12
+ b.ge 4f // AES-192/256?
-1: enc_round CTR, v21
+2: enc_round CTR, v21
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE( rev x8, x8 )
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
- cbnz w0, 0b
+ cbz w19, 3f
-CPU_LE( rev x8, x8 )
- st1 {XL.2d}, [x1]
- str x8, [x5, #8] // store lower counter
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ .if \enc == 1
+ st1 {KS.16b}, [x27]
+ .endif
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+ b 1b
+
+3: st1 {XL.2d}, [x20]
.if \enc == 1
- st1 {KS.16b}, [x7]
+ st1 {KS.16b}, [x27]
.endif
+CPU_LE( rev x28, x28 )
+ str x28, [x24, #8] // store lower counter
+
+ frame_pop 10
ret
-2: b.eq 3f // AES-192?
+4: b.eq 5f // AES-192?
enc_round CTR, v17
enc_round CTR, v18
-3: enc_round CTR, v19
+5: enc_round CTR, v19
enc_round CTR, v20
- b 1b
+ b 2b
.endm
/*
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..7cf0b1aa6ea8 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -63,11 +63,12 @@ static void (*pmull_ghash_update)(int blocks, u64 dg[], const char *src,
asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds, u8 ks[]);
+ u8 ctr[], u32 const rk[], int rounds,
+ u8 ks[]);
asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds);
+ u8 ctr[], u32 const rk[], int rounds);
asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
u32 const rk[], int rounds);
@@ -368,26 +369,29 @@ static int gcm_encrypt(struct aead_request *req)
pmull_gcm_encrypt_block(ks, iv, NULL,
num_rounds(&ctx->aes_key));
put_unaligned_be32(3, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key), ks);
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key), ks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -467,15 +471,19 @@ static int gcm_decrypt(struct aead_request *req)
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key));
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key));
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
@@ -483,14 +491,12 @@ static int gcm_decrypt(struct aead_request *req)
if (walk.nbytes)
pmull_gcm_encrypt_block(iv, iv, NULL,
num_rounds(&ctx->aes_key));
-
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 17/20] crypto: arm64/aes-ghash - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/ghash-ce-core.S | 113 ++++++++++++++------
arch/arm64/crypto/ghash-ce-glue.c | 28 +++--
2 files changed, 97 insertions(+), 44 deletions(-)
diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..8da87cfcce66 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,22 +213,31 @@
.endm
.macro __pmull_ghash, pn
- ld1 {SHASH.2d}, [x3]
- ld1 {XL.2d}, [x1]
+ frame_push 5
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+
+0: ld1 {SHASH.2d}, [x22]
+ ld1 {XL.2d}, [x20]
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
eor SHASH2.16b, SHASH2.16b, SHASH.16b
__pmull_pre_\pn
/* do the head block first, if supplied */
- cbz x4, 0f
- ld1 {T1.2d}, [x4]
- b 1f
+ cbz x23, 1f
+ ld1 {T1.2d}, [x23]
+ mov x23, xzr
+ b 2f
-0: ld1 {T1.2d}, [x2], #16
- sub w0, w0, #1
+1: ld1 {T1.2d}, [x21], #16
+ sub w19, w19, #1
-1: /* multiply XL by SHASH in GF(2^128) */
+2: /* multiply XL by SHASH in GF(2^128) */
CPU_LE( rev64 T1.16b, T1.16b )
ext T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +259,18 @@ CPU_LE( rev64 T1.16b, T1.16b )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
- cbnz w0, 0b
+ cbz w19, 3f
+
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+
+ b 1b
- st1 {XL.2d}, [x1]
+3: st1 {XL.2d}, [x20]
+ frame_pop 5
ret
.endm
@@ -304,38 +322,55 @@ ENDPROC(pmull_ghash_update_p8)
.endm
.macro pmull_gcm_do_crypt, enc
- ld1 {SHASH.2d}, [x4]
- ld1 {XL.2d}, [x1]
- ldr x8, [x5, #8] // load lower counter
+ frame_push 10
+
+ mov x19, x0
+ mov x20, x1
+ mov x21, x2
+ mov x22, x3
+ mov x23, x4
+ mov x24, x5
+ mov x25, x6
+ mov x26, x7
+ .if \enc == 1
+ ldr x27, [sp, #96] // first stacked arg
+ .endif
+
+ ldr x28, [x24, #8] // load lower counter
+CPU_LE( rev x28, x28 )
+
+0: mov x0, x25
+ load_round_keys w26, x0
+ ld1 {SHASH.2d}, [x23]
+ ld1 {XL.2d}, [x20]
movi MASK.16b, #0xe1
ext SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE( rev x8, x8 )
shl MASK.2d, MASK.2d, #57
eor SHASH2.16b, SHASH2.16b, SHASH.16b
.if \enc == 1
- ld1 {KS.16b}, [x7]
+ ld1 {KS.16b}, [x27]
.endif
-0: ld1 {CTR.8b}, [x5] // load upper counter
- ld1 {INP.16b}, [x3], #16
- rev x9, x8
- add x8, x8, #1
- sub w0, w0, #1
+1: ld1 {CTR.8b}, [x24] // load upper counter
+ ld1 {INP.16b}, [x22], #16
+ rev x9, x28
+ add x28, x28, #1
+ sub w19, w19, #1
ins CTR.d[1], x9 // set lower counter
.if \enc == 1
eor INP.16b, INP.16b, KS.16b // encrypt input
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
rev64 T1.16b, INP.16b
- cmp w6, #12
- b.ge 2f // AES-192/256?
+ cmp w26, #12
+ b.ge 4f // AES-192/256?
-1: enc_round CTR, v21
+2: enc_round CTR, v21
ext T2.16b, XL.16b, XL.16b, #8
ext IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +425,39 @@ CPU_LE( rev x8, x8 )
.if \enc == 0
eor INP.16b, INP.16b, KS.16b
- st1 {INP.16b}, [x2], #16
+ st1 {INP.16b}, [x21], #16
.endif
- cbnz w0, 0b
+ cbz w19, 3f
-CPU_LE( rev x8, x8 )
- st1 {XL.2d}, [x1]
- str x8, [x5, #8] // store lower counter
+ if_will_cond_yield_neon
+ st1 {XL.2d}, [x20]
+ .if \enc == 1
+ st1 {KS.16b}, [x27]
+ .endif
+ do_cond_yield_neon
+ b 0b
+ endif_yield_neon
+ b 1b
+
+3: st1 {XL.2d}, [x20]
.if \enc == 1
- st1 {KS.16b}, [x7]
+ st1 {KS.16b}, [x27]
.endif
+CPU_LE( rev x28, x28 )
+ str x28, [x24, #8] // store lower counter
+
+ frame_pop 10
ret
-2: b.eq 3f // AES-192?
+4: b.eq 5f // AES-192?
enc_round CTR, v17
enc_round CTR, v18
-3: enc_round CTR, v19
+5: enc_round CTR, v19
enc_round CTR, v20
- b 1b
+ b 2b
.endm
/*
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..7cf0b1aa6ea8 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -63,11 +63,12 @@ static void (*pmull_ghash_update)(int blocks, u64 dg[], const char *src,
asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds, u8 ks[]);
+ u8 ctr[], u32 const rk[], int rounds,
+ u8 ks[]);
asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
const u8 src[], struct ghash_key const *k,
- u8 ctr[], int rounds);
+ u8 ctr[], u32 const rk[], int rounds);
asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
u32 const rk[], int rounds);
@@ -368,26 +369,29 @@ static int gcm_encrypt(struct aead_request *req)
pmull_gcm_encrypt_block(ks, iv, NULL,
num_rounds(&ctx->aes_key));
put_unaligned_be32(3, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key), ks);
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key), ks);
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
}
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_encrypt(&walk, req, true);
+ err = skcipher_walk_aead_encrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -467,15 +471,19 @@ static int gcm_decrypt(struct aead_request *req)
pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
+ kernel_neon_end();
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
+ kernel_neon_begin();
pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
walk.src.virt.addr, &ctx->ghash_key,
- iv, num_rounds(&ctx->aes_key));
+ iv, ctx->aes_key.key_enc,
+ num_rounds(&ctx->aes_key));
+ kernel_neon_end();
err = skcipher_walk_done(&walk,
walk.nbytes % AES_BLOCK_SIZE);
@@ -483,14 +491,12 @@ static int gcm_decrypt(struct aead_request *req)
if (walk.nbytes)
pmull_gcm_encrypt_block(iv, iv, NULL,
num_rounds(&ctx->aes_key));
-
- kernel_neon_end();
} else {
__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
num_rounds(&ctx->aes_key));
put_unaligned_be32(2, iv + GCM_IV_SIZE);
- err = skcipher_walk_aead_decrypt(&walk, req, true);
+ err = skcipher_walk_aead_decrypt(&walk, req, false);
while (walk.nbytes >= AES_BLOCK_SIZE) {
int blocks = walk.nbytes / AES_BLOCK_SIZE;
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 18/20] crypto: arm64/crc32-ce - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crc32-ce-core.S | 44 ++++++++++++++------
1 file changed, 32 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/crypto/crc32-ce-core.S b/arch/arm64/crypto/crc32-ce-core.S
index 18f5a8442276..b4ddbb2027e5 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,9 @@
dCONSTANT .req d0
qCONSTANT .req q0
- BUF .req x0
- LEN .req x1
- CRC .req x2
+ BUF .req x19
+ LEN .req x20
+ CRC .req x21
vzr .req v9
@@ -116,13 +116,21 @@
* size_t len, uint crc32)
*/
ENTRY(crc32_pmull_le)
- adr x3, .Lcrc32_constants
+ frame_push 4, 64
+
+ adr x22, .Lcrc32_constants
b 0f
ENTRY(crc32c_pmull_le)
- adr x3, .Lcrc32c_constants
+ frame_push 4, 64
+
+ adr x22, .Lcrc32c_constants
+
+0: mov BUF, x0
+ mov LEN, x1
+ mov CRC, x2
-0: bic LEN, LEN, #15
+ bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movi vzr.16b, #0
fmov dCONSTANT, CRC
@@ -131,7 +139,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.lt less_64
- ldr qCONSTANT, [x3]
+ ldr qCONSTANT, [x22]
loop_64: /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -161,10 +169,21 @@ loop_64: /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
cmp LEN, #0x40
- b.ge loop_64
+ b.lt less_64
+
+ if_will_cond_yield_neon
+ stp q1, q2, [sp, #48]
+ stp q3, q4, [sp, #80]
+ do_cond_yield_neon
+ ldp q1, q2, [sp, #48]
+ ldp q3, q4, [sp, #80]
+ ldr qCONSTANT, [x22]
+ movi vzr.16b, #0
+ endif_yield_neon
+ b loop_64
less_64: /* Folding cache line into 128bit */
- ldr qCONSTANT, [x3, #16]
+ ldr qCONSTANT, [x22, #16]
pmull2 v5.1q, v1.2d, vCONSTANT.2d
pmull v1.1q, v1.1d, vCONSTANT.1d
@@ -203,8 +222,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* final 32-bit fold */
- ldr dCONSTANT, [x3, #32]
- ldr d3, [x3, #40]
+ ldr dCONSTANT, [x22, #32]
+ ldr d3, [x22, #40]
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -212,7 +231,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
- ldr qCONSTANT, [x3, #48]
+ ldr qCONSTANT, [x22, #48]
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -222,6 +241,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
+ frame_pop 4, 64
ret
ENDPROC(crc32_pmull_le)
ENDPROC(crc32c_pmull_le)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 18/20] crypto: arm64/crc32-ce - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crc32-ce-core.S | 44 ++++++++++++++------
1 file changed, 32 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/crypto/crc32-ce-core.S b/arch/arm64/crypto/crc32-ce-core.S
index 18f5a8442276..b4ddbb2027e5 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,9 @@
dCONSTANT .req d0
qCONSTANT .req q0
- BUF .req x0
- LEN .req x1
- CRC .req x2
+ BUF .req x19
+ LEN .req x20
+ CRC .req x21
vzr .req v9
@@ -116,13 +116,21 @@
* size_t len, uint crc32)
*/
ENTRY(crc32_pmull_le)
- adr x3, .Lcrc32_constants
+ frame_push 4, 64
+
+ adr x22, .Lcrc32_constants
b 0f
ENTRY(crc32c_pmull_le)
- adr x3, .Lcrc32c_constants
+ frame_push 4, 64
+
+ adr x22, .Lcrc32c_constants
+
+0: mov BUF, x0
+ mov LEN, x1
+ mov CRC, x2
-0: bic LEN, LEN, #15
+ bic LEN, LEN, #15
ld1 {v1.16b-v4.16b}, [BUF], #0x40
movi vzr.16b, #0
fmov dCONSTANT, CRC
@@ -131,7 +139,7 @@ ENTRY(crc32c_pmull_le)
cmp LEN, #0x40
b.lt less_64
- ldr qCONSTANT, [x3]
+ ldr qCONSTANT, [x22]
loop_64: /* 64 bytes Full cache line folding */
sub LEN, LEN, #0x40
@@ -161,10 +169,21 @@ loop_64: /* 64 bytes Full cache line folding */
eor v4.16b, v4.16b, v8.16b
cmp LEN, #0x40
- b.ge loop_64
+ b.lt less_64
+
+ if_will_cond_yield_neon
+ stp q1, q2, [sp, #48]
+ stp q3, q4, [sp, #80]
+ do_cond_yield_neon
+ ldp q1, q2, [sp, #48]
+ ldp q3, q4, [sp, #80]
+ ldr qCONSTANT, [x22]
+ movi vzr.16b, #0
+ endif_yield_neon
+ b loop_64
less_64: /* Folding cache line into 128bit */
- ldr qCONSTANT, [x3, #16]
+ ldr qCONSTANT, [x22, #16]
pmull2 v5.1q, v1.2d, vCONSTANT.2d
pmull v1.1q, v1.1d, vCONSTANT.1d
@@ -203,8 +222,8 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* final 32-bit fold */
- ldr dCONSTANT, [x3, #32]
- ldr d3, [x3, #40]
+ ldr dCONSTANT, [x22, #32]
+ ldr d3, [x22, #40]
ext v2.16b, v1.16b, vzr.16b, #4
and v1.16b, v1.16b, v3.16b
@@ -212,7 +231,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
- ldr qCONSTANT, [x3, #48]
+ ldr qCONSTANT, [x22, #48]
and v2.16b, v1.16b, v3.16b
ext v2.16b, vzr.16b, v2.16b, #8
@@ -222,6 +241,7 @@ fold_64:
eor v1.16b, v1.16b, v2.16b
mov w0, v1.s[1]
+ frame_pop 4, 64
ret
ENDPROC(crc32_pmull_le)
ENDPROC(crc32c_pmull_le)
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 19/20] crypto: arm64/crct10dif-ce - yield NEON after every block of input
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crct10dif-ce-core.S | 32 +++++++++++++++++---
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
index d5b5a8c038c8..111675f7bad5 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpu generic+crypto
- arg1_low32 .req w0
- arg2 .req x1
- arg3 .req x2
+ arg1_low32 .req w19
+ arg2 .req x20
+ arg3 .req x21
vzr .req v13
ENTRY(crc_t10dif_pmull)
+ frame_push 3, 128
+
+ mov arg1_low32, w0
+ mov arg2, x1
+ mov arg3, x2
+
movi vzr.16b, #0 // init zero register
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 )
subs arg3, arg3, #128
// check if there is another 64B in the buffer to be able to fold
- b.ge _fold_64_B_loop
+ b.lt _fold_64_B_end
+
+ if_will_cond_yield_neon
+ stp q0, q1, [sp, #48]
+ stp q2, q3, [sp, #80]
+ stp q4, q5, [sp, #112]
+ stp q6, q7, [sp, #144]
+ do_cond_yield_neon
+ ldp q0, q1, [sp, #48]
+ ldp q2, q3, [sp, #80]
+ ldp q4, q5, [sp, #112]
+ ldp q6, q7, [sp, #144]
+ ldr q10, rk3
+ movi vzr.16b, #0 // init zero register
+ endif_yield_neon
+
+ b _fold_64_B_loop
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
_cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+ frame_pop 3, 128
ret
_less_than_128:
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 19/20] crypto: arm64/crct10dif-ce - yield NEON after every block of input
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON after every block of input.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm64/crypto/crct10dif-ce-core.S | 32 +++++++++++++++++---
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
index d5b5a8c038c8..111675f7bad5 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,19 @@
.text
.cpu generic+crypto
- arg1_low32 .req w0
- arg2 .req x1
- arg3 .req x2
+ arg1_low32 .req w19
+ arg2 .req x20
+ arg3 .req x21
vzr .req v13
ENTRY(crc_t10dif_pmull)
+ frame_push 3, 128
+
+ mov arg1_low32, w0
+ mov arg2, x1
+ mov arg3, x2
+
movi vzr.16b, #0 // init zero register
// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +181,25 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 )
subs arg3, arg3, #128
// check if there is another 64B in the buffer to be able to fold
- b.ge _fold_64_B_loop
+ b.lt _fold_64_B_end
+
+ if_will_cond_yield_neon
+ stp q0, q1, [sp, #48]
+ stp q2, q3, [sp, #80]
+ stp q4, q5, [sp, #112]
+ stp q6, q7, [sp, #144]
+ do_cond_yield_neon
+ ldp q0, q1, [sp, #48]
+ ldp q2, q3, [sp, #80]
+ ldp q4, q5, [sp, #112]
+ ldp q6, q7, [sp, #144]
+ ldr q10, rk3
+ movi vzr.16b, #0 // init zero register
+ endif_yield_neon
+
+ b _fold_64_B_loop
+_fold_64_B_end:
// at this point, the buffer pointer is pointing at the last y Bytes
// of the buffer the 64B of folded data is in 4 of the vector
// registers: v0, v1, v2, v3
@@ -304,6 +327,7 @@ _barrett:
_cleanup:
// scale the result back to 16 bits
lsr x0, x0, #16
+ frame_pop 3, 128
ret
_less_than_128:
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 20/20] DO NOT MERGE
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-06 19:43 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-crypto
Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
Russell King - ARM Linux, Sebastian Andrzej Siewior,
Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
Will Deacon, Steven Rostedt, Thomas Gleixner
Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
arch/arm64/include/asm/assembler.h | 33 ++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index c54e408fd5a7..7072c29b4e83 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -607,6 +607,7 @@ alternative_else_nop_endif
cmp w1, #1 // == PREEMPT_OFFSET
csel x0, x0, xzr, eq
tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
+ b 5555f
#endif
.subsection 1
5555:
@@ -615,6 +616,38 @@ alternative_else_nop_endif
.macro do_cond_yield_neon
bl kernel_neon_end
bl kernel_neon_begin
+ movi v0.16b, #0x55
+ movi v1.16b, #0x55
+ movi v2.16b, #0x55
+ movi v3.16b, #0x55
+ movi v4.16b, #0x55
+ movi v5.16b, #0x55
+ movi v6.16b, #0x55
+ movi v7.16b, #0x55
+ movi v8.16b, #0x55
+ movi v9.16b, #0x55
+ movi v10.16b, #0x55
+ movi v11.16b, #0x55
+ movi v12.16b, #0x55
+ movi v13.16b, #0x55
+ movi v14.16b, #0x55
+ movi v15.16b, #0x55
+ movi v16.16b, #0x55
+ movi v17.16b, #0x55
+ movi v18.16b, #0x55
+ movi v19.16b, #0x55
+ movi v20.16b, #0x55
+ movi v21.16b, #0x55
+ movi v22.16b, #0x55
+ movi v23.16b, #0x55
+ movi v24.16b, #0x55
+ movi v25.16b, #0x55
+ movi v26.16b, #0x55
+ movi v27.16b, #0x55
+ movi v28.16b, #0x55
+ movi v29.16b, #0x55
+ movi v30.16b, #0x55
+ movi v31.16b, #0x55
.endm
.macro endif_yield_neon, lbl=6666f
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH v3 20/20] DO NOT MERGE
@ 2017-12-06 19:43 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 19:43 UTC (permalink / raw)
To: linux-arm-kernel
Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
arch/arm64/include/asm/assembler.h | 33 ++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index c54e408fd5a7..7072c29b4e83 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -607,6 +607,7 @@ alternative_else_nop_endif
cmp w1, #1 // == PREEMPT_OFFSET
csel x0, x0, xzr, eq
tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
+ b 5555f
#endif
.subsection 1
5555:
@@ -615,6 +616,38 @@ alternative_else_nop_endif
.macro do_cond_yield_neon
bl kernel_neon_end
bl kernel_neon_begin
+ movi v0.16b, #0x55
+ movi v1.16b, #0x55
+ movi v2.16b, #0x55
+ movi v3.16b, #0x55
+ movi v4.16b, #0x55
+ movi v5.16b, #0x55
+ movi v6.16b, #0x55
+ movi v7.16b, #0x55
+ movi v8.16b, #0x55
+ movi v9.16b, #0x55
+ movi v10.16b, #0x55
+ movi v11.16b, #0x55
+ movi v12.16b, #0x55
+ movi v13.16b, #0x55
+ movi v14.16b, #0x55
+ movi v15.16b, #0x55
+ movi v16.16b, #0x55
+ movi v17.16b, #0x55
+ movi v18.16b, #0x55
+ movi v19.16b, #0x55
+ movi v20.16b, #0x55
+ movi v21.16b, #0x55
+ movi v22.16b, #0x55
+ movi v23.16b, #0x55
+ movi v24.16b, #0x55
+ movi v25.16b, #0x55
+ movi v26.16b, #0x55
+ movi v27.16b, #0x55
+ movi v28.16b, #0x55
+ movi v29.16b, #0x55
+ movi v30.16b, #0x55
+ movi v31.16b, #0x55
.endm
.macro endif_yield_neon, lbl=6666f
--
2.11.0
^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-07 14:11 ` Dave Martin
-1 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 14:11 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-crypto, Mark Rutland, herbert, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
linux-arm-kernel, linux-rt-users
On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
> We are going to add code to all the NEON crypto routines that will
> turn them into non-leaf functions, so we need to manage the stack
> frames. To make this less tedious and error prone, add some macros
> that take the number of callee saved registers to preserve and the
> extra size to allocate in the stack frame (for locals) and emit
> the ldp/stp sequences.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
> 1 file changed, 60 insertions(+)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index aef72d886677..5f61487e9f93 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -499,6 +499,66 @@ alternative_else_nop_endif
> #endif
> .endm
>
> + /*
> + * frame_push - Push @regcount callee saved registers to the stack,
> + * starting at x19, as well as x29/x30, and set x29 to
> + * the new value of sp. Add @extra bytes of stack space
> + * for locals.
> + */
> + .macro frame_push, regcount:req, extra
> + __frame st, \regcount, \extra
> + .endm
> +
> + /*
> + * frame_pop - Pop @regcount callee saved registers from the stack,
> + * starting at x19, as well as x29/x30. Also pop @extra
> + * bytes of stack space for locals.
> + */
> + .macro frame_pop, regcount:req, extra
> + __frame ld, \regcount, \extra
> + .endm
> +
> + .macro __frame, op, regcount:req, extra=0
> + .ifc \op, st
> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
> + mov x29, sp
> + .endif
> + .if \regcount < 0 || \regcount > 10
> + .error "regcount should be in the range [0 ... 10]"
> + .endif
> + .if (\extra % 16) != 0
> + .error "extra should be a multiple of 16 bytes"
> + .endif
> + .if \regcount > 1
> + \op\()p x19, x20, [sp, #16]
> + .if \regcount > 3
> + \op\()p x21, x22, [sp, #32]
> + .if \regcount > 5
> + \op\()p x23, x24, [sp, #48]
> + .if \regcount > 7
> + \op\()p x25, x26, [sp, #64]
> + .if \regcount > 9
> + \op\()p x27, x28, [sp, #80]
Can the _for thing I introduced in fpsimdmacros.h be any use here?
Alternatively, the following could replace that .if-slide,
providing the calling macro does .altmacro .. .noaltmacro somewhere.
.macro _pushpop2 op, n1, n2, offset
\op x\n1, x\n2, [sp, #\offset]
.endm
.macro _pushpop op, first, last, offset
.if \first < \last
_pushpop2 \op\()p, \first, %\first + 1, \offset
_pushpop \op, %\first + 2, \last, %\offset + 16
.elseif \first == \last
\op\()r x\first, [sp, #\offset]
.endif
.endm
Also, I wonder whether it would be more readable at the call site
to specify the first and last reg numbers instead of the reg count,
e.g.:
frame_push first_reg=19, last_reg=23
(or whatever). Just syntactic sugar though.
[...]
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
@ 2017-12-07 14:11 ` Dave Martin
0 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 14:11 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
> We are going to add code to all the NEON crypto routines that will
> turn them into non-leaf functions, so we need to manage the stack
> frames. To make this less tedious and error prone, add some macros
> that take the number of callee saved registers to preserve and the
> extra size to allocate in the stack frame (for locals) and emit
> the ldp/stp sequences.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
> 1 file changed, 60 insertions(+)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index aef72d886677..5f61487e9f93 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -499,6 +499,66 @@ alternative_else_nop_endif
> #endif
> .endm
>
> + /*
> + * frame_push - Push @regcount callee saved registers to the stack,
> + * starting at x19, as well as x29/x30, and set x29 to
> + * the new value of sp. Add @extra bytes of stack space
> + * for locals.
> + */
> + .macro frame_push, regcount:req, extra
> + __frame st, \regcount, \extra
> + .endm
> +
> + /*
> + * frame_pop - Pop @regcount callee saved registers from the stack,
> + * starting at x19, as well as x29/x30. Also pop @extra
> + * bytes of stack space for locals.
> + */
> + .macro frame_pop, regcount:req, extra
> + __frame ld, \regcount, \extra
> + .endm
> +
> + .macro __frame, op, regcount:req, extra=0
> + .ifc \op, st
> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
> + mov x29, sp
> + .endif
> + .if \regcount < 0 || \regcount > 10
> + .error "regcount should be in the range [0 ... 10]"
> + .endif
> + .if (\extra % 16) != 0
> + .error "extra should be a multiple of 16 bytes"
> + .endif
> + .if \regcount > 1
> + \op\()p x19, x20, [sp, #16]
> + .if \regcount > 3
> + \op\()p x21, x22, [sp, #32]
> + .if \regcount > 5
> + \op\()p x23, x24, [sp, #48]
> + .if \regcount > 7
> + \op\()p x25, x26, [sp, #64]
> + .if \regcount > 9
> + \op\()p x27, x28, [sp, #80]
Can the _for thing I introduced in fpsimdmacros.h be any use here?
Alternatively, the following could replace that .if-slide,
providing the calling macro does .altmacro .. .noaltmacro somewhere.
.macro _pushpop2 op, n1, n2, offset
\op x\n1, x\n2, [sp, #\offset]
.endm
.macro _pushpop op, first, last, offset
.if \first < \last
_pushpop2 \op\()p, \first, %\first + 1, \offset
_pushpop \op, %\first + 2, \last, %\offset + 16
.elseif \first == \last
\op\()r x\first, [sp, #\offset]
.endif
.endm
Also, I wonder whether it would be more readable@the call site
to specify the first and last reg numbers instead of the reg count,
e.g.:
frame_push first_reg=19, last_reg=23
(or whatever). Just syntactic sugar though.
[...]
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
2017-12-07 14:11 ` Dave Martin
@ 2017-12-07 14:21 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 14:21 UTC (permalink / raw)
To: Dave Martin
Cc: linux-crypto, Mark Rutland, Herbert Xu, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
linux-arm-kernel, linux-rt-users
On 7 December 2017 at 14:11, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
>> We are going to add code to all the NEON crypto routines that will
>> turn them into non-leaf functions, so we need to manage the stack
>> frames. To make this less tedious and error prone, add some macros
>> that take the number of callee saved registers to preserve and the
>> extra size to allocate in the stack frame (for locals) and emit
>> the ldp/stp sequences.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
>> 1 file changed, 60 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> index aef72d886677..5f61487e9f93 100644
>> --- a/arch/arm64/include/asm/assembler.h
>> +++ b/arch/arm64/include/asm/assembler.h
>> @@ -499,6 +499,66 @@ alternative_else_nop_endif
>> #endif
>> .endm
>>
>> + /*
>> + * frame_push - Push @regcount callee saved registers to the stack,
>> + * starting at x19, as well as x29/x30, and set x29 to
>> + * the new value of sp. Add @extra bytes of stack space
>> + * for locals.
>> + */
>> + .macro frame_push, regcount:req, extra
>> + __frame st, \regcount, \extra
>> + .endm
>> +
>> + /*
>> + * frame_pop - Pop @regcount callee saved registers from the stack,
>> + * starting at x19, as well as x29/x30. Also pop @extra
>> + * bytes of stack space for locals.
>> + */
>> + .macro frame_pop, regcount:req, extra
>> + __frame ld, \regcount, \extra
>> + .endm
>> +
>> + .macro __frame, op, regcount:req, extra=0
>> + .ifc \op, st
>> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
>> + mov x29, sp
>> + .endif
>> + .if \regcount < 0 || \regcount > 10
>> + .error "regcount should be in the range [0 ... 10]"
>> + .endif
>> + .if (\extra % 16) != 0
>> + .error "extra should be a multiple of 16 bytes"
>> + .endif
>> + .if \regcount > 1
>> + \op\()p x19, x20, [sp, #16]
>> + .if \regcount > 3
>> + \op\()p x21, x22, [sp, #32]
>> + .if \regcount > 5
>> + \op\()p x23, x24, [sp, #48]
>> + .if \regcount > 7
>> + \op\()p x25, x26, [sp, #64]
>> + .if \regcount > 9
>> + \op\()p x27, x28, [sp, #80]
>
> Can the _for thing I introduced in fpsimdmacros.h be any use here?
> Alternatively, the following could replace that .if-slide,
> providing the calling macro does .altmacro .. .noaltmacro somewhere.
>
> .macro _pushpop2 op, n1, n2, offset
> \op x\n1, x\n2, [sp, #\offset]
> .endm
>
> .macro _pushpop op, first, last, offset
> .if \first < \last
> _pushpop2 \op\()p, \first, %\first + 1, \offset
> _pushpop \op, %\first + 2, \last, %\offset + 16
> .elseif \first == \last
> \op\()r x\first, [sp, #\offset]
> .endif
> .endm
>
I'd prefer not to rely on altmacro, for reasons you pointed out
yourself a while ago IIRC.
I agree your version is more compact, but for a write once thing, I'm
not sure if it matters.
> Also, I wonder whether it would be more readable at the call site
> to specify the first and last reg numbers instead of the reg count,
> e.g.:
>
> frame_push first_reg=19, last_reg=23
>
> (or whatever). Just syntactic sugar though.
>
Again, this will involve arithmetic on macro arguments, which implies
altmacro. Relying on altmacro being set is dodgy, and unfortunately,
we can't enable it in the macro without keeping it enabled (or we may
disable it on behalf of the caller. I guess we could try to come up
with a smart way to infer whether altmacro was enabled, and only
disable it afterwards if it wasn't, using some directives that get
interpreted differently, but to be honest, I factored out this
sequence so I could think about more important things :-)
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
@ 2017-12-07 14:21 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 14:21 UTC (permalink / raw)
To: linux-arm-kernel
On 7 December 2017 at 14:11, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
>> We are going to add code to all the NEON crypto routines that will
>> turn them into non-leaf functions, so we need to manage the stack
>> frames. To make this less tedious and error prone, add some macros
>> that take the number of callee saved registers to preserve and the
>> extra size to allocate in the stack frame (for locals) and emit
>> the ldp/stp sequences.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
>> 1 file changed, 60 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> index aef72d886677..5f61487e9f93 100644
>> --- a/arch/arm64/include/asm/assembler.h
>> +++ b/arch/arm64/include/asm/assembler.h
>> @@ -499,6 +499,66 @@ alternative_else_nop_endif
>> #endif
>> .endm
>>
>> + /*
>> + * frame_push - Push @regcount callee saved registers to the stack,
>> + * starting at x19, as well as x29/x30, and set x29 to
>> + * the new value of sp. Add @extra bytes of stack space
>> + * for locals.
>> + */
>> + .macro frame_push, regcount:req, extra
>> + __frame st, \regcount, \extra
>> + .endm
>> +
>> + /*
>> + * frame_pop - Pop @regcount callee saved registers from the stack,
>> + * starting at x19, as well as x29/x30. Also pop @extra
>> + * bytes of stack space for locals.
>> + */
>> + .macro frame_pop, regcount:req, extra
>> + __frame ld, \regcount, \extra
>> + .endm
>> +
>> + .macro __frame, op, regcount:req, extra=0
>> + .ifc \op, st
>> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
>> + mov x29, sp
>> + .endif
>> + .if \regcount < 0 || \regcount > 10
>> + .error "regcount should be in the range [0 ... 10]"
>> + .endif
>> + .if (\extra % 16) != 0
>> + .error "extra should be a multiple of 16 bytes"
>> + .endif
>> + .if \regcount > 1
>> + \op\()p x19, x20, [sp, #16]
>> + .if \regcount > 3
>> + \op\()p x21, x22, [sp, #32]
>> + .if \regcount > 5
>> + \op\()p x23, x24, [sp, #48]
>> + .if \regcount > 7
>> + \op\()p x25, x26, [sp, #64]
>> + .if \regcount > 9
>> + \op\()p x27, x28, [sp, #80]
>
> Can the _for thing I introduced in fpsimdmacros.h be any use here?
> Alternatively, the following could replace that .if-slide,
> providing the calling macro does .altmacro .. .noaltmacro somewhere.
>
> .macro _pushpop2 op, n1, n2, offset
> \op x\n1, x\n2, [sp, #\offset]
> .endm
>
> .macro _pushpop op, first, last, offset
> .if \first < \last
> _pushpop2 \op\()p, \first, %\first + 1, \offset
> _pushpop \op, %\first + 2, \last, %\offset + 16
> .elseif \first == \last
> \op\()r x\first, [sp, #\offset]
> .endif
> .endm
>
I'd prefer not to rely on altmacro, for reasons you pointed out
yourself a while ago IIRC.
I agree your version is more compact, but for a write once thing, I'm
not sure if it matters.
> Also, I wonder whether it would be more readable at the call site
> to specify the first and last reg numbers instead of the reg count,
> e.g.:
>
> frame_push first_reg=19, last_reg=23
>
> (or whatever). Just syntactic sugar though.
>
Again, this will involve arithmetic on macro arguments, which implies
altmacro. Relying on altmacro being set is dodgy, and unfortunately,
we can't enable it in the macro without keeping it enabled (or we may
disable it on behalf of the caller. I guess we could try to come up
with a smart way to infer whether altmacro was enabled, and only
disable it afterwards if it wasn't, using some directives that get
interpreted differently, but to be honest, I factored out this
sequence so I could think about more important things :-)
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-06 19:43 ` Ard Biesheuvel
@ 2017-12-07 14:39 ` Dave Martin
-1 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 14:39 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-crypto, Mark Rutland, herbert, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
linux-arm-kernel, linux-rt-users
On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
> Add support macros to conditionally yield the NEON (and thus the CPU)
> that may be called from the assembler code.
>
> In some cases, yielding the NEON involves saving and restoring a non
> trivial amount of context (especially in the CRC folding algorithms),
> and so the macro is split into three, and the code in between is only
> executed when the yield path is taken, allowing the context to be preserved.
> The third macro takes an optional label argument that marks the resume
> path after a yield has been performed.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
> 1 file changed, 51 insertions(+)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 5f61487e9f93..c54e408fd5a7 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -572,4 +572,55 @@ alternative_else_nop_endif
> #endif
> .endm
>
> +/*
> + * Check whether to yield to another runnable task from kernel mode NEON code
> + * (which runs with preemption disabled).
> + *
> + * if_will_cond_yield_neon
> + * // pre-yield patchup code
> + * do_cond_yield_neon
> + * // post-yield patchup code
> + * endif_yield_neon
^ Mention the lbl argument?
> + *
> + * - Check whether the preempt count is exactly 1, in which case disabling
enabling ^
> + * preemption once will make the task preemptible. If this is not the case,
> + * yielding is pointless.
> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> + * kernel mode NEON (which will trigger a reschedule), and branch to the
> + * yield fixup code.
Mention that neither patchup sequence is allowed to use section-changing
directives?
For example:
if_will_cond_yield_neon
// some code
.pushsection .rodata, "a"
foo: .quad // some literal data for some reason
.popsection
// some code
do_cond_yield_neon
is not safe, because .previous is now .rodata.
(You could protect against this with
.pushsection .text; .previous; .subsection 1; // ...
.popsection
but it may be overkill.)
> + *
> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
> + * and may clobber x2 .. x18 if the yield path is taken.
> + */
> +
> + .macro cond_yield_neon, lbl
> + if_will_cond_yield_neon
> + do_cond_yield_neon
> + endif_yield_neon \lbl
> + .endm
> +
> + .macro if_will_cond_yield_neon
> +#ifdef CONFIG_PREEMPT
> + get_thread_info x0
> + ldr w1, [x0, #TSK_TI_PREEMPT]
> + ldr x0, [x0, #TSK_TI_FLAGS]
> + cmp w1, #1 // == PREEMPT_OFFSET
Can we at least drop a BUILD_BUG_ON() somewhere to check this?
Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
related.
> + csel x0, x0, xzr, eq
> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
> +#endif
A comment that we will fall through to 6666f here may be helpful.
> + .subsection 1
> +5555:
> + .endm
> +
> + .macro do_cond_yield_neon
> + bl kernel_neon_end
> + bl kernel_neon_begin
> + .endm
> +
> + .macro endif_yield_neon, lbl=6666f
> + b \lbl
> + .previous
> +6666:
Could have slightly more random "random" labels here, but otherwise
it looks ok to me.
I might go through and replace all the random labels with something
more robust sometime, but I've never been sure it was worth the
effort...
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-07 14:39 ` Dave Martin
0 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 14:39 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
> Add support macros to conditionally yield the NEON (and thus the CPU)
> that may be called from the assembler code.
>
> In some cases, yielding the NEON involves saving and restoring a non
> trivial amount of context (especially in the CRC folding algorithms),
> and so the macro is split into three, and the code in between is only
> executed when the yield path is taken, allowing the context to be preserved.
> The third macro takes an optional label argument that marks the resume
> path after a yield has been performed.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
> 1 file changed, 51 insertions(+)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 5f61487e9f93..c54e408fd5a7 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -572,4 +572,55 @@ alternative_else_nop_endif
> #endif
> .endm
>
> +/*
> + * Check whether to yield to another runnable task from kernel mode NEON code
> + * (which runs with preemption disabled).
> + *
> + * if_will_cond_yield_neon
> + * // pre-yield patchup code
> + * do_cond_yield_neon
> + * // post-yield patchup code
> + * endif_yield_neon
^ Mention the lbl argument?
> + *
> + * - Check whether the preempt count is exactly 1, in which case disabling
enabling ^
> + * preemption once will make the task preemptible. If this is not the case,
> + * yielding is pointless.
> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> + * kernel mode NEON (which will trigger a reschedule), and branch to the
> + * yield fixup code.
Mention that neither patchup sequence is allowed to use section-changing
directives?
For example:
if_will_cond_yield_neon
// some code
.pushsection .rodata, "a"
foo: .quad // some literal data for some reason
.popsection
// some code
do_cond_yield_neon
is not safe, because .previous is now .rodata.
(You could protect against this with
.pushsection .text; .previous; .subsection 1; // ...
.popsection
but it may be overkill.)
> + *
> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
> + * and may clobber x2 .. x18 if the yield path is taken.
> + */
> +
> + .macro cond_yield_neon, lbl
> + if_will_cond_yield_neon
> + do_cond_yield_neon
> + endif_yield_neon \lbl
> + .endm
> +
> + .macro if_will_cond_yield_neon
> +#ifdef CONFIG_PREEMPT
> + get_thread_info x0
> + ldr w1, [x0, #TSK_TI_PREEMPT]
> + ldr x0, [x0, #TSK_TI_FLAGS]
> + cmp w1, #1 // == PREEMPT_OFFSET
Can we at least drop a BUILD_BUG_ON() somewhere to check this?
Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
related.
> + csel x0, x0, xzr, eq
> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
> +#endif
A comment that we will fall through to 6666f here may be helpful.
> + .subsection 1
> +5555:
> + .endm
> +
> + .macro do_cond_yield_neon
> + bl kernel_neon_end
> + bl kernel_neon_begin
> + .endm
> +
> + .macro endif_yield_neon, lbl=6666f
> + b \lbl
> + .previous
> +6666:
Could have slightly more random "random" labels here, but otherwise
it looks ok to me.
I might go through and replace all the random labels with something
more robust sometime, but I've never been sure it was worth the
effort...
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-07 14:39 ` Dave Martin
@ 2017-12-07 14:50 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 14:50 UTC (permalink / raw)
To: Dave Martin
Cc: linux-crypto, Mark Rutland, Herbert Xu, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
linux-arm-kernel, linux-rt-users
On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
>> Add support macros to conditionally yield the NEON (and thus the CPU)
>> that may be called from the assembler code.
>>
>> In some cases, yielding the NEON involves saving and restoring a non
>> trivial amount of context (especially in the CRC folding algorithms),
>> and so the macro is split into three, and the code in between is only
>> executed when the yield path is taken, allowing the context to be preserved.
>> The third macro takes an optional label argument that marks the resume
>> path after a yield has been performed.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
>> 1 file changed, 51 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> index 5f61487e9f93..c54e408fd5a7 100644
>> --- a/arch/arm64/include/asm/assembler.h
>> +++ b/arch/arm64/include/asm/assembler.h
>> @@ -572,4 +572,55 @@ alternative_else_nop_endif
>> #endif
>> .endm
>>
>> +/*
>> + * Check whether to yield to another runnable task from kernel mode NEON code
>> + * (which runs with preemption disabled).
>> + *
>> + * if_will_cond_yield_neon
>> + * // pre-yield patchup code
>> + * do_cond_yield_neon
>> + * // post-yield patchup code
>> + * endif_yield_neon
>
> ^ Mention the lbl argument?
>
Yep will do
>> + *
>> + * - Check whether the preempt count is exactly 1, in which case disabling
>
> enabling ^
>
>> + * preemption once will make the task preemptible. If this is not the case,
>> + * yielding is pointless.
>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>> + * kernel mode NEON (which will trigger a reschedule), and branch to the
>> + * yield fixup code.
>
> Mention that neither patchup sequence is allowed to use section-changing
> directives?
>
> For example:
>
> if_will_cond_yield_neon
> // some code
>
> .pushsection .rodata, "a"
> foo: .quad // some literal data for some reason
> .popsection
>
> // some code
> do_cond_yield_neon
>
> is not safe, because .previous is now .rodata.
>
Are you sure this is true?
The gas info page for .previous tells me
In terms of the section stack, this directive swaps the current
section with the top section on the section stack.
and it seems to me that .rodata is no longer on the section stack
after .popsection. In that sense, push/pop should be safe, but
section/subsection/previous is not (I think). So yes, let's put a note
in to mention that section directives are unsupported.
> (You could protect against this with
> .pushsection .text; .previous; .subsection 1; // ...
> .popsection
> but it may be overkill.)
>
>> + *
>> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
>> + * and may clobber x2 .. x18 if the yield path is taken.
>> + */
>> +
>> + .macro cond_yield_neon, lbl
>> + if_will_cond_yield_neon
>> + do_cond_yield_neon
>> + endif_yield_neon \lbl
>> + .endm
>> +
>> + .macro if_will_cond_yield_neon
>> +#ifdef CONFIG_PREEMPT
>> + get_thread_info x0
>> + ldr w1, [x0, #TSK_TI_PREEMPT]
>> + ldr x0, [x0, #TSK_TI_FLAGS]
>> + cmp w1, #1 // == PREEMPT_OFFSET
>
> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
>
> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
> related.
>
Sure.
>> + csel x0, x0, xzr, eq
>> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
>> +#endif
>
> A comment that we will fall through to 6666f here may be helpful.
>
Indeed. Will add that.
>> + .subsection 1
>> +5555:
>> + .endm
>> +
>> + .macro do_cond_yield_neon
>> + bl kernel_neon_end
>> + bl kernel_neon_begin
>> + .endm
>> +
>> + .macro endif_yield_neon, lbl=6666f
>> + b \lbl
>> + .previous
>> +6666:
>
> Could have slightly more random "random" labels here, but otherwise
> it looks ok to me.
>
Which number did you have in mind that is more random than 6666? :-)
> I might go through and replace all the random labels with something
> more robust sometime, but I've never been sure it was worth the
> effort...
>
I guess we could invent all kinds of elaborate schemes but as you say,
having 4 digit numbers and grep'ing the source before you add a new
one has been working fine so far, so I don't think it should be a
priority.
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-07 14:50 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 14:50 UTC (permalink / raw)
To: linux-arm-kernel
On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
>> Add support macros to conditionally yield the NEON (and thus the CPU)
>> that may be called from the assembler code.
>>
>> In some cases, yielding the NEON involves saving and restoring a non
>> trivial amount of context (especially in the CRC folding algorithms),
>> and so the macro is split into three, and the code in between is only
>> executed when the yield path is taken, allowing the context to be preserved.
>> The third macro takes an optional label argument that marks the resume
>> path after a yield has been performed.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
>> 1 file changed, 51 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> index 5f61487e9f93..c54e408fd5a7 100644
>> --- a/arch/arm64/include/asm/assembler.h
>> +++ b/arch/arm64/include/asm/assembler.h
>> @@ -572,4 +572,55 @@ alternative_else_nop_endif
>> #endif
>> .endm
>>
>> +/*
>> + * Check whether to yield to another runnable task from kernel mode NEON code
>> + * (which runs with preemption disabled).
>> + *
>> + * if_will_cond_yield_neon
>> + * // pre-yield patchup code
>> + * do_cond_yield_neon
>> + * // post-yield patchup code
>> + * endif_yield_neon
>
> ^ Mention the lbl argument?
>
Yep will do
>> + *
>> + * - Check whether the preempt count is exactly 1, in which case disabling
>
> enabling ^
>
>> + * preemption once will make the task preemptible. If this is not the case,
>> + * yielding is pointless.
>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>> + * kernel mode NEON (which will trigger a reschedule), and branch to the
>> + * yield fixup code.
>
> Mention that neither patchup sequence is allowed to use section-changing
> directives?
>
> For example:
>
> if_will_cond_yield_neon
> // some code
>
> .pushsection .rodata, "a"
> foo: .quad // some literal data for some reason
> .popsection
>
> // some code
> do_cond_yield_neon
>
> is not safe, because .previous is now .rodata.
>
Are you sure this is true?
The gas info page for .previous tells me
In terms of the section stack, this directive swaps the current
section with the top section on the section stack.
and it seems to me that .rodata is no longer on the section stack
after .popsection. In that sense, push/pop should be safe, but
section/subsection/previous is not (I think). So yes, let's put a note
in to mention that section directives are unsupported.
> (You could protect against this with
> .pushsection .text; .previous; .subsection 1; // ...
> .popsection
> but it may be overkill.)
>
>> + *
>> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
>> + * and may clobber x2 .. x18 if the yield path is taken.
>> + */
>> +
>> + .macro cond_yield_neon, lbl
>> + if_will_cond_yield_neon
>> + do_cond_yield_neon
>> + endif_yield_neon \lbl
>> + .endm
>> +
>> + .macro if_will_cond_yield_neon
>> +#ifdef CONFIG_PREEMPT
>> + get_thread_info x0
>> + ldr w1, [x0, #TSK_TI_PREEMPT]
>> + ldr x0, [x0, #TSK_TI_FLAGS]
>> + cmp w1, #1 // == PREEMPT_OFFSET
>
> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
>
> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
> related.
>
Sure.
>> + csel x0, x0, xzr, eq
>> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
>> +#endif
>
> A comment that we will fall through to 6666f here may be helpful.
>
Indeed. Will add that.
>> + .subsection 1
>> +5555:
>> + .endm
>> +
>> + .macro do_cond_yield_neon
>> + bl kernel_neon_end
>> + bl kernel_neon_begin
>> + .endm
>> +
>> + .macro endif_yield_neon, lbl=6666f
>> + b \lbl
>> + .previous
>> +6666:
>
> Could have slightly more random "random" labels here, but otherwise
> it looks ok to me.
>
Which number did you have in mind that is more random than 6666? :-)
> I might go through and replace all the random labels with something
> more robust sometime, but I've never been sure it was worth the
> effort...
>
I guess we could invent all kinds of elaborate schemes but as you say,
having 4 digit numbers and grep'ing the source before you add a new
one has been working fine so far, so I don't think it should be a
priority.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
2017-12-07 14:21 ` Ard Biesheuvel
@ 2017-12-07 14:53 ` Dave Martin
-1 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 14:53 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
linux-rt-users
On Thu, Dec 07, 2017 at 02:21:17PM +0000, Ard Biesheuvel wrote:
> On 7 December 2017 at 14:11, Dave Martin <Dave.Martin@arm.com> wrote:
> > On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
> >> We are going to add code to all the NEON crypto routines that will
> >> turn them into non-leaf functions, so we need to manage the stack
> >> frames. To make this less tedious and error prone, add some macros
> >> that take the number of callee saved registers to preserve and the
> >> extra size to allocate in the stack frame (for locals) and emit
> >> the ldp/stp sequences.
> >>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >> ---
> >> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
> >> 1 file changed, 60 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> >> index aef72d886677..5f61487e9f93 100644
> >> --- a/arch/arm64/include/asm/assembler.h
> >> +++ b/arch/arm64/include/asm/assembler.h
> >> @@ -499,6 +499,66 @@ alternative_else_nop_endif
> >> #endif
> >> .endm
> >>
> >> + /*
> >> + * frame_push - Push @regcount callee saved registers to the stack,
> >> + * starting at x19, as well as x29/x30, and set x29 to
> >> + * the new value of sp. Add @extra bytes of stack space
> >> + * for locals.
> >> + */
> >> + .macro frame_push, regcount:req, extra
> >> + __frame st, \regcount, \extra
> >> + .endm
> >> +
> >> + /*
> >> + * frame_pop - Pop @regcount callee saved registers from the stack,
> >> + * starting at x19, as well as x29/x30. Also pop @extra
> >> + * bytes of stack space for locals.
> >> + */
> >> + .macro frame_pop, regcount:req, extra
> >> + __frame ld, \regcount, \extra
> >> + .endm
> >> +
> >> + .macro __frame, op, regcount:req, extra=0
> >> + .ifc \op, st
> >> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
> >> + mov x29, sp
> >> + .endif
> >> + .if \regcount < 0 || \regcount > 10
> >> + .error "regcount should be in the range [0 ... 10]"
> >> + .endif
> >> + .if (\extra % 16) != 0
> >> + .error "extra should be a multiple of 16 bytes"
> >> + .endif
> >> + .if \regcount > 1
> >> + \op\()p x19, x20, [sp, #16]
> >> + .if \regcount > 3
> >> + \op\()p x21, x22, [sp, #32]
> >> + .if \regcount > 5
> >> + \op\()p x23, x24, [sp, #48]
> >> + .if \regcount > 7
> >> + \op\()p x25, x26, [sp, #64]
> >> + .if \regcount > 9
> >> + \op\()p x27, x28, [sp, #80]
> >
> > Can the _for thing I introduced in fpsimdmacros.h be any use here?
> > Alternatively, the following could replace that .if-slide,
> > providing the calling macro does .altmacro .. .noaltmacro somewhere.
> >
> > .macro _pushpop2 op, n1, n2, offset
> > \op x\n1, x\n2, [sp, #\offset]
> > .endm
> >
> > .macro _pushpop op, first, last, offset
> > .if \first < \last
> > _pushpop2 \op\()p, \first, %\first + 1, \offset
> > _pushpop \op, %\first + 2, \last, %\offset + 16
> > .elseif \first == \last
> > \op\()r x\first, [sp, #\offset]
> > .endif
> > .endm
> >
>
> I'd prefer not to rely on altmacro, for reasons you pointed out
> yourself a while ago IIRC.
>
> I agree your version is more compact, but for a write once thing, I'm
> not sure if it matters.
>
> > Also, I wonder whether it would be more readable at the call site
> > to specify the first and last reg numbers instead of the reg count,
> > e.g.:
> >
> > frame_push first_reg=19, last_reg=23
> >
> > (or whatever). Just syntactic sugar though.
> >
>
> Again, this will involve arithmetic on macro arguments, which implies
> altmacro. Relying on altmacro being set is dodgy, and unfortunately,
> we can't enable it in the macro without keeping it enabled (or we may
> disable it on behalf of the caller. I guess we could try to come up
> with a smart way to infer whether altmacro was enabled, and only
> disable it afterwards if it wasn't, using some directives that get
> interpreted differently, but to be honest, I factored out this
> sequence so I could think about more important things :-)
Sure, no worries.
I've changed my mind a bit about .altmacro, in that it is not really
usable at all unless turned on explicitly, and then off again, only
where it's needed. So if you just assume it's always off, things are
sane (and that's what happens in practice).
But it's not really needed here -- my main confusion was with the
deeply nested .ifs, but perhaps that could be avoided more
straightforwardly:
.if \regcount > 1
\op\()p x19, x20, [sp, #16]
.endif
.if \regcount > 3
\op\()p x21, x22, [sp, #32]
.endif
// ...
.if \regcount > 9
\op\()p x27, x28, [sp, #80]
.endif
.if \regcount == 1
\op\()r x19, [sp, #20]
.endif
.if \regcount == 3
\op\()r x21, [sp, #22]
.endif
// ...
.if \regcount == 9
\op\()r x27, [sp, #28]
.endif
One other thing, should you be protecting the macro args with ()?
It seems unlikely that an expression would be passed for regcount,
but for extra it's a bit more plausible.
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
@ 2017-12-07 14:53 ` Dave Martin
0 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 14:53 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 07, 2017 at 02:21:17PM +0000, Ard Biesheuvel wrote:
> On 7 December 2017 at 14:11, Dave Martin <Dave.Martin@arm.com> wrote:
> > On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
> >> We are going to add code to all the NEON crypto routines that will
> >> turn them into non-leaf functions, so we need to manage the stack
> >> frames. To make this less tedious and error prone, add some macros
> >> that take the number of callee saved registers to preserve and the
> >> extra size to allocate in the stack frame (for locals) and emit
> >> the ldp/stp sequences.
> >>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >> ---
> >> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
> >> 1 file changed, 60 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> >> index aef72d886677..5f61487e9f93 100644
> >> --- a/arch/arm64/include/asm/assembler.h
> >> +++ b/arch/arm64/include/asm/assembler.h
> >> @@ -499,6 +499,66 @@ alternative_else_nop_endif
> >> #endif
> >> .endm
> >>
> >> + /*
> >> + * frame_push - Push @regcount callee saved registers to the stack,
> >> + * starting at x19, as well as x29/x30, and set x29 to
> >> + * the new value of sp. Add @extra bytes of stack space
> >> + * for locals.
> >> + */
> >> + .macro frame_push, regcount:req, extra
> >> + __frame st, \regcount, \extra
> >> + .endm
> >> +
> >> + /*
> >> + * frame_pop - Pop @regcount callee saved registers from the stack,
> >> + * starting at x19, as well as x29/x30. Also pop @extra
> >> + * bytes of stack space for locals.
> >> + */
> >> + .macro frame_pop, regcount:req, extra
> >> + __frame ld, \regcount, \extra
> >> + .endm
> >> +
> >> + .macro __frame, op, regcount:req, extra=0
> >> + .ifc \op, st
> >> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
> >> + mov x29, sp
> >> + .endif
> >> + .if \regcount < 0 || \regcount > 10
> >> + .error "regcount should be in the range [0 ... 10]"
> >> + .endif
> >> + .if (\extra % 16) != 0
> >> + .error "extra should be a multiple of 16 bytes"
> >> + .endif
> >> + .if \regcount > 1
> >> + \op\()p x19, x20, [sp, #16]
> >> + .if \regcount > 3
> >> + \op\()p x21, x22, [sp, #32]
> >> + .if \regcount > 5
> >> + \op\()p x23, x24, [sp, #48]
> >> + .if \regcount > 7
> >> + \op\()p x25, x26, [sp, #64]
> >> + .if \regcount > 9
> >> + \op\()p x27, x28, [sp, #80]
> >
> > Can the _for thing I introduced in fpsimdmacros.h be any use here?
> > Alternatively, the following could replace that .if-slide,
> > providing the calling macro does .altmacro .. .noaltmacro somewhere.
> >
> > .macro _pushpop2 op, n1, n2, offset
> > \op x\n1, x\n2, [sp, #\offset]
> > .endm
> >
> > .macro _pushpop op, first, last, offset
> > .if \first < \last
> > _pushpop2 \op\()p, \first, %\first + 1, \offset
> > _pushpop \op, %\first + 2, \last, %\offset + 16
> > .elseif \first == \last
> > \op\()r x\first, [sp, #\offset]
> > .endif
> > .endm
> >
>
> I'd prefer not to rely on altmacro, for reasons you pointed out
> yourself a while ago IIRC.
>
> I agree your version is more compact, but for a write once thing, I'm
> not sure if it matters.
>
> > Also, I wonder whether it would be more readable at the call site
> > to specify the first and last reg numbers instead of the reg count,
> > e.g.:
> >
> > frame_push first_reg=19, last_reg=23
> >
> > (or whatever). Just syntactic sugar though.
> >
>
> Again, this will involve arithmetic on macro arguments, which implies
> altmacro. Relying on altmacro being set is dodgy, and unfortunately,
> we can't enable it in the macro without keeping it enabled (or we may
> disable it on behalf of the caller. I guess we could try to come up
> with a smart way to infer whether altmacro was enabled, and only
> disable it afterwards if it wasn't, using some directives that get
> interpreted differently, but to be honest, I factored out this
> sequence so I could think about more important things :-)
Sure, no worries.
I've changed my mind a bit about .altmacro, in that it is not really
usable at all unless turned on explicitly, and then off again, only
where it's needed. So if you just assume it's always off, things are
sane (and that's what happens in practice).
But it's not really needed here -- my main confusion was with the
deeply nested .ifs, but perhaps that could be avoided more
straightforwardly:
.if \regcount > 1
\op\()p x19, x20, [sp, #16]
.endif
.if \regcount > 3
\op\()p x21, x22, [sp, #32]
.endif
// ...
.if \regcount > 9
\op\()p x27, x28, [sp, #80]
.endif
.if \regcount == 1
\op\()r x19, [sp, #20]
.endif
.if \regcount == 3
\op\()r x21, [sp, #22]
.endif
// ...
.if \regcount == 9
\op\()r x27, [sp, #28]
.endif
One other thing, should you be protecting the macro args with ()?
It seems unlikely that an expression would be passed for regcount,
but for extra it's a bit more plausible.
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
2017-12-07 14:53 ` Dave Martin
@ 2017-12-07 14:58 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 14:58 UTC (permalink / raw)
To: Dave Martin
Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
linux-rt-users
On 7 December 2017 at 14:53, Dave Martin <Dave.Martin@arm.com> wrote:
> On Thu, Dec 07, 2017 at 02:21:17PM +0000, Ard Biesheuvel wrote:
>> On 7 December 2017 at 14:11, Dave Martin <Dave.Martin@arm.com> wrote:
>> > On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
>> >> We are going to add code to all the NEON crypto routines that will
>> >> turn them into non-leaf functions, so we need to manage the stack
>> >> frames. To make this less tedious and error prone, add some macros
>> >> that take the number of callee saved registers to preserve and the
>> >> extra size to allocate in the stack frame (for locals) and emit
>> >> the ldp/stp sequences.
>> >>
>> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> >> ---
>> >> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
>> >> 1 file changed, 60 insertions(+)
>> >>
>> >> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> >> index aef72d886677..5f61487e9f93 100644
>> >> --- a/arch/arm64/include/asm/assembler.h
>> >> +++ b/arch/arm64/include/asm/assembler.h
>> >> @@ -499,6 +499,66 @@ alternative_else_nop_endif
>> >> #endif
>> >> .endm
>> >>
>> >> + /*
>> >> + * frame_push - Push @regcount callee saved registers to the stack,
>> >> + * starting at x19, as well as x29/x30, and set x29 to
>> >> + * the new value of sp. Add @extra bytes of stack space
>> >> + * for locals.
>> >> + */
>> >> + .macro frame_push, regcount:req, extra
>> >> + __frame st, \regcount, \extra
>> >> + .endm
>> >> +
>> >> + /*
>> >> + * frame_pop - Pop @regcount callee saved registers from the stack,
>> >> + * starting at x19, as well as x29/x30. Also pop @extra
>> >> + * bytes of stack space for locals.
>> >> + */
>> >> + .macro frame_pop, regcount:req, extra
>> >> + __frame ld, \regcount, \extra
>> >> + .endm
>> >> +
>> >> + .macro __frame, op, regcount:req, extra=0
>> >> + .ifc \op, st
>> >> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
>> >> + mov x29, sp
>> >> + .endif
>> >> + .if \regcount < 0 || \regcount > 10
>> >> + .error "regcount should be in the range [0 ... 10]"
>> >> + .endif
>> >> + .if (\extra % 16) != 0
>> >> + .error "extra should be a multiple of 16 bytes"
>> >> + .endif
>> >> + .if \regcount > 1
>> >> + \op\()p x19, x20, [sp, #16]
>> >> + .if \regcount > 3
>> >> + \op\()p x21, x22, [sp, #32]
>> >> + .if \regcount > 5
>> >> + \op\()p x23, x24, [sp, #48]
>> >> + .if \regcount > 7
>> >> + \op\()p x25, x26, [sp, #64]
>> >> + .if \regcount > 9
>> >> + \op\()p x27, x28, [sp, #80]
>> >
>> > Can the _for thing I introduced in fpsimdmacros.h be any use here?
>> > Alternatively, the following could replace that .if-slide,
>> > providing the calling macro does .altmacro .. .noaltmacro somewhere.
>> >
>> > .macro _pushpop2 op, n1, n2, offset
>> > \op x\n1, x\n2, [sp, #\offset]
>> > .endm
>> >
>> > .macro _pushpop op, first, last, offset
>> > .if \first < \last
>> > _pushpop2 \op\()p, \first, %\first + 1, \offset
>> > _pushpop \op, %\first + 2, \last, %\offset + 16
>> > .elseif \first == \last
>> > \op\()r x\first, [sp, #\offset]
>> > .endif
>> > .endm
>> >
>>
>> I'd prefer not to rely on altmacro, for reasons you pointed out
>> yourself a while ago IIRC.
>>
>> I agree your version is more compact, but for a write once thing, I'm
>> not sure if it matters.
>>
>> > Also, I wonder whether it would be more readable at the call site
>> > to specify the first and last reg numbers instead of the reg count,
>> > e.g.:
>> >
>> > frame_push first_reg=19, last_reg=23
>> >
>> > (or whatever). Just syntactic sugar though.
>> >
>>
>> Again, this will involve arithmetic on macro arguments, which implies
>> altmacro. Relying on altmacro being set is dodgy, and unfortunately,
>> we can't enable it in the macro without keeping it enabled (or we may
>> disable it on behalf of the caller. I guess we could try to come up
>> with a smart way to infer whether altmacro was enabled, and only
>> disable it afterwards if it wasn't, using some directives that get
>> interpreted differently, but to be honest, I factored out this
>> sequence so I could think about more important things :-)
>
> Sure, no worries.
>
> I've changed my mind a bit about .altmacro, in that it is not really
> usable at all unless turned on explicitly, and then off again, only
> where it's needed. So if you just assume it's always off, things are
> sane (and that's what happens in practice).
>
> But it's not really needed here -- my main confusion was with the
> deeply nested .ifs, but perhaps that could be avoided more
> straightforwardly:
>
> .if \regcount > 1
> \op\()p x19, x20, [sp, #16]
> .endif
> .if \regcount > 3
> \op\()p x21, x22, [sp, #32]
> .endif
> // ...
> .if \regcount > 9
> \op\()p x27, x28, [sp, #80]
> .endif
>
> .if \regcount == 1
> \op\()r x19, [sp, #20]
> .endif
> .if \regcount == 3
> \op\()r x21, [sp, #22]
> .endif
> // ...
> .if \regcount == 9
> \op\()r x27, [sp, #28]
> .endif
>
Yes, that does look better.
>
> One other thing, should you be protecting the macro args with ()?
>
> It seems unlikely that an expression would be passed for regcount,
> but for extra it's a bit more plausible.
>
Good point, given that I subtract \extra from the frame size in the ldp case.
> Cheers
> ---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames
@ 2017-12-07 14:58 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 14:58 UTC (permalink / raw)
To: linux-arm-kernel
On 7 December 2017 at 14:53, Dave Martin <Dave.Martin@arm.com> wrote:
> On Thu, Dec 07, 2017 at 02:21:17PM +0000, Ard Biesheuvel wrote:
>> On 7 December 2017 at 14:11, Dave Martin <Dave.Martin@arm.com> wrote:
>> > On Wed, Dec 06, 2017 at 07:43:36PM +0000, Ard Biesheuvel wrote:
>> >> We are going to add code to all the NEON crypto routines that will
>> >> turn them into non-leaf functions, so we need to manage the stack
>> >> frames. To make this less tedious and error prone, add some macros
>> >> that take the number of callee saved registers to preserve and the
>> >> extra size to allocate in the stack frame (for locals) and emit
>> >> the ldp/stp sequences.
>> >>
>> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> >> ---
>> >> arch/arm64/include/asm/assembler.h | 60 ++++++++++++++++++++
>> >> 1 file changed, 60 insertions(+)
>> >>
>> >> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> >> index aef72d886677..5f61487e9f93 100644
>> >> --- a/arch/arm64/include/asm/assembler.h
>> >> +++ b/arch/arm64/include/asm/assembler.h
>> >> @@ -499,6 +499,66 @@ alternative_else_nop_endif
>> >> #endif
>> >> .endm
>> >>
>> >> + /*
>> >> + * frame_push - Push @regcount callee saved registers to the stack,
>> >> + * starting at x19, as well as x29/x30, and set x29 to
>> >> + * the new value of sp. Add @extra bytes of stack space
>> >> + * for locals.
>> >> + */
>> >> + .macro frame_push, regcount:req, extra
>> >> + __frame st, \regcount, \extra
>> >> + .endm
>> >> +
>> >> + /*
>> >> + * frame_pop - Pop @regcount callee saved registers from the stack,
>> >> + * starting at x19, as well as x29/x30. Also pop @extra
>> >> + * bytes of stack space for locals.
>> >> + */
>> >> + .macro frame_pop, regcount:req, extra
>> >> + __frame ld, \regcount, \extra
>> >> + .endm
>> >> +
>> >> + .macro __frame, op, regcount:req, extra=0
>> >> + .ifc \op, st
>> >> + stp x29, x30, [sp, #-((\regcount + 3) / 2) * 16 - \extra]!
>> >> + mov x29, sp
>> >> + .endif
>> >> + .if \regcount < 0 || \regcount > 10
>> >> + .error "regcount should be in the range [0 ... 10]"
>> >> + .endif
>> >> + .if (\extra % 16) != 0
>> >> + .error "extra should be a multiple of 16 bytes"
>> >> + .endif
>> >> + .if \regcount > 1
>> >> + \op\()p x19, x20, [sp, #16]
>> >> + .if \regcount > 3
>> >> + \op\()p x21, x22, [sp, #32]
>> >> + .if \regcount > 5
>> >> + \op\()p x23, x24, [sp, #48]
>> >> + .if \regcount > 7
>> >> + \op\()p x25, x26, [sp, #64]
>> >> + .if \regcount > 9
>> >> + \op\()p x27, x28, [sp, #80]
>> >
>> > Can the _for thing I introduced in fpsimdmacros.h be any use here?
>> > Alternatively, the following could replace that .if-slide,
>> > providing the calling macro does .altmacro .. .noaltmacro somewhere.
>> >
>> > .macro _pushpop2 op, n1, n2, offset
>> > \op x\n1, x\n2, [sp, #\offset]
>> > .endm
>> >
>> > .macro _pushpop op, first, last, offset
>> > .if \first < \last
>> > _pushpop2 \op\()p, \first, %\first + 1, \offset
>> > _pushpop \op, %\first + 2, \last, %\offset + 16
>> > .elseif \first == \last
>> > \op\()r x\first, [sp, #\offset]
>> > .endif
>> > .endm
>> >
>>
>> I'd prefer not to rely on altmacro, for reasons you pointed out
>> yourself a while ago IIRC.
>>
>> I agree your version is more compact, but for a write once thing, I'm
>> not sure if it matters.
>>
>> > Also, I wonder whether it would be more readable at the call site
>> > to specify the first and last reg numbers instead of the reg count,
>> > e.g.:
>> >
>> > frame_push first_reg=19, last_reg=23
>> >
>> > (or whatever). Just syntactic sugar though.
>> >
>>
>> Again, this will involve arithmetic on macro arguments, which implies
>> altmacro. Relying on altmacro being set is dodgy, and unfortunately,
>> we can't enable it in the macro without keeping it enabled (or we may
>> disable it on behalf of the caller. I guess we could try to come up
>> with a smart way to infer whether altmacro was enabled, and only
>> disable it afterwards if it wasn't, using some directives that get
>> interpreted differently, but to be honest, I factored out this
>> sequence so I could think about more important things :-)
>
> Sure, no worries.
>
> I've changed my mind a bit about .altmacro, in that it is not really
> usable at all unless turned on explicitly, and then off again, only
> where it's needed. So if you just assume it's always off, things are
> sane (and that's what happens in practice).
>
> But it's not really needed here -- my main confusion was with the
> deeply nested .ifs, but perhaps that could be avoided more
> straightforwardly:
>
> .if \regcount > 1
> \op\()p x19, x20, [sp, #16]
> .endif
> .if \regcount > 3
> \op\()p x21, x22, [sp, #32]
> .endif
> // ...
> .if \regcount > 9
> \op\()p x27, x28, [sp, #80]
> .endif
>
> .if \regcount == 1
> \op\()r x19, [sp, #20]
> .endif
> .if \regcount == 3
> \op\()r x21, [sp, #22]
> .endif
> // ...
> .if \regcount == 9
> \op\()r x27, [sp, #28]
> .endif
>
Yes, that does look better.
>
> One other thing, should you be protecting the macro args with ()?
>
> It seems unlikely that an expression would be passed for regcount,
> but for extra it's a bit more plausible.
>
Good point, given that I subtract \extra from the frame size in the ldp case.
> Cheers
> ---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-07 14:50 ` Ard Biesheuvel
@ 2017-12-07 15:47 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 15:47 UTC (permalink / raw)
To: Dave Martin
Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
linux-rt-users
On 7 December 2017 at 14:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
>> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
>>> Add support macros to conditionally yield the NEON (and thus the CPU)
>>> that may be called from the assembler code.
>>>
>>> In some cases, yielding the NEON involves saving and restoring a non
>>> trivial amount of context (especially in the CRC folding algorithms),
>>> and so the macro is split into three, and the code in between is only
>>> executed when the yield path is taken, allowing the context to be preserved.
>>> The third macro takes an optional label argument that marks the resume
>>> path after a yield has been performed.
>>>
>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> ---
>>> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
>>> 1 file changed, 51 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>>> index 5f61487e9f93..c54e408fd5a7 100644
>>> --- a/arch/arm64/include/asm/assembler.h
>>> +++ b/arch/arm64/include/asm/assembler.h
>>> @@ -572,4 +572,55 @@ alternative_else_nop_endif
>>> #endif
>>> .endm
>>>
>>> +/*
>>> + * Check whether to yield to another runnable task from kernel mode NEON code
>>> + * (which runs with preemption disabled).
>>> + *
>>> + * if_will_cond_yield_neon
>>> + * // pre-yield patchup code
>>> + * do_cond_yield_neon
>>> + * // post-yield patchup code
>>> + * endif_yield_neon
>>
>> ^ Mention the lbl argument?
>>
>
> Yep will do
>
>>> + *
>>> + * - Check whether the preempt count is exactly 1, in which case disabling
>>
>> enabling ^
>>
>>> + * preemption once will make the task preemptible. If this is not the case,
>>> + * yielding is pointless.
>>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>>> + * kernel mode NEON (which will trigger a reschedule), and branch to the
>>> + * yield fixup code.
>>
>> Mention that neither patchup sequence is allowed to use section-changing
>> directives?
>>
>> For example:
>>
>> if_will_cond_yield_neon
>> // some code
>>
>> .pushsection .rodata, "a"
>> foo: .quad // some literal data for some reason
>> .popsection
>>
>> // some code
>> do_cond_yield_neon
>>
>> is not safe, because .previous is now .rodata.
>>
>
> Are you sure this is true?
>
> The gas info page for .previous tells me
>
> In terms of the section stack, this directive swaps the current
> section with the top section on the section stack.
>
> and it seems to me that .rodata is no longer on the section stack
> after .popsection. In that sense, push/pop should be safe, but
> section/subsection/previous is not (I think). So yes, let's put a note
> in to mention that section directives are unsupported.
>
>> (You could protect against this with
>> .pushsection .text; .previous; .subsection 1; // ...
>> .popsection
>> but it may be overkill.)
>>
>>> + *
>>> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
>>> + * and may clobber x2 .. x18 if the yield path is taken.
>>> + */
>>> +
>>> + .macro cond_yield_neon, lbl
>>> + if_will_cond_yield_neon
>>> + do_cond_yield_neon
>>> + endif_yield_neon \lbl
>>> + .endm
>>> +
>>> + .macro if_will_cond_yield_neon
>>> +#ifdef CONFIG_PREEMPT
>>> + get_thread_info x0
>>> + ldr w1, [x0, #TSK_TI_PREEMPT]
>>> + ldr x0, [x0, #TSK_TI_FLAGS]
>>> + cmp w1, #1 // == PREEMPT_OFFSET
>>
>> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
>>
>> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
>> related.
>>
>
> Sure.
>
I only just understood your asm-offsets remark earlier. I wasn't aware
that it allows exposing random constants as well (although it is
fairly obvious now that I do). So I will expose PREEMPT_OFFSET rather
than open code it
>>> + csel x0, x0, xzr, eq
>>> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
>>> +#endif
>>
>> A comment that we will fall through to 6666f here may be helpful.
>>
>
> Indeed. Will add that.
>
>>> + .subsection 1
>>> +5555:
>>> + .endm
>>> +
>>> + .macro do_cond_yield_neon
>>> + bl kernel_neon_end
>>> + bl kernel_neon_begin
>>> + .endm
>>> +
>>> + .macro endif_yield_neon, lbl=6666f
>>> + b \lbl
>>> + .previous
>>> +6666:
>>
>> Could have slightly more random "random" labels here, but otherwise
>> it looks ok to me.
>>
>
> Which number did you have in mind that is more random than 6666? :-)
>
>> I might go through and replace all the random labels with something
>> more robust sometime, but I've never been sure it was worth the
>> effort...
>>
>
> I guess we could invent all kinds of elaborate schemes but as you say,
> having 4 digit numbers and grep'ing the source before you add a new
> one has been working fine so far, so I don't think it should be a
> priority.
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-07 15:47 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 15:47 UTC (permalink / raw)
To: linux-arm-kernel
On 7 December 2017 at 14:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
>> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
>>> Add support macros to conditionally yield the NEON (and thus the CPU)
>>> that may be called from the assembler code.
>>>
>>> In some cases, yielding the NEON involves saving and restoring a non
>>> trivial amount of context (especially in the CRC folding algorithms),
>>> and so the macro is split into three, and the code in between is only
>>> executed when the yield path is taken, allowing the context to be preserved.
>>> The third macro takes an optional label argument that marks the resume
>>> path after a yield has been performed.
>>>
>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> ---
>>> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
>>> 1 file changed, 51 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>>> index 5f61487e9f93..c54e408fd5a7 100644
>>> --- a/arch/arm64/include/asm/assembler.h
>>> +++ b/arch/arm64/include/asm/assembler.h
>>> @@ -572,4 +572,55 @@ alternative_else_nop_endif
>>> #endif
>>> .endm
>>>
>>> +/*
>>> + * Check whether to yield to another runnable task from kernel mode NEON code
>>> + * (which runs with preemption disabled).
>>> + *
>>> + * if_will_cond_yield_neon
>>> + * // pre-yield patchup code
>>> + * do_cond_yield_neon
>>> + * // post-yield patchup code
>>> + * endif_yield_neon
>>
>> ^ Mention the lbl argument?
>>
>
> Yep will do
>
>>> + *
>>> + * - Check whether the preempt count is exactly 1, in which case disabling
>>
>> enabling ^
>>
>>> + * preemption once will make the task preemptible. If this is not the case,
>>> + * yielding is pointless.
>>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>>> + * kernel mode NEON (which will trigger a reschedule), and branch to the
>>> + * yield fixup code.
>>
>> Mention that neither patchup sequence is allowed to use section-changing
>> directives?
>>
>> For example:
>>
>> if_will_cond_yield_neon
>> // some code
>>
>> .pushsection .rodata, "a"
>> foo: .quad // some literal data for some reason
>> .popsection
>>
>> // some code
>> do_cond_yield_neon
>>
>> is not safe, because .previous is now .rodata.
>>
>
> Are you sure this is true?
>
> The gas info page for .previous tells me
>
> In terms of the section stack, this directive swaps the current
> section with the top section on the section stack.
>
> and it seems to me that .rodata is no longer on the section stack
> after .popsection. In that sense, push/pop should be safe, but
> section/subsection/previous is not (I think). So yes, let's put a note
> in to mention that section directives are unsupported.
>
>> (You could protect against this with
>> .pushsection .text; .previous; .subsection 1; // ...
>> .popsection
>> but it may be overkill.)
>>
>>> + *
>>> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
>>> + * and may clobber x2 .. x18 if the yield path is taken.
>>> + */
>>> +
>>> + .macro cond_yield_neon, lbl
>>> + if_will_cond_yield_neon
>>> + do_cond_yield_neon
>>> + endif_yield_neon \lbl
>>> + .endm
>>> +
>>> + .macro if_will_cond_yield_neon
>>> +#ifdef CONFIG_PREEMPT
>>> + get_thread_info x0
>>> + ldr w1, [x0, #TSK_TI_PREEMPT]
>>> + ldr x0, [x0, #TSK_TI_FLAGS]
>>> + cmp w1, #1 // == PREEMPT_OFFSET
>>
>> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
>>
>> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
>> related.
>>
>
> Sure.
>
I only just understood your asm-offsets remark earlier. I wasn't aware
that it allows exposing random constants as well (although it is
fairly obvious now that I do). So I will expose PREEMPT_OFFSET rather
than open code it
>>> + csel x0, x0, xzr, eq
>>> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
>>> +#endif
>>
>> A comment that we will fall through to 6666f here may be helpful.
>>
>
> Indeed. Will add that.
>
>>> + .subsection 1
>>> +5555:
>>> + .endm
>>> +
>>> + .macro do_cond_yield_neon
>>> + bl kernel_neon_end
>>> + bl kernel_neon_begin
>>> + .endm
>>> +
>>> + .macro endif_yield_neon, lbl=6666f
>>> + b \lbl
>>> + .previous
>>> +6666:
>>
>> Could have slightly more random "random" labels here, but otherwise
>> it looks ok to me.
>>
>
> Which number did you have in mind that is more random than 6666? :-)
>
>> I might go through and replace all the random labels with something
>> more robust sometime, but I've never been sure it was worth the
>> effort...
>>
>
> I guess we could invent all kinds of elaborate schemes but as you say,
> having 4 digit numbers and grep'ing the source before you add a new
> one has been working fine so far, so I don't think it should be a
> priority.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-07 15:47 ` Ard Biesheuvel
@ 2017-12-07 15:51 ` Ard Biesheuvel
-1 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 15:51 UTC (permalink / raw)
To: Dave Martin
Cc: linux-crypto, Mark Rutland, Herbert Xu, Peter Zijlstra,
Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
linux-arm-kernel, linux-rt-users
On 7 December 2017 at 15:47, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 7 December 2017 at 14:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
>>> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
>>>> Add support macros to conditionally yield the NEON (and thus the CPU)
>>>> that may be called from the assembler code.
>>>>
>>>> In some cases, yielding the NEON involves saving and restoring a non
>>>> trivial amount of context (especially in the CRC folding algorithms),
>>>> and so the macro is split into three, and the code in between is only
>>>> executed when the yield path is taken, allowing the context to be preserved.
>>>> The third macro takes an optional label argument that marks the resume
>>>> path after a yield has been performed.
>>>>
>>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>>> ---
>>>> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
>>>> 1 file changed, 51 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>>>> index 5f61487e9f93..c54e408fd5a7 100644
>>>> --- a/arch/arm64/include/asm/assembler.h
>>>> +++ b/arch/arm64/include/asm/assembler.h
>>>> @@ -572,4 +572,55 @@ alternative_else_nop_endif
>>>> #endif
>>>> .endm
>>>>
>>>> +/*
>>>> + * Check whether to yield to another runnable task from kernel mode NEON code
>>>> + * (which runs with preemption disabled).
>>>> + *
>>>> + * if_will_cond_yield_neon
>>>> + * // pre-yield patchup code
>>>> + * do_cond_yield_neon
>>>> + * // post-yield patchup code
>>>> + * endif_yield_neon
>>>
>>> ^ Mention the lbl argument?
>>>
>>
>> Yep will do
>>
>>>> + *
>>>> + * - Check whether the preempt count is exactly 1, in which case disabling
>>>
>>> enabling ^
>>>
>>>> + * preemption once will make the task preemptible. If this is not the case,
>>>> + * yielding is pointless.
>>>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>>>> + * kernel mode NEON (which will trigger a reschedule), and branch to the
>>>> + * yield fixup code.
>>>
>>> Mention that neither patchup sequence is allowed to use section-changing
>>> directives?
>>>
>>> For example:
>>>
>>> if_will_cond_yield_neon
>>> // some code
>>>
>>> .pushsection .rodata, "a"
>>> foo: .quad // some literal data for some reason
>>> .popsection
>>>
>>> // some code
>>> do_cond_yield_neon
>>>
>>> is not safe, because .previous is now .rodata.
>>>
>>
>> Are you sure this is true?
>>
>> The gas info page for .previous tells me
>>
>> In terms of the section stack, this directive swaps the current
>> section with the top section on the section stack.
>>
>> and it seems to me that .rodata is no longer on the section stack
>> after .popsection. In that sense, push/pop should be safe, but
>> section/subsection/previous is not (I think). So yes, let's put a note
>> in to mention that section directives are unsupported.
>>
>>> (You could protect against this with
>>> .pushsection .text; .previous; .subsection 1; // ...
>>> .popsection
>>> but it may be overkill.)
>>>
>>>> + *
>>>> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
>>>> + * and may clobber x2 .. x18 if the yield path is taken.
>>>> + */
>>>> +
>>>> + .macro cond_yield_neon, lbl
>>>> + if_will_cond_yield_neon
>>>> + do_cond_yield_neon
>>>> + endif_yield_neon \lbl
>>>> + .endm
>>>> +
>>>> + .macro if_will_cond_yield_neon
>>>> +#ifdef CONFIG_PREEMPT
>>>> + get_thread_info x0
>>>> + ldr w1, [x0, #TSK_TI_PREEMPT]
>>>> + ldr x0, [x0, #TSK_TI_FLAGS]
>>>> + cmp w1, #1 // == PREEMPT_OFFSET
>>>
>>> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
>>>
>>> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
>>> related.
>>>
>>
>> Sure.
>>
>
> I only just understood your asm-offsets remark earlier. I wasn't aware
> that it allows exposing random constants as well (although it is
> fairly obvious now that I do). So I will expose PREEMPT_OFFSET rather
> than open code it
>
Of course, I mean 'arbitrary' not 'random' (like 6666)
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-07 15:51 ` Ard Biesheuvel
0 siblings, 0 replies; 62+ messages in thread
From: Ard Biesheuvel @ 2017-12-07 15:51 UTC (permalink / raw)
To: linux-arm-kernel
On 7 December 2017 at 15:47, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 7 December 2017 at 14:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
>>> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
>>>> Add support macros to conditionally yield the NEON (and thus the CPU)
>>>> that may be called from the assembler code.
>>>>
>>>> In some cases, yielding the NEON involves saving and restoring a non
>>>> trivial amount of context (especially in the CRC folding algorithms),
>>>> and so the macro is split into three, and the code in between is only
>>>> executed when the yield path is taken, allowing the context to be preserved.
>>>> The third macro takes an optional label argument that marks the resume
>>>> path after a yield has been performed.
>>>>
>>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>>> ---
>>>> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
>>>> 1 file changed, 51 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>>>> index 5f61487e9f93..c54e408fd5a7 100644
>>>> --- a/arch/arm64/include/asm/assembler.h
>>>> +++ b/arch/arm64/include/asm/assembler.h
>>>> @@ -572,4 +572,55 @@ alternative_else_nop_endif
>>>> #endif
>>>> .endm
>>>>
>>>> +/*
>>>> + * Check whether to yield to another runnable task from kernel mode NEON code
>>>> + * (which runs with preemption disabled).
>>>> + *
>>>> + * if_will_cond_yield_neon
>>>> + * // pre-yield patchup code
>>>> + * do_cond_yield_neon
>>>> + * // post-yield patchup code
>>>> + * endif_yield_neon
>>>
>>> ^ Mention the lbl argument?
>>>
>>
>> Yep will do
>>
>>>> + *
>>>> + * - Check whether the preempt count is exactly 1, in which case disabling
>>>
>>> enabling ^
>>>
>>>> + * preemption once will make the task preemptible. If this is not the case,
>>>> + * yielding is pointless.
>>>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>>>> + * kernel mode NEON (which will trigger a reschedule), and branch to the
>>>> + * yield fixup code.
>>>
>>> Mention that neither patchup sequence is allowed to use section-changing
>>> directives?
>>>
>>> For example:
>>>
>>> if_will_cond_yield_neon
>>> // some code
>>>
>>> .pushsection .rodata, "a"
>>> foo: .quad // some literal data for some reason
>>> .popsection
>>>
>>> // some code
>>> do_cond_yield_neon
>>>
>>> is not safe, because .previous is now .rodata.
>>>
>>
>> Are you sure this is true?
>>
>> The gas info page for .previous tells me
>>
>> In terms of the section stack, this directive swaps the current
>> section with the top section on the section stack.
>>
>> and it seems to me that .rodata is no longer on the section stack
>> after .popsection. In that sense, push/pop should be safe, but
>> section/subsection/previous is not (I think). So yes, let's put a note
>> in to mention that section directives are unsupported.
>>
>>> (You could protect against this with
>>> .pushsection .text; .previous; .subsection 1; // ...
>>> .popsection
>>> but it may be overkill.)
>>>
>>>> + *
>>>> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
>>>> + * and may clobber x2 .. x18 if the yield path is taken.
>>>> + */
>>>> +
>>>> + .macro cond_yield_neon, lbl
>>>> + if_will_cond_yield_neon
>>>> + do_cond_yield_neon
>>>> + endif_yield_neon \lbl
>>>> + .endm
>>>> +
>>>> + .macro if_will_cond_yield_neon
>>>> +#ifdef CONFIG_PREEMPT
>>>> + get_thread_info x0
>>>> + ldr w1, [x0, #TSK_TI_PREEMPT]
>>>> + ldr x0, [x0, #TSK_TI_FLAGS]
>>>> + cmp w1, #1 // == PREEMPT_OFFSET
>>>
>>> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
>>>
>>> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
>>> related.
>>>
>>
>> Sure.
>>
>
> I only just understood your asm-offsets remark earlier. I wasn't aware
> that it allows exposing random constants as well (although it is
> fairly obvious now that I do). So I will expose PREEMPT_OFFSET rather
> than open code it
>
Of course, I mean 'arbitrary' not 'random' (like 6666)
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-07 14:50 ` Ard Biesheuvel
@ 2017-12-07 16:11 ` Dave Martin
-1 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 16:11 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
linux-rt-users
On Thu, Dec 07, 2017 at 02:50:11PM +0000, Ard Biesheuvel wrote:
> On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
> > On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
> >> Add support macros to conditionally yield the NEON (and thus the CPU)
> >> that may be called from the assembler code.
> >>
> >> In some cases, yielding the NEON involves saving and restoring a non
> >> trivial amount of context (especially in the CRC folding algorithms),
> >> and so the macro is split into three, and the code in between is only
> >> executed when the yield path is taken, allowing the context to be preserved.
> >> The third macro takes an optional label argument that marks the resume
> >> path after a yield has been performed.
> >>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >> ---
> >> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
> >> 1 file changed, 51 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> >> index 5f61487e9f93..c54e408fd5a7 100644
> >> --- a/arch/arm64/include/asm/assembler.h
> >> +++ b/arch/arm64/include/asm/assembler.h
> >> @@ -572,4 +572,55 @@ alternative_else_nop_endif
> >> #endif
> >> .endm
> >>
> >> +/*
> >> + * Check whether to yield to another runnable task from kernel mode NEON code
> >> + * (which runs with preemption disabled).
> >> + *
> >> + * if_will_cond_yield_neon
> >> + * // pre-yield patchup code
> >> + * do_cond_yield_neon
> >> + * // post-yield patchup code
> >> + * endif_yield_neon
> >
> > ^ Mention the lbl argument?
> >
>
> Yep will do
>
> >> + *
> >> + * - Check whether the preempt count is exactly 1, in which case disabling
> >
> > enabling ^
> >
> >> + * preemption once will make the task preemptible. If this is not the case,
> >> + * yielding is pointless.
> >> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> >> + * kernel mode NEON (which will trigger a reschedule), and branch to the
> >> + * yield fixup code.
> >
> > Mention that neither patchup sequence is allowed to use section-changing
> > directives?
> >
> > For example:
> >
> > if_will_cond_yield_neon
> > // some code
> >
> > .pushsection .rodata, "a"
> > foo: .quad // some literal data for some reason
> > .popsection
> >
> > // some code
> > do_cond_yield_neon
> >
> > is not safe, because .previous is now .rodata.
> >
>
> Are you sure this is true?
>
> The gas info page for .previous tells me
>
> In terms of the section stack, this directive swaps the current
> section with the top section on the section stack.
That statement is either misleading or wrong, but the actual behaviour
doesn't seem straightforward either.
> and it seems to me that .rodata is no longer on the section stack
> after .popsection. In that sense, push/pop should be safe, but
My suggestion does seem to work here (I've used it in the past) but
it's probably best not to rely on it unnecessarily... One would
have to read the gas code and get the docs fixed first.
> section/subsection/previous is not (I think). So yes, let's put a note
> in to mention that section directives are unsupported.
... here I'd agree: it doesn't seem justified relying on dubious
tricks here, since there's doubt about whether my suggestion is
really safe.
>
> > (You could protect against this with
> > .pushsection .text; .previous; .subsection 1; // ...
> > .popsection
> > but it may be overkill.)
> >
> >> + *
> >> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
> >> + * and may clobber x2 .. x18 if the yield path is taken.
> >> + */
> >> +
> >> + .macro cond_yield_neon, lbl
> >> + if_will_cond_yield_neon
> >> + do_cond_yield_neon
> >> + endif_yield_neon \lbl
> >> + .endm
> >> +
> >> + .macro if_will_cond_yield_neon
> >> +#ifdef CONFIG_PREEMPT
> >> + get_thread_info x0
> >> + ldr w1, [x0, #TSK_TI_PREEMPT]
> >> + ldr x0, [x0, #TSK_TI_FLAGS]
> >> + cmp w1, #1 // == PREEMPT_OFFSET
> >
> > Can we at least drop a BUILD_BUG_ON() somewhere to check this?
> >
> > Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
> > related.
> >
>
> Sure.
>
> >> + csel x0, x0, xzr, eq
> >> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
> >> +#endif
> >
> > A comment that we will fall through to 6666f here may be helpful.
> >
>
> Indeed. Will add that.
>
> >> + .subsection 1
> >> +5555:
> >> + .endm
> >> +
> >> + .macro do_cond_yield_neon
> >> + bl kernel_neon_end
> >> + bl kernel_neon_begin
> >> + .endm
> >> +
> >> + .macro endif_yield_neon, lbl=6666f
> >> + b \lbl
> >> + .previous
> >> +6666:
> >
> > Could have slightly more random "random" labels here, but otherwise
> > it looks ok to me.
> >
>
> Which number did you have in mind that is more random than 6666? :-)
>
> > I might go through and replace all the random labels with something
> > more robust sometime, but I've never been sure it was worth the
> > effort...
> >
>
> I guess we could invent all kinds of elaborate schemes but as you say,
> having 4 digit numbers and grep'ing the source before you add a new
> one has been working fine so far, so I don't think it should be a
> priority.
You could try $RANDOM for inspiration.
Nested macro use is rare, but a scheme with only 10 possible random
numbers seems a little too optiimstic -- and in practice people don't
always remember to grep when adding new ones.
9999, 8888, 1111 and 2222 are already taken even without this patch.
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-07 16:11 ` Dave Martin
0 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 16:11 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 07, 2017 at 02:50:11PM +0000, Ard Biesheuvel wrote:
> On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
> > On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
> >> Add support macros to conditionally yield the NEON (and thus the CPU)
> >> that may be called from the assembler code.
> >>
> >> In some cases, yielding the NEON involves saving and restoring a non
> >> trivial amount of context (especially in the CRC folding algorithms),
> >> and so the macro is split into three, and the code in between is only
> >> executed when the yield path is taken, allowing the context to be preserved.
> >> The third macro takes an optional label argument that marks the resume
> >> path after a yield has been performed.
> >>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >> ---
> >> arch/arm64/include/asm/assembler.h | 51 ++++++++++++++++++++
> >> 1 file changed, 51 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> >> index 5f61487e9f93..c54e408fd5a7 100644
> >> --- a/arch/arm64/include/asm/assembler.h
> >> +++ b/arch/arm64/include/asm/assembler.h
> >> @@ -572,4 +572,55 @@ alternative_else_nop_endif
> >> #endif
> >> .endm
> >>
> >> +/*
> >> + * Check whether to yield to another runnable task from kernel mode NEON code
> >> + * (which runs with preemption disabled).
> >> + *
> >> + * if_will_cond_yield_neon
> >> + * // pre-yield patchup code
> >> + * do_cond_yield_neon
> >> + * // post-yield patchup code
> >> + * endif_yield_neon
> >
> > ^ Mention the lbl argument?
> >
>
> Yep will do
>
> >> + *
> >> + * - Check whether the preempt count is exactly 1, in which case disabling
> >
> > enabling ^
> >
> >> + * preemption once will make the task preemptible. If this is not the case,
> >> + * yielding is pointless.
> >> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> >> + * kernel mode NEON (which will trigger a reschedule), and branch to the
> >> + * yield fixup code.
> >
> > Mention that neither patchup sequence is allowed to use section-changing
> > directives?
> >
> > For example:
> >
> > if_will_cond_yield_neon
> > // some code
> >
> > .pushsection .rodata, "a"
> > foo: .quad // some literal data for some reason
> > .popsection
> >
> > // some code
> > do_cond_yield_neon
> >
> > is not safe, because .previous is now .rodata.
> >
>
> Are you sure this is true?
>
> The gas info page for .previous tells me
>
> In terms of the section stack, this directive swaps the current
> section with the top section on the section stack.
That statement is either misleading or wrong, but the actual behaviour
doesn't seem straightforward either.
> and it seems to me that .rodata is no longer on the section stack
> after .popsection. In that sense, push/pop should be safe, but
My suggestion does seem to work here (I've used it in the past) but
it's probably best not to rely on it unnecessarily... One would
have to read the gas code and get the docs fixed first.
> section/subsection/previous is not (I think). So yes, let's put a note
> in to mention that section directives are unsupported.
... here I'd agree: it doesn't seem justified relying on dubious
tricks here, since there's doubt about whether my suggestion is
really safe.
>
> > (You could protect against this with
> > .pushsection .text; .previous; .subsection 1; // ...
> > .popsection
> > but it may be overkill.)
> >
> >> + *
> >> + * This macro sequence clobbers x0, x1 and the flags register unconditionally,
> >> + * and may clobber x2 .. x18 if the yield path is taken.
> >> + */
> >> +
> >> + .macro cond_yield_neon, lbl
> >> + if_will_cond_yield_neon
> >> + do_cond_yield_neon
> >> + endif_yield_neon \lbl
> >> + .endm
> >> +
> >> + .macro if_will_cond_yield_neon
> >> +#ifdef CONFIG_PREEMPT
> >> + get_thread_info x0
> >> + ldr w1, [x0, #TSK_TI_PREEMPT]
> >> + ldr x0, [x0, #TSK_TI_FLAGS]
> >> + cmp w1, #1 // == PREEMPT_OFFSET
> >
> > Can we at least drop a BUILD_BUG_ON() somewhere to check this?
> >
> > Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
> > related.
> >
>
> Sure.
>
> >> + csel x0, x0, xzr, eq
> >> + tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
> >> +#endif
> >
> > A comment that we will fall through to 6666f here may be helpful.
> >
>
> Indeed. Will add that.
>
> >> + .subsection 1
> >> +5555:
> >> + .endm
> >> +
> >> + .macro do_cond_yield_neon
> >> + bl kernel_neon_end
> >> + bl kernel_neon_begin
> >> + .endm
> >> +
> >> + .macro endif_yield_neon, lbl=6666f
> >> + b \lbl
> >> + .previous
> >> +6666:
> >
> > Could have slightly more random "random" labels here, but otherwise
> > it looks ok to me.
> >
>
> Which number did you have in mind that is more random than 6666? :-)
>
> > I might go through and replace all the random labels with something
> > more robust sometime, but I've never been sure it was worth the
> > effort...
> >
>
> I guess we could invent all kinds of elaborate schemes but as you say,
> having 4 digit numbers and grep'ing the source before you add a new
> one has been working fine so far, so I don't think it should be a
> priority.
You could try $RANDOM for inspiration.
Nested macro use is rare, but a scheme with only 10 possible random
numbers seems a little too optiimstic -- and in practice people don't
always remember to grep when adding new ones.
9999, 8888, 1111 and 2222 are already taken even without this patch.
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
2017-12-07 15:47 ` Ard Biesheuvel
@ 2017-12-07 16:15 ` Dave Martin
-1 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 16:15 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
linux-rt-users
On Thu, Dec 07, 2017 at 03:47:43PM +0000, Ard Biesheuvel wrote:
> On 7 December 2017 at 14:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
> >> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
[...]
> >>> + .macro if_will_cond_yield_neon
> >>> +#ifdef CONFIG_PREEMPT
> >>> + get_thread_info x0
> >>> + ldr w1, [x0, #TSK_TI_PREEMPT]
> >>> + ldr x0, [x0, #TSK_TI_FLAGS]
> >>> + cmp w1, #1 // == PREEMPT_OFFSET
> >>
> >> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
> >>
> >> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
> >> related.
> >>
> >
> > Sure.
> >
>
> I only just understood your asm-offsets remark earlier. I wasn't aware
> that it allows exposing random constants as well (although it is
> fairly obvious now that I do). So I will expose PREEMPT_OFFSET rather
> than open code it
[...]
OK, yes, this works for any C expression that is compiletime-constant
but requires evaluation that the assembler doesn't understand.
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT
@ 2017-12-07 16:15 ` Dave Martin
0 siblings, 0 replies; 62+ messages in thread
From: Dave Martin @ 2017-12-07 16:15 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 07, 2017 at 03:47:43PM +0000, Ard Biesheuvel wrote:
> On 7 December 2017 at 14:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > On 7 December 2017 at 14:39, Dave Martin <Dave.Martin@arm.com> wrote:
> >> On Wed, Dec 06, 2017 at 07:43:37PM +0000, Ard Biesheuvel wrote:
[...]
> >>> + .macro if_will_cond_yield_neon
> >>> +#ifdef CONFIG_PREEMPT
> >>> + get_thread_info x0
> >>> + ldr w1, [x0, #TSK_TI_PREEMPT]
> >>> + ldr x0, [x0, #TSK_TI_FLAGS]
> >>> + cmp w1, #1 // == PREEMPT_OFFSET
> >>
> >> Can we at least drop a BUILD_BUG_ON() somewhere to check this?
> >>
> >> Maybe in kernel_neon_begin() since this is intimately kernel-mode NEON
> >> related.
> >>
> >
> > Sure.
> >
>
> I only just understood your asm-offsets remark earlier. I wasn't aware
> that it allows exposing random constants as well (although it is
> fairly obvious now that I do). So I will expose PREEMPT_OFFSET rather
> than open code it
[...]
OK, yes, this works for any C expression that is compiletime-constant
but requires evaluation that the assembler doesn't understand.
Cheers
---Dave
^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2017-12-07 16:15 UTC | newest]
Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-06 19:43 [PATCH v3 00/20] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 01/20] crypto: testmgr - add a new test case for CRC-T10DIF Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 02/20] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 03/20] crypto: arm64/aes-blk " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 04/20] crypto: arm64/aes-bs " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 05/20] crypto: arm64/chacha20 " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 06/20] crypto: arm64/aes-blk - remove configurable interleave Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 07/20] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 08/20] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 09/20] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 10/20] arm64: assembler: add utility macros to push/pop stack frames Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-07 14:11 ` Dave Martin
2017-12-07 14:11 ` Dave Martin
2017-12-07 14:21 ` Ard Biesheuvel
2017-12-07 14:21 ` Ard Biesheuvel
2017-12-07 14:53 ` Dave Martin
2017-12-07 14:53 ` Dave Martin
2017-12-07 14:58 ` Ard Biesheuvel
2017-12-07 14:58 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 11/20] arm64: assembler: add macros to conditionally yield the NEON under PREEMPT Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-07 14:39 ` Dave Martin
2017-12-07 14:39 ` Dave Martin
2017-12-07 14:50 ` Ard Biesheuvel
2017-12-07 14:50 ` Ard Biesheuvel
2017-12-07 15:47 ` Ard Biesheuvel
2017-12-07 15:47 ` Ard Biesheuvel
2017-12-07 15:51 ` Ard Biesheuvel
2017-12-07 15:51 ` Ard Biesheuvel
2017-12-07 16:15 ` Dave Martin
2017-12-07 16:15 ` Dave Martin
2017-12-07 16:11 ` Dave Martin
2017-12-07 16:11 ` Dave Martin
2017-12-06 19:43 ` [PATCH v3 12/20] crypto: arm64/sha1-ce - yield NEON after every block of input Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 13/20] crypto: arm64/sha2-ce " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 14/20] crypto: arm64/aes-ccm " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 15/20] crypto: arm64/aes-blk " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 16/20] crypto: arm64/aes-bs " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 17/20] crypto: arm64/aes-ghash " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 18/20] crypto: arm64/crc32-ce " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 19/20] crypto: arm64/crct10dif-ce " Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
2017-12-06 19:43 ` [PATCH v3 20/20] DO NOT MERGE Ard Biesheuvel
2017-12-06 19:43 ` Ard Biesheuvel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.