linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT
@ 2017-12-04 12:26 Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 01/19] crypto: testmgr - add a new test case for CRC-T10DIF Ard Biesheuvel
                   ` (18 more replies)
  0 siblings, 19 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

This is a followup 'crypto: arm64 - disable NEON across scatterwalk API
calls' sent out last Friday.

As reported by Sebastian, the way the arm64 NEON crypto code currently
keeps kernel mode NEON enabled across calls into skcipher_walk_xxx() is
causing problems with RT builds, given that the skcipher walk API may
allocate and free temporary buffers it uses to present the input and
output arrays to the crypto algorithm in blocksize sized chunks (where
blocksize is the natural blocksize of the crypto algorithm), and doing
so with NEON enabled means we're alloc/free'ing memory with preemption
disabled.

This was deliberate: when this code was introduced, each kernel_neon_begin()
and kernel_neon_end() call incurred a fixed penalty of storing resp.
loading the contents of all NEON registers to/from memory, and so doing
it less often had an obvious performance benefit. However, in the mean time,
we have refactored the core kernel mode NEON code, and now kernel_neon_begin()
only incurs this penalty the first time it is called after entering the kernel,
and the NEON register restore is deferred until returning to userland. This
means pulling those calls into the loops that iterate over the input/output
of the crypto algorithm is not a big deal anymore (although there are some
places in the code where we relied on the NEON registers retaining their
values between calls)

So let's clean this up for arm64: update the NEON based skcipher drivers to
no longer keep the NEON enabled when calling into the skcipher walk API.

As pointed out by Peter, this only solves part of the problem. So let's
tackle it more thoroughly, and update the algorithms to test the NEED_RESCHED
flag each time after processing a fixed chunk of input. An attempt was made
to align the different algorithms with regards to how much work such a fixed
chunk entails, i.e., yielding every block for an algorithm that operates on
16 byte blocks at < 1 cycles per byte seems rather pointless.

Changes since v1:
- add CRC-T10DIF test vector (#1)
- stop using GFP_ATOMIC in scatterwalk API calls, now that they are executed
  with preemption enabled (#2 - #6)
- do some preparatory refactoring on the AES block mode code (#7 - #9)
- add yield patches (#10 - #18)
- add test patch (#19) - DO NOT MERGE

Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-rt-users@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>

Ard Biesheuvel (19):
  crypto: testmgr - add a new test case for CRC-T10DIF
  crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
  crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
  crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
  crypto: arm64/ghash - move kernel mode neon en/disable into loop
  crypto: arm64/aes-blk - remove configurable interleave
  crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
  crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
  crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
  arm64: assembler: add macro to conditionally yield the NEON under
    PREEMPT
  crypto: arm64/sha1-ce - yield every 8 blocks of input
  crypto: arm64/sha2-ce - yield every 8 blocks of input
  crypto: arm64/aes-blk - yield after processing each 64 bytes of input
  crypto: arm64/aes-bs - yield after processing each 128 bytes of input
  crypto: arm64/aes-ghash - yield after processing fixed number of
    blocks
  crypto: arm64/crc32-ce - yield NEON every 16 blocks of input
  crypto: arm64/crct10dif-ce - yield NEON every 8 blocks of input
  DO NOT MERGE

 arch/arm64/crypto/Makefile             |   3 -
 arch/arm64/crypto/aes-ce-ccm-glue.c    |  47 +-
 arch/arm64/crypto/aes-ce.S             |  17 +-
 arch/arm64/crypto/aes-glue.c           |  95 ++-
 arch/arm64/crypto/aes-modes.S          | 624 ++++++++++----------
 arch/arm64/crypto/aes-neon.S           |   2 +
 arch/arm64/crypto/aes-neonbs-core.S    | 317 ++++++----
 arch/arm64/crypto/aes-neonbs-glue.c    |  48 +-
 arch/arm64/crypto/chacha20-neon-glue.c |  12 +-
 arch/arm64/crypto/crc32-ce-core.S      |  55 +-
 arch/arm64/crypto/crct10dif-ce-core.S  |  39 +-
 arch/arm64/crypto/ghash-ce-core.S      | 128 ++--
 arch/arm64/crypto/ghash-ce-glue.c      |  17 +-
 arch/arm64/crypto/sha1-ce-core.S       |  45 +-
 arch/arm64/crypto/sha2-ce-core.S       |  40 +-
 arch/arm64/crypto/sha256-glue.c        |  36 +-
 arch/arm64/include/asm/assembler.h     |  83 +++
 crypto/testmgr.h                       | 259 ++++++++
 18 files changed, 1231 insertions(+), 636 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 01/19] crypto: testmgr - add a new test case for CRC-T10DIF
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 02/19] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop Ard Biesheuvel
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

In order to be able to test yield support under preempt, add a test
vector for CRC-T10DIF that is long enough to take multiple iterations
(and thus possible preemption between them) of the primary loop of the
accelerated x86 and arm64 implementations.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.h | 259 ++++++++++++++++++++
 1 file changed, 259 insertions(+)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index a714b6293959..0c849aec161d 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1494,6 +1494,265 @@ static const struct hash_testvec crct10dif_tv_template[] = {
 		.digest		= (u8 *)(u16 []){ 0x44c6 },
 		.np		= 4,
 		.tap		= { 1, 255, 57, 6 },
+	}, {
+		.plaintext =	"\x6e\x05\x79\x10\xa7\x1b\xb2\x49"
+				"\xe0\x54\xeb\x82\x19\x8d\x24\xbb"
+				"\x2f\xc6\x5d\xf4\x68\xff\x96\x0a"
+				"\xa1\x38\xcf\x43\xda\x71\x08\x7c"
+				"\x13\xaa\x1e\xb5\x4c\xe3\x57\xee"
+				"\x85\x1c\x90\x27\xbe\x32\xc9\x60"
+				"\xf7\x6b\x02\x99\x0d\xa4\x3b\xd2"
+				"\x46\xdd\x74\x0b\x7f\x16\xad\x21"
+				"\xb8\x4f\xe6\x5a\xf1\x88\x1f\x93"
+				"\x2a\xc1\x35\xcc\x63\xfa\x6e\x05"
+				"\x9c\x10\xa7\x3e\xd5\x49\xe0\x77"
+				"\x0e\x82\x19\xb0\x24\xbb\x52\xe9"
+				"\x5d\xf4\x8b\x22\x96\x2d\xc4\x38"
+				"\xcf\x66\xfd\x71\x08\x9f\x13\xaa"
+				"\x41\xd8\x4c\xe3\x7a\x11\x85\x1c"
+				"\xb3\x27\xbe\x55\xec\x60\xf7\x8e"
+				"\x02\x99\x30\xc7\x3b\xd2\x69\x00"
+				"\x74\x0b\xa2\x16\xad\x44\xdb\x4f"
+				"\xe6\x7d\x14\x88\x1f\xb6\x2a\xc1"
+				"\x58\xef\x63\xfa\x91\x05\x9c\x33"
+				"\xca\x3e\xd5\x6c\x03\x77\x0e\xa5"
+				"\x19\xb0\x47\xde\x52\xe9\x80\x17"
+				"\x8b\x22\xb9\x2d\xc4\x5b\xf2\x66"
+				"\xfd\x94\x08\x9f\x36\xcd\x41\xd8"
+				"\x6f\x06\x7a\x11\xa8\x1c\xb3\x4a"
+				"\xe1\x55\xec\x83\x1a\x8e\x25\xbc"
+				"\x30\xc7\x5e\xf5\x69\x00\x97\x0b"
+				"\xa2\x39\xd0\x44\xdb\x72\x09\x7d"
+				"\x14\xab\x1f\xb6\x4d\xe4\x58\xef"
+				"\x86\x1d\x91\x28\xbf\x33\xca\x61"
+				"\xf8\x6c\x03\x9a\x0e\xa5\x3c\xd3"
+				"\x47\xde\x75\x0c\x80\x17\xae\x22"
+				"\xb9\x50\xe7\x5b\xf2\x89\x20\x94"
+				"\x2b\xc2\x36\xcd\x64\xfb\x6f\x06"
+				"\x9d\x11\xa8\x3f\xd6\x4a\xe1\x78"
+				"\x0f\x83\x1a\xb1\x25\xbc\x53\xea"
+				"\x5e\xf5\x8c\x00\x97\x2e\xc5\x39"
+				"\xd0\x67\xfe\x72\x09\xa0\x14\xab"
+				"\x42\xd9\x4d\xe4\x7b\x12\x86\x1d"
+				"\xb4\x28\xbf\x56\xed\x61\xf8\x8f"
+				"\x03\x9a\x31\xc8\x3c\xd3\x6a\x01"
+				"\x75\x0c\xa3\x17\xae\x45\xdc\x50"
+				"\xe7\x7e\x15\x89\x20\xb7\x2b\xc2"
+				"\x59\xf0\x64\xfb\x92\x06\x9d\x34"
+				"\xcb\x3f\xd6\x6d\x04\x78\x0f\xa6"
+				"\x1a\xb1\x48\xdf\x53\xea\x81\x18"
+				"\x8c\x23\xba\x2e\xc5\x5c\xf3\x67"
+				"\xfe\x95\x09\xa0\x37\xce\x42\xd9"
+				"\x70\x07\x7b\x12\xa9\x1d\xb4\x4b"
+				"\xe2\x56\xed\x84\x1b\x8f\x26\xbd"
+				"\x31\xc8\x5f\xf6\x6a\x01\x98\x0c"
+				"\xa3\x3a\xd1\x45\xdc\x73\x0a\x7e"
+				"\x15\xac\x20\xb7\x4e\xe5\x59\xf0"
+				"\x87\x1e\x92\x29\xc0\x34\xcb\x62"
+				"\xf9\x6d\x04\x9b\x0f\xa6\x3d\xd4"
+				"\x48\xdf\x76\x0d\x81\x18\xaf\x23"
+				"\xba\x51\xe8\x5c\xf3\x8a\x21\x95"
+				"\x2c\xc3\x37\xce\x65\xfc\x70\x07"
+				"\x9e\x12\xa9\x40\xd7\x4b\xe2\x79"
+				"\x10\x84\x1b\xb2\x26\xbd\x54\xeb"
+				"\x5f\xf6\x8d\x01\x98\x2f\xc6\x3a"
+				"\xd1\x68\xff\x73\x0a\xa1\x15\xac"
+				"\x43\xda\x4e\xe5\x7c\x13\x87\x1e"
+				"\xb5\x29\xc0\x57\xee\x62\xf9\x90"
+				"\x04\x9b\x32\xc9\x3d\xd4\x6b\x02"
+				"\x76\x0d\xa4\x18\xaf\x46\xdd\x51"
+				"\xe8\x7f\x16\x8a\x21\xb8\x2c\xc3"
+				"\x5a\xf1\x65\xfc\x93\x07\x9e\x35"
+				"\xcc\x40\xd7\x6e\x05\x79\x10\xa7"
+				"\x1b\xb2\x49\xe0\x54\xeb\x82\x19"
+				"\x8d\x24\xbb\x2f\xc6\x5d\xf4\x68"
+				"\xff\x96\x0a\xa1\x38\xcf\x43\xda"
+				"\x71\x08\x7c\x13\xaa\x1e\xb5\x4c"
+				"\xe3\x57\xee\x85\x1c\x90\x27\xbe"
+				"\x32\xc9\x60\xf7\x6b\x02\x99\x0d"
+				"\xa4\x3b\xd2\x46\xdd\x74\x0b\x7f"
+				"\x16\xad\x21\xb8\x4f\xe6\x5a\xf1"
+				"\x88\x1f\x93\x2a\xc1\x35\xcc\x63"
+				"\xfa\x6e\x05\x9c\x10\xa7\x3e\xd5"
+				"\x49\xe0\x77\x0e\x82\x19\xb0\x24"
+				"\xbb\x52\xe9\x5d\xf4\x8b\x22\x96"
+				"\x2d\xc4\x38\xcf\x66\xfd\x71\x08"
+				"\x9f\x13\xaa\x41\xd8\x4c\xe3\x7a"
+				"\x11\x85\x1c\xb3\x27\xbe\x55\xec"
+				"\x60\xf7\x8e\x02\x99\x30\xc7\x3b"
+				"\xd2\x69\x00\x74\x0b\xa2\x16\xad"
+				"\x44\xdb\x4f\xe6\x7d\x14\x88\x1f"
+				"\xb6\x2a\xc1\x58\xef\x63\xfa\x91"
+				"\x05\x9c\x33\xca\x3e\xd5\x6c\x03"
+				"\x77\x0e\xa5\x19\xb0\x47\xde\x52"
+				"\xe9\x80\x17\x8b\x22\xb9\x2d\xc4"
+				"\x5b\xf2\x66\xfd\x94\x08\x9f\x36"
+				"\xcd\x41\xd8\x6f\x06\x7a\x11\xa8"
+				"\x1c\xb3\x4a\xe1\x55\xec\x83\x1a"
+				"\x8e\x25\xbc\x30\xc7\x5e\xf5\x69"
+				"\x00\x97\x0b\xa2\x39\xd0\x44\xdb"
+				"\x72\x09\x7d\x14\xab\x1f\xb6\x4d"
+				"\xe4\x58\xef\x86\x1d\x91\x28\xbf"
+				"\x33\xca\x61\xf8\x6c\x03\x9a\x0e"
+				"\xa5\x3c\xd3\x47\xde\x75\x0c\x80"
+				"\x17\xae\x22\xb9\x50\xe7\x5b\xf2"
+				"\x89\x20\x94\x2b\xc2\x36\xcd\x64"
+				"\xfb\x6f\x06\x9d\x11\xa8\x3f\xd6"
+				"\x4a\xe1\x78\x0f\x83\x1a\xb1\x25"
+				"\xbc\x53\xea\x5e\xf5\x8c\x00\x97"
+				"\x2e\xc5\x39\xd0\x67\xfe\x72\x09"
+				"\xa0\x14\xab\x42\xd9\x4d\xe4\x7b"
+				"\x12\x86\x1d\xb4\x28\xbf\x56\xed"
+				"\x61\xf8\x8f\x03\x9a\x31\xc8\x3c"
+				"\xd3\x6a\x01\x75\x0c\xa3\x17\xae"
+				"\x45\xdc\x50\xe7\x7e\x15\x89\x20"
+				"\xb7\x2b\xc2\x59\xf0\x64\xfb\x92"
+				"\x06\x9d\x34\xcb\x3f\xd6\x6d\x04"
+				"\x78\x0f\xa6\x1a\xb1\x48\xdf\x53"
+				"\xea\x81\x18\x8c\x23\xba\x2e\xc5"
+				"\x5c\xf3\x67\xfe\x95\x09\xa0\x37"
+				"\xce\x42\xd9\x70\x07\x7b\x12\xa9"
+				"\x1d\xb4\x4b\xe2\x56\xed\x84\x1b"
+				"\x8f\x26\xbd\x31\xc8\x5f\xf6\x6a"
+				"\x01\x98\x0c\xa3\x3a\xd1\x45\xdc"
+				"\x73\x0a\x7e\x15\xac\x20\xb7\x4e"
+				"\xe5\x59\xf0\x87\x1e\x92\x29\xc0"
+				"\x34\xcb\x62\xf9\x6d\x04\x9b\x0f"
+				"\xa6\x3d\xd4\x48\xdf\x76\x0d\x81"
+				"\x18\xaf\x23\xba\x51\xe8\x5c\xf3"
+				"\x8a\x21\x95\x2c\xc3\x37\xce\x65"
+				"\xfc\x70\x07\x9e\x12\xa9\x40\xd7"
+				"\x4b\xe2\x79\x10\x84\x1b\xb2\x26"
+				"\xbd\x54\xeb\x5f\xf6\x8d\x01\x98"
+				"\x2f\xc6\x3a\xd1\x68\xff\x73\x0a"
+				"\xa1\x15\xac\x43\xda\x4e\xe5\x7c"
+				"\x13\x87\x1e\xb5\x29\xc0\x57\xee"
+				"\x62\xf9\x90\x04\x9b\x32\xc9\x3d"
+				"\xd4\x6b\x02\x76\x0d\xa4\x18\xaf"
+				"\x46\xdd\x51\xe8\x7f\x16\x8a\x21"
+				"\xb8\x2c\xc3\x5a\xf1\x65\xfc\x93"
+				"\x07\x9e\x35\xcc\x40\xd7\x6e\x05"
+				"\x79\x10\xa7\x1b\xb2\x49\xe0\x54"
+				"\xeb\x82\x19\x8d\x24\xbb\x2f\xc6"
+				"\x5d\xf4\x68\xff\x96\x0a\xa1\x38"
+				"\xcf\x43\xda\x71\x08\x7c\x13\xaa"
+				"\x1e\xb5\x4c\xe3\x57\xee\x85\x1c"
+				"\x90\x27\xbe\x32\xc9\x60\xf7\x6b"
+				"\x02\x99\x0d\xa4\x3b\xd2\x46\xdd"
+				"\x74\x0b\x7f\x16\xad\x21\xb8\x4f"
+				"\xe6\x5a\xf1\x88\x1f\x93\x2a\xc1"
+				"\x35\xcc\x63\xfa\x6e\x05\x9c\x10"
+				"\xa7\x3e\xd5\x49\xe0\x77\x0e\x82"
+				"\x19\xb0\x24\xbb\x52\xe9\x5d\xf4"
+				"\x8b\x22\x96\x2d\xc4\x38\xcf\x66"
+				"\xfd\x71\x08\x9f\x13\xaa\x41\xd8"
+				"\x4c\xe3\x7a\x11\x85\x1c\xb3\x27"
+				"\xbe\x55\xec\x60\xf7\x8e\x02\x99"
+				"\x30\xc7\x3b\xd2\x69\x00\x74\x0b"
+				"\xa2\x16\xad\x44\xdb\x4f\xe6\x7d"
+				"\x14\x88\x1f\xb6\x2a\xc1\x58\xef"
+				"\x63\xfa\x91\x05\x9c\x33\xca\x3e"
+				"\xd5\x6c\x03\x77\x0e\xa5\x19\xb0"
+				"\x47\xde\x52\xe9\x80\x17\x8b\x22"
+				"\xb9\x2d\xc4\x5b\xf2\x66\xfd\x94"
+				"\x08\x9f\x36\xcd\x41\xd8\x6f\x06"
+				"\x7a\x11\xa8\x1c\xb3\x4a\xe1\x55"
+				"\xec\x83\x1a\x8e\x25\xbc\x30\xc7"
+				"\x5e\xf5\x69\x00\x97\x0b\xa2\x39"
+				"\xd0\x44\xdb\x72\x09\x7d\x14\xab"
+				"\x1f\xb6\x4d\xe4\x58\xef\x86\x1d"
+				"\x91\x28\xbf\x33\xca\x61\xf8\x6c"
+				"\x03\x9a\x0e\xa5\x3c\xd3\x47\xde"
+				"\x75\x0c\x80\x17\xae\x22\xb9\x50"
+				"\xe7\x5b\xf2\x89\x20\x94\x2b\xc2"
+				"\x36\xcd\x64\xfb\x6f\x06\x9d\x11"
+				"\xa8\x3f\xd6\x4a\xe1\x78\x0f\x83"
+				"\x1a\xb1\x25\xbc\x53\xea\x5e\xf5"
+				"\x8c\x00\x97\x2e\xc5\x39\xd0\x67"
+				"\xfe\x72\x09\xa0\x14\xab\x42\xd9"
+				"\x4d\xe4\x7b\x12\x86\x1d\xb4\x28"
+				"\xbf\x56\xed\x61\xf8\x8f\x03\x9a"
+				"\x31\xc8\x3c\xd3\x6a\x01\x75\x0c"
+				"\xa3\x17\xae\x45\xdc\x50\xe7\x7e"
+				"\x15\x89\x20\xb7\x2b\xc2\x59\xf0"
+				"\x64\xfb\x92\x06\x9d\x34\xcb\x3f"
+				"\xd6\x6d\x04\x78\x0f\xa6\x1a\xb1"
+				"\x48\xdf\x53\xea\x81\x18\x8c\x23"
+				"\xba\x2e\xc5\x5c\xf3\x67\xfe\x95"
+				"\x09\xa0\x37\xce\x42\xd9\x70\x07"
+				"\x7b\x12\xa9\x1d\xb4\x4b\xe2\x56"
+				"\xed\x84\x1b\x8f\x26\xbd\x31\xc8"
+				"\x5f\xf6\x6a\x01\x98\x0c\xa3\x3a"
+				"\xd1\x45\xdc\x73\x0a\x7e\x15\xac"
+				"\x20\xb7\x4e\xe5\x59\xf0\x87\x1e"
+				"\x92\x29\xc0\x34\xcb\x62\xf9\x6d"
+				"\x04\x9b\x0f\xa6\x3d\xd4\x48\xdf"
+				"\x76\x0d\x81\x18\xaf\x23\xba\x51"
+				"\xe8\x5c\xf3\x8a\x21\x95\x2c\xc3"
+				"\x37\xce\x65\xfc\x70\x07\x9e\x12"
+				"\xa9\x40\xd7\x4b\xe2\x79\x10\x84"
+				"\x1b\xb2\x26\xbd\x54\xeb\x5f\xf6"
+				"\x8d\x01\x98\x2f\xc6\x3a\xd1\x68"
+				"\xff\x73\x0a\xa1\x15\xac\x43\xda"
+				"\x4e\xe5\x7c\x13\x87\x1e\xb5\x29"
+				"\xc0\x57\xee\x62\xf9\x90\x04\x9b"
+				"\x32\xc9\x3d\xd4\x6b\x02\x76\x0d"
+				"\xa4\x18\xaf\x46\xdd\x51\xe8\x7f"
+				"\x16\x8a\x21\xb8\x2c\xc3\x5a\xf1"
+				"\x65\xfc\x93\x07\x9e\x35\xcc\x40"
+				"\xd7\x6e\x05\x79\x10\xa7\x1b\xb2"
+				"\x49\xe0\x54\xeb\x82\x19\x8d\x24"
+				"\xbb\x2f\xc6\x5d\xf4\x68\xff\x96"
+				"\x0a\xa1\x38\xcf\x43\xda\x71\x08"
+				"\x7c\x13\xaa\x1e\xb5\x4c\xe3\x57"
+				"\xee\x85\x1c\x90\x27\xbe\x32\xc9"
+				"\x60\xf7\x6b\x02\x99\x0d\xa4\x3b"
+				"\xd2\x46\xdd\x74\x0b\x7f\x16\xad"
+				"\x21\xb8\x4f\xe6\x5a\xf1\x88\x1f"
+				"\x93\x2a\xc1\x35\xcc\x63\xfa\x6e"
+				"\x05\x9c\x10\xa7\x3e\xd5\x49\xe0"
+				"\x77\x0e\x82\x19\xb0\x24\xbb\x52"
+				"\xe9\x5d\xf4\x8b\x22\x96\x2d\xc4"
+				"\x38\xcf\x66\xfd\x71\x08\x9f\x13"
+				"\xaa\x41\xd8\x4c\xe3\x7a\x11\x85"
+				"\x1c\xb3\x27\xbe\x55\xec\x60\xf7"
+				"\x8e\x02\x99\x30\xc7\x3b\xd2\x69"
+				"\x00\x74\x0b\xa2\x16\xad\x44\xdb"
+				"\x4f\xe6\x7d\x14\x88\x1f\xb6\x2a"
+				"\xc1\x58\xef\x63\xfa\x91\x05\x9c"
+				"\x33\xca\x3e\xd5\x6c\x03\x77\x0e"
+				"\xa5\x19\xb0\x47\xde\x52\xe9\x80"
+				"\x17\x8b\x22\xb9\x2d\xc4\x5b\xf2"
+				"\x66\xfd\x94\x08\x9f\x36\xcd\x41"
+				"\xd8\x6f\x06\x7a\x11\xa8\x1c\xb3"
+				"\x4a\xe1\x55\xec\x83\x1a\x8e\x25"
+				"\xbc\x30\xc7\x5e\xf5\x69\x00\x97"
+				"\x0b\xa2\x39\xd0\x44\xdb\x72\x09"
+				"\x7d\x14\xab\x1f\xb6\x4d\xe4\x58"
+				"\xef\x86\x1d\x91\x28\xbf\x33\xca"
+				"\x61\xf8\x6c\x03\x9a\x0e\xa5\x3c"
+				"\xd3\x47\xde\x75\x0c\x80\x17\xae"
+				"\x22\xb9\x50\xe7\x5b\xf2\x89\x20"
+				"\x94\x2b\xc2\x36\xcd\x64\xfb\x6f"
+				"\x06\x9d\x11\xa8\x3f\xd6\x4a\xe1"
+				"\x78\x0f\x83\x1a\xb1\x25\xbc\x53"
+				"\xea\x5e\xf5\x8c\x00\x97\x2e\xc5"
+				"\x39\xd0\x67\xfe\x72\x09\xa0\x14"
+				"\xab\x42\xd9\x4d\xe4\x7b\x12\x86"
+				"\x1d\xb4\x28\xbf\x56\xed\x61\xf8"
+				"\x8f\x03\x9a\x31\xc8\x3c\xd3\x6a"
+				"\x01\x75\x0c\xa3\x17\xae\x45\xdc"
+				"\x50\xe7\x7e\x15\x89\x20\xb7\x2b"
+				"\xc2\x59\xf0\x64\xfb\x92\x06\x9d"
+				"\x34\xcb\x3f\xd6\x6d\x04\x78\x0f"
+				"\xa6\x1a\xb1\x48\xdf\x53\xea\x81"
+				"\x18\x8c\x23\xba\x2e\xc5\x5c\xf3"
+				"\x67\xfe\x95\x09\xa0\x37\xce\x42"
+				"\xd9\x70\x07\x7b\x12\xa9\x1d\xb4"
+				"\x4b\xe2\x56\xed\x84\x1b\x8f\x26"
+				"\xbd\x31\xc8\x5f\xf6\x6a\x01\x98",
+		.psize = 2048,
+		.digest		= (u8 *)(u16 []){ 0x23ca },
 	}
 };
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 02/19] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 01/19] crypto: testmgr - add a new test case for CRC-T10DIF Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 03/19] crypto: arm64/aes-blk " Ard Biesheuvel
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-ce-ccm-glue.c | 47 ++++++++++----------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce-ccm-glue.c b/arch/arm64/crypto/aes-ce-ccm-glue.c
index a1254036f2b1..68b11aa690e4 100644
--- a/arch/arm64/crypto/aes-ce-ccm-glue.c
+++ b/arch/arm64/crypto/aes-ce-ccm-glue.c
@@ -107,11 +107,13 @@ static int ccm_init_mac(struct aead_request *req, u8 maciv[], u32 msglen)
 }
 
 static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
-			   u32 abytes, u32 *macp, bool use_neon)
+			   u32 abytes, u32 *macp)
 {
-	if (likely(use_neon)) {
+	if (may_use_simd()) {
+		kernel_neon_begin();
 		ce_aes_ccm_auth_data(mac, in, abytes, macp, key->key_enc,
 				     num_rounds(key));
+		kernel_neon_end();
 	} else {
 		if (*macp > 0 && *macp < AES_BLOCK_SIZE) {
 			int added = min(abytes, AES_BLOCK_SIZE - *macp);
@@ -143,8 +145,7 @@ static void ccm_update_mac(struct crypto_aes_ctx *key, u8 mac[], u8 const in[],
 	}
 }
 
-static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
-				   bool use_neon)
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
 {
 	struct crypto_aead *aead = crypto_aead_reqtfm(req);
 	struct crypto_aes_ctx *ctx = crypto_aead_ctx(aead);
@@ -163,7 +164,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
 		ltag.len = 6;
 	}
 
-	ccm_update_mac(ctx, mac, (u8 *)&ltag, ltag.len, &macp, use_neon);
+	ccm_update_mac(ctx, mac, (u8 *)&ltag, ltag.len, &macp);
 	scatterwalk_start(&walk, req->src);
 
 	do {
@@ -175,7 +176,7 @@ static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[],
 			n = scatterwalk_clamp(&walk, len);
 		}
 		p = scatterwalk_map(&walk);
-		ccm_update_mac(ctx, mac, p, n, &macp, use_neon);
+		ccm_update_mac(ctx, mac, p, n, &macp);
 		len -= n;
 
 		scatterwalk_unmap(p);
@@ -242,43 +243,42 @@ static int ccm_encrypt(struct aead_request *req)
 	u8 __aligned(8) mac[AES_BLOCK_SIZE];
 	u8 buf[AES_BLOCK_SIZE];
 	u32 len = req->cryptlen;
-	bool use_neon = may_use_simd();
 	int err;
 
 	err = ccm_init_mac(req, mac, len);
 	if (err)
 		return err;
 
-	if (likely(use_neon))
-		kernel_neon_begin();
-
 	if (req->assoclen)
-		ccm_calculate_auth_mac(req, mac, use_neon);
+		ccm_calculate_auth_mac(req, mac);
 
 	/* preserve the original iv for the final round */
 	memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
 	err = skcipher_walk_aead_encrypt(&walk, req, true);
 
-	if (likely(use_neon)) {
+	if (may_use_simd()) {
 		while (walk.nbytes) {
 			u32 tail = walk.nbytes % AES_BLOCK_SIZE;
 
 			if (walk.nbytes == walk.total)
 				tail = 0;
 
+			kernel_neon_begin();
 			ce_aes_ccm_encrypt(walk.dst.virt.addr,
 					   walk.src.virt.addr,
 					   walk.nbytes - tail, ctx->key_enc,
 					   num_rounds(ctx), mac, walk.iv);
+			kernel_neon_end();
 
 			err = skcipher_walk_done(&walk, tail);
 		}
-		if (!err)
+		if (!err) {
+			kernel_neon_begin();
 			ce_aes_ccm_final(mac, buf, ctx->key_enc,
 					 num_rounds(ctx));
-
-		kernel_neon_end();
+			kernel_neon_end();
+		}
 	} else {
 		err = ccm_crypt_fallback(&walk, mac, buf, ctx, true);
 	}
@@ -301,43 +301,42 @@ static int ccm_decrypt(struct aead_request *req)
 	u8 __aligned(8) mac[AES_BLOCK_SIZE];
 	u8 buf[AES_BLOCK_SIZE];
 	u32 len = req->cryptlen - authsize;
-	bool use_neon = may_use_simd();
 	int err;
 
 	err = ccm_init_mac(req, mac, len);
 	if (err)
 		return err;
 
-	if (likely(use_neon))
-		kernel_neon_begin();
-
 	if (req->assoclen)
-		ccm_calculate_auth_mac(req, mac, use_neon);
+		ccm_calculate_auth_mac(req, mac);
 
 	/* preserve the original iv for the final round */
 	memcpy(buf, req->iv, AES_BLOCK_SIZE);
 
 	err = skcipher_walk_aead_decrypt(&walk, req, true);
 
-	if (likely(use_neon)) {
+	if (may_use_simd()) {
 		while (walk.nbytes) {
 			u32 tail = walk.nbytes % AES_BLOCK_SIZE;
 
 			if (walk.nbytes == walk.total)
 				tail = 0;
 
+			kernel_neon_begin();
 			ce_aes_ccm_decrypt(walk.dst.virt.addr,
 					   walk.src.virt.addr,
 					   walk.nbytes - tail, ctx->key_enc,
 					   num_rounds(ctx), mac, walk.iv);
+			kernel_neon_end();
 
 			err = skcipher_walk_done(&walk, tail);
 		}
-		if (!err)
+		if (!err) {
+			kernel_neon_begin();
 			ce_aes_ccm_final(mac, buf, ctx->key_enc,
 					 num_rounds(ctx));
-
-		kernel_neon_end();
+			kernel_neon_end();
+		}
 	} else {
 		err = ccm_crypt_fallback(&walk, mac, buf, ctx, false);
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 03/19] crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 01/19] crypto: testmgr - add a new test case for CRC-T10DIF Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 02/19] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 04/19] crypto: arm64/aes-bs " Ard Biesheuvel
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-glue.c        | 95 ++++++++++----------
 arch/arm64/crypto/aes-modes.S       | 90 +++++++++----------
 arch/arm64/crypto/aes-neonbs-glue.c | 14 ++-
 3 files changed, 97 insertions(+), 102 deletions(-)

diff --git a/arch/arm64/crypto/aes-glue.c b/arch/arm64/crypto/aes-glue.c
index 998ba519a026..00a3e2fd6a48 100644
--- a/arch/arm64/crypto/aes-glue.c
+++ b/arch/arm64/crypto/aes-glue.c
@@ -64,17 +64,17 @@ MODULE_LICENSE("GPL v2");
 
 /* defined in aes-modes.S */
 asmlinkage void aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[],
-				int rounds, int blocks, int first);
+				int rounds, int blocks);
 asmlinkage void aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[],
-				int rounds, int blocks, int first);
+				int rounds, int blocks);
 
 asmlinkage void aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[],
-				int rounds, int blocks, u8 iv[], int first);
+				int rounds, int blocks, u8 iv[]);
 asmlinkage void aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[],
-				int rounds, int blocks, u8 iv[], int first);
+				int rounds, int blocks, u8 iv[]);
 
 asmlinkage void aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[],
-				int rounds, int blocks, u8 ctr[], int first);
+				int rounds, int blocks, u8 ctr[]);
 
 asmlinkage void aes_xts_encrypt(u8 out[], u8 const in[], u8 const rk1[],
 				int rounds, int blocks, u8 const rk2[], u8 iv[],
@@ -133,19 +133,19 @@ static int ecb_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-	int err, first, rounds = 6 + ctx->key_length / 4;
+	int err, rounds = 6 + ctx->key_length / 4;
 	struct skcipher_walk walk;
 	unsigned int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
-	for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+	while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+		kernel_neon_begin();
 		aes_ecb_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				(u8 *)ctx->key_enc, rounds, blocks, first);
+				(u8 *)ctx->key_enc, rounds, blocks);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 	return err;
 }
 
@@ -153,19 +153,19 @@ static int ecb_decrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-	int err, first, rounds = 6 + ctx->key_length / 4;
+	int err, rounds = 6 + ctx->key_length / 4;
 	struct skcipher_walk walk;
 	unsigned int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
-	for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+	while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+		kernel_neon_begin();
 		aes_ecb_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				(u8 *)ctx->key_dec, rounds, blocks, first);
+				(u8 *)ctx->key_dec, rounds, blocks);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 	return err;
 }
 
@@ -173,20 +173,19 @@ static int cbc_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-	int err, first, rounds = 6 + ctx->key_length / 4;
+	int err, rounds = 6 + ctx->key_length / 4;
 	struct skcipher_walk walk;
 	unsigned int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
-	for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+	while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+		kernel_neon_begin();
 		aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				(u8 *)ctx->key_enc, rounds, blocks, walk.iv,
-				first);
+				(u8 *)ctx->key_enc, rounds, blocks, walk.iv);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 	return err;
 }
 
@@ -194,20 +193,19 @@ static int cbc_decrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-	int err, first, rounds = 6 + ctx->key_length / 4;
+	int err, rounds = 6 + ctx->key_length / 4;
 	struct skcipher_walk walk;
 	unsigned int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
-	for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+	while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+		kernel_neon_begin();
 		aes_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				(u8 *)ctx->key_dec, rounds, blocks, walk.iv,
-				first);
+				(u8 *)ctx->key_dec, rounds, blocks, walk.iv);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 	return err;
 }
 
@@ -215,20 +213,18 @@ static int ctr_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct crypto_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
-	int err, first, rounds = 6 + ctx->key_length / 4;
+	int err, rounds = 6 + ctx->key_length / 4;
 	struct skcipher_walk walk;
 	int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	first = 1;
-	kernel_neon_begin();
 	while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {
+		kernel_neon_begin();
 		aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				(u8 *)ctx->key_enc, rounds, blocks, walk.iv,
-				first);
+				(u8 *)ctx->key_enc, rounds, blocks, walk.iv);
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
-		first = 0;
+		kernel_neon_end();
 	}
 	if (walk.nbytes) {
 		u8 __aligned(8) tail[AES_BLOCK_SIZE];
@@ -241,12 +237,13 @@ static int ctr_encrypt(struct skcipher_request *req)
 		 */
 		blocks = -1;
 
+		kernel_neon_begin();
 		aes_ctr_encrypt(tail, NULL, (u8 *)ctx->key_enc, rounds,
-				blocks, walk.iv, first);
+				blocks, walk.iv);
+		kernel_neon_end();
 		crypto_xor_cpy(tdst, tsrc, tail, nbytes);
 		err = skcipher_walk_done(&walk, 0);
 	}
-	kernel_neon_end();
 
 	return err;
 }
@@ -270,16 +267,16 @@ static int xts_encrypt(struct skcipher_request *req)
 	struct skcipher_walk walk;
 	unsigned int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
 	for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+		kernel_neon_begin();
 		aes_xts_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				(u8 *)ctx->key1.key_enc, rounds, blocks,
 				(u8 *)ctx->key2.key_enc, walk.iv, first);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 
 	return err;
 }
@@ -292,16 +289,16 @@ static int xts_decrypt(struct skcipher_request *req)
 	struct skcipher_walk walk;
 	unsigned int blocks;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
 	for (first = 1; (blocks = (walk.nbytes / AES_BLOCK_SIZE)); first = 0) {
+		kernel_neon_begin();
 		aes_xts_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				(u8 *)ctx->key1.key_dec, rounds, blocks,
 				(u8 *)ctx->key2.key_enc, walk.iv, first);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 
 	return err;
 }
@@ -425,7 +422,7 @@ static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
 
 	/* encrypt the zero vector */
 	kernel_neon_begin();
-	aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1, 1);
+	aes_ecb_encrypt(ctx->consts, (u8[AES_BLOCK_SIZE]){}, rk, rounds, 1);
 	kernel_neon_end();
 
 	cmac_gf128_mul_by_x(consts, consts);
@@ -454,8 +451,8 @@ static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
 		return err;
 
 	kernel_neon_begin();
-	aes_ecb_encrypt(key, ks[0], rk, rounds, 1, 1);
-	aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2, 0);
+	aes_ecb_encrypt(key, ks[0], rk, rounds, 1);
+	aes_ecb_encrypt(ctx->consts, ks[1], rk, rounds, 2);
 	kernel_neon_end();
 
 	return cbcmac_setkey(tfm, key, sizeof(key));
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 2674d43d1384..65b273667b34 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -40,24 +40,24 @@
 #if INTERLEAVE == 2
 
 aes_encrypt_block2x:
-	encrypt_block2x	v0, v1, w3, x2, x6, w7
+	encrypt_block2x	v0, v1, w3, x2, x8, w7
 	ret
 ENDPROC(aes_encrypt_block2x)
 
 aes_decrypt_block2x:
-	decrypt_block2x	v0, v1, w3, x2, x6, w7
+	decrypt_block2x	v0, v1, w3, x2, x8, w7
 	ret
 ENDPROC(aes_decrypt_block2x)
 
 #elif INTERLEAVE == 4
 
 aes_encrypt_block4x:
-	encrypt_block4x	v0, v1, v2, v3, w3, x2, x6, w7
+	encrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
 	ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-	decrypt_block4x	v0, v1, v2, v3, w3, x2, x6, w7
+	decrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
 	ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -86,33 +86,32 @@ ENDPROC(aes_decrypt_block4x)
 #define FRAME_POP
 
 	.macro		do_encrypt_block2x
-	encrypt_block2x	v0, v1, w3, x2, x6, w7
+	encrypt_block2x	v0, v1, w3, x2, x8, w7
 	.endm
 
 	.macro		do_decrypt_block2x
-	decrypt_block2x	v0, v1, w3, x2, x6, w7
+	decrypt_block2x	v0, v1, w3, x2, x8, w7
 	.endm
 
 	.macro		do_encrypt_block4x
-	encrypt_block4x	v0, v1, v2, v3, w3, x2, x6, w7
+	encrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
 	.endm
 
 	.macro		do_decrypt_block4x
-	decrypt_block4x	v0, v1, v2, v3, w3, x2, x6, w7
+	decrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
 	.endm
 
 #endif
 
 	/*
 	 * aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
-	 *		   int blocks, int first)
+	 *		   int blocks)
 	 * aes_ecb_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
-	 *		   int blocks, int first)
+	 *		   int blocks)
 	 */
 
 AES_ENTRY(aes_ecb_encrypt)
 	FRAME_PUSH
-	cbz		w5, .LecbencloopNx
 
 	enc_prepare	w3, x2, x5
 
@@ -148,7 +147,6 @@ AES_ENDPROC(aes_ecb_encrypt)
 
 AES_ENTRY(aes_ecb_decrypt)
 	FRAME_PUSH
-	cbz		w5, .LecbdecloopNx
 
 	dec_prepare	w3, x2, x5
 
@@ -184,14 +182,12 @@ AES_ENDPROC(aes_ecb_decrypt)
 
 	/*
 	 * aes_cbc_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
-	 *		   int blocks, u8 iv[], int first)
+	 *		   int blocks, u8 iv[])
 	 * aes_cbc_decrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
-	 *		   int blocks, u8 iv[], int first)
+	 *		   int blocks, u8 iv[])
 	 */
 
 AES_ENTRY(aes_cbc_encrypt)
-	cbz		w6, .Lcbcencloop
-
 	ld1		{v0.16b}, [x5]			/* get iv */
 	enc_prepare	w3, x2, x6
 
@@ -209,7 +205,6 @@ AES_ENDPROC(aes_cbc_encrypt)
 
 AES_ENTRY(aes_cbc_decrypt)
 	FRAME_PUSH
-	cbz		w6, .LcbcdecloopNx
 
 	ld1		{v7.16b}, [x5]			/* get iv */
 	dec_prepare	w3, x2, x6
@@ -264,20 +259,19 @@ AES_ENDPROC(aes_cbc_decrypt)
 
 	/*
 	 * aes_ctr_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
-	 *		   int blocks, u8 ctr[], int first)
+	 *		   int blocks, u8 ctr[])
 	 */
 
 AES_ENTRY(aes_ctr_encrypt)
 	FRAME_PUSH
-	cbz		w6, .Lctrnotfirst	/* 1st time around? */
+
 	enc_prepare	w3, x2, x6
 	ld1		{v4.16b}, [x5]
 
-.Lctrnotfirst:
-	umov		x8, v4.d[1]		/* keep swabbed ctr in reg */
-	rev		x8, x8
+	umov		x6, v4.d[1]		/* keep swabbed ctr in reg */
+	rev		x6, x6
 #if INTERLEAVE >= 2
-	cmn		w8, w4			/* 32 bit overflow? */
+	cmn		w6, w4			/* 32 bit overflow? */
 	bcs		.Lctrloop
 .LctrloopNx:
 	subs		w4, w4, #INTERLEAVE
@@ -285,11 +279,11 @@ AES_ENTRY(aes_ctr_encrypt)
 #if INTERLEAVE == 2
 	mov		v0.8b, v4.8b
 	mov		v1.8b, v4.8b
-	rev		x7, x8
-	add		x8, x8, #1
+	rev		x7, x6
+	add		x6, x6, #1
 	ins		v0.d[1], x7
-	rev		x7, x8
-	add		x8, x8, #1
+	rev		x7, x6
+	add		x6, x6, #1
 	ins		v1.d[1], x7
 	ld1		{v2.16b-v3.16b}, [x1], #32	/* get 2 input blocks */
 	do_encrypt_block2x
@@ -298,7 +292,7 @@ AES_ENTRY(aes_ctr_encrypt)
 	st1		{v0.16b-v1.16b}, [x0], #32
 #else
 	ldr		q8, =0x30000000200000001	/* addends 1,2,3[,0] */
-	dup		v7.4s, w8
+	dup		v7.4s, w6
 	mov		v0.16b, v4.16b
 	add		v7.4s, v7.4s, v8.4s
 	mov		v1.16b, v4.16b
@@ -316,9 +310,9 @@ AES_ENTRY(aes_ctr_encrypt)
 	eor		v2.16b, v7.16b, v2.16b
 	eor		v3.16b, v5.16b, v3.16b
 	st1		{v0.16b-v3.16b}, [x0], #64
-	add		x8, x8, #INTERLEAVE
+	add		x6, x6, #INTERLEAVE
 #endif
-	rev		x7, x8
+	rev		x7, x6
 	ins		v4.d[1], x7
 	cbz		w4, .Lctrout
 	b		.LctrloopNx
@@ -328,10 +322,10 @@ AES_ENTRY(aes_ctr_encrypt)
 #endif
 .Lctrloop:
 	mov		v0.16b, v4.16b
-	encrypt_block	v0, w3, x2, x6, w7
+	encrypt_block	v0, w3, x2, x8, w7
 
-	adds		x8, x8, #1		/* increment BE ctr */
-	rev		x7, x8
+	adds		x6, x6, #1		/* increment BE ctr */
+	rev		x7, x6
 	ins		v4.d[1], x7
 	bcs		.Lctrcarry		/* overflow? */
 
@@ -385,15 +379,17 @@ CPU_BE(	.quad		0x87, 1		)
 
 AES_ENTRY(aes_xts_encrypt)
 	FRAME_PUSH
-	cbz		w7, .LxtsencloopNx
-
 	ld1		{v4.16b}, [x6]
-	enc_prepare	w3, x5, x6
-	encrypt_block	v4, w3, x5, x6, w7		/* first tweak */
-	enc_switch_key	w3, x2, x6
+	cbz		w7, .Lxtsencnotfirst
+
+	enc_prepare	w3, x5, x8
+	encrypt_block	v4, w3, x5, x8, w7		/* first tweak */
+	enc_switch_key	w3, x2, x8
 	ldr		q7, .Lxts_mul_x
 	b		.LxtsencNx
 
+.Lxtsencnotfirst:
+	enc_prepare	w3, x2, x8
 .LxtsencloopNx:
 	ldr		q7, .Lxts_mul_x
 	next_tweak	v4, v4, v7, v8
@@ -442,7 +438,7 @@ AES_ENTRY(aes_xts_encrypt)
 .Lxtsencloop:
 	ld1		{v1.16b}, [x1], #16
 	eor		v0.16b, v1.16b, v4.16b
-	encrypt_block	v0, w3, x2, x6, w7
+	encrypt_block	v0, w3, x2, x8, w7
 	eor		v0.16b, v0.16b, v4.16b
 	st1		{v0.16b}, [x0], #16
 	subs		w4, w4, #1
@@ -450,6 +446,7 @@ AES_ENTRY(aes_xts_encrypt)
 	next_tweak	v4, v4, v7, v8
 	b		.Lxtsencloop
 .Lxtsencout:
+	st1		{v4.16b}, [x6]
 	FRAME_POP
 	ret
 AES_ENDPROC(aes_xts_encrypt)
@@ -457,15 +454,17 @@ AES_ENDPROC(aes_xts_encrypt)
 
 AES_ENTRY(aes_xts_decrypt)
 	FRAME_PUSH
-	cbz		w7, .LxtsdecloopNx
-
 	ld1		{v4.16b}, [x6]
-	enc_prepare	w3, x5, x6
-	encrypt_block	v4, w3, x5, x6, w7		/* first tweak */
-	dec_prepare	w3, x2, x6
+	cbz		w7, .Lxtsdecnotfirst
+
+	enc_prepare	w3, x5, x8
+	encrypt_block	v4, w3, x5, x8, w7		/* first tweak */
+	dec_prepare	w3, x2, x8
 	ldr		q7, .Lxts_mul_x
 	b		.LxtsdecNx
 
+.Lxtsdecnotfirst:
+	dec_prepare	w3, x2, x8
 .LxtsdecloopNx:
 	ldr		q7, .Lxts_mul_x
 	next_tweak	v4, v4, v7, v8
@@ -514,7 +513,7 @@ AES_ENTRY(aes_xts_decrypt)
 .Lxtsdecloop:
 	ld1		{v1.16b}, [x1], #16
 	eor		v0.16b, v1.16b, v4.16b
-	decrypt_block	v0, w3, x2, x6, w7
+	decrypt_block	v0, w3, x2, x8, w7
 	eor		v0.16b, v0.16b, v4.16b
 	st1		{v0.16b}, [x0], #16
 	subs		w4, w4, #1
@@ -522,6 +521,7 @@ AES_ENTRY(aes_xts_decrypt)
 	next_tweak	v4, v4, v7, v8
 	b		.Lxtsdecloop
 .Lxtsdecout:
+	st1		{v4.16b}, [x6]
 	FRAME_POP
 	ret
 AES_ENDPROC(aes_xts_decrypt)
diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index c55d68ccb89f..9d823c77ec84 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -46,10 +46,9 @@ asmlinkage void aesbs_xts_decrypt(u8 out[], u8 const in[], u8 const rk[],
 
 /* borrowed from aes-neon-blk.ko */
 asmlinkage void neon_aes_ecb_encrypt(u8 out[], u8 const in[], u32 const rk[],
-				     int rounds, int blocks, int first);
+				     int rounds, int blocks);
 asmlinkage void neon_aes_cbc_encrypt(u8 out[], u8 const in[], u32 const rk[],
-				     int rounds, int blocks, u8 iv[],
-				     int first);
+				     int rounds, int blocks, u8 iv[]);
 
 struct aesbs_ctx {
 	u8	rk[13 * (8 * AES_BLOCK_SIZE) + 32];
@@ -157,7 +156,7 @@ static int cbc_encrypt(struct skcipher_request *req)
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct aesbs_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
-	int err, first = 1;
+	int err;
 
 	err = skcipher_walk_virt(&walk, req, true);
 
@@ -167,10 +166,9 @@ static int cbc_encrypt(struct skcipher_request *req)
 
 		/* fall back to the non-bitsliced NEON implementation */
 		neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
-				     ctx->enc, ctx->key.rounds, blocks, walk.iv,
-				     first);
+				     ctx->enc, ctx->key.rounds, blocks,
+				     walk.iv);
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
-		first = 0;
 	}
 	kernel_neon_end();
 	return err;
@@ -311,7 +309,7 @@ static int __xts_crypt(struct skcipher_request *req,
 	kernel_neon_begin();
 
 	neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
-			     ctx->key.rounds, 1, 1);
+			     ctx->key.rounds, 1);
 
 	while (walk.nbytes >= AES_BLOCK_SIZE) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 04/19] crypto: arm64/aes-bs - move kernel mode neon en/disable into loop
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 03/19] crypto: arm64/aes-blk " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 05/19] crypto: arm64/chacha20 " Ard Biesheuvel
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-neonbs-glue.c | 36 +++++++++-----------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-glue.c b/arch/arm64/crypto/aes-neonbs-glue.c
index 9d823c77ec84..e7a95a566462 100644
--- a/arch/arm64/crypto/aes-neonbs-glue.c
+++ b/arch/arm64/crypto/aes-neonbs-glue.c
@@ -99,9 +99,8 @@ static int __ecb_crypt(struct skcipher_request *req,
 	struct skcipher_walk walk;
 	int err;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
 	while (walk.nbytes >= AES_BLOCK_SIZE) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -109,12 +108,13 @@ static int __ecb_crypt(struct skcipher_request *req,
 			blocks = round_down(blocks,
 					    walk.stride / AES_BLOCK_SIZE);
 
+		kernel_neon_begin();
 		fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->rk,
 		   ctx->rounds, blocks);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk,
 					 walk.nbytes - blocks * AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 
 	return err;
 }
@@ -158,19 +158,19 @@ static int cbc_encrypt(struct skcipher_request *req)
 	struct skcipher_walk walk;
 	int err;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
 	while (walk.nbytes >= AES_BLOCK_SIZE) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
 		/* fall back to the non-bitsliced NEON implementation */
+		kernel_neon_begin();
 		neon_aes_cbc_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				     ctx->enc, ctx->key.rounds, blocks,
 				     walk.iv);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk, walk.nbytes % AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 	return err;
 }
 
@@ -181,9 +181,8 @@ static int cbc_decrypt(struct skcipher_request *req)
 	struct skcipher_walk walk;
 	int err;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
 	while (walk.nbytes >= AES_BLOCK_SIZE) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
@@ -191,13 +190,14 @@ static int cbc_decrypt(struct skcipher_request *req)
 			blocks = round_down(blocks,
 					    walk.stride / AES_BLOCK_SIZE);
 
+		kernel_neon_begin();
 		aesbs_cbc_decrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				  ctx->key.rk, ctx->key.rounds, blocks,
 				  walk.iv);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk,
 					 walk.nbytes - blocks * AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
 
 	return err;
 }
@@ -229,9 +229,8 @@ static int ctr_encrypt(struct skcipher_request *req)
 	u8 buf[AES_BLOCK_SIZE];
 	int err;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
-	kernel_neon_begin();
 	while (walk.nbytes > 0) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
 		u8 *final = (walk.total % AES_BLOCK_SIZE) ? buf : NULL;
@@ -242,8 +241,10 @@ static int ctr_encrypt(struct skcipher_request *req)
 			final = NULL;
 		}
 
+		kernel_neon_begin();
 		aesbs_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
 				  ctx->rk, ctx->rounds, blocks, walk.iv, final);
+		kernel_neon_end();
 
 		if (final) {
 			u8 *dst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
@@ -258,8 +259,6 @@ static int ctr_encrypt(struct skcipher_request *req)
 		err = skcipher_walk_done(&walk,
 					 walk.nbytes - blocks * AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
-
 	return err;
 }
 
@@ -304,12 +303,11 @@ static int __xts_crypt(struct skcipher_request *req,
 	struct skcipher_walk walk;
 	int err;
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
 	kernel_neon_begin();
-
-	neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey,
-			     ctx->key.rounds, 1);
+	neon_aes_ecb_encrypt(walk.iv, walk.iv, ctx->twkey, ctx->key.rounds, 1);
+	kernel_neon_end();
 
 	while (walk.nbytes >= AES_BLOCK_SIZE) {
 		unsigned int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -318,13 +316,13 @@ static int __xts_crypt(struct skcipher_request *req,
 			blocks = round_down(blocks,
 					    walk.stride / AES_BLOCK_SIZE);
 
+		kernel_neon_begin();
 		fn(walk.dst.virt.addr, walk.src.virt.addr, ctx->key.rk,
 		   ctx->key.rounds, blocks, walk.iv);
+		kernel_neon_end();
 		err = skcipher_walk_done(&walk,
 					 walk.nbytes - blocks * AES_BLOCK_SIZE);
 	}
-	kernel_neon_end();
-
 	return err;
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 05/19] crypto: arm64/chacha20 - move kernel mode neon en/disable into loop
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 04/19] crypto: arm64/aes-bs " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 06/19] crypto: arm64/ghash " Ard Biesheuvel
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/chacha20-neon-glue.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/chacha20-neon-glue.c b/arch/arm64/crypto/chacha20-neon-glue.c
index cbdb75d15cd0..727579c93ded 100644
--- a/arch/arm64/crypto/chacha20-neon-glue.c
+++ b/arch/arm64/crypto/chacha20-neon-glue.c
@@ -37,12 +37,19 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
 	u8 buf[CHACHA20_BLOCK_SIZE];
 
 	while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+		kernel_neon_begin();
 		chacha20_4block_xor_neon(state, dst, src);
+		kernel_neon_end();
 		bytes -= CHACHA20_BLOCK_SIZE * 4;
 		src += CHACHA20_BLOCK_SIZE * 4;
 		dst += CHACHA20_BLOCK_SIZE * 4;
 		state[12] += 4;
 	}
+
+	if (!bytes)
+		return;
+
+	kernel_neon_begin();
 	while (bytes >= CHACHA20_BLOCK_SIZE) {
 		chacha20_block_xor_neon(state, dst, src);
 		bytes -= CHACHA20_BLOCK_SIZE;
@@ -55,6 +62,7 @@ static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
 		chacha20_block_xor_neon(state, buf, buf);
 		memcpy(dst, buf, bytes);
 	}
+	kernel_neon_end();
 }
 
 static int chacha20_neon(struct skcipher_request *req)
@@ -68,11 +76,10 @@ static int chacha20_neon(struct skcipher_request *req)
 	if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
 		return crypto_chacha20_crypt(req);
 
-	err = skcipher_walk_virt(&walk, req, true);
+	err = skcipher_walk_virt(&walk, req, false);
 
 	crypto_chacha20_init(state, ctx, walk.iv);
 
-	kernel_neon_begin();
 	while (walk.nbytes > 0) {
 		unsigned int nbytes = walk.nbytes;
 
@@ -83,7 +90,6 @@ static int chacha20_neon(struct skcipher_request *req)
 				nbytes);
 		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
 	}
-	kernel_neon_end();
 
 	return err;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 06/19] crypto: arm64/ghash - move kernel mode neon en/disable into loop
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 05/19] crypto: arm64/chacha20 " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 07/19] crypto: arm64/aes-blk - remove configurable interleave Ard Biesheuvel
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.

Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.

So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/ghash-ce-glue.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index cfc9c92814fd..cb39503673d4 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -368,26 +368,28 @@ static int gcm_encrypt(struct aead_request *req)
 		pmull_gcm_encrypt_block(ks, iv, NULL,
 					num_rounds(&ctx->aes_key));
 		put_unaligned_be32(3, iv + GCM_IV_SIZE);
+		kernel_neon_end();
 
-		err = skcipher_walk_aead_encrypt(&walk, req, true);
+		err = skcipher_walk_aead_encrypt(&walk, req, false);
 
 		while (walk.nbytes >= AES_BLOCK_SIZE) {
 			int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
+			kernel_neon_begin();
 			pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
 					  walk.src.virt.addr, &ctx->ghash_key,
 					  iv, num_rounds(&ctx->aes_key), ks);
+			kernel_neon_end();
 
 			err = skcipher_walk_done(&walk,
 						 walk.nbytes % AES_BLOCK_SIZE);
 		}
-		kernel_neon_end();
 	} else {
 		__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
 				    num_rounds(&ctx->aes_key));
 		put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
-		err = skcipher_walk_aead_encrypt(&walk, req, true);
+		err = skcipher_walk_aead_encrypt(&walk, req, false);
 
 		while (walk.nbytes >= AES_BLOCK_SIZE) {
 			int blocks = walk.nbytes / AES_BLOCK_SIZE;
@@ -467,15 +469,18 @@ static int gcm_decrypt(struct aead_request *req)
 		pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc,
 					num_rounds(&ctx->aes_key));
 		put_unaligned_be32(2, iv + GCM_IV_SIZE);
+		kernel_neon_end();
 
-		err = skcipher_walk_aead_decrypt(&walk, req, true);
+		err = skcipher_walk_aead_decrypt(&walk, req, false);
 
 		while (walk.nbytes >= AES_BLOCK_SIZE) {
 			int blocks = walk.nbytes / AES_BLOCK_SIZE;
 
+			kernel_neon_begin();
 			pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
 					  walk.src.virt.addr, &ctx->ghash_key,
 					  iv, num_rounds(&ctx->aes_key));
+			kernel_neon_end();
 
 			err = skcipher_walk_done(&walk,
 						 walk.nbytes % AES_BLOCK_SIZE);
@@ -483,14 +488,12 @@ static int gcm_decrypt(struct aead_request *req)
 		if (walk.nbytes)
 			pmull_gcm_encrypt_block(iv, iv, NULL,
 						num_rounds(&ctx->aes_key));
-
-		kernel_neon_end();
 	} else {
 		__aes_arm64_encrypt(ctx->aes_key.key_enc, tag, iv,
 				    num_rounds(&ctx->aes_key));
 		put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
-		err = skcipher_walk_aead_decrypt(&walk, req, true);
+		err = skcipher_walk_aead_decrypt(&walk, req, false);
 
 		while (walk.nbytes >= AES_BLOCK_SIZE) {
 			int blocks = walk.nbytes / AES_BLOCK_SIZE;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 07/19] crypto: arm64/aes-blk - remove configurable interleave
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 06/19] crypto: arm64/ghash " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 08/19] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path Ard Biesheuvel
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

The AES block mode implementation using Crypto Extensions or plain NEON
was written before real hardware existed, and so its interleave factor
was made build time configurable (as well as an option to instantiate
all interleaved sequences inline rather than as subroutines)

We ended up using INTERLEAVE=4 with inlining disabled for both flavors
of the core AES routines, so let's stick with that, and remove the option
to configure this at build time. This makes the code easier to modify,
which is nice now that we're adding yield support.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Makefile    |   3 -
 arch/arm64/crypto/aes-modes.S | 237 ++++----------------
 2 files changed, 40 insertions(+), 200 deletions(-)

diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index b5edc5918c28..aaf4e9afd750 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -50,9 +50,6 @@ aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
 obj-$(CONFIG_CRYPTO_AES_ARM64_BS) += aes-neon-bs.o
 aes-neon-bs-y := aes-neonbs-core.o aes-neonbs-glue.o
 
-AFLAGS_aes-ce.o		:= -DINTERLEAVE=4
-AFLAGS_aes-neon.o	:= -DINTERLEAVE=4
-
 CFLAGS_aes-glue-ce.o	:= -DUSE_V8_CRYPTO_EXTENSIONS
 
 $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 65b273667b34..27a235b2ddee 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -13,44 +13,6 @@
 	.text
 	.align		4
 
-/*
- * There are several ways to instantiate this code:
- * - no interleave, all inline
- * - 2-way interleave, 2x calls out of line (-DINTERLEAVE=2)
- * - 2-way interleave, all inline (-DINTERLEAVE=2 -DINTERLEAVE_INLINE)
- * - 4-way interleave, 4x calls out of line (-DINTERLEAVE=4)
- * - 4-way interleave, all inline (-DINTERLEAVE=4 -DINTERLEAVE_INLINE)
- *
- * Macros imported by this code:
- * - enc_prepare	- setup NEON registers for encryption
- * - dec_prepare	- setup NEON registers for decryption
- * - enc_switch_key	- change to new key after having prepared for encryption
- * - encrypt_block	- encrypt a single block
- * - decrypt block	- decrypt a single block
- * - encrypt_block2x	- encrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - decrypt_block2x	- decrypt 2 blocks in parallel (if INTERLEAVE == 2)
- * - encrypt_block4x	- encrypt 4 blocks in parallel (if INTERLEAVE == 4)
- * - decrypt_block4x	- decrypt 4 blocks in parallel (if INTERLEAVE == 4)
- */
-
-#if defined(INTERLEAVE) && !defined(INTERLEAVE_INLINE)
-#define FRAME_PUSH	stp x29, x30, [sp,#-16]! ; mov x29, sp
-#define FRAME_POP	ldp x29, x30, [sp],#16
-
-#if INTERLEAVE == 2
-
-aes_encrypt_block2x:
-	encrypt_block2x	v0, v1, w3, x2, x8, w7
-	ret
-ENDPROC(aes_encrypt_block2x)
-
-aes_decrypt_block2x:
-	decrypt_block2x	v0, v1, w3, x2, x8, w7
-	ret
-ENDPROC(aes_decrypt_block2x)
-
-#elif INTERLEAVE == 4
-
 aes_encrypt_block4x:
 	encrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
 	ret
@@ -61,48 +23,6 @@ aes_decrypt_block4x:
 	ret
 ENDPROC(aes_decrypt_block4x)
 
-#else
-#error INTERLEAVE should equal 2 or 4
-#endif
-
-	.macro		do_encrypt_block2x
-	bl		aes_encrypt_block2x
-	.endm
-
-	.macro		do_decrypt_block2x
-	bl		aes_decrypt_block2x
-	.endm
-
-	.macro		do_encrypt_block4x
-	bl		aes_encrypt_block4x
-	.endm
-
-	.macro		do_decrypt_block4x
-	bl		aes_decrypt_block4x
-	.endm
-
-#else
-#define FRAME_PUSH
-#define FRAME_POP
-
-	.macro		do_encrypt_block2x
-	encrypt_block2x	v0, v1, w3, x2, x8, w7
-	.endm
-
-	.macro		do_decrypt_block2x
-	decrypt_block2x	v0, v1, w3, x2, x8, w7
-	.endm
-
-	.macro		do_encrypt_block4x
-	encrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
-	.endm
-
-	.macro		do_decrypt_block4x
-	decrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
-	.endm
-
-#endif
-
 	/*
 	 * aes_ecb_encrypt(u8 out[], u8 const in[], u8 const rk[], int rounds,
 	 *		   int blocks)
@@ -111,28 +31,21 @@ ENDPROC(aes_decrypt_block4x)
 	 */
 
 AES_ENTRY(aes_ecb_encrypt)
-	FRAME_PUSH
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
 
 	enc_prepare	w3, x2, x5
 
 .LecbencloopNx:
-#if INTERLEAVE >= 2
-	subs		w4, w4, #INTERLEAVE
+	subs		w4, w4, #4
 	bmi		.Lecbenc1x
-#if INTERLEAVE == 2
-	ld1		{v0.16b-v1.16b}, [x1], #32	/* get 2 pt blocks */
-	do_encrypt_block2x
-	st1		{v0.16b-v1.16b}, [x0], #32
-#else
 	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 pt blocks */
-	do_encrypt_block4x
+	bl		aes_encrypt_block4x
 	st1		{v0.16b-v3.16b}, [x0], #64
-#endif
 	b		.LecbencloopNx
 .Lecbenc1x:
-	adds		w4, w4, #INTERLEAVE
+	adds		w4, w4, #4
 	beq		.Lecbencout
-#endif
 .Lecbencloop:
 	ld1		{v0.16b}, [x1], #16		/* get next pt block */
 	encrypt_block	v0, w3, x2, x5, w6
@@ -140,34 +53,27 @@ AES_ENTRY(aes_ecb_encrypt)
 	subs		w4, w4, #1
 	bne		.Lecbencloop
 .Lecbencout:
-	FRAME_POP
+	ldp		x29, x30, [sp], #16
 	ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-	FRAME_PUSH
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
 
 	dec_prepare	w3, x2, x5
 
 .LecbdecloopNx:
-#if INTERLEAVE >= 2
-	subs		w4, w4, #INTERLEAVE
+	subs		w4, w4, #4
 	bmi		.Lecbdec1x
-#if INTERLEAVE == 2
-	ld1		{v0.16b-v1.16b}, [x1], #32	/* get 2 ct blocks */
-	do_decrypt_block2x
-	st1		{v0.16b-v1.16b}, [x0], #32
-#else
 	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 ct blocks */
-	do_decrypt_block4x
+	bl		aes_decrypt_block4x
 	st1		{v0.16b-v3.16b}, [x0], #64
-#endif
 	b		.LecbdecloopNx
 .Lecbdec1x:
-	adds		w4, w4, #INTERLEAVE
+	adds		w4, w4, #4
 	beq		.Lecbdecout
-#endif
 .Lecbdecloop:
 	ld1		{v0.16b}, [x1], #16		/* get next ct block */
 	decrypt_block	v0, w3, x2, x5, w6
@@ -175,7 +81,7 @@ AES_ENTRY(aes_ecb_decrypt)
 	subs		w4, w4, #1
 	bne		.Lecbdecloop
 .Lecbdecout:
-	FRAME_POP
+	ldp		x29, x30, [sp], #16
 	ret
 AES_ENDPROC(aes_ecb_decrypt)
 
@@ -204,30 +110,20 @@ AES_ENDPROC(aes_cbc_encrypt)
 
 
 AES_ENTRY(aes_cbc_decrypt)
-	FRAME_PUSH
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
 
 	ld1		{v7.16b}, [x5]			/* get iv */
 	dec_prepare	w3, x2, x6
 
 .LcbcdecloopNx:
-#if INTERLEAVE >= 2
-	subs		w4, w4, #INTERLEAVE
+	subs		w4, w4, #4
 	bmi		.Lcbcdec1x
-#if INTERLEAVE == 2
-	ld1		{v0.16b-v1.16b}, [x1], #32	/* get 2 ct blocks */
-	mov		v2.16b, v0.16b
-	mov		v3.16b, v1.16b
-	do_decrypt_block2x
-	eor		v0.16b, v0.16b, v7.16b
-	eor		v1.16b, v1.16b, v2.16b
-	mov		v7.16b, v3.16b
-	st1		{v0.16b-v1.16b}, [x0], #32
-#else
 	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 ct blocks */
 	mov		v4.16b, v0.16b
 	mov		v5.16b, v1.16b
 	mov		v6.16b, v2.16b
-	do_decrypt_block4x
+	bl		aes_decrypt_block4x
 	sub		x1, x1, #16
 	eor		v0.16b, v0.16b, v7.16b
 	eor		v1.16b, v1.16b, v4.16b
@@ -235,12 +131,10 @@ AES_ENTRY(aes_cbc_decrypt)
 	eor		v2.16b, v2.16b, v5.16b
 	eor		v3.16b, v3.16b, v6.16b
 	st1		{v0.16b-v3.16b}, [x0], #64
-#endif
 	b		.LcbcdecloopNx
 .Lcbcdec1x:
-	adds		w4, w4, #INTERLEAVE
+	adds		w4, w4, #4
 	beq		.Lcbcdecout
-#endif
 .Lcbcdecloop:
 	ld1		{v1.16b}, [x1], #16		/* get next ct block */
 	mov		v0.16b, v1.16b			/* ...and copy to v0 */
@@ -251,8 +145,8 @@ AES_ENTRY(aes_cbc_decrypt)
 	subs		w4, w4, #1
 	bne		.Lcbcdecloop
 .Lcbcdecout:
-	FRAME_POP
 	st1		{v7.16b}, [x5]			/* return iv */
+	ldp		x29, x30, [sp], #16
 	ret
 AES_ENDPROC(aes_cbc_decrypt)
 
@@ -263,34 +157,19 @@ AES_ENDPROC(aes_cbc_decrypt)
 	 */
 
 AES_ENTRY(aes_ctr_encrypt)
-	FRAME_PUSH
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
 
 	enc_prepare	w3, x2, x6
 	ld1		{v4.16b}, [x5]
 
 	umov		x6, v4.d[1]		/* keep swabbed ctr in reg */
 	rev		x6, x6
-#if INTERLEAVE >= 2
 	cmn		w6, w4			/* 32 bit overflow? */
 	bcs		.Lctrloop
 .LctrloopNx:
-	subs		w4, w4, #INTERLEAVE
+	subs		w4, w4, #4
 	bmi		.Lctr1x
-#if INTERLEAVE == 2
-	mov		v0.8b, v4.8b
-	mov		v1.8b, v4.8b
-	rev		x7, x6
-	add		x6, x6, #1
-	ins		v0.d[1], x7
-	rev		x7, x6
-	add		x6, x6, #1
-	ins		v1.d[1], x7
-	ld1		{v2.16b-v3.16b}, [x1], #32	/* get 2 input blocks */
-	do_encrypt_block2x
-	eor		v0.16b, v0.16b, v2.16b
-	eor		v1.16b, v1.16b, v3.16b
-	st1		{v0.16b-v1.16b}, [x0], #32
-#else
 	ldr		q8, =0x30000000200000001	/* addends 1,2,3[,0] */
 	dup		v7.4s, w6
 	mov		v0.16b, v4.16b
@@ -303,23 +182,21 @@ AES_ENTRY(aes_ctr_encrypt)
 	mov		v2.s[3], v8.s[1]
 	mov		v3.s[3], v8.s[2]
 	ld1		{v5.16b-v7.16b}, [x1], #48	/* get 3 input blocks */
-	do_encrypt_block4x
+	bl		aes_encrypt_block4x
 	eor		v0.16b, v5.16b, v0.16b
 	ld1		{v5.16b}, [x1], #16		/* get 1 input block  */
 	eor		v1.16b, v6.16b, v1.16b
 	eor		v2.16b, v7.16b, v2.16b
 	eor		v3.16b, v5.16b, v3.16b
 	st1		{v0.16b-v3.16b}, [x0], #64
-	add		x6, x6, #INTERLEAVE
-#endif
+	add		x6, x6, #4
 	rev		x7, x6
 	ins		v4.d[1], x7
 	cbz		w4, .Lctrout
 	b		.LctrloopNx
 .Lctr1x:
-	adds		w4, w4, #INTERLEAVE
+	adds		w4, w4, #4
 	beq		.Lctrout
-#endif
 .Lctrloop:
 	mov		v0.16b, v4.16b
 	encrypt_block	v0, w3, x2, x8, w7
@@ -339,12 +216,12 @@ AES_ENTRY(aes_ctr_encrypt)
 
 .Lctrout:
 	st1		{v4.16b}, [x5]		/* return next CTR value */
-	FRAME_POP
+	ldp		x29, x30, [sp], #16
 	ret
 
 .Lctrtailblock:
 	st1		{v0.16b}, [x0]
-	FRAME_POP
+	ldp		x29, x30, [sp], #16
 	ret
 
 .Lctrcarry:
@@ -378,7 +255,9 @@ CPU_LE(	.quad		1, 0x87		)
 CPU_BE(	.quad		0x87, 1		)
 
 AES_ENTRY(aes_xts_encrypt)
-	FRAME_PUSH
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
+
 	ld1		{v4.16b}, [x6]
 	cbz		w7, .Lxtsencnotfirst
 
@@ -394,25 +273,8 @@ AES_ENTRY(aes_xts_encrypt)
 	ldr		q7, .Lxts_mul_x
 	next_tweak	v4, v4, v7, v8
 .LxtsencNx:
-#if INTERLEAVE >= 2
-	subs		w4, w4, #INTERLEAVE
+	subs		w4, w4, #4
 	bmi		.Lxtsenc1x
-#if INTERLEAVE == 2
-	ld1		{v0.16b-v1.16b}, [x1], #32	/* get 2 pt blocks */
-	next_tweak	v5, v4, v7, v8
-	eor		v0.16b, v0.16b, v4.16b
-	eor		v1.16b, v1.16b, v5.16b
-	do_encrypt_block2x
-	eor		v0.16b, v0.16b, v4.16b
-	eor		v1.16b, v1.16b, v5.16b
-	st1		{v0.16b-v1.16b}, [x0], #32
-	cbz		w4, .LxtsencoutNx
-	next_tweak	v4, v5, v7, v8
-	b		.LxtsencNx
-.LxtsencoutNx:
-	mov		v4.16b, v5.16b
-	b		.Lxtsencout
-#else
 	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 pt blocks */
 	next_tweak	v5, v4, v7, v8
 	eor		v0.16b, v0.16b, v4.16b
@@ -421,7 +283,7 @@ AES_ENTRY(aes_xts_encrypt)
 	eor		v2.16b, v2.16b, v6.16b
 	next_tweak	v7, v6, v7, v8
 	eor		v3.16b, v3.16b, v7.16b
-	do_encrypt_block4x
+	bl		aes_encrypt_block4x
 	eor		v3.16b, v3.16b, v7.16b
 	eor		v0.16b, v0.16b, v4.16b
 	eor		v1.16b, v1.16b, v5.16b
@@ -430,11 +292,9 @@ AES_ENTRY(aes_xts_encrypt)
 	mov		v4.16b, v7.16b
 	cbz		w4, .Lxtsencout
 	b		.LxtsencloopNx
-#endif
 .Lxtsenc1x:
-	adds		w4, w4, #INTERLEAVE
+	adds		w4, w4, #4
 	beq		.Lxtsencout
-#endif
 .Lxtsencloop:
 	ld1		{v1.16b}, [x1], #16
 	eor		v0.16b, v1.16b, v4.16b
@@ -447,13 +307,15 @@ AES_ENTRY(aes_xts_encrypt)
 	b		.Lxtsencloop
 .Lxtsencout:
 	st1		{v4.16b}, [x6]
-	FRAME_POP
+	ldp		x29, x30, [sp], #16
 	ret
 AES_ENDPROC(aes_xts_encrypt)
 
 
 AES_ENTRY(aes_xts_decrypt)
-	FRAME_PUSH
+	stp		x29, x30, [sp, #-16]!
+	mov		x29, sp
+
 	ld1		{v4.16b}, [x6]
 	cbz		w7, .Lxtsdecnotfirst
 
@@ -469,25 +331,8 @@ AES_ENTRY(aes_xts_decrypt)
 	ldr		q7, .Lxts_mul_x
 	next_tweak	v4, v4, v7, v8
 .LxtsdecNx:
-#if INTERLEAVE >= 2
-	subs		w4, w4, #INTERLEAVE
+	subs		w4, w4, #4
 	bmi		.Lxtsdec1x
-#if INTERLEAVE == 2
-	ld1		{v0.16b-v1.16b}, [x1], #32	/* get 2 ct blocks */
-	next_tweak	v5, v4, v7, v8
-	eor		v0.16b, v0.16b, v4.16b
-	eor		v1.16b, v1.16b, v5.16b
-	do_decrypt_block2x
-	eor		v0.16b, v0.16b, v4.16b
-	eor		v1.16b, v1.16b, v5.16b
-	st1		{v0.16b-v1.16b}, [x0], #32
-	cbz		w4, .LxtsdecoutNx
-	next_tweak	v4, v5, v7, v8
-	b		.LxtsdecNx
-.LxtsdecoutNx:
-	mov		v4.16b, v5.16b
-	b		.Lxtsdecout
-#else
 	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 ct blocks */
 	next_tweak	v5, v4, v7, v8
 	eor		v0.16b, v0.16b, v4.16b
@@ -496,7 +341,7 @@ AES_ENTRY(aes_xts_decrypt)
 	eor		v2.16b, v2.16b, v6.16b
 	next_tweak	v7, v6, v7, v8
 	eor		v3.16b, v3.16b, v7.16b
-	do_decrypt_block4x
+	bl		aes_decrypt_block4x
 	eor		v3.16b, v3.16b, v7.16b
 	eor		v0.16b, v0.16b, v4.16b
 	eor		v1.16b, v1.16b, v5.16b
@@ -505,11 +350,9 @@ AES_ENTRY(aes_xts_decrypt)
 	mov		v4.16b, v7.16b
 	cbz		w4, .Lxtsdecout
 	b		.LxtsdecloopNx
-#endif
 .Lxtsdec1x:
-	adds		w4, w4, #INTERLEAVE
+	adds		w4, w4, #4
 	beq		.Lxtsdecout
-#endif
 .Lxtsdecloop:
 	ld1		{v1.16b}, [x1], #16
 	eor		v0.16b, v1.16b, v4.16b
@@ -522,7 +365,7 @@ AES_ENTRY(aes_xts_decrypt)
 	b		.Lxtsdecloop
 .Lxtsdecout:
 	st1		{v4.16b}, [x6]
-	FRAME_POP
+	ldp		x29, x30, [sp], #16
 	ret
 AES_ENDPROC(aes_xts_decrypt)
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 08/19] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (6 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 07/19] crypto: arm64/aes-blk - remove configurable interleave Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 09/19] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC " Ard Biesheuvel
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

CBC encryption is strictly sequential, and so the current AES code
simply processes the input one block at a time. However, we are
about to add yield support, which adds a bit of overhead, and which
we prefer to align with other modes in terms of granularity (i.e.,
it is better to have all routines yield every 64 bytes and not have
an exception for CBC encrypt which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-modes.S | 31 ++++++++++++++++----
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index 27a235b2ddee..e86535a1329d 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -94,17 +94,36 @@ AES_ENDPROC(aes_ecb_decrypt)
 	 */
 
 AES_ENTRY(aes_cbc_encrypt)
-	ld1		{v0.16b}, [x5]			/* get iv */
+	ld1		{v4.16b}, [x5]			/* get iv */
 	enc_prepare	w3, x2, x6
 
-.Lcbcencloop:
-	ld1		{v1.16b}, [x1], #16		/* get next pt block */
-	eor		v0.16b, v0.16b, v1.16b		/* ..and xor with iv */
+.Lcbcencloop4x:
+	subs		w4, w4, #4
+	bmi		.Lcbcenc1x
+	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 pt blocks */
+	eor		v0.16b, v0.16b, v4.16b		/* ..and xor with iv */
 	encrypt_block	v0, w3, x2, x6, w7
-	st1		{v0.16b}, [x0], #16
+	eor		v1.16b, v1.16b, v0.16b
+	encrypt_block	v1, w3, x2, x6, w7
+	eor		v2.16b, v2.16b, v1.16b
+	encrypt_block	v2, w3, x2, x6, w7
+	eor		v3.16b, v3.16b, v2.16b
+	encrypt_block	v3, w3, x2, x6, w7
+	st1		{v0.16b-v3.16b}, [x0], #64
+	mov		v4.16b, v3.16b
+	b		.Lcbcencloop4x
+.Lcbcenc1x:
+	adds		w4, w4, #4
+	beq		.Lcbcencout
+.Lcbcencloop:
+	ld1		{v0.16b}, [x1], #16		/* get next pt block */
+	eor		v4.16b, v4.16b, v0.16b		/* ..and xor with iv */
+	encrypt_block	v4, w3, x2, x6, w7
+	st1		{v4.16b}, [x0], #16
 	subs		w4, w4, #1
 	bne		.Lcbcencloop
-	st1		{v0.16b}, [x5]			/* return iv */
+.Lcbcencout:
+	st1		{v4.16b}, [x5]			/* return iv */
 	ret
 AES_ENDPROC(aes_cbc_encrypt)
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 09/19] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC encrypt path
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (7 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 08/19] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 10/19] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels Ard Biesheuvel
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

CBC MAC is strictly sequential, and so the current AES code simply
processes the input one block at a time. However, we are about to add
yield support, which adds a bit of overhead, and which we prefer to
align with other modes in terms of granularity (i.e., it is better to
have all routines yield every 64 bytes and not have an exception for
CBC MAC which yields every 16 bytes)

So unroll the loop by 4. We still cannot perform the AES algorithm in
parallel, but we can at least merge the loads and stores.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-modes.S | 23 ++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index e86535a1329d..a68412e1e3a4 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -395,8 +395,28 @@ AES_ENDPROC(aes_xts_decrypt)
 AES_ENTRY(aes_mac_update)
 	ld1		{v0.16b}, [x4]			/* get dg */
 	enc_prepare	w2, x1, x7
-	cbnz		w5, .Lmacenc
+	cbz		w5, .Lmacloop4x
 
+	encrypt_block	v0, w2, x1, x7, w8
+
+.Lmacloop4x:
+	subs		w3, w3, #4
+	bmi		.Lmac1x
+	ld1		{v1.16b-v4.16b}, [x0], #64	/* get next pt block */
+	eor		v0.16b, v0.16b, v1.16b		/* ..and xor with dg */
+	encrypt_block	v0, w2, x1, x7, w8
+	eor		v0.16b, v0.16b, v2.16b
+	encrypt_block	v0, w2, x1, x7, w8
+	eor		v0.16b, v0.16b, v3.16b
+	encrypt_block	v0, w2, x1, x7, w8
+	eor		v0.16b, v0.16b, v4.16b
+	cmp		w3, wzr
+	csinv		x5, x6, xzr, eq
+	cbz		w5, .Lmacout
+	encrypt_block	v0, w2, x1, x7, w8
+	b		.Lmacloop4x
+.Lmac1x:
+	add		w3, w3, #4
 .Lmacloop:
 	cbz		w3, .Lmacout
 	ld1		{v1.16b}, [x0], #16		/* get next pt block */
@@ -406,7 +426,6 @@ AES_ENTRY(aes_mac_update)
 	csinv		x5, x6, xzr, eq
 	cbz		w5, .Lmacout
 
-.Lmacenc:
 	encrypt_block	v0, w2, x1, x7, w8
 	b		.Lmacloop
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 10/19] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (8 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 09/19] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT Ard Biesheuvel
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Tweak the SHA256 update routines to invoke the SHA256 block transform
block by block, to avoid excessive scheduling delays caused by the
NEON algorithm running with preemption disabled.

Also, remove a stale comment which no longer applies now that kernel
mode NEON is actually disallowed in some contexts.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/sha256-glue.c | 36 +++++++++++++-------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
index b064d925fe2a..e8880ccdc71f 100644
--- a/arch/arm64/crypto/sha256-glue.c
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -89,21 +89,32 @@ static struct shash_alg algs[] = { {
 static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
 			      unsigned int len)
 {
-	/*
-	 * Stacking and unstacking a substantial slice of the NEON register
-	 * file may significantly affect performance for small updates when
-	 * executing in interrupt context, so fall back to the scalar code
-	 * in that case.
-	 */
+	struct sha256_state *sctx = shash_desc_ctx(desc);
+
 	if (!may_use_simd())
 		return sha256_base_do_update(desc, data, len,
 				(sha256_block_fn *)sha256_block_data_order);
 
-	kernel_neon_begin();
-	sha256_base_do_update(desc, data, len,
-				(sha256_block_fn *)sha256_block_neon);
-	kernel_neon_end();
+	while (len > 0) {
+		unsigned int chunk = len;
+
+		/*
+		 * Don't hog the CPU for the entire time it takes to process all
+		 * input when running on a preemptible kernel, but process the
+		 * data block by block instead.
+		 */
+		if (IS_ENABLED(CONFIG_PREEMPT) &&
+		    chunk + sctx->count % SHA256_BLOCK_SIZE > SHA256_BLOCK_SIZE)
+			chunk = SHA256_BLOCK_SIZE -
+				sctx->count % SHA256_BLOCK_SIZE;
 
+		kernel_neon_begin();
+		sha256_base_do_update(desc, data, chunk,
+				      (sha256_block_fn *)sha256_block_neon);
+		kernel_neon_end();
+		data += chunk;
+		len -= chunk;
+	}
 	return 0;
 }
 
@@ -117,10 +128,9 @@ static int sha256_finup_neon(struct shash_desc *desc, const u8 *data,
 		sha256_base_do_finalize(desc,
 				(sha256_block_fn *)sha256_block_data_order);
 	} else {
-		kernel_neon_begin();
 		if (len)
-			sha256_base_do_update(desc, data, len,
-				(sha256_block_fn *)sha256_block_neon);
+			sha256_update_neon(desc, data, len);
+		kernel_neon_begin();
 		sha256_base_do_finalize(desc,
 				(sha256_block_fn *)sha256_block_neon);
 		kernel_neon_end();
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (9 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 10/19] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-05 12:28   ` Dave Martin
  2017-12-04 12:26 ` [PATCH v2 12/19] crypto: arm64/sha1-ce - yield every 8 blocks of input Ard Biesheuvel
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Add a support macro to conditionally yield the NEON (and thus the CPU)
that may be called from the assembler code. Given that especially the
instruction based accelerated crypto code may use very tight loops, add
some parametrization so that the TIF_NEED_RESCHED flag test is only
executed every so many loop iterations.

In some cases, yielding the NEON involves saving and restoring a non
trivial amount of context (especially in the CRC folding algorithms),
and so the macro is split into two, and the code in between is only
executed when the yield path is taken, allowing the contex to be preserved.
The second macro takes a label argument that marks the resume-from-yield
path, which should restore the preserved context again.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index aef72d886677..917b026d3e00 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -512,4 +512,54 @@ alternative_else_nop_endif
 #endif
 	.endm
 
+/*
+ * yield_neon - check whether to yield to another runnable task from
+ *		kernel mode NEON code (running with preemption disabled)
+ *
+ * - Check whether the preempt count is exactly 1, in which case disabling
+ *   preemption once will make the task preemptible. If this is not the case,
+ *   yielding is pointless.
+ * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
+ *   kernel mode NEON (which will trigger a reschedule), and branch to the
+ *   yield fixup code at @lbl.
+ */
+	.macro		yield_neon, lbl:req, ctr, order, stride, loop
+	yield_neon_pre	\ctr, \order, \stride, \loop
+	yield_neon_post	\lbl
+	.endm
+
+	.macro		yield_neon_pre, ctr, order=0, stride, loop=4444f
+#ifdef CONFIG_PREEMPT
+	/*
+	 * With some algorithms, it makes little sense to poll the
+	 * TIF_NEED_RESCHED flag after every iteration, so only perform
+	 * the check every 2^order strides.
+	 */
+	.if		\order > 1
+	.if		(\stride & (\stride - 1)) != 0
+	.error		"stride should be a power of 2"
+	.endif
+	tst		\ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
+	b.ne		\loop
+	.endif
+
+	get_thread_info	x0
+	ldr		w1, [x0, #TSK_TI_PREEMPT]
+	ldr		x0, [x0, #TSK_TI_FLAGS]
+	cmp		w1, #1 // == PREEMPT_OFFSET
+	csel		x0, x0, xzr, eq
+	tbnz		x0, #TIF_NEED_RESCHED, 5555f	// needs rescheduling?
+4444:
+#endif
+	.subsection	1
+5555:
+	.endm
+
+	.macro		yield_neon_post, lbl:req
+	bl		kernel_neon_end
+	bl		kernel_neon_begin
+	b		\lbl
+	.previous
+	.endm
+
 #endif	/* __ASM_ASSEMBLER_H */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 12/19] crypto: arm64/sha1-ce - yield every 8 blocks of input
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (10 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 13/19] crypto: arm64/sha2-ce " Ard Biesheuvel
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON every 8 blocks of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/sha1-ce-core.S | 45 ++++++++++++++------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 8550408735a0..7ae0dd369e0a 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -70,31 +70,40 @@
 	 *			  int blocks)
 	 */
 ENTRY(sha1_ce_transform)
+	stp		x29, x30, [sp, #-48]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	str		x21, [sp, #32]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+
 	/* load round constants */
-	adr		x6, .Lsha1_rcon
+0:	adr		x6, .Lsha1_rcon
 	ld1r		{k0.4s}, [x6], #4
 	ld1r		{k1.4s}, [x6], #4
 	ld1r		{k2.4s}, [x6], #4
 	ld1r		{k3.4s}, [x6]
 
 	/* load state */
-	ld1		{dgav.4s}, [x0]
-	ldr		dgb, [x0, #16]
+	ld1		{dgav.4s}, [x19]
+	ldr		dgb, [x19, #16]
 
 	/* load sha1_ce_state::finalize */
 	ldr_l		w4, sha1_ce_offsetof_finalize, x4
-	ldr		w4, [x0, x4]
+	ldr		w4, [x19, x4]
 
 	/* load input */
-0:	ld1		{v8.4s-v11.4s}, [x1], #64
-	sub		w2, w2, #1
+1:	ld1		{v8.4s-v11.4s}, [x20], #64
+	sub		w21, w21, #1
 
 CPU_LE(	rev32		v8.16b, v8.16b		)
 CPU_LE(	rev32		v9.16b, v9.16b		)
 CPU_LE(	rev32		v10.16b, v10.16b	)
 CPU_LE(	rev32		v11.16b, v11.16b	)
 
-1:	add		t0.4s, v8.4s, k0.4s
+2:	add		t0.4s, v8.4s, k0.4s
 	mov		dg0v.16b, dgav.16b
 
 	add_update	c, ev, k0,  8,  9, 10, 11, dgb
@@ -125,16 +134,23 @@ CPU_LE(	rev32		v11.16b, v11.16b	)
 	add		dgbv.2s, dgbv.2s, dg1v.2s
 	add		dgav.4s, dgav.4s, dg0v.4s
 
-	cbnz		w2, 0b
+	cbz		w21, 3f
+
+	yield_neon_pre	w21, 3, 1, 1b			// yield every 8 blocks
+	st1		{dgav.4s}, [x19]
+	str		dgb, [x19, #16]
+	yield_neon_post	0b
+
+	b		1b
 
 	/*
 	 * Final block: add padding and total bit count.
 	 * Skip if the input size was not a round multiple of the block size,
 	 * the padding is handled by the C code in that case.
 	 */
-	cbz		x4, 3f
+3:	cbz		x4, 4f
 	ldr_l		w4, sha1_ce_offsetof_count, x4
-	ldr		x4, [x0, x4]
+	ldr		x4, [x19, x4]
 	movi		v9.2d, #0
 	mov		x8, #0x80000000
 	movi		v10.2d, #0
@@ -143,10 +159,13 @@ CPU_LE(	rev32		v11.16b, v11.16b	)
 	mov		x4, #0
 	mov		v11.d[0], xzr
 	mov		v11.d[1], x7
-	b		1b
+	b		2b
 
 	/* store new state */
-3:	st1		{dgav.4s}, [x0]
-	str		dgb, [x0, #16]
+4:	st1		{dgav.4s}, [x19]
+	str		dgb, [x19, #16]
+	ldp		x19, x20, [sp, #16]
+	ldr		x21, [sp, #32]
+	ldp		x29, x30, [sp], #48
 	ret
 ENDPROC(sha1_ce_transform)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 13/19] crypto: arm64/sha2-ce - yield every 8 blocks of input
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (11 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 12/19] crypto: arm64/sha1-ce - yield every 8 blocks of input Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 14/19] crypto: arm64/aes-blk - yield after processing a fixed chunk " Ard Biesheuvel
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON every 8 blocks of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/sha2-ce-core.S | 40 ++++++++++++++------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index 679c6c002f4f..d156b3ae967c 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -77,30 +77,39 @@
 	 *			  int blocks)
 	 */
 ENTRY(sha2_ce_transform)
+	stp		x29, x30, [sp, #-48]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	str		x21, [sp, #32]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+
 	/* load round constants */
-	adr		x8, .Lsha2_rcon
+0:	adr		x8, .Lsha2_rcon
 	ld1		{ v0.4s- v3.4s}, [x8], #64
 	ld1		{ v4.4s- v7.4s}, [x8], #64
 	ld1		{ v8.4s-v11.4s}, [x8], #64
 	ld1		{v12.4s-v15.4s}, [x8]
 
 	/* load state */
-	ld1		{dgav.4s, dgbv.4s}, [x0]
+	ld1		{dgav.4s, dgbv.4s}, [x19]
 
 	/* load sha256_ce_state::finalize */
 	ldr_l		w4, sha256_ce_offsetof_finalize, x4
-	ldr		w4, [x0, x4]
+	ldr		w4, [x19, x4]
 
 	/* load input */
-0:	ld1		{v16.4s-v19.4s}, [x1], #64
-	sub		w2, w2, #1
+1:	ld1		{v16.4s-v19.4s}, [x20], #64
+	sub		w21, w21, #1
 
 CPU_LE(	rev32		v16.16b, v16.16b	)
 CPU_LE(	rev32		v17.16b, v17.16b	)
 CPU_LE(	rev32		v18.16b, v18.16b	)
 CPU_LE(	rev32		v19.16b, v19.16b	)
 
-1:	add		t0.4s, v16.4s, v0.4s
+2:	add		t0.4s, v16.4s, v0.4s
 	mov		dg0v.16b, dgav.16b
 	mov		dg1v.16b, dgbv.16b
 
@@ -129,16 +138,22 @@ CPU_LE(	rev32		v19.16b, v19.16b	)
 	add		dgbv.4s, dgbv.4s, dg1v.4s
 
 	/* handled all input blocks? */
-	cbnz		w2, 0b
+	cbz		w21, 3f
+
+	yield_neon_pre	w21, 3, 1, 1b			// yield every 8 blocks
+	st1		{dgav.4s, dgbv.4s}, [x19]
+	yield_neon_post	0b
+
+	b		1b
 
 	/*
 	 * Final block: add padding and total bit count.
 	 * Skip if the input size was not a round multiple of the block size,
 	 * the padding is handled by the C code in that case.
 	 */
-	cbz		x4, 3f
+3:	cbz		x4, 4f
 	ldr_l		w4, sha256_ce_offsetof_count, x4
-	ldr		x4, [x0, x4]
+	ldr		x4, [x19, x4]
 	movi		v17.2d, #0
 	mov		x8, #0x80000000
 	movi		v18.2d, #0
@@ -147,9 +162,12 @@ CPU_LE(	rev32		v19.16b, v19.16b	)
 	mov		x4, #0
 	mov		v19.d[0], xzr
 	mov		v19.d[1], x7
-	b		1b
+	b		2b
 
 	/* store new state */
-3:	st1		{dgav.4s, dgbv.4s}, [x0]
+4:	st1		{dgav.4s, dgbv.4s}, [x19]
+	ldp		x19, x20, [sp, #16]
+	ldr		x21, [sp, #32]
+	ldp		x29, x30, [sp], #48
 	ret
 ENDPROC(sha2_ce_transform)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 14/19] crypto: arm64/aes-blk - yield after processing a fixed chunk of input
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (12 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 13/19] crypto: arm64/sha2-ce " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 15/19] crypto: arm64/aes-bs - yield after processing each 128 bytes " Ard Biesheuvel
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Currently, the AES block code may keep preemption disabled for as long
as it takes to process each contigous chunk of input, which could be as
large as a page or skb, depending on the context.

For this code to be useable in RT context, it needs to operate on fixed
chunks of limited size. So let's add a yield after each 16 blocks (for
the CE case) or after every block (for the pure NEON case), which will
disable and re-enable kernel mode NEON if a reschedule is pending.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-ce.S    |  17 +-
 arch/arm64/crypto/aes-modes.S | 379 +++++++++++++-------
 arch/arm64/crypto/aes-neon.S  |   2 +
 3 files changed, 272 insertions(+), 126 deletions(-)

diff --git a/arch/arm64/crypto/aes-ce.S b/arch/arm64/crypto/aes-ce.S
index 50330f5c3adc..ccb17b65005a 100644
--- a/arch/arm64/crypto/aes-ce.S
+++ b/arch/arm64/crypto/aes-ce.S
@@ -15,6 +15,8 @@
 #define AES_ENTRY(func)		ENTRY(ce_ ## func)
 #define AES_ENDPROC(func)	ENDPROC(ce_ ## func)
 
+#define AES_YIELD_ORDER		4
+
 	.arch		armv8-a+crypto
 
 	/* preload all round keys */
@@ -30,18 +32,21 @@
 	.endm
 
 	/* prepare for encryption with key in rk[] */
-	.macro		enc_prepare, rounds, rk, ignore
-	load_round_keys	\rounds, \rk
+	.macro		enc_prepare, rounds, rk, temp
+	mov		\temp, \rk
+	load_round_keys	\rounds, \temp
 	.endm
 
 	/* prepare for encryption (again) but with new key in rk[] */
-	.macro		enc_switch_key, rounds, rk, ignore
-	load_round_keys	\rounds, \rk
+	.macro		enc_switch_key, rounds, rk, temp
+	mov		\temp, \rk
+	load_round_keys	\rounds, \temp
 	.endm
 
 	/* prepare for decryption with key in rk[] */
-	.macro		dec_prepare, rounds, rk, ignore
-	load_round_keys	\rounds, \rk
+	.macro		dec_prepare, rounds, rk, temp
+	mov		\temp, \rk
+	load_round_keys	\rounds, \temp
 	.endm
 
 	.macro		do_enc_Nx, de, mc, k, i0, i1, i2, i3
diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
index a68412e1e3a4..6fcdf82fa295 100644
--- a/arch/arm64/crypto/aes-modes.S
+++ b/arch/arm64/crypto/aes-modes.S
@@ -14,12 +14,12 @@
 	.align		4
 
 aes_encrypt_block4x:
-	encrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
+	encrypt_block4x	v0, v1, v2, v3, w22, x21, x8, w7
 	ret
 ENDPROC(aes_encrypt_block4x)
 
 aes_decrypt_block4x:
-	decrypt_block4x	v0, v1, v2, v3, w3, x2, x8, w7
+	decrypt_block4x	v0, v1, v2, v3, w22, x21, x8, w7
 	ret
 ENDPROC(aes_decrypt_block4x)
 
@@ -31,57 +31,85 @@ ENDPROC(aes_decrypt_block4x)
 	 */
 
 AES_ENTRY(aes_ecb_encrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	str		x23, [sp, #48]
 
-	enc_prepare	w3, x2, x5
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+
+.Lecbencrestart:
+	enc_prepare	w22, x21, x5
 
 .LecbencloopNx:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lecbenc1x
-	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 pt blocks */
+	ld1		{v0.16b-v3.16b}, [x20], #64	/* get 4 pt blocks */
 	bl		aes_encrypt_block4x
-	st1		{v0.16b-v3.16b}, [x0], #64
+	st1		{v0.16b-v3.16b}, [x19], #64
+	yield_neon	.Lecbencrestart, w23, AES_YIELD_ORDER, 4, .LecbencloopNx
 	b		.LecbencloopNx
 .Lecbenc1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lecbencout
 .Lecbencloop:
-	ld1		{v0.16b}, [x1], #16		/* get next pt block */
-	encrypt_block	v0, w3, x2, x5, w6
-	st1		{v0.16b}, [x0], #16
-	subs		w4, w4, #1
+	ld1		{v0.16b}, [x20], #16		/* get next pt block */
+	encrypt_block	v0, w22, x21, x5, w6
+	st1		{v0.16b}, [x19], #16
+	subs		w23, w23, #1
 	bne		.Lecbencloop
 .Lecbencout:
-	ldp		x29, x30, [sp], #16
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldr		x23, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 AES_ENDPROC(aes_ecb_encrypt)
 
 
 AES_ENTRY(aes_ecb_decrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	str		x23, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
 
-	dec_prepare	w3, x2, x5
+.Lecbdecrestart:
+	dec_prepare	w22, x21, x5
 
 .LecbdecloopNx:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lecbdec1x
-	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 ct blocks */
+	ld1		{v0.16b-v3.16b}, [x20], #64	/* get 4 ct blocks */
 	bl		aes_decrypt_block4x
-	st1		{v0.16b-v3.16b}, [x0], #64
+	st1		{v0.16b-v3.16b}, [x19], #64
+	yield_neon	.Lecbdecrestart, w23, AES_YIELD_ORDER, 4, .LecbdecloopNx
 	b		.LecbdecloopNx
 .Lecbdec1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lecbdecout
 .Lecbdecloop:
-	ld1		{v0.16b}, [x1], #16		/* get next ct block */
-	decrypt_block	v0, w3, x2, x5, w6
-	st1		{v0.16b}, [x0], #16
-	subs		w4, w4, #1
+	ld1		{v0.16b}, [x20], #16		/* get next ct block */
+	decrypt_block	v0, w22, x21, x5, w6
+	st1		{v0.16b}, [x19], #16
+	subs		w23, w23, #1
 	bne		.Lecbdecloop
 .Lecbdecout:
-	ldp		x29, x30, [sp], #16
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldr		x23, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 AES_ENDPROC(aes_ecb_decrypt)
 
@@ -94,78 +122,114 @@ AES_ENDPROC(aes_ecb_decrypt)
 	 */
 
 AES_ENTRY(aes_cbc_encrypt)
-	ld1		{v4.16b}, [x5]			/* get iv */
-	enc_prepare	w3, x2, x6
+	stp		x29, x30, [sp, #-64]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
+
+.Lcbcencrestart:
+	ld1		{v4.16b}, [x24]			/* get iv */
+	enc_prepare	w22, x21, x6
 
 .Lcbcencloop4x:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lcbcenc1x
-	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 pt blocks */
+	ld1		{v0.16b-v3.16b}, [x20], #64	/* get 4 pt blocks */
 	eor		v0.16b, v0.16b, v4.16b		/* ..and xor with iv */
-	encrypt_block	v0, w3, x2, x6, w7
+	encrypt_block	v0, w22, x21, x6, w7
 	eor		v1.16b, v1.16b, v0.16b
-	encrypt_block	v1, w3, x2, x6, w7
+	encrypt_block	v1, w22, x21, x6, w7
 	eor		v2.16b, v2.16b, v1.16b
-	encrypt_block	v2, w3, x2, x6, w7
+	encrypt_block	v2, w22, x21, x6, w7
 	eor		v3.16b, v3.16b, v2.16b
-	encrypt_block	v3, w3, x2, x6, w7
-	st1		{v0.16b-v3.16b}, [x0], #64
+	encrypt_block	v3, w22, x21, x6, w7
+	st1		{v0.16b-v3.16b}, [x19], #64
 	mov		v4.16b, v3.16b
+	st1		{v4.16b}, [x24]			/* return iv */
+	yield_neon	.Lcbcencrestart, w23, AES_YIELD_ORDER, 4, .Lcbcencloop4x
 	b		.Lcbcencloop4x
 .Lcbcenc1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lcbcencout
 .Lcbcencloop:
-	ld1		{v0.16b}, [x1], #16		/* get next pt block */
+	ld1		{v0.16b}, [x20], #16		/* get next pt block */
 	eor		v4.16b, v4.16b, v0.16b		/* ..and xor with iv */
-	encrypt_block	v4, w3, x2, x6, w7
-	st1		{v4.16b}, [x0], #16
-	subs		w4, w4, #1
+	encrypt_block	v4, w22, x21, x6, w7
+	st1		{v4.16b}, [x19], #16
+	subs		w23, w23, #1
 	bne		.Lcbcencloop
 .Lcbcencout:
-	st1		{v4.16b}, [x5]			/* return iv */
+	st1		{v4.16b}, [x24]			/* return iv */
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 AES_ENDPROC(aes_cbc_encrypt)
 
 
 AES_ENTRY(aes_cbc_decrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
 
-	ld1		{v7.16b}, [x5]			/* get iv */
-	dec_prepare	w3, x2, x6
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
+
+.Lcbcdecrestart:
+	ld1		{v7.16b}, [x24]			/* get iv */
+	dec_prepare	w22, x21, x6
 
 .LcbcdecloopNx:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lcbcdec1x
-	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 ct blocks */
+	ld1		{v0.16b-v3.16b}, [x20], #64	/* get 4 ct blocks */
 	mov		v4.16b, v0.16b
 	mov		v5.16b, v1.16b
 	mov		v6.16b, v2.16b
 	bl		aes_decrypt_block4x
-	sub		x1, x1, #16
+	sub		x20, x20, #16
 	eor		v0.16b, v0.16b, v7.16b
 	eor		v1.16b, v1.16b, v4.16b
-	ld1		{v7.16b}, [x1], #16		/* reload 1 ct block */
+	ld1		{v7.16b}, [x20], #16		/* reload 1 ct block */
 	eor		v2.16b, v2.16b, v5.16b
 	eor		v3.16b, v3.16b, v6.16b
-	st1		{v0.16b-v3.16b}, [x0], #64
+	st1		{v0.16b-v3.16b}, [x19], #64
+	st1		{v7.16b}, [x24]			/* return iv */
+	yield_neon	.Lcbcdecrestart, w23, AES_YIELD_ORDER, 4, .LcbcdecloopNx
 	b		.LcbcdecloopNx
 .Lcbcdec1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lcbcdecout
 .Lcbcdecloop:
-	ld1		{v1.16b}, [x1], #16		/* get next ct block */
+	ld1		{v1.16b}, [x20], #16		/* get next ct block */
 	mov		v0.16b, v1.16b			/* ...and copy to v0 */
-	decrypt_block	v0, w3, x2, x6, w7
+	decrypt_block	v0, w22, x21, x6, w7
 	eor		v0.16b, v0.16b, v7.16b		/* xor with iv => pt */
 	mov		v7.16b, v1.16b			/* ct is next iv */
-	st1		{v0.16b}, [x0], #16
-	subs		w4, w4, #1
+	st1		{v0.16b}, [x19], #16
+	subs		w23, w23, #1
 	bne		.Lcbcdecloop
 .Lcbcdecout:
-	st1		{v7.16b}, [x5]			/* return iv */
-	ldp		x29, x30, [sp], #16
+	st1		{v7.16b}, [x24]			/* return iv */
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 AES_ENDPROC(aes_cbc_decrypt)
 
@@ -176,19 +240,30 @@ AES_ENDPROC(aes_cbc_decrypt)
 	 */
 
 AES_ENTRY(aes_ctr_encrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
 
-	enc_prepare	w3, x2, x6
-	ld1		{v4.16b}, [x5]
+.Lctrrestart:
+	enc_prepare	w22, x21, x6
+	ld1		{v4.16b}, [x24]
 
 	umov		x6, v4.d[1]		/* keep swabbed ctr in reg */
 	rev		x6, x6
-	cmn		w6, w4			/* 32 bit overflow? */
-	bcs		.Lctrloop
 .LctrloopNx:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lctr1x
+	cmn		w6, #4			/* 32 bit overflow? */
+	bcs		.Lctr1x
 	ldr		q8, =0x30000000200000001	/* addends 1,2,3[,0] */
 	dup		v7.4s, w6
 	mov		v0.16b, v4.16b
@@ -200,25 +275,27 @@ AES_ENTRY(aes_ctr_encrypt)
 	mov		v1.s[3], v8.s[0]
 	mov		v2.s[3], v8.s[1]
 	mov		v3.s[3], v8.s[2]
-	ld1		{v5.16b-v7.16b}, [x1], #48	/* get 3 input blocks */
+	ld1		{v5.16b-v7.16b}, [x20], #48	/* get 3 input blocks */
 	bl		aes_encrypt_block4x
 	eor		v0.16b, v5.16b, v0.16b
-	ld1		{v5.16b}, [x1], #16		/* get 1 input block  */
+	ld1		{v5.16b}, [x20], #16		/* get 1 input block  */
 	eor		v1.16b, v6.16b, v1.16b
 	eor		v2.16b, v7.16b, v2.16b
 	eor		v3.16b, v5.16b, v3.16b
-	st1		{v0.16b-v3.16b}, [x0], #64
+	st1		{v0.16b-v3.16b}, [x19], #64
 	add		x6, x6, #4
 	rev		x7, x6
 	ins		v4.d[1], x7
-	cbz		w4, .Lctrout
+	cbz		w23, .Lctrout
+	st1		{v4.16b}, [x24]		/* return next CTR value */
+	yield_neon	.Lctrrestart, w23, AES_YIELD_ORDER, 4, .LctrloopNx
 	b		.LctrloopNx
 .Lctr1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lctrout
 .Lctrloop:
 	mov		v0.16b, v4.16b
-	encrypt_block	v0, w3, x2, x8, w7
+	encrypt_block	v0, w22, x21, x8, w7
 
 	adds		x6, x6, #1		/* increment BE ctr */
 	rev		x7, x6
@@ -226,22 +303,25 @@ AES_ENTRY(aes_ctr_encrypt)
 	bcs		.Lctrcarry		/* overflow? */
 
 .Lctrcarrydone:
-	subs		w4, w4, #1
+	subs		w23, w23, #1
 	bmi		.Lctrtailblock		/* blocks <0 means tail block */
-	ld1		{v3.16b}, [x1], #16
+	ld1		{v3.16b}, [x20], #16
 	eor		v3.16b, v0.16b, v3.16b
-	st1		{v3.16b}, [x0], #16
+	st1		{v3.16b}, [x19], #16
 	bne		.Lctrloop
 
 .Lctrout:
-	st1		{v4.16b}, [x5]		/* return next CTR value */
-	ldp		x29, x30, [sp], #16
+	st1		{v4.16b}, [x24]		/* return next CTR value */
+.Lctrret:
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 
 .Lctrtailblock:
-	st1		{v0.16b}, [x0]
-	ldp		x29, x30, [sp], #16
-	ret
+	st1		{v0.16b}, [x19]
+	b		.Lctrret
 
 .Lctrcarry:
 	umov		x7, v4.d[0]		/* load upper word of ctr  */
@@ -274,10 +354,20 @@ CPU_LE(	.quad		1, 0x87		)
 CPU_BE(	.quad		0x87, 1		)
 
 AES_ENTRY(aes_xts_encrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
-
-	ld1		{v4.16b}, [x6]
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x6
+
+	ld1		{v4.16b}, [x24]
 	cbz		w7, .Lxtsencnotfirst
 
 	enc_prepare	w3, x5, x8
@@ -286,15 +376,17 @@ AES_ENTRY(aes_xts_encrypt)
 	ldr		q7, .Lxts_mul_x
 	b		.LxtsencNx
 
+.Lxtsencrestart:
+	ld1		{v4.16b}, [x24]
 .Lxtsencnotfirst:
-	enc_prepare	w3, x2, x8
+	enc_prepare	w22, x21, x8
 .LxtsencloopNx:
 	ldr		q7, .Lxts_mul_x
 	next_tweak	v4, v4, v7, v8
 .LxtsencNx:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lxtsenc1x
-	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 pt blocks */
+	ld1		{v0.16b-v3.16b}, [x20], #64	/* get 4 pt blocks */
 	next_tweak	v5, v4, v7, v8
 	eor		v0.16b, v0.16b, v4.16b
 	next_tweak	v6, v5, v7, v8
@@ -307,35 +399,50 @@ AES_ENTRY(aes_xts_encrypt)
 	eor		v0.16b, v0.16b, v4.16b
 	eor		v1.16b, v1.16b, v5.16b
 	eor		v2.16b, v2.16b, v6.16b
-	st1		{v0.16b-v3.16b}, [x0], #64
+	st1		{v0.16b-v3.16b}, [x19], #64
 	mov		v4.16b, v7.16b
-	cbz		w4, .Lxtsencout
+	cbz		w23, .Lxtsencout
+	st1		{v4.16b}, [x24]
+	yield_neon	.Lxtsencrestart, w23, AES_YIELD_ORDER, 4, .LxtsencloopNx
 	b		.LxtsencloopNx
 .Lxtsenc1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lxtsencout
 .Lxtsencloop:
-	ld1		{v1.16b}, [x1], #16
+	ld1		{v1.16b}, [x20], #16
 	eor		v0.16b, v1.16b, v4.16b
-	encrypt_block	v0, w3, x2, x8, w7
+	encrypt_block	v0, w22, x21, x8, w7
 	eor		v0.16b, v0.16b, v4.16b
-	st1		{v0.16b}, [x0], #16
-	subs		w4, w4, #1
+	st1		{v0.16b}, [x19], #16
+	subs		w23, w23, #1
 	beq		.Lxtsencout
 	next_tweak	v4, v4, v7, v8
 	b		.Lxtsencloop
 .Lxtsencout:
-	st1		{v4.16b}, [x6]
-	ldp		x29, x30, [sp], #16
+	st1		{v4.16b}, [x24]
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 AES_ENDPROC(aes_xts_encrypt)
 
 
 AES_ENTRY(aes_xts_decrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
-
-	ld1		{v4.16b}, [x6]
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x6
+
+	ld1		{v4.16b}, [x24]
 	cbz		w7, .Lxtsdecnotfirst
 
 	enc_prepare	w3, x5, x8
@@ -344,15 +451,17 @@ AES_ENTRY(aes_xts_decrypt)
 	ldr		q7, .Lxts_mul_x
 	b		.LxtsdecNx
 
+.Lxtsdecrestart:
+	ld1		{v4.16b}, [x24]
 .Lxtsdecnotfirst:
-	dec_prepare	w3, x2, x8
+	dec_prepare	w22, x21, x8
 .LxtsdecloopNx:
 	ldr		q7, .Lxts_mul_x
 	next_tweak	v4, v4, v7, v8
 .LxtsdecNx:
-	subs		w4, w4, #4
+	subs		w23, w23, #4
 	bmi		.Lxtsdec1x
-	ld1		{v0.16b-v3.16b}, [x1], #64	/* get 4 ct blocks */
+	ld1		{v0.16b-v3.16b}, [x20], #64	/* get 4 ct blocks */
 	next_tweak	v5, v4, v7, v8
 	eor		v0.16b, v0.16b, v4.16b
 	next_tweak	v6, v5, v7, v8
@@ -365,26 +474,31 @@ AES_ENTRY(aes_xts_decrypt)
 	eor		v0.16b, v0.16b, v4.16b
 	eor		v1.16b, v1.16b, v5.16b
 	eor		v2.16b, v2.16b, v6.16b
-	st1		{v0.16b-v3.16b}, [x0], #64
+	st1		{v0.16b-v3.16b}, [x19], #64
 	mov		v4.16b, v7.16b
-	cbz		w4, .Lxtsdecout
+	cbz		w23, .Lxtsdecout
+	st1		{v4.16b}, [x24]
+	yield_neon	.Lxtsdecrestart, w23, AES_YIELD_ORDER, 4, .LxtsdecloopNx
 	b		.LxtsdecloopNx
 .Lxtsdec1x:
-	adds		w4, w4, #4
+	adds		w23, w23, #4
 	beq		.Lxtsdecout
 .Lxtsdecloop:
-	ld1		{v1.16b}, [x1], #16
+	ld1		{v1.16b}, [x20], #16
 	eor		v0.16b, v1.16b, v4.16b
-	decrypt_block	v0, w3, x2, x8, w7
+	decrypt_block	v0, w22, x21, x8, w7
 	eor		v0.16b, v0.16b, v4.16b
-	st1		{v0.16b}, [x0], #16
-	subs		w4, w4, #1
+	st1		{v0.16b}, [x19], #16
+	subs		w23, w23, #1
 	beq		.Lxtsdecout
 	next_tweak	v4, v4, v7, v8
 	b		.Lxtsdecloop
 .Lxtsdecout:
-	st1		{v4.16b}, [x6]
-	ldp		x29, x30, [sp], #16
+	st1		{v4.16b}, [x24]
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 AES_ENDPROC(aes_xts_decrypt)
 
@@ -393,43 +507,68 @@ AES_ENDPROC(aes_xts_decrypt)
 	 *		  int blocks, u8 dg[], int enc_before, int enc_after)
 	 */
 AES_ENTRY(aes_mac_update)
-	ld1		{v0.16b}, [x4]			/* get dg */
+	stp		x29, x30, [sp, #-64]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x6
+
+	ld1		{v0.16b}, [x23]			/* get dg */
 	enc_prepare	w2, x1, x7
 	cbz		w5, .Lmacloop4x
 
 	encrypt_block	v0, w2, x1, x7, w8
 
 .Lmacloop4x:
-	subs		w3, w3, #4
+	subs		w22, w22, #4
 	bmi		.Lmac1x
-	ld1		{v1.16b-v4.16b}, [x0], #64	/* get next pt block */
+	ld1		{v1.16b-v4.16b}, [x19], #64	/* get next pt block */
 	eor		v0.16b, v0.16b, v1.16b		/* ..and xor with dg */
-	encrypt_block	v0, w2, x1, x7, w8
+	encrypt_block	v0, w21, x20, x7, w8
 	eor		v0.16b, v0.16b, v2.16b
-	encrypt_block	v0, w2, x1, x7, w8
+	encrypt_block	v0, w21, x20, x7, w8
 	eor		v0.16b, v0.16b, v3.16b
-	encrypt_block	v0, w2, x1, x7, w8
+	encrypt_block	v0, w21, x20, x7, w8
 	eor		v0.16b, v0.16b, v4.16b
-	cmp		w3, wzr
-	csinv		x5, x6, xzr, eq
+	cmp		w22, wzr
+	csinv		x5, x24, xzr, eq
 	cbz		w5, .Lmacout
-	encrypt_block	v0, w2, x1, x7, w8
+	encrypt_block	v0, w21, x20, x7, w8
+	st1		{v0.16b}, [x23]			/* return dg */
+	yield_neon	.Lmacrestart, w22, AES_YIELD_ORDER, 4, .Lmacloop4x
 	b		.Lmacloop4x
 .Lmac1x:
-	add		w3, w3, #4
+	add		w22, w22, #4
 .Lmacloop:
-	cbz		w3, .Lmacout
-	ld1		{v1.16b}, [x0], #16		/* get next pt block */
+	cbz		w22, .Lmacout
+	ld1		{v1.16b}, [x19], #16		/* get next pt block */
 	eor		v0.16b, v0.16b, v1.16b		/* ..and xor with dg */
 
-	subs		w3, w3, #1
-	csinv		x5, x6, xzr, eq
+	subs		w22, w22, #1
+	csinv		x5, x24, xzr, eq
 	cbz		w5, .Lmacout
 
-	encrypt_block	v0, w2, x1, x7, w8
+.Lmacenc:
+	encrypt_block	v0, w21, x20, x7, w8
 	b		.Lmacloop
 
 .Lmacout:
-	st1		{v0.16b}, [x4]			/* return dg */
+	st1		{v0.16b}, [x23]			/* return dg */
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
+
+.Lmacrestart:
+	ld1		{v0.16b}, [x23]			/* get dg */
+	enc_prepare	w21, x20, x0
+	b		.Lmacloop4x
 AES_ENDPROC(aes_mac_update)
diff --git a/arch/arm64/crypto/aes-neon.S b/arch/arm64/crypto/aes-neon.S
index f1e3aa2732f9..dab7be7d3628 100644
--- a/arch/arm64/crypto/aes-neon.S
+++ b/arch/arm64/crypto/aes-neon.S
@@ -14,6 +14,8 @@
 #define AES_ENTRY(func)		ENTRY(neon_ ## func)
 #define AES_ENDPROC(func)	ENDPROC(neon_ ## func)
 
+#define AES_YIELD_ORDER		0
+
 	/* multiply by polynomial 'x' in GF(2^8) */
 	.macro		mul_by_x, out, in, temp, const
 	sshr		\temp, \in, #7
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 15/19] crypto: arm64/aes-bs - yield after processing each 128 bytes of input
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (13 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 14/19] crypto: arm64/aes-blk - yield after processing a fixed chunk " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 16/19] crypto: arm64/aes-ghash - yield after processing fixed number of blocks Ard Biesheuvel
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Currently, the bit-sliced AES code may keep preemption disabled for as
long as it takes to process each contigous chunk of input, which could
be as large as a page or skb, depending on the context.

For this code to be useable in RT context, it needs to operate on fixed
chunks of limited size. So let's add a yield after each 128 bytes of input,
(i.e., 8x the AES block size, which is the natural granularity for a bit
sliced algorithm.) This will disable and re-enable kernel mode NEON if a
reschedule is pending.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/aes-neonbs-core.S | 317 ++++++++++++--------
 1 file changed, 190 insertions(+), 127 deletions(-)

diff --git a/arch/arm64/crypto/aes-neonbs-core.S b/arch/arm64/crypto/aes-neonbs-core.S
index ca0472500433..4532a2262742 100644
--- a/arch/arm64/crypto/aes-neonbs-core.S
+++ b/arch/arm64/crypto/aes-neonbs-core.S
@@ -565,54 +565,68 @@ ENDPROC(aesbs_decrypt8)
 	 *		     int blocks)
 	 */
 	.macro		__ecb_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	str		x23, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
 
 99:	mov		x5, #1
-	lsl		x5, x5, x4
-	subs		w4, w4, #8
-	csel		x4, x4, xzr, pl
+	lsl		x5, x5, x23
+	subs		w23, w23, #8
+	csel		x23, x23, xzr, pl
 	csel		x5, x5, xzr, mi
 
-	ld1		{v0.16b}, [x1], #16
+	ld1		{v0.16b}, [x20], #16
 	tbnz		x5, #1, 0f
-	ld1		{v1.16b}, [x1], #16
+	ld1		{v1.16b}, [x20], #16
 	tbnz		x5, #2, 0f
-	ld1		{v2.16b}, [x1], #16
+	ld1		{v2.16b}, [x20], #16
 	tbnz		x5, #3, 0f
-	ld1		{v3.16b}, [x1], #16
+	ld1		{v3.16b}, [x20], #16
 	tbnz		x5, #4, 0f
-	ld1		{v4.16b}, [x1], #16
+	ld1		{v4.16b}, [x20], #16
 	tbnz		x5, #5, 0f
-	ld1		{v5.16b}, [x1], #16
+	ld1		{v5.16b}, [x20], #16
 	tbnz		x5, #6, 0f
-	ld1		{v6.16b}, [x1], #16
+	ld1		{v6.16b}, [x20], #16
 	tbnz		x5, #7, 0f
-	ld1		{v7.16b}, [x1], #16
+	ld1		{v7.16b}, [x20], #16
 
-0:	mov		bskey, x2
-	mov		rounds, x3
+0:	mov		bskey, x21
+	mov		rounds, x22
 	bl		\do8
 
-	st1		{\o0\().16b}, [x0], #16
+	st1		{\o0\().16b}, [x19], #16
 	tbnz		x5, #1, 1f
-	st1		{\o1\().16b}, [x0], #16
+	st1		{\o1\().16b}, [x19], #16
 	tbnz		x5, #2, 1f
-	st1		{\o2\().16b}, [x0], #16
+	st1		{\o2\().16b}, [x19], #16
 	tbnz		x5, #3, 1f
-	st1		{\o3\().16b}, [x0], #16
+	st1		{\o3\().16b}, [x19], #16
 	tbnz		x5, #4, 1f
-	st1		{\o4\().16b}, [x0], #16
+	st1		{\o4\().16b}, [x19], #16
 	tbnz		x5, #5, 1f
-	st1		{\o5\().16b}, [x0], #16
+	st1		{\o5\().16b}, [x19], #16
 	tbnz		x5, #6, 1f
-	st1		{\o6\().16b}, [x0], #16
+	st1		{\o6\().16b}, [x19], #16
 	tbnz		x5, #7, 1f
-	st1		{\o7\().16b}, [x0], #16
+	st1		{\o7\().16b}, [x19], #16
 
-	cbnz		x4, 99b
+	cbz		x23, 1f
+	yield_neon	99b
+	b		99b
 
-1:	ldp		x29, x30, [sp], #16
+1:	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldr		x23, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 	.endm
 
@@ -632,43 +646,53 @@ ENDPROC(aesbs_ecb_decrypt)
 	 */
 	.align		4
 ENTRY(aesbs_cbc_decrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-64]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
 
 99:	mov		x6, #1
-	lsl		x6, x6, x4
-	subs		w4, w4, #8
-	csel		x4, x4, xzr, pl
+	lsl		x6, x6, x23
+	subs		w23, w23, #8
+	csel		x23, x23, xzr, pl
 	csel		x6, x6, xzr, mi
 
-	ld1		{v0.16b}, [x1], #16
+	ld1		{v0.16b}, [x20], #16
 	mov		v25.16b, v0.16b
 	tbnz		x6, #1, 0f
-	ld1		{v1.16b}, [x1], #16
+	ld1		{v1.16b}, [x20], #16
 	mov		v26.16b, v1.16b
 	tbnz		x6, #2, 0f
-	ld1		{v2.16b}, [x1], #16
+	ld1		{v2.16b}, [x20], #16
 	mov		v27.16b, v2.16b
 	tbnz		x6, #3, 0f
-	ld1		{v3.16b}, [x1], #16
+	ld1		{v3.16b}, [x20], #16
 	mov		v28.16b, v3.16b
 	tbnz		x6, #4, 0f
-	ld1		{v4.16b}, [x1], #16
+	ld1		{v4.16b}, [x20], #16
 	mov		v29.16b, v4.16b
 	tbnz		x6, #5, 0f
-	ld1		{v5.16b}, [x1], #16
+	ld1		{v5.16b}, [x20], #16
 	mov		v30.16b, v5.16b
 	tbnz		x6, #6, 0f
-	ld1		{v6.16b}, [x1], #16
+	ld1		{v6.16b}, [x20], #16
 	mov		v31.16b, v6.16b
 	tbnz		x6, #7, 0f
-	ld1		{v7.16b}, [x1]
+	ld1		{v7.16b}, [x20]
 
-0:	mov		bskey, x2
-	mov		rounds, x3
+0:	mov		bskey, x21
+	mov		rounds, x22
 	bl		aesbs_decrypt8
 
-	ld1		{v24.16b}, [x5]			// load IV
+	ld1		{v24.16b}, [x24]		// load IV
 
 	eor		v1.16b, v1.16b, v25.16b
 	eor		v6.16b, v6.16b, v26.16b
@@ -679,34 +703,39 @@ ENTRY(aesbs_cbc_decrypt)
 	eor		v3.16b, v3.16b, v30.16b
 	eor		v5.16b, v5.16b, v31.16b
 
-	st1		{v0.16b}, [x0], #16
+	st1		{v0.16b}, [x19], #16
 	mov		v24.16b, v25.16b
 	tbnz		x6, #1, 1f
-	st1		{v1.16b}, [x0], #16
+	st1		{v1.16b}, [x19], #16
 	mov		v24.16b, v26.16b
 	tbnz		x6, #2, 1f
-	st1		{v6.16b}, [x0], #16
+	st1		{v6.16b}, [x19], #16
 	mov		v24.16b, v27.16b
 	tbnz		x6, #3, 1f
-	st1		{v4.16b}, [x0], #16
+	st1		{v4.16b}, [x19], #16
 	mov		v24.16b, v28.16b
 	tbnz		x6, #4, 1f
-	st1		{v2.16b}, [x0], #16
+	st1		{v2.16b}, [x19], #16
 	mov		v24.16b, v29.16b
 	tbnz		x6, #5, 1f
-	st1		{v7.16b}, [x0], #16
+	st1		{v7.16b}, [x19], #16
 	mov		v24.16b, v30.16b
 	tbnz		x6, #6, 1f
-	st1		{v3.16b}, [x0], #16
+	st1		{v3.16b}, [x19], #16
 	mov		v24.16b, v31.16b
 	tbnz		x6, #7, 1f
-	ld1		{v24.16b}, [x1], #16
-	st1		{v5.16b}, [x0], #16
-1:	st1		{v24.16b}, [x5]			// store IV
-
-	cbnz		x4, 99b
-
-	ldp		x29, x30, [sp], #16
+	ld1		{v24.16b}, [x20], #16
+	st1		{v5.16b}, [x19], #16
+1:	st1		{v24.16b}, [x24]		// store IV
+
+	cbz		x23, 2f
+	yield_neon	99b
+	b		99b
+
+2:	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 ENDPROC(aesbs_cbc_decrypt)
 
@@ -731,65 +760,75 @@ CPU_BE(	.quad		0x87, 1		)
 	 */
 __xts_crypt8:
 	mov		x6, #1
-	lsl		x6, x6, x4
-	subs		w4, w4, #8
-	csel		x4, x4, xzr, pl
+	lsl		x6, x6, x23
+	subs		w23, w23, #8
+	csel		x23, x23, xzr, pl
 	csel		x6, x6, xzr, mi
 
-	ld1		{v0.16b}, [x1], #16
+	ld1		{v0.16b}, [x20], #16
 	next_tweak	v26, v25, v30, v31
 	eor		v0.16b, v0.16b, v25.16b
 	tbnz		x6, #1, 0f
 
-	ld1		{v1.16b}, [x1], #16
+	ld1		{v1.16b}, [x20], #16
 	next_tweak	v27, v26, v30, v31
 	eor		v1.16b, v1.16b, v26.16b
 	tbnz		x6, #2, 0f
 
-	ld1		{v2.16b}, [x1], #16
+	ld1		{v2.16b}, [x20], #16
 	next_tweak	v28, v27, v30, v31
 	eor		v2.16b, v2.16b, v27.16b
 	tbnz		x6, #3, 0f
 
-	ld1		{v3.16b}, [x1], #16
+	ld1		{v3.16b}, [x20], #16
 	next_tweak	v29, v28, v30, v31
 	eor		v3.16b, v3.16b, v28.16b
 	tbnz		x6, #4, 0f
 
-	ld1		{v4.16b}, [x1], #16
+	ld1		{v4.16b}, [x20], #16
 	str		q29, [sp, #16]
 	eor		v4.16b, v4.16b, v29.16b
 	next_tweak	v29, v29, v30, v31
 	tbnz		x6, #5, 0f
 
-	ld1		{v5.16b}, [x1], #16
+	ld1		{v5.16b}, [x20], #16
 	str		q29, [sp, #32]
 	eor		v5.16b, v5.16b, v29.16b
 	next_tweak	v29, v29, v30, v31
 	tbnz		x6, #6, 0f
 
-	ld1		{v6.16b}, [x1], #16
+	ld1		{v6.16b}, [x20], #16
 	str		q29, [sp, #48]
 	eor		v6.16b, v6.16b, v29.16b
 	next_tweak	v29, v29, v30, v31
 	tbnz		x6, #7, 0f
 
-	ld1		{v7.16b}, [x1], #16
+	ld1		{v7.16b}, [x20], #16
 	str		q29, [sp, #64]
 	eor		v7.16b, v7.16b, v29.16b
 	next_tweak	v29, v29, v30, v31
 
-0:	mov		bskey, x2
-	mov		rounds, x3
+0:	mov		bskey, x21
+	mov		rounds, x22
 	br		x7
 ENDPROC(__xts_crypt8)
 
 	.macro		__xts_crypt, do8, o0, o1, o2, o3, o4, o5, o6, o7
-	stp		x29, x30, [sp, #-80]!
+	stp		x29, x30, [sp, #-128]!
 	mov		x29, sp
+	stp		x19, x20, [sp, #80]
+	stp		x21, x22, [sp, #96]
+	stp		x23, x24, [sp, #112]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
 
-	ldr		q30, .Lxts_mul_x
-	ld1		{v25.16b}, [x5]
+0:	ldr		q30, .Lxts_mul_x
+	ld1		{v25.16b}, [x24]
 
 99:	adr		x7, \do8
 	bl		__xts_crypt8
@@ -802,16 +841,16 @@ ENDPROC(__xts_crypt8)
 	eor		\o2\().16b, \o2\().16b, v27.16b
 	eor		\o3\().16b, \o3\().16b, v28.16b
 
-	st1		{\o0\().16b}, [x0], #16
+	st1		{\o0\().16b}, [x19], #16
 	mov		v25.16b, v26.16b
 	tbnz		x6, #1, 1f
-	st1		{\o1\().16b}, [x0], #16
+	st1		{\o1\().16b}, [x19], #16
 	mov		v25.16b, v27.16b
 	tbnz		x6, #2, 1f
-	st1		{\o2\().16b}, [x0], #16
+	st1		{\o2\().16b}, [x19], #16
 	mov		v25.16b, v28.16b
 	tbnz		x6, #3, 1f
-	st1		{\o3\().16b}, [x0], #16
+	st1		{\o3\().16b}, [x19], #16
 	mov		v25.16b, v29.16b
 	tbnz		x6, #4, 1f
 
@@ -820,18 +859,24 @@ ENDPROC(__xts_crypt8)
 	eor		\o6\().16b, \o6\().16b, v18.16b
 	eor		\o7\().16b, \o7\().16b, v19.16b
 
-	st1		{\o4\().16b}, [x0], #16
+	st1		{\o4\().16b}, [x19], #16
 	tbnz		x6, #5, 1f
-	st1		{\o5\().16b}, [x0], #16
+	st1		{\o5\().16b}, [x19], #16
 	tbnz		x6, #6, 1f
-	st1		{\o6\().16b}, [x0], #16
+	st1		{\o6\().16b}, [x19], #16
 	tbnz		x6, #7, 1f
-	st1		{\o7\().16b}, [x0], #16
-
-	cbnz		x4, 99b
-
-1:	st1		{v25.16b}, [x5]
-	ldp		x29, x30, [sp], #80
+	st1		{\o7\().16b}, [x19], #16
+
+	cbz		x23, 1f
+	st1		{v25.16b}, [x24]
+	yield_neon	0b
+	b		99b
+
+1:	st1		{v25.16b}, [x24]
+	ldp		x19, x20, [sp, #80]
+	ldp		x21, x22, [sp, #96]
+	ldp		x23, x24, [sp, #112]
+	ldp		x29, x30, [sp], #128
 	ret
 	.endm
 
@@ -856,24 +901,36 @@ ENDPROC(aesbs_xts_decrypt)
 	 *		     int rounds, int blocks, u8 iv[], u8 final[])
 	 */
 ENTRY(aesbs_ctr_encrypt)
-	stp		x29, x30, [sp, #-16]!
+	stp		x29, x30, [sp, #-80]!
 	mov		x29, sp
-
-	cmp		x6, #0
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+	str		x25, [sp, #64]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
+	mov		x25, x6
+
+	cmp		x25, #0
 	cset		x10, ne
-	add		x4, x4, x10		// do one extra block if final
+	add		x23, x23, x10		// do one extra block if final
 
-	ldp		x7, x8, [x5]
-	ld1		{v0.16b}, [x5]
+98:	ldp		x7, x8, [x24]
+	ld1		{v0.16b}, [x24]
 CPU_LE(	rev		x7, x7		)
 CPU_LE(	rev		x8, x8		)
 	adds		x8, x8, #1
 	adc		x7, x7, xzr
 
 99:	mov		x9, #1
-	lsl		x9, x9, x4
-	subs		w4, w4, #8
-	csel		x4, x4, xzr, pl
+	lsl		x9, x9, x23
+	subs		w23, w23, #8
+	csel		x23, x23, xzr, pl
 	csel		x9, x9, xzr, le
 
 	tbnz		x9, #1, 0f
@@ -891,82 +948,88 @@ CPU_LE(	rev		x8, x8		)
 	tbnz		x9, #7, 0f
 	next_ctr	v7
 
-0:	mov		bskey, x2
-	mov		rounds, x3
+0:	mov		bskey, x21
+	mov		rounds, x22
 	bl		aesbs_encrypt8
 
 	lsr		x9, x9, x10		// disregard the extra block
 	tbnz		x9, #0, 0f
 
-	ld1		{v8.16b}, [x1], #16
+	ld1		{v8.16b}, [x20], #16
 	eor		v0.16b, v0.16b, v8.16b
-	st1		{v0.16b}, [x0], #16
+	st1		{v0.16b}, [x19], #16
 	tbnz		x9, #1, 1f
 
-	ld1		{v9.16b}, [x1], #16
+	ld1		{v9.16b}, [x20], #16
 	eor		v1.16b, v1.16b, v9.16b
-	st1		{v1.16b}, [x0], #16
+	st1		{v1.16b}, [x19], #16
 	tbnz		x9, #2, 2f
 
-	ld1		{v10.16b}, [x1], #16
+	ld1		{v10.16b}, [x20], #16
 	eor		v4.16b, v4.16b, v10.16b
-	st1		{v4.16b}, [x0], #16
+	st1		{v4.16b}, [x19], #16
 	tbnz		x9, #3, 3f
 
-	ld1		{v11.16b}, [x1], #16
+	ld1		{v11.16b}, [x20], #16
 	eor		v6.16b, v6.16b, v11.16b
-	st1		{v6.16b}, [x0], #16
+	st1		{v6.16b}, [x19], #16
 	tbnz		x9, #4, 4f
 
-	ld1		{v12.16b}, [x1], #16
+	ld1		{v12.16b}, [x20], #16
 	eor		v3.16b, v3.16b, v12.16b
-	st1		{v3.16b}, [x0], #16
+	st1		{v3.16b}, [x19], #16
 	tbnz		x9, #5, 5f
 
-	ld1		{v13.16b}, [x1], #16
+	ld1		{v13.16b}, [x20], #16
 	eor		v7.16b, v7.16b, v13.16b
-	st1		{v7.16b}, [x0], #16
+	st1		{v7.16b}, [x19], #16
 	tbnz		x9, #6, 6f
 
-	ld1		{v14.16b}, [x1], #16
+	ld1		{v14.16b}, [x20], #16
 	eor		v2.16b, v2.16b, v14.16b
-	st1		{v2.16b}, [x0], #16
+	st1		{v2.16b}, [x19], #16
 	tbnz		x9, #7, 7f
 
-	ld1		{v15.16b}, [x1], #16
+	ld1		{v15.16b}, [x20], #16
 	eor		v5.16b, v5.16b, v15.16b
-	st1		{v5.16b}, [x0], #16
+	st1		{v5.16b}, [x19], #16
 
 8:	next_ctr	v0
-	cbnz		x4, 99b
-
-0:	st1		{v0.16b}, [x5]
-	ldp		x29, x30, [sp], #16
+	st1		{v0.16b}, [x24]
+	cbz		x23, 0f
+	yield_neon	98b
+	b		99b
+
+0:	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldr		x25, [sp, #64]
+	ldp		x29, x30, [sp], #80
 	ret
 
 	/*
 	 * If we are handling the tail of the input (x6 != NULL), return the
 	 * final keystream block back to the caller.
 	 */
-1:	cbz		x6, 8b
-	st1		{v1.16b}, [x6]
+1:	cbz		x25, 8b
+	st1		{v1.16b}, [x25]
 	b		8b
-2:	cbz		x6, 8b
-	st1		{v4.16b}, [x6]
+2:	cbz		x25, 8b
+	st1		{v4.16b}, [x25]
 	b		8b
-3:	cbz		x6, 8b
-	st1		{v6.16b}, [x6]
+3:	cbz		x25, 8b
+	st1		{v6.16b}, [x25]
 	b		8b
-4:	cbz		x6, 8b
-	st1		{v3.16b}, [x6]
+4:	cbz		x25, 8b
+	st1		{v3.16b}, [x25]
 	b		8b
-5:	cbz		x6, 8b
-	st1		{v7.16b}, [x6]
+5:	cbz		x25, 8b
+	st1		{v7.16b}, [x25]
 	b		8b
-6:	cbz		x6, 8b
-	st1		{v2.16b}, [x6]
+6:	cbz		x25, 8b
+	st1		{v2.16b}, [x25]
 	b		8b
-7:	cbz		x6, 8b
-	st1		{v5.16b}, [x6]
+7:	cbz		x25, 8b
+	st1		{v5.16b}, [x25]
 	b		8b
 ENDPROC(aesbs_ctr_encrypt)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 16/19] crypto: arm64/aes-ghash - yield after processing fixed number of blocks
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (14 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 15/19] crypto: arm64/aes-bs - yield after processing each 128 bytes " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 17/19] crypto: arm64/crc32-ce - yield NEON every 16 blocks of input Ard Biesheuvel
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

This updates both the core GHASH as well as the AES-GCM algorithm to
yield each time after processing a fixed chunk of input. For the GCM
driver, we align with the other AES/CE block mode drivers, and use
a block size of 64 bytes. The core GHASH is much shorter, so let's
use a block size of 128 bytes for that one.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 128 ++++++++++++++------
 1 file changed, 92 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 11ebf1ae248a..fbfd4681675d 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -212,23 +212,36 @@
 	ushr		XL.2d, XL.2d, #1
 	.endm
 
-	.macro		__pmull_ghash, pn
-	ld1		{SHASH.2d}, [x3]
-	ld1		{XL.2d}, [x1]
+	.macro		__pmull_ghash, pn, yield
+	stp		x29, x30, [sp, #-64]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	str		x23, [sp, #48]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+
+0:	ld1		{SHASH.2d}, [x22]
+	ld1		{XL.2d}, [x20]
 	ext		SHASH2.16b, SHASH.16b, SHASH.16b, #8
 	eor		SHASH2.16b, SHASH2.16b, SHASH.16b
 
 	__pmull_pre_\pn
 
 	/* do the head block first, if supplied */
-	cbz		x4, 0f
-	ld1		{T1.2d}, [x4]
-	b		1f
+	cbz		x23, 1f
+	ld1		{T1.2d}, [x23]
+	mov		x23, xzr
+	b		2f
 
-0:	ld1		{T1.2d}, [x2], #16
-	sub		w0, w0, #1
+1:	ld1		{T1.2d}, [x21], #16
+	sub		w19, w19, #1
 
-1:	/* multiply XL by SHASH in GF(2^128) */
+2:	/* multiply XL by SHASH in GF(2^128) */
 CPU_LE(	rev64		T1.16b, T1.16b	)
 
 	ext		T2.16b, XL.16b, XL.16b, #8
@@ -250,9 +263,19 @@ CPU_LE(	rev64		T1.16b, T1.16b	)
 	eor		T2.16b, T2.16b, XH.16b
 	eor		XL.16b, XL.16b, T2.16b
 
-	cbnz		w0, 0b
+	cbz		w19, 3f
 
-	st1		{XL.2d}, [x1]
+	yield_neon_pre	w19, \yield, 1, 1b
+	st1		{XL.2d}, [x20]
+	yield_neon_post	0b
+
+	b		1b
+
+3:	st1		{XL.2d}, [x20]
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldr		x23, [sp, #48]
+	ldp		x29, x30, [sp], #64
 	ret
 	.endm
 
@@ -261,11 +284,11 @@ CPU_LE(	rev64		T1.16b, T1.16b	)
 	 *			   struct ghash_key const *k, const char *head)
 	 */
 ENTRY(pmull_ghash_update_p64)
-	__pmull_ghash	p64
+	__pmull_ghash	p64, 5
 ENDPROC(pmull_ghash_update_p64)
 
 ENTRY(pmull_ghash_update_p8)
-	__pmull_ghash	p8
+	__pmull_ghash	p8, 2
 ENDPROC(pmull_ghash_update_p8)
 
 	KS		.req	v8
@@ -304,38 +327,56 @@ ENDPROC(pmull_ghash_update_p8)
 	.endm
 
 	.macro		pmull_gcm_do_crypt, enc
-	ld1		{SHASH.2d}, [x4]
-	ld1		{XL.2d}, [x1]
-	ldr		x8, [x5, #8]			// load lower counter
+	stp		x29, x30, [sp, #-96]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+	stp		x23, x24, [sp, #48]
+	stp		x25, x26, [sp, #64]
+	str		x27, [sp, #80]
+
+	mov		x19, x0
+	mov		x20, x1
+	mov		x21, x2
+	mov		x22, x3
+	mov		x23, x4
+	mov		x24, x5
+	mov		x25, x6
+	mov		x26, x7
+
+	ldr		x27, [x24, #8]			// load lower counter
+CPU_LE(	rev		x27, x27	)
+
+0:	ld1		{SHASH.2d}, [x23]
+	ld1		{XL.2d}, [x20]
 
 	movi		MASK.16b, #0xe1
 	ext		SHASH2.16b, SHASH.16b, SHASH.16b, #8
-CPU_LE(	rev		x8, x8		)
 	shl		MASK.2d, MASK.2d, #57
 	eor		SHASH2.16b, SHASH2.16b, SHASH.16b
 
 	.if		\enc == 1
-	ld1		{KS.16b}, [x7]
+	ld1		{KS.16b}, [x26]
 	.endif
 
-0:	ld1		{CTR.8b}, [x5]			// load upper counter
-	ld1		{INP.16b}, [x3], #16
-	rev		x9, x8
-	add		x8, x8, #1
-	sub		w0, w0, #1
+1:	ld1		{CTR.8b}, [x24]			// load upper counter
+	ld1		{INP.16b}, [x22], #16
+	rev		x9, x27
+	add		x27, x27, #1
+	sub		w19, w19, #1
 	ins		CTR.d[1], x9			// set lower counter
 
 	.if		\enc == 1
 	eor		INP.16b, INP.16b, KS.16b	// encrypt input
-	st1		{INP.16b}, [x2], #16
+	st1		{INP.16b}, [x21], #16
 	.endif
 
 	rev64		T1.16b, INP.16b
 
-	cmp		w6, #12
-	b.ge		2f				// AES-192/256?
+	cmp		w25, #12
+	b.ge		4f				// AES-192/256?
 
-1:	enc_round	CTR, v21
+2:	enc_round	CTR, v21
 
 	ext		T2.16b, XL.16b, XL.16b, #8
 	ext		IN1.16b, T1.16b, T1.16b, #8
@@ -390,27 +431,42 @@ CPU_LE(	rev		x8, x8		)
 
 	.if		\enc == 0
 	eor		INP.16b, INP.16b, KS.16b
-	st1		{INP.16b}, [x2], #16
+	st1		{INP.16b}, [x21], #16
 	.endif
 
-	cbnz		w0, 0b
+	cbz		w19, 3f
 
-CPU_LE(	rev		x8, x8		)
-	st1		{XL.2d}, [x1]
-	str		x8, [x5, #8]			// store lower counter
+	yield_neon_pre	w19, 8, 1, 1b			// yield every 8 blocks
+	st1		{XL.2d}, [x20]
+	.if		\enc == 1
+	st1		{KS.16b}, [x26]
+	.endif
+	yield_neon_post	0b
 
+	b		1b
+
+3:	st1		{XL.2d}, [x20]
 	.if		\enc == 1
-	st1		{KS.16b}, [x7]
+	st1		{KS.16b}, [x26]
 	.endif
 
+CPU_LE(	rev		x27, x27	)
+	str		x27, [x24, #8]			// store lower counter
+
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x23, x24, [sp, #48]
+	ldp		x25, x26, [sp, #64]
+	ldr		x27, [sp, #80]
+	ldp		x29, x30, [sp], #96
 	ret
 
-2:	b.eq		3f				// AES-192?
+4:	b.eq		5f				// AES-192?
 	enc_round	CTR, v17
 	enc_round	CTR, v18
-3:	enc_round	CTR, v19
+5:	enc_round	CTR, v19
 	enc_round	CTR, v20
-	b		1b
+	b		2b
 	.endm
 
 	/*
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 17/19] crypto: arm64/crc32-ce - yield NEON every 16 blocks of input
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (15 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 16/19] crypto: arm64/aes-ghash - yield after processing fixed number of blocks Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 18/19] crypto: arm64/crct10dif-ce - yield NEON every 8 " Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 19/19] DO NOT MERGE Ard Biesheuvel
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON every 16 blocks of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/crc32-ce-core.S | 55 +++++++++++++++-----
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/crypto/crc32-ce-core.S b/arch/arm64/crypto/crc32-ce-core.S
index 18f5a8442276..bca3d22fae7b 100644
--- a/arch/arm64/crypto/crc32-ce-core.S
+++ b/arch/arm64/crypto/crc32-ce-core.S
@@ -100,9 +100,9 @@
 	dCONSTANT	.req	d0
 	qCONSTANT	.req	q0
 
-	BUF		.req	x0
-	LEN		.req	x1
-	CRC		.req	x2
+	BUF		.req	x19
+	LEN		.req	x20
+	CRC		.req	x21
 
 	vzr		.req	v9
 
@@ -116,13 +116,27 @@
 	 *                     size_t len, uint crc32)
 	 */
 ENTRY(crc32_pmull_le)
-	adr		x3, .Lcrc32_constants
+	stp		x29, x30, [sp, #-112]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+
+	adr		x22, .Lcrc32_constants
 	b		0f
 
 ENTRY(crc32c_pmull_le)
-	adr		x3, .Lcrc32c_constants
+	stp		x29, x30, [sp, #-112]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+
+	adr		x22, .Lcrc32c_constants
 
-0:	bic		LEN, LEN, #15
+0:	mov		BUF, x0
+	mov		LEN, x1
+	mov		CRC, x2
+
+	bic		LEN, LEN, #15
 	ld1		{v1.16b-v4.16b}, [BUF], #0x40
 	movi		vzr.16b, #0
 	fmov		dCONSTANT, CRC
@@ -131,7 +145,7 @@ ENTRY(crc32c_pmull_le)
 	cmp		LEN, #0x40
 	b.lt		less_64
 
-	ldr		qCONSTANT, [x3]
+	ldr		qCONSTANT, [x22]
 
 loop_64:		/* 64 bytes Full cache line folding */
 	sub		LEN, LEN, #0x40
@@ -161,10 +175,24 @@ loop_64:		/* 64 bytes Full cache line folding */
 	eor		v4.16b, v4.16b, v8.16b
 
 	cmp		LEN, #0x40
-	b.ge		loop_64
+	b.lt		less_64
+
+	yield_neon_pre	LEN, 4, 64, loop_64		// yield every 16 blocks
+	stp		q1, q2, [sp, #48]
+	stp		q3, q4, [sp, #80]
+	yield_neon_post	2f
+	b		loop_64
+
+	.subsection	1
+2:	ldp		q1, q2, [sp, #48]
+	ldp		q3, q4, [sp, #80]
+	ldr		qCONSTANT, [x22]
+	movi		vzr.16b, #0
+	b		loop_64
+	.previous
 
 less_64:		/* Folding cache line into 128bit */
-	ldr		qCONSTANT, [x3, #16]
+	ldr		qCONSTANT, [x22, #16]
 
 	pmull2		v5.1q, v1.2d, vCONSTANT.2d
 	pmull		v1.1q, v1.1d, vCONSTANT.1d
@@ -203,8 +231,8 @@ fold_64:
 	eor		v1.16b, v1.16b, v2.16b
 
 	/* final 32-bit fold */
-	ldr		dCONSTANT, [x3, #32]
-	ldr		d3, [x3, #40]
+	ldr		dCONSTANT, [x22, #32]
+	ldr		d3, [x22, #40]
 
 	ext		v2.16b, v1.16b, vzr.16b, #4
 	and		v1.16b, v1.16b, v3.16b
@@ -212,7 +240,7 @@ fold_64:
 	eor		v1.16b, v1.16b, v2.16b
 
 	/* Finish up with the bit-reversed barrett reduction 64 ==> 32 bits */
-	ldr		qCONSTANT, [x3, #48]
+	ldr		qCONSTANT, [x22, #48]
 
 	and		v2.16b, v1.16b, v3.16b
 	ext		v2.16b, vzr.16b, v2.16b, #8
@@ -222,6 +250,9 @@ fold_64:
 	eor		v1.16b, v1.16b, v2.16b
 	mov		w0, v1.s[1]
 
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x29, x30, [sp], #112
 	ret
 ENDPROC(crc32_pmull_le)
 ENDPROC(crc32c_pmull_le)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 18/19] crypto: arm64/crct10dif-ce - yield NEON every 8 blocks of input
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (16 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 17/19] crypto: arm64/crc32-ce - yield NEON every 16 blocks of input Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  2017-12-04 12:26 ` [PATCH v2 19/19] DO NOT MERGE Ard Biesheuvel
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Avoid excessive scheduling delays under a preemptible kernel by
yielding the NEON every 8 blocks of input.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/crct10dif-ce-core.S | 39 ++++++++++++++++++--
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
index d5b5a8c038c8..d57067e80bae 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -74,13 +74,22 @@
 	.text
 	.cpu		generic+crypto
 
-	arg1_low32	.req	w0
-	arg2		.req	x1
-	arg3		.req	x2
+	arg1_low32	.req	w19
+	arg2		.req	x20
+	arg3		.req	x21
 
 	vzr		.req	v13
 
 ENTRY(crc_t10dif_pmull)
+	stp		x29, x30, [sp, #-176]!
+	mov		x29, sp
+	stp		x19, x20, [sp, #16]
+	stp		x21, x22, [sp, #32]
+
+	mov		arg1_low32, w0
+	mov		arg2, x1
+	mov		arg3, x2
+
 	movi		vzr.16b, #0		// init zero register
 
 	// adjust the 16-bit initial_crc value, scale it to 32 bits
@@ -175,8 +184,27 @@ CPU_LE(	ext		v12.16b, v12.16b, v12.16b, #8	)
 	subs		arg3, arg3, #128
 
 	// check if there is another 64B in the buffer to be able to fold
-	b.ge		_fold_64_B_loop
+	b.lt		_fold_64_B_end
+
+	yield_neon_pre	arg3, 3, 128, _fold_64_B_loop	// yield every 8 blocks
+	stp		q0, q1, [sp, #48]
+	stp		q2, q3, [sp, #80]
+	stp		q4, q5, [sp, #112]
+	stp		q6, q7, [sp, #144]
+	yield_neon_post	2f
+	b		_fold_64_B_loop
+
+	.subsection	1
+2:	ldp		q0, q1, [sp, #48]
+	ldp		q2, q3, [sp, #80]
+	ldp		q4, q5, [sp, #112]
+	ldp		q6, q7, [sp, #144]
+	ldr		q10, rk3
+	movi		vzr.16b, #0		// init zero register
+	b		_fold_64_B_loop
+	.previous
 
+_fold_64_B_end:
 	// at this point, the buffer pointer is pointing at the last y Bytes
 	// of the buffer the 64B of folded data is in 4 of the vector
 	// registers: v0, v1, v2, v3
@@ -304,6 +332,9 @@ _barrett:
 _cleanup:
 	// scale the result back to 16 bits
 	lsr		x0, x0, #16
+	ldp		x19, x20, [sp, #16]
+	ldp		x21, x22, [sp, #32]
+	ldp		x29, x30, [sp], #176
 	ret
 
 _less_than_128:
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 19/19] DO NOT MERGE
  2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
                   ` (17 preceding siblings ...)
  2017-12-04 12:26 ` [PATCH v2 18/19] crypto: arm64/crct10dif-ce - yield NEON every 8 " Ard Biesheuvel
@ 2017-12-04 12:26 ` Ard Biesheuvel
  18 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-04 12:26 UTC (permalink / raw)
  To: linux-crypto
  Cc: herbert, linux-arm-kernel, Ard Biesheuvel, Dave Martin,
	Russell King - ARM Linux, Sebastian Andrzej Siewior,
	Mark Rutland, linux-rt-users, Peter Zijlstra, Catalin Marinas,
	Will Deacon, Steven Rostedt, Thomas Gleixner

Test code to force a kernel_neon_end+begin sequence at every yield point,
and wipe the entire NEON state before resuming the algorithm.
---
 arch/arm64/include/asm/assembler.h | 33 ++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 917b026d3e00..dfee20246592 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -549,6 +549,7 @@ alternative_else_nop_endif
 	cmp		w1, #1 // == PREEMPT_OFFSET
 	csel		x0, x0, xzr, eq
 	tbnz		x0, #TIF_NEED_RESCHED, 5555f	// needs rescheduling?
+	b		5555f
 4444:
 #endif
 	.subsection	1
@@ -558,6 +559,38 @@ alternative_else_nop_endif
 	.macro		yield_neon_post, lbl:req
 	bl		kernel_neon_end
 	bl		kernel_neon_begin
+	movi		v0.16b, #0x55
+	movi		v1.16b, #0x55
+	movi		v2.16b, #0x55
+	movi		v3.16b, #0x55
+	movi		v4.16b, #0x55
+	movi		v5.16b, #0x55
+	movi		v6.16b, #0x55
+	movi		v7.16b, #0x55
+	movi		v8.16b, #0x55
+	movi		v9.16b, #0x55
+	movi		v10.16b, #0x55
+	movi		v11.16b, #0x55
+	movi		v12.16b, #0x55
+	movi		v13.16b, #0x55
+	movi		v14.16b, #0x55
+	movi		v15.16b, #0x55
+	movi		v16.16b, #0x55
+	movi		v17.16b, #0x55
+	movi		v18.16b, #0x55
+	movi		v19.16b, #0x55
+	movi		v20.16b, #0x55
+	movi		v21.16b, #0x55
+	movi		v22.16b, #0x55
+	movi		v23.16b, #0x55
+	movi		v24.16b, #0x55
+	movi		v25.16b, #0x55
+	movi		v26.16b, #0x55
+	movi		v27.16b, #0x55
+	movi		v28.16b, #0x55
+	movi		v29.16b, #0x55
+	movi		v30.16b, #0x55
+	movi		v31.16b, #0x55
 	b		\lbl
 	.previous
 	.endm
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-04 12:26 ` [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT Ard Biesheuvel
@ 2017-12-05 12:28   ` Dave Martin
  2017-12-05 12:45     ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Martin @ 2017-12-05 12:28 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, Mark Rutland, herbert, Peter Zijlstra,
	Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
	Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
	linux-arm-kernel, linux-rt-users

On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
> Add a support macro to conditionally yield the NEON (and thus the CPU)
> that may be called from the assembler code. Given that especially the
> instruction based accelerated crypto code may use very tight loops, add
> some parametrization so that the TIF_NEED_RESCHED flag test is only
> executed every so many loop iterations.
> 
> In some cases, yielding the NEON involves saving and restoring a non
> trivial amount of context (especially in the CRC folding algorithms),
> and so the macro is split into two, and the code in between is only
> executed when the yield path is taken, allowing the contex to be preserved.
> The second macro takes a label argument that marks the resume-from-yield
> path, which should restore the preserved context again.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index aef72d886677..917b026d3e00 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>  #endif
>  	.endm
>  
> +/*
> + * yield_neon - check whether to yield to another runnable task from
> + *		kernel mode NEON code (running with preemption disabled)
> + *
> + * - Check whether the preempt count is exactly 1, in which case disabling
> + *   preemption once will make the task preemptible. If this is not the case,
> + *   yielding is pointless.
> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
> + *   yield fixup code at @lbl.
> + */
> +	.macro		yield_neon, lbl:req, ctr, order, stride, loop
> +	yield_neon_pre	\ctr, \order, \stride, \loop
> +	yield_neon_post	\lbl
> +	.endm
> +
> +	.macro		yield_neon_pre, ctr, order=0, stride, loop=4444f
> +#ifdef CONFIG_PREEMPT
> +	/*
> +	 * With some algorithms, it makes little sense to poll the
> +	 * TIF_NEED_RESCHED flag after every iteration, so only perform
> +	 * the check every 2^order strides.
> +	 */
> +	.if		\order > 1
> +	.if		(\stride & (\stride - 1)) != 0
> +	.error		"stride should be a power of 2"
> +	.endif
> +	tst		\ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
> +	b.ne		\loop
> +	.endif

I'm not sure what baking in this check gives us, and this seems to
conflate two rather different things: yielding and defining a
"heartbeat" frequency for the calling code.

Can we separate out the crypto-loop-helper semantics from the yield-
neon stuff?

If we had

	if_cond_yield_neon
	// patchup code
	endif_yield_neon

then the caller is free to conditionally branch over that as appropriate
like

loop:
	// crypto stuff
	tst x0, #0xf
	b.ne 	loop

	if_cond_yield_neon
	// patchup code
	endif_cond_yield_neon

	b	loop

I think this is clearer than burying checks and branches in a macro that
is trying to be generic.

Label arguments can be added to elide some branches of course, at a
corresponding cost to clarity...  in the common case the cache will
be hot and the branches won't be mispredicted though.  Is it really
worth it?

> +
> +	get_thread_info	x0
> +	ldr		w1, [x0, #TSK_TI_PREEMPT]
> +	ldr		x0, [x0, #TSK_TI_FLAGS]
> +	cmp		w1, #1 // == PREEMPT_OFFSET

asm-offsets?

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-05 12:28   ` Dave Martin
@ 2017-12-05 12:45     ` Ard Biesheuvel
  2017-12-05 18:04       ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-05 12:45 UTC (permalink / raw)
  To: Dave Martin
  Cc: linux-crypto, Mark Rutland, herbert, Peter Zijlstra,
	Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
	Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
	linux-arm-kernel, linux-rt-users



> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:
> 
>> On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
>> Add a support macro to conditionally yield the NEON (and thus the CPU)
>> that may be called from the assembler code. Given that especially the
>> instruction based accelerated crypto code may use very tight loops, add
>> some parametrization so that the TIF_NEED_RESCHED flag test is only
>> executed every so many loop iterations.
>> 
>> In some cases, yielding the NEON involves saving and restoring a non
>> trivial amount of context (especially in the CRC folding algorithms),
>> and so the macro is split into two, and the code in between is only
>> executed when the yield path is taken, allowing the contex to be preserved.
>> The second macro takes a label argument that marks the resume-from-yield
>> path, which should restore the preserved context again.
>> 
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>> arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
>> 1 file changed, 50 insertions(+)
>> 
>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> index aef72d886677..917b026d3e00 100644
>> --- a/arch/arm64/include/asm/assembler.h
>> +++ b/arch/arm64/include/asm/assembler.h
>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>> #endif
>>    .endm
>> 
>> +/*
>> + * yield_neon - check whether to yield to another runnable task from
>> + *        kernel mode NEON code (running with preemption disabled)
>> + *
>> + * - Check whether the preempt count is exactly 1, in which case disabling
>> + *   preemption once will make the task preemptible. If this is not the case,
>> + *   yielding is pointless.
>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
>> + *   yield fixup code at @lbl.
>> + */
>> +    .macro        yield_neon, lbl:req, ctr, order, stride, loop
>> +    yield_neon_pre    \ctr, \order, \stride, \loop
>> +    yield_neon_post    \lbl
>> +    .endm
>> +
>> +    .macro        yield_neon_pre, ctr, order=0, stride, loop=4444f
>> +#ifdef CONFIG_PREEMPT
>> +    /*
>> +     * With some algorithms, it makes little sense to poll the
>> +     * TIF_NEED_RESCHED flag after every iteration, so only perform
>> +     * the check every 2^order strides.
>> +     */
>> +    .if        \order > 1
>> +    .if        (\stride & (\stride - 1)) != 0
>> +    .error        "stride should be a power of 2"
>> +    .endif
>> +    tst        \ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
>> +    b.ne        \loop
>> +    .endif
> 
> I'm not sure what baking in this check gives us, and this seems to
> conflate two rather different things: yielding and defining a
> "heartbeat" frequency for the calling code.
> 
> Can we separate out the crypto-loop-helper semantics from the yield-
> neon stuff?
> 

Fair enough. I incorporated the check here so it disappears from the code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a sequence that is mispredicted every # iterations without any benefit.
I guess i could macroise that separately though.

> If we had
> 
>    if_cond_yield_neon
>    // patchup code
>    endif_yield_neon
> 
> then the caller is free to conditionally branch over that as appropriate
> like
> 
> loop:
>    // crypto stuff
>    tst x0, #0xf
>    b.ne    loop
> 
>    if_cond_yield_neon
>    // patchup code
>    endif_cond_yield_neon
> 
>    b    loop
> 
> I think this is clearer than burying checks and branches in a macro that
> is trying to be generic.
> 

Agreed.

> Label arguments can be added to elide some branches of course, at a
> corresponding cost to clarity...  in the common case the cache will
> be hot and the branches won't be mispredicted though.  Is it really
> worth it?
> 

Perhaps not. And i have not made any attempts yet to benchmark at great detail, given that i need some feedback from the rt crowd first whether this is likely to work as desired.

>> +
>> +    get_thread_info    x0
>> +    ldr        w1, [x0, #TSK_TI_PREEMPT]
>> +    ldr        x0, [x0, #TSK_TI_FLAGS]
>> +    cmp        w1, #1 // == PREEMPT_OFFSET
> 
> asm-offsets?
> 

This is not an offset in that regard, but the header that defines it is not asm safe

> [...]
> 
> Cheers
> ---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-05 12:45     ` Ard Biesheuvel
@ 2017-12-05 18:04       ` Ard Biesheuvel
  2017-12-06 11:51         ` Dave Martin
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-05 18:04 UTC (permalink / raw)
  To: Dave Martin
  Cc: linux-crypto, Mark Rutland, Herbert Xu, Peter Zijlstra,
	Catalin Marinas, Sebastian Andrzej Siewior, Will Deacon,
	Russell King - ARM Linux, Steven Rostedt, Thomas Gleixner,
	linux-arm-kernel, linux-rt-users

On 5 December 2017 at 12:45, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
>
>> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:
>>
>>> On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
>>> Add a support macro to conditionally yield the NEON (and thus the CPU)
>>> that may be called from the assembler code. Given that especially the
>>> instruction based accelerated crypto code may use very tight loops, add
>>> some parametrization so that the TIF_NEED_RESCHED flag test is only
>>> executed every so many loop iterations.
>>>
>>> In some cases, yielding the NEON involves saving and restoring a non
>>> trivial amount of context (especially in the CRC folding algorithms),
>>> and so the macro is split into two, and the code in between is only
>>> executed when the yield path is taken, allowing the contex to be preserved.
>>> The second macro takes a label argument that marks the resume-from-yield
>>> path, which should restore the preserved context again.
>>>
>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> ---
>>> arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
>>> 1 file changed, 50 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>>> index aef72d886677..917b026d3e00 100644
>>> --- a/arch/arm64/include/asm/assembler.h
>>> +++ b/arch/arm64/include/asm/assembler.h
>>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>>> #endif
>>>    .endm
>>>
>>> +/*
>>> + * yield_neon - check whether to yield to another runnable task from
>>> + *        kernel mode NEON code (running with preemption disabled)
>>> + *
>>> + * - Check whether the preempt count is exactly 1, in which case disabling
>>> + *   preemption once will make the task preemptible. If this is not the case,
>>> + *   yielding is pointless.
>>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>>> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
>>> + *   yield fixup code at @lbl.
>>> + */
>>> +    .macro        yield_neon, lbl:req, ctr, order, stride, loop
>>> +    yield_neon_pre    \ctr, \order, \stride, \loop
>>> +    yield_neon_post    \lbl
>>> +    .endm
>>> +
>>> +    .macro        yield_neon_pre, ctr, order=0, stride, loop=4444f
>>> +#ifdef CONFIG_PREEMPT
>>> +    /*
>>> +     * With some algorithms, it makes little sense to poll the
>>> +     * TIF_NEED_RESCHED flag after every iteration, so only perform
>>> +     * the check every 2^order strides.
>>> +     */
>>> +    .if        \order > 1
>>> +    .if        (\stride & (\stride - 1)) != 0
>>> +    .error        "stride should be a power of 2"
>>> +    .endif
>>> +    tst        \ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
>>> +    b.ne        \loop
>>> +    .endif
>>
>> I'm not sure what baking in this check gives us, and this seems to
>> conflate two rather different things: yielding and defining a
>> "heartbeat" frequency for the calling code.
>>
>> Can we separate out the crypto-loop-helper semantics from the yield-
>> neon stuff?
>>
>
> Fair enough. I incorporated the check here so it disappears from the code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a sequence that is mispredicted every # iterations without any benefit.
> I guess i could macroise that separately though.
>
>> If we had
>>
>>    if_cond_yield_neon
>>    // patchup code
>>    endif_yield_neon
>>

I like this, but ...

>> then the caller is free to conditionally branch over that as appropriate
>> like
>>
>> loop:
>>    // crypto stuff
>>    tst x0, #0xf
>>    b.ne    loop
>>
>>    if_cond_yield_neon
>>    // patchup code
>>    endif_cond_yield_neon
>>

I need to put the post patchup code somewhere too. Please refer to the
CRC patches for the best examples of this.


>>    b    loop
>>
>> I think this is clearer than burying checks and branches in a macro that
>> is trying to be generic.
>>
>
> Agreed.
>
>> Label arguments can be added to elide some branches of course, at a
>> corresponding cost to clarity...  in the common case the cache will
>> be hot and the branches won't be mispredicted though.  Is it really
>> worth it?
>>
>
> Perhaps not. And i have not made any attempts yet to benchmark at great detail, given that i need some feedback from the rt crowd first whether this is likely to work as desired.
>
>>> +
>>> +    get_thread_info    x0
>>> +    ldr        w1, [x0, #TSK_TI_PREEMPT]
>>> +    ldr        x0, [x0, #TSK_TI_FLAGS]
>>> +    cmp        w1, #1 // == PREEMPT_OFFSET
>>
>> asm-offsets?
>>
>
> This is not an offset in that regard, but the header that defines it is not asm safe
>
>> [...]
>>
>> Cheers
>> ---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-05 18:04       ` Ard Biesheuvel
@ 2017-12-06 11:51         ` Dave Martin
  2017-12-06 11:57           ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Martin @ 2017-12-06 11:51 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
	Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
	linux-rt-users

On Tue, Dec 05, 2017 at 06:04:34PM +0000, Ard Biesheuvel wrote:
> On 5 December 2017 at 12:45, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> >
> >> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:
> >>
> >>> On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
> >>> that may be called from the assembler code. Given that especially the
> >>> instruction based accelerated crypto code may use very tight loops, add
> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
> >>> executed every so many loop iterations.
> >>>
> >>> In some cases, yielding the NEON involves saving and restoring a non
> >>> trivial amount of context (especially in the CRC folding algorithms),
> >>> and so the macro is split into two, and the code in between is only
> >>> executed when the yield path is taken, allowing the contex to be preserved.
> >>> The second macro takes a label argument that marks the resume-from-yield
> >>> path, which should restore the preserved context again.
> >>>
> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >>> ---
> >>> arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
> >>> 1 file changed, 50 insertions(+)
> >>>
> >>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> >>> index aef72d886677..917b026d3e00 100644
> >>> --- a/arch/arm64/include/asm/assembler.h
> >>> +++ b/arch/arm64/include/asm/assembler.h
> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
> >>> #endif
> >>>    .endm
> >>>
> >>> +/*
> >>> + * yield_neon - check whether to yield to another runnable task from
> >>> + *        kernel mode NEON code (running with preemption disabled)
> >>> + *
> >>> + * - Check whether the preempt count is exactly 1, in which case disabling
> >>> + *   preemption once will make the task preemptible. If this is not the case,
> >>> + *   yielding is pointless.
> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
> >>> + *   yield fixup code at @lbl.
> >>> + */
> >>> +    .macro        yield_neon, lbl:req, ctr, order, stride, loop
> >>> +    yield_neon_pre    \ctr, \order, \stride, \loop
> >>> +    yield_neon_post    \lbl
> >>> +    .endm
> >>> +
> >>> +    .macro        yield_neon_pre, ctr, order=0, stride, loop=4444f
> >>> +#ifdef CONFIG_PREEMPT
> >>> +    /*
> >>> +     * With some algorithms, it makes little sense to poll the
> >>> +     * TIF_NEED_RESCHED flag after every iteration, so only perform
> >>> +     * the check every 2^order strides.
> >>> +     */
> >>> +    .if        \order > 1
> >>> +    .if        (\stride & (\stride - 1)) != 0
> >>> +    .error        "stride should be a power of 2"
> >>> +    .endif
> >>> +    tst        \ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
> >>> +    b.ne        \loop
> >>> +    .endif
> >>
> >> I'm not sure what baking in this check gives us, and this seems to
> >> conflate two rather different things: yielding and defining a
> >> "heartbeat" frequency for the calling code.
> >>
> >> Can we separate out the crypto-loop-helper semantics from the yield-
> >> neon stuff?
> >>
> >
> > Fair enough. I incorporated the check here so it disappears from the code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a sequence that is mispredicted every # iterations without any benefit.
> > I guess i could macroise that separately though.
> >
> >> If we had
> >>
> >>    if_cond_yield_neon
> >>    // patchup code
> >>    endif_yield_neon
> >>
> 
> I like this, but ...
> 
> >> then the caller is free to conditionally branch over that as appropriate
> >> like
> >>
> >> loop:
> >>    // crypto stuff
> >>    tst x0, #0xf
> >>    b.ne    loop
> >>
> >>    if_cond_yield_neon
> >>    // patchup code
> >>    endif_cond_yield_neon
> >>
> 
> I need to put the post patchup code somewhere too. Please refer to the
> CRC patches for the best examples of this.

I'm not sure I follow.  If you need to do something different after
patchup, can't you either pull that code into the if...endif too, or
otherwise put a branch before the endif?

[...]

Cheers
---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-06 11:51         ` Dave Martin
@ 2017-12-06 11:57           ` Ard Biesheuvel
  2017-12-06 12:12             ` Dave P Martin
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 11:57 UTC (permalink / raw)
  To: Dave Martin
  Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
	Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
	linux-rt-users

On 6 December 2017 at 11:51, Dave Martin <Dave.Martin@arm.com> wrote:
> On Tue, Dec 05, 2017 at 06:04:34PM +0000, Ard Biesheuvel wrote:
>> On 5 December 2017 at 12:45, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> >
>> >
>> >> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:
>> >>
>> >>> On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
>> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
>> >>> that may be called from the assembler code. Given that especially the
>> >>> instruction based accelerated crypto code may use very tight loops, add
>> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
>> >>> executed every so many loop iterations.
>> >>>
>> >>> In some cases, yielding the NEON involves saving and restoring a non
>> >>> trivial amount of context (especially in the CRC folding algorithms),
>> >>> and so the macro is split into two, and the code in between is only
>> >>> executed when the yield path is taken, allowing the contex to be preserved.
>> >>> The second macro takes a label argument that marks the resume-from-yield
>> >>> path, which should restore the preserved context again.
>> >>>
>> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> >>> ---
>> >>> arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
>> >>> 1 file changed, 50 insertions(+)
>> >>>
>> >>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> >>> index aef72d886677..917b026d3e00 100644
>> >>> --- a/arch/arm64/include/asm/assembler.h
>> >>> +++ b/arch/arm64/include/asm/assembler.h
>> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>> >>> #endif
>> >>>    .endm
>> >>>
>> >>> +/*
>> >>> + * yield_neon - check whether to yield to another runnable task from
>> >>> + *        kernel mode NEON code (running with preemption disabled)
>> >>> + *
>> >>> + * - Check whether the preempt count is exactly 1, in which case disabling
>> >>> + *   preemption once will make the task preemptible. If this is not the case,
>> >>> + *   yielding is pointless.
>> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
>> >>> + *   yield fixup code at @lbl.
>> >>> + */
>> >>> +    .macro        yield_neon, lbl:req, ctr, order, stride, loop
>> >>> +    yield_neon_pre    \ctr, \order, \stride, \loop
>> >>> +    yield_neon_post    \lbl
>> >>> +    .endm
>> >>> +
>> >>> +    .macro        yield_neon_pre, ctr, order=0, stride, loop=4444f
>> >>> +#ifdef CONFIG_PREEMPT
>> >>> +    /*
>> >>> +     * With some algorithms, it makes little sense to poll the
>> >>> +     * TIF_NEED_RESCHED flag after every iteration, so only perform
>> >>> +     * the check every 2^order strides.
>> >>> +     */
>> >>> +    .if        \order > 1
>> >>> +    .if        (\stride & (\stride - 1)) != 0
>> >>> +    .error        "stride should be a power of 2"
>> >>> +    .endif
>> >>> +    tst        \ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
>> >>> +    b.ne        \loop
>> >>> +    .endif
>> >>
>> >> I'm not sure what baking in this check gives us, and this seems to
>> >> conflate two rather different things: yielding and defining a
>> >> "heartbeat" frequency for the calling code.
>> >>
>> >> Can we separate out the crypto-loop-helper semantics from the yield-
>> >> neon stuff?
>> >>
>> >
>> > Fair enough. I incorporated the check here so it disappears from the code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a sequence that is mispredicted every # iterations without any benefit.
>> > I guess i could macroise that separately though.
>> >
>> >> If we had
>> >>
>> >>    if_cond_yield_neon
>> >>    // patchup code
>> >>    endif_yield_neon
>> >>
>>
>> I like this, but ...
>>
>> >> then the caller is free to conditionally branch over that as appropriate
>> >> like
>> >>
>> >> loop:
>> >>    // crypto stuff
>> >>    tst x0, #0xf
>> >>    b.ne    loop
>> >>
>> >>    if_cond_yield_neon
>> >>    // patchup code
>> >>    endif_cond_yield_neon
>> >>
>>
>> I need to put the post patchup code somewhere too. Please refer to the
>> CRC patches for the best examples of this.
>
> I'm not sure I follow.  If you need to do something different after
> patchup, can't you either pull that code into the if...endif too, or
> otherwise put a branch before the endif?
>

No, not really.

What I currently have is

>    if_cond_yield_neon
>    // patchup code
>    endif_cond_yield_neon

which is being turned into

    get_thread_info x0
    ldr w1, [x0, #TSK_TI_PREEMPT]
    ldr x0, [x0, #TSK_TI_FLAGS]
    cmp w1, #1 // == PREEMPT_OFFSET
    csel x0, x0, xzr, eq
    tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
4444:

    .subsection 1
5555:
    // patchup code
    bl kernel_neon_end
    bl kernel_neon_begin
    // post patchup code goes here
    b 4444b
    .subsection

so what I basically need is a place to put the post patchup code,
unless I open code it and branch to a different label right after
kernel_neon_begin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-06 11:57           ` Ard Biesheuvel
@ 2017-12-06 12:12             ` Dave P Martin
  2017-12-06 12:25               ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Dave P Martin @ 2017-12-06 12:12 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
	Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
	linux-rt-users

On Wed, Dec 06, 2017 at 11:57:12AM +0000, Ard Biesheuvel wrote:
> On 6 December 2017 at 11:51, Dave Martin <Dave.Martin@arm.com> wrote:
> > On Tue, Dec 05, 2017 at 06:04:34PM +0000, Ard Biesheuvel wrote:
> >> On 5 December 2017 at 12:45, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >> >
> >> >
> >> >> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:
> >> >>
> >> >>> On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
> >> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
> >> >>> that may be called from the assembler code. Given that especially the
> >> >>> instruction based accelerated crypto code may use very tight loops, add
> >> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
> >> >>> executed every so many loop iterations.
> >> >>>
> >> >>> In some cases, yielding the NEON involves saving and restoring a non
> >> >>> trivial amount of context (especially in the CRC folding algorithms),
> >> >>> and so the macro is split into two, and the code in between is only
> >> >>> executed when the yield path is taken, allowing the contex to be preserved.
> >> >>> The second macro takes a label argument that marks the resume-from-yield
> >> >>> path, which should restore the preserved context again.
> >> >>>
> >> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >> >>> ---
> >> >>> arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
> >> >>> 1 file changed, 50 insertions(+)
> >> >>>
> >> >>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> >> >>> index aef72d886677..917b026d3e00 100644
> >> >>> --- a/arch/arm64/include/asm/assembler.h
> >> >>> +++ b/arch/arm64/include/asm/assembler.h
> >> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
> >> >>> #endif
> >> >>>    .endm
> >> >>>
> >> >>> +/*
> >> >>> + * yield_neon - check whether to yield to another runnable task from
> >> >>> + *        kernel mode NEON code (running with preemption disabled)
> >> >>> + *
> >> >>> + * - Check whether the preempt count is exactly 1, in which case disabling
> >> >>> + *   preemption once will make the task preemptible. If this is not the case,
> >> >>> + *   yielding is pointless.
> >> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
> >> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
> >> >>> + *   yield fixup code at @lbl.
> >> >>> + */
> >> >>> +    .macro        yield_neon, lbl:req, ctr, order, stride, loop
> >> >>> +    yield_neon_pre    \ctr, \order, \stride, \loop
> >> >>> +    yield_neon_post    \lbl
> >> >>> +    .endm
> >> >>> +
> >> >>> +    .macro        yield_neon_pre, ctr, order=0, stride, loop=4444f
> >> >>> +#ifdef CONFIG_PREEMPT
> >> >>> +    /*
> >> >>> +     * With some algorithms, it makes little sense to poll the
> >> >>> +     * TIF_NEED_RESCHED flag after every iteration, so only perform
> >> >>> +     * the check every 2^order strides.
> >> >>> +     */
> >> >>> +    .if        \order > 1
> >> >>> +    .if        (\stride & (\stride - 1)) != 0
> >> >>> +    .error        "stride should be a power of 2"
> >> >>> +    .endif
> >> >>> +    tst        \ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
> >> >>> +    b.ne        \loop
> >> >>> +    .endif
> >> >>
> >> >> I'm not sure what baking in this check gives us, and this seems to
> >> >> conflate two rather different things: yielding and defining a
> >> >> "heartbeat" frequency for the calling code.
> >> >>
> >> >> Can we separate out the crypto-loop-helper semantics from the yield-
> >> >> neon stuff?
> >> >>
> >> >
> >> > Fair enough. I incorporated the check here so it disappears from the code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a sequence that is mispredicted every # iterations without any benefit.
> >> > I guess i could macroise that separately though.
> >> >
> >> >> If we had
> >> >>
> >> >>    if_cond_yield_neon
> >> >>    // patchup code
> >> >>    endif_yield_neon
> >> >>
> >>
> >> I like this, but ...
> >>
> >> >> then the caller is free to conditionally branch over that as appropriate
> >> >> like
> >> >>
> >> >> loop:
> >> >>    // crypto stuff
> >> >>    tst x0, #0xf
> >> >>    b.ne    loop
> >> >>
> >> >>    if_cond_yield_neon
> >> >>    // patchup code
> >> >>    endif_cond_yield_neon
> >> >>
> >>
> >> I need to put the post patchup code somewhere too. Please refer to the
> >> CRC patches for the best examples of this.
> >
> > I'm not sure I follow.  If you need to do something different after
> > patchup, can't you either pull that code into the if...endif too, or
> > otherwise put a branch before the endif?
> >
>
> No, not really.
>
> What I currently have is
>
> >    if_cond_yield_neon
> >    // patchup code
> >    endif_cond_yield_neon
>
> which is being turned into
>
>     get_thread_info x0
>     ldr w1, [x0, #TSK_TI_PREEMPT]
>     ldr x0, [x0, #TSK_TI_FLAGS]
>     cmp w1, #1 // == PREEMPT_OFFSET
>     csel x0, x0, xzr, eq
>     tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
> 4444:
>
>     .subsection 1
> 5555:
>     // patchup code
>     bl kernel_neon_end
>     bl kernel_neon_begin
>     // post patchup code goes here
>     b 4444b
>     .subsection
>
> so what I basically need is a place to put the post patchup code,
> unless I open code it and branch to a different label right after
> kernel_neon_begin

Ah, I get you.  I had convinced myself that there was only post-
patchup code.

So you conceptually want something like

if_will_cond_yield_neon
        // pre-yield patchup code
do_cond_yield_neon
        // post-yield patchup code
endif_yield_neon

.macro if_will_cond_yield_neon
        // do preempt_count == 1 && TIF_NEED_RESCHED check
        b.wont_yield 3432f
.endm

.macro do_cond_yield_neon
        bl      kernel_neon_end
        bl      kernel_neon_begin
.endm

.macro endif_yield_neon
3432:
.endm

The main difference I see is that this doesn't get the slow-case
code out of line.

Or have I still missed something?

Cheers
---Dave
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-06 12:12             ` Dave P Martin
@ 2017-12-06 12:25               ` Ard Biesheuvel
  2017-12-06 14:37                 ` Dave Martin
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2017-12-06 12:25 UTC (permalink / raw)
  To: Dave P Martin
  Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
	Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
	linux-rt-users

On 6 December 2017 at 12:12, Dave P Martin <Dave.Martin@arm.com> wrote:
> On Wed, Dec 06, 2017 at 11:57:12AM +0000, Ard Biesheuvel wrote:
>> On 6 December 2017 at 11:51, Dave Martin <Dave.Martin@arm.com> wrote:
>> > On Tue, Dec 05, 2017 at 06:04:34PM +0000, Ard Biesheuvel wrote:
>> >> On 5 December 2017 at 12:45, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> >> >
>> >> >
>> >> >> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:
>> >> >>
>> >> >>> On Mon, Dec 04, 2017 at 12:26:37PM +0000, Ard Biesheuvel wrote:
>> >> >>> Add a support macro to conditionally yield the NEON (and thus the CPU)
>> >> >>> that may be called from the assembler code. Given that especially the
>> >> >>> instruction based accelerated crypto code may use very tight loops, add
>> >> >>> some parametrization so that the TIF_NEED_RESCHED flag test is only
>> >> >>> executed every so many loop iterations.
>> >> >>>
>> >> >>> In some cases, yielding the NEON involves saving and restoring a non
>> >> >>> trivial amount of context (especially in the CRC folding algorithms),
>> >> >>> and so the macro is split into two, and the code in between is only
>> >> >>> executed when the yield path is taken, allowing the contex to be preserved.
>> >> >>> The second macro takes a label argument that marks the resume-from-yield
>> >> >>> path, which should restore the preserved context again.
>> >> >>>
>> >> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> >> >>> ---
>> >> >>> arch/arm64/include/asm/assembler.h | 50 ++++++++++++++++++++
>> >> >>> 1 file changed, 50 insertions(+)
>> >> >>>
>> >> >>> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
>> >> >>> index aef72d886677..917b026d3e00 100644
>> >> >>> --- a/arch/arm64/include/asm/assembler.h
>> >> >>> +++ b/arch/arm64/include/asm/assembler.h
>> >> >>> @@ -512,4 +512,54 @@ alternative_else_nop_endif
>> >> >>> #endif
>> >> >>>    .endm
>> >> >>>
>> >> >>> +/*
>> >> >>> + * yield_neon - check whether to yield to another runnable task from
>> >> >>> + *        kernel mode NEON code (running with preemption disabled)
>> >> >>> + *
>> >> >>> + * - Check whether the preempt count is exactly 1, in which case disabling
>> >> >>> + *   preemption once will make the task preemptible. If this is not the case,
>> >> >>> + *   yielding is pointless.
>> >> >>> + * - Check whether TIF_NEED_RESCHED is set, and if so, disable and re-enable
>> >> >>> + *   kernel mode NEON (which will trigger a reschedule), and branch to the
>> >> >>> + *   yield fixup code at @lbl.
>> >> >>> + */
>> >> >>> +    .macro        yield_neon, lbl:req, ctr, order, stride, loop
>> >> >>> +    yield_neon_pre    \ctr, \order, \stride, \loop
>> >> >>> +    yield_neon_post    \lbl
>> >> >>> +    .endm
>> >> >>> +
>> >> >>> +    .macro        yield_neon_pre, ctr, order=0, stride, loop=4444f
>> >> >>> +#ifdef CONFIG_PREEMPT
>> >> >>> +    /*
>> >> >>> +     * With some algorithms, it makes little sense to poll the
>> >> >>> +     * TIF_NEED_RESCHED flag after every iteration, so only perform
>> >> >>> +     * the check every 2^order strides.
>> >> >>> +     */
>> >> >>> +    .if        \order > 1
>> >> >>> +    .if        (\stride & (\stride - 1)) != 0
>> >> >>> +    .error        "stride should be a power of 2"
>> >> >>> +    .endif
>> >> >>> +    tst        \ctr, #((1 << \order) * \stride - 1) & ~(\stride - 1)
>> >> >>> +    b.ne        \loop
>> >> >>> +    .endif
>> >> >>
>> >> >> I'm not sure what baking in this check gives us, and this seems to
>> >> >> conflate two rather different things: yielding and defining a
>> >> >> "heartbeat" frequency for the calling code.
>> >> >>
>> >> >> Can we separate out the crypto-loop-helper semantics from the yield-
>> >> >> neon stuff?
>> >> >>
>> >> >
>> >> > Fair enough. I incorporated the check here so it disappears from the code entirely when !CONFIG_PREEMPT, because otherwise, you end up with a sequence that is mispredicted every # iterations without any benefit.
>> >> > I guess i could macroise that separately though.
>> >> >
>> >> >> If we had
>> >> >>
>> >> >>    if_cond_yield_neon
>> >> >>    // patchup code
>> >> >>    endif_yield_neon
>> >> >>
>> >>
>> >> I like this, but ...
>> >>
>> >> >> then the caller is free to conditionally branch over that as appropriate
>> >> >> like
>> >> >>
>> >> >> loop:
>> >> >>    // crypto stuff
>> >> >>    tst x0, #0xf
>> >> >>    b.ne    loop
>> >> >>
>> >> >>    if_cond_yield_neon
>> >> >>    // patchup code
>> >> >>    endif_cond_yield_neon
>> >> >>
>> >>
>> >> I need to put the post patchup code somewhere too. Please refer to the
>> >> CRC patches for the best examples of this.
>> >
>> > I'm not sure I follow.  If you need to do something different after
>> > patchup, can't you either pull that code into the if...endif too, or
>> > otherwise put a branch before the endif?
>> >
>>
>> No, not really.
>>
>> What I currently have is
>>
>> >    if_cond_yield_neon
>> >    // patchup code
>> >    endif_cond_yield_neon
>>
>> which is being turned into
>>
>>     get_thread_info x0
>>     ldr w1, [x0, #TSK_TI_PREEMPT]
>>     ldr x0, [x0, #TSK_TI_FLAGS]
>>     cmp w1, #1 // == PREEMPT_OFFSET
>>     csel x0, x0, xzr, eq
>>     tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
>> 4444:
>>
>>     .subsection 1
>> 5555:
>>     // patchup code
>>     bl kernel_neon_end
>>     bl kernel_neon_begin
>>     // post patchup code goes here
>>     b 4444b
>>     .subsection
>>
>> so what I basically need is a place to put the post patchup code,
>> unless I open code it and branch to a different label right after
>> kernel_neon_begin
>
> Ah, I get you.  I had convinced myself that there was only post-
> patchup code.
>
> So you conceptually want something like
>
> if_will_cond_yield_neon
>         // pre-yield patchup code
> do_cond_yield_neon
>         // post-yield patchup code
> endif_yield_neon
>
> .macro if_will_cond_yield_neon
>         // do preempt_count == 1 && TIF_NEED_RESCHED check
>         b.wont_yield 3432f
> .endm
>
> .macro do_cond_yield_neon
>         bl      kernel_neon_end
>         bl      kernel_neon_begin
> .endm
>
> .macro endif_yield_neon
> 3432:
> .endm
>
> The main difference I see is that this doesn't get the slow-case
> code out of line.
>
> Or have I still missed something?
>

No, that seems accurate. I think the idiom looks much better, but I'd
still like to get the code out of line, so that everything disappears
completely for !CONFIG_PREEMPT.

I'll try to rework my stuff using this as a starting point.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT
  2017-12-06 12:25               ` Ard Biesheuvel
@ 2017-12-06 14:37                 ` Dave Martin
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Martin @ 2017-12-06 14:37 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Herbert Xu, Peter Zijlstra, Catalin Marinas,
	Sebastian Andrzej Siewior, Will Deacon, Russell King - ARM Linux,
	Steven Rostedt, linux-crypto, Thomas Gleixner, linux-arm-kernel,
	linux-rt-users

On Wed, Dec 06, 2017 at 12:25:44PM +0000, Ard Biesheuvel wrote:
> On 6 December 2017 at 12:12, Dave P Martin <Dave.Martin@arm.com> wrote:
> > On Wed, Dec 06, 2017 at 11:57:12AM +0000, Ard Biesheuvel wrote:
> >> On 6 December 2017 at 11:51, Dave Martin <Dave.Martin@arm.com> wrote:
> >> > On Tue, Dec 05, 2017 at 06:04:34PM +0000, Ard Biesheuvel wrote:
> >> >> On 5 December 2017 at 12:45, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >> >> >
> >> >> >
> >> >> >> On 5 Dec 2017, at 12:28, Dave Martin <Dave.Martin@arm.com> wrote:

[...]

> >> >> >> If we had
> >> >> >>
> >> >> >>    if_cond_yield_neon
> >> >> >>    // patchup code
> >> >> >>    endif_yield_neon
> >> >> >>
> >> >>
> >> >> I like this, but ...
> >> >>
> >> >> >> then the caller is free to conditionally branch over that as appropriate
> >> >> >> like
> >> >> >>
> >> >> >> loop:
> >> >> >>    // crypto stuff
> >> >> >>    tst x0, #0xf
> >> >> >>    b.ne    loop
> >> >> >>
> >> >> >>    if_cond_yield_neon
> >> >> >>    // patchup code
> >> >> >>    endif_cond_yield_neon
> >> >> >>
> >> >>
> >> >> I need to put the post patchup code somewhere too. Please refer to the
> >> >> CRC patches for the best examples of this.
> >> >
> >> > I'm not sure I follow.  If you need to do something different after
> >> > patchup, can't you either pull that code into the if...endif too, or
> >> > otherwise put a branch before the endif?
> >> >
> >>
> >> No, not really.
> >>
> >> What I currently have is
> >>
> >> >    if_cond_yield_neon
> >> >    // patchup code
> >> >    endif_cond_yield_neon
> >>
> >> which is being turned into
> >>
> >>     get_thread_info x0
> >>     ldr w1, [x0, #TSK_TI_PREEMPT]
> >>     ldr x0, [x0, #TSK_TI_FLAGS]
> >>     cmp w1, #1 // == PREEMPT_OFFSET
> >>     csel x0, x0, xzr, eq
> >>     tbnz x0, #TIF_NEED_RESCHED, 5555f // needs rescheduling?
> >> 4444:
> >>
> >>     .subsection 1
> >> 5555:
> >>     // patchup code
> >>     bl kernel_neon_end
> >>     bl kernel_neon_begin
> >>     // post patchup code goes here
> >>     b 4444b
> >>     .subsection
> >>
> >> so what I basically need is a place to put the post patchup code,
> >> unless I open code it and branch to a different label right after
> >> kernel_neon_begin
> >
> > Ah, I get you.  I had convinced myself that there was only post-
> > patchup code.
> >
> > So you conceptually want something like
> >
> > if_will_cond_yield_neon
> >         // pre-yield patchup code
> > do_cond_yield_neon
> >         // post-yield patchup code
> > endif_yield_neon
> >
> > .macro if_will_cond_yield_neon
> >         // do preempt_count == 1 && TIF_NEED_RESCHED check
> >         b.wont_yield 3432f
> > .endm
> >
> > .macro do_cond_yield_neon
> >         bl      kernel_neon_end
> >         bl      kernel_neon_begin
> > .endm
> >
> > .macro endif_yield_neon
> > 3432:
> > .endm
> >
> > The main difference I see is that this doesn't get the slow-case
> > code out of line.
> >
> > Or have I still missed something?
> >
> 
> No, that seems accurate. I think the idiom looks much better, but I'd
> still like to get the code out of line, so that everything disappears
> completely for !CONFIG_PREEMPT.

Sure, out-of-lining should still be possible by spilling into another
section or subsection similarly to what you were already doing -- and
it may still make sense to have (optional?) label arguments to allow
some branch elision.  I didn't want to complicate things initially.

> I'll try to rework my stuff using this as a starting point.

Cool, let me know.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-12-06 14:37 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-04 12:26 [PATCH v2 00/19] crypto: arm64 - play nice with CONFIG_PREEMPT Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 01/19] crypto: testmgr - add a new test case for CRC-T10DIF Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 02/19] crypto: arm64/aes-ce-ccm - move kernel mode neon en/disable into loop Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 03/19] crypto: arm64/aes-blk " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 04/19] crypto: arm64/aes-bs " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 05/19] crypto: arm64/chacha20 " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 06/19] crypto: arm64/ghash " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 07/19] crypto: arm64/aes-blk - remove configurable interleave Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 08/19] crypto: arm64/aes-blk - add 4 way interleave to CBC encrypt path Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 09/19] crypto: arm64/aes-blk - add 4 way interleave to CBC-MAC " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 10/19] crypto: arm64/sha256-neon - play nice with CONFIG_PREEMPT kernels Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 11/19] arm64: assembler: add macro to conditionally yield the NEON under PREEMPT Ard Biesheuvel
2017-12-05 12:28   ` Dave Martin
2017-12-05 12:45     ` Ard Biesheuvel
2017-12-05 18:04       ` Ard Biesheuvel
2017-12-06 11:51         ` Dave Martin
2017-12-06 11:57           ` Ard Biesheuvel
2017-12-06 12:12             ` Dave P Martin
2017-12-06 12:25               ` Ard Biesheuvel
2017-12-06 14:37                 ` Dave Martin
2017-12-04 12:26 ` [PATCH v2 12/19] crypto: arm64/sha1-ce - yield every 8 blocks of input Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 13/19] crypto: arm64/sha2-ce " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 14/19] crypto: arm64/aes-blk - yield after processing a fixed chunk " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 15/19] crypto: arm64/aes-bs - yield after processing each 128 bytes " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 16/19] crypto: arm64/aes-ghash - yield after processing fixed number of blocks Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 17/19] crypto: arm64/crc32-ce - yield NEON every 16 blocks of input Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 18/19] crypto: arm64/crct10dif-ce - yield NEON every 8 " Ard Biesheuvel
2017-12-04 12:26 ` [PATCH v2 19/19] DO NOT MERGE Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).