linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64
@ 2016-11-24 15:43 Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 1/4] crypto: testmgr - avoid overlap in chunked tests Ard Biesheuvel
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-crypto, herbert, linux-arm-kernel, catalin.marinas,
	will.deacon, linux
  Cc: steve.capper, dingtianhong, yangshengkai, yuehaibing, hanjun.guo,
	Ard Biesheuvel

First of all, apologies to Yue Haibing for stealing his thunder, to some
extent. But after reviewing (and replying to) his patch, I noticed that his
code is not original code, but simply a transliteration of the existing Intel
code that resides in arch/x86/crypto/crct10dif-pcl-asm_64.S, but with the
license and copyright statement removed. 

So, if we are going to transliterate code, let's credit the original authors,
even if the resulting code does not look like the code you started out with.

Then, I noticed that we could stay *much* closer to the original, and that
there is no need for jump tables or computed gotos at all. So I got a bit
carried away, and ended up reimplementing the whole thing, for both arm and64
and ARM.

Patch #1 fixes an issue in testmgr that results in spurious false negatives
in the chunking tests if the third chunk exceeds 31 bytes.

Patch #2 expands the existing CRCT10DIF test cases, to ensure that all
code paths are actually covered.

Patch #3 is a straight transliteration of the Intel code to arm64.

Patch #4 is a straight transliteration of the Intel code to ARM. This patch
is against patch #3 (using --find-copies-harder) so that it is easy to
see how the ARM code deviates from the arm64 code.

NOTE: this code uses the 64x64->128 bit polynomial multiply instruction,
which is only available on cores that implement the v8 Crypto Extensions.

Ard Biesheuvel (4):
  crypto: testmgr - avoid overlap in chunked tests
  crypto: testmgr - add/enhance test cases for CRC-T10DIF
  crypto: arm64/crct10dif - port x86 SSE implementation to arm64
  crypto: arm/crct10dif - port x86 SSE implementation to ARM

 arch/arm/crypto/Kconfig               |   5 +
 arch/arm/crypto/Makefile              |   2 +
 arch/arm/crypto/crct10dif-ce-core.S   | 569 ++++++++++++++++++++
 arch/arm/crypto/crct10dif-ce-glue.c   |  89 +++
 arch/arm64/crypto/Kconfig             |   5 +
 arch/arm64/crypto/Makefile            |   3 +
 arch/arm64/crypto/crct10dif-ce-core.S | 518 ++++++++++++++++++
 arch/arm64/crypto/crct10dif-ce-glue.c |  80 +++
 crypto/testmgr.c                      |   2 +-
 crypto/testmgr.h                      |  70 ++-
 10 files changed, 1314 insertions(+), 29 deletions(-)
 create mode 100644 arch/arm/crypto/crct10dif-ce-core.S
 create mode 100644 arch/arm/crypto/crct10dif-ce-glue.c
 create mode 100644 arch/arm64/crypto/crct10dif-ce-core.S
 create mode 100644 arch/arm64/crypto/crct10dif-ce-glue.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] crypto: testmgr - avoid overlap in chunked tests
  2016-11-24 15:43 [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64 Ard Biesheuvel
@ 2016-11-24 15:43 ` Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 2/4] crypto: testmgr - add/enhance test cases for CRC-T10DIF Ard Biesheuvel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-crypto, herbert, linux-arm-kernel, catalin.marinas,
	will.deacon, linux
  Cc: steve.capper, dingtianhong, yangshengkai, yuehaibing, hanjun.guo,
	Ard Biesheuvel

The IDXn offsets are chosen such that tap values (which may go up to
255) end up overlapping in the xbuf allocation. In particular, IDX1
and IDX3 are too close together, so update IDX3 to avoid this issue.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 62dffa0028ac..15650597dcc9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -62,7 +62,7 @@ int alg_test(const char *driver, const char *alg, u32 type, u32 mask)
  */
 #define IDX1		32
 #define IDX2		32400
-#define IDX3		1
+#define IDX3		511
 #define IDX4		8193
 #define IDX5		22222
 #define IDX6		17101
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] crypto: testmgr - add/enhance test cases for CRC-T10DIF
  2016-11-24 15:43 [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64 Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 1/4] crypto: testmgr - avoid overlap in chunked tests Ard Biesheuvel
@ 2016-11-24 15:43 ` Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 3/4] crypto: arm64/crct10dif - port x86 SSE implementation to arm64 Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM Ard Biesheuvel
  3 siblings, 0 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-crypto, herbert, linux-arm-kernel, catalin.marinas,
	will.deacon, linux
  Cc: steve.capper, dingtianhong, yangshengkai, yuehaibing, hanjun.guo,
	Ard Biesheuvel

The existing test cases only exercise a small slice of the various
possible code paths through the x86 SSE/PCLMULQDQ implementation,
and the upcoming ports of it for arm64. So add one that exceeds 256
bytes in size, and convert another to a chunked test.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.h | 70 ++++++++++++--------
 1 file changed, 42 insertions(+), 28 deletions(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index e64a4ef9d8ca..b7cd41b25a2a 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -1334,36 +1334,50 @@ static struct hash_testvec rmd320_tv_template[] = {
 	}
 };
 
-#define CRCT10DIF_TEST_VECTORS	3
+#define CRCT10DIF_TEST_VECTORS	ARRAY_SIZE(crct10dif_tv_template)
 static struct hash_testvec crct10dif_tv_template[] = {
 	{
-		.plaintext = "abc",
-		.psize  = 3,
-#ifdef __LITTLE_ENDIAN
-		.digest = "\x3b\x44",
-#else
-		.digest = "\x44\x3b",
-#endif
-	}, {
-		.plaintext = "1234567890123456789012345678901234567890"
-			     "123456789012345678901234567890123456789",
-		.psize	= 79,
-#ifdef __LITTLE_ENDIAN
-		.digest	= "\x70\x4b",
-#else
-		.digest	= "\x4b\x70",
-#endif
-	}, {
-		.plaintext =
-		"abcddddddddddddddddddddddddddddddddddddddddddddddddddddd",
-		.psize  = 56,
-#ifdef __LITTLE_ENDIAN
-		.digest = "\xe3\x9c",
-#else
-		.digest = "\x9c\xe3",
-#endif
-		.np     = 2,
-		.tap    = { 28, 28 }
+		.plaintext	= "abc",
+		.psize		= 3,
+		.digest		= (u8 *)(u16 []){ 0x443b },
+	}, {
+		.plaintext 	= "1234567890123456789012345678901234567890"
+				  "123456789012345678901234567890123456789",
+		.psize		= 79,
+		.digest 	= (u8 *)(u16 []){ 0x4b70 },
+		.np		= 2,
+		.tap		= { 63, 16 },
+	}, {
+		.plaintext	= "abcdddddddddddddddddddddddddddddddddddddddd"
+				  "ddddddddddddd",
+		.psize		= 56,
+		.digest		= (u8 *)(u16 []){ 0x9ce3 },
+		.np		= 8,
+		.tap		= { 1, 2, 28, 7, 6, 5, 4, 3 },
+	}, {
+		.plaintext 	= "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "123456789012345678901234567890123456789",
+		.psize		= 319,
+		.digest		= (u8 *)((u16 []){ 0x44c6 }),
+	}, {
+		.plaintext 	= "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "1234567890123456789012345678901234567890"
+				  "123456789012345678901234567890123456789",
+		.psize		= 319,
+		.digest		= (u8 *)((u16 []){ 0x44c6 }),
+		.np		= 4,
+		.tap		= { 1, 255, 57, 6 },
 	}
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] crypto: arm64/crct10dif - port x86 SSE implementation to arm64
  2016-11-24 15:43 [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64 Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 1/4] crypto: testmgr - avoid overlap in chunked tests Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 2/4] crypto: testmgr - add/enhance test cases for CRC-T10DIF Ard Biesheuvel
@ 2016-11-24 15:43 ` Ard Biesheuvel
  2016-11-24 15:43 ` [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM Ard Biesheuvel
  3 siblings, 0 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-crypto, herbert, linux-arm-kernel, catalin.marinas,
	will.deacon, linux
  Cc: steve.capper, dingtianhong, yangshengkai, yuehaibing, hanjun.guo,
	Ard Biesheuvel

This is a straight transliteration of the Intel algorithm implemented
using SSE and PCLMULQDQ instructions that resides under in the file
arch/x86/crypto/crct10dif-pcl-asm_64.S.

Suggested-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/Kconfig             |   5 +
 arch/arm64/crypto/Makefile            |   3 +
 arch/arm64/crypto/crct10dif-ce-core.S | 518 ++++++++++++++++++++
 arch/arm64/crypto/crct10dif-ce-glue.c |  80 +++
 4 files changed, 606 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2cf32e9887e1..1b50671ffec3 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -23,6 +23,11 @@ config CRYPTO_GHASH_ARM64_CE
 	depends on ARM64 && KERNEL_MODE_NEON
 	select CRYPTO_HASH
 
+config CRYPTO_CRCT10DIF_ARM64_CE
+	tristate "CRCT10DIF digest algorithm using PMULL instructions"
+	depends on ARM64 && KERNEL_MODE_NEON
+	select CRYPTO_HASH
+
 config CRYPTO_AES_ARM64_CE
 	tristate "AES core cipher using ARMv8 Crypto Extensions"
 	depends on ARM64 && KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index abb79b3cfcfe..36fd3eb4201b 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha2-ce-y := sha2-ce-glue.o sha2-ce-core.o
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
+obj-$(CONFIG_CRYPTO_CRCT10DIF_ARM64_CE) += crct10dif-ce.o
+crct10dif-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
+
 obj-$(CONFIG_CRYPTO_AES_ARM64_CE) += aes-ce-cipher.o
 CFLAGS_aes-ce-cipher.o += -march=armv8-a+crypto
 
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
new file mode 100644
index 000000000000..9148ebd3470a
--- /dev/null
+++ b/arch/arm64/crypto/crct10dif-ce-core.S
@@ -0,0 +1,518 @@
+//
+// Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+//
+// Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License version 2 as
+// published by the Free Software Foundation.
+//
+
+//
+// Implement fast CRC-T10DIF computation with SSE and PCLMULQDQ instructions
+//
+// Copyright (c) 2013, Intel Corporation
+//
+// Authors:
+//     Erdinc Ozturk <erdinc.ozturk@intel.com>
+//     Vinodh Gopal <vinodh.gopal@intel.com>
+//     James Guilford <james.guilford@intel.com>
+//     Tim Chen <tim.c.chen@linux.intel.com>
+//
+// This software is available to you under a choice of one of two
+// licenses.  You may choose to be licensed under the terms of the GNU
+// General Public License (GPL) Version 2, available from the file
+// COPYING in the main directory of this source tree, or the
+// OpenIB.org BSD license below:
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// * Redistributions of source code must retain the above copyright
+//   notice, this list of conditions and the following disclaimer.
+//
+// * Redistributions in binary form must reproduce the above copyright
+//   notice, this list of conditions and the following disclaimer in the
+//   documentation and/or other materials provided with the
+//   distribution.
+//
+// * Neither the name of the Intel Corporation nor the names of its
+//   contributors may be used to endorse or promote products derived from
+//   this software without specific prior written permission.
+//
+//
+// THIS SOFTWARE IS PROVIDED BY INTEL CORPORATION ""AS IS"" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL CORPORATION OR
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+//       Function API:
+//       UINT16 crc_t10dif_pcl(
+//               UINT16 init_crc, //initial CRC value, 16 bits
+//               const unsigned char *buf, //buffer pointer to calculate CRC on
+//               UINT64 len //buffer length in bytes (64-bit data)
+//       );
+//
+//       Reference paper titled "Fast CRC Computation for Generic
+//	Polynomials Using PCLMULQDQ Instruction"
+//       URL: http://www.intel.com/content/dam/www/public/us/en/documents
+//  /white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf
+//
+//
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+	.text
+	.cpu		generic+crypto
+
+	arg1_low32	.req	w0
+	arg2		.req	x1
+	arg3		.req	x2
+
+	vzr		.req	v13
+
+ENTRY(crc_t10dif_pmull)
+	stp		x29, x30, [sp, #-32]!
+	mov		x29, sp
+
+	movi		vzr.16b, #0		// init zero register
+
+	// adjust the 16-bit initial_crc value, scale it to 32 bits
+	lsl		arg1_low32, arg1_low32, #16
+
+	// check if smaller than 256
+	cmp		arg3, #256
+
+	// for sizes less than 128, we can't fold 64B at a time...
+	b.lt		_less_than_128
+
+	// load the initial crc value
+	// crc value does not need to be byte-reflected, but it needs
+	// to be moved to the high part of the register.
+	// because data will be byte-reflected and will align with
+	// initial crc at correct place.
+	movi		v10.16b, #0
+	mov		v10.s[3], arg1_low32		// initial crc
+
+	// receive the initial 64B data, xor the initial crc value
+	ld1		{v0.2d-v3.2d}, [arg2], #0x40
+	ld1		{v4.2d-v7.2d}, [arg2], #0x40
+CPU_LE(	rev64		v0.16b, v0.16b		)
+CPU_LE(	rev64		v1.16b, v1.16b		)
+CPU_LE(	rev64		v2.16b, v2.16b		)
+CPU_LE(	rev64		v3.16b, v3.16b		)
+CPU_LE(	rev64		v4.16b, v4.16b		)
+CPU_LE(	rev64		v5.16b, v5.16b		)
+CPU_LE(	rev64		v6.16b, v6.16b		)
+CPU_LE(	rev64		v7.16b, v7.16b		)
+
+	ext		v0.16b, v0.16b, v0.16b, #8
+	ext		v1.16b, v1.16b, v1.16b, #8
+	ext		v2.16b, v2.16b, v2.16b, #8
+	ext		v3.16b, v3.16b, v3.16b, #8
+	ext		v4.16b, v4.16b, v4.16b, #8
+	ext		v5.16b, v5.16b, v5.16b, #8
+	ext		v6.16b, v6.16b, v6.16b, #8
+	ext		v7.16b, v7.16b, v7.16b, #8
+
+	// XOR the initial_crc value
+	eor		v0.16b, v0.16b, v10.16b
+
+	ldr		q10, rk3	// xmm10 has rk3 and rk4
+					// type of pmull instruction
+					// will determine which constant to use
+
+	//
+	// we subtract 256 instead of 128 to save one instruction from the loop
+	//
+	sub		arg3, arg3, #256
+
+	// at this section of the code, there is 64*x+y (0<=y<64) bytes of
+	// buffer. The _fold_64_B_loop will fold 64B at a time
+	// until we have 64+y Bytes of buffer
+
+
+	// fold 64B at a time. This section of the code folds 4 vector
+	// registers in parallel
+_fold_64_B_loop:
+
+	.macro		fold64, reg1, reg2
+	ld1		{v11.2d-v12.2d}, [arg2], #0x20
+CPU_LE(	rev64		v11.16b, v11.16b	)
+CPU_LE(	rev64		v12.16b, v12.16b	)
+	ext		v11.16b, v11.16b, v11.16b, #8
+	ext		v12.16b, v12.16b, v12.16b, #8
+
+	pmull2		v8.1q, \reg1\().2d, v10.2d
+	pmull		\reg1\().1q, \reg1\().1d, v10.1d
+	pmull2		v9.1q, \reg2\().2d, v10.2d
+	pmull		\reg2\().1q, \reg2\().1d, v10.1d
+
+	eor		\reg1\().16b, \reg1\().16b, v11.16b
+	eor		\reg2\().16b, \reg2\().16b, v12.16b
+	eor		\reg1\().16b, \reg1\().16b, v8.16b
+	eor		\reg2\().16b, \reg2\().16b, v9.16b
+	.endm
+
+	fold64		v0, v1
+	fold64		v2, v3
+	fold64		v4, v5
+	fold64		v6, v7
+
+	subs		arg3, arg3, #128
+
+	// check if there is another 64B in the buffer to be able to fold
+	b.ge		_fold_64_B_loop
+
+	// at this point, the buffer pointer is pointing at the last y Bytes
+	// of the buffer the 64B of folded data is in 4 of the vector
+	// registers: v0, v1, v2, v3
+
+	// fold the 8 vector registers to 1 vector register with different
+	// constants
+
+	.macro		fold16, rk, reg
+	ldr		q10, \rk
+	pmull		v8.1q, \reg\().1d, v10.1d
+	pmull2		\reg\().1q, \reg\().2d, v10.2d
+	eor		v7.16b, v7.16b, v8.16b
+	eor		v7.16b, v7.16b, \reg\().16b
+	.endm
+
+	fold16		rk9, v0
+	fold16		rk11, v1
+	fold16		rk13, v2
+	fold16		rk15, v3
+	fold16		rk17, v4
+	fold16		rk19, v5
+	fold16		rk1, v6
+
+	// instead of 64, we add 48 to the loop counter to save 1 instruction
+	// from the loop instead of a cmp instruction, we use the negative
+	// flag with the jl instruction
+	adds		arg3, arg3, #(128-16)
+	b.lt		_final_reduction_for_128
+
+	// now we have 16+y bytes left to reduce. 16 Bytes is in register v7
+	// and the rest is in memory. We can fold 16 bytes at a time if y>=16
+	// continue folding 16B at a time
+
+_16B_reduction_loop:
+	pmull		v8.1q, v7.1d, v10.1d
+	pmull2		v7.1q, v7.2d, v10.2d
+	eor		v7.16b, v7.16b, v8.16b
+
+	ld1		{v0.2d}, [arg2], #16
+CPU_LE(	rev64		v0.16b, v0.16b		)
+	ext		v0.16b, v0.16b, v0.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+	subs		arg3, arg3, #16
+
+	// instead of a cmp instruction, we utilize the flags with the
+	// jge instruction equivalent of: cmp arg3, 16-16
+	// check if there is any more 16B in the buffer to be able to fold
+	b.ge		_16B_reduction_loop
+
+	// now we have 16+z bytes left to reduce, where 0<= z < 16.
+	// first, we reduce the data in the xmm7 register
+
+_final_reduction_for_128:
+	// check if any more data to fold. If not, compute the CRC of
+	// the final 128 bits
+	adds		arg3, arg3, #16
+	b.eq		_128_done
+
+	// here we are getting data that is less than 16 bytes.
+	// since we know that there was data before the pointer, we can
+	// offset the input pointer before the actual point, to receive
+	// exactly 16 bytes. after that the registers need to be adjusted.
+_get_last_two_regs:
+	mov		v2.16b, v7.16b
+
+	add		arg2, arg2, arg3
+	sub		arg2, arg2, #16
+	ld1		{v1.2d}, [arg2]
+CPU_LE(	rev64		v1.16b, v1.16b		)
+	ext		v1.16b, v1.16b, v1.16b, #8
+
+	// get rid of the extra data that was loaded before
+	// load the shift constant
+	adr		x4, tbl_shf_table + 16
+	sub		x4, x4, arg3
+	ld1		{v0.16b}, [x4]
+
+	// shift v2 to the left by arg3 bytes
+	tbl		v2.16b, {v2.16b}, v0.16b
+
+	// shift v7 to the right by 16-arg3 bytes
+	movi		v9.16b, #0x80
+	eor		v0.16b, v0.16b, v9.16b
+	tbl		v7.16b, {v7.16b}, v0.16b
+
+	// blend
+	sshr		v0.16b, v0.16b, #7	// convert to 8-bit mask
+	bsl		v0.16b, v2.16b, v1.16b
+
+	// fold 16 Bytes
+	pmull		v8.1q, v7.1d, v10.1d
+	pmull2		v7.1q, v7.2d, v10.2d
+	eor		v7.16b, v7.16b, v8.16b
+	eor		v7.16b, v7.16b, v0.16b
+
+_128_done:
+	// compute crc of a 128-bit value
+	ldr		q10, rk5		// rk5 and rk6 in xmm10
+
+	// 64b fold
+	mov		v0.16b, v7.16b
+	ext		v7.16b, v7.16b, v7.16b, #8
+	pmull		v7.1q, v7.1d, v10.1d
+	ext		v0.16b, vzr.16b, v0.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	// 32b fold
+	mov		v0.16b, v7.16b
+	mov		v0.s[3], vzr.s[0]
+	ext		v7.16b, v7.16b, vzr.16b, #12
+	ext		v9.16b, v10.16b, v10.16b, #8
+	pmull		v7.1q, v7.1d, v9.1d
+	eor		v7.16b, v7.16b, v0.16b
+
+	// barrett reduction
+_barrett:
+	ldr		q10, rk7
+	mov		v0.16b, v7.16b
+	ext		v7.16b, v7.16b, v7.16b, #8
+
+	pmull		v7.1q, v7.1d, v10.1d
+	ext		v7.16b, vzr.16b, v7.16b, #12
+	pmull2		v7.1q, v7.2d, v10.2d
+	ext		v7.16b, vzr.16b, v7.16b, #12
+	eor		v7.16b, v7.16b, v0.16b
+	mov		w0, v7.s[1]
+
+_cleanup:
+	// scale the result back to 16 bits
+	lsr		x0, x0, #16
+	ldp		x29, x30, [sp], #32
+	ret
+
+	.align		4
+_less_than_128:
+
+	// check if there is enough buffer to be able to fold 16B at a time
+	cmp		arg3, #32
+	b.lt		_less_than_32
+
+	// now if there is, load the constants
+	ldr		q10, rk1		// rk1 and rk2 in xmm10
+
+	movi		v0.16b, #0
+	mov		v0.s[3], arg1_low32	// get the initial crc value
+	ld1		{v7.2d}, [arg2], #0x10
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	// update the counter. subtract 32 instead of 16 to save one
+	// instruction from the loop
+	sub		arg3, arg3, #32
+
+	b		_16B_reduction_loop
+
+	.align		4
+_less_than_32:
+	cbz		arg3, _cleanup
+
+	movi		v0.16b, #0
+	mov		v0.s[3], arg1_low32	// get the initial crc value
+
+	cmp		arg3, #16
+	b.eq		_exact_16_left
+	b.lt		_less_than_16_left
+
+	ld1		{v7.2d}, [arg2], #0x10
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+	sub		arg3, arg3, #16
+	ldr		q10, rk1		// rk1 and rk2 in xmm10
+	b		_get_last_two_regs
+
+	.align		4
+_less_than_16_left:
+	// use stack space to load data less than 16 bytes, zero-out
+	// the 16B in memory first.
+
+	add		x11, sp, #0x10
+	stp		xzr, xzr, [x11]
+
+	cmp		arg3, #4
+	b.lt		_only_less_than_4
+
+	// backup the counter value
+	mov		x9, arg3
+	tbz		arg3, #3, _less_than_8_left
+
+	// load 8 Bytes
+	ldr		x0, [arg2], #8
+	str		x0, [x11], #8
+	sub		arg3, arg3, #8
+
+_less_than_8_left:
+	tbz		arg3, #2, _less_than_4_left
+
+	// load 4 Bytes
+	ldr		w0, [arg2], #4
+	str		w0, [x11], #4
+	sub		arg3, arg3, #4
+
+_less_than_4_left:
+	tbz		arg3, #1, _less_than_2_left
+
+	// load 2 Bytes
+	ldrh		w0, [arg2], #2
+	strh		w0, [x11], #2
+	sub		arg3, arg3, #2
+
+_less_than_2_left:
+	cbz		arg3, _zero_left
+
+	// load 1 Byte
+	ldrb		w0, [arg2]
+	strb		w0, [x11]
+
+_zero_left:
+	add		x11, sp, #0x10
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	// shl r9, 4
+	adr		x0, tbl_shf_table + 16
+	sub		x0, x0, x9
+	ld1		{v0.16b}, [x0]
+	movi		v9.16b, #0x80
+	eor		v0.16b, v0.16b, v9.16b
+	tbl		v7.16b, {v7.16b}, v0.16b
+
+	b		_128_done
+
+	.align		4
+_exact_16_left:
+	ld1		{v7.2d}, [arg2]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b	// xor the initial crc value
+
+	b		_128_done
+
+_only_less_than_4:
+	cmp		arg3, #3
+	b.lt		_only_less_than_3
+
+	// load 3 Bytes
+	ldrh		w0, [arg2]
+	strh		w0, [x11]
+
+	ldrb		w0, [arg2, #2]
+	strb		w0, [x11, #2]
+
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	ext		v7.16b, v7.16b, vzr.16b, #5
+	b		_barrett
+
+_only_less_than_3:
+	cmp		arg3, #2
+	b.lt		_only_less_than_2
+
+	// load 2 Bytes
+	ldrh		w0, [arg2]
+	strh		w0, [x11]
+
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	ext		v7.16b, v7.16b, vzr.16b, #6
+	b		_barrett
+
+_only_less_than_2:
+
+	// load 1 Byte
+	ldrb		w0, [arg2]
+	strb		w0, [x11]
+
+	ld1		{v7.2d}, [x11]
+CPU_LE(	rev64		v7.16b, v7.16b		)
+	ext		v7.16b, v7.16b, v7.16b, #8
+	eor		v7.16b, v7.16b, v0.16b
+
+	ext		v7.16b, v7.16b, vzr.16b, #7
+	b		_barrett
+
+ENDPROC(crc_t10dif_pmull)
+
+// precomputed constants
+// these constants are precomputed from the poly:
+// 0x8bb70000 (0x8bb7 scaled to 32 bits)
+	.align		4
+// Q = 0x18BB70000
+// rk1 = 2^(32*3) mod Q << 32
+// rk2 = 2^(32*5) mod Q << 32
+// rk3 = 2^(32*15) mod Q << 32
+// rk4 = 2^(32*17) mod Q << 32
+// rk5 = 2^(32*3) mod Q << 32
+// rk6 = 2^(32*2) mod Q << 32
+// rk7 = floor(2^64/Q)
+// rk8 = Q
+
+rk1:	.octa		0x06df0000000000002d56000000000000
+rk3:	.octa		0x7cf50000000000009d9d000000000000
+rk5:	.octa		0x13680000000000002d56000000000000
+rk7:	.octa		0x000000018bb7000000000001f65a57f8
+rk9:	.octa		0xbfd6000000000000ceae000000000000
+rk11:	.octa		0x713c0000000000001e16000000000000
+rk13:	.octa		0x80a6000000000000f7f9000000000000
+rk15:	.octa		0xe658000000000000044c000000000000
+rk17:	.octa		0xa497000000000000ad18000000000000
+rk19:	.octa		0xe7b50000000000006ee3000000000000
+
+tbl_shf_table:
+// use these values for shift constants for the tbl/tbx instruction
+// different alignments result in values as shown:
+//	DDQ 0x008f8e8d8c8b8a898887868584838281 # shl 15 (16-1) / shr1
+//	DDQ 0x01008f8e8d8c8b8a8988878685848382 # shl 14 (16-3) / shr2
+//	DDQ 0x0201008f8e8d8c8b8a89888786858483 # shl 13 (16-4) / shr3
+//	DDQ 0x030201008f8e8d8c8b8a898887868584 # shl 12 (16-4) / shr4
+//	DDQ 0x04030201008f8e8d8c8b8a8988878685 # shl 11 (16-5) / shr5
+//	DDQ 0x0504030201008f8e8d8c8b8a89888786 # shl 10 (16-6) / shr6
+//	DDQ 0x060504030201008f8e8d8c8b8a898887 # shl 9  (16-7) / shr7
+//	DDQ 0x07060504030201008f8e8d8c8b8a8988 # shl 8  (16-8) / shr8
+//	DDQ 0x0807060504030201008f8e8d8c8b8a89 # shl 7  (16-9) / shr9
+//	DDQ 0x090807060504030201008f8e8d8c8b8a # shl 6  (16-10) / shr10
+//	DDQ 0x0a090807060504030201008f8e8d8c8b # shl 5  (16-11) / shr11
+//	DDQ 0x0b0a090807060504030201008f8e8d8c # shl 4  (16-12) / shr12
+//	DDQ 0x0c0b0a090807060504030201008f8e8d # shl 3  (16-13) / shr13
+//	DDQ 0x0d0c0b0a090807060504030201008f8e # shl 2  (16-14) / shr14
+//	DDQ 0x0e0d0c0b0a090807060504030201008f # shl 1  (16-15) / shr15
+
+	.byte		 0x0, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87
+	.byte		0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe , 0x0
diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c
new file mode 100644
index 000000000000..d11f33dae79c
--- /dev/null
+++ b/arch/arm64/crypto/crct10dif-ce-glue.c
@@ -0,0 +1,80 @@
+/*
+ * Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+ *
+ * Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufeature.h>
+#include <linux/crc-t10dif.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+#include <crypto/internal/hash.h>
+
+#include <asm/neon.h>
+
+asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 buf[], u64 len);
+
+static int crct10dif_init(struct shash_desc *desc)
+{
+	u16 *crc = shash_desc_ctx(desc);
+
+	*crc = 0;
+	return 0;
+}
+
+static int crct10dif_update(struct shash_desc *desc, const u8 *data,
+			 unsigned int length)
+{
+	u16 *crc = shash_desc_ctx(desc);
+
+	kernel_neon_begin_partial(14);
+	*crc = crc_t10dif_pmull(*crc, data, length);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int crct10dif_final(struct shash_desc *desc, u8 *out)
+{
+	u16 *crc = shash_desc_ctx(desc);
+
+	*(u16 *)out = *crc;
+	return 0;
+}
+
+static struct shash_alg crc_t10dif_alg = {
+	.digestsize		= CRC_T10DIF_DIGEST_SIZE,
+	.init			= crct10dif_init,
+	.update			= crct10dif_update,
+	.final			= crct10dif_final,
+
+	.descsize		= CRC_T10DIF_DIGEST_SIZE,
+	.base.cra_name		= "crct10dif",
+	.base.cra_driver_name	= "crct10dif-arm64-ce",
+	.base.cra_priority	= 200,
+	.base.cra_blocksize	= CRC_T10DIF_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+};
+
+static int __init crc_t10dif_mod_init(void)
+{
+	return crypto_register_shash(&crc_t10dif_alg);
+}
+
+static void __exit crc_t10dif_mod_exit(void)
+{
+	crypto_unregister_shash(&crc_t10dif_alg);
+}
+
+module_cpu_feature_match(PMULL, crc_t10dif_mod_init);
+module_exit(crc_t10dif_mod_exit);
+
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM
  2016-11-24 15:43 [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64 Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2016-11-24 15:43 ` [PATCH 3/4] crypto: arm64/crct10dif - port x86 SSE implementation to arm64 Ard Biesheuvel
@ 2016-11-24 15:43 ` Ard Biesheuvel
  2016-11-24 17:32   ` Ard Biesheuvel
  3 siblings, 1 reply; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-24 15:43 UTC (permalink / raw)
  To: linux-crypto, herbert, linux-arm-kernel, catalin.marinas,
	will.deacon, linux
  Cc: steve.capper, Ard Biesheuvel, yuehaibing, hanjun.guo,
	dingtianhong, yangshengkai

This is a straight transliteration of the Intel algorithm implemented
using SSE and PCLMULQDQ instructions that resides under in the file
arch/x86/crypto/crct10dif-pcl-asm_64.S.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/crypto/Kconfig                        |   5 +
 arch/arm/crypto/Makefile                       |   2 +
 arch/{arm64 => arm}/crypto/crct10dif-ce-core.S | 457 +++++++++++---------
 arch/{arm64 => arm}/crypto/crct10dif-ce-glue.c |  23 +-
 4 files changed, 277 insertions(+), 210 deletions(-)

diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 27ed1b1cd1d7..fce801fa52a1 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -120,4 +120,9 @@ config CRYPTO_GHASH_ARM_CE
 	  that uses the 64x64 to 128 bit polynomial multiplication (vmull.p64)
 	  that is part of the ARMv8 Crypto Extensions
 
+config CRYPTO_CRCT10DIF_ARM_CE
+	tristate "CRCT10DIF digest algorithm using PMULL instructions"
+	depends on KERNEL_MODE_NEON && CRC_T10DIF
+	select CRYPTO_HASH
+
 endif
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index fc5150702b64..fc77265014b7 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -13,6 +13,7 @@ ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_SHA2_ARM_CE) += sha2-arm-ce.o
 ce-obj-$(CONFIG_CRYPTO_GHASH_ARM_CE) += ghash-arm-ce.o
+ce-obj-$(CONFIG_CRYPTO_CRCT10DIF_ARM_CE) += crct10dif-arm-ce.o
 
 ifneq ($(ce-obj-y)$(ce-obj-m),)
 ifeq ($(call as-instr,.fpu crypto-neon-fp-armv8,y,n),y)
@@ -36,6 +37,7 @@ sha1-arm-ce-y	:= sha1-ce-core.o sha1-ce-glue.o
 sha2-arm-ce-y	:= sha2-ce-core.o sha2-ce-glue.o
 aes-arm-ce-y	:= aes-ce-core.o aes-ce-glue.o
 ghash-arm-ce-y	:= ghash-ce-core.o ghash-ce-glue.o
+crct10dif-arm-ce-y	:= crct10dif-ce-core.o crct10dif-ce-glue.o
 
 quiet_cmd_perl = PERL    $@
       cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S
similarity index 60%
copy from arch/arm64/crypto/crct10dif-ce-core.S
copy to arch/arm/crypto/crct10dif-ce-core.S
index 9148ebd3470a..30168b0f8581 100644
--- a/arch/arm64/crypto/crct10dif-ce-core.S
+++ b/arch/arm/crypto/crct10dif-ce-core.S
@@ -1,5 +1,5 @@
 //
-// Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+// Accelerated CRC-T10DIF using ARM NEON and Crypto Extensions instructions
 //
 // Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
 //
@@ -71,20 +71,43 @@
 #include <linux/linkage.h>
 #include <asm/assembler.h>
 
-	.text
-	.cpu		generic+crypto
-
-	arg1_low32	.req	w0
-	arg2		.req	x1
-	arg3		.req	x2
+#ifdef CONFIG_CPU_ENDIAN_BE8
+#define CPU_LE(code...)
+#else
+#define CPU_LE(code...)		code
+#endif
 
-	vzr		.req	v13
+	.text
+	.fpu		crypto-neon-fp-armv8
+
+	arg1_low32	.req	r0
+	arg2		.req	r1
+	arg3		.req	r2
+
+	qzr		.req	q13
+
+	q0l		.req	d0
+	q0h		.req	d1
+	q1l		.req	d2
+	q1h		.req	d3
+	q2l		.req	d4
+	q2h		.req	d5
+	q3l		.req	d6
+	q3h		.req	d7
+	q4l		.req	d8
+	q4h		.req	d9
+	q5l		.req	d10
+	q5h		.req	d11
+	q6l		.req	d12
+	q6h		.req	d13
+	q7l		.req	d14
+	q7h		.req	d15
 
 ENTRY(crc_t10dif_pmull)
-	stp		x29, x30, [sp, #-32]!
-	mov		x29, sp
+	push		{r4, lr}
+	sub		sp, sp, #0x10
 
-	movi		vzr.16b, #0		// init zero register
+	vmov.i8		qzr, #0			// init zero register
 
 	// adjust the 16-bit initial_crc value, scale it to 32 bits
 	lsl		arg1_low32, arg1_low32, #16
@@ -93,41 +116,44 @@ ENTRY(crc_t10dif_pmull)
 	cmp		arg3, #256
 
 	// for sizes less than 128, we can't fold 64B at a time...
-	b.lt		_less_than_128
+	blt		_less_than_128
 
 	// load the initial crc value
 	// crc value does not need to be byte-reflected, but it needs
 	// to be moved to the high part of the register.
 	// because data will be byte-reflected and will align with
 	// initial crc at correct place.
-	movi		v10.16b, #0
-	mov		v10.s[3], arg1_low32		// initial crc
+	vmov		s0, arg1_low32		// initial crc
+	vext.8		q10, qzr, q0, #4
 
 	// receive the initial 64B data, xor the initial crc value
-	ld1		{v0.2d-v3.2d}, [arg2], #0x40
-	ld1		{v4.2d-v7.2d}, [arg2], #0x40
-CPU_LE(	rev64		v0.16b, v0.16b		)
-CPU_LE(	rev64		v1.16b, v1.16b		)
-CPU_LE(	rev64		v2.16b, v2.16b		)
-CPU_LE(	rev64		v3.16b, v3.16b		)
-CPU_LE(	rev64		v4.16b, v4.16b		)
-CPU_LE(	rev64		v5.16b, v5.16b		)
-CPU_LE(	rev64		v6.16b, v6.16b		)
-CPU_LE(	rev64		v7.16b, v7.16b		)
-
-	ext		v0.16b, v0.16b, v0.16b, #8
-	ext		v1.16b, v1.16b, v1.16b, #8
-	ext		v2.16b, v2.16b, v2.16b, #8
-	ext		v3.16b, v3.16b, v3.16b, #8
-	ext		v4.16b, v4.16b, v4.16b, #8
-	ext		v5.16b, v5.16b, v5.16b, #8
-	ext		v6.16b, v6.16b, v6.16b, #8
-	ext		v7.16b, v7.16b, v7.16b, #8
+	vld1.64		{q0-q1}, [arg2]!
+	vld1.64		{q2-q3}, [arg2]!
+	vld1.64		{q4-q5}, [arg2]!
+	vld1.64		{q6-q7}, [arg2]!
+CPU_LE(	vrev64.8	q0, q0			)
+CPU_LE(	vrev64.8	q1, q1			)
+CPU_LE(	vrev64.8	q2, q2			)
+CPU_LE(	vrev64.8	q3, q3			)
+CPU_LE(	vrev64.8	q4, q4			)
+CPU_LE(	vrev64.8	q5, q5			)
+CPU_LE(	vrev64.8	q6, q6			)
+CPU_LE(	vrev64.8	q7, q7			)
+
+	vext.8		q0, q0, q0, #8
+	vext.8		q1, q1, q1, #8
+	vext.8		q2, q2, q2, #8
+	vext.8		q3, q3, q3, #8
+	vext.8		q4, q4, q4, #8
+	vext.8		q5, q5, q5, #8
+	vext.8		q6, q6, q6, #8
+	vext.8		q7, q7, q7, #8
 
 	// XOR the initial_crc value
-	eor		v0.16b, v0.16b, v10.16b
+	veor.8		q0, q0, q10
 
-	ldr		q10, rk3	// xmm10 has rk3 and rk4
+	adrl		ip, rk3
+	vld1.64		{q10}, [ip]	// xmm10 has rk3 and rk4
 					// type of pmull instruction
 					// will determine which constant to use
 
@@ -146,32 +172,32 @@ CPU_LE(	rev64		v7.16b, v7.16b		)
 _fold_64_B_loop:
 
 	.macro		fold64, reg1, reg2
-	ld1		{v11.2d-v12.2d}, [arg2], #0x20
-CPU_LE(	rev64		v11.16b, v11.16b	)
-CPU_LE(	rev64		v12.16b, v12.16b	)
-	ext		v11.16b, v11.16b, v11.16b, #8
-	ext		v12.16b, v12.16b, v12.16b, #8
-
-	pmull2		v8.1q, \reg1\().2d, v10.2d
-	pmull		\reg1\().1q, \reg1\().1d, v10.1d
-	pmull2		v9.1q, \reg2\().2d, v10.2d
-	pmull		\reg2\().1q, \reg2\().1d, v10.1d
-
-	eor		\reg1\().16b, \reg1\().16b, v11.16b
-	eor		\reg2\().16b, \reg2\().16b, v12.16b
-	eor		\reg1\().16b, \reg1\().16b, v8.16b
-	eor		\reg2\().16b, \reg2\().16b, v9.16b
+	vld1.64		{q11-q12}, [arg2]!
+CPU_LE(	vrev64.8	q11, q11		)
+CPU_LE(	vrev64.8	q12, q12		)
+	vext.8		q11, q11, q11, #8
+	vext.8		q12, q12, q12, #8
+
+	vmull.p64	q8, \reg1\()h, d21
+	vmull.p64	\reg1\(), \reg1\()l, d20
+	vmull.p64	q9, \reg2\()h, d21
+	vmull.p64	\reg2\(), \reg2\()l, d20
+
+	veor.8		\reg1, \reg1, q11
+	veor.8		\reg2, \reg2, q12
+	veor.8		\reg1, \reg1, q8
+	veor.8		\reg2, \reg2, q9
 	.endm
 
-	fold64		v0, v1
-	fold64		v2, v3
-	fold64		v4, v5
-	fold64		v6, v7
+	fold64		q0, q1
+	fold64		q2, q3
+	fold64		q4, q5
+	fold64		q6, q7
 
 	subs		arg3, arg3, #128
 
 	// check if there is another 64B in the buffer to be able to fold
-	b.ge		_fold_64_B_loop
+	bge		_fold_64_B_loop
 
 	// at this point, the buffer pointer is pointing at the last y Bytes
 	// of the buffer the 64B of folded data is in 4 of the vector
@@ -181,46 +207,47 @@ CPU_LE(	rev64		v12.16b, v12.16b	)
 	// constants
 
 	.macro		fold16, rk, reg
-	ldr		q10, \rk
-	pmull		v8.1q, \reg\().1d, v10.1d
-	pmull2		\reg\().1q, \reg\().2d, v10.2d
-	eor		v7.16b, v7.16b, v8.16b
-	eor		v7.16b, v7.16b, \reg\().16b
+	vldr		d20, \rk
+	vldr		d21, \rk + 8
+	vmull.p64	q8, \reg\()l, d20
+	vmull.p64	\reg\(), \reg\()h, d21
+	veor.8		q7, q7, q8
+	veor.8		q7, q7, \reg
 	.endm
 
-	fold16		rk9, v0
-	fold16		rk11, v1
-	fold16		rk13, v2
-	fold16		rk15, v3
-	fold16		rk17, v4
-	fold16		rk19, v5
-	fold16		rk1, v6
+	fold16		rk9, q0
+	fold16		rk11, q1
+	fold16		rk13, q2
+	fold16		rk15, q3
+	fold16		rk17, q4
+	fold16		rk19, q5
+	fold16		rk1, q6
 
 	// instead of 64, we add 48 to the loop counter to save 1 instruction
 	// from the loop instead of a cmp instruction, we use the negative
 	// flag with the jl instruction
 	adds		arg3, arg3, #(128-16)
-	b.lt		_final_reduction_for_128
+	blt		_final_reduction_for_128
 
 	// now we have 16+y bytes left to reduce. 16 Bytes is in register v7
 	// and the rest is in memory. We can fold 16 bytes at a time if y>=16
 	// continue folding 16B at a time
 
 _16B_reduction_loop:
-	pmull		v8.1q, v7.1d, v10.1d
-	pmull2		v7.1q, v7.2d, v10.2d
-	eor		v7.16b, v7.16b, v8.16b
-
-	ld1		{v0.2d}, [arg2], #16
-CPU_LE(	rev64		v0.16b, v0.16b		)
-	ext		v0.16b, v0.16b, v0.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vmull.p64	q8, d14, d20
+	vmull.p64	q7, d15, d21
+	veor.8		q7, q7, q8
+
+	vld1.64		{q0}, [arg2]!
+CPU_LE(	vrev64.8	q0, q0		)
+	vext.8		q0, q0, q0, #8
+	veor.8		q7, q7, q0
 	subs		arg3, arg3, #16
 
 	// instead of a cmp instruction, we utilize the flags with the
 	// jge instruction equivalent of: cmp arg3, 16-16
 	// check if there is any more 16B in the buffer to be able to fold
-	b.ge		_16B_reduction_loop
+	bge		_16B_reduction_loop
 
 	// now we have 16+z bytes left to reduce, where 0<= z < 16.
 	// first, we reduce the data in the xmm7 register
@@ -229,99 +256,104 @@ _final_reduction_for_128:
 	// check if any more data to fold. If not, compute the CRC of
 	// the final 128 bits
 	adds		arg3, arg3, #16
-	b.eq		_128_done
+	beq		_128_done
 
 	// here we are getting data that is less than 16 bytes.
 	// since we know that there was data before the pointer, we can
 	// offset the input pointer before the actual point, to receive
 	// exactly 16 bytes. after that the registers need to be adjusted.
 _get_last_two_regs:
-	mov		v2.16b, v7.16b
+	vmov		q2, q7
 
 	add		arg2, arg2, arg3
 	sub		arg2, arg2, #16
-	ld1		{v1.2d}, [arg2]
-CPU_LE(	rev64		v1.16b, v1.16b		)
-	ext		v1.16b, v1.16b, v1.16b, #8
+	vld1.64		{q1}, [arg2]
+CPU_LE(	vrev64.8	q1, q1			)
+	vext.8		q1, q1, q1, #8
 
 	// get rid of the extra data that was loaded before
 	// load the shift constant
-	adr		x4, tbl_shf_table + 16
-	sub		x4, x4, arg3
-	ld1		{v0.16b}, [x4]
+	adr		lr, tbl_shf_table + 16
+	sub		lr, lr, arg3
+	vld1.8		{q0}, [lr]
 
 	// shift v2 to the left by arg3 bytes
-	tbl		v2.16b, {v2.16b}, v0.16b
+	vmov		q9, q2
+	vtbl.8		d4, {d18-d19}, d0
+	vtbl.8		d5, {d18-d19}, d1
 
 	// shift v7 to the right by 16-arg3 bytes
-	movi		v9.16b, #0x80
-	eor		v0.16b, v0.16b, v9.16b
-	tbl		v7.16b, {v7.16b}, v0.16b
+	vmov.i8		q9, #0x80
+	veor.8		q0, q0, q9
+	vmov		q9, q7
+	vtbl.8		d14, {d18-d19}, d0
+	vtbl.8		d15, {d18-d19}, d1
 
 	// blend
-	sshr		v0.16b, v0.16b, #7	// convert to 8-bit mask
-	bsl		v0.16b, v2.16b, v1.16b
+	vshr.s8		q0, q0, #7		// convert to 8-bit mask
+	vbsl.8		q0, q2, q1
 
 	// fold 16 Bytes
-	pmull		v8.1q, v7.1d, v10.1d
-	pmull2		v7.1q, v7.2d, v10.2d
-	eor		v7.16b, v7.16b, v8.16b
-	eor		v7.16b, v7.16b, v0.16b
+	vmull.p64	q8, d14, d20
+	vmull.p64	q7, d15, d21
+	veor.8		q7, q7, q8
+	veor.8		q7, q7, q0
 
 _128_done:
 	// compute crc of a 128-bit value
-	ldr		q10, rk5		// rk5 and rk6 in xmm10
+	vldr		d20, rk5
+	vldr		d21, rk6		// rk5 and rk6 in xmm10
 
 	// 64b fold
-	mov		v0.16b, v7.16b
-	ext		v7.16b, v7.16b, v7.16b, #8
-	pmull		v7.1q, v7.1d, v10.1d
-	ext		v0.16b, vzr.16b, v0.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vmov		q0, q7
+	vmull.p64	q7, d15, d20
+	vext.8		q0, qzr, q0, #8
+	veor.8		q7, q7, q0
 
 	// 32b fold
-	mov		v0.16b, v7.16b
-	mov		v0.s[3], vzr.s[0]
-	ext		v7.16b, v7.16b, vzr.16b, #12
-	ext		v9.16b, v10.16b, v10.16b, #8
-	pmull		v7.1q, v7.1d, v9.1d
-	eor		v7.16b, v7.16b, v0.16b
+	veor.8		d1, d1, d1
+	vmov		d0, d14
+	vmov		s2, s30
+	vext.8		q7, q7, qzr, #12
+	vmull.p64	q7, d14, d21
+	veor.8		q7, q7, q0
 
 	// barrett reduction
 _barrett:
-	ldr		q10, rk7
-	mov		v0.16b, v7.16b
-	ext		v7.16b, v7.16b, v7.16b, #8
+	vldr		d20, rk7
+	vldr		d21, rk8
+	vmov.8		q0, q7
 
-	pmull		v7.1q, v7.1d, v10.1d
-	ext		v7.16b, vzr.16b, v7.16b, #12
-	pmull2		v7.1q, v7.2d, v10.2d
-	ext		v7.16b, vzr.16b, v7.16b, #12
-	eor		v7.16b, v7.16b, v0.16b
-	mov		w0, v7.s[1]
+	vmull.p64	q7, d15, d20
+	vext.8		q7, qzr, q7, #12
+	vmull.p64	q7, d15, d21
+	vext.8		q7, qzr, q7, #12
+	veor.8		q7, q7, q0
+	vmov		r0, s29
 
 _cleanup:
 	// scale the result back to 16 bits
-	lsr		x0, x0, #16
-	ldp		x29, x30, [sp], #32
-	ret
+	lsr		r0, r0, #16
+	add		sp, sp, #0x10
+	pop		{r4, pc}
 
 	.align		4
 _less_than_128:
 
 	// check if there is enough buffer to be able to fold 16B at a time
 	cmp		arg3, #32
-	b.lt		_less_than_32
+	blt		_less_than_32
 
 	// now if there is, load the constants
-	ldr		q10, rk1		// rk1 and rk2 in xmm10
+	vldr		d20, rk1
+	vldr		d21, rk2		// rk1 and rk2 in xmm10
 
-	movi		v0.16b, #0
-	mov		v0.s[3], arg1_low32	// get the initial crc value
-	ld1		{v7.2d}, [arg2], #0x10
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vmov.i8		q0, #0
+	vmov		s3, arg1_low32		// get the initial crc value
+	vld1.64		{q7}, [arg2]!
+CPU_LE(	vrev64.8	q7, q7		)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
 	// update the counter. subtract 32 instead of 16 to save one
 	// instruction from the loop
@@ -331,21 +363,23 @@ CPU_LE(	rev64		v7.16b, v7.16b		)
 
 	.align		4
 _less_than_32:
-	cbz		arg3, _cleanup
+	teq		arg3, #0
+	beq		_cleanup
 
-	movi		v0.16b, #0
-	mov		v0.s[3], arg1_low32	// get the initial crc value
+	vmov.i8		q0, #0
+	vmov		s3, arg1_low32		// get the initial crc value
 
 	cmp		arg3, #16
-	b.eq		_exact_16_left
-	b.lt		_less_than_16_left
+	beq		_exact_16_left
+	blt		_less_than_16_left
 
-	ld1		{v7.2d}, [arg2], #0x10
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [arg2]!
+CPU_LE(	vrev64.8	q7, q7		)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 	sub		arg3, arg3, #16
-	ldr		q10, rk1		// rk1 and rk2 in xmm10
+	vldr		d20, rk1
+	vldr		d21, rk2		// rk1 and rk2 in xmm10
 	b		_get_last_two_regs
 
 	.align		4
@@ -353,117 +387,124 @@ _less_than_16_left:
 	// use stack space to load data less than 16 bytes, zero-out
 	// the 16B in memory first.
 
-	add		x11, sp, #0x10
-	stp		xzr, xzr, [x11]
+	vst1.8		{qzr}, [sp]
+	mov		ip, sp
 
 	cmp		arg3, #4
-	b.lt		_only_less_than_4
+	blt		_only_less_than_4
 
 	// backup the counter value
-	mov		x9, arg3
-	tbz		arg3, #3, _less_than_8_left
+	mov		lr, arg3
+	cmp		arg3, #8
+	blt		_less_than_8_left
 
 	// load 8 Bytes
-	ldr		x0, [arg2], #8
-	str		x0, [x11], #8
+	ldr		r0, [arg2], #4
+	ldr		r3, [arg2], #4
+	str		r0, [ip], #4
+	str		r3, [ip], #4
 	sub		arg3, arg3, #8
 
 _less_than_8_left:
-	tbz		arg3, #2, _less_than_4_left
+	cmp		arg3, #4
+	blt		_less_than_4_left
 
 	// load 4 Bytes
-	ldr		w0, [arg2], #4
-	str		w0, [x11], #4
+	ldr		r0, [arg2], #4
+	str		r0, [ip], #4
 	sub		arg3, arg3, #4
 
 _less_than_4_left:
-	tbz		arg3, #1, _less_than_2_left
+	cmp		arg3, #2
+	blt		_less_than_2_left
 
 	// load 2 Bytes
-	ldrh		w0, [arg2], #2
-	strh		w0, [x11], #2
+	ldrh		r0, [arg2], #2
+	strh		r0, [ip], #2
 	sub		arg3, arg3, #2
 
 _less_than_2_left:
-	cbz		arg3, _zero_left
+	cmp		arg3, #1
+	blt		_zero_left
 
 	// load 1 Byte
-	ldrb		w0, [arg2]
-	strb		w0, [x11]
+	ldrb		r0, [arg2]
+	strb		r0, [ip]
 
 _zero_left:
-	add		x11, sp, #0x10
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [sp]
+CPU_LE(	vrev64.8	q7, q7		)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
 	// shl r9, 4
-	adr		x0, tbl_shf_table + 16
-	sub		x0, x0, x9
-	ld1		{v0.16b}, [x0]
-	movi		v9.16b, #0x80
-	eor		v0.16b, v0.16b, v9.16b
-	tbl		v7.16b, {v7.16b}, v0.16b
+	adr		ip, tbl_shf_table + 16
+	sub		ip, ip, lr
+	vld1.8		{q0}, [ip]
+	vmov.i8		q9, #0x80
+	veor.8		q0, q0, q9
+	vmov		q9, q7
+	vtbl.8		d14, {d18-d19}, d0
+	vtbl.8		d15, {d18-d19}, d1
 
 	b		_128_done
 
 	.align		4
 _exact_16_left:
-	ld1		{v7.2d}, [arg2]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b	// xor the initial crc value
+	vld1.64		{q7}, [arg2]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0		// xor the initial crc value
 
 	b		_128_done
 
 _only_less_than_4:
 	cmp		arg3, #3
-	b.lt		_only_less_than_3
+	blt		_only_less_than_3
 
 	// load 3 Bytes
-	ldrh		w0, [arg2]
-	strh		w0, [x11]
+	ldrh		r0, [arg2]
+	strh		r0, [ip]
 
-	ldrb		w0, [arg2, #2]
-	strb		w0, [x11, #2]
+	ldrb		r0, [arg2, #2]
+	strb		r0, [ip, #2]
 
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [ip]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
-	ext		v7.16b, v7.16b, vzr.16b, #5
+	vext.8		q7, q7, qzr, #5
 	b		_barrett
 
 _only_less_than_3:
 	cmp		arg3, #2
-	b.lt		_only_less_than_2
+	blt		_only_less_than_2
 
 	// load 2 Bytes
-	ldrh		w0, [arg2]
-	strh		w0, [x11]
+	ldrh		r0, [arg2]
+	strh		r0, [ip]
 
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [ip]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
-	ext		v7.16b, v7.16b, vzr.16b, #6
+	vext.8		q7, q7, qzr, #6
 	b		_barrett
 
 _only_less_than_2:
 
 	// load 1 Byte
-	ldrb		w0, [arg2]
-	strb		w0, [x11]
+	ldrb		r0, [arg2]
+	strb		r0, [ip]
 
-	ld1		{v7.2d}, [x11]
-CPU_LE(	rev64		v7.16b, v7.16b		)
-	ext		v7.16b, v7.16b, v7.16b, #8
-	eor		v7.16b, v7.16b, v0.16b
+	vld1.64		{q7}, [ip]
+CPU_LE(	vrev64.8	q7, q7			)
+	vext.8		q7, q7, q7, #8
+	veor.8		q7, q7, q0
 
-	ext		v7.16b, v7.16b, vzr.16b, #7
+	vext.8		q7, q7, qzr, #7
 	b		_barrett
 
 ENDPROC(crc_t10dif_pmull)
@@ -482,16 +523,26 @@ ENDPROC(crc_t10dif_pmull)
 // rk7 = floor(2^64/Q)
 // rk8 = Q
 
-rk1:	.octa		0x06df0000000000002d56000000000000
-rk3:	.octa		0x7cf50000000000009d9d000000000000
-rk5:	.octa		0x13680000000000002d56000000000000
-rk7:	.octa		0x000000018bb7000000000001f65a57f8
-rk9:	.octa		0xbfd6000000000000ceae000000000000
-rk11:	.octa		0x713c0000000000001e16000000000000
-rk13:	.octa		0x80a6000000000000f7f9000000000000
-rk15:	.octa		0xe658000000000000044c000000000000
-rk17:	.octa		0xa497000000000000ad18000000000000
-rk19:	.octa		0xe7b50000000000006ee3000000000000
+rk1:	.quad		0x2d56000000000000
+rk2:	.quad		0x06df000000000000
+rk3:	.quad		0x9d9d000000000000
+rk4:	.quad		0x7cf5000000000000
+rk5:	.quad		0x2d56000000000000
+rk6:	.quad		0x1368000000000000
+rk7:	.quad		0x00000001f65a57f8
+rk8:	.quad		0x000000018bb70000
+rk9:	.quad		0xceae000000000000
+rk10:	.quad		0xbfd6000000000000
+rk11:	.quad		0x1e16000000000000
+rk12:	.quad		0x713c000000000000
+rk13:	.quad		0xf7f9000000000000
+rk14:	.quad		0x80a6000000000000
+rk15:	.quad		0x044c000000000000
+rk16:	.quad		0xe658000000000000
+rk17:	.quad		0xad18000000000000
+rk18:	.quad		0xa497000000000000
+rk19:	.quad		0x6ee3000000000000
+rk20:	.quad		0xe7b5000000000000
 
 tbl_shf_table:
 // use these values for shift constants for the tbl/tbx instruction
diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c
similarity index 76%
copy from arch/arm64/crypto/crct10dif-ce-glue.c
copy to arch/arm/crypto/crct10dif-ce-glue.c
index d11f33dae79c..e717538d902c 100644
--- a/arch/arm64/crypto/crct10dif-ce-glue.c
+++ b/arch/arm/crypto/crct10dif-ce-glue.c
@@ -1,5 +1,5 @@
 /*
- * Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions
+ * Accelerated CRC-T10DIF using ARM NEON and Crypto Extensions instructions
  *
  * Copyright (C) 2016 Linaro Ltd <ard.biesheuvel@linaro.org>
  *
@@ -8,7 +8,6 @@
  * published by the Free Software Foundation.
  */
 
-#include <linux/cpufeature.h>
 #include <linux/crc-t10dif.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -18,6 +17,7 @@
 #include <crypto/internal/hash.h>
 
 #include <asm/neon.h>
+#include <asm/simd.h>
 
 asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 buf[], u64 len);
 
@@ -34,9 +34,13 @@ static int crct10dif_update(struct shash_desc *desc, const u8 *data,
 {
 	u16 *crc = shash_desc_ctx(desc);
 
-	kernel_neon_begin_partial(14);
-	*crc = crc_t10dif_pmull(*crc, data, length);
-	kernel_neon_end();
+	if (may_use_simd()) {
+		kernel_neon_begin();
+		*crc = crc_t10dif_pmull(*crc, data, length);
+		kernel_neon_end();
+	} else {
+		*crc = crc_t10dif_generic(*crc, data, length);
+	}
 
 	return 0;
 }
@@ -57,7 +61,7 @@ static struct shash_alg crc_t10dif_alg = {
 
 	.descsize		= CRC_T10DIF_DIGEST_SIZE,
 	.base.cra_name		= "crct10dif",
-	.base.cra_driver_name	= "crct10dif-arm64-ce",
+	.base.cra_driver_name	= "crct10dif-arm-ce",
 	.base.cra_priority	= 200,
 	.base.cra_blocksize	= CRC_T10DIF_BLOCK_SIZE,
 	.base.cra_module	= THIS_MODULE,
@@ -65,6 +69,9 @@ static struct shash_alg crc_t10dif_alg = {
 
 static int __init crc_t10dif_mod_init(void)
 {
+	if (!(elf_hwcap2 & HWCAP2_PMULL))
+		return -ENODEV;
+
 	return crypto_register_shash(&crc_t10dif_alg);
 }
 
@@ -73,8 +80,10 @@ static void __exit crc_t10dif_mod_exit(void)
 	crypto_unregister_shash(&crc_t10dif_alg);
 }
 
-module_cpu_feature_match(PMULL, crc_t10dif_mod_init);
+module_init(crc_t10dif_mod_init);
 module_exit(crc_t10dif_mod_exit);
 
 MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
 MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("crct10dif");
+MODULE_ALIAS_CRYPTO("crct10dif-arm-ce");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM
  2016-11-24 15:43 ` [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM Ard Biesheuvel
@ 2016-11-24 17:32   ` Ard Biesheuvel
  2016-11-28 13:17     ` Herbert Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-24 17:32 UTC (permalink / raw)
  To: linux-crypto, Herbert Xu, linux-arm-kernel, Catalin Marinas,
	Will Deacon, Russell King - ARM Linux
  Cc: Steve Capper, dingtinahong, yangshengkai, YueHaibing, Hanjun Guo,
	Ard Biesheuvel

On 24 November 2016 at 15:43, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> This is a straight transliteration of the Intel algorithm implemented
> using SSE and PCLMULQDQ instructions that resides under in the file
> arch/x86/crypto/crct10dif-pcl-asm_64.S.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm/crypto/Kconfig                        |   5 +
>  arch/arm/crypto/Makefile                       |   2 +
>  arch/{arm64 => arm}/crypto/crct10dif-ce-core.S | 457 +++++++++++---------
>  arch/{arm64 => arm}/crypto/crct10dif-ce-glue.c |  23 +-
>  4 files changed, 277 insertions(+), 210 deletions(-)
>

This patch needs the following hunk folded in to avoid breaking the
Thumb2 build:

"""
diff --git a/arch/arm/crypto/crct10dif-ce-core.S
b/arch/arm/crypto/crct10dif-ce-core.S
index 30168b0f8581..4fdbca94dd0c 100644
--- a/arch/arm/crypto/crct10dif-ce-core.S
+++ b/arch/arm/crypto/crct10dif-ce-core.S
@@ -152,7 +152,8 @@ CPU_LE(     vrev64.8        q7, q7                  )
        // XOR the initial_crc value
        veor.8          q0, q0, q10

-       adrl            ip, rk3
+ARM(   adrl            ip, rk3         )
+THUMB( adr             ip, rk3         )
        vld1.64         {q10}, [ip]     // xmm10 has rk3 and rk4
                                        // type of pmull instruction
                                        // will determine which constant to use
"""

Updated patch(es) can be found here
https://git.kernel.org/cgit/linux/kernel/git/ardb/linux.git/log/?h=arm-crct10dif

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM
  2016-11-24 17:32   ` Ard Biesheuvel
@ 2016-11-28 13:17     ` Herbert Xu
  2016-11-28 14:59       ` Ard Biesheuvel
  0 siblings, 1 reply; 8+ messages in thread
From: Herbert Xu @ 2016-11-28 13:17 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, linux-arm-kernel, Catalin Marinas, Will Deacon,
	Russell King - ARM Linux, Steve Capper, dingtinahong,
	yangshengkai, YueHaibing, Hanjun Guo

On Thu, Nov 24, 2016 at 05:32:42PM +0000, Ard Biesheuvel wrote:
> On 24 November 2016 at 15:43, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > This is a straight transliteration of the Intel algorithm implemented
> > using SSE and PCLMULQDQ instructions that resides under in the file
> > arch/x86/crypto/crct10dif-pcl-asm_64.S.
> >
> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > ---
> >  arch/arm/crypto/Kconfig                        |   5 +
> >  arch/arm/crypto/Makefile                       |   2 +
> >  arch/{arm64 => arm}/crypto/crct10dif-ce-core.S | 457 +++++++++++---------
> >  arch/{arm64 => arm}/crypto/crct10dif-ce-glue.c |  23 +-
> >  4 files changed, 277 insertions(+), 210 deletions(-)
> >
> 
> This patch needs the following hunk folded in to avoid breaking the
> Thumb2 build:
> 
> """
> diff --git a/arch/arm/crypto/crct10dif-ce-core.S
> b/arch/arm/crypto/crct10dif-ce-core.S
> index 30168b0f8581..4fdbca94dd0c 100644
> --- a/arch/arm/crypto/crct10dif-ce-core.S
> +++ b/arch/arm/crypto/crct10dif-ce-core.S
> @@ -152,7 +152,8 @@ CPU_LE(     vrev64.8        q7, q7                  )
>         // XOR the initial_crc value
>         veor.8          q0, q0, q10
> 
> -       adrl            ip, rk3
> +ARM(   adrl            ip, rk3         )
> +THUMB( adr             ip, rk3         )
>         vld1.64         {q10}, [ip]     // xmm10 has rk3 and rk4
>                                         // type of pmull instruction
>                                         // will determine which constant to use
> """

I'm sorry but this patch doesn't apply on top of the other four.
So please resend the whole series.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM
  2016-11-28 13:17     ` Herbert Xu
@ 2016-11-28 14:59       ` Ard Biesheuvel
  0 siblings, 0 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2016-11-28 14:59 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-crypto, linux-arm-kernel, Catalin Marinas, Will Deacon,
	Russell King - ARM Linux, Steve Capper, dingtinahong,
	yangshengkai, YueHaibing, Hanjun Guo

On 28 November 2016 at 14:17, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Nov 24, 2016 at 05:32:42PM +0000, Ard Biesheuvel wrote:
>> On 24 November 2016 at 15:43, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> > This is a straight transliteration of the Intel algorithm implemented
>> > using SSE and PCLMULQDQ instructions that resides under in the file
>> > arch/x86/crypto/crct10dif-pcl-asm_64.S.
>> >
>> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> > ---
>> >  arch/arm/crypto/Kconfig                        |   5 +
>> >  arch/arm/crypto/Makefile                       |   2 +
>> >  arch/{arm64 => arm}/crypto/crct10dif-ce-core.S | 457 +++++++++++---------
>> >  arch/{arm64 => arm}/crypto/crct10dif-ce-glue.c |  23 +-
>> >  4 files changed, 277 insertions(+), 210 deletions(-)
>> >
>>
>> This patch needs the following hunk folded in to avoid breaking the
>> Thumb2 build:
>>
>> """
>> diff --git a/arch/arm/crypto/crct10dif-ce-core.S
>> b/arch/arm/crypto/crct10dif-ce-core.S
>> index 30168b0f8581..4fdbca94dd0c 100644
>> --- a/arch/arm/crypto/crct10dif-ce-core.S
>> +++ b/arch/arm/crypto/crct10dif-ce-core.S
>> @@ -152,7 +152,8 @@ CPU_LE(     vrev64.8        q7, q7                  )
>>         // XOR the initial_crc value
>>         veor.8          q0, q0, q10
>>
>> -       adrl            ip, rk3
>> +ARM(   adrl            ip, rk3         )
>> +THUMB( adr             ip, rk3         )
>>         vld1.64         {q10}, [ip]     // xmm10 has rk3 and rk4
>>                                         // type of pmull instruction
>>                                         // will determine which constant to use
>> """
>
> I'm sorry but this patch doesn't apply on top of the other four.
> So please resend the whole series.
>

Yes, please disregard all CRC ARM/arm64 patches for now, I will
consolidate them into a single v2 and send it out after the merge
window.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-28 14:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-24 15:43 [PATCH 0/4] crypto: CRCT10DIF support for ARM and arm64 Ard Biesheuvel
2016-11-24 15:43 ` [PATCH 1/4] crypto: testmgr - avoid overlap in chunked tests Ard Biesheuvel
2016-11-24 15:43 ` [PATCH 2/4] crypto: testmgr - add/enhance test cases for CRC-T10DIF Ard Biesheuvel
2016-11-24 15:43 ` [PATCH 3/4] crypto: arm64/crct10dif - port x86 SSE implementation to arm64 Ard Biesheuvel
2016-11-24 15:43 ` [PATCH 4/4] crypto: arm/crct10dif - port x86 SSE implementation to ARM Ard Biesheuvel
2016-11-24 17:32   ` Ard Biesheuvel
2016-11-28 13:17     ` Herbert Xu
2016-11-28 14:59       ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).