linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64
@ 2007-06-08 21:42 Benjamin Gilbert
  2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-08 21:42 UTC (permalink / raw)
  To: akpm; +Cc: herbert, linux-crypto, linux-kernel

The following 3-part series adds assembly implementations of the SHA-1
transform for x86 and x86_64.  For x86_64 the optimized code is always
selected; on x86 it is selected if the kernel is compiled for i486 or above
(since the code needs BSWAP).  These changes primarily improve the
performance of the CryptoAPI SHA-1 module and of /dev/urandom.  I've
included some performance data from my test boxes below.

This version incorporates feedback from Herbert Xu.  Andrew, I'm sending
this to you because of the (admittedly tiny) intersection with arm and s390
in part 1.

-

tcrypt performance tests:

=== Pentium IV in 32-bit mode, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
        block  update    (C)  (asm)
    0      16      16    229    114     50%
    1      64      16    142     76     46%
    2      64      64     79     35     56%
    3     256      16     59     34     42%
    4     256      64     44     24     45%
    5     256     256     43     17     60%
    6    1024      16     51     36     29%
    7    1024     256     30     13     57%
    8    1024    1024     28     12     57%
    9    2048      16     66     30     55%
   10    2048     256     31     12     61%
   11    2048    1024     27     13     52%
   12    2048    2048     26     13     50%
   13    4096      16     49     30     39%
   14    4096     256     28     12     57%
   15    4096    1024     28     11     61%
   16    4096    4096     26     13     50%
   17    8192      16     49     29     41%
   18    8192     256     27     11     59%
   19    8192    1024     26     11     58%
   20    8192    4096     25     10     60%
   21    8192    8192     25     10     60%

=== Intel Core 2 in 64-bit mode, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
        block  update    (C)  (asm)
    0      16      16    112     81     28%
    1      64      16     55     39     29%
    2      64      64     42     27     36%
    3     256      16     35     25     29%
    4     256      64     24     14     42%
    5     256     256     22     12     45%
    6    1024      16     31     22     29%
    7    1024     256     17      9     47%
    8    1024    1024     16      9     44%
    9    2048      16     30     22     27%
   10    2048     256     16      8     50%
   11    2048    1024     16      8     50%
   12    2048    2048     16      8     50%
   13    4096      16     29     21     28%
   14    4096     256     16      8     50%
   15    4096    1024     15      8     47%
   16    4096    4096     15      7     53%
   17    8192      16     29     22     24%
   18    8192     256     16      8     50%
   19    8192    1024     15      7     53%
   20    8192    4096     15      7     53%
   21    8192    8192     15      7     53%

I've also done informal tests on other boxes, and the performance
improvement has been in the same ballpark.

On the aforementioned Pentium IV, /dev/urandom throughput goes from 3.7 MB/s
to 5.6 MB/s with the patches; on the Core 2, it increases from 5.5 MB/s to
8.1 MB/s.

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h
  2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
@ 2007-06-08 21:42 ` Benjamin Gilbert
  2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-08 21:42 UTC (permalink / raw)
  To: akpm; +Cc: herbert, linux-crypto, linux-kernel

Make sha_init() a static inline in cryptohash.h rather than an (unexported)
function in lib/sha1.c, in preparation for making sha1.c optional.  This
also allows some cleanups:

- Modular code can now use sha_init() rather than reimplementing it

- The optimized implementation of SHA-1 for ARM no longer needs to
reimplement sha_init() in assembly

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
---

 arch/arm/lib/sha1.S          |   16 ----------------
 arch/s390/crypto/sha1_s390.c |    6 +-----
 drivers/crypto/padlock-sha.c |    8 ++------
 include/linux/cryptohash.h   |   14 +++++++++++++-
 lib/sha1.c                   |   14 --------------
 5 files changed, 16 insertions(+), 42 deletions(-)

diff --git a/arch/arm/lib/sha1.S b/arch/arm/lib/sha1.S
index ff6ece4..5be800c 100644
--- a/arch/arm/lib/sha1.S
+++ b/arch/arm/lib/sha1.S
@@ -188,19 +188,3 @@ ENTRY(sha_transform)
 .L_sha_K:
 	.word	0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6
 
-
-/*
- * void sha_init(__u32 *buf)
- */
-
-.L_sha_initial_digest:
-	.word	0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
-
-ENTRY(sha_init)
-
-	str	lr, [sp, #-4]!
-	adr	r1, .L_sha_initial_digest
-	ldmia	r1, {r1, r2, r3, ip, lr}
-	stmia	r0, {r1, r2, r3, ip, lr}
-	ldr	pc, [sp], #4
-
diff --git a/arch/s390/crypto/sha1_s390.c b/arch/s390/crypto/sha1_s390.c
index af4460e..fed9a2e 100644
--- a/arch/s390/crypto/sha1_s390.c
+++ b/arch/s390/crypto/sha1_s390.c
@@ -42,11 +42,7 @@ static void sha1_init(struct crypto_tfm *tfm)
 {
 	struct s390_sha1_ctx *sctx = crypto_tfm_ctx(tfm);
 
-	sctx->state[0] = 0x67452301;
-	sctx->state[1] = 0xEFCDAB89;
-	sctx->state[2] = 0x98BADCFE;
-	sctx->state[3] = 0x10325476;
-	sctx->state[4] = 0xC3D2E1F0;
+	sha_init(sctx->state);
 	sctx->count = 0;
 }
 
diff --git a/drivers/crypto/padlock-sha.c b/drivers/crypto/padlock-sha.c
index a781fd2..b47d708 100644
--- a/drivers/crypto/padlock-sha.c
+++ b/drivers/crypto/padlock-sha.c
@@ -107,12 +107,8 @@ static void padlock_do_sha1(const char *in, char *out, int count)
 	char buf[128+16];
 	char *result = NEAREST_ALIGNED(buf);
 
-	((uint32_t *)result)[0] = 0x67452301;
-	((uint32_t *)result)[1] = 0xEFCDAB89;
-	((uint32_t *)result)[2] = 0x98BADCFE;
-	((uint32_t *)result)[3] = 0x10325476;
-	((uint32_t *)result)[4] = 0xC3D2E1F0;
- 
+	sha_init((uint32_t *)result);
+
 	asm volatile (".byte 0xf3,0x0f,0xa6,0xc8" /* rep xsha1 */
 		      : "+S"(in), "+D"(result)
 		      : "c"(count), "a"(0));
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index c118b2a..a172401 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -4,7 +4,19 @@
 #define SHA_DIGEST_WORDS 5
 #define SHA_WORKSPACE_WORDS 80
 
-void sha_init(__u32 *buf);
+/**
+ * sha_init - initialize the vectors for a SHA1 digest
+ * @buf: vector to initialize
+ */
+static inline void sha_init(__u32 *buf)
+{
+	buf[0] = 0x67452301;
+	buf[1] = 0xefcdab89;
+	buf[2] = 0x98badcfe;
+	buf[3] = 0x10325476;
+	buf[4] = 0xc3d2e1f0;
+}
+
 void sha_transform(__u32 *digest, const char *data, __u32 *W);
 
 __u32 half_md4_transform(__u32 buf[4], __u32 const in[8]);
diff --git a/lib/sha1.c b/lib/sha1.c
index 4c45fd5..815816f 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -79,17 +79,3 @@ void sha_transform(__u32 *digest, const char *in, __u32 *W)
 	digest[4] += e;
 }
 EXPORT_SYMBOL(sha_transform);
-
-/**
- * sha_init - initialize the vectors for a SHA1 digest
- * @buf: vector to initialize
- */
-void sha_init(__u32 *buf)
-{
-	buf[0] = 0x67452301;
-	buf[1] = 0xefcdab89;
-	buf[2] = 0x98badcfe;
-	buf[3] = 0x10325476;
-	buf[4] = 0xc3d2e1f0;
-}
-


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
  2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
@ 2007-06-08 21:42 ` Benjamin Gilbert
  2007-06-09  7:32   ` Jan Engelhardt
  2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
  2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
  2007-06-11 20:30 ` [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Adrian Bunk
  3 siblings, 2 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-08 21:42 UTC (permalink / raw)
  To: akpm; +Cc: herbert, linux-crypto, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 13717 bytes --]

Add x86-optimized implementation of the SHA-1 hash function, taken from
Nettle under the LGPL.  This code will be enabled on kernels compiled for
486es or better; kernels which support 386es will use the generic
implementation (since we need BSWAP).

We disable building lib/sha1.o when an optimized implementation is
available, as the library link order for x86 (and x86_64) would otherwise
ignore the optimized version.  The existing optimized implementation for ARM
does not do this; the library link order for that architecture appears to
favor the arch/arm/ version automatically.  I've left this situation alone
since I'm not familiar with the ARM code, but a !ARM condition could be
added to CONFIG_SHA1_GENERIC if it makes sense.

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
---

 arch/i386/kernel/i386_ksyms.c |    5 +
 arch/i386/lib/Makefile        |    1 
 arch/i386/lib/sha1.S          |  299 +++++++++++++++++++++++++++++++++++++++++
 include/linux/cryptohash.h    |    9 +
 lib/Kconfig                   |   13 ++
 lib/Makefile                  |    3 
 6 files changed, 328 insertions(+), 2 deletions(-)

diff --git a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
index e3d4b73..812bc4e 100644
--- a/arch/i386/kernel/i386_ksyms.c
+++ b/arch/i386/kernel/i386_ksyms.c
@@ -1,4 +1,5 @@
 #include <linux/module.h>
+#include <linux/cryptohash.h>
 #include <asm/checksum.h>
 #include <asm/desc.h>
 
@@ -28,3 +29,7 @@ EXPORT_SYMBOL(__read_lock_failed);
 #endif
 
 EXPORT_SYMBOL(csum_partial);
+
+#ifdef CONFIG_SHA1_X86
+EXPORT_SYMBOL(sha_transform);
+#endif
diff --git a/arch/i386/lib/Makefile b/arch/i386/lib/Makefile
index 22d8ac5..69f4845 100644
--- a/arch/i386/lib/Makefile
+++ b/arch/i386/lib/Makefile
@@ -6,6 +6,7 @@
 lib-y = checksum.o delay.o usercopy.o getuser.o putuser.o memcpy.o strstr.o \
 	bitops.o semaphore.o
 
+lib-$(CONFIG_SHA1_X86) += sha1.o
 lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
 
 obj-$(CONFIG_SMP)	+= msr-on-cpu.o
diff --git a/arch/i386/lib/sha1.S b/arch/i386/lib/sha1.S
new file mode 100644
index 0000000..28aa4b7
--- /dev/null
+++ b/arch/i386/lib/sha1.S
@@ -0,0 +1,299 @@
+/*
+ * x86-optimized SHA1 hash algorithm (i486 and above)
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <bgilbert@cs.cmu.edu>
+ *
+ * Copyright (C) 2004, Niels Möller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.  A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage */
+#define SA	%eax
+#define SB	%ebx
+#define SC	%ecx
+#define SD	%edx
+#define SE	%ebp
+#define DATA	%esp
+#define TMP	%edi
+#define TMP2	%esi			/* Used by SWAP and F3 */
+#define TMP3	64(%esp)
+
+/* Constants */
+#define K1VALUE	$0x5A827999		/* Rounds  0-19 */
+#define K2VALUE	$0x6ED9EBA1		/* Rounds 20-39 */
+#define K3VALUE	$0x8F1BBCDC		/* Rounds 40-59 */
+#define K4VALUE	$0xCA62C1D6		/* Rounds 60-79 */
+
+/* Convert stack offsets in words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via TMP2 into register, byteswaps it, and stores it in
+   the DATA array. */
+#define SWAP(index, register)					\
+	movl	OFFSET(index)(TMP2), register;			\
+	bswap	register;					\
+	movl	register, OFFSET(index)(DATA)
+
+/* Sets the workspace word at the given index to TMP. */
+#define CLEAR(index)						\
+	movl	TMP, OFFSET(index)(DATA)
+
+/* pushl/popl wrappers that update the DWARF unwind table */
+#define PUSH(regname)						\
+	pushl			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	4;				\
+	CFI_REL_OFFSET		regname, 0
+
+#define POP(regname)						\
+	popl			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	-4;				\
+	CFI_RESTORE		regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ *   W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i)						\
+	movl	OFFSET(i % 16)(DATA), TMP;			\
+	xorl	OFFSET((i + 2) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 8) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 13) % 16)(DATA), TMP;		\
+	roll	$1, TMP;					\
+	movl	TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ *  f1(x,y,z) = z ^ (x & (y ^ z))
+ *  f2(x,y,z) = x ^ y ^ z
+ *  f3(x,y,z) = (x & y) | (z & (x | y))
+ *  f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z)						\
+	movl	z, TMP;						\
+	xorl	y, TMP;						\
+	andl	x, TMP;						\
+	xorl	z, TMP
+
+#define F2(x,y,z)						\
+	movl	x, TMP;						\
+	xorl	y, TMP;						\
+	xorl	z, TMP
+
+#define F3(x,y,z)						\
+	movl	x, TMP2;					\
+	andl	y, TMP2;					\
+	movl	x, TMP;						\
+	orl	y, TMP;						\
+	andl	z, TMP;						\
+	orl	TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ *   a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ *   b' = a;
+ *   c' = b <<< 30;
+ *   d' = c;
+ *   e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ *   e += a <<< 5 + f( b, c, d ) + k + w;
+ *   b <<<= 30
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,k,w)					\
+	addl	k, e;						\
+	addl	w, e;						\
+	f(b,c,d);						\
+	addl	TMP, e;						\
+	movl	a, TMP;						\
+	roll	$5, TMP;					\
+	addl	TMP, e;						\
+	roll	$30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* The workspace argument is ignored; we don't have enough registers to handle
+   a workspace that isn't on our own stack. */
+.text
+ENTRY(sha_transform)
+	CFI_STARTPROC
+	/* save all registers that need to be saved */
+	PUSH(ebx)		/* 80(%esp) */
+	PUSH(ebp)		/* 76(%esp) */
+	PUSH(esi)		/* 72(%esp) */
+	PUSH(edi)		/* 68(%esp) */
+
+	subl	$68, %esp	/* %esp = W */
+	CFI_ADJUST_CFA_OFFSET	68
+
+	/* Load and byteswap data */
+	movl	92(%esp), TMP2
+
+	SWAP( 0, %eax); SWAP( 1, %ebx); SWAP( 2, %ecx); SWAP( 3, %edx)
+	SWAP( 4, %eax); SWAP( 5, %ebx); SWAP( 6, %ecx); SWAP( 7, %edx)
+	SWAP( 8, %eax); SWAP( 9, %ebx); SWAP(10, %ecx); SWAP(11, %edx)
+	SWAP(12, %eax); SWAP(13, %ebx); SWAP(14, %ecx); SWAP(15, %edx)
+
+	/* load the state vector */
+	movl	88(%esp),TMP
+	movl	(TMP),   SA
+	movl	4(TMP),  SB
+	movl	8(TMP),  SC
+	movl	12(TMP), SD
+	movl	16(TMP), SE
+
+	movl	K1VALUE, TMP2
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 0)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 1)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 2)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 3)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 4)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 5)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 6)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 7)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 8)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 9)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(10)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET(11)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET(12)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET(13)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET(14)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(15)(DATA))
+	EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP2, TMP)
+	EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP2, TMP)
+	EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP2, TMP)
+	EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP2, TMP)
+
+	/* TMP2 is free to use in these rounds */
+	movl	K2VALUE, TMP2
+	EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	/* We have to put this constant on the stack */
+	movl	K3VALUE, TMP3
+	EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	movl	K4VALUE, TMP2
+	EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	/* Update the state vector */
+	movl	88(%esp),TMP
+	addl	SA, (TMP)
+	addl	SB, 4(TMP)
+	addl	SC, 8(TMP)
+	addl	SD, 12(TMP)
+	addl	SE, 16(TMP)
+
+	/* Clear the workspace for security */
+	xorl	TMP, TMP
+	CLEAR( 0); CLEAR( 1); CLEAR( 2); CLEAR( 3);
+	CLEAR( 4); CLEAR( 5); CLEAR( 6); CLEAR( 7);
+	CLEAR( 8); CLEAR( 9); CLEAR(10); CLEAR(11);
+	CLEAR(12); CLEAR(13); CLEAR(14); CLEAR(15);
+
+	addl	$68, %esp
+	CFI_ADJUST_CFA_OFFSET	-68
+	POP(edi)
+	POP(esi)
+	POP(ebp)
+	POP(ebx)
+	ret
+	CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index a172401..0da331c 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -1,8 +1,15 @@
 #ifndef __CRYPTOHASH_H
 #define __CRYPTOHASH_H
 
+#include <linux/linkage.h>
+
 #define SHA_DIGEST_WORDS 5
+
+#if defined(CONFIG_SHA1_X86)
+#define SHA_WORKSPACE_WORDS 0
+#else
 #define SHA_WORKSPACE_WORDS 80
+#endif
 
 /**
  * sha_init - initialize the vectors for a SHA1 digest
@@ -17,7 +24,7 @@ static inline void sha_init(__u32 *buf)
 	buf[4] = 0xc3d2e1f0;
 }
 
-void sha_transform(__u32 *digest, const char *data, __u32 *W);
+asmlinkage void sha_transform(__u32 *digest, const char *data, __u32 *W);
 
 __u32 half_md4_transform(__u32 buf[4], __u32 const in[8]);
 
diff --git a/lib/Kconfig b/lib/Kconfig
index 2e7ae6b..69fdb64 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -124,4 +124,17 @@ config HAS_DMA
 	depends on !NO_DMA
 	default y
 
+#
+# Optimized SHA-1 support is autoconfigured
+#
+config SHA1_X86
+	bool
+	depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
+	default y
+
+config SHA1_GENERIC
+	bool
+	depends on !SHA1_X86
+	default y
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index c8c8e20..f67be29 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -5,10 +5,11 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o
+	 irq_regs.o reciprocal_div.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SHA1_GENERIC) += sha1.o
 
 lib-y	+= kobject.o kref.o kobject_uevent.o klist.o
 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64
  2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
  2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
  2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
@ 2007-06-08 21:42 ` Benjamin Gilbert
  2007-06-11 12:01   ` Andi Kleen
  2007-06-11 20:30 ` [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Adrian Bunk
  3 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-08 21:42 UTC (permalink / raw)
  To: akpm; +Cc: herbert, linux-crypto, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 11561 bytes --]

Add optimized implementation of the SHA-1 hash function for x86_64, ported
from the x86 implementation in Nettle (which is LGPLed).

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
---

 arch/x86_64/kernel/x8664_ksyms.c |    3 
 arch/x86_64/lib/Makefile         |    2 
 arch/x86_64/lib/sha1.S           |  281 ++++++++++++++++++++++++++++++++++++++
 include/linux/cryptohash.h       |    2 
 lib/Kconfig                      |    7 +
 5 files changed, 293 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c
index 77c25b3..bc641ab 100644
--- a/arch/x86_64/kernel/x8664_ksyms.c
+++ b/arch/x86_64/kernel/x8664_ksyms.c
@@ -3,6 +3,7 @@
 
 #include <linux/module.h>
 #include <linux/smp.h>
+#include <linux/cryptohash.h>
 
 #include <asm/semaphore.h>
 #include <asm/processor.h>
@@ -60,3 +61,5 @@ EXPORT_SYMBOL(init_level4_pgt);
 EXPORT_SYMBOL(load_gs_index);
 
 EXPORT_SYMBOL(_proxy_pda);
+
+EXPORT_SYMBOL(sha_transform);
diff --git a/arch/x86_64/lib/Makefile b/arch/x86_64/lib/Makefile
index c943271..6c8110b 100644
--- a/arch/x86_64/lib/Makefile
+++ b/arch/x86_64/lib/Makefile
@@ -9,5 +9,5 @@ obj-$(CONFIG_SMP)	+= msr-on-cpu.o
 
 lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
 	usercopy.o getuser.o putuser.o  \
-	thunk.o clear_page.o copy_page.o bitstr.o bitops.o
+	thunk.o clear_page.o copy_page.o bitstr.o bitops.o sha1.o
 lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o copy_user_nocache.o
diff --git a/arch/x86_64/lib/sha1.S b/arch/x86_64/lib/sha1.S
new file mode 100644
index 0000000..48f4fde
--- /dev/null
+++ b/arch/x86_64/lib/sha1.S
@@ -0,0 +1,281 @@
+/*
+ * sha1-x86_64 - x86_64-optimized SHA1 hash algorithm
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <bgilbert@cs.cmu.edu>
+ * Ported from x86 to x86_64 by Benjamin Gilbert
+ *
+ * Copyright (C) 2004, Niels Möller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.  A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage.  r12-15 must be saved if they will be used.  Accessing
+   r8-r15 takes an extra instruction byte. */
+#define P_STATE	%rdi			/* Pointer parameter */
+#define P_DATA	%rsi			/* Pointer parameter */
+#define DATA	%rdx			/* Pointer parameter */
+#define SA	%edi			/* Reuses P_STATE */
+#define SB	%esi			/* Reuses P_DATA */
+#define SC	%eax
+#define SD	%ebx			/* Callee-saved */
+#define SE	%ebp			/* Callee-saved */
+#define TMP	%ecx
+#define TMP2	%r8d			/* Used by F3 */
+#define CONST	%r9d
+#define STATE	%r10
+
+/* Constants */
+#define K1VALUE	$0x5A827999		/* Rounds  0-19 */
+#define K2VALUE	$0x6ED9EBA1		/* Rounds 20-39 */
+#define K3VALUE	$0x8F1BBCDC		/* Rounds 40-59 */
+#define K4VALUE	$0xCA62C1D6		/* Rounds 60-79 */
+
+/* Convert stack offsets in 32-bit words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via P_DATA into register, byteswaps it, and stores it in
+   the DATA array. */
+#define SWAP(index, register)					\
+	movl	OFFSET(index)(P_DATA), register;		\
+	bswap	register;					\
+	movl	register, OFFSET(index)(DATA)
+
+/* push/pop wrappers that update the DWARF unwind table */
+#define PUSH(regname)						\
+	push			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	8;				\
+	CFI_REL_OFFSET		regname, 0
+
+#define POP(regname)						\
+	pop			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	-8;				\
+	CFI_RESTORE		regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ *   W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i)						\
+	movl	OFFSET(i % 16)(DATA), TMP;			\
+	xorl	OFFSET((i + 2) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 8) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 13) % 16)(DATA), TMP;		\
+	roll	$1, TMP;					\
+	movl	TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ *  f1(x,y,z) = z ^ (x & (y ^ z))
+ *  f2(x,y,z) = x ^ y ^ z
+ *  f3(x,y,z) = (x & y) | (z & (x | y))
+ *  f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z)						\
+	movl	z, TMP;						\
+	xorl	y, TMP;						\
+	andl	x, TMP;						\
+	xorl	z, TMP
+
+#define F2(x,y,z)						\
+	movl	x, TMP;						\
+	xorl	y, TMP;						\
+	xorl	z, TMP
+
+#define F3(x,y,z)						\
+	movl	x, TMP2;					\
+	andl	y, TMP2;					\
+	movl	x, TMP;						\
+	orl	y, TMP;						\
+	andl	z, TMP;						\
+	orl	TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ *   a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ *   b' = a;
+ *   c' = b <<< 30;
+ *   d' = c;
+ *   e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ *   e += a <<< 5 + f( b, c, d ) + k + w;
+ *   b <<<= 30
+ *
+ * k is not an explicit parameter; it's always stored in CONST.
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,w)					\
+	addl	CONST, e;					\
+	addl	w, e;						\
+	f(b,c,d);						\
+	addl	TMP, e;						\
+	movl	a, TMP;						\
+	roll	$5, TMP;					\
+	addl	TMP, e;						\
+	roll	$30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* We expect a 64-byte workspace. */
+.text
+ENTRY(sha_transform)
+	CFI_STARTPROC
+	PUSH(rbx)
+	PUSH(rbp)
+
+	/* Load and byteswap data */
+	SWAP( 0,  %eax); SWAP( 1,  %ebx); SWAP( 2,  %ecx); SWAP( 3,  %ebp)
+	SWAP( 4,  %r8d); SWAP( 5,  %r9d); SWAP( 6, %r10d); SWAP( 7, %r11d)
+	SWAP( 8,  %eax); SWAP( 9,  %ebx); SWAP(10,  %ecx); SWAP(11,  %ebp)
+	SWAP(12,  %r8d); SWAP(13,  %r9d); SWAP(14, %r10d); SWAP(15, %r11d)
+
+	/* P_DATA now dead; free up P_STATE for other uses */
+	movq	P_STATE, STATE
+
+	/* load the state vector */
+	movl	(STATE),   SA
+	movl	4(STATE),  SB
+	movl	8(STATE),  SC
+	movl	12(STATE), SD
+	movl	16(STATE), SE
+
+	movl	K1VALUE, CONST
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 0)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 1)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 2)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 3)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 4)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 5)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 6)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 7)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 8)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 9)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET(10)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, OFFSET(11)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, OFFSET(12)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, OFFSET(13)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, OFFSET(14)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET(15)(DATA))
+	EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP)
+	EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP)
+	EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP)
+	EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP)
+
+	movl	K2VALUE, CONST
+	EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	movl	K3VALUE, CONST
+	EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	movl	K4VALUE, CONST
+	EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	/* Update the state vector */
+	addl	SA, (STATE)
+	addl	SB, 4(STATE)
+	addl	SC, 8(STATE)
+	addl	SD, 12(STATE)
+	addl	SE, 16(STATE)
+
+	POP(rbp)
+	POP(rbx)
+	ret
+	CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index 0da331c..b3b14a3 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -7,6 +7,8 @@
 
 #if defined(CONFIG_SHA1_X86)
 #define SHA_WORKSPACE_WORDS 0
+#elif defined(CONFIG_SHA1_X86_64)
+#define SHA_WORKSPACE_WORDS 16
 #else
 #define SHA_WORKSPACE_WORDS 80
 #endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 69fdb64..23a84ed 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -132,9 +132,14 @@ config SHA1_X86
 	depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
 	default y
 
+config SHA1_X86_64
+	bool
+	depends on (X86 || UML_X86) && 64BIT
+	default y
+
 config SHA1_GENERIC
 	bool
-	depends on !SHA1_X86
+	depends on !SHA1_X86 && !SHA1_X86_64
 	default y
 
 endmenu


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
@ 2007-06-09  7:32   ` Jan Engelhardt
  2007-06-10  1:15     ` Benjamin Gilbert
  2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Engelhardt @ 2007-06-09  7:32 UTC (permalink / raw)
  To: Benjamin Gilbert; +Cc: akpm, herbert, linux-crypto, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 391 bytes --]


On Jun 8 2007 17:42, Benjamin Gilbert wrote:

>@@ -0,0 +1,299 @@
>+/*
>+ * x86-optimized SHA1 hash algorithm (i486 and above)
>+ *
>+ * Originally from Nettle
>+ * Ported from M4 to cpp by Benjamin Gilbert <bgilbert@cs.cmu.edu>
>+ *
>+ * Copyright (C) 2004, Niels M?ller
>+ * Copyright (C) 2006-2007 Carnegie Mellon University

UTF-8 please. Hint: it should most likely be an ö.


	Jan
-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
  2007-06-09  7:32   ` Jan Engelhardt
@ 2007-06-09 20:11   ` Matt Mackall
  2007-06-09 20:23     ` Jeff Garzik
  2007-06-11 12:04     ` Andi Kleen
  1 sibling, 2 replies; 26+ messages in thread
From: Matt Mackall @ 2007-06-09 20:11 UTC (permalink / raw)
  To: Benjamin Gilbert; +Cc: akpm, herbert, linux-crypto, linux-kernel

On Fri, Jun 08, 2007 at 05:42:53PM -0400, Benjamin Gilbert wrote:
> Add x86-optimized implementation of the SHA-1 hash function, taken from
> Nettle under the LGPL.  This code will be enabled on kernels compiled for
> 486es or better; kernels which support 386es will use the generic
> implementation (since we need BSWAP).
> 
> We disable building lib/sha1.o when an optimized implementation is
> available, as the library link order for x86 (and x86_64) would otherwise
> ignore the optimized version.  The existing optimized implementation for ARM
> does not do this; the library link order for that architecture appears to
> favor the arch/arm/ version automatically.  I've left this situation alone
> since I'm not familiar with the ARM code, but a !ARM condition could be
> added to CONFIG_SHA1_GENERIC if it makes sense.
> 
> The code has been tested with tcrypt and the NIST test vectors.

Have you benchmarked this against lib/sha1.c? Please post the results.
Until then, I'm frankly skeptical that your unrolled version is faster
because when I introduced lib/sha1.c the rolled version therein won by
a significant margin and had 1/10th the cache footprint.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
@ 2007-06-09 20:23     ` Jeff Garzik
  2007-06-09 21:34       ` Matt Mackall
  2007-06-10  0:33       ` Benjamin Gilbert
  2007-06-11 12:04     ` Andi Kleen
  1 sibling, 2 replies; 26+ messages in thread
From: Jeff Garzik @ 2007-06-09 20:23 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Benjamin Gilbert, akpm, herbert, linux-crypto, linux-kernel

Matt Mackall wrote:
> On Fri, Jun 08, 2007 at 05:42:53PM -0400, Benjamin Gilbert wrote:
>> Add x86-optimized implementation of the SHA-1 hash function, taken from
>> Nettle under the LGPL.  This code will be enabled on kernels compiled for
>> 486es or better; kernels which support 386es will use the generic
>> implementation (since we need BSWAP).
>>
>> We disable building lib/sha1.o when an optimized implementation is
>> available, as the library link order for x86 (and x86_64) would otherwise
>> ignore the optimized version.  The existing optimized implementation for ARM
>> does not do this; the library link order for that architecture appears to
>> favor the arch/arm/ version automatically.  I've left this situation alone
>> since I'm not familiar with the ARM code, but a !ARM condition could be
>> added to CONFIG_SHA1_GENERIC if it makes sense.
>>
>> The code has been tested with tcrypt and the NIST test vectors.
> 
> Have you benchmarked this against lib/sha1.c? Please post the results.
> Until then, I'm frankly skeptical that your unrolled version is faster
> because when I introduced lib/sha1.c the rolled version therein won by
> a significant margin and had 1/10th the cache footprint.

Yes. And it also depends on the CPU as well.  Testing on a server-class 
x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce 
different result than from popular but less-capable "value" CPUs.

	Jeff



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-09 20:23     ` Jeff Garzik
@ 2007-06-09 21:34       ` Matt Mackall
  2007-06-10  0:33       ` Benjamin Gilbert
  1 sibling, 0 replies; 26+ messages in thread
From: Matt Mackall @ 2007-06-09 21:34 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Benjamin Gilbert, akpm, herbert, linux-crypto, linux-kernel

On Sat, Jun 09, 2007 at 04:23:27PM -0400, Jeff Garzik wrote:
> Matt Mackall wrote:
> >On Fri, Jun 08, 2007 at 05:42:53PM -0400, Benjamin Gilbert wrote:
> >>Add x86-optimized implementation of the SHA-1 hash function, taken from
> >>Nettle under the LGPL.  This code will be enabled on kernels compiled for
> >>486es or better; kernels which support 386es will use the generic
> >>implementation (since we need BSWAP).
> >>
> >>We disable building lib/sha1.o when an optimized implementation is
> >>available, as the library link order for x86 (and x86_64) would otherwise
> >>ignore the optimized version.  The existing optimized implementation for 
> >>ARM
> >>does not do this; the library link order for that architecture appears to
> >>favor the arch/arm/ version automatically.  I've left this situation alone
> >>since I'm not familiar with the ARM code, but a !ARM condition could be
> >>added to CONFIG_SHA1_GENERIC if it makes sense.
> >>
> >>The code has been tested with tcrypt and the NIST test vectors.
> >
> >Have you benchmarked this against lib/sha1.c? Please post the results.
> >Until then, I'm frankly skeptical that your unrolled version is faster
> >because when I introduced lib/sha1.c the rolled version therein won by
> >a significant margin and had 1/10th the cache footprint.
> 
> Yes. And it also depends on the CPU as well.  Testing on a server-class 
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce 
> different result than from popular but less-capable "value" CPUs.

In particular, any optimization made for "486+" CPUs is highly suspect
on any machine which runs the core at >1x the memory bus speeds.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-09 20:23     ` Jeff Garzik
  2007-06-09 21:34       ` Matt Mackall
@ 2007-06-10  0:33       ` Benjamin Gilbert
  2007-06-10 13:59         ` Matt Mackall
  1 sibling, 1 reply; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-10  0:33 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Matt Mackall, akpm, herbert, linux-crypto, linux-kernel

Jeff Garzik wrote:
> Matt Mackall wrote:
>> Have you benchmarked this against lib/sha1.c? Please post the results.
>> Until then, I'm frankly skeptical that your unrolled version is faster
>> because when I introduced lib/sha1.c the rolled version therein won by
>> a significant margin and had 1/10th the cache footprint.

See the benchmark tables in patch 0 at the head of this thread. 
Performance improved by at least 25% in every test, and 40-60% was more 
common for the 32-bit version (on a Pentium IV).

It's not just the loop unrolling; it's the register allocation and 
spilling.  For comparison, I built SHATransform() from the 
drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and 
SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty 
close to what you tested back then.  The resulting code is 49% MOV 
instructions, and 80% of *those* involve memory.  gcc4 is somewhat 
better, but it still spills a whole lot, both for the 2.6.11 unrolled 
code and for the current lib/sha1.c.

In contrast, the assembly implementation in this patch only has to go to 
memory for data and workspace (with one small exception in the F3 
rounds), and the workspace has a fifth of the cache footprint of the 
default implementation.

> Yes. And it also depends on the CPU as well.  Testing on a server-class 
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce 
> different result than from popular but less-capable "value" CPUs.

Good point.  I benchmarked the 32-bit assembly code on a couple more boxes:

=== AMD Duron, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
         block  update    (C)  (asm)
     0      16      16    104     72     31%
     1      64      16     52     36     31%
     2      64      64     45     29     36%
     3     256      16     33     23     30%
     4     256      64     27     17     37%
     5     256     256     24     14     42%
     6    1024      16     29     20     31%
     7    1024     256     20     11     45%
     8    1024    1024     19     11     42%
     9    2048      16     28     20     29%
    10    2048     256     19     11     42%
    11    2048    1024     18     10     44%
    12    2048    2048     18     10     44%
    13    4096      16     28     19     32%
    14    4096     256     18     10     44%
    15    4096    1024     18     10     44%
    16    4096    4096     18     10     44%
    17    8192      16     27     19     30%
    18    8192     256     18     10     44%
    19    8192    1024     18     10     44%
    20    8192    4096     17     10     41%
    21    8192    8192     17     10     41%

=== Classic Pentium, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
         block  update    (C)  (asm)
     0      16      16    145    144      1%
     1      64      16     72     61     15%
     2      64      64     65     52     20%
     3     256      16     46     39     15%
     4     256      64     39     32     18%
     5     256     256     36     29     19%
     6    1024      16     40     33     18%
     7    1024     256     30     23     23%
     8    1024    1024     29     23     21%
     9    2048      16     39     32     18%
    10    2048     256     29     22     24%
    11    2048    1024     28     22     21%
    12    2048    2048     28     22     21%
    13    4096      16     38     32     16%
    14    4096     256     28     22     21%
    15    4096    1024     28     21     25%
    16    4096    4096     27     21     22%
    17    8192      16     38     32     16%
    18    8192     256     28     22     21%
    19    8192    1024     28     21     25%
    20    8192    4096     27     21     22%
    21    8192    8192     27     21     22%

The improvement isn't as good, but it's still noticeable.

--Benjamin Gilbert


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-09  7:32   ` Jan Engelhardt
@ 2007-06-10  1:15     ` Benjamin Gilbert
  2007-06-11 19:47       ` Benjamin Gilbert
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-10  1:15 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: akpm, herbert, linux-crypto, linux-kernel

Jan Engelhardt wrote:
> On Jun 8 2007 17:42, Benjamin Gilbert wrote:
>> @@ -0,0 +1,299 @@
>> +/*
>> + * x86-optimized SHA1 hash algorithm (i486 and above)
>> + *
>> + * Originally from Nettle
>> + * Ported from M4 to cpp by Benjamin Gilbert <bgilbert@cs.cmu.edu>
>> + *
>> + * Copyright (C) 2004, Niels M?ller
>> + * Copyright (C) 2006-2007 Carnegie Mellon University
> 
> UTF-8 please. Hint: it should most likely be an ö.

Whoops, I had thought I had gotten that right.  I'll get updates for 
parts 2 and 3 sent out on Monday.

Thanks
--Benjamin Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-10  0:33       ` Benjamin Gilbert
@ 2007-06-10 13:59         ` Matt Mackall
  2007-06-10 16:47           ` Benjamin Gilbert
  2007-06-11 17:39           ` Benjamin Gilbert
  0 siblings, 2 replies; 26+ messages in thread
From: Matt Mackall @ 2007-06-10 13:59 UTC (permalink / raw)
  To: Benjamin Gilbert; +Cc: Jeff Garzik, akpm, herbert, linux-crypto, linux-kernel

On Sat, Jun 09, 2007 at 08:33:25PM -0400, Benjamin Gilbert wrote:
> Jeff Garzik wrote:
> >Matt Mackall wrote:
> >>Have you benchmarked this against lib/sha1.c? Please post the results.
> >>Until then, I'm frankly skeptical that your unrolled version is faster
> >>because when I introduced lib/sha1.c the rolled version therein won by
> >>a significant margin and had 1/10th the cache footprint.
> 
> See the benchmark tables in patch 0 at the head of this thread. 
> Performance improved by at least 25% in every test, and 40-60% was more 
> common for the 32-bit version (on a Pentium IV).
> 
> It's not just the loop unrolling; it's the register allocation and 
> spilling.  For comparison, I built SHATransform() from the 
> drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and 
> SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty 
> close to what you tested back then.  The resulting code is 49% MOV 
> instructions, and 80% of *those* involve memory.  gcc4 is somewhat 
> better, but it still spills a whole lot, both for the 2.6.11 unrolled 
> code and for the current lib/sha1.c.

Wait, your benchmark is comparing against the unrolled code?

> In contrast, the assembly implementation in this patch only has to go to 
> memory for data and workspace (with one small exception in the F3 
> rounds), and the workspace has a fifth of the cache footprint of the 
> default implementation.

How big is the -code- footprint?

Earlier you wrote:

> On the aforementioned Pentium IV, /dev/urandom throughput goes from
> 3.7 MB/s to 5.6 MB/s with the patches; on the Core 2, it increases
> from 5.5 MB/s to 8.1 MB/s.

Whoa. We've regressed something horrible here:

http://groups.google.com/group/linux.kernel/msg/fba056363c99d4f9?dmode=source&hl=en

In 2003, I was getting 17MB/s out of my Athlon. Now I'm getting 2.7MB/s.
Were your tests with or without the latest /dev/urandom fixes? This
one in particular:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=374f167dfb97c1785515a0c41e32a66b414859a8

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-10 13:59         ` Matt Mackall
@ 2007-06-10 16:47           ` Benjamin Gilbert
  2007-06-10 17:33             ` Matt Mackall
  2007-06-11 17:39           ` Benjamin Gilbert
  1 sibling, 1 reply; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-10 16:47 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, akpm, herbert, linux-crypto, linux-kernel

Matt Mackall wrote:
> On Sat, Jun 09, 2007 at 08:33:25PM -0400, Benjamin Gilbert wrote:
>> It's not just the loop unrolling; it's the register allocation and 
>> spilling.  For comparison, I built SHATransform() from the 
>> drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and 
>> SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty 
>> close to what you tested back then.  The resulting code is 49% MOV 
>> instructions, and 80% of *those* involve memory.  gcc4 is somewhat 
>> better, but it still spills a whole lot, both for the 2.6.11 unrolled 
>> code and for the current lib/sha1.c.
> 
> Wait, your benchmark is comparing against the unrolled code?

No, it's comparing the current lib/sha1.c to the optimized code in the 
patch.  I was just pointing out that the unrolled code you were likely 
testing against, back then, may not have been very good.  (Though I 
assumed that you were talking about the unrolled code in random.c, not 
the code in CryptoAPI, so that might change the numbers some.  It 
appears from the post you linked below that the unrolled CryptoAPI code 
still beat the rolled version?)

> How big is the -code- footprint?

About 3700 bytes for the 32-bit version of sha_transform().

> Whoa. We've regressed something horrible here:
> 
> http://groups.google.com/group/linux.kernel/msg/fba056363c99d4f9?dmode=source&hl=en
> 
> In 2003, I was getting 17MB/s out of my Athlon. Now I'm getting 2.7MB/s.
> Were your tests with or without the latest /dev/urandom fixes? This
> one in particular:
> 
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=374f167dfb97c1785515a0c41e32a66b414859a8

I'm not in front of that machine right now; I can check tomorrow.  For 
what it's worth, I've seen equivalent performance (a few MB/s) on a 
range of fairly-recent kernels.

--Benjamin Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-10 16:47           ` Benjamin Gilbert
@ 2007-06-10 17:33             ` Matt Mackall
  0 siblings, 0 replies; 26+ messages in thread
From: Matt Mackall @ 2007-06-10 17:33 UTC (permalink / raw)
  To: Benjamin Gilbert; +Cc: Jeff Garzik, akpm, herbert, linux-crypto, linux-kernel

On Sun, Jun 10, 2007 at 12:47:19PM -0400, Benjamin Gilbert wrote:
> Matt Mackall wrote:
> >On Sat, Jun 09, 2007 at 08:33:25PM -0400, Benjamin Gilbert wrote:
> >>It's not just the loop unrolling; it's the register allocation and 
> >>spilling.  For comparison, I built SHATransform() from the 
> >>drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and 
> >>SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty 
> >>close to what you tested back then.  The resulting code is 49% MOV 
> >>instructions, and 80% of *those* involve memory.  gcc4 is somewhat 
> >>better, but it still spills a whole lot, both for the 2.6.11 unrolled 
> >>code and for the current lib/sha1.c.
> >
> >Wait, your benchmark is comparing against the unrolled code?
> 
> No, it's comparing the current lib/sha1.c to the optimized code in the 
> patch.  I was just pointing out that the unrolled code you were likely 
> testing against, back then, may not have been very good.  (Though I 
> assumed that you were talking about the unrolled code in random.c, not 
> the code in CryptoAPI, so that might change the numbers some.  It 
> appears from the post you linked below that the unrolled CryptoAPI code 
> still beat the rolled version?)

That predates lib/sha1.c by a while.

> >How big is the -code- footprint?
> 
> About 3700 bytes for the 32-bit version of sha_transform().

lib/sha1.c's footprint is... 621 bytes today. Huh. That's up from 466
bytes when it was introduced and no one's touched it:

http://search.luky.org/ML/linux-kernel.2005/msg06648.html

Stupid compilers.

But anyway. Cache footprint matters. The two big users of SHA1 in the
kernel are /dev/random and IPSec, both of which typically operate on
small chunks of data.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64
  2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
@ 2007-06-11 12:01   ` Andi Kleen
  2007-06-11 19:45     ` Benjamin Gilbert
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2007-06-11 12:01 UTC (permalink / raw)
  To: Benjamin Gilbert; +Cc: akpm, herbert, linux-crypto, linux-kernel

Benjamin Gilbert <bgilbert@cs.cmu.edu> writes:

> +/* push/pop wrappers that update the DWARF unwind table */
> +#define PUSH(regname)						\
> +	push			%regname;			\
> +	CFI_ADJUST_CFA_OFFSET	8;				\
> +	CFI_REL_OFFSET		regname, 0
> +
> +#define POP(regname)						\
> +	pop			%regname;			\
> +	CFI_ADJUST_CFA_OFFSET	-8;				\
> +	CFI_RESTORE		regname

Please don't do these kinds of wrappers. They just obfuscate the code.

And BTW plain gas macros (.macro) are much nicer to read too
than cpp macros.


> +#define EXPAND(i)						\
> +	movl	OFFSET(i % 16)(DATA), TMP;			\
> +	xorl	OFFSET((i + 2) % 16)(DATA), TMP;		\

Such overlapping memory accesses are somewhat dangerous as they tend
to stall some CPUs.  Better probably to do a quad load and then extract.

If you care about the last cycle I would suggest you run 
it at least once through the Pipeline simulator in the Linux
version of AMD CodeAnalyst or through vtune.

I haven't checked in detail if it's possible but it's suspicious you
never use quad operations for anything. You keep at least half
the CPU's bits idle all the time.

> +	EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP)
> +	EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP)
> +	EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP)
> +	EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP)
> +	EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP)

Gut feeling is that the unroll factor is far too large.
Have you tried a smaller one? That would save icache
which is very important in the kernel. Unlike in your micro benchmark
when kernel code runs normally caches are cold.  Smaller is faster then.
And most kernel SHA applications don't process very much data anyways
so startup costs are important.

> diff --git a/lib/Kconfig b/lib/Kconfig
> index 69fdb64..23a84ed 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -132,9 +132,14 @@ config SHA1_X86
>  	depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
>  	default y
>  
> +config SHA1_X86_64
> +	bool
> +	depends on (X86 || UML_X86) && 64BIT
> +	default y
> +
>  config SHA1_GENERIC
>  	bool
> -	depends on !SHA1_X86
> +	depends on !SHA1_X86 && !SHA1_X86_64

Better define a SHA_ARCH_OPTIMIZED helper symbol, otherwise
this will get messy as more architectures add optimized versions.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
  2007-06-09 20:23     ` Jeff Garzik
@ 2007-06-11 12:04     ` Andi Kleen
  1 sibling, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2007-06-11 12:04 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Benjamin Gilbert, akpm, herbert, linux-crypto, linux-kernel

Matt Mackall <mpm@selenic.com> writes:
> 
> Have you benchmarked this against lib/sha1.c? Please post the results.
> Until then, I'm frankly skeptical that your unrolled version is faster
> because when I introduced lib/sha1.c the rolled version therein won by
> a significant margin and had 1/10th the cache footprint.

I would always suggest to benchmark such functions with forced cold i/d caches

(memset(x, 0, 5*1024*1024) and running some very large generated
function every few iterations of the benchmark) 

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-10 13:59         ` Matt Mackall
  2007-06-10 16:47           ` Benjamin Gilbert
@ 2007-06-11 17:39           ` Benjamin Gilbert
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-11 17:39 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Jeff Garzik, akpm, herbert, linux-crypto, linux-kernel

Matt Mackall wrote:
> In 2003, I was getting 17MB/s out of my Athlon. Now I'm getting 2.7MB/s.
> Were your tests with or without the latest /dev/urandom fixes? This
> one in particular:
> 
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=374f167dfb97c1785515a0c41e32a66b414859a8

With.  I just tried 2.6.11 (the oldest that will boot) on the Pentium IV 
box and got 3.7 MB/s, so if it's a regression it's been around for a while.

--Benjamin Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64
  2007-06-11 12:01   ` Andi Kleen
@ 2007-06-11 19:45     ` Benjamin Gilbert
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-11 19:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, herbert, linux-crypto, linux-kernel, linux

Andi Kleen wrote:
> Benjamin Gilbert <bgilbert@cs.cmu.edu> writes:
>> +#define EXPAND(i)						\
>> +	movl	OFFSET(i % 16)(DATA), TMP;			\
>> +	xorl	OFFSET((i + 2) % 16)(DATA), TMP;		\
> 
> Such overlapping memory accesses are somewhat dangerous as they tend
> to stall some CPUs.  Better probably to do a quad load and then extract.

OFFSET(i) is defined as 4*(i), so they don't actually overlap. 
(Arguably that macro should go away.)

> I haven't checked in detail if it's possible but it's suspicious you
> never use quad operations for anything. You keep at least half
> the CPU's bits idle all the time.

SHA-1 fundamentally wants to work with 32-bit quantities.  It might be 
possible to use quad operations for some things, with sufficient 
cleverness, but I doubt it'd be worth the effort.

> Gut feeling is that the unroll factor is far too large.
> Have you tried a smaller one? That would save icache
> which is very important in the kernel.

That seems to be the consensus.  I'll see if I can find some time to try 
linux@horizon.com's suggestion and report back.

I don't think, though, that cache footprint is the *only* thing that 
matters.  Leaving aside /dev/urandom, there are cases where throughput 
matters a lot.  This patch set came out of some work on a hashing block 
device driver in which SHA is, by far, the biggest CPU user.  One could 
imagine content-addressable filesystems, or even IPsec under the right 
workloads, being in a similar situation.

Would it be more palatable to roll the patch as an optimized CryptoAPI 
module rather than as a lib/sha1.c replacement?  That wouldn't help 
/dev/urandom, of course, but for other cases it would allow the user to 
ask for the optimized version if needed, and not pay the footprint costs 
otherwise.

--Benjamin Gilbert


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-10  1:15     ` Benjamin Gilbert
@ 2007-06-11 19:47       ` Benjamin Gilbert
  2007-06-11 19:50         ` [PATCH] " Benjamin Gilbert
  2007-06-11 19:52         ` [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
  0 siblings, 2 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-11 19:47 UTC (permalink / raw)
  To: Benjamin Gilbert
  Cc: Jan Engelhardt, akpm, herbert, linux-crypto, linux-kernel

Benjamin Gilbert wrote:
> Jan Engelhardt wrote:
>> UTF-8 please. Hint: it should most likely be an ö.
> 
> Whoops, I had thought I had gotten that right.  I'll get updates for 
> parts 2 and 3 sent out on Monday.

I'm sending the corrected parts 2 and 3 as replies to this email.  The 
UTF-8 fix is the *only* thing that has changed.  The patches themselves 
are moot in their current form, but I wanted to make sure they were 
archived with the correct attribution.

--Benjamin Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-11 19:47       ` Benjamin Gilbert
@ 2007-06-11 19:50         ` Benjamin Gilbert
  2007-06-11 19:52         ` [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-11 19:50 UTC (permalink / raw)
  To: jengelh; +Cc: akpm, herbert, linux-crypto, linux-kernel

Add x86-optimized implementation of the SHA-1 hash function, taken from
Nettle under the LGPL.  This code will be enabled on kernels compiled for
486es or better; kernels which support 386es will use the generic
implementation (since we need BSWAP).

We disable building lib/sha1.o when an optimized implementation is
available, as the library link order for x86 (and x86_64) would otherwise
ignore the optimized version.  The existing optimized implementation for ARM
does not do this; the library link order for that architecture appears to
favor the arch/arm/ version automatically.  I've left this situation alone
since I'm not familiar with the ARM code, but a !ARM condition could be
added to CONFIG_SHA1_GENERIC if it makes sense.

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
---

 arch/i386/kernel/i386_ksyms.c |    5 +
 arch/i386/lib/Makefile        |    1 
 arch/i386/lib/sha1.S          |  299 +++++++++++++++++++++++++++++++++++++++++
 include/linux/cryptohash.h    |    9 +
 lib/Kconfig                   |   13 ++
 lib/Makefile                  |    3 
 6 files changed, 328 insertions(+), 2 deletions(-)

diff --git a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
index e3d4b73..812bc4e 100644
--- a/arch/i386/kernel/i386_ksyms.c
+++ b/arch/i386/kernel/i386_ksyms.c
@@ -1,4 +1,5 @@
 #include <linux/module.h>
+#include <linux/cryptohash.h>
 #include <asm/checksum.h>
 #include <asm/desc.h>
 
@@ -28,3 +29,7 @@ EXPORT_SYMBOL(__read_lock_failed);
 #endif
 
 EXPORT_SYMBOL(csum_partial);
+
+#ifdef CONFIG_SHA1_X86
+EXPORT_SYMBOL(sha_transform);
+#endif
diff --git a/arch/i386/lib/Makefile b/arch/i386/lib/Makefile
index 22d8ac5..69f4845 100644
--- a/arch/i386/lib/Makefile
+++ b/arch/i386/lib/Makefile
@@ -6,6 +6,7 @@
 lib-y = checksum.o delay.o usercopy.o getuser.o putuser.o memcpy.o strstr.o \
 	bitops.o semaphore.o
 
+lib-$(CONFIG_SHA1_X86) += sha1.o
 lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
 
 obj-$(CONFIG_SMP)	+= msr-on-cpu.o
diff --git a/arch/i386/lib/sha1.S b/arch/i386/lib/sha1.S
new file mode 100644
index 0000000..a84d829
--- /dev/null
+++ b/arch/i386/lib/sha1.S
@@ -0,0 +1,299 @@
+/*
+ * x86-optimized SHA1 hash algorithm (i486 and above)
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <bgilbert@cs.cmu.edu>
+ *
+ * Copyright (C) 2004, Niels Möller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.  A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage */
+#define SA	%eax
+#define SB	%ebx
+#define SC	%ecx
+#define SD	%edx
+#define SE	%ebp
+#define DATA	%esp
+#define TMP	%edi
+#define TMP2	%esi			/* Used by SWAP and F3 */
+#define TMP3	64(%esp)
+
+/* Constants */
+#define K1VALUE	$0x5A827999		/* Rounds  0-19 */
+#define K2VALUE	$0x6ED9EBA1		/* Rounds 20-39 */
+#define K3VALUE	$0x8F1BBCDC		/* Rounds 40-59 */
+#define K4VALUE	$0xCA62C1D6		/* Rounds 60-79 */
+
+/* Convert stack offsets in words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via TMP2 into register, byteswaps it, and stores it in
+   the DATA array. */
+#define SWAP(index, register)					\
+	movl	OFFSET(index)(TMP2), register;			\
+	bswap	register;					\
+	movl	register, OFFSET(index)(DATA)
+
+/* Sets the workspace word at the given index to TMP. */
+#define CLEAR(index)						\
+	movl	TMP, OFFSET(index)(DATA)
+
+/* pushl/popl wrappers that update the DWARF unwind table */
+#define PUSH(regname)						\
+	pushl			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	4;				\
+	CFI_REL_OFFSET		regname, 0
+
+#define POP(regname)						\
+	popl			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	-4;				\
+	CFI_RESTORE		regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ *   W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i)						\
+	movl	OFFSET(i % 16)(DATA), TMP;			\
+	xorl	OFFSET((i + 2) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 8) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 13) % 16)(DATA), TMP;		\
+	roll	$1, TMP;					\
+	movl	TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ *  f1(x,y,z) = z ^ (x & (y ^ z))
+ *  f2(x,y,z) = x ^ y ^ z
+ *  f3(x,y,z) = (x & y) | (z & (x | y))
+ *  f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z)						\
+	movl	z, TMP;						\
+	xorl	y, TMP;						\
+	andl	x, TMP;						\
+	xorl	z, TMP
+
+#define F2(x,y,z)						\
+	movl	x, TMP;						\
+	xorl	y, TMP;						\
+	xorl	z, TMP
+
+#define F3(x,y,z)						\
+	movl	x, TMP2;					\
+	andl	y, TMP2;					\
+	movl	x, TMP;						\
+	orl	y, TMP;						\
+	andl	z, TMP;						\
+	orl	TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ *   a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ *   b' = a;
+ *   c' = b <<< 30;
+ *   d' = c;
+ *   e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ *   e += a <<< 5 + f( b, c, d ) + k + w;
+ *   b <<<= 30
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,k,w)					\
+	addl	k, e;						\
+	addl	w, e;						\
+	f(b,c,d);						\
+	addl	TMP, e;						\
+	movl	a, TMP;						\
+	roll	$5, TMP;					\
+	addl	TMP, e;						\
+	roll	$30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* The workspace argument is ignored; we don't have enough registers to handle
+   a workspace that isn't on our own stack. */
+.text
+ENTRY(sha_transform)
+	CFI_STARTPROC
+	/* save all registers that need to be saved */
+	PUSH(ebx)		/* 80(%esp) */
+	PUSH(ebp)		/* 76(%esp) */
+	PUSH(esi)		/* 72(%esp) */
+	PUSH(edi)		/* 68(%esp) */
+
+	subl	$68, %esp	/* %esp = W */
+	CFI_ADJUST_CFA_OFFSET	68
+
+	/* Load and byteswap data */
+	movl	92(%esp), TMP2
+
+	SWAP( 0, %eax); SWAP( 1, %ebx); SWAP( 2, %ecx); SWAP( 3, %edx)
+	SWAP( 4, %eax); SWAP( 5, %ebx); SWAP( 6, %ecx); SWAP( 7, %edx)
+	SWAP( 8, %eax); SWAP( 9, %ebx); SWAP(10, %ecx); SWAP(11, %edx)
+	SWAP(12, %eax); SWAP(13, %ebx); SWAP(14, %ecx); SWAP(15, %edx)
+
+	/* load the state vector */
+	movl	88(%esp),TMP
+	movl	(TMP),   SA
+	movl	4(TMP),  SB
+	movl	8(TMP),  SC
+	movl	12(TMP), SD
+	movl	16(TMP), SE
+
+	movl	K1VALUE, TMP2
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 0)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 1)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 2)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 3)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 4)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 5)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 6)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 7)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 8)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 9)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(10)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET(11)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET(12)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET(13)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET(14)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(15)(DATA))
+	EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP2, TMP)
+	EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP2, TMP)
+	EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP2, TMP)
+	EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP2, TMP)
+
+	/* TMP2 is free to use in these rounds */
+	movl	K2VALUE, TMP2
+	EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	/* We have to put this constant on the stack */
+	movl	K3VALUE, TMP3
+	EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+	EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+	EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+	EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+	EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+	movl	K4VALUE, TMP2
+	EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+	EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+	EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+	EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+	EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+	/* Update the state vector */
+	movl	88(%esp),TMP
+	addl	SA, (TMP)
+	addl	SB, 4(TMP)
+	addl	SC, 8(TMP)
+	addl	SD, 12(TMP)
+	addl	SE, 16(TMP)
+
+	/* Clear the workspace for security */
+	xorl	TMP, TMP
+	CLEAR( 0); CLEAR( 1); CLEAR( 2); CLEAR( 3);
+	CLEAR( 4); CLEAR( 5); CLEAR( 6); CLEAR( 7);
+	CLEAR( 8); CLEAR( 9); CLEAR(10); CLEAR(11);
+	CLEAR(12); CLEAR(13); CLEAR(14); CLEAR(15);
+
+	addl	$68, %esp
+	CFI_ADJUST_CFA_OFFSET	-68
+	POP(edi)
+	POP(esi)
+	POP(ebp)
+	POP(ebx)
+	ret
+	CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index a172401..0da331c 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -1,8 +1,15 @@
 #ifndef __CRYPTOHASH_H
 #define __CRYPTOHASH_H
 
+#include <linux/linkage.h>
+
 #define SHA_DIGEST_WORDS 5
+
+#if defined(CONFIG_SHA1_X86)
+#define SHA_WORKSPACE_WORDS 0
+#else
 #define SHA_WORKSPACE_WORDS 80
+#endif
 
 /**
  * sha_init - initialize the vectors for a SHA1 digest
@@ -17,7 +24,7 @@ static inline void sha_init(__u32 *buf)
 	buf[4] = 0xc3d2e1f0;
 }
 
-void sha_transform(__u32 *digest, const char *data, __u32 *W);
+asmlinkage void sha_transform(__u32 *digest, const char *data, __u32 *W);
 
 __u32 half_md4_transform(__u32 buf[4], __u32 const in[8]);
 
diff --git a/lib/Kconfig b/lib/Kconfig
index 2e7ae6b..69fdb64 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -124,4 +124,17 @@ config HAS_DMA
 	depends on !NO_DMA
 	default y
 
+#
+# Optimized SHA-1 support is autoconfigured
+#
+config SHA1_X86
+	bool
+	depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
+	default y
+
+config SHA1_GENERIC
+	bool
+	depends on !SHA1_X86
+	default y
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index c8c8e20..f67be29 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -5,10 +5,11 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o
+	 irq_regs.o reciprocal_div.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SHA1_GENERIC) += sha1.o
 
 lib-y	+= kobject.o kref.o kobject_uevent.o klist.o
 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64
  2007-06-11 19:47       ` Benjamin Gilbert
  2007-06-11 19:50         ` [PATCH] " Benjamin Gilbert
@ 2007-06-11 19:52         ` Benjamin Gilbert
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-11 19:52 UTC (permalink / raw)
  To: jengelh; +Cc: akpm, herbert, linux-crypto, linux-kernel

Add optimized implementation of the SHA-1 hash function for x86_64, ported
from the x86 implementation in Nettle (which is LGPLed).

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
---

 arch/x86_64/kernel/x8664_ksyms.c |    3 
 arch/x86_64/lib/Makefile         |    2 
 arch/x86_64/lib/sha1.S           |  281 ++++++++++++++++++++++++++++++++++++++
 include/linux/cryptohash.h       |    2 
 lib/Kconfig                      |    7 +
 5 files changed, 293 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c
index 77c25b3..bc641ab 100644
--- a/arch/x86_64/kernel/x8664_ksyms.c
+++ b/arch/x86_64/kernel/x8664_ksyms.c
@@ -3,6 +3,7 @@
 
 #include <linux/module.h>
 #include <linux/smp.h>
+#include <linux/cryptohash.h>
 
 #include <asm/semaphore.h>
 #include <asm/processor.h>
@@ -60,3 +61,5 @@ EXPORT_SYMBOL(init_level4_pgt);
 EXPORT_SYMBOL(load_gs_index);
 
 EXPORT_SYMBOL(_proxy_pda);
+
+EXPORT_SYMBOL(sha_transform);
diff --git a/arch/x86_64/lib/Makefile b/arch/x86_64/lib/Makefile
index c943271..6c8110b 100644
--- a/arch/x86_64/lib/Makefile
+++ b/arch/x86_64/lib/Makefile
@@ -9,5 +9,5 @@ obj-$(CONFIG_SMP)	+= msr-on-cpu.o
 
 lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
 	usercopy.o getuser.o putuser.o  \
-	thunk.o clear_page.o copy_page.o bitstr.o bitops.o
+	thunk.o clear_page.o copy_page.o bitstr.o bitops.o sha1.o
 lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o copy_user_nocache.o
diff --git a/arch/x86_64/lib/sha1.S b/arch/x86_64/lib/sha1.S
new file mode 100644
index 0000000..f928ac3
--- /dev/null
+++ b/arch/x86_64/lib/sha1.S
@@ -0,0 +1,281 @@
+/*
+ * sha1-x86_64 - x86_64-optimized SHA1 hash algorithm
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <bgilbert@cs.cmu.edu>
+ * Ported from x86 to x86_64 by Benjamin Gilbert
+ *
+ * Copyright (C) 2004, Niels Möller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.  A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage.  r12-15 must be saved if they will be used.  Accessing
+   r8-r15 takes an extra instruction byte. */
+#define P_STATE	%rdi			/* Pointer parameter */
+#define P_DATA	%rsi			/* Pointer parameter */
+#define DATA	%rdx			/* Pointer parameter */
+#define SA	%edi			/* Reuses P_STATE */
+#define SB	%esi			/* Reuses P_DATA */
+#define SC	%eax
+#define SD	%ebx			/* Callee-saved */
+#define SE	%ebp			/* Callee-saved */
+#define TMP	%ecx
+#define TMP2	%r8d			/* Used by F3 */
+#define CONST	%r9d
+#define STATE	%r10
+
+/* Constants */
+#define K1VALUE	$0x5A827999		/* Rounds  0-19 */
+#define K2VALUE	$0x6ED9EBA1		/* Rounds 20-39 */
+#define K3VALUE	$0x8F1BBCDC		/* Rounds 40-59 */
+#define K4VALUE	$0xCA62C1D6		/* Rounds 60-79 */
+
+/* Convert stack offsets in 32-bit words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via P_DATA into register, byteswaps it, and stores it in
+   the DATA array. */
+#define SWAP(index, register)					\
+	movl	OFFSET(index)(P_DATA), register;		\
+	bswap	register;					\
+	movl	register, OFFSET(index)(DATA)
+
+/* push/pop wrappers that update the DWARF unwind table */
+#define PUSH(regname)						\
+	push			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	8;				\
+	CFI_REL_OFFSET		regname, 0
+
+#define POP(regname)						\
+	pop			%regname;			\
+	CFI_ADJUST_CFA_OFFSET	-8;				\
+	CFI_RESTORE		regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ *   W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i)						\
+	movl	OFFSET(i % 16)(DATA), TMP;			\
+	xorl	OFFSET((i + 2) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 8) % 16)(DATA), TMP;		\
+	xorl	OFFSET((i + 13) % 16)(DATA), TMP;		\
+	roll	$1, TMP;					\
+	movl	TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ *  f1(x,y,z) = z ^ (x & (y ^ z))
+ *  f2(x,y,z) = x ^ y ^ z
+ *  f3(x,y,z) = (x & y) | (z & (x | y))
+ *  f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z)						\
+	movl	z, TMP;						\
+	xorl	y, TMP;						\
+	andl	x, TMP;						\
+	xorl	z, TMP
+
+#define F2(x,y,z)						\
+	movl	x, TMP;						\
+	xorl	y, TMP;						\
+	xorl	z, TMP
+
+#define F3(x,y,z)						\
+	movl	x, TMP2;					\
+	andl	y, TMP2;					\
+	movl	x, TMP;						\
+	orl	y, TMP;						\
+	andl	z, TMP;						\
+	orl	TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ *   a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ *   b' = a;
+ *   c' = b <<< 30;
+ *   d' = c;
+ *   e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ *   e += a <<< 5 + f( b, c, d ) + k + w;
+ *   b <<<= 30
+ *
+ * k is not an explicit parameter; it's always stored in CONST.
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,w)					\
+	addl	CONST, e;					\
+	addl	w, e;						\
+	f(b,c,d);						\
+	addl	TMP, e;						\
+	movl	a, TMP;						\
+	roll	$5, TMP;					\
+	addl	TMP, e;						\
+	roll	$30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* We expect a 64-byte workspace. */
+.text
+ENTRY(sha_transform)
+	CFI_STARTPROC
+	PUSH(rbx)
+	PUSH(rbp)
+
+	/* Load and byteswap data */
+	SWAP( 0,  %eax); SWAP( 1,  %ebx); SWAP( 2,  %ecx); SWAP( 3,  %ebp)
+	SWAP( 4,  %r8d); SWAP( 5,  %r9d); SWAP( 6, %r10d); SWAP( 7, %r11d)
+	SWAP( 8,  %eax); SWAP( 9,  %ebx); SWAP(10,  %ecx); SWAP(11,  %ebp)
+	SWAP(12,  %r8d); SWAP(13,  %r9d); SWAP(14, %r10d); SWAP(15, %r11d)
+
+	/* P_DATA now dead; free up P_STATE for other uses */
+	movq	P_STATE, STATE
+
+	/* load the state vector */
+	movl	(STATE),   SA
+	movl	4(STATE),  SB
+	movl	8(STATE),  SC
+	movl	12(STATE), SD
+	movl	16(STATE), SE
+
+	movl	K1VALUE, CONST
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 0)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 1)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 2)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 3)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 4)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 5)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 6)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 7)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 8)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 9)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET(10)(DATA))
+	ROUND(SE, SA, SB, SC, SD, F1, OFFSET(11)(DATA))
+	ROUND(SD, SE, SA, SB, SC, F1, OFFSET(12)(DATA))
+	ROUND(SC, SD, SE, SA, SB, F1, OFFSET(13)(DATA))
+	ROUND(SB, SC, SD, SE, SA, F1, OFFSET(14)(DATA))
+
+	ROUND(SA, SB, SC, SD, SE, F1, OFFSET(15)(DATA))
+	EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP)
+	EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP)
+	EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP)
+	EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP)
+
+	movl	K2VALUE, CONST
+	EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	movl	K3VALUE, CONST
+	EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+	EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+	EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+	EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+	EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+	movl	K4VALUE, CONST
+	EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+	EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+	EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+	EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+	EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+	/* Update the state vector */
+	addl	SA, (STATE)
+	addl	SB, 4(STATE)
+	addl	SC, 8(STATE)
+	addl	SD, 12(STATE)
+	addl	SE, 16(STATE)
+
+	POP(rbp)
+	POP(rbx)
+	ret
+	CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index 0da331c..b3b14a3 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -7,6 +7,8 @@
 
 #if defined(CONFIG_SHA1_X86)
 #define SHA_WORKSPACE_WORDS 0
+#elif defined(CONFIG_SHA1_X86_64)
+#define SHA_WORKSPACE_WORDS 16
 #else
 #define SHA_WORKSPACE_WORDS 80
 #endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 69fdb64..23a84ed 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -132,9 +132,14 @@ config SHA1_X86
 	depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
 	default y
 
+config SHA1_X86_64
+	bool
+	depends on (X86 || UML_X86) && 64BIT
+	default y
+
 config SHA1_GENERIC
 	bool
-	depends on !SHA1_X86
+	depends on !SHA1_X86 && !SHA1_X86_64
 	default y
 
 endmenu


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64
  2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
                   ` (2 preceding siblings ...)
  2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
@ 2007-06-11 20:30 ` Adrian Bunk
  3 siblings, 0 replies; 26+ messages in thread
From: Adrian Bunk @ 2007-06-11 20:30 UTC (permalink / raw)
  To: Benjamin Gilbert; +Cc: akpm, herbert, linux-crypto, linux-kernel

On Fri, Jun 08, 2007 at 05:42:42PM -0400, Benjamin Gilbert wrote:
> The following 3-part series adds assembly implementations of the SHA-1
> transform for x86 and x86_64.  For x86_64 the optimized code is always
> selected; on x86 it is selected if the kernel is compiled for i486 or above
> (since the code needs BSWAP).  These changes primarily improve the
> performance of the CryptoAPI SHA-1 module and of /dev/urandom.  I've
> included some performance data from my test boxes below.
> 
> This version incorporates feedback from Herbert Xu.  Andrew, I'm sending
> this to you because of the (admittedly tiny) intersection with arm and s390
> in part 1.
> 
> -
> 
> tcrypt performance tests:
> 
> === Pentium IV in 32-bit mode, average of 5 trials ===
> Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
>...
> I've also done informal tests on other boxes, and the performance
> improvement has been in the same ballpark.
> 
> On the aforementioned Pentium IV, /dev/urandom throughput goes from 3.7 MB/s
> to 5.6 MB/s with the patches; on the Core 2, it increases from 5.5 MB/s to
> 8.1 MB/s.
>...

With which gcc version and compiler flags?

And why is the C code slower?
Problems in the C code?
gcc problems?

Generally, I'd really prefer one C implementation that works good on all 
platforms over getting dozens of different assembler implemenations, 
each potentially with different bugs.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-13  5:50     ` Matt Mackall
@ 2007-06-13  6:46       ` linux
  0 siblings, 0 replies; 26+ messages in thread
From: linux @ 2007-06-13  6:46 UTC (permalink / raw)
  To: linux, mpm; +Cc: bgilbert, linux-kernel

>> The names are the order they were written in.  "One" is the lib/sha1.c
>> code (547 bytes with -Os).  "Four" is a 5x unrolled C version (1106 bytes).
>
> I'd like to see your version four.

Here's the test driver wrapped around the earlier assembly code.
It's an ugly mess of copy & paste code, of course.

I suspect it could be shrunk by allocating the W[] array locally,
thereby freeing up a register.  Size is -Os -fomit-frame-pointer.


/*
 * SHA transform algorithm, originally taken from code written by
 * Peter Gutmann, and placed in the public domain.
 */
#include <stdint.h>
#include <stdio.h>

#define rol32(x, s) ((x)<<(s) | (x)>>(32-(s)))
static inline uint32_t __attribute__((const))
be32_to_cpu(unsigned x)
{
	asm("bswap	%0" : "+r"(x));
	return x;
}


/* The SHA f()-functions.  */

#define f1(x,y,z)   (z ^ (x & (y ^ z)))		/* x ? y : z */
#define f2(x,y,z)   (x ^ y ^ z)			/* XOR */
#define f3(x,y,z)   ((x & y) + (z & (x ^ y)))	/* majority */

/* The SHA Mysterious Constants */

#define K1  0x5A827999L			/* Rounds  0-19: sqrt(2) * 2^30 */
#define K2  0x6ED9EBA1L			/* Rounds 20-39: sqrt(3) * 2^30 */
#define K3  0x8F1BBCDCL			/* Rounds 40-59: sqrt(5) * 2^30 */
#define K4  0xCA62C1D6L			/* Rounds 60-79: sqrt(10) * 2^30 */

/**
 * sha_transform - single block SHA1 transform
 *
 * @digest: 160 bit digest to update
 * @data:   512 bits of data to hash
 * @W:      80 words of workspace (see note)
 *
 * This function generates a SHA1 digest for a single 512-bit block.
 * Be warned, it does not handle padding and message digest, do not
 * confuse it with the full FIPS 180-1 digest algorithm for variable
 * length messages.
 *
 * Note: If the hash is security sensitive, the caller should be sure
 * to clear the workspace. This is left to the caller to avoid
 * unnecessary clears between chained hashing operations.
 */
void sha_transform(uint32_t digest[5], const char in[64], uint32_t W[80])
{
	register uint32_t a, b, c, d, e, t, i;

	for (i = 0; i < 16; i++)
		W[i] = be32_to_cpu(((const uint32_t *)in)[i]);

	for (i = 0; i < 64; i++)
		W[i+16] = rol32(W[i+13] ^ W[i+8] ^ W[i+2] ^ W[i], 1);

	a = digest[0];
	b = digest[1];
	c = digest[2];
	d = digest[3];
	e = digest[4];

	for (i = 0; i < 20; i++) {
		t = f1(b, c, d) + K1 + rol32(a, 5) + e + W[i];
		e = d; d = c; c = rol32(b, 30); b = a; a = t;
	}

	for (; i < 40; i ++) {
		t = f2(b, c, d) + K2 + rol32(a, 5) + e + W[i];
		e = d; d = c; c = rol32(b, 30); b = a; a = t;
	}

	for (; i < 60; i ++) {
		t = f3(b, c, d) + K3 + rol32(a, 5) + e + W[i];
		e = d; d = c; c = rol32(b, 30); b = a; a = t;
	}

	for (; i < 80; i ++) {
		t = f2(b, c, d) + K4 + rol32(a, 5) + e + W[i];
		e = d; d = c; c = rol32(b, 30); b = a; a = t;
	}

	digest[0] += a;
	digest[1] += b;
	digest[2] += c;
	digest[3] += d;
	digest[4] += e;
}

#define ROUND(a,b,c,d,e,f,add)	\
	( e += add + f(b,c,d),	\
	  b = rol32(b, 30),	\
	  e += rol32(a, 5) )

void sha_transform4(uint32_t digest[5], const char in[64], uint32_t W[80])
{
	register uint32_t a, b, c, d, e, i;

	for (i = 0; i < 16; i++)
		W[i] = be32_to_cpu(((const uint32_t *)in)[i]);

	for (i = 0; i < 64; i++) {
		a = W[i+13] ^ W[i+8] ^ W[i+2] ^ W[i];
		W[i+16] = rol32(a, 1);
	}

	a = digest[0];
	b = digest[1];
	c = digest[2];
	d = digest[3];
	e = digest[4];

	for (i = 0; i < 20; i += 5) {
		ROUND(a,b,c,d,e,f1,W[i  ]+K1);
		ROUND(e,a,b,c,d,f1,W[i+1]+K1);
		ROUND(d,e,a,b,c,f1,W[i+2]+K1);
		ROUND(c,d,e,a,b,f1,W[i+3]+K1);
		ROUND(b,c,d,e,a,f1,W[i+4]+K1);
	}

	for (; i < 40; i += 5) {
		ROUND(a,b,c,d,e,f2,W[i  ]+K2);
		ROUND(e,a,b,c,d,f2,W[i+1]+K2);
		ROUND(d,e,a,b,c,f2,W[i+2]+K2);
		ROUND(c,d,e,a,b,f2,W[i+3]+K2);
		ROUND(b,c,d,e,a,f2,W[i+4]+K2);
	}
	for (; i < 60; i += 5) {
		ROUND(a,b,c,d,e,f3,W[i  ]+K3);
		ROUND(e,a,b,c,d,f3,W[i+1]+K3);
		ROUND(d,e,a,b,c,f3,W[i+2]+K3);
		ROUND(c,d,e,a,b,f3,W[i+3]+K3);
		ROUND(b,c,d,e,a,f3,W[i+4]+K3);
	}
	for (; i < 80; i += 5) {
		ROUND(a,b,c,d,e,f2,W[i  ]+K4);
		ROUND(e,a,b,c,d,f2,W[i+1]+K4);
		ROUND(d,e,a,b,c,f2,W[i+2]+K4);
		ROUND(c,d,e,a,b,f2,W[i+3]+K4);
		ROUND(b,c,d,e,a,f2,W[i+4]+K4);
	}

	digest[0] += a;
	digest[1] += b;
	digest[2] += c;
	digest[3] += d;
	digest[4] += e;
}

extern void sha_transform2(uint32_t digest[5], const char in[64]);
extern void sha_transform3(uint32_t digest[5], const char in[64]);
extern void sha_transform5(uint32_t digest[5], const char in[64]);
extern void sha_stackwipe(void);

void sha_init(uint32_t buf[5])
{
	buf[0] = 0x67452301;
	buf[1] = 0xefcdab89;
	buf[2] = 0x98badcfe;
	buf[3] = 0x10325476;
	buf[4] = 0xc3d2e1f0;
}

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#if 1
void sha_stackwipe2(void)
{
	uint32_t buf[90];
	memset(buf, 0, sizeof buf);
	asm("" : : "r" (&buf));	/* Force the compiler to do the memset */
}
#endif


#define TEST_SIZE (10*1024*1024)

int main(void)
{
	uint32_t W[80];
	uint32_t out[5];
	char const text[64] = "Hello, world!\n";
	char *buf;
	uint32_t *p;
	unsigned i;
	struct timeval start, stop;

	sha_init(out);
	sha_transform(out, text, W);
	printf("  One: %08x %08x %08x %08x %08x\n",
		out[0], out[1], out[2], out[3], out[4]);

	sha_init(out);
	sha_transform4(out, text, W);
	printf(" Four: %08x %08x %08x %08x %08x\n",
		out[0], out[1], out[2], out[3], out[4]);

	sha_init(out);
	sha_transform2(out, text);
	printf("  Two: %08x %08x %08x %08x %08x\n",
		out[0], out[1], out[2], out[3], out[4]);

	sha_init(out);
	sha_transform3(out, text);
	printf("Three: %08x %08x %08x %08x %08x\n",
		out[0], out[1], out[2], out[3], out[4]);

	sha_init(out);
	sha_transform5(out, text);
	printf(" Five: %08x %08x %08x %08x %08x\n",
		out[0], out[1], out[2], out[3], out[4]);

	sha_stackwipe();
#if 1

	/* Set up a large buffer full of stuff */
	buf = malloc(TEST_SIZE);
	p = (uint32_t *)buf;
	memcpy(p, W+80-16, 16*sizeof *p);
	for (i = 0; i < TEST_SIZE/sizeof *p - 16; i++) {
		uint32_t a = p[i+13] ^ p[i+8] ^ p[i+2] ^ p[i];
		p[i+16] = rol32(a, 1);
	}

	sha_init(out);
	gettimeofday(&start, 0);
	for (i = 0; i < TEST_SIZE; i += 64)
		sha_transform(out, buf+i, W);
	gettimeofday(&stop, 0);
	printf("  One: %08x %08x %08x %08x %08x -- %lu us\n",
		out[0], out[1], out[2], out[3], out[4],
		1000000*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

	sha_init(out);
	gettimeofday(&start, 0);
	for (i = 0; i < TEST_SIZE; i += 64)
		sha_transform4(out, buf+i, W);
	gettimeofday(&stop, 0);
	printf(" Four: %08x %08x %08x %08x %08x -- %lu us\n",
		out[0], out[1], out[2], out[3], out[4],
		1000000*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

	sha_init(out);
	gettimeofday(&start, 0);
	for (i = 0; i < TEST_SIZE; i += 64)
		sha_transform2(out, buf+i);
	gettimeofday(&stop, 0);
	printf("  Two: %08x %08x %08x %08x %08x -- %lu us\n",
		out[0], out[1], out[2], out[3], out[4],
		1000000*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

	sha_init(out);
	gettimeofday(&start, 0);
	for (i = 0; i < TEST_SIZE; i += 64)
		sha_transform3(out, buf+i);
	gettimeofday(&stop, 0);
	printf("Three: %08x %08x %08x %08x %08x -- %lu us\n",
		out[0], out[1], out[2], out[3], out[4],
		1000000*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

	sha_init(out);
	gettimeofday(&start, 0);
	for (i = 0; i < TEST_SIZE; i += 64)
		sha_transform5(out, buf+i);
	gettimeofday(&stop, 0);
	printf(" Five: %08x %08x %08x %08x %08x -- %lu us\n",
		out[0], out[1], out[2], out[3], out[4],
		1000000*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);
	
	sha_stackwipe();
#endif	
	
	return 0;
}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-12  5:05   ` linux
@ 2007-06-13  5:50     ` Matt Mackall
  2007-06-13  6:46       ` linux
  0 siblings, 1 reply; 26+ messages in thread
From: Matt Mackall @ 2007-06-13  5:50 UTC (permalink / raw)
  To: linux; +Cc: bgilbert, linux-kernel

On Tue, Jun 12, 2007 at 01:05:44AM -0400, linux@horizon.com wrote:
> > I got this code from Nettle, originally, and I never looked at the SHA-1 
> > round structure very closely.  I'll give that approach a try.
> 
> Attached is some (tested, working, and public domain) assembly code for
> three different sha_transform implementations.  Compared to C code, the
> timings to hash 10 MiB on a 600 MHz PIII are:
> 
> The names are the order they were written in.  "One" is the lib/sha1.c
> code (547 bytes with -Os).  "Four" is a 5x unrolled C version (1106 bytes).

I'd like to see your version four.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-11 19:17 ` Benjamin Gilbert
@ 2007-06-12  5:05   ` linux
  2007-06-13  5:50     ` Matt Mackall
  0 siblings, 1 reply; 26+ messages in thread
From: linux @ 2007-06-12  5:05 UTC (permalink / raw)
  To: bgilbert, linux; +Cc: linux-kernel

> I got this code from Nettle, originally, and I never looked at the SHA-1 
> round structure very closely.  I'll give that approach a try.

Attached is some (tested, working, and public domain) assembly code for
three different sha_transform implementations.  Compared to C code, the
timings to hash 10 MiB on a 600 MHz PIII are:

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 564819 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 391086 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 399134 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 345986 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 301152 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 558652 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 390980 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 407661 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 412434 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 266809 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 559053 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 396506 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 401661 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 349668 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 265861 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 556082 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 392967 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 406381 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 338959 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 274712 us

Um.. some more runs, nice --19, that come out a bit more stable:
  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 552971 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 388167 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 398721 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 337220 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 259790 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 551240 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 387812 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 398519 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 336903 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 260161 us

  One: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 551934 us
 Four: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 387639 us
  Two: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 398094 us
Three: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 335860 us
 Five: e02e77e3 bb7af36d bbf79d4a 46956044 2aaea172 -- 259805 us

This is hot-cache testing; I haven't got around to writing macro
tricks that exapnd to a megabyte of object code.  The challenge is to
purge not only the I- and D-caches, but also the branch predictor!


The names are the order they were written in.  "One" is the lib/sha1.c
code (547 bytes with -Os).  "Four" is a 5x unrolled C version (1106 bytes).

"Two" is a space-optimized ASM version, 266 bytes long.  "Three" is 5x
unrolled, 722 bytes long.  "Five" is a fully unrolled version, 3558
bytes long.

(Further space savings are possible, but it doesn't seem worth it.)

I have noticed that every caller of sha_transform in the kernel tree
allocates the W[] array on the stack, so we might as well do that inside
sha_transform.  The point of passing in the buffer is to amortize the
wiping afterwards, but see sha_stackwipe for ideas on how to do that.
(It can even be done mostly portably in C, given a good guess about the
C function's stack usage.)


I also noticed a glaring BUG in the folding at the end of extract_buf at
drivers/char/random.c:797.  That should be:

	/*
	 * In case the hash function has some recognizable
	 * output pattern, we fold it in half.
	 */

	buf[0] ^= buf[4];
	buf[1] ^= buf[3];
	buf[2] ^= rol32(buf[2], 16);	// <--- Bug was here
	memcpy(out, buf, EXTRACT_SIZE);
	memset(buf, 0, sizeof(buf));

if the code is to match the comment.



=== sha1asm.S ===
#define A %eax
#define B %ebx
#define C %ecx
#define D %edx
#define E %ebp
#define I %esi
#define T %edi

# f1(x,y,z) = bitwise x ? y : z = (z ^ (x & (y ^ z)))
#define F1(x,y,z,dest)	\
	movl	z,T;	\
	xorl	y,T;	\
	andl	x,T;	\
	xorl	z,T

# f2(x,y,z) = x ^ y ^ z
#define F2(x,y,z,dest)	\
	movl	z,T;	\
	xorl	x,T;	\
	xorl	y,T

# f3(x,y,z) = majority(x,y,z) = ((x & z) + (y & (x ^ z)))
#define F3(x,y,z,dest)	\
	movl	z,T;	\
	andl	x,T;	\
	addl	T,dest; \
	movl	z,T;	\
	xorl	x,T;	\
	andl	y,T

#define K1  0x5A827999			/* Rounds  0-19: sqrt(2) * 2^30 */
#define K2  0x6ED9EBA1			/* Rounds 20-39: sqrt(3) * 2^30 */
#define K3  0x8F1BBCDC			/* Rounds 40-59: sqrt(5) * 2^30 */
#define K4  0xCA62C1D6			/* Rounds 60-79: sqrt(10) * 2^30 */

# e += W[i] + K + f(b, c, d) + rol32(a, 5); b = rol32(b, 30); i++;
#define ROUND(a,b,c,d,e,f,k)	\
	addl (%esp,I,4),e;	\
	incl I;			\
	f(b,c,d,e);		\
	leal k(T,e),e;		\
	movl a,T;		\
	roll $5,T;		\
	rorl $2,b;		\
	addl T,e

	.text
.globl sha_transform3
	.type	sha_transform3, @function
# void sha_transform3(__u32 digest[5], const char in[64])
sha_transform3:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
# Args start at 20(%esp)
	xorl	I,I
	movl	24(%esp),B	# B = in
	movl	20(%esp),T	# T = digest
	subl	$320,%esp
1:
	movl	(B,I,4),A
	bswap	A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$16,I
	jne	1b

2:
	movl	-64(%esp,I,4),A
	xorl	-56(%esp,I,4),A
	xorl	-32(%esp,I,4),A
	xorl	-12(%esp,I,4),A
	roll	$1,A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$80,I
	jne	2b

	movl	(T),A
	movl	4(T),B
	movl	8(T),C
	movl	12(T),D
	movl	16(T),E

	xorl	I,I
3:
	ROUND(A,B,C,D,E,F1,K1)
	ROUND(E,A,B,C,D,F1,K1)
	ROUND(D,E,A,B,C,F1,K1)
	ROUND(C,D,E,A,B,F1,K1)
	ROUND(B,C,D,E,A,F1,K1)
	cmp	$20,I
	jne	3b
4:
	ROUND(A,B,C,D,E,F2,K2)
	ROUND(E,A,B,C,D,F2,K2)
	ROUND(D,E,A,B,C,F2,K2)
	ROUND(C,D,E,A,B,F2,K2)
	ROUND(B,C,D,E,A,F2,K2)
	cmp	$40,I
	jne	4b
5:
	ROUND(A,B,C,D,E,F3,K3)
	ROUND(E,A,B,C,D,F3,K3)
	ROUND(D,E,A,B,C,F3,K3)
	ROUND(C,D,E,A,B,F3,K3)
	ROUND(B,C,D,E,A,F3,K3)
	cmp	$60,I
	jne	5b
6:
	ROUND(A,B,C,D,E,F2,K4)
	ROUND(E,A,B,C,D,F2,K4)
	ROUND(D,E,A,B,C,F2,K4)
	ROUND(C,D,E,A,B,F2,K4)
	ROUND(B,C,D,E,A,F2,K4)
	cmp	$80,I
	jne	6b

	addl	$320,%esp
	movl	20(%esp),T

	addl	A,(T)
	addl	B,4(T)
	addl	C,8(T)
	addl	D,12(T)
	addl	E,16(T)

	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp

	ret
	.size	sha_transform3, .-sha_transform3
	# Size is 0x2D2 = 722 bytes

# A smaller variant
#define ROUND2(a,b,c,d,e,f,k)	\
	addl (%esp,I,4),e;	\
	incl I;			\
	f(b,c,d,e);		\
	leal k(T,e),T;		\
	movl d,e;		\
	movl c,d;		\
	movl b,c;		\
	rorl $2,c;		\
	movl a,b;		\
	roll $5,a;		\
	addl T,a

.globl sha_transform2
	.type	sha_transform2, @function
# void sha_transform2(__u32 digest[5], const char in[64])
sha_transform2:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
# Args start at 20(%esp)
	xorl	I,I
	movl	24(%esp),B	# B = in
	movl	20(%esp),T	# T = digest
	subl	$320,%esp
1:
	movl	(B,I,4),A
	bswap	A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$16,I
	jne	1b

2:
	movl	-64(%esp,I,4),A
	xorl	-56(%esp,I,4),A
	xorl	-32(%esp,I,4),A
	xorl	-12(%esp,I,4),A
	roll	$1,A
	movl	A,(%esp,I,4)
	incl	I
	cmpl	$80,I
	jne	2b

	movl	(T),A
	movl	4(T),B
	movl	8(T),C
	movl	12(T),D
	movl	16(T),E

	xorl	I,I
3:
	ROUND2(A,B,C,D,E,F1,K1)
	cmp	$20,I
	jne	3b
4:
	ROUND2(A,B,C,D,E,F2,K2)
	cmp	$40,I
	jne	4b
5:
	ROUND2(A,B,C,D,E,F3,K3)
	cmp	$60,I
	jne	5b
6:
	ROUND2(A,B,C,D,E,F2,K4)
	cmp	$80,I
	jne	6b

	addl	$320,%esp
	movl	20(%esp),T
	addl	A,(T)
	addl	B,4(T)
	addl	C,8(T)
	addl	D,12(T)
	addl	E,16(T)

	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp

	ret
	.size	sha_transform2, .-sha_transform2
	# Size is 0x10A = 266 bytes

# The three cases of the next input word...
# Fetch big-endian (first 16 rounds)
#define FETCH(i)		\
	movl	4*i(I),T;	\
	bswap	T;		\
	movl	T,4*i(%esp)

# Calculate but don't store (last 3 rounds)
#define CALCX(i)			\
	movl	4*(i&15)(%esp),T;	\
	xorl	4*((i+2)&15)(%esp),T;	\
	xorl	4*((i+8)&15)(%esp),T;	\
	xorl	4*((i+13)&15)(%esp),T;	\
	roll	$1,T

# Calculate and store on stack (middle 61 rounds)
#define CALC(i)			\
	CALCX(i);			\
	movl	T,4*(i&15)(%esp)

# e += W[i] + K + f(b, c, d) + rol32(a, 5); b = rol32(b, 30); i++;
#define ROUND5a(a,b,c,d,e,f,k)	\
	leal k(T,e),e;		\
	f(b,c,d,e);		\
	addl T,e;		\
	movl a,T;		\
	roll $5,T;		\
	rorl $2,b;		\
	addl T,e

# A variant that assumes that k is stored in I
#define ROUND5b(a,b,c,d,e,f)	\
	addl I,e;		\
	addl T,e;		\
	f(b,c,d,e);		\
	addl T,e;		\
	movl a,T;		\
	roll $5,T;		\
	rorl $2,b;		\
	addl T,e

.globl sha_transform5
	.type	sha_transform5, @function
# void sha_transform5(__u32 digest[5], const char in[64])
sha_transform5:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
# Args start at 20(%esp)
	movl	24(%esp),I	# I = in
	movl	20(%esp),T	# T = digest
	subl	$64,%esp

	movl	(T),A
	movl	4(T),B
	movl	8(T),C
	movl	12(T),D
	movl	16(T),E

	FETCH(0); ROUND5a(A,B,C,D,E,F1,K1)
	FETCH(1); ROUND5a(E,A,B,C,D,F1,K1)
	FETCH(2); ROUND5a(D,E,A,B,C,F1,K1)
	FETCH(3); ROUND5a(C,D,E,A,B,F1,K1)
	FETCH(4); ROUND5a(B,C,D,E,A,F1,K1)

	FETCH(5); ROUND5a(A,B,C,D,E,F1,K1)
	FETCH(6); ROUND5a(E,A,B,C,D,F1,K1)
	FETCH(7); ROUND5a(D,E,A,B,C,F1,K1)
	FETCH(8); ROUND5a(C,D,E,A,B,F1,K1)
	FETCH(9); ROUND5a(B,C,D,E,A,F1,K1)

	FETCH(10); ROUND5a(A,B,C,D,E,F1,K1)
	FETCH(11); ROUND5a(E,A,B,C,D,F1,K1)
	FETCH(12); ROUND5a(D,E,A,B,C,F1,K1)
	FETCH(13); ROUND5a(C,D,E,A,B,F1,K1)
	FETCH(14); ROUND5a(B,C,D,E,A,F1,K1)

	FETCH(15); movl $K1,I; ROUND5b(A,B,C,D,E,F1)
	CALC(16); ROUND5b(E,A,B,C,D,F1)
	CALC(17); ROUND5b(D,E,A,B,C,F1)
	CALC(18); ROUND5b(C,D,E,A,B,F1)
	CALC(19); ROUND5b(B,C,D,E,A,F1)

	movl $K2,I

	CALC(20); ROUND5b(A,B,C,D,E,F2)
	CALC(21); ROUND5b(E,A,B,C,D,F2)
	CALC(22); ROUND5b(D,E,A,B,C,F2)
	CALC(23); ROUND5b(C,D,E,A,B,F2)
	CALC(24); ROUND5b(B,C,D,E,A,F2)

	CALC(25); ROUND5b(A,B,C,D,E,F2)
	CALC(26); ROUND5b(E,A,B,C,D,F2)
	CALC(27); ROUND5b(D,E,A,B,C,F2)
	CALC(28); ROUND5b(C,D,E,A,B,F2)
	CALC(29); ROUND5b(B,C,D,E,A,F2)

	CALC(30); ROUND5b(A,B,C,D,E,F2)
	CALC(31); ROUND5b(E,A,B,C,D,F2)
	CALC(32); ROUND5b(D,E,A,B,C,F2)
	CALC(33); ROUND5b(C,D,E,A,B,F2)
	CALC(34); ROUND5b(B,C,D,E,A,F2)

	CALC(35); ROUND5b(A,B,C,D,E,F2)
	CALC(36); ROUND5b(E,A,B,C,D,F2)
	CALC(37); ROUND5b(D,E,A,B,C,F2)
	CALC(38); ROUND5b(C,D,E,A,B,F2)
	CALC(39); ROUND5b(B,C,D,E,A,F2)

	movl $K3,I

	CALC(40); ROUND5b(A,B,C,D,E,F3)
	CALC(41); ROUND5b(E,A,B,C,D,F3)
	CALC(42); ROUND5b(D,E,A,B,C,F3)
	CALC(43); ROUND5b(C,D,E,A,B,F3)
	CALC(44); ROUND5b(B,C,D,E,A,F3)

	CALC(45); ROUND5b(A,B,C,D,E,F3)
	CALC(46); ROUND5b(E,A,B,C,D,F3)
	CALC(47); ROUND5b(D,E,A,B,C,F3)
	CALC(48); ROUND5b(C,D,E,A,B,F3)
	CALC(49); ROUND5b(B,C,D,E,A,F3)

	CALC(50); ROUND5b(A,B,C,D,E,F3)
	CALC(51); ROUND5b(E,A,B,C,D,F3)
	CALC(52); ROUND5b(D,E,A,B,C,F3)
	CALC(53); ROUND5b(C,D,E,A,B,F3)
	CALC(54); ROUND5b(B,C,D,E,A,F3)

	CALC(55); ROUND5b(A,B,C,D,E,F3)
	CALC(56); ROUND5b(E,A,B,C,D,F3)
	CALC(57); ROUND5b(D,E,A,B,C,F3)
	CALC(58); ROUND5b(C,D,E,A,B,F3)
	CALC(59); ROUND5b(B,C,D,E,A,F3)

	movl $K4,I

	CALC(60); ROUND5b(A,B,C,D,E,F2)
	CALC(61); ROUND5b(E,A,B,C,D,F2)
	CALC(62); ROUND5b(D,E,A,B,C,F2)
	CALC(63); ROUND5b(C,D,E,A,B,F2)
	CALC(64); ROUND5b(B,C,D,E,A,F2)

	CALC(65); ROUND5b(A,B,C,D,E,F2)
	CALC(66); ROUND5b(E,A,B,C,D,F2)
	CALC(67); ROUND5b(D,E,A,B,C,F2)
	CALC(68); ROUND5b(C,D,E,A,B,F2)
	CALC(69); ROUND5b(B,C,D,E,A,F2)

	CALC(70); ROUND5b(A,B,C,D,E,F2)
	CALC(71); ROUND5b(E,A,B,C,D,F2)
	CALC(72); ROUND5b(D,E,A,B,C,F2)
	CALC(73); ROUND5b(C,D,E,A,B,F2)
	CALC(74); ROUND5b(B,C,D,E,A,F2)

	CALC(75); ROUND5b(A,B,C,D,E,F2)
	CALC(76); ROUND5b(E,A,B,C,D,F2)
	CALCX(77); ROUND5b(D,E,A,B,C,F2)
	CALCX(78); ROUND5b(C,D,E,A,B,F2)
	CALCX(79); ROUND5b(B,C,D,E,A,F2)

	addl	$64,%esp
	movl	20(%esp),I

#if 1
	addl	A,(I)
	addl	B,4(I)
	addl	C,8(I)
	addl	D,12(I)
	addl	E,16(I)
#else
	movl	A,(I)
	movl	B,4(I)
	movl	C,8(I)
	movl	D,12(I)
	movl	E,16(I)
#endif

	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp

	ret
	.size	sha_transform5, .-sha_transform5
	# Size is 0xDE6 = 3558 bytes


.globl sha_stackwipe
	.type	sha_stackwipe, @function
# void sha_stackwipe(void)
# After one or more sha_transform calls, we have left the contents of W[]
# on the stack, and from any 16 of those 80 words, the entire input
# can be reconstructed.  If the caller cares, this function obliterates
# the relevant portion of the stack.
# 2 words of argument + 4 woirds of saved registers + 80 words of W[]
sha_stackwipe:
	xorl	%eax,%eax
	movl	$86,%ecx
# Damn, I had hoped that loop; pushl %eax would work..
1:
	decl	%ecx
	pushl	%eax
	jne	1b

	addl	$4*86,%esp
	ret
	.size	sha_stackwipe, .-sha_stackwipe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
  2007-06-11  7:53 [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ linux
@ 2007-06-11 19:17 ` Benjamin Gilbert
  2007-06-12  5:05   ` linux
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin Gilbert @ 2007-06-11 19:17 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel

linux@horizon.com wrote:
> /* Majority: (x^y)|(y&z)|(z&x) = (x & z) + ((x ^ z) & y)
> #define F3(x,y,z,dest)		\
> 	movl	z, TMP;		\
> 	andl	x, TMP;		\
> 	addl	TMP, dest;	\
> 	movl	z, TMP;		\
> 	xorl	x, TMP;		\
> 	andl	y, TMP;		\
> 	addl	TMP, dest
> 
> Since y is the most recently computed result (it's rotated in the
> previous round), I arranged the code to delay its use as late as
> possible.
> 
> 
> Now you have one more register to play with.

Okay, thanks.  It doesn't actually give one more register except in the 
F3 rounds (TMP2 is normally used to hold the magic constants) but it's a 
good cleanup.

> A faster way is to unroll 5 iterations and do:
> 	e += F(b, c, d) + K + rol32(a, 5) + W[i  ]; b = rol32(b, 30);
> 	d += F(a, b, c) + K + rol32(e, 5) + W[i+1]; a = rol32(a, 30);
> 	c += F(e, a, b) + K + rol32(d, 5) + W[i+2]; e = rol32(e, 30);
> 	b += F(d, e, a) + K + rol32(c, 5) + W[i+3]; d = rol32(d, 30);
> 	a += F(c, d, e) + K + rol32(b, 5) + W[i+4]; c = rol32(c, 30);
> then loop over that 4 times each.  This is somewhat larger, but
> still reasonably compact; only 20 of the 80 rounds are written out
> long-hand.

I got this code from Nettle, originally, and I never looked at the SHA-1 
round structure very closely.  I'll give that approach a try.

Thanks
--Benjamin Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
@ 2007-06-11  7:53 linux
  2007-06-11 19:17 ` Benjamin Gilbert
  0 siblings, 1 reply; 26+ messages in thread
From: linux @ 2007-06-11  7:53 UTC (permalink / raw)
  To: bgilbert, linux-kernel; +Cc: linux

+#define F3(x,y,z)						\
+	movl	x, TMP2;					\
+	andl	y, TMP2;					\
+	movl	x, TMP;						\
+	orl	y, TMP;						\
+	andl	z, TMP;						\
+       orl	TMP2, TMP

*Sigh*.  You don't need TMP2 to compute the majority function.

You're implementing it as (x & y) | ((x | y) & z).
Look at the rephrasing in lib/sha1.c:
#define f3(x,y,z)	((x & y) + (z & (x ^ y)))	/* majority */

By changing the second OR to x^y, you ensure that the two halves
of the first disjunciton are distinct, so you can replace the OR
with XOR, or better yet, +.

Then you can just do two adds to e.  That is, write:

/* Bitwise select: x ? y : z, which is (z ^ (x & (y ^ z))) */
#define F1(x,y,z,dest)		\
	movl	z, TMP;		\
	xorl	y, TMP;		\
	andl	x, TMP;		\
	xorl	z, TMP;		\
	addl	TMP, dest

/* Three-way XOR (x ^ y ^ z) */
#define F2(x,y,z,dest)		\
	movl	z, TMP;		\
	xorl	x, TMP;		\
	xorl	y, TMP;		\
	addl	TMP, dest

/* Majority: (x^y)|(y&z)|(z&x) = (x & z) + ((x ^ z) & y)
#define F3(x,y,z,dest)		\
	movl	z, TMP;		\
	andl	x, TMP;		\
	addl	TMP, dest;	\
	movl	z, TMP;		\
	xorl	x, TMP;		\
	andl	y, TMP;		\
	addl	TMP, dest

Since y is the most recently computed result (it's rotated in the
previous round), I arranged the code to delay its use as late as
possible.


Now you have one more register to play with.


I thought I had some good sha1 asm code lying around, but I
can't seem to find it.  (I have some excellent PowerPC asm if anyone
wants it.)  Here's a basic implementation question:

SHA-1 is made up of 80 rounds, 20 of each of 4 types.
There are 5 working variables, a through e.

The basic round is:
	t = F(b, c, d) + K + rol32(a, 5) + e + W[i];
	e = d; d = c; c = rol32(b, 30); b = a; a = t;
where W[] is the input array.  W[0..15] are the input words, and
W[16..79] are computed by a sort of LFSR from W[0..15].

Each group of 20 rounds has a different F() and K.
This is the smallest way to write the function, but all the
register shuffling makes for a bit of a speed penalty.

A faster way is to unroll 5 iterations and do:
	e += F(b, c, d) + K + rol32(a, 5) + W[i  ]; b = rol32(b, 30);
	d += F(a, b, c) + K + rol32(e, 5) + W[i+1]; a = rol32(a, 30);
	c += F(e, a, b) + K + rol32(d, 5) + W[i+2]; e = rol32(e, 30);
	b += F(d, e, a) + K + rol32(c, 5) + W[i+3]; d = rol32(d, 30);
	a += F(c, d, e) + K + rol32(b, 5) + W[i+4]; c = rol32(c, 30);
then loop over that 4 times each.  This is somewhat larger, but
still reasonably compact; only 20 of the 80 rounds are written out
long-hand.

Faster yet is to unroll all 80 rounds directly.  But it also takes the
most code space, and as we have learned, when your code is not the
execution time hot spot, less cache is faster code.

Is there a preferred implementation?


Another implementation choice has to do with the computation of W[].
W[i] is a function of W[i-3], W[i-8], W[i-14] and W[i-16].
It is possible to keep a 16-word circular buffer with only the most
recent 16 values of W[i%16] and compute each new word as it is needed.
However, the offsets i%16 repeat every 16 rounds, which is an awkward
fit with the 5-round repeating pattern of the main computation.

One option is to compute all the W[] values in a pre-pass beforehand.
Simple and small, but uses 320 bytes of data on the stack or wherever.
An intermediate one is to keep a 20-word buffer, and compute 20 words
at a time just before each of the 20 round groups.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-06-13  6:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
2007-06-09  7:32   ` Jan Engelhardt
2007-06-10  1:15     ` Benjamin Gilbert
2007-06-11 19:47       ` Benjamin Gilbert
2007-06-11 19:50         ` [PATCH] " Benjamin Gilbert
2007-06-11 19:52         ` [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
2007-06-09 20:23     ` Jeff Garzik
2007-06-09 21:34       ` Matt Mackall
2007-06-10  0:33       ` Benjamin Gilbert
2007-06-10 13:59         ` Matt Mackall
2007-06-10 16:47           ` Benjamin Gilbert
2007-06-10 17:33             ` Matt Mackall
2007-06-11 17:39           ` Benjamin Gilbert
2007-06-11 12:04     ` Andi Kleen
2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-11 12:01   ` Andi Kleen
2007-06-11 19:45     ` Benjamin Gilbert
2007-06-11 20:30 ` [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Adrian Bunk
2007-06-11  7:53 [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ linux
2007-06-11 19:17 ` Benjamin Gilbert
2007-06-12  5:05   ` linux
2007-06-13  5:50     ` Matt Mackall
2007-06-13  6:46       ` linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).