All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] block-sha1: improved SHA1 hashing
@ 2009-08-06 15:13 Linus Torvalds
  2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
  2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:13 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


The bulk of this is the patches I sent out yesterday, but there's a few 
added tweaks from today there, and it's now one nice series instead of a 
patch followed by a "Haa, I improved on it" etc, so you can get the whole 
thing without actually having to hunt for the pieces.

It's a series of 7 patches:

      block-sha1: add new optimized C 'block-sha1' routines
      block-sha1: try to use rol/ror appropriately
      block-sha1: make the 'ntohl()' part of the first SHA1 loop
      block-sha1: re-use the temporary array as we calculate the SHA1
      block-sha1: macroize the rounds a bit further
      block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
      block-sha1: get rid of redundant 'lenW' context

where the thing is loosely based on the Mozilla SHA1 routines but by the 
end doesn't really resemble them all that much. The Mozilla ones suck 
donkey d*ck in so many ways - unnecessary copies, idiotic byte-at-a-time 
build-up of the hash input etc.

The end result is pretty much equivalent in performance to the OpenSSL 
SHA1 code for me on x86-64. Getting rid of OpenSSL gets rid of another 
couple of shared library loadings, and as a result my "make -j64 test" is 
now a couple of seconds faster with this than with OpenSSL in my rather 
unscientific tests.

The code isn't very big:

 Makefile          |    9 +++
 block-sha1/sha1.c |  158 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block-sha1/sha1.h |   20 +++++++
 3 files changed, 187 insertions(+), 0 deletions(-)
 create mode 100644 block-sha1/sha1.c
 create mode 100644 block-sha1/sha1.h

and passes all my tests. It's really hard to not do the SHA1 right and 
still pass any tests at all, so it should all be good.

Enable them with BLK_SHA1=1.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines
  2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds
@ 2009-08-06 15:15 ` Linus Torvalds
  2009-08-06 15:16   ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
  2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:15 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano



From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 14:49:32 -0700
Subject: [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines

Based ont he mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.

It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Side note: we can get rid of the "arch must support unaligned loads" 
requirement later on by telling the compiler which ones _can_ be 
unaligned. So the limitations aren't all that fundamental, really.

 Makefile          |    9 +++
 block-sha1/sha1.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block-sha1/sha1.h |   21 ++++++++
 3 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index d7669b1..f12024c 100644
--- a/Makefile
+++ b/Makefile
@@ -84,6 +84,10 @@ all::
 # specify your own (or DarwinPort's) include directories and
 # library directories by defining CFLAGS and LDFLAGS appropriately.
 #
+# Define BLK_SHA1 environment variable if you want the C version
+# of the SHA1 that assumes you can do unaligned 32-bit loads and
+# have a fast htonl() function.
+#
 # Define PPC_SHA1 environment variable when running make to make use of
 # a bundled SHA1 routine optimized for PowerPC.
 #
@@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND
 	BASIC_CFLAGS += -DNO_DEFLATE_BOUND
 endif
 
+ifdef BLK_SHA1
+	SHA1_HEADER = "block-sha1/sha1.h"
+	LIB_OBJS += block-sha1/sha1.o
+else
 ifdef PPC_SHA1
 	SHA1_HEADER = "ppc/sha1.h"
 	LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o
@@ -1183,6 +1191,7 @@ else
 endif
 endif
 endif
+endif
 ifdef NO_PERL_MAKEMAKER
 	export NO_PERL_MAKEMAKER
 endif
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
new file mode 100644
index 0000000..eef32f7
--- /dev/null
+++ b/block-sha1/sha1.c
@@ -0,0 +1,145 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "sha1.h"
+
+/* Hash one 64-byte block of data */
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data);
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx)
+{
+	ctx->lenW = 0;
+	ctx->size = 0;
+
+	/* Initialize H with the magic constants (see FIPS180 for constants)
+	 */
+	ctx->H[0] = 0x67452301;
+	ctx->H[1] = 0xefcdab89;
+	ctx->H[2] = 0x98badcfe;
+	ctx->H[3] = 0x10325476;
+	ctx->H[4] = 0xc3d2e1f0;
+}
+
+
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
+{
+	int lenW = ctx->lenW;
+
+	ctx->size += (unsigned long long) len << 3;
+
+	/* Read the data into W and process blocks as they get full
+	 */
+	if (lenW) {
+		int left = 64 - lenW;
+		if (len < left)
+			left = len;
+		memcpy(lenW + (char *)ctx->W, data, left);
+		lenW = (lenW + left) & 63;
+		len -= left;
+		data += left;
+		ctx->lenW = lenW;
+		if (lenW)
+			return;
+		blk_SHA1Block(ctx, ctx->W);
+	}
+	while (len >= 64) {
+		blk_SHA1Block(ctx, data);
+		data += 64;
+		len -= 64;
+	}
+	if (len) {
+		memcpy(ctx->W, data, len);
+		ctx->lenW = len;
+	}
+}
+
+
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
+{
+	static const unsigned char pad[64] = { 0x80 };
+	unsigned int padlen[2];
+	int i;
+
+	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
+	 */
+	padlen[0] = htonl(ctx->size >> 32);
+	padlen[1] = htonl(ctx->size);
+
+	blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW)));
+	blk_SHA1_Update(ctx, padlen, 8);
+
+	/* Output hash
+	 */
+	for (i = 0; i < 5; i++)
+		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+}
+
+#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
+{
+	int t;
+	unsigned int A,B,C,D,E,TEMP;
+	unsigned int W[80];
+
+	for (t = 0; t < 16; t++)
+		W[t] = htonl(data[t]);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+	A = ctx->H[0];
+	B = ctx->H[1];
+	C = ctx->H[2];
+	D = ctx->H[3];
+	E = ctx->H[4];
+
+#define T_0_19(t) \
+	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
+	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
+	T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
+	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+
+#define T_20_39(t) \
+	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
+	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
+	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
+	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+
+#define T_40_59(t) \
+	TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
+	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
+	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
+	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+
+#define T_60_79(t) \
+	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
+	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
+	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
+	T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+
+	ctx->H[0] += A;
+	ctx->H[1] += B;
+	ctx->H[2] += C;
+	ctx->H[3] += D;
+	ctx->H[4] += E;
+}
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
new file mode 100644
index 0000000..7be2d93
--- /dev/null
+++ b/block-sha1/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+typedef struct {
+	unsigned int H[5];
+	unsigned int W[16];
+	int lenW;
+	unsigned long long size;
+} blk_SHA_CTX;
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len);
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
+
+#define git_SHA_CTX	blk_SHA_CTX
+#define git_SHA1_Init	blk_SHA1_Init
+#define git_SHA1_Update	blk_SHA1_Update
+#define git_SHA1_Final	blk_SHA1_Final
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/7] block-sha1: try to use rol/ror appropriately
  2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
@ 2009-08-06 15:16   ` Linus Torvalds
  2009-08-06 15:18     ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
  2009-08-06 18:25     ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:16 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano



From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 19:42:15 -0700
Subject: [PATCH 2/7] block-sha1: try to use rol/ror appropriately

Use the one with the smaller constant.  It _can_ generate slightly
smaller code (a constant of 1 is special), but perhaps more importantly
it's possibly faster on any uarch that does a rotate with a loop.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, this is the hackiest of the bunch. I considered dropping it. But I do 
suspect we want something like this, even if we might end up massaging the 
asm for different compilers etc.

 block-sha1/sha1.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index eef32f7..a45a3de 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
 }
 
-#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#if defined(__i386__) || defined(__x86_64__)
+
+#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
+#define SHA_ROL(x,n)	SHA_ASM("rol", x, n)
+#define SHA_ROR(x,n)	SHA_ASM("ror", x, n)
+
+#else
+
+#define SHA_ROT(X,n)	(((X) << (l)) | ((X) >> (r)))
+#define SHA_ROL(X,n)	SHA_ROT(X,n,32-(n))
+#define SHA_ROR(X,n)	SHA_ROT(X,32-(n),n)
+
+#endif
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
@@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 
 	/* Unroll it? */
 	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
 
 	A = ctx->H[0];
 	B = ctx->H[1];
@@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 #define T_0_19(t) \
-	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
 	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
@@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
 
 #define T_20_39(t) \
-	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
 	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
 #define T_40_59(t) \
-	TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
 	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
@@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
 #define T_60_79(t) \
-	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
 	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop
  2009-08-06 15:16   ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
@ 2009-08-06 15:18     ` Linus Torvalds
  2009-08-06 15:20       ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds
  2009-08-06 18:25     ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:18 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 20:28:07 -0700
Subject: [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop

This helps a teeny bit.  But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This is a pretty trivial one, but it was the first stage to getting rid of 
the annoying 80-word array that not only wastes precious L1 cache, but 
that loop to initialize it was a noticeable cost.

 block-sha1/sha1.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index a45a3de..39a5bbb 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	unsigned int A,B,C,D,E,TEMP;
 	unsigned int W[80];
 
-	for (t = 0; t < 16; t++)
-		W[t] = htonl(data[t]);
-
-	/* Unroll it? */
-	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
-
 	A = ctx->H[0];
 	B = ctx->H[1];
 	C = ctx->H[2];
 	D = ctx->H[3];
 	E = ctx->H[4];
 
-#define T_0_19(t) \
+#define T_0_15(t) \
+	TEMP = htonl(data[t]); W[t] = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D)     + E + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
+
+	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
+	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
+	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
+	T_0_15(15);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+#define T_16_19(t) \
 	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
-	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
-	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
-	T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
-	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
 #define T_20_39(t) \
 	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1
  2009-08-06 15:18     ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
@ 2009-08-06 15:20       ` Linus Torvalds
  2009-08-06 15:22         ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:20 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 20:49:41 -0700
Subject: [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1

The mozilla-SHA1 code did this 80-word array for the 80 iterations.  But
the SHA1 state is really just 512 bits, and you can actually keep it in
a kind of "circular queue" of just 16 words instead.

This requires us to do the xor updates as we go along (rather than as a
pre-phase), but that's really what we want to do anyway.

This gets me really close to the OpenSSL performance on my Nehalem.
Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't
strictly even matter on my Nehalem, it's just a local optimization).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This gets rid of the annoying pre-initialization and its need for that 
extra space. We can dynamically iterate over a 512-bit circular buffer 
instead.

 block-sha1/sha1.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 39a5bbb..80193d4 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
-	int t;
 	unsigned int A,B,C,D,E,TEMP;
-	unsigned int W[80];
+	unsigned int array[16];
 
 	A = ctx->H[0];
 	B = ctx->H[1];
@@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 #define T_0_15(t) \
-	TEMP = htonl(data[t]); W[t] = TEMP; \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D)     + E + 0x5a827999; \
+	TEMP = htonl(data[t]); array[t] = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
 	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
@@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
 	T_0_15(15);
 
-	/* Unroll it? */
-	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])
+#define SHA_XOR(t) \
+	TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
 
 #define T_16_19(t) \
-	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
 	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
 #define T_20_39(t) \
-	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
@@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
 #define T_40_59(t) \
-	TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
@@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
 #define T_60_79(t) \
-	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/7] block-sha1: macroize the rounds a bit further
  2009-08-06 15:20       ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds
@ 2009-08-06 15:22         ` Linus Torvalds
  2009-08-06 15:24           ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:22 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 6 Aug 2009 07:20:54 -0700
Subject: [PATCH 5/7] block-sha1: macroize the rounds a bit further

Avoid repeating the shared parts of the different rounds by adding a
macro layer or two. It was already more cpp than C.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This makes things denser, and puts all the core rules in one place. That 
first hunk really contains just about all of the important parts of SHA1. 
The rest is just fluff and the necessary expansions etc.

 block-sha1/sha1.c |   56 ++++++++++++++++++++++++----------------------------
 1 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 80193d4..4837d58 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -94,6 +94,27 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 #endif
 
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])
+
+/*
+ * Where do we get the source from? The first 16 iterations get it from
+ * the input data, the next mix it from the 512-bit array.
+ */
+#define SHA_SRC(t) htonl(data[t])
+#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
+
+#define SHA_ROUND(t, input, fn, constant) \
+	TEMP = input(t); W(t) = TEMP; \
+	TEMP += SHA_ROL(A,5) + (fn) + E + (constant); \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP
+
+#define T_0_15(t)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 )
+#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 )
+#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 )
+#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)|(D&(B|C))) , 0x8f1bbcdc )
+#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) ,  0xca62c1d6 )
+
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
 	unsigned int A,B,C,D,E,TEMP;
@@ -105,53 +126,28 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	D = ctx->H[3];
 	E = ctx->H[4];
 
-#define T_0_15(t) \
-	TEMP = htonl(data[t]); array[t] = TEMP; \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
-
+	/* Round 1 - iterations 0-16 take their input from 'data' */
 	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
 	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
 	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
 	T_0_15(15);
 
-/* This "rolls" over the 512-bit array */
-#define W(x) (array[(x)&15])
-#define SHA_XOR(t) \
-	TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
-
-#define T_16_19(t) \
-	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
-
+	/* Round 1 - tail. Input from 512-bit mixing array */
 	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
-#define T_20_39(t) \
-	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
-
+	/* Round 2 */
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
 	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
 	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
-#define T_40_59(t) \
-	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
-
+	/* Round 3 */
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
 	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
 	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
-#define T_60_79(t) \
-	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
-
+	/* Round 4 */
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
 	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
 	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
  2009-08-06 15:22         ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds
@ 2009-08-06 15:24           ` Linus Torvalds
  2009-08-06 15:25             ` [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:24 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 6 Aug 2009 07:27:57 -0700
Subject: [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3

It's an equivalent expression, but the '+' gives us some freedom in
instruction selection (for example, we can use 'lea' rather than 'add'),
and associates with the other additions around it to give some minor
scheduling freedom.

Suggested-by: linux@horizon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

A small micro-optimization. It may not matter hugely, but it's definitely 
the right thing to do. And with the new macroized thing, you can see 
clearly what part of the SHA1 rounds it affects, and how.

 block-sha1/sha1.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 4837d58..9a060a6 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -112,7 +112,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 #define T_0_15(t)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 )
 #define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 )
 #define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 )
-#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)|(D&(B|C))) , 0x8f1bbcdc )
+#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc )
 #define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) ,  0xca62c1d6 )
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context
  2009-08-06 15:24           ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds
@ 2009-08-06 15:25             ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:25 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano


From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 6 Aug 2009 07:45:46 -0700
Subject: [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context

.. and simplify the ctx->size logic.

We now count the size in bytes, which means that 'lenW' was always just
the low 6 bits of the total size, so we don't carry it around separately
any more.  And we do the 'size in bits' shift at the end.

Suggested by Nicolas Pitre and linux@horizon.com.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Some final cleanup. Based on two separate discussions yesterday. Trivial 
and doesn't make any difference that I can tell, but definitely the right 
thing to do.

 block-sha1/sha1.c |   17 +++++++----------
 block-sha1/sha1.h |    1 -
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 9a060a6..78dcb0c 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -14,7 +14,6 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data);
 
 void blk_SHA1_Init(blk_SHA_CTX *ctx)
 {
-	ctx->lenW = 0;
 	ctx->size = 0;
 
 	/* Initialize H with the magic constants (see FIPS180 for constants)
@@ -29,9 +28,9 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 
 void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
-	int lenW = ctx->lenW;
+	int lenW = ctx->size & 63;
 
-	ctx->size += (unsigned long long) len << 3;
+	ctx->size += len;
 
 	/* Read the data into W and process blocks as they get full
 	 */
@@ -43,7 +42,6 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 		lenW = (lenW + left) & 63;
 		len -= left;
 		data += left;
-		ctx->lenW = lenW;
 		if (lenW)
 			return;
 		blk_SHA1Block(ctx, ctx->W);
@@ -53,10 +51,8 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 		data += 64;
 		len -= 64;
 	}
-	if (len) {
+	if (len)
 		memcpy(ctx->W, data, len);
-		ctx->lenW = len;
-	}
 }
 
 
@@ -68,10 +64,11 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
 	 */
-	padlen[0] = htonl(ctx->size >> 32);
-	padlen[1] = htonl(ctx->size);
+	padlen[0] = htonl(ctx->size >> 29);
+	padlen[1] = htonl(ctx->size << 3);
 
-	blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW)));
+	i = ctx->size & 63;
+	blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - i)));
 	blk_SHA1_Update(ctx, padlen, 8);
 
 	/* Output hash
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
index 7be2d93..c1ae74d 100644
--- a/block-sha1/sha1.h
+++ b/block-sha1/sha1.h
@@ -7,7 +7,6 @@
 typedef struct {
 	unsigned int H[5];
 	unsigned int W[16];
-	int lenW;
 	unsigned long long size;
 } blk_SHA_CTX;
 
-- 
1.6.4.31.g154b2.dirty

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds
  2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
@ 2009-08-06 17:22 ` Artur Skawina
  2009-08-06 18:09   ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 17:22 UTC (permalink / raw)
  To: Git Mailing List

Linus Torvalds wrote:
> where the thing is loosely based on the Mozilla SHA1 routines but by the 
> end doesn't really resemble them all that much. The Mozilla ones suck 
> donkey d*ck in so many ways - unnecessary copies, idiotic byte-at-a-time 
> build-up of the hash input etc.
> 
> The end result is pretty much equivalent in performance to the OpenSSL 
> SHA1 code for me on x86-64. Getting rid of OpenSSL gets rid of another 

For those curious just how close the C version is to the various
asm and C implementations, the q&d microbenchmark is at 
http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz

In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom.

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
@ 2009-08-06 18:09   ` Linus Torvalds
  2009-08-06 19:10     ` Artur Skawina
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 18:09 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> For those curious just how close the C version is to the various
> asm and C implementations, the q&d microbenchmark is at 
> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz

Hmm. That thing doesn't work at all on x86-64. Even apart from the asm 
sources, your timing thing does soem really odd things (why do you do that 
odd "iret" in GETCYCLES and GETTIME?). You're better off using 
lfence/mfence/cpuid, and I think you could make it work on 64-bit that 
way too.

I just hacked it away for testing.

> In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom.

I'll use this to see if I can improve the 32-bit case.

On Nehalem, with your benchmark, I get:

	#             TIME[s] SPEED[MB/s]
	rfc3174         5.122       119.2
	# New hash result: d829b9e028e64840094ab6702f9acdf11bec3937
	rfc3174         5.153       118.5
	linus           2.092       291.8
	linusas         2.056       296.8
	linusas2        1.909       319.8
	mozilla         5.139       118.8
	mozillaas       5.775       105.7
	openssl         1.627       375.1
	spelvin         1.678       363.7
	spelvina        1.603       380.8
	nettle          1.592       383.4

And with the hacked version to get some 64-bit numbers:

	#             TIME[s] SPEED[MB/s]
	rfc3174         3.992       152.9
	# New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab
	rfc3174         3.991       152.9
	linus            1.54       396.3
	linusas         1.533       398.1
	linusas2        1.603       380.9
	mozilla         4.352       140.3
	mozillaas       4.227       144.4

so as you can see, your improvements in 32-bit mode are actually 
de-provements in 64-bit mode (ok, your first one seems to be a tiny 
improvement, but I think it's in the noise).

But you're right, I need to try to improve the 32-bit case.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/7] block-sha1: try to use rol/ror appropriately
  2009-08-06 15:16   ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
  2009-08-06 15:18     ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
@ 2009-08-06 18:25     ` Bert Wesarg
  1 sibling, 0 replies; 38+ messages in thread
From: Bert Wesarg @ 2009-08-06 18:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano

Hi,

On Thu, Aug 6, 2009 at 17:16, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
> diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
> index eef32f7..a45a3de 100644
> --- a/block-sha1/sha1.c
> +++ b/block-sha1/sha1.c
> @@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
>                ((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
>  }
>
> -#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
> +#if defined(__i386__) || defined(__x86_64__)
> +
> +#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
> +#define SHA_ROL(x,n)   SHA_ASM("rol", x, n)
> +#define SHA_ROR(x,n)   SHA_ASM("ror", x, n)
> +
> +#else
> +
> +#define SHA_ROT(X,n)   (((X) << (l)) | ((X) >> (r)))
I suspect, this should be:
#define SHA_ROT(X,l.r)   (((X) << (l)) | ((X) >> (r)))

> +#define SHA_ROL(X,n)   SHA_ROT(X,n,32-(n))
> +#define SHA_ROR(X,n)   SHA_ROT(X,32-(n),n)
> +
> +#endif
>
>  static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
>  {

Bert

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 18:09   ` Linus Torvalds
@ 2009-08-06 19:10     ` Artur Skawina
  2009-08-06 19:41       ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 19:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> For those curious just how close the C version is to the various
>> asm and C implementations, the q&d microbenchmark is at 
>> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz
> 
> Hmm. That thing doesn't work at all on x86-64. Even apart from the asm 
> sources, your timing thing does soem really odd things (why do you do that 
> odd "iret" in GETCYCLES and GETTIME?). You're better off using 
> lfence/mfence/cpuid, and I think you could make it work on 64-bit that 
> way too.

yes, it's 32-bit only, i should have mentioned that. The timing
code was written more than a decade ago, it really works on p2,
haven't updated it, it's all just c&p'ed ever since. All of it
can be safely disabled; on p2 you could account for every cycle,
nowadays gettimeofday is more than enough.

> I just hacked it away for testing.
> 
>> In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom.
> 
> I'll use this to see if I can improve the 32-bit case.
> 
> On Nehalem, with your benchmark, I get:
> 
> 	#             TIME[s] SPEED[MB/s]
> 	rfc3174         5.122       119.2
> 	# New hash result: d829b9e028e64840094ab6702f9acdf11bec3937
> 	rfc3174         5.153       118.5
> 	linus           2.092       291.8
> 	linusas         2.056       296.8
> 	linusas2        1.909       319.8
> 	mozilla         5.139       118.8
> 	mozillaas       5.775       105.7
> 	openssl         1.627       375.1
> 	spelvin         1.678       363.7
> 	spelvina        1.603       380.8
> 	nettle          1.592       383.4
> 
> And with the hacked version to get some 64-bit numbers:
> 
> 	#             TIME[s] SPEED[MB/s]
> 	rfc3174         3.992       152.9
> 	# New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab
> 	rfc3174         3.991       152.9
> 	linus            1.54       396.3
> 	linusas         1.533       398.1
> 	linusas2        1.603       380.9
> 	mozilla         4.352       140.3
> 	mozillaas       4.227       144.4
> 
> so as you can see, your improvements in 32-bit mode are actually 
> de-provements in 64-bit mode (ok, your first one seems to be a tiny 
> improvement, but I think it's in the noise).

Actually i didn't keep anything that wasn't a win, one reason
why linusas2 stayed was that it really surprised me, i'd have
expected for gcc to do a lot worse w/ the many temporaries and
the compiler came up w/ a 70% gain; gcc really must have improved
when i wasn't looking.

> But you're right, I need to try to improve the 32-bit case.

I never said anything like that. :) there probably isn't all that
much that can be done. I tried a few things, but never saw any 
improvement above measurement noise (a few percent). Would have
though that overlapping the iterations a bit would be a gain, but
that didn't do much (-20%..0), maybe on 64 bit, with more registers...

Oh, i noticed that '-mtune' makes quite a difference, it can change
the relative performance of the functions significantly, in unobvious
ways; depending on which cpu gcc tunes for (build config or -mtune);
some implementations slow down, others become a bit faster.

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 19:10     ` Artur Skawina
@ 2009-08-06 19:41       ` Linus Torvalds
  2009-08-06 20:08         ` Artur Skawina
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 19:41 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> Oh, i noticed that '-mtune' makes quite a difference, it can change
> the relative performance of the functions significantly, in unobvious
> ways; depending on which cpu gcc tunes for (build config or -mtune);
> some implementations slow down, others become a bit faster.

That probably is mainly true for P4, although it's quite possible that it 
has an effect for just what the register allocator does, and then for 
spilling.

And it looks like _all_ the tweakability is in the spilling. Nothing else 
matters.

How does this patch work for you? It avoids doing that C-level register 
rotation, and instead rotates the register names with the preprocessor.

I realize it's ugly as hell, but it does make it easier for gcc to see 
what's going on.

The patch is against my git patches, but I think it should apply pretty 
much as-is to your sha1bench sources too. Does it make any difference for 
you?

		Linus

---
 block-sha1/sha1.c |  117 ++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 90 insertions(+), 27 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 78dcb0c..ac47162 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -101,20 +101,20 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 #define SHA_SRC(t) htonl(data[t])
 #define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
 
-#define SHA_ROUND(t, input, fn, constant) \
-	TEMP = input(t); W(t) = TEMP; \
-	TEMP += SHA_ROL(A,5) + (fn) + E + (constant); \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP
+#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
+	unsigned int TEMP = input(t); W(t) = TEMP; \
+	TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
+	B = SHA_ROR(B, 2); E = TEMP; } while (0)
 
-#define T_0_15(t)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 )
-#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 )
-#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 )
-#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc )
-#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) ,  0xca62c1d6 )
+#define T_0_15(t, A, B, C, D, E)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
+#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
+#define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E )
+#define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E )
+#define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) ,  0xca62c1d6, A, B, C, D, E )
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
-	unsigned int A,B,C,D,E,TEMP;
+	unsigned int A,B,C,D,E;
 	unsigned int array[16];
 
 	A = ctx->H[0];
@@ -124,31 +124,94 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 	/* Round 1 - iterations 0-16 take their input from 'data' */
-	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
-	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
-	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
-	T_0_15(15);
+	T_0_15( 0, A, B, C, D, E);
+	T_0_15( 1, E, A, B, C, D);
+	T_0_15( 2, D, E, A, B, C);
+	T_0_15( 3, C, D, E, A, B);
+	T_0_15( 4, B, C, D, E, A);
+	T_0_15( 5, A, B, C, D, E);
+	T_0_15( 6, E, A, B, C, D);
+	T_0_15( 7, D, E, A, B, C);
+	T_0_15( 8, C, D, E, A, B);
+	T_0_15( 9, B, C, D, E, A);
+	T_0_15(10, A, B, C, D, E);
+	T_0_15(11, E, A, B, C, D);
+	T_0_15(12, D, E, A, B, C);
+	T_0_15(13, C, D, E, A, B);
+	T_0_15(14, B, C, D, E, A);
+	T_0_15(15, A, B, C, D, E);
 
 	/* Round 1 - tail. Input from 512-bit mixing array */
-	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
+	T_16_19(16, E, A, B, C, D);
+	T_16_19(17, D, E, A, B, C);
+	T_16_19(18, C, D, E, A, B);
+	T_16_19(19, B, C, D, E, A);
 
 	/* Round 2 */
-	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
-	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
-	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
-	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+	T_20_39(20, A, B, C, D, E);
+	T_20_39(21, E, A, B, C, D);
+	T_20_39(22, D, E, A, B, C);
+	T_20_39(23, C, D, E, A, B);
+	T_20_39(24, B, C, D, E, A);
+	T_20_39(25, A, B, C, D, E);
+	T_20_39(26, E, A, B, C, D);
+	T_20_39(27, D, E, A, B, C);
+	T_20_39(28, C, D, E, A, B);
+	T_20_39(29, B, C, D, E, A);
+	T_20_39(30, A, B, C, D, E);
+	T_20_39(31, E, A, B, C, D);
+	T_20_39(32, D, E, A, B, C);
+	T_20_39(33, C, D, E, A, B);
+	T_20_39(34, B, C, D, E, A);
+	T_20_39(35, A, B, C, D, E);
+	T_20_39(36, E, A, B, C, D);
+	T_20_39(37, D, E, A, B, C);
+	T_20_39(38, C, D, E, A, B);
+	T_20_39(39, B, C, D, E, A);
 
 	/* Round 3 */
-	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
-	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
-	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
-	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+	T_40_59(40, A, B, C, D, E);
+	T_40_59(41, E, A, B, C, D);
+	T_40_59(42, D, E, A, B, C);
+	T_40_59(43, C, D, E, A, B);
+	T_40_59(44, B, C, D, E, A);
+	T_40_59(45, A, B, C, D, E);
+	T_40_59(46, E, A, B, C, D);
+	T_40_59(47, D, E, A, B, C);
+	T_40_59(48, C, D, E, A, B);
+	T_40_59(49, B, C, D, E, A);
+	T_40_59(50, A, B, C, D, E);
+	T_40_59(51, E, A, B, C, D);
+	T_40_59(52, D, E, A, B, C);
+	T_40_59(53, C, D, E, A, B);
+	T_40_59(54, B, C, D, E, A);
+	T_40_59(55, A, B, C, D, E);
+	T_40_59(56, E, A, B, C, D);
+	T_40_59(57, D, E, A, B, C);
+	T_40_59(58, C, D, E, A, B);
+	T_40_59(59, B, C, D, E, A);
 
 	/* Round 4 */
-	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
-	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
-	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
-	T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+	T_60_79(60, A, B, C, D, E);
+	T_60_79(61, E, A, B, C, D);
+	T_60_79(62, D, E, A, B, C);
+	T_60_79(63, C, D, E, A, B);
+	T_60_79(64, B, C, D, E, A);
+	T_60_79(65, A, B, C, D, E);
+	T_60_79(66, E, A, B, C, D);
+	T_60_79(67, D, E, A, B, C);
+	T_60_79(68, C, D, E, A, B);
+	T_60_79(69, B, C, D, E, A);
+	T_60_79(70, A, B, C, D, E);
+	T_60_79(71, E, A, B, C, D);
+	T_60_79(72, D, E, A, B, C);
+	T_60_79(73, C, D, E, A, B);
+	T_60_79(74, B, C, D, E, A);
+	T_60_79(75, A, B, C, D, E);
+	T_60_79(76, E, A, B, C, D);
+	T_60_79(77, D, E, A, B, C);
+	T_60_79(78, C, D, E, A, B);
+	T_60_79(79, B, C, D, E, A);
 
 	ctx->H[0] += A;
 	ctx->H[1] += B;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 19:41       ` Linus Torvalds
@ 2009-08-06 20:08         ` Artur Skawina
  2009-08-06 20:53           ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> Oh, i noticed that '-mtune' makes quite a difference, it can change
>> the relative performance of the functions significantly, in unobvious
>> ways; depending on which cpu gcc tunes for (build config or -mtune);
>> some implementations slow down, others become a bit faster.
> 
> That probably is mainly true for P4, although it's quite possible that it 
> has an effect for just what the register allocator does, and then for 
> spilling.
> 
> And it looks like _all_ the tweakability is in the spilling. Nothing else 
> matters.
> 
> How does this patch work for you? It avoids doing that C-level register 
> rotation, and instead rotates the register names with the preprocessor.
> 
> I realize it's ugly as hell, but it does make it easier for gcc to see 
> what's going on.
> 
> The patch is against my git patches, but I think it should apply pretty 
> much as-is to your sha1bench sources too. Does it make any difference for 
> you?

it's a bit slower (P4):

before: linus          0.6288       97.06
after:  linus          0.6604       92.42

i was trying similar things, like the example below, too, but it wasn't a
win on 32 bit...

artur

[the iteration below is functionally correct, but scheduling is most likely
 fubared as it wasn't a win and i was checking how much a difference it made
 on P4 -- ~-20..~0%, but never faster (relative to linusas2; it _is_ faster
 than 'linus'. Dropped this version when merging your new preprocessor macros.]

@@ -125,6 +127,8 @@
 #define W(x) (array[(x)&15])
 #define SHA_XOR(t) \
        TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
+#define SHA_XOR2(t) \
+       SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
 
 #define T_16_19(t) \
         { unsigned TEMP;\
@@ -139,10 +143,27 @@
 #endif
 
 #define T_20_39(t) \
-        { unsigned TEMP;\
-       SHA_XOR(t); \
-       TEMP += (B^C^D) + E + 0x6ed9eba1; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; TEMP += SHA_ROL(A,5); A = TEMP; }
+        if (t%2==0) {\
+               unsigned TEMP;\
+               unsigned TEMP2;\
+               \
+               TEMP   = SHA_XOR2(t); \
+               TEMP2  = SHA_XOR2(t+1); \
+               W(t)   = TEMP;\
+               W(t+1) = TEMP2;\
+               TEMP   += E + 0x6ed9eba1; \
+               E      = C;\
+               TEMP   += (B^E^D); \
+               TEMP2  += D + 0x6ed9eba1; \
+               D      = SHA_ROR(B, 2);\
+               B      = SHA_ROL(A, 5);\
+               B      += TEMP;\
+               C      = SHA_ROR(A, 2);\
+               A      ^= E; \
+               A      ^= D; \
+               A      += TEMP2;\
+               A      += SHA_ROL(B, 5);\
+       }
 
 #if UNROLL
        T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 20:08         ` Artur Skawina
@ 2009-08-06 20:53           ` Linus Torvalds
  2009-08-06 21:24             ` Linus Torvalds
  2009-08-06 21:39             ` Artur Skawina
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 20:53 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> it's a bit slower (P4):
> 
> before: linus          0.6288       97.06
> after:  linus          0.6604       92.42

Hmm. Ok, I just tested with your harness, and I get

	#             TIME[s] SPEED[MB/s]
	rfc3174           5.1       119.7
	rfc3174         5.097       119.7
	linus           1.836       332.5
	linusas         2.006       304.3
	linusas2        1.879       324.9
	mozilla         5.562       109.7
	mozillaas       5.913       103.2
	openssl         1.613       378.5
	spelvin         1.698       359.5
	spelvina        1.602         381
	nettle          1.594       382.9

with it, so it is faster for me. So your slowdown seems to be yet another 
P4 thing. Dang crazy micro-architecture.

Of course, it might be a compiler version difference too. I'm using 
gcc-4.4.0.

With the cpp variable renaming, the compiler really has less to be smart 
about, but spill decisions will still matter a lot.

(My old 32-bit numbers were 

        linus           2.092       291.8

so it's a clear improvement on my machine and with my compiler).

It also seems to improve the 64-bit numbers a small bit, I'm getting

	#             TIME[s] SPEED[MB/s]
	rfc3174          3.98       153.3
	rfc3174         3.972       153.7
	linus           1.514       403.1
	linusas         1.555       392.6
	linusas2        1.599       381.7
	mozilla          4.34       140.6
	mozillaas       4.223       144.5

with my 64-bit compile, so on a Nehalem it's the best one of the C ones by 
a noticeable margin. (My original 64-bit numbers were

        linus            1.54       396.3

and while the numbers seem to fluctuate a bit, the fluctuation is roughly 
in the 1% range, so that improvement seems to be statistically 
significant.

Oh, I did make a small change, but I doubt it matters. Instead of doing

	TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
	B = SHA_ROR(B, 2); E = TEMP; } while (0)

I now do

	E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
	B = SHA_ROR(B, 2); } while (0)

which is a bit more logical (the old TEMP usage was just due to a fairly 
mindless conversion). That _might_ have lower register pressure if the 
compiler is silly enough to not notice that it can do it. Maybe that 
matters.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 20:53           ` Linus Torvalds
@ 2009-08-06 21:24             ` Linus Torvalds
  2009-08-06 21:39             ` Artur Skawina
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 21:24 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Linus Torvalds wrote:
> 
> Hmm. Ok, I just tested with your harness, and I get
> 
> 	#             TIME[s] SPEED[MB/s]
> 	rfc3174           5.1       119.7
> 	rfc3174         5.097       119.7
> 	linus           1.836       332.5
> 	linusas         2.006       304.3
> 	linusas2        1.879       324.9
> 	mozilla         5.562       109.7
> 	mozillaas       5.913       103.2
> 	openssl         1.613       378.5
> 	spelvin         1.698       359.5
> 	spelvina        1.602         381
> 	nettle          1.594       382.9

On atom, I get things like:

	#             TIME[s] SPEED[MB/s]
	rfc3174         2.186       27.92
	rfc3174         2.186       27.92
	linus          0.9492        64.3
	linusas        0.9656       63.21
	linusas2        1.012       60.29
	mozilla         2.492       24.49
	mozillaas         2.5       24.41
	openssl        0.6411        95.2
	spelvin        0.6052       100.8
	spelvina       0.6655       91.71
	nettle         0.7149       85.37

but quite frankly, those timings aren't stable enough to say anything. 
Another few runs got me:

	#             TIME[s] SPEED[MB/s]
	rfc3174         2.207       27.65
	rfc3174          2.21       27.62
	linus           1.022       59.74
	linusas         1.058        57.7
	linusas2        1.008       60.58
	mozilla         2.485       24.56
	mozillaas       2.522        24.2
	openssl        0.6421       95.06
	spelvin        0.5989       101.9
	spelvina       0.6638       91.94
	nettle         0.7132       85.58

	#             TIME[s] SPEED[MB/s]
	rfc3174         2.224       27.44
	rfc3174         2.205       27.68
	linus          0.9727       62.75
	linusas        0.9766        62.5
	linusas2        1.026        59.5
	mozilla          2.52       24.22
	mozillaas       2.547       23.96
	openssl        0.6459        94.5
	spelvin        0.6074       100.5
	spelvina       0.6751       90.41
	nettle         0.7254       84.14

so whatever differences there are between linus*, they seem to be in the 
noise, and the hand-scheduled asm beats all the C versions senseless.

I'd like to get closer to the hand-tuned ones, but I don't see anything to 
do any more. It's all about gcc register choice and avoiding spilling. So 
compiler flags changing small details can have _huge_ differences in 
performance. Here's the Atom numbers with gcc given the "-Os" flag (just 
because I wanted to try):

	linus           1.072       56.94
	linusas        0.9573       63.76
	linusas2       0.9906       61.61

Why did 'linus' numbers go down? No idea. With -O3, it's the other way 
around:

	linus          0.9537          64
	linusas        0.9566        63.8
	linusas2        1.013       60.26

but again, there's variation enough that I'd probabyl need to run ten runs 
just to see how much is noise. But the "linusas2 sucks with -O3" is clear, 
as is the "linus sucks with -Os" thing. Very odd, and very random.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 20:53           ` Linus Torvalds
  2009-08-06 21:24             ` Linus Torvalds
@ 2009-08-06 21:39             ` Artur Skawina
  2009-08-06 21:52               ` Artur Skawina
  1 sibling, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 21:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> with it, so it is faster for me. So your slowdown seems to be yet another 
> P4 thing. Dang crazy micro-architecture.
> 
> Of course, it might be a compiler version difference too. I'm using 
> gcc-4.4.0.

gcc version 4.4.1 20090603 (prerelease)

> Oh, I did make a small change, but I doubt it matters. Instead of doing
> 
> 	TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
> 	B = SHA_ROR(B, 2); E = TEMP; } while (0)
> 
> I now do
> 
> 	E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
> 	B = SHA_ROR(B, 2); } while (0)
> 
> which is a bit more logical (the old TEMP usage was just due to a fairly 
> mindless conversion). That _might_ have lower register pressure if the 
> compiler is silly enough to not notice that it can do it. Maybe that 
> matters.

before: linus          0.6622       92.17
after:  linus          0.6631       92.05
after:  linus          0.6601       92.46
after:  linus          0.6624       92.14

IOW, no difference, just noise.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 21:39             ` Artur Skawina
@ 2009-08-06 21:52               ` Artur Skawina
  2009-08-06 22:27                 ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 21:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Artur Skawina wrote:
> Linus Torvalds wrote:
>> Oh, I did make a small change, but I doubt it matters. Instead of doing
>>
>> 	TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
>> 	B = SHA_ROR(B, 2); E = TEMP; } while (0)
>>
>> I now do
>>
>> 	E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
>> 	B = SHA_ROR(B, 2); } while (0)
>>
>> which is a bit more logical (the old TEMP usage was just due to a fairly 
>> mindless conversion). That _might_ have lower register pressure if the 
>> compiler is silly enough to not notice that it can do it. Maybe that 
>> matters.
> 
> before: linus          0.6622       92.17
> after:  linus          0.6631       92.05
> after:  linus          0.6601       92.46
> after:  linus          0.6624       92.14
> 
> IOW, no difference, just noise.

Just to check, i did this: 

diff -urNp sha1bench-linus.org/block-sha1/sha1.c sha1bench-linus/block-sha1/sha1.c
--- sha1bench-linus.org/block-sha1/sha1.c	2009-08-06 23:26:15.607321815 +0200
+++ sha1bench-linus/block-sha1/sha1.c	2009-08-06 23:41:36.858325807 +0200
@@ -103,8 +103,8 @@ void blk_SHA1_Final(unsigned char hashou
 
 #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
 	unsigned int TEMP = input(t); W(t) = TEMP; \
-	E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
-	B = SHA_ROR(B, 2); } while (0)
+	E += TEMP + (fn) + (constant); \
+	B = SHA_ROR(B, 2); E += SHA_ROL(A,5); } while (0)
 
 #define T_0_15(t, A, B, C, D, E)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
 #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )

and the result was:

#             TIME[s] SPEED[MB/s]
rfc3174          1.47       41.51
rfc3174         1.474       41.42
linus          0.3564       171.2
linusas        0.5736       106.4
linusas2       0.3568       171.1
mozilla          1.17       52.19
mozillaas       1.382       44.17
openssl        0.2636       231.5
spelvin        0.2662       229.2
spelvina       0.2515       242.7
nettle         0.4386       139.2

Hmm. 
Does this make any difference for you? For me it's the best one so far
(the linusas2 number clearly shows that for me the register renaming does
nothing; other than that the functions should be very similar)

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 21:52               ` Artur Skawina
@ 2009-08-06 22:27                 ` Linus Torvalds
  2009-08-06 22:33                   ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 22:27 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> Does this make any difference for you? For me it's the best one so far
> (the linusas2 number clearly shows that for me the register renaming does
> nothing; other than that the functions should be very similar)

Nope. If anything, it's bit slower, but it might be in the noise. I 
generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the 
64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, 
which matches the linusas2 numbers pretty exactly.

But it seems to make a big difference for you.

Btw, _what_ P4 do you have (Northwood or Prescott)?

The Intel optimization manuals very much talk about avoiding rotates. And 
they mention "with a CPUID signature corresponding to family 15 and model 
encoding of 0, 1, or 2" specifically as being longer latency. That's 
basically pre-prescott P4, I think.

Anyway, on P4 I think you have two double-speed integer issue ports (ie 
max four ops per cycle), but only one of them takes a rotate, and only in 
the first half of the cycle (ie just one shift per cycle).

And afaik, that is actually the _improved_ state in Prescott. The older 
P4's didn't have a full shifter unit at all, iirc: shifts were "complex 
instructions" in Northwood and weren't even single-clock.

In Core 2, I think there's still just one shifter unit, but at least it's 
as fast as all the other units. So P4 really does stand out as sucking as 
far as shifts are concerned, and if you have an older P4, it will be even 
worse.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 22:27                 ` Linus Torvalds
@ 2009-08-06 22:33                   ` Linus Torvalds
  2009-08-06 23:19                     ` Artur Skawina
  2009-08-06 22:55                   ` Artur Skawina
  2009-08-07  2:23                   ` Linus Torvalds
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 22:33 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Linus Torvalds wrote:
> 
> Anyway, on P4 I think you have two double-speed integer issue ports (ie 
> max four ops per cycle), but only one of them takes a rotate, and only in 
> the first half of the cycle (ie just one shift per cycle).
> 
> And afaik, that is actually the _improved_ state in Prescott. The older 
> P4's didn't have a full shifter unit at all, iirc: shifts were "complex 
> instructions" in Northwood and weren't even single-clock.

Yeah, verified. Google for

	northwood "barrel shifter"

and you'll find a lot of it.

Basically, older P4's will I think shift one bit at a time. So while even 
Prescott is relatively weak in the shifter department, pre-prescott 
(Willamette and Northwood) are _really_ weak. If your P4 is one of those, 
you really shouldn't use it to decide on optimizations.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 22:27                 ` Linus Torvalds
  2009-08-06 22:33                   ` Linus Torvalds
@ 2009-08-06 22:55                   ` Artur Skawina
  2009-08-06 23:04                     ` Linus Torvalds
  2009-08-07  2:23                   ` Linus Torvalds
  2 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 22:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> Does this make any difference for you? For me it's the best one so far
>> (the linusas2 number clearly shows that for me the register renaming does
>> nothing; other than that the functions should be very similar)
> 
> Nope. If anything, it's bit slower, but it might be in the noise. I 
> generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the 
> 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, 
> which matches the linusas2 numbers pretty exactly.
> 
> But it seems to make a big difference for you.

It seems to do well on P2 and P4 here, if it works for core2 this could
be a good generic candidate. It only does 62% on an Atom, but the best C
version so far exceeds it only by ~2%.

> Btw, _what_ P4 do you have (Northwood or Prescott)?

northwood

> The Intel optimization manuals very much talk about avoiding rotates. And 
> they mention "with a CPUID signature corresponding to family 15 and model 
> encoding of 0, 1, or 2" specifically as being longer latency. That's 
> basically pre-prescott P4, I think.

cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping        : 5

> Anyway, on P4 I think you have two double-speed integer issue ports (ie 
> max four ops per cycle), but only one of them takes a rotate, and only in 
> the first half of the cycle (ie just one shift per cycle).
> 
> And afaik, that is actually the _improved_ state in Prescott. The older 
> P4's didn't have a full shifter unit at all, iirc: shifts were "complex 
> instructions" in Northwood and weren't even single-clock.
> 
> In Core 2, I think there's still just one shifter unit, but at least it's 
> as fast as all the other units. So P4 really does stand out as sucking as 
> far as shifts are concerned, and if you have an older P4, it will be even 
> worse.

hmm, I might be able to try it on some old willamette, but my prescott's
mobo died, so i can't verify that right now.

I'll upload an updated sha1bench, maybe somebody else feels like checking...

artur 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 22:55                   ` Artur Skawina
@ 2009-08-06 23:04                     ` Linus Torvalds
  2009-08-06 23:25                       ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 23:04 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Fri, 7 Aug 2009, Artur Skawina wrote:
> 
> hmm, I might be able to try it on some old willamette, but my prescott's
> mobo died, so i can't verify that right now.

I think willamette and northwood are basically the same wrt shifters (and 
pretty much everything else too, for that matter).  I think northwood is a 
shrink, and had an increased cache size (and higher frequencies). But I 
think core-wise, they're very similar.

It was prescott that changed a lot (mostly for the worse - the shifter was 
one of the few upsides of prescott, although increased frequency often 
made up for the downsides).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 22:33                   ` Linus Torvalds
@ 2009-08-06 23:19                     ` Artur Skawina
  2009-08-06 23:42                       ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 23:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> Yeah, verified. Google for
> 
> 	northwood "barrel shifter"
> 
> and you'll find a lot of it.
> 
> Basically, older P4's will I think shift one bit at a time. So while even 
> Prescott is relatively weak in the shifter department, pre-prescott 
> (Willamette and Northwood) are _really_ weak. If your P4 is one of those, 
> you really shouldn't use it to decide on optimizations.

Actually that's even more of a reason to make sure the code doesn't suck :)
The difference on less perverse cpus will usually be small, but on P4 it
can be huge.

A few years back I found my old ip checksum microbenchmark, and when I ran
it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward 
32-bit C implementation was running circles around the in-kernel one...
And a few tweaks to the assembler version got me another ~100% speedup.[1]

After that the P4 became the very first cpu to test any code on... :)

artur

[1] just reran the benchmark on this p4; true on northwood too:

IACCK 0.9.30  Artur Skawina <...>
[ exec time; lower is better  ] [speed ] [ time ]  [ok?]
TIME-N+S TIME32 TIME33 TIME1480 MBYTES/S TIMEXXXX  CSUM FUNCTION ( rdtsc_overhead=0  null=0 )
   17901    510    557     3010   393.36    59772  56dd csum_partial_cdumb16
    3019    154    156      431  2747.10    43106  56dd csum_partial_c32
    2413    170    177      328  3609.76    37501  56dd csum_partial_c32l
    2437    170    170      328  3609.76    37488  56dd csum_partial_c32i
    5078    205    254      767  1543.68    48117  56dd csum_partial_std
    5612    299    291      851  1391.30    53673  56dd csum_partial_686
    1584     99    127      227  5215.86    14495  56dd csum_partial_586f
    1738    107    121      229  5170.31    14785  56dd csum_partial_586fs
    4893    175    171      759  1559.95    52347  56dd csum_partial_copy_generic_std
    4949    151    189      756  1566.14    67847  56dd csum_partial_copy_generic_686
    2072    110    134      302  3920.53    39061  56dd csum_partial_copy_generic_p4as1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 23:04                     ` Linus Torvalds
@ 2009-08-06 23:25                       ` Linus Torvalds
  2009-08-07  0:13                         ` Linus Torvalds
  2009-08-07  0:53                         ` Artur Skawina
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 23:25 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Linus Torvalds wrote:
> 
> It was prescott that changed a lot (mostly for the worse - the shifter was 
> one of the few upsides of prescott, although increased frequency often 
> made up for the downsides).

Anyway, since you have a Northwood, I bet that the #1 issue for you is to 
spread out the shift instructions in a way that simply doesn't matter 
anywhere else.

In netburst, if I remember the details correcty, a "complex instruction" 
will basically get the trace cache from the microcode roms. I'm not sure 
how it interacts with the TC entries around it, but it's entirely possible 
that it basically disables any instruction scheduling (the microcode 
traces are presumably "pre-scheduled"), so you'd basically see stalls 
where there's little out-of-order execution.

That then explains why you see huge differences from what is basically 
trivial scheduling decisions, and why some random placement of a shift 
makes a big difference.

Just out of curiosity, does anything change if you change the

	B = SHA_ROR(B,2)

into a

	B = SHA_ROR(SHA_ROR(B,1),1)

instead? It's very possible that it becomes _much_ worse, but I guess it's 
also possible in theory that a single-bit rotate ends up being a simple 
instruction and that doing two single-bit ROR's is actually faster than 
one 2-bit ROR (assuming the second one is microcoded and the first one).

In particular, I'm thinking about the warnign in the intel optimization 
manual:

	The rotate by immediate and rotate by register instructions are 
	more expensive than a shift. The rotate by 1 instruction has the 
	same latency as a shift.

so it's very possible that "rotate by 1" is much better than other 
rotates.


			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 23:19                     ` Artur Skawina
@ 2009-08-06 23:42                       ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 23:42 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Fri, 7 Aug 2009, Artur Skawina wrote:
> 
> Actually that's even more of a reason to make sure the code doesn't suck :)
> The difference on less perverse cpus will usually be small, but on P4 it
> can be huge.

No. First off, the things you have to do on P4 are just insane. See the 
email I just sent out asking you to test whether two 1-bit rotates might 
be faster than 1 2-bit rotate.

So optimizing for P4 is often the wrong thing.

Secondly, P4's are going away. You may have one, but they are getting 
rare. So optimizing for them is a losing proposition in the long run.

> A few years back I found my old ip checksum microbenchmark, and when I ran
> it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward 
> 32-bit C implementation was running circles around the in-kernel one...
> And a few tweaks to the assembler version got me another ~100% speedup.[1]

Yeah, not very surprising. The P4 is very good at the simplest possible 
kind of code that does _nothing_ fancy.

But then it completely chokes on some code. I mean _totally_. It slows 
down by a huge amount if there is anything but the most trivial kinds of 
instructions. And by "trivial", I mean _really_ trivial. Shifts (as in 
SHA1), but iirc also things like "adc" (add with carry) etc.

So it's not hard to write code that works well on other uarchs, and then 
totally blow up on P4. I think it doesn't rename the flags at all, so any 
flag dependency (carry being the most common one) will stall things 
horrible.

There's also a very subtle store forwarding failure thing (and a lot of 
other events) that causes a nasty micro-architectural replay trap, and 
again you go from "running like a bat out of hell" to "slower than a i486 
at a tenth the frequency".

Really. It's disgusting. Perfectly fine code can run really slowly on the 
P4 just because it hits some random internal micro-architectural flaw. And 
there's a _lot_ of those "glass jaw" issues.

The best way to avoid them is to use _only_ simple ALU instructions (add, 
sub, and/or/not), and to be _very_ careful with loads and stores. 

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 23:25                       ` Linus Torvalds
@ 2009-08-07  0:13                         ` Linus Torvalds
  2009-08-07  1:30                           ` Artur Skawina
  2009-08-07  0:53                         ` Artur Skawina
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-07  0:13 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Linus Torvalds wrote:
> 
> In particular, I'm thinking about the warnign in the intel optimization 
> manual:
> 
> 	The rotate by immediate and rotate by register instructions are 
> 	more expensive than a shift. The rotate by 1 instruction has the 
> 	same latency as a shift.
> 
> so it's very possible that "rotate by 1" is much better than other 
> rotates.

Hmm. Probably not. Googling more seems to indicate that rotates and shifts 
have a fixed 4-cycle latency on Northwood. I'm not seeing anything that 
indicates that a single-bit rotate/shift would be any faster.

(And remember, if 4 cycles doesn't sound so bad: that's enough of a 
latency to do _16_ "simple" ALU's, since they can be double-pumped in the 
two regular ALU's).

I think long-running ALU ops that feed into a store (spill) also happen to 
be the thing that makes the dreaded store-buffer replay trap nasties 
happen more (load vs store scheduled badly, and then you end up spending 
tens of cycles just replaying).

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 23:25                       ` Linus Torvalds
  2009-08-07  0:13                         ` Linus Torvalds
@ 2009-08-07  0:53                         ` Artur Skawina
  1 sibling, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-07  0:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> Just out of curiosity, does anything change if you change the
> 
> 	B = SHA_ROR(B,2)
> 
> into a
> 
> 	B = SHA_ROR(SHA_ROR(B,1),1)
> 
> instead? It's very possible that it becomes _much_ worse, but I guess it's 

Did try that yesterday, didn't help. Will recheck now.. yep:

before: linus          0.3554       171.7
after:  linus           0.407         150

still true for the current version.

> So optimizing for P4 is often the wrong thing.
> 
> Secondly, P4's are going away. You may have one, but they are getting 
> rare. So optimizing for them is a losing proposition in the long run.

Sure, no argument; it's just that avoiding the P4 pitfalls is usually
not that hard and the impact on other, non-netburst, archs is low.
There are a lot of P4s out there and they're not going away soon.
(i'm still keeping most of my git trees on a P3...)

For generic C code such as this the difference for your i7 was -2% and
+70% for my P4; all the other (but one, i think) optimizations which
worked on P4 also applied to 32-bit i7. As i happen to have a p4 i can
just as well test the code on it, many improvements will likely apply
to other cpus too. That's all, i doubt anybody seriously considered
"optimizing for P4"; there is a reason intel discontinued them :)

The atom is a more important target, but only the asm versions did well
there so far.

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-07  0:13                         ` Linus Torvalds
@ 2009-08-07  1:30                           ` Artur Skawina
  2009-08-07  1:55                             ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-07  1:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Linus Torvalds wrote:
>> In particular, I'm thinking about the warnign in the intel optimization 
>> manual:
>>
>> 	The rotate by immediate and rotate by register instructions are 
>> 	more expensive than a shift. The rotate by 1 instruction has the 
>> 	same latency as a shift.
>>
>> so it's very possible that "rotate by 1" is much better than other 
>> rotates.
> 
> Hmm. Probably not. Googling more seems to indicate that rotates and shifts 
> have a fixed 4-cycle latency on Northwood. I'm not seeing anything that 
> indicates that a single-bit rotate/shift would be any faster.
> 
> (And remember, if 4 cycles doesn't sound so bad: that's enough of a 
> latency to do _16_ "simple" ALU's, since they can be double-pumped in the 
> two regular ALU's).

looking at the generated code, there is a lot of ro[rl] movement, so it's
likely that contributes to the problem.

I also see 44 extra lea instructions, 44 less adds and changes like:
        [...]
        mov    XX(%eRX),%eRX
        xor    XX(%eRX),%eRX
-       and    %eRX,%eRX
+       and    XX(%eRX),%eRX
        xor    XX(%eRX),%eRX
-       add    %eRX,%eRX
-       ror    $0x2,%eRX
-       mov    %eRX,XX(%eRX)
+       lea    (%eRX,%eRX,1),%eRX
        mov    XX(%eRX),%eRX
        bswap  %eRX
        mov    %eRX,XX(%eRX)
        mov    %eRX,%eRX
+       ror    $0x2,%eRX
+       mov    %eRX,XX(%eRX)
+       mov    %eRX,%eRX
        rol    $0x5,%eRX
        mov    XX(%eRX),%eRX
-       mov    XX(%eRX),%eRX
        [...]
which could mean that gcc did a better job of register allocation
(where "better job" might be just luck).

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-07  1:30                           ` Artur Skawina
@ 2009-08-07  1:55                             ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-07  1:55 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Fri, 7 Aug 2009, Artur Skawina wrote:
> 
> I also see 44 extra lea instructions, 44 less adds

add and lea (as long as the lea shift is 1) should be the same on a P4 
(they are not the same on some other microarchitectures and lea can have 
address generation stalls etc).

Lea, of course, gives the potential for register movement at the same time 
(three-address op), and that's likely the reason for lea-vs-adds.

> and changes like:
>         [...]
>         mov    XX(%eRX),%eRX
>         xor    XX(%eRX),%eRX
> -       and    %eRX,%eRX
> +       and    XX(%eRX),%eRX

Yeah, different spill patterns. That's the biggest issue, I think.

In particular, on P4, with unlucky spills, you may end up with things like

	ror $2,reg
	mov reg,x(%esp)
	.. a few instructions ..
	xor x(%esp), reg

and the above is exactly when one of the worst P4 problems hit: a store, 
followed a few cycles later by a load from the same address (and "a few 
cycles later" can be quite a few instructions if they are the nice ones).

What can happen is that if the store data isn't ready yet (because it 
comes from a long-latency op like a shift or a multiply), then you hit a 
store buffer replay thing. The P4 (with its long pipeline) basically 
starts the load speculatively, and if anything bad happens for the load 
(L1 cache miss, TLB miss, store buffer fault, you name it), it will cause 
a replay of the whole pipeline.

Which can take tens of cycles. 

[ That said, it's been a long time since I did a lot of P4 worrying. So I 
  may mis-remember the details. But that whole store buffer forwarding had 
  some really nasty replay issues ]

> which could mean that gcc did a better job of register allocation
> (where "better job" might be just luck).

I suspect that's the biggest issue. Just _happening_ to get the spills so 
that they don't hurt. And with unlucky scheduling, you might hit some of 
the P4 replay issues every single time.

There are some P4 optimizations that are simple:
 - avoid complex instructions
 - don't blow the trace cache
 - predictable branches
but the replay faults can really get you.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-06 22:27                 ` Linus Torvalds
  2009-08-06 22:33                   ` Linus Torvalds
  2009-08-06 22:55                   ` Artur Skawina
@ 2009-08-07  2:23                   ` Linus Torvalds
  2009-08-07  4:16                     ` Artur Skawina
       [not found]                     ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>
  2 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-07  2:23 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Thu, 6 Aug 2009, Linus Torvalds wrote:

> 
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
> > 
> > Does this make any difference for you? For me it's the best one so far
> > (the linusas2 number clearly shows that for me the register renaming does
> > nothing; other than that the functions should be very similar)
> 
> Nope. If anything, it's bit slower, but it might be in the noise. I 
> generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the 
> 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row, 
> which matches the linusas2 numbers pretty exactly.

I actually found a P4 I have access to, except that one is a Prescott.

And I can't run it in 32-bit mode, because I only have a regular user 
login, and it only has the 64-bit development environment.

But I can do the hacked-for-64bit sha1bench runs, and I tested your patch.

It's horrible.

Here's the plain "linus" baseline (ie the "Do register rotation in cpp") 
thing, with the fixed "E += TEMP .." thing):

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.648       37.03
	rfc3174         1.677        36.4
	linus          0.4018       151.9
	linusas        0.4439       137.5
	linusas2       0.4381       139.3
	mozilla        0.9587       63.66
	mozillaas      0.9434        64.7

and here it is with your patch:

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.667       36.61
	rfc3174         1.644       37.12
	linus          0.4653       131.2
	linusas        0.4412       138.3
	linusas2       0.4388       139.1
	mozilla        0.9466       64.48
	mozillaas      0.9449       64.59

(ok, so the numbers aren't horribly stable, but the "plain linus" thing 
consistently outperforms here - and underperforms with your patch).

However, note that since this is the 64-bit thing, there likely aren't any 
spill issues, but it's simply an issue of "just how did the array[] 
accesses get scheduled" etc. And since this is a Prescott (or rather 
"Xeon") P4, the shifter isn't quite as horrible as yours is. _And_ this is 
a different gcc version (4.0.3).

So the numbers aren't really all that comparable. It's more an example of 
"optimizing for P4 is futile, because you're just playing with total 
randomness". That's like a 20MB/s difference, just from moving a few ALU 
ops around a bit.

And it's entirely possible that if I had gcc-4.4 on that machine, your 
patch would magically do the right thing ;)

Sadly, that machine is just a ssh gateway, so there's no real development 
tools on it at all - no way to get good profiles etc. So I can't really 
say exactly what the problem pattern is :(

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-07  2:23                   ` Linus Torvalds
@ 2009-08-07  4:16                     ` Artur Skawina
       [not found]                     ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>
  1 sibling, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-07  4:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> Here's the plain "linus" baseline (ie the "Do register rotation in cpp") 
> thing, with the fixed "E += TEMP .." thing):

> 	linus          0.4018       151.9

> and here it is with your patch:

> 	linus          0.4653       131.2

> (ok, so the numbers aren't horribly stable, but the "plain linus" thing 
> consistently outperforms here - and underperforms with your patch).

Well, I'd be surprised if one C version would always be the winner on
every single cpu; that 13% loss[1] I think would be an acceptable compromise,
if the goal is to have one implementation that does reasonably well on all
cpus.

That's why i asked how the change did on nehalem; if it's a measurable loss
on anything modern (core2+), then of course the P4s must suffer; and one
could always blame the compiler ;)
It's not like the difference in sha1 overhead will be noticeable in normal
git use.

artur

[1] I suspect the old gcc is a factor (4.0.4 does <100M/s here).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
       [not found]                             ` <4A7CC380.3070008@gmail.com>
@ 2009-08-08  4:16                               ` Linus Torvalds
  2009-08-08  5:34                                 ` Artur Skawina
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-08  4:16 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Sat, 8 Aug 2009, Artur Skawina wrote:
> 
> i was seeing such large variations depending on the -mtune flags that
> i gave up and now do just -march=i686; that's what i would expect
> for generic x86 binaries.

I think I have found a way to avoid the gcc crazyness.

Lookie here:

	#             TIME[s] SPEED[MB/s]
	rfc3174         5.094       119.8
	rfc3174         5.098       119.7
	linus           1.462       417.5
	linusas         2.008         304
	linusas2        1.878         325
	mozilla         5.566       109.6
	mozillaas       5.866       104.1
	openssl         1.609       379.3
	spelvin         1.675       364.5
	spelvina        1.601       381.3
	nettle          1.591       383.6

notice? I outperform all the hand-tuned asm on 32-bit too. By quite a 
margin, in fact.

Now, I didn't try a P4, and it's possible that it won't do that there, but 
the 32-bit code generation sure looks impressive on my Nehalem box. The 
magic? I force the stores to the 512-bit hash bucket to be done in order. 
That seems to help a lot.

The diff is trivial (on top of the "rename registers with cpp" patch), as 
appended. And it does seem to fix the P4 issues too, although I can 
obviously (once again) only test Prescott, and only in 64-bit mode:

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.662       36.73
	rfc3174          1.64       37.22
	linus          0.2523       241.9
	linusas        0.4367       139.8
	linusas2       0.4487         136
	mozilla        0.9704        62.9
	mozillaas      0.9399       64.94

that's some really impressive improvement. All from just saying "do the 
stores in the order I told you to, dammit!" to the compiler.

		Linus

---
 block-sha1/sha1.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 19dc41d..f70e1ba 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -93,6 +93,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 /* This "rolls" over the 512-bit array */
 #define W(x) (array[(x)&15])
+#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
 
 /*
  * Where do we get the source from? The first 16 iterations get it from
@@ -102,7 +103,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 #define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
 
 #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
-	unsigned int TEMP = input(t); W(t) = TEMP; \
+	unsigned int TEMP = input(t); setW(t, TEMP); \
 	E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
 	B = SHA_ROR(B, 2); } while (0)
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-08  4:16                               ` Linus Torvalds
@ 2009-08-08  5:34                                 ` Artur Skawina
  2009-08-08 17:10                                   ` Linus Torvalds
  2009-08-08 22:58                                   ` Artur Skawina
  0 siblings, 2 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-08  5:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> I think I have found a way to avoid the gcc crazyness.
> 
> Lookie here:
> 
> 	#             TIME[s] SPEED[MB/s]
> 	rfc3174         5.094       119.8
> 	rfc3174         5.098       119.7
> 	linus           1.462       417.5
> 	linusas         2.008         304
> 	linusas2        1.878         325
> 	mozilla         5.566       109.6
> 	mozillaas       5.866       104.1
> 	openssl         1.609       379.3
> 	spelvin         1.675       364.5
> 	spelvina        1.601       381.3
> 	nettle          1.591       383.6
> 
> notice? I outperform all the hand-tuned asm on 32-bit too. By quite a 
> margin, in fact.
> 
> Now, I didn't try a P4, and it's possible that it won't do that there, but 
> the 32-bit code generation sure looks impressive on my Nehalem box. The 
> magic? I force the stores to the 512-bit hash bucket to be done in order. 
> That seems to help a lot.

I named it 'linusv':

P4/i686:
#             TIME[s] SPEED[MB/s]
rfc3174         1.456       41.92
rfc3174         1.445       42.22
linus          0.5865       104.1
linusph        0.5643       108.2
linusv         0.3697       165.1
linusvph       0.3618       168.7
linusp4        0.4312       141.5
linusas        0.4091       149.2
linusas2       0.4364       139.9
mozilla         1.102       55.37
mozillaas       1.297       47.07
openssl         0.261       233.9
opensslb       0.2395       254.9
spelvin        0.2653         230
nettle          0.438       139.4

and when tuning for prescott:

linus          0.6544       93.27
linusph        0.6523       93.57
linusv         0.3439       177.5
linusvph       0.3547       172.1
linusp4        0.3585       170.3

so it isn't as fast as the openssl asm ones, but it does win
in the C category.

> I outperform all the hand-tuned asm on 32-bit too. By quite a 
> margin, in fact.

I've inlined the byteswapping in 'opensslb', maybe that one will
do a bit better.

http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-08  5:34                                 ` Artur Skawina
@ 2009-08-08 17:10                                   ` Linus Torvalds
  2009-08-08 18:12                                     ` Artur Skawina
  2009-08-08 22:58                                   ` Artur Skawina
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-08 17:10 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Git Mailing List



On Sat, 8 Aug 2009, Artur Skawina wrote:
> 
> I've inlined the byteswapping in 'opensslb', maybe that one will
> do a bit better.
> 
> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz

Hmm. Testing on my atom, the inlined bswap is worse, but the asm versions 
are generally superior to any C one:

	#             TIME[s] SPEED[MB/s]
	rfc3174         2.194       27.82
	rfc3174          2.19       27.87
	linus           0.947       64.45
	linusph        0.9381       65.06
	linusv         0.8943       68.25
	linusvph       0.8803       69.34
	linusasm       0.9349       65.29
	linusp4         1.006       60.66
	linusas         1.062       57.48
	linusas2        1.009        60.5
	mozilla         2.264       26.96
	mozillaas       2.197       27.78
	openssl         0.648       94.19
	opensslb       0.7419       82.27
	spelvin         0.636       95.96
	spelvina       0.6671       91.49
	nettle          0.717       85.12
	nettle-ror     0.7137       85.52
	nettle-p4sch   0.7158       85.27

Interestingly, -mtune=prescott does well for that 'linusv' version on atom 
too, and gets it up to

	linusv         0.8365       72.96

and it's the only one that improves. Odd interactions.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-08 17:10                                   ` Linus Torvalds
@ 2009-08-08 18:12                                     ` Artur Skawina
  0 siblings, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 18:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds wrote:
> 
> On Sat, 8 Aug 2009, Artur Skawina wrote:
>> I've inlined the byteswapping in 'opensslb', maybe that one will
>> do a bit better.
> Hmm. Testing on my atom, the inlined bswap is worse, but the asm versions 
> are generally superior to any C one:

It loses on atom, but is the best one on both P3 and P4 here.
Based on your other numbers I was expecting it to win on 32-bit
nehalem too. gcc doing a better job of scheduling w/ 'linusv'
wouldn't surprise though (since there are no spills, the data reads
are about the only other thing that could make a difference. And, yes,
they show up in the profiles; if x86 only had one more register...)

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-08  5:34                                 ` Artur Skawina
  2009-08-08 17:10                                   ` Linus Torvalds
@ 2009-08-08 22:58                                   ` Artur Skawina
  2009-08-08 23:36                                     ` Artur Skawina
  1 sibling, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 22:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Artur Skawina wrote:
> Linus Torvalds wrote:
>> magic? I force the stores to the 512-bit hash bucket to be done in order. 
>> That seems to help a lot.
> 
> I named it 'linusv':

> linusv         0.3697       165.1

I was not going to spend even more time on the C version, but after looking
at what gcc does to it, tried this: 

diff --git a/block-sha1/sha1vol.c b/block-sha1/sha1vol.c
--- a/block-sha1/sha1vol.c
+++ b/block-sha1/sha1vol.c
@@ -93,7 +93,7 @@ void blk_SHA1_Finalv(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 /* This "rolls" over the 512-bit array */
 #define W(x) (array[(x)&15])
-#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+#define setW(x, val) W(x) = (val); __asm__ volatile ("": "+m" (W(x)))
 
 /*
  * Where do we get the source from? The first 16 iterations get it from

and got a nice improvement:

rfc3174         1.436       42.49
linus          0.5843       104.5
linusph        0.5639       108.2
linusv         0.3098         197
linusvph       0.3082       198.1
linusasm       0.5849       104.3
linusp4         0.433         141
linusas        0.4077       149.7
linusas2        0.436         140
mozilla         1.099       55.54
mozillaas       1.295       47.11
openssl        0.2632       231.9
opensslb       0.2395       254.8
spelvin        0.2687       227.2
spelvina       0.2526       241.7
nettle         0.4378       139.4
nettle-ror     0.4379       139.4
nettle-p4sch   0.4231       144.2

The atom numbers didn't change much.

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
  2009-08-08 22:58                                   ` Artur Skawina
@ 2009-08-08 23:36                                     ` Artur Skawina
  0 siblings, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 23:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Artur Skawina wrote:
> Artur Skawina wrote:

> -#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
> +#define setW(x, val) W(x) = (val); __asm__ volatile ("": "+m" (W(x)))

and w/ this on top:

diff --git a/block-sha1/sha1vol.c b/block-sha1/sha1vol.c
--- a/block-sha1/sha1vol.c
+++ b/block-sha1/sha1vol.c
@@ -103,9 +103,9 @@ void blk_SHA1_Finalv(unsigned char hashout[20], blk_SHA_CTX *ctx)
 #define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
 
 #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
-	unsigned int TEMP = input(t); setW(t, TEMP); \
-	E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
-	B = SHA_ROR(B, 2); } while (0)
+	unsigned int TEMP = SHA_ROL(A,5); E+= (fn); \
+	E += (constant) + TEMP; TEMP = input(t); setW(t, TEMP); \
+	B = SHA_ROR(B, 2); E += TEMP; } while (0)
 
 #define T_0_15(t, A, B, C, D, E)  SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
 #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )

I see an improvement on atom and reach ~200M/s on P4 (i686).
.
When compiled w/ '-mtune=prescott':

rfc3174         1.459       41.84
linus          0.6574       92.85
linusph        0.6613       92.29
linusv         0.2682       227.6
linusvph       0.2681       227.7
linusasm       0.5868         104
linusp4        0.3586       170.2
linusas        0.3795       160.8
linusas2       0.3583       170.3
mozilla         1.171       52.11
mozillaas       1.381        44.2
openssl        0.2623       232.7
opensslb       0.2404       253.9
spelvin        0.2659       229.6
spelvina       0.2492       244.9
nettle         0.4362       139.9
nettle-ror      0.436         140
nettle-p4sch   0.4204       145.2

it's now just 2% slower than the openssl assembler version.

artur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
@ 2009-08-07  7:36 George Spelvin
  0 siblings, 0 replies; 38+ messages in thread
From: George Spelvin @ 2009-08-07  7:36 UTC (permalink / raw)
  To: torvalds; +Cc: art.08.09, git, linux

> Basically, older P4's will I think shift one bit at a time. So while even 
> Prescott is relatively weak in the shifter department, pre-prescott 
> (Willamette and Northwood) are _really_ weak. If your P4 is one of those, 
> you really shouldn't use it to decide on optimizations.

Thanks for the warning.  The only P4 I have a login on is a Willamette, so
I guess it's a bad optimization target:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Pentium(R) 4 CPU 1.60GHz
stepping        : 2
cpu MHz         : 1593.888
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pebs bts
bogomips        : 3187.77
clflush size    : 64
power management:



BTW, I'm having a bit of a problem with sha1bench.  Because the number
of rounds is multiplied by 10 until it takes more than mintime, and
the watchdog timer is set to 12*mintime, if I just barely miss the
threshold, I can hit the watchdog on any code that's more than 20% slower
than the reference rfc3174 code:

# /tmp/sha1bench 
#Initializing... Rounds: 10000000, size: 625000K, time: 6.904s, speed: 88.41MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         6.912       88.31
# New hash result: 0489b02aee9fbd82b0bb0cba96f8047e42f543b8
rfc3174         6.911       88.31
linus           3.002       203.3
linusas         3.253       187.7
linusas2        3.059       199.6
mozilla         10.86       56.22
mozillaas       10.33       59.09
openssl         2.145       284.5
spelvin         1.933       315.8
spelvina        1.933       315.8
nettle          2.161       282.4

I had to bump it to 20 times to be able to get past the mozilla code.

You can still have nice round numbers with smaller incements of 2x or 2.5x:

    do {
-      rounds *= 10;
+      rounds *= 2;
+      if (rounds % 9 == 4)
+        rounds += rounds/4;     /* Step 1, 2, 5, 10, 20, 50, 100... */
       uWATCHDOG(mintime*20);
       t1 = GETTIME();
       for (i=0; i<rounds; i++) {
          SHA1Input(sc, arr512b, sizeof(arr512b));
          if (i<64) {
             SHA1Result(sc, arr512b+
                     (i+arr512b[i]+arr512b[arr512b[i]%64]+
                      arr512b[63-i]+arr512b[arr512b[63-i]%64]) %
                     (sizeof(arr512b)-20));
             SHA1Reset(sc);
             }
          }
       t2 = GETTIME(); td = t2-t1;
       }
    while (td<mintime);

And yeah, my P4 is touchy:
n# /var/tmp/sha1bench 
#Initializing... Rounds: 500000, size: 31250K, time: 0.9736s, speed: 31.35MB/s
#             TIME[s] SPEED[MB/s]
rfc3174        0.9931       30.73
# New hash result: 2872616106e163ae9c7c8d12a38bef032323c844
rfc3174        0.9491       32.16
linus          0.4906        62.2
linusas        0.5799       52.62
linusas2       0.4859       62.81
mozilla         1.302       23.44
mozillaas       1.234       24.74
openssl         0.226         135
spelvin        0.2298       132.8
spelvina       0.2494       122.4
nettle         0.3687       82.78

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2009-08-08 23:37 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds
2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
2009-08-06 15:16   ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
2009-08-06 15:18     ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
2009-08-06 15:20       ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds
2009-08-06 15:22         ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds
2009-08-06 15:24           ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds
2009-08-06 15:25             ` [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context Linus Torvalds
2009-08-06 18:25     ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg
2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
2009-08-06 18:09   ` Linus Torvalds
2009-08-06 19:10     ` Artur Skawina
2009-08-06 19:41       ` Linus Torvalds
2009-08-06 20:08         ` Artur Skawina
2009-08-06 20:53           ` Linus Torvalds
2009-08-06 21:24             ` Linus Torvalds
2009-08-06 21:39             ` Artur Skawina
2009-08-06 21:52               ` Artur Skawina
2009-08-06 22:27                 ` Linus Torvalds
2009-08-06 22:33                   ` Linus Torvalds
2009-08-06 23:19                     ` Artur Skawina
2009-08-06 23:42                       ` Linus Torvalds
2009-08-06 22:55                   ` Artur Skawina
2009-08-06 23:04                     ` Linus Torvalds
2009-08-06 23:25                       ` Linus Torvalds
2009-08-07  0:13                         ` Linus Torvalds
2009-08-07  1:30                           ` Artur Skawina
2009-08-07  1:55                             ` Linus Torvalds
2009-08-07  0:53                         ` Artur Skawina
2009-08-07  2:23                   ` Linus Torvalds
2009-08-07  4:16                     ` Artur Skawina
     [not found]                     ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>
     [not found]                       ` <4A7CBD28.6070306@gmail.com>
     [not found]                         ` <4A7CBF47.9000903@gmail.com>
     [not found]                           ` <alpine.LFD.2.01.0908071700290.3288@localhost.localdomain>
     [not found]                             ` <4A7CC380.3070008@gmail.com>
2009-08-08  4:16                               ` Linus Torvalds
2009-08-08  5:34                                 ` Artur Skawina
2009-08-08 17:10                                   ` Linus Torvalds
2009-08-08 18:12                                     ` Artur Skawina
2009-08-08 22:58                                   ` Artur Skawina
2009-08-08 23:36                                     ` Artur Skawina
2009-08-07  7:36 George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.