* [PATCH 0/7] block-sha1: improved SHA1 hashing
@ 2009-08-06 15:13 Linus Torvalds
2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:13 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
The bulk of this is the patches I sent out yesterday, but there's a few
added tweaks from today there, and it's now one nice series instead of a
patch followed by a "Haa, I improved on it" etc, so you can get the whole
thing without actually having to hunt for the pieces.
It's a series of 7 patches:
block-sha1: add new optimized C 'block-sha1' routines
block-sha1: try to use rol/ror appropriately
block-sha1: make the 'ntohl()' part of the first SHA1 loop
block-sha1: re-use the temporary array as we calculate the SHA1
block-sha1: macroize the rounds a bit further
block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
block-sha1: get rid of redundant 'lenW' context
where the thing is loosely based on the Mozilla SHA1 routines but by the
end doesn't really resemble them all that much. The Mozilla ones suck
donkey d*ck in so many ways - unnecessary copies, idiotic byte-at-a-time
build-up of the hash input etc.
The end result is pretty much equivalent in performance to the OpenSSL
SHA1 code for me on x86-64. Getting rid of OpenSSL gets rid of another
couple of shared library loadings, and as a result my "make -j64 test" is
now a couple of seconds faster with this than with OpenSSL in my rather
unscientific tests.
The code isn't very big:
Makefile | 9 +++
block-sha1/sha1.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++++
block-sha1/sha1.h | 20 +++++++
3 files changed, 187 insertions(+), 0 deletions(-)
create mode 100644 block-sha1/sha1.c
create mode 100644 block-sha1/sha1.h
and passes all my tests. It's really hard to not do the SHA1 right and
still pass any tests at all, so it should all be good.
Enable them with BLK_SHA1=1.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines
2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds
@ 2009-08-06 15:15 ` Linus Torvalds
2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:15 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 14:49:32 -0700
Subject: [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines
Based ont he mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.
It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Side note: we can get rid of the "arch must support unaligned loads"
requirement later on by telling the compiler which ones _can_ be
unaligned. So the limitations aren't all that fundamental, really.
Makefile | 9 +++
block-sha1/sha1.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++
block-sha1/sha1.h | 21 ++++++++
3 files changed, 175 insertions(+), 0 deletions(-)
diff --git a/Makefile b/Makefile
index d7669b1..f12024c 100644
--- a/Makefile
+++ b/Makefile
@@ -84,6 +84,10 @@ all::
# specify your own (or DarwinPort's) include directories and
# library directories by defining CFLAGS and LDFLAGS appropriately.
#
+# Define BLK_SHA1 environment variable if you want the C version
+# of the SHA1 that assumes you can do unaligned 32-bit loads and
+# have a fast htonl() function.
+#
# Define PPC_SHA1 environment variable when running make to make use of
# a bundled SHA1 routine optimized for PowerPC.
#
@@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND
BASIC_CFLAGS += -DNO_DEFLATE_BOUND
endif
+ifdef BLK_SHA1
+ SHA1_HEADER = "block-sha1/sha1.h"
+ LIB_OBJS += block-sha1/sha1.o
+else
ifdef PPC_SHA1
SHA1_HEADER = "ppc/sha1.h"
LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o
@@ -1183,6 +1191,7 @@ else
endif
endif
endif
+endif
ifdef NO_PERL_MAKEMAKER
export NO_PERL_MAKEMAKER
endif
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
new file mode 100644
index 0000000..eef32f7
--- /dev/null
+++ b/block-sha1/sha1.c
@@ -0,0 +1,145 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "sha1.h"
+
+/* Hash one 64-byte block of data */
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data);
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx)
+{
+ ctx->lenW = 0;
+ ctx->size = 0;
+
+ /* Initialize H with the magic constants (see FIPS180 for constants)
+ */
+ ctx->H[0] = 0x67452301;
+ ctx->H[1] = 0xefcdab89;
+ ctx->H[2] = 0x98badcfe;
+ ctx->H[3] = 0x10325476;
+ ctx->H[4] = 0xc3d2e1f0;
+}
+
+
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
+{
+ int lenW = ctx->lenW;
+
+ ctx->size += (unsigned long long) len << 3;
+
+ /* Read the data into W and process blocks as they get full
+ */
+ if (lenW) {
+ int left = 64 - lenW;
+ if (len < left)
+ left = len;
+ memcpy(lenW + (char *)ctx->W, data, left);
+ lenW = (lenW + left) & 63;
+ len -= left;
+ data += left;
+ ctx->lenW = lenW;
+ if (lenW)
+ return;
+ blk_SHA1Block(ctx, ctx->W);
+ }
+ while (len >= 64) {
+ blk_SHA1Block(ctx, data);
+ data += 64;
+ len -= 64;
+ }
+ if (len) {
+ memcpy(ctx->W, data, len);
+ ctx->lenW = len;
+ }
+}
+
+
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
+{
+ static const unsigned char pad[64] = { 0x80 };
+ unsigned int padlen[2];
+ int i;
+
+ /* Pad with a binary 1 (ie 0x80), then zeroes, then length
+ */
+ padlen[0] = htonl(ctx->size >> 32);
+ padlen[1] = htonl(ctx->size);
+
+ blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW)));
+ blk_SHA1_Update(ctx, padlen, 8);
+
+ /* Output hash
+ */
+ for (i = 0; i < 5; i++)
+ ((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+}
+
+#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
+{
+ int t;
+ unsigned int A,B,C,D,E,TEMP;
+ unsigned int W[80];
+
+ for (t = 0; t < 16; t++)
+ W[t] = htonl(data[t]);
+
+ /* Unroll it? */
+ for (t = 16; t <= 79; t++)
+ W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+ A = ctx->H[0];
+ B = ctx->H[1];
+ C = ctx->H[2];
+ D = ctx->H[3];
+ E = ctx->H[4];
+
+#define T_0_19(t) \
+ TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \
+ E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+ T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
+ T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
+ T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
+ T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+
+#define T_20_39(t) \
+ TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \
+ E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+ T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
+ T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
+ T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
+ T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+
+#define T_40_59(t) \
+ TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+ E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+ T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
+ T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
+ T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
+ T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+
+#define T_60_79(t) \
+ TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \
+ E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+ T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
+ T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
+ T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
+ T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+
+ ctx->H[0] += A;
+ ctx->H[1] += B;
+ ctx->H[2] += C;
+ ctx->H[3] += D;
+ ctx->H[4] += E;
+}
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
new file mode 100644
index 0000000..7be2d93
--- /dev/null
+++ b/block-sha1/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+typedef struct {
+ unsigned int H[5];
+ unsigned int W[16];
+ int lenW;
+ unsigned long long size;
+} blk_SHA_CTX;
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len);
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
+
+#define git_SHA_CTX blk_SHA_CTX
+#define git_SHA1_Init blk_SHA1_Init
+#define git_SHA1_Update blk_SHA1_Update
+#define git_SHA1_Final blk_SHA1_Final
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/7] block-sha1: try to use rol/ror appropriately
2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
@ 2009-08-06 15:16 ` Linus Torvalds
2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
2009-08-06 18:25 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg
0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:16 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 19:42:15 -0700
Subject: [PATCH 2/7] block-sha1: try to use rol/ror appropriately
Use the one with the smaller constant. It _can_ generate slightly
smaller code (a constant of 1 is special), but perhaps more importantly
it's possibly faster on any uarch that does a rotate with a loop.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Ok, this is the hackiest of the bunch. I considered dropping it. But I do
suspect we want something like this, even if we might end up massaging the
asm for different compilers etc.
block-sha1/sha1.c | 32 ++++++++++++++++++++++----------
1 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index eef32f7..a45a3de 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
}
-#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#if defined(__i386__) || defined(__x86_64__)
+
+#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
+#define SHA_ROL(x,n) SHA_ASM("rol", x, n)
+#define SHA_ROR(x,n) SHA_ASM("ror", x, n)
+
+#else
+
+#define SHA_ROT(X,n) (((X) << (l)) | ((X) >> (r)))
+#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n))
+#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n)
+
+#endif
static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
{
@@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
/* Unroll it? */
for (t = 16; t <= 79; t++)
- W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+ W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
A = ctx->H[0];
B = ctx->H[1];
@@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
E = ctx->H[4];
#define T_0_19(t) \
- TEMP = SHA_ROT(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \
- E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+ TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
@@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
#define T_20_39(t) \
- TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \
- E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+ TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
#define T_40_59(t) \
- TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
- E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+ TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
@@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
#define T_60_79(t) \
- TEMP = SHA_ROT(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \
- E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+ TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop
2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
@ 2009-08-06 15:18 ` Linus Torvalds
2009-08-06 15:20 ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds
2009-08-06 18:25 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg
1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:18 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 20:28:07 -0700
Subject: [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop
This helps a teeny bit. But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
This is a pretty trivial one, but it was the first stage to getting rid of
the annoying 80-word array that not only wastes precious L1 cache, but
that loop to initialize it was a noticeable cost.
block-sha1/sha1.c | 28 ++++++++++++++++------------
1 files changed, 16 insertions(+), 12 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index a45a3de..39a5bbb 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
unsigned int A,B,C,D,E,TEMP;
unsigned int W[80];
- for (t = 0; t < 16; t++)
- W[t] = htonl(data[t]);
-
- /* Unroll it? */
- for (t = 16; t <= 79; t++)
- W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
-
A = ctx->H[0];
B = ctx->H[1];
C = ctx->H[2];
D = ctx->H[3];
E = ctx->H[4];
-#define T_0_19(t) \
+#define T_0_15(t) \
+ TEMP = htonl(data[t]); W[t] = TEMP; \
+ TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
+
+ T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
+ T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
+ T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
+ T_0_15(15);
+
+ /* Unroll it? */
+ for (t = 16; t <= 79; t++)
+ W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+#define T_16_19(t) \
TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \
E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
- T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
- T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
- T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
- T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+ T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
#define T_20_39(t) \
TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1
2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
@ 2009-08-06 15:20 ` Linus Torvalds
2009-08-06 15:22 ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:20 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 5 Aug 2009 20:49:41 -0700
Subject: [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1
The mozilla-SHA1 code did this 80-word array for the 80 iterations. But
the SHA1 state is really just 512 bits, and you can actually keep it in
a kind of "circular queue" of just 16 words instead.
This requires us to do the xor updates as we go along (rather than as a
pre-phase), but that's really what we want to do anyway.
This gets me really close to the OpenSSL performance on my Nehalem.
Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't
strictly even matter on my Nehalem, it's just a local optimization).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
This gets rid of the annoying pre-initialization and its need for that
extra space. We can dynamically iterate over a 512-bit circular buffer
instead.
block-sha1/sha1.c | 28 ++++++++++++++++------------
1 files changed, 16 insertions(+), 12 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 39a5bbb..80193d4 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
{
- int t;
unsigned int A,B,C,D,E,TEMP;
- unsigned int W[80];
+ unsigned int array[16];
A = ctx->H[0];
B = ctx->H[1];
@@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
E = ctx->H[4];
#define T_0_15(t) \
- TEMP = htonl(data[t]); W[t] = TEMP; \
- TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
+ TEMP = htonl(data[t]); array[t] = TEMP; \
+ TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
@@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
T_0_15(15);
- /* Unroll it? */
- for (t = 16; t <= 79; t++)
- W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])
+#define SHA_XOR(t) \
+ TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
#define T_16_19(t) \
- TEMP = SHA_ROL(A,5) + (((C^D)&B)^D) + E + W[t] + 0x5a827999; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+ SHA_XOR(t); \
+ TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
#define T_20_39(t) \
- TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0x6ed9eba1; \
+ SHA_XOR(t); \
+ TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
@@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
#define T_40_59(t) \
- TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+ SHA_XOR(t); \
+ TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
@@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
#define T_60_79(t) \
- TEMP = SHA_ROL(A,5) + (B^C^D) + E + W[t] + 0xca62c1d6; \
+ SHA_XOR(t); \
+ TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 5/7] block-sha1: macroize the rounds a bit further
2009-08-06 15:20 ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds
@ 2009-08-06 15:22 ` Linus Torvalds
2009-08-06 15:24 ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:22 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 6 Aug 2009 07:20:54 -0700
Subject: [PATCH 5/7] block-sha1: macroize the rounds a bit further
Avoid repeating the shared parts of the different rounds by adding a
macro layer or two. It was already more cpp than C.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
This makes things denser, and puts all the core rules in one place. That
first hunk really contains just about all of the important parts of SHA1.
The rest is just fluff and the necessary expansions etc.
block-sha1/sha1.c | 56 ++++++++++++++++++++++++----------------------------
1 files changed, 26 insertions(+), 30 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 80193d4..4837d58 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -94,6 +94,27 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
#endif
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])
+
+/*
+ * Where do we get the source from? The first 16 iterations get it from
+ * the input data, the next mix it from the 512-bit array.
+ */
+#define SHA_SRC(t) htonl(data[t])
+#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
+
+#define SHA_ROUND(t, input, fn, constant) \
+ TEMP = input(t); W(t) = TEMP; \
+ TEMP += SHA_ROL(A,5) + (fn) + E + (constant); \
+ E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP
+
+#define T_0_15(t) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 )
+#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 )
+#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 )
+#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)|(D&(B|C))) , 0x8f1bbcdc )
+#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6 )
+
static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
{
unsigned int A,B,C,D,E,TEMP;
@@ -105,53 +126,28 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
D = ctx->H[3];
E = ctx->H[4];
-#define T_0_15(t) \
- TEMP = htonl(data[t]); array[t] = TEMP; \
- TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
-
+ /* Round 1 - iterations 0-16 take their input from 'data' */
T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
T_0_15(15);
-/* This "rolls" over the 512-bit array */
-#define W(x) (array[(x)&15])
-#define SHA_XOR(t) \
- TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
-
-#define T_16_19(t) \
- SHA_XOR(t); \
- TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
-
+ /* Round 1 - tail. Input from 512-bit mixing array */
T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
-#define T_20_39(t) \
- SHA_XOR(t); \
- TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
-
+ /* Round 2 */
T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
-#define T_40_59(t) \
- SHA_XOR(t); \
- TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
-
+ /* Round 3 */
T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
-#define T_60_79(t) \
- SHA_XOR(t); \
- TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
-
+ /* Round 4 */
T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
2009-08-06 15:22 ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds
@ 2009-08-06 15:24 ` Linus Torvalds
2009-08-06 15:25 ` [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:24 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 6 Aug 2009 07:27:57 -0700
Subject: [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
It's an equivalent expression, but the '+' gives us some freedom in
instruction selection (for example, we can use 'lea' rather than 'add'),
and associates with the other additions around it to give some minor
scheduling freedom.
Suggested-by: linux@horizon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
A small micro-optimization. It may not matter hugely, but it's definitely
the right thing to do. And with the new macroized thing, you can see
clearly what part of the SHA1 rounds it affects, and how.
block-sha1/sha1.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 4837d58..9a060a6 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -112,7 +112,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
#define T_0_15(t) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 )
#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 )
#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 )
-#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)|(D&(B|C))) , 0x8f1bbcdc )
+#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc )
#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6 )
static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context
2009-08-06 15:24 ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds
@ 2009-08-06 15:25 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 15:25 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 6 Aug 2009 07:45:46 -0700
Subject: [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context
.. and simplify the ctx->size logic.
We now count the size in bytes, which means that 'lenW' was always just
the low 6 bits of the total size, so we don't carry it around separately
any more. And we do the 'size in bits' shift at the end.
Suggested by Nicolas Pitre and linux@horizon.com.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Some final cleanup. Based on two separate discussions yesterday. Trivial
and doesn't make any difference that I can tell, but definitely the right
thing to do.
block-sha1/sha1.c | 17 +++++++----------
block-sha1/sha1.h | 1 -
2 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 9a060a6..78dcb0c 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -14,7 +14,6 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data);
void blk_SHA1_Init(blk_SHA_CTX *ctx)
{
- ctx->lenW = 0;
ctx->size = 0;
/* Initialize H with the magic constants (see FIPS180 for constants)
@@ -29,9 +28,9 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
{
- int lenW = ctx->lenW;
+ int lenW = ctx->size & 63;
- ctx->size += (unsigned long long) len << 3;
+ ctx->size += len;
/* Read the data into W and process blocks as they get full
*/
@@ -43,7 +42,6 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
lenW = (lenW + left) & 63;
len -= left;
data += left;
- ctx->lenW = lenW;
if (lenW)
return;
blk_SHA1Block(ctx, ctx->W);
@@ -53,10 +51,8 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
data += 64;
len -= 64;
}
- if (len) {
+ if (len)
memcpy(ctx->W, data, len);
- ctx->lenW = len;
- }
}
@@ -68,10 +64,11 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
/* Pad with a binary 1 (ie 0x80), then zeroes, then length
*/
- padlen[0] = htonl(ctx->size >> 32);
- padlen[1] = htonl(ctx->size);
+ padlen[0] = htonl(ctx->size >> 29);
+ padlen[1] = htonl(ctx->size << 3);
- blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW)));
+ i = ctx->size & 63;
+ blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - i)));
blk_SHA1_Update(ctx, padlen, 8);
/* Output hash
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
index 7be2d93..c1ae74d 100644
--- a/block-sha1/sha1.h
+++ b/block-sha1/sha1.h
@@ -7,7 +7,6 @@
typedef struct {
unsigned int H[5];
unsigned int W[16];
- int lenW;
unsigned long long size;
} blk_SHA_CTX;
--
1.6.4.31.g154b2.dirty
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds
2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
@ 2009-08-06 17:22 ` Artur Skawina
2009-08-06 18:09 ` Linus Torvalds
1 sibling, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 17:22 UTC (permalink / raw)
To: Git Mailing List
Linus Torvalds wrote:
> where the thing is loosely based on the Mozilla SHA1 routines but by the
> end doesn't really resemble them all that much. The Mozilla ones suck
> donkey d*ck in so many ways - unnecessary copies, idiotic byte-at-a-time
> build-up of the hash input etc.
>
> The end result is pretty much equivalent in performance to the OpenSSL
> SHA1 code for me on x86-64. Getting rid of OpenSSL gets rid of another
For those curious just how close the C version is to the various
asm and C implementations, the q&d microbenchmark is at
http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz
In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom.
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
@ 2009-08-06 18:09 ` Linus Torvalds
2009-08-06 19:10 ` Artur Skawina
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 18:09 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Artur Skawina wrote:
>
> For those curious just how close the C version is to the various
> asm and C implementations, the q&d microbenchmark is at
> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz
Hmm. That thing doesn't work at all on x86-64. Even apart from the asm
sources, your timing thing does soem really odd things (why do you do that
odd "iret" in GETCYCLES and GETTIME?). You're better off using
lfence/mfence/cpuid, and I think you could make it work on 64-bit that
way too.
I just hacked it away for testing.
> In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom.
I'll use this to see if I can improve the 32-bit case.
On Nehalem, with your benchmark, I get:
# TIME[s] SPEED[MB/s]
rfc3174 5.122 119.2
# New hash result: d829b9e028e64840094ab6702f9acdf11bec3937
rfc3174 5.153 118.5
linus 2.092 291.8
linusas 2.056 296.8
linusas2 1.909 319.8
mozilla 5.139 118.8
mozillaas 5.775 105.7
openssl 1.627 375.1
spelvin 1.678 363.7
spelvina 1.603 380.8
nettle 1.592 383.4
And with the hacked version to get some 64-bit numbers:
# TIME[s] SPEED[MB/s]
rfc3174 3.992 152.9
# New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab
rfc3174 3.991 152.9
linus 1.54 396.3
linusas 1.533 398.1
linusas2 1.603 380.9
mozilla 4.352 140.3
mozillaas 4.227 144.4
so as you can see, your improvements in 32-bit mode are actually
de-provements in 64-bit mode (ok, your first one seems to be a tiny
improvement, but I think it's in the noise).
But you're right, I need to try to improve the 32-bit case.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/7] block-sha1: try to use rol/ror appropriately
2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
@ 2009-08-06 18:25 ` Bert Wesarg
1 sibling, 0 replies; 38+ messages in thread
From: Bert Wesarg @ 2009-08-06 18:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
Hi,
On Thu, Aug 6, 2009 at 17:16, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
> diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
> index eef32f7..a45a3de 100644
> --- a/block-sha1/sha1.c
> +++ b/block-sha1/sha1.c
> @@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
> ((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
> }
>
> -#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
> +#if defined(__i386__) || defined(__x86_64__)
> +
> +#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
> +#define SHA_ROL(x,n) SHA_ASM("rol", x, n)
> +#define SHA_ROR(x,n) SHA_ASM("ror", x, n)
> +
> +#else
> +
> +#define SHA_ROT(X,n) (((X) << (l)) | ((X) >> (r)))
I suspect, this should be:
#define SHA_ROT(X,l.r) (((X) << (l)) | ((X) >> (r)))
> +#define SHA_ROL(X,n) SHA_ROT(X,n,32-(n))
> +#define SHA_ROR(X,n) SHA_ROT(X,32-(n),n)
> +
> +#endif
>
> static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
> {
Bert
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 18:09 ` Linus Torvalds
@ 2009-08-06 19:10 ` Artur Skawina
2009-08-06 19:41 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 19:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> For those curious just how close the C version is to the various
>> asm and C implementations, the q&d microbenchmark is at
>> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz
>
> Hmm. That thing doesn't work at all on x86-64. Even apart from the asm
> sources, your timing thing does soem really odd things (why do you do that
> odd "iret" in GETCYCLES and GETTIME?). You're better off using
> lfence/mfence/cpuid, and I think you could make it work on 64-bit that
> way too.
yes, it's 32-bit only, i should have mentioned that. The timing
code was written more than a decade ago, it really works on p2,
haven't updated it, it's all just c&p'ed ever since. All of it
can be safely disabled; on p2 you could account for every cycle,
nowadays gettimeofday is more than enough.
> I just hacked it away for testing.
>
>> In short: 88% of openssl speed on P3, 42% on P4, 66% on Atom.
>
> I'll use this to see if I can improve the 32-bit case.
>
> On Nehalem, with your benchmark, I get:
>
> # TIME[s] SPEED[MB/s]
> rfc3174 5.122 119.2
> # New hash result: d829b9e028e64840094ab6702f9acdf11bec3937
> rfc3174 5.153 118.5
> linus 2.092 291.8
> linusas 2.056 296.8
> linusas2 1.909 319.8
> mozilla 5.139 118.8
> mozillaas 5.775 105.7
> openssl 1.627 375.1
> spelvin 1.678 363.7
> spelvina 1.603 380.8
> nettle 1.592 383.4
>
> And with the hacked version to get some 64-bit numbers:
>
> # TIME[s] SPEED[MB/s]
> rfc3174 3.992 152.9
> # New hash result: b78fd74c0033a4dfe0ededccb85ab00cb56880ab
> rfc3174 3.991 152.9
> linus 1.54 396.3
> linusas 1.533 398.1
> linusas2 1.603 380.9
> mozilla 4.352 140.3
> mozillaas 4.227 144.4
>
> so as you can see, your improvements in 32-bit mode are actually
> de-provements in 64-bit mode (ok, your first one seems to be a tiny
> improvement, but I think it's in the noise).
Actually i didn't keep anything that wasn't a win, one reason
why linusas2 stayed was that it really surprised me, i'd have
expected for gcc to do a lot worse w/ the many temporaries and
the compiler came up w/ a 70% gain; gcc really must have improved
when i wasn't looking.
> But you're right, I need to try to improve the 32-bit case.
I never said anything like that. :) there probably isn't all that
much that can be done. I tried a few things, but never saw any
improvement above measurement noise (a few percent). Would have
though that overlapping the iterations a bit would be a gain, but
that didn't do much (-20%..0), maybe on 64 bit, with more registers...
Oh, i noticed that '-mtune' makes quite a difference, it can change
the relative performance of the functions significantly, in unobvious
ways; depending on which cpu gcc tunes for (build config or -mtune);
some implementations slow down, others become a bit faster.
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 19:10 ` Artur Skawina
@ 2009-08-06 19:41 ` Linus Torvalds
2009-08-06 20:08 ` Artur Skawina
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 19:41 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Artur Skawina wrote:
>
> Oh, i noticed that '-mtune' makes quite a difference, it can change
> the relative performance of the functions significantly, in unobvious
> ways; depending on which cpu gcc tunes for (build config or -mtune);
> some implementations slow down, others become a bit faster.
That probably is mainly true for P4, although it's quite possible that it
has an effect for just what the register allocator does, and then for
spilling.
And it looks like _all_ the tweakability is in the spilling. Nothing else
matters.
How does this patch work for you? It avoids doing that C-level register
rotation, and instead rotates the register names with the preprocessor.
I realize it's ugly as hell, but it does make it easier for gcc to see
what's going on.
The patch is against my git patches, but I think it should apply pretty
much as-is to your sha1bench sources too. Does it make any difference for
you?
Linus
---
block-sha1/sha1.c | 117 ++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 90 insertions(+), 27 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 78dcb0c..ac47162 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -101,20 +101,20 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
#define SHA_SRC(t) htonl(data[t])
#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
-#define SHA_ROUND(t, input, fn, constant) \
- TEMP = input(t); W(t) = TEMP; \
- TEMP += SHA_ROL(A,5) + (fn) + E + (constant); \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP
+#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
+ unsigned int TEMP = input(t); W(t) = TEMP; \
+ TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
+ B = SHA_ROR(B, 2); E = TEMP; } while (0)
-#define T_0_15(t) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999 )
-#define T_16_19(t) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999 )
-#define T_20_39(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1 )
-#define T_40_59(t) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc )
-#define T_60_79(t) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6 )
+#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
+#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
+#define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E )
+#define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E )
+#define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6, A, B, C, D, E )
static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
{
- unsigned int A,B,C,D,E,TEMP;
+ unsigned int A,B,C,D,E;
unsigned int array[16];
A = ctx->H[0];
@@ -124,31 +124,94 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
E = ctx->H[4];
/* Round 1 - iterations 0-16 take their input from 'data' */
- T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
- T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
- T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
- T_0_15(15);
+ T_0_15( 0, A, B, C, D, E);
+ T_0_15( 1, E, A, B, C, D);
+ T_0_15( 2, D, E, A, B, C);
+ T_0_15( 3, C, D, E, A, B);
+ T_0_15( 4, B, C, D, E, A);
+ T_0_15( 5, A, B, C, D, E);
+ T_0_15( 6, E, A, B, C, D);
+ T_0_15( 7, D, E, A, B, C);
+ T_0_15( 8, C, D, E, A, B);
+ T_0_15( 9, B, C, D, E, A);
+ T_0_15(10, A, B, C, D, E);
+ T_0_15(11, E, A, B, C, D);
+ T_0_15(12, D, E, A, B, C);
+ T_0_15(13, C, D, E, A, B);
+ T_0_15(14, B, C, D, E, A);
+ T_0_15(15, A, B, C, D, E);
/* Round 1 - tail. Input from 512-bit mixing array */
- T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
+ T_16_19(16, E, A, B, C, D);
+ T_16_19(17, D, E, A, B, C);
+ T_16_19(18, C, D, E, A, B);
+ T_16_19(19, B, C, D, E, A);
/* Round 2 */
- T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
- T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
- T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
- T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+ T_20_39(20, A, B, C, D, E);
+ T_20_39(21, E, A, B, C, D);
+ T_20_39(22, D, E, A, B, C);
+ T_20_39(23, C, D, E, A, B);
+ T_20_39(24, B, C, D, E, A);
+ T_20_39(25, A, B, C, D, E);
+ T_20_39(26, E, A, B, C, D);
+ T_20_39(27, D, E, A, B, C);
+ T_20_39(28, C, D, E, A, B);
+ T_20_39(29, B, C, D, E, A);
+ T_20_39(30, A, B, C, D, E);
+ T_20_39(31, E, A, B, C, D);
+ T_20_39(32, D, E, A, B, C);
+ T_20_39(33, C, D, E, A, B);
+ T_20_39(34, B, C, D, E, A);
+ T_20_39(35, A, B, C, D, E);
+ T_20_39(36, E, A, B, C, D);
+ T_20_39(37, D, E, A, B, C);
+ T_20_39(38, C, D, E, A, B);
+ T_20_39(39, B, C, D, E, A);
/* Round 3 */
- T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
- T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
- T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
- T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+ T_40_59(40, A, B, C, D, E);
+ T_40_59(41, E, A, B, C, D);
+ T_40_59(42, D, E, A, B, C);
+ T_40_59(43, C, D, E, A, B);
+ T_40_59(44, B, C, D, E, A);
+ T_40_59(45, A, B, C, D, E);
+ T_40_59(46, E, A, B, C, D);
+ T_40_59(47, D, E, A, B, C);
+ T_40_59(48, C, D, E, A, B);
+ T_40_59(49, B, C, D, E, A);
+ T_40_59(50, A, B, C, D, E);
+ T_40_59(51, E, A, B, C, D);
+ T_40_59(52, D, E, A, B, C);
+ T_40_59(53, C, D, E, A, B);
+ T_40_59(54, B, C, D, E, A);
+ T_40_59(55, A, B, C, D, E);
+ T_40_59(56, E, A, B, C, D);
+ T_40_59(57, D, E, A, B, C);
+ T_40_59(58, C, D, E, A, B);
+ T_40_59(59, B, C, D, E, A);
/* Round 4 */
- T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
- T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
- T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
- T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+ T_60_79(60, A, B, C, D, E);
+ T_60_79(61, E, A, B, C, D);
+ T_60_79(62, D, E, A, B, C);
+ T_60_79(63, C, D, E, A, B);
+ T_60_79(64, B, C, D, E, A);
+ T_60_79(65, A, B, C, D, E);
+ T_60_79(66, E, A, B, C, D);
+ T_60_79(67, D, E, A, B, C);
+ T_60_79(68, C, D, E, A, B);
+ T_60_79(69, B, C, D, E, A);
+ T_60_79(70, A, B, C, D, E);
+ T_60_79(71, E, A, B, C, D);
+ T_60_79(72, D, E, A, B, C);
+ T_60_79(73, C, D, E, A, B);
+ T_60_79(74, B, C, D, E, A);
+ T_60_79(75, A, B, C, D, E);
+ T_60_79(76, E, A, B, C, D);
+ T_60_79(77, D, E, A, B, C);
+ T_60_79(78, C, D, E, A, B);
+ T_60_79(79, B, C, D, E, A);
ctx->H[0] += A;
ctx->H[1] += B;
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 19:41 ` Linus Torvalds
@ 2009-08-06 20:08 ` Artur Skawina
2009-08-06 20:53 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 20:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> Oh, i noticed that '-mtune' makes quite a difference, it can change
>> the relative performance of the functions significantly, in unobvious
>> ways; depending on which cpu gcc tunes for (build config or -mtune);
>> some implementations slow down, others become a bit faster.
>
> That probably is mainly true for P4, although it's quite possible that it
> has an effect for just what the register allocator does, and then for
> spilling.
>
> And it looks like _all_ the tweakability is in the spilling. Nothing else
> matters.
>
> How does this patch work for you? It avoids doing that C-level register
> rotation, and instead rotates the register names with the preprocessor.
>
> I realize it's ugly as hell, but it does make it easier for gcc to see
> what's going on.
>
> The patch is against my git patches, but I think it should apply pretty
> much as-is to your sha1bench sources too. Does it make any difference for
> you?
it's a bit slower (P4):
before: linus 0.6288 97.06
after: linus 0.6604 92.42
i was trying similar things, like the example below, too, but it wasn't a
win on 32 bit...
artur
[the iteration below is functionally correct, but scheduling is most likely
fubared as it wasn't a win and i was checking how much a difference it made
on P4 -- ~-20..~0%, but never faster (relative to linusas2; it _is_ faster
than 'linus'. Dropped this version when merging your new preprocessor macros.]
@@ -125,6 +127,8 @@
#define W(x) (array[(x)&15])
#define SHA_XOR(t) \
TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
+#define SHA_XOR2(t) \
+ SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
#define T_16_19(t) \
{ unsigned TEMP;\
@@ -139,10 +143,27 @@
#endif
#define T_20_39(t) \
- { unsigned TEMP;\
- SHA_XOR(t); \
- TEMP += (B^C^D) + E + 0x6ed9eba1; \
- E = D; D = C; C = SHA_ROR(B, 2); B = A; TEMP += SHA_ROL(A,5); A = TEMP; }
+ if (t%2==0) {\
+ unsigned TEMP;\
+ unsigned TEMP2;\
+ \
+ TEMP = SHA_XOR2(t); \
+ TEMP2 = SHA_XOR2(t+1); \
+ W(t) = TEMP;\
+ W(t+1) = TEMP2;\
+ TEMP += E + 0x6ed9eba1; \
+ E = C;\
+ TEMP += (B^E^D); \
+ TEMP2 += D + 0x6ed9eba1; \
+ D = SHA_ROR(B, 2);\
+ B = SHA_ROL(A, 5);\
+ B += TEMP;\
+ C = SHA_ROR(A, 2);\
+ A ^= E; \
+ A ^= D; \
+ A += TEMP2;\
+ A += SHA_ROL(B, 5);\
+ }
#if UNROLL
T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 20:08 ` Artur Skawina
@ 2009-08-06 20:53 ` Linus Torvalds
2009-08-06 21:24 ` Linus Torvalds
2009-08-06 21:39 ` Artur Skawina
0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 20:53 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Artur Skawina wrote:
>
> it's a bit slower (P4):
>
> before: linus 0.6288 97.06
> after: linus 0.6604 92.42
Hmm. Ok, I just tested with your harness, and I get
# TIME[s] SPEED[MB/s]
rfc3174 5.1 119.7
rfc3174 5.097 119.7
linus 1.836 332.5
linusas 2.006 304.3
linusas2 1.879 324.9
mozilla 5.562 109.7
mozillaas 5.913 103.2
openssl 1.613 378.5
spelvin 1.698 359.5
spelvina 1.602 381
nettle 1.594 382.9
with it, so it is faster for me. So your slowdown seems to be yet another
P4 thing. Dang crazy micro-architecture.
Of course, it might be a compiler version difference too. I'm using
gcc-4.4.0.
With the cpp variable renaming, the compiler really has less to be smart
about, but spill decisions will still matter a lot.
(My old 32-bit numbers were
linus 2.092 291.8
so it's a clear improvement on my machine and with my compiler).
It also seems to improve the 64-bit numbers a small bit, I'm getting
# TIME[s] SPEED[MB/s]
rfc3174 3.98 153.3
rfc3174 3.972 153.7
linus 1.514 403.1
linusas 1.555 392.6
linusas2 1.599 381.7
mozilla 4.34 140.6
mozillaas 4.223 144.5
with my 64-bit compile, so on a Nehalem it's the best one of the C ones by
a noticeable margin. (My original 64-bit numbers were
linus 1.54 396.3
and while the numbers seem to fluctuate a bit, the fluctuation is roughly
in the 1% range, so that improvement seems to be statistically
significant.
Oh, I did make a small change, but I doubt it matters. Instead of doing
TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
B = SHA_ROR(B, 2); E = TEMP; } while (0)
I now do
E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
B = SHA_ROR(B, 2); } while (0)
which is a bit more logical (the old TEMP usage was just due to a fairly
mindless conversion). That _might_ have lower register pressure if the
compiler is silly enough to not notice that it can do it. Maybe that
matters.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 20:53 ` Linus Torvalds
@ 2009-08-06 21:24 ` Linus Torvalds
2009-08-06 21:39 ` Artur Skawina
1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 21:24 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Linus Torvalds wrote:
>
> Hmm. Ok, I just tested with your harness, and I get
>
> # TIME[s] SPEED[MB/s]
> rfc3174 5.1 119.7
> rfc3174 5.097 119.7
> linus 1.836 332.5
> linusas 2.006 304.3
> linusas2 1.879 324.9
> mozilla 5.562 109.7
> mozillaas 5.913 103.2
> openssl 1.613 378.5
> spelvin 1.698 359.5
> spelvina 1.602 381
> nettle 1.594 382.9
On atom, I get things like:
# TIME[s] SPEED[MB/s]
rfc3174 2.186 27.92
rfc3174 2.186 27.92
linus 0.9492 64.3
linusas 0.9656 63.21
linusas2 1.012 60.29
mozilla 2.492 24.49
mozillaas 2.5 24.41
openssl 0.6411 95.2
spelvin 0.6052 100.8
spelvina 0.6655 91.71
nettle 0.7149 85.37
but quite frankly, those timings aren't stable enough to say anything.
Another few runs got me:
# TIME[s] SPEED[MB/s]
rfc3174 2.207 27.65
rfc3174 2.21 27.62
linus 1.022 59.74
linusas 1.058 57.7
linusas2 1.008 60.58
mozilla 2.485 24.56
mozillaas 2.522 24.2
openssl 0.6421 95.06
spelvin 0.5989 101.9
spelvina 0.6638 91.94
nettle 0.7132 85.58
# TIME[s] SPEED[MB/s]
rfc3174 2.224 27.44
rfc3174 2.205 27.68
linus 0.9727 62.75
linusas 0.9766 62.5
linusas2 1.026 59.5
mozilla 2.52 24.22
mozillaas 2.547 23.96
openssl 0.6459 94.5
spelvin 0.6074 100.5
spelvina 0.6751 90.41
nettle 0.7254 84.14
so whatever differences there are between linus*, they seem to be in the
noise, and the hand-scheduled asm beats all the C versions senseless.
I'd like to get closer to the hand-tuned ones, but I don't see anything to
do any more. It's all about gcc register choice and avoiding spilling. So
compiler flags changing small details can have _huge_ differences in
performance. Here's the Atom numbers with gcc given the "-Os" flag (just
because I wanted to try):
linus 1.072 56.94
linusas 0.9573 63.76
linusas2 0.9906 61.61
Why did 'linus' numbers go down? No idea. With -O3, it's the other way
around:
linus 0.9537 64
linusas 0.9566 63.8
linusas2 1.013 60.26
but again, there's variation enough that I'd probabyl need to run ten runs
just to see how much is noise. But the "linusas2 sucks with -O3" is clear,
as is the "linus sucks with -Os" thing. Very odd, and very random.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 20:53 ` Linus Torvalds
2009-08-06 21:24 ` Linus Torvalds
@ 2009-08-06 21:39 ` Artur Skawina
2009-08-06 21:52 ` Artur Skawina
1 sibling, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 21:39 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> with it, so it is faster for me. So your slowdown seems to be yet another
> P4 thing. Dang crazy micro-architecture.
>
> Of course, it might be a compiler version difference too. I'm using
> gcc-4.4.0.
gcc version 4.4.1 20090603 (prerelease)
> Oh, I did make a small change, but I doubt it matters. Instead of doing
>
> TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
> B = SHA_ROR(B, 2); E = TEMP; } while (0)
>
> I now do
>
> E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
> B = SHA_ROR(B, 2); } while (0)
>
> which is a bit more logical (the old TEMP usage was just due to a fairly
> mindless conversion). That _might_ have lower register pressure if the
> compiler is silly enough to not notice that it can do it. Maybe that
> matters.
before: linus 0.6622 92.17
after: linus 0.6631 92.05
after: linus 0.6601 92.46
after: linus 0.6624 92.14
IOW, no difference, just noise.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 21:39 ` Artur Skawina
@ 2009-08-06 21:52 ` Artur Skawina
2009-08-06 22:27 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 21:52 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Artur Skawina wrote:
> Linus Torvalds wrote:
>> Oh, I did make a small change, but I doubt it matters. Instead of doing
>>
>> TEMP += E + SHA_ROL(A,5) + (fn) + (constant); \
>> B = SHA_ROR(B, 2); E = TEMP; } while (0)
>>
>> I now do
>>
>> E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
>> B = SHA_ROR(B, 2); } while (0)
>>
>> which is a bit more logical (the old TEMP usage was just due to a fairly
>> mindless conversion). That _might_ have lower register pressure if the
>> compiler is silly enough to not notice that it can do it. Maybe that
>> matters.
>
> before: linus 0.6622 92.17
> after: linus 0.6631 92.05
> after: linus 0.6601 92.46
> after: linus 0.6624 92.14
>
> IOW, no difference, just noise.
Just to check, i did this:
diff -urNp sha1bench-linus.org/block-sha1/sha1.c sha1bench-linus/block-sha1/sha1.c
--- sha1bench-linus.org/block-sha1/sha1.c 2009-08-06 23:26:15.607321815 +0200
+++ sha1bench-linus/block-sha1/sha1.c 2009-08-06 23:41:36.858325807 +0200
@@ -103,8 +103,8 @@ void blk_SHA1_Final(unsigned char hashou
#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
unsigned int TEMP = input(t); W(t) = TEMP; \
- E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
- B = SHA_ROR(B, 2); } while (0)
+ E += TEMP + (fn) + (constant); \
+ B = SHA_ROR(B, 2); E += SHA_ROL(A,5); } while (0)
#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
and the result was:
# TIME[s] SPEED[MB/s]
rfc3174 1.47 41.51
rfc3174 1.474 41.42
linus 0.3564 171.2
linusas 0.5736 106.4
linusas2 0.3568 171.1
mozilla 1.17 52.19
mozillaas 1.382 44.17
openssl 0.2636 231.5
spelvin 0.2662 229.2
spelvina 0.2515 242.7
nettle 0.4386 139.2
Hmm.
Does this make any difference for you? For me it's the best one so far
(the linusas2 number clearly shows that for me the register renaming does
nothing; other than that the functions should be very similar)
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 21:52 ` Artur Skawina
@ 2009-08-06 22:27 ` Linus Torvalds
2009-08-06 22:33 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 22:27 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Artur Skawina wrote:
>
> Does this make any difference for you? For me it's the best one so far
> (the linusas2 number clearly shows that for me the register renaming does
> nothing; other than that the functions should be very similar)
Nope. If anything, it's bit slower, but it might be in the noise. I
generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the
64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row,
which matches the linusas2 numbers pretty exactly.
But it seems to make a big difference for you.
Btw, _what_ P4 do you have (Northwood or Prescott)?
The Intel optimization manuals very much talk about avoiding rotates. And
they mention "with a CPUID signature corresponding to family 15 and model
encoding of 0, 1, or 2" specifically as being longer latency. That's
basically pre-prescott P4, I think.
Anyway, on P4 I think you have two double-speed integer issue ports (ie
max four ops per cycle), but only one of them takes a rotate, and only in
the first half of the cycle (ie just one shift per cycle).
And afaik, that is actually the _improved_ state in Prescott. The older
P4's didn't have a full shifter unit at all, iirc: shifts were "complex
instructions" in Northwood and weren't even single-clock.
In Core 2, I think there's still just one shifter unit, but at least it's
as fast as all the other units. So P4 really does stand out as sucking as
far as shifts are concerned, and if you have an older P4, it will be even
worse.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 22:27 ` Linus Torvalds
@ 2009-08-06 22:33 ` Linus Torvalds
2009-08-06 23:19 ` Artur Skawina
2009-08-06 22:55 ` Artur Skawina
2009-08-07 2:23 ` Linus Torvalds
2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 22:33 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Linus Torvalds wrote:
>
> Anyway, on P4 I think you have two double-speed integer issue ports (ie
> max four ops per cycle), but only one of them takes a rotate, and only in
> the first half of the cycle (ie just one shift per cycle).
>
> And afaik, that is actually the _improved_ state in Prescott. The older
> P4's didn't have a full shifter unit at all, iirc: shifts were "complex
> instructions" in Northwood and weren't even single-clock.
Yeah, verified. Google for
northwood "barrel shifter"
and you'll find a lot of it.
Basically, older P4's will I think shift one bit at a time. So while even
Prescott is relatively weak in the shifter department, pre-prescott
(Willamette and Northwood) are _really_ weak. If your P4 is one of those,
you really shouldn't use it to decide on optimizations.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 22:27 ` Linus Torvalds
2009-08-06 22:33 ` Linus Torvalds
@ 2009-08-06 22:55 ` Artur Skawina
2009-08-06 23:04 ` Linus Torvalds
2009-08-07 2:23 ` Linus Torvalds
2 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 22:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> Does this make any difference for you? For me it's the best one so far
>> (the linusas2 number clearly shows that for me the register renaming does
>> nothing; other than that the functions should be very similar)
>
> Nope. If anything, it's bit slower, but it might be in the noise. I
> generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the
> 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row,
> which matches the linusas2 numbers pretty exactly.
>
> But it seems to make a big difference for you.
It seems to do well on P2 and P4 here, if it works for core2 this could
be a good generic candidate. It only does 62% on an Atom, but the best C
version so far exceeds it only by ~2%.
> Btw, _what_ P4 do you have (Northwood or Prescott)?
northwood
> The Intel optimization manuals very much talk about avoiding rotates. And
> they mention "with a CPUID signature corresponding to family 15 and model
> encoding of 0, 1, or 2" specifically as being longer latency. That's
> basically pre-prescott P4, I think.
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping : 5
> Anyway, on P4 I think you have two double-speed integer issue ports (ie
> max four ops per cycle), but only one of them takes a rotate, and only in
> the first half of the cycle (ie just one shift per cycle).
>
> And afaik, that is actually the _improved_ state in Prescott. The older
> P4's didn't have a full shifter unit at all, iirc: shifts were "complex
> instructions" in Northwood and weren't even single-clock.
>
> In Core 2, I think there's still just one shifter unit, but at least it's
> as fast as all the other units. So P4 really does stand out as sucking as
> far as shifts are concerned, and if you have an older P4, it will be even
> worse.
hmm, I might be able to try it on some old willamette, but my prescott's
mobo died, so i can't verify that right now.
I'll upload an updated sha1bench, maybe somebody else feels like checking...
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 22:55 ` Artur Skawina
@ 2009-08-06 23:04 ` Linus Torvalds
2009-08-06 23:25 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 23:04 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Fri, 7 Aug 2009, Artur Skawina wrote:
>
> hmm, I might be able to try it on some old willamette, but my prescott's
> mobo died, so i can't verify that right now.
I think willamette and northwood are basically the same wrt shifters (and
pretty much everything else too, for that matter). I think northwood is a
shrink, and had an increased cache size (and higher frequencies). But I
think core-wise, they're very similar.
It was prescott that changed a lot (mostly for the worse - the shifter was
one of the few upsides of prescott, although increased frequency often
made up for the downsides).
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 22:33 ` Linus Torvalds
@ 2009-08-06 23:19 ` Artur Skawina
2009-08-06 23:42 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-06 23:19 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> Yeah, verified. Google for
>
> northwood "barrel shifter"
>
> and you'll find a lot of it.
>
> Basically, older P4's will I think shift one bit at a time. So while even
> Prescott is relatively weak in the shifter department, pre-prescott
> (Willamette and Northwood) are _really_ weak. If your P4 is one of those,
> you really shouldn't use it to decide on optimizations.
Actually that's even more of a reason to make sure the code doesn't suck :)
The difference on less perverse cpus will usually be small, but on P4 it
can be huge.
A few years back I found my old ip checksum microbenchmark, and when I ran
it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward
32-bit C implementation was running circles around the in-kernel one...
And a few tweaks to the assembler version got me another ~100% speedup.[1]
After that the P4 became the very first cpu to test any code on... :)
artur
[1] just reran the benchmark on this p4; true on northwood too:
IACCK 0.9.30 Artur Skawina <...>
[ exec time; lower is better ] [speed ] [ time ] [ok?]
TIME-N+S TIME32 TIME33 TIME1480 MBYTES/S TIMEXXXX CSUM FUNCTION ( rdtsc_overhead=0 null=0 )
17901 510 557 3010 393.36 59772 56dd csum_partial_cdumb16
3019 154 156 431 2747.10 43106 56dd csum_partial_c32
2413 170 177 328 3609.76 37501 56dd csum_partial_c32l
2437 170 170 328 3609.76 37488 56dd csum_partial_c32i
5078 205 254 767 1543.68 48117 56dd csum_partial_std
5612 299 291 851 1391.30 53673 56dd csum_partial_686
1584 99 127 227 5215.86 14495 56dd csum_partial_586f
1738 107 121 229 5170.31 14785 56dd csum_partial_586fs
4893 175 171 759 1559.95 52347 56dd csum_partial_copy_generic_std
4949 151 189 756 1566.14 67847 56dd csum_partial_copy_generic_686
2072 110 134 302 3920.53 39061 56dd csum_partial_copy_generic_p4as1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 23:04 ` Linus Torvalds
@ 2009-08-06 23:25 ` Linus Torvalds
2009-08-07 0:13 ` Linus Torvalds
2009-08-07 0:53 ` Artur Skawina
0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 23:25 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Linus Torvalds wrote:
>
> It was prescott that changed a lot (mostly for the worse - the shifter was
> one of the few upsides of prescott, although increased frequency often
> made up for the downsides).
Anyway, since you have a Northwood, I bet that the #1 issue for you is to
spread out the shift instructions in a way that simply doesn't matter
anywhere else.
In netburst, if I remember the details correcty, a "complex instruction"
will basically get the trace cache from the microcode roms. I'm not sure
how it interacts with the TC entries around it, but it's entirely possible
that it basically disables any instruction scheduling (the microcode
traces are presumably "pre-scheduled"), so you'd basically see stalls
where there's little out-of-order execution.
That then explains why you see huge differences from what is basically
trivial scheduling decisions, and why some random placement of a shift
makes a big difference.
Just out of curiosity, does anything change if you change the
B = SHA_ROR(B,2)
into a
B = SHA_ROR(SHA_ROR(B,1),1)
instead? It's very possible that it becomes _much_ worse, but I guess it's
also possible in theory that a single-bit rotate ends up being a simple
instruction and that doing two single-bit ROR's is actually faster than
one 2-bit ROR (assuming the second one is microcoded and the first one).
In particular, I'm thinking about the warnign in the intel optimization
manual:
The rotate by immediate and rotate by register instructions are
more expensive than a shift. The rotate by 1 instruction has the
same latency as a shift.
so it's very possible that "rotate by 1" is much better than other
rotates.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 23:19 ` Artur Skawina
@ 2009-08-06 23:42 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-06 23:42 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Fri, 7 Aug 2009, Artur Skawina wrote:
>
> Actually that's even more of a reason to make sure the code doesn't suck :)
> The difference on less perverse cpus will usually be small, but on P4 it
> can be huge.
No. First off, the things you have to do on P4 are just insane. See the
email I just sent out asking you to test whether two 1-bit rotates might
be faster than 1 2-bit rotate.
So optimizing for P4 is often the wrong thing.
Secondly, P4's are going away. You may have one, but they are getting
rare. So optimizing for them is a losing proposition in the long run.
> A few years back I found my old ip checksum microbenchmark, and when I ran
> it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward
> 32-bit C implementation was running circles around the in-kernel one...
> And a few tweaks to the assembler version got me another ~100% speedup.[1]
Yeah, not very surprising. The P4 is very good at the simplest possible
kind of code that does _nothing_ fancy.
But then it completely chokes on some code. I mean _totally_. It slows
down by a huge amount if there is anything but the most trivial kinds of
instructions. And by "trivial", I mean _really_ trivial. Shifts (as in
SHA1), but iirc also things like "adc" (add with carry) etc.
So it's not hard to write code that works well on other uarchs, and then
totally blow up on P4. I think it doesn't rename the flags at all, so any
flag dependency (carry being the most common one) will stall things
horrible.
There's also a very subtle store forwarding failure thing (and a lot of
other events) that causes a nasty micro-architectural replay trap, and
again you go from "running like a bat out of hell" to "slower than a i486
at a tenth the frequency".
Really. It's disgusting. Perfectly fine code can run really slowly on the
P4 just because it hits some random internal micro-architectural flaw. And
there's a _lot_ of those "glass jaw" issues.
The best way to avoid them is to use _only_ simple ALU instructions (add,
sub, and/or/not), and to be _very_ careful with loads and stores.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 23:25 ` Linus Torvalds
@ 2009-08-07 0:13 ` Linus Torvalds
2009-08-07 1:30 ` Artur Skawina
2009-08-07 0:53 ` Artur Skawina
1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-07 0:13 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Linus Torvalds wrote:
>
> In particular, I'm thinking about the warnign in the intel optimization
> manual:
>
> The rotate by immediate and rotate by register instructions are
> more expensive than a shift. The rotate by 1 instruction has the
> same latency as a shift.
>
> so it's very possible that "rotate by 1" is much better than other
> rotates.
Hmm. Probably not. Googling more seems to indicate that rotates and shifts
have a fixed 4-cycle latency on Northwood. I'm not seeing anything that
indicates that a single-bit rotate/shift would be any faster.
(And remember, if 4 cycles doesn't sound so bad: that's enough of a
latency to do _16_ "simple" ALU's, since they can be double-pumped in the
two regular ALU's).
I think long-running ALU ops that feed into a store (spill) also happen to
be the thing that makes the dreaded store-buffer replay trap nasties
happen more (load vs store scheduled badly, and then you end up spending
tens of cycles just replaying).
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 23:25 ` Linus Torvalds
2009-08-07 0:13 ` Linus Torvalds
@ 2009-08-07 0:53 ` Artur Skawina
1 sibling, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-07 0:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> Just out of curiosity, does anything change if you change the
>
> B = SHA_ROR(B,2)
>
> into a
>
> B = SHA_ROR(SHA_ROR(B,1),1)
>
> instead? It's very possible that it becomes _much_ worse, but I guess it's
Did try that yesterday, didn't help. Will recheck now.. yep:
before: linus 0.3554 171.7
after: linus 0.407 150
still true for the current version.
> So optimizing for P4 is often the wrong thing.
>
> Secondly, P4's are going away. You may have one, but they are getting
> rare. So optimizing for them is a losing proposition in the long run.
Sure, no argument; it's just that avoiding the P4 pitfalls is usually
not that hard and the impact on other, non-netburst, archs is low.
There are a lot of P4s out there and they're not going away soon.
(i'm still keeping most of my git trees on a P3...)
For generic C code such as this the difference for your i7 was -2% and
+70% for my P4; all the other (but one, i think) optimizations which
worked on P4 also applied to 32-bit i7. As i happen to have a p4 i can
just as well test the code on it, many improvements will likely apply
to other cpus too. That's all, i doubt anybody seriously considered
"optimizing for P4"; there is a reason intel discontinued them :)
The atom is a more important target, but only the asm versions did well
there so far.
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-07 0:13 ` Linus Torvalds
@ 2009-08-07 1:30 ` Artur Skawina
2009-08-07 1:55 ` Linus Torvalds
0 siblings, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-07 1:30 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> On Thu, 6 Aug 2009, Linus Torvalds wrote:
>> In particular, I'm thinking about the warnign in the intel optimization
>> manual:
>>
>> The rotate by immediate and rotate by register instructions are
>> more expensive than a shift. The rotate by 1 instruction has the
>> same latency as a shift.
>>
>> so it's very possible that "rotate by 1" is much better than other
>> rotates.
>
> Hmm. Probably not. Googling more seems to indicate that rotates and shifts
> have a fixed 4-cycle latency on Northwood. I'm not seeing anything that
> indicates that a single-bit rotate/shift would be any faster.
>
> (And remember, if 4 cycles doesn't sound so bad: that's enough of a
> latency to do _16_ "simple" ALU's, since they can be double-pumped in the
> two regular ALU's).
looking at the generated code, there is a lot of ro[rl] movement, so it's
likely that contributes to the problem.
I also see 44 extra lea instructions, 44 less adds and changes like:
[...]
mov XX(%eRX),%eRX
xor XX(%eRX),%eRX
- and %eRX,%eRX
+ and XX(%eRX),%eRX
xor XX(%eRX),%eRX
- add %eRX,%eRX
- ror $0x2,%eRX
- mov %eRX,XX(%eRX)
+ lea (%eRX,%eRX,1),%eRX
mov XX(%eRX),%eRX
bswap %eRX
mov %eRX,XX(%eRX)
mov %eRX,%eRX
+ ror $0x2,%eRX
+ mov %eRX,XX(%eRX)
+ mov %eRX,%eRX
rol $0x5,%eRX
mov XX(%eRX),%eRX
- mov XX(%eRX),%eRX
[...]
which could mean that gcc did a better job of register allocation
(where "better job" might be just luck).
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-07 1:30 ` Artur Skawina
@ 2009-08-07 1:55 ` Linus Torvalds
0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-07 1:55 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Fri, 7 Aug 2009, Artur Skawina wrote:
>
> I also see 44 extra lea instructions, 44 less adds
add and lea (as long as the lea shift is 1) should be the same on a P4
(they are not the same on some other microarchitectures and lea can have
address generation stalls etc).
Lea, of course, gives the potential for register movement at the same time
(three-address op), and that's likely the reason for lea-vs-adds.
> and changes like:
> [...]
> mov XX(%eRX),%eRX
> xor XX(%eRX),%eRX
> - and %eRX,%eRX
> + and XX(%eRX),%eRX
Yeah, different spill patterns. That's the biggest issue, I think.
In particular, on P4, with unlucky spills, you may end up with things like
ror $2,reg
mov reg,x(%esp)
.. a few instructions ..
xor x(%esp), reg
and the above is exactly when one of the worst P4 problems hit: a store,
followed a few cycles later by a load from the same address (and "a few
cycles later" can be quite a few instructions if they are the nice ones).
What can happen is that if the store data isn't ready yet (because it
comes from a long-latency op like a shift or a multiply), then you hit a
store buffer replay thing. The P4 (with its long pipeline) basically
starts the load speculatively, and if anything bad happens for the load
(L1 cache miss, TLB miss, store buffer fault, you name it), it will cause
a replay of the whole pipeline.
Which can take tens of cycles.
[ That said, it's been a long time since I did a lot of P4 worrying. So I
may mis-remember the details. But that whole store buffer forwarding had
some really nasty replay issues ]
> which could mean that gcc did a better job of register allocation
> (where "better job" might be just luck).
I suspect that's the biggest issue. Just _happening_ to get the spills so
that they don't hurt. And with unlucky scheduling, you might hit some of
the P4 replay issues every single time.
There are some P4 optimizations that are simple:
- avoid complex instructions
- don't blow the trace cache
- predictable branches
but the replay faults can really get you.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-06 22:27 ` Linus Torvalds
2009-08-06 22:33 ` Linus Torvalds
2009-08-06 22:55 ` Artur Skawina
@ 2009-08-07 2:23 ` Linus Torvalds
2009-08-07 4:16 ` Artur Skawina
[not found] ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>
2 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2009-08-07 2:23 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Thu, 6 Aug 2009, Linus Torvalds wrote:
>
>
> On Thu, 6 Aug 2009, Artur Skawina wrote:
> >
> > Does this make any difference for you? For me it's the best one so far
> > (the linusas2 number clearly shows that for me the register renaming does
> > nothing; other than that the functions should be very similar)
>
> Nope. If anything, it's bit slower, but it might be in the noise. I
> generally got 330MB/s with my "cpp renaming" on Nehalem (32-bit - the
> 64-bit numbers are ~400MB/s), but with this I got 325MB/s twice in a row,
> which matches the linusas2 numbers pretty exactly.
I actually found a P4 I have access to, except that one is a Prescott.
And I can't run it in 32-bit mode, because I only have a regular user
login, and it only has the 64-bit development environment.
But I can do the hacked-for-64bit sha1bench runs, and I tested your patch.
It's horrible.
Here's the plain "linus" baseline (ie the "Do register rotation in cpp")
thing, with the fixed "E += TEMP .." thing):
# TIME[s] SPEED[MB/s]
rfc3174 1.648 37.03
rfc3174 1.677 36.4
linus 0.4018 151.9
linusas 0.4439 137.5
linusas2 0.4381 139.3
mozilla 0.9587 63.66
mozillaas 0.9434 64.7
and here it is with your patch:
# TIME[s] SPEED[MB/s]
rfc3174 1.667 36.61
rfc3174 1.644 37.12
linus 0.4653 131.2
linusas 0.4412 138.3
linusas2 0.4388 139.1
mozilla 0.9466 64.48
mozillaas 0.9449 64.59
(ok, so the numbers aren't horribly stable, but the "plain linus" thing
consistently outperforms here - and underperforms with your patch).
However, note that since this is the 64-bit thing, there likely aren't any
spill issues, but it's simply an issue of "just how did the array[]
accesses get scheduled" etc. And since this is a Prescott (or rather
"Xeon") P4, the shifter isn't quite as horrible as yours is. _And_ this is
a different gcc version (4.0.3).
So the numbers aren't really all that comparable. It's more an example of
"optimizing for P4 is futile, because you're just playing with total
randomness". That's like a 20MB/s difference, just from moving a few ALU
ops around a bit.
And it's entirely possible that if I had gcc-4.4 on that machine, your
patch would magically do the right thing ;)
Sadly, that machine is just a ssh gateway, so there's no real development
tools on it at all - no way to get good profiles etc. So I can't really
say exactly what the problem pattern is :(
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-07 2:23 ` Linus Torvalds
@ 2009-08-07 4:16 ` Artur Skawina
[not found] ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>
1 sibling, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-07 4:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> Here's the plain "linus" baseline (ie the "Do register rotation in cpp")
> thing, with the fixed "E += TEMP .." thing):
> linus 0.4018 151.9
> and here it is with your patch:
> linus 0.4653 131.2
> (ok, so the numbers aren't horribly stable, but the "plain linus" thing
> consistently outperforms here - and underperforms with your patch).
Well, I'd be surprised if one C version would always be the winner on
every single cpu; that 13% loss[1] I think would be an acceptable compromise,
if the goal is to have one implementation that does reasonably well on all
cpus.
That's why i asked how the change did on nehalem; if it's a measurable loss
on anything modern (core2+), then of course the P4s must suffer; and one
could always blame the compiler ;)
It's not like the difference in sha1 overhead will be noticeable in normal
git use.
artur
[1] I suspect the old gcc is a factor (4.0.4 does <100M/s here).
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
[not found] ` <4A7CC380.3070008@gmail.com>
@ 2009-08-08 4:16 ` Linus Torvalds
2009-08-08 5:34 ` Artur Skawina
0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-08 4:16 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Sat, 8 Aug 2009, Artur Skawina wrote:
>
> i was seeing such large variations depending on the -mtune flags that
> i gave up and now do just -march=i686; that's what i would expect
> for generic x86 binaries.
I think I have found a way to avoid the gcc crazyness.
Lookie here:
# TIME[s] SPEED[MB/s]
rfc3174 5.094 119.8
rfc3174 5.098 119.7
linus 1.462 417.5
linusas 2.008 304
linusas2 1.878 325
mozilla 5.566 109.6
mozillaas 5.866 104.1
openssl 1.609 379.3
spelvin 1.675 364.5
spelvina 1.601 381.3
nettle 1.591 383.6
notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
margin, in fact.
Now, I didn't try a P4, and it's possible that it won't do that there, but
the 32-bit code generation sure looks impressive on my Nehalem box. The
magic? I force the stores to the 512-bit hash bucket to be done in order.
That seems to help a lot.
The diff is trivial (on top of the "rename registers with cpp" patch), as
appended. And it does seem to fix the P4 issues too, although I can
obviously (once again) only test Prescott, and only in 64-bit mode:
# TIME[s] SPEED[MB/s]
rfc3174 1.662 36.73
rfc3174 1.64 37.22
linus 0.2523 241.9
linusas 0.4367 139.8
linusas2 0.4487 136
mozilla 0.9704 62.9
mozillaas 0.9399 64.94
that's some really impressive improvement. All from just saying "do the
stores in the order I told you to, dammit!" to the compiler.
Linus
---
block-sha1/sha1.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 19dc41d..f70e1ba 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -93,6 +93,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
/* This "rolls" over the 512-bit array */
#define W(x) (array[(x)&15])
+#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
/*
* Where do we get the source from? The first 16 iterations get it from
@@ -102,7 +103,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
- unsigned int TEMP = input(t); W(t) = TEMP; \
+ unsigned int TEMP = input(t); setW(t, TEMP); \
E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
B = SHA_ROR(B, 2); } while (0)
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-08 4:16 ` Linus Torvalds
@ 2009-08-08 5:34 ` Artur Skawina
2009-08-08 17:10 ` Linus Torvalds
2009-08-08 22:58 ` Artur Skawina
0 siblings, 2 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 5:34 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> I think I have found a way to avoid the gcc crazyness.
>
> Lookie here:
>
> # TIME[s] SPEED[MB/s]
> rfc3174 5.094 119.8
> rfc3174 5.098 119.7
> linus 1.462 417.5
> linusas 2.008 304
> linusas2 1.878 325
> mozilla 5.566 109.6
> mozillaas 5.866 104.1
> openssl 1.609 379.3
> spelvin 1.675 364.5
> spelvina 1.601 381.3
> nettle 1.591 383.6
>
> notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
> margin, in fact.
>
> Now, I didn't try a P4, and it's possible that it won't do that there, but
> the 32-bit code generation sure looks impressive on my Nehalem box. The
> magic? I force the stores to the 512-bit hash bucket to be done in order.
> That seems to help a lot.
I named it 'linusv':
P4/i686:
# TIME[s] SPEED[MB/s]
rfc3174 1.456 41.92
rfc3174 1.445 42.22
linus 0.5865 104.1
linusph 0.5643 108.2
linusv 0.3697 165.1
linusvph 0.3618 168.7
linusp4 0.4312 141.5
linusas 0.4091 149.2
linusas2 0.4364 139.9
mozilla 1.102 55.37
mozillaas 1.297 47.07
openssl 0.261 233.9
opensslb 0.2395 254.9
spelvin 0.2653 230
nettle 0.438 139.4
and when tuning for prescott:
linus 0.6544 93.27
linusph 0.6523 93.57
linusv 0.3439 177.5
linusvph 0.3547 172.1
linusp4 0.3585 170.3
so it isn't as fast as the openssl asm ones, but it does win
in the C category.
> I outperform all the hand-tuned asm on 32-bit too. By quite a
> margin, in fact.
I've inlined the byteswapping in 'opensslb', maybe that one will
do a bit better.
http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-08 5:34 ` Artur Skawina
@ 2009-08-08 17:10 ` Linus Torvalds
2009-08-08 18:12 ` Artur Skawina
2009-08-08 22:58 ` Artur Skawina
1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2009-08-08 17:10 UTC (permalink / raw)
To: Artur Skawina; +Cc: Git Mailing List
On Sat, 8 Aug 2009, Artur Skawina wrote:
>
> I've inlined the byteswapping in 'opensslb', maybe that one will
> do a bit better.
>
> http://www.src.multimo.pl/YDpqIo7Li27O0L0h/sha1bench.tar.gz
Hmm. Testing on my atom, the inlined bswap is worse, but the asm versions
are generally superior to any C one:
# TIME[s] SPEED[MB/s]
rfc3174 2.194 27.82
rfc3174 2.19 27.87
linus 0.947 64.45
linusph 0.9381 65.06
linusv 0.8943 68.25
linusvph 0.8803 69.34
linusasm 0.9349 65.29
linusp4 1.006 60.66
linusas 1.062 57.48
linusas2 1.009 60.5
mozilla 2.264 26.96
mozillaas 2.197 27.78
openssl 0.648 94.19
opensslb 0.7419 82.27
spelvin 0.636 95.96
spelvina 0.6671 91.49
nettle 0.717 85.12
nettle-ror 0.7137 85.52
nettle-p4sch 0.7158 85.27
Interestingly, -mtune=prescott does well for that 'linusv' version on atom
too, and gets it up to
linusv 0.8365 72.96
and it's the only one that improves. Odd interactions.
Linus
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-08 17:10 ` Linus Torvalds
@ 2009-08-08 18:12 ` Artur Skawina
0 siblings, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 18:12 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Linus Torvalds wrote:
>
> On Sat, 8 Aug 2009, Artur Skawina wrote:
>> I've inlined the byteswapping in 'opensslb', maybe that one will
>> do a bit better.
> Hmm. Testing on my atom, the inlined bswap is worse, but the asm versions
> are generally superior to any C one:
It loses on atom, but is the best one on both P3 and P4 here.
Based on your other numbers I was expecting it to win on 32-bit
nehalem too. gcc doing a better job of scheduling w/ 'linusv'
wouldn't surprise though (since there are no spills, the data reads
are about the only other thing that could make a difference. And, yes,
they show up in the profiles; if x86 only had one more register...)
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-08 5:34 ` Artur Skawina
2009-08-08 17:10 ` Linus Torvalds
@ 2009-08-08 22:58 ` Artur Skawina
2009-08-08 23:36 ` Artur Skawina
1 sibling, 1 reply; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 22:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Artur Skawina wrote:
> Linus Torvalds wrote:
>> magic? I force the stores to the 512-bit hash bucket to be done in order.
>> That seems to help a lot.
>
> I named it 'linusv':
> linusv 0.3697 165.1
I was not going to spend even more time on the C version, but after looking
at what gcc does to it, tried this:
diff --git a/block-sha1/sha1vol.c b/block-sha1/sha1vol.c
--- a/block-sha1/sha1vol.c
+++ b/block-sha1/sha1vol.c
@@ -93,7 +93,7 @@ void blk_SHA1_Finalv(unsigned char hashout[20], blk_SHA_CTX *ctx)
/* This "rolls" over the 512-bit array */
#define W(x) (array[(x)&15])
-#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
+#define setW(x, val) W(x) = (val); __asm__ volatile ("": "+m" (W(x)))
/*
* Where do we get the source from? The first 16 iterations get it from
and got a nice improvement:
rfc3174 1.436 42.49
linus 0.5843 104.5
linusph 0.5639 108.2
linusv 0.3098 197
linusvph 0.3082 198.1
linusasm 0.5849 104.3
linusp4 0.433 141
linusas 0.4077 149.7
linusas2 0.436 140
mozilla 1.099 55.54
mozillaas 1.295 47.11
openssl 0.2632 231.9
opensslb 0.2395 254.8
spelvin 0.2687 227.2
spelvina 0.2526 241.7
nettle 0.4378 139.4
nettle-ror 0.4379 139.4
nettle-p4sch 0.4231 144.2
The atom numbers didn't change much.
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
2009-08-08 22:58 ` Artur Skawina
@ 2009-08-08 23:36 ` Artur Skawina
0 siblings, 0 replies; 38+ messages in thread
From: Artur Skawina @ 2009-08-08 23:36 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List
Artur Skawina wrote:
> Artur Skawina wrote:
> -#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
> +#define setW(x, val) W(x) = (val); __asm__ volatile ("": "+m" (W(x)))
and w/ this on top:
diff --git a/block-sha1/sha1vol.c b/block-sha1/sha1vol.c
--- a/block-sha1/sha1vol.c
+++ b/block-sha1/sha1vol.c
@@ -103,9 +103,9 @@ void blk_SHA1_Finalv(unsigned char hashout[20], blk_SHA_CTX *ctx)
#define SHA_MIX(t) SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
- unsigned int TEMP = input(t); setW(t, TEMP); \
- E += TEMP + SHA_ROL(A,5) + (fn) + (constant); \
- B = SHA_ROR(B, 2); } while (0)
+ unsigned int TEMP = SHA_ROL(A,5); E+= (fn); \
+ E += (constant) + TEMP; TEMP = input(t); setW(t, TEMP); \
+ B = SHA_ROR(B, 2); E += TEMP; } while (0)
#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
I see an improvement on atom and reach ~200M/s on P4 (i686).
.
When compiled w/ '-mtune=prescott':
rfc3174 1.459 41.84
linus 0.6574 92.85
linusph 0.6613 92.29
linusv 0.2682 227.6
linusvph 0.2681 227.7
linusasm 0.5868 104
linusp4 0.3586 170.2
linusas 0.3795 160.8
linusas2 0.3583 170.3
mozilla 1.171 52.11
mozillaas 1.381 44.2
openssl 0.2623 232.7
opensslb 0.2404 253.9
spelvin 0.2659 229.6
spelvina 0.2492 244.9
nettle 0.4362 139.9
nettle-ror 0.436 140
nettle-p4sch 0.4204 145.2
it's now just 2% slower than the openssl assembler version.
artur
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 0/7] block-sha1: improved SHA1 hashing
@ 2009-08-07 7:36 George Spelvin
0 siblings, 0 replies; 38+ messages in thread
From: George Spelvin @ 2009-08-07 7:36 UTC (permalink / raw)
To: torvalds; +Cc: art.08.09, git, linux
> Basically, older P4's will I think shift one bit at a time. So while even
> Prescott is relatively weak in the shifter department, pre-prescott
> (Willamette and Northwood) are _really_ weak. If your P4 is one of those,
> you really shouldn't use it to decide on optimizations.
Thanks for the warning. The only P4 I have a login on is a Willamette, so
I guess it's a bad optimization target:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 1
model name : Intel(R) Pentium(R) 4 CPU 1.60GHz
stepping : 2
cpu MHz : 1593.888
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pebs bts
bogomips : 3187.77
clflush size : 64
power management:
BTW, I'm having a bit of a problem with sha1bench. Because the number
of rounds is multiplied by 10 until it takes more than mintime, and
the watchdog timer is set to 12*mintime, if I just barely miss the
threshold, I can hit the watchdog on any code that's more than 20% slower
than the reference rfc3174 code:
# /tmp/sha1bench
#Initializing... Rounds: 10000000, size: 625000K, time: 6.904s, speed: 88.41MB/s
# TIME[s] SPEED[MB/s]
rfc3174 6.912 88.31
# New hash result: 0489b02aee9fbd82b0bb0cba96f8047e42f543b8
rfc3174 6.911 88.31
linus 3.002 203.3
linusas 3.253 187.7
linusas2 3.059 199.6
mozilla 10.86 56.22
mozillaas 10.33 59.09
openssl 2.145 284.5
spelvin 1.933 315.8
spelvina 1.933 315.8
nettle 2.161 282.4
I had to bump it to 20 times to be able to get past the mozilla code.
You can still have nice round numbers with smaller incements of 2x or 2.5x:
do {
- rounds *= 10;
+ rounds *= 2;
+ if (rounds % 9 == 4)
+ rounds += rounds/4; /* Step 1, 2, 5, 10, 20, 50, 100... */
uWATCHDOG(mintime*20);
t1 = GETTIME();
for (i=0; i<rounds; i++) {
SHA1Input(sc, arr512b, sizeof(arr512b));
if (i<64) {
SHA1Result(sc, arr512b+
(i+arr512b[i]+arr512b[arr512b[i]%64]+
arr512b[63-i]+arr512b[arr512b[63-i]%64]) %
(sizeof(arr512b)-20));
SHA1Reset(sc);
}
}
t2 = GETTIME(); td = t2-t1;
}
while (td<mintime);
And yeah, my P4 is touchy:
n# /var/tmp/sha1bench
#Initializing... Rounds: 500000, size: 31250K, time: 0.9736s, speed: 31.35MB/s
# TIME[s] SPEED[MB/s]
rfc3174 0.9931 30.73
# New hash result: 2872616106e163ae9c7c8d12a38bef032323c844
rfc3174 0.9491 32.16
linus 0.4906 62.2
linusas 0.5799 52.62
linusas2 0.4859 62.81
mozilla 1.302 23.44
mozillaas 1.234 24.74
openssl 0.226 135
spelvin 0.2298 132.8
spelvina 0.2494 122.4
nettle 0.3687 82.78
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2009-08-08 23:37 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-06 15:13 [PATCH 0/7] block-sha1: improved SHA1 hashing Linus Torvalds
2009-08-06 15:15 ` [PATCH 1/7] block-sha1: add new optimized C 'block-sha1' routines Linus Torvalds
2009-08-06 15:16 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Linus Torvalds
2009-08-06 15:18 ` [PATCH 3/7] block-sha1: make the 'ntohl()' part of the first SHA1 loop Linus Torvalds
2009-08-06 15:20 ` [PATCH 4/7] block-sha1: re-use the temporary array as we calculate the SHA1 Linus Torvalds
2009-08-06 15:22 ` [PATCH 5/7] block-sha1: macroize the rounds a bit further Linus Torvalds
2009-08-06 15:24 ` [PATCH 6/7] block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3 Linus Torvalds
2009-08-06 15:25 ` [PATCH 7/7] block-sha1: get rid of redundant 'lenW' context Linus Torvalds
2009-08-06 18:25 ` [PATCH 2/7] block-sha1: try to use rol/ror appropriately Bert Wesarg
2009-08-06 17:22 ` [PATCH 0/7] block-sha1: improved SHA1 hashing Artur Skawina
2009-08-06 18:09 ` Linus Torvalds
2009-08-06 19:10 ` Artur Skawina
2009-08-06 19:41 ` Linus Torvalds
2009-08-06 20:08 ` Artur Skawina
2009-08-06 20:53 ` Linus Torvalds
2009-08-06 21:24 ` Linus Torvalds
2009-08-06 21:39 ` Artur Skawina
2009-08-06 21:52 ` Artur Skawina
2009-08-06 22:27 ` Linus Torvalds
2009-08-06 22:33 ` Linus Torvalds
2009-08-06 23:19 ` Artur Skawina
2009-08-06 23:42 ` Linus Torvalds
2009-08-06 22:55 ` Artur Skawina
2009-08-06 23:04 ` Linus Torvalds
2009-08-06 23:25 ` Linus Torvalds
2009-08-07 0:13 ` Linus Torvalds
2009-08-07 1:30 ` Artur Skawina
2009-08-07 1:55 ` Linus Torvalds
2009-08-07 0:53 ` Artur Skawina
2009-08-07 2:23 ` Linus Torvalds
2009-08-07 4:16 ` Artur Skawina
[not found] ` <alpine.LFD.2.01.0908071614310.3288@localhost.localdomain>
[not found] ` <4A7CBD28.6070306@gmail.com>
[not found] ` <4A7CBF47.9000903@gmail.com>
[not found] ` <alpine.LFD.2.01.0908071700290.3288@localhost.localdomain>
[not found] ` <4A7CC380.3070008@gmail.com>
2009-08-08 4:16 ` Linus Torvalds
2009-08-08 5:34 ` Artur Skawina
2009-08-08 17:10 ` Linus Torvalds
2009-08-08 18:12 ` Artur Skawina
2009-08-08 22:58 ` Artur Skawina
2009-08-08 23:36 ` Artur Skawina
2009-08-07 7:36 George Spelvin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.