All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Performance issue of 'git branch'
@ 2009-07-26 23:21 George Spelvin
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
  0 siblings, 1 reply; 60+ messages in thread
From: George Spelvin @ 2009-07-26 23:21 UTC (permalink / raw)
  To: git, torvalds; +Cc: linux

> It's a bit sad, since the _only_ thing we load all of libcrypto for is the 
> (fairly trivial) SHA1 code. 
>
> But at the same time, last time I benchmarked the different SHA1 
> libraries, the openssl one was the fastest. I think it has tuned assembly 
> language for most architectures. Our regular mozilla-based C code is 
> perfectly fine, but it doesn't hold a candle to assembler tuning.

Actually, openssl only has assembly for x86, x86_64, and ia64.
Truthfully, once you have 32 registers, SHA1 is awfully easy to
compile near-optimally.

Git currently includes some hand-tuned assembly that isn't in OpenSSL:
- ARM (only 16 registers, and the rotate+op support can be used nicely)
- PPC (3-way superscalar *without* OO execution benefits from careful
  scheduling)

Further, all of the core hand-tuned SHA1 assembly code in OpenSSL is by
Andy Polyakov and is dual-licensed GPL/3-clause BSD *in addition to*
the OpenSSL license.  So we can just import it:

See http://www.openssl.org/~appro/cryptogams/
and http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz

(Ooh, look, he has PPC code in there, too.  Does anyone with a PPC machine
want to compare it with Git's?)

It'll take some massaging because that's just the core SHA1_Transform
function and not the wrappers, but it's hardly a heroic effort.

I'm pretty deep in the weeds at $DAY_JOB and can't get to it for a while,
but would a patch be appreciated?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Request for benchmarking: x86 SHA1 code
  2009-07-26 23:21 Performance issue of 'git branch' George Spelvin
@ 2009-07-31 10:46 ` George Spelvin
  2009-07-31 11:11   ` Erik Faye-Lund
                     ` (8 more replies)
  0 siblings, 9 replies; 60+ messages in thread
From: George Spelvin @ 2009-07-31 10:46 UTC (permalink / raw)
  To: git; +Cc: linux

After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
I'm curious about the performance on other CPUs I don't have access to,
particularly Core 2 duo and i7.

Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
machines are also interesting, but I'm optimizing for newer ones.

I haven't packaged this nicely, but it's not that complicated.
- Download Andy's original code from
  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
- Unpack and cd to the cryptogams-0/x86 directory
- "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
   and "sha1-x86.pl".
- "make" 
- Run ./586test (before) and ./x86test (after) and note the timings.

Thank you!

--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ Makefile	2009-07-31 06:22:42.000000000 -0400
@@ -0,0 +1,16 @@
+CC := gcc
+CFLAGS := -m32 -W -Wall -Os -g
+ASFLAGS := --32
+
+all: 586test x86test
+
+586test : sha1test.c sha1-586.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o
+
+x86test : sha1test.c sha1-x86.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o
+
+586test x86test : sha1-586.h
+
+%.s : %.pl x86asm.pl x86unix.pl
+	perl $< elf > $@
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1test.c	2009-07-28 09:24:09.000000000 -0400
@@ -0,0 +1,67 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/time.h>
+
+#include "sha1-586.h"
+
+#define SIZE 1000000
+
+#if SIZE % 64
+# error SIZE must be a multiple of 64!
+#endif
+
+int
+main(void)
+{
+	uint32_t iv[5] = {
+		0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
+	};
+	/* Simplest known-answer test, "abc" */
+	static uint8_t const testbuf[64] = {
+		'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24
+	};
+	/* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */
+	static uint32_t const expected[5] = {
+		0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d };
+	unsigned i;
+	char *p = malloc(SIZE);
+	struct timeval tv0, tv1;
+
+	if (!p) {
+		perror("malloc");
+		return 1;
+	}
+
+	sha1_block_data_order(iv, testbuf, 1);
+	printf("Result:  %08x %08x %08x %08x %08x\n"
+	       "Expected:%08x %08x %08x %08x %08x\n",
+		iv[0], iv[1], iv[2], iv[3], iv[4], expected[0],
+		expected[1], expected[2], expected[3], expected[4]);
+	for (i = 0; i < 5; i++)
+		if (iv[i] != expected[i])
+			printf("MISMATCH in word %u\n", i);
+
+	if (gettimeofday(&tv0, NULL) < 0) {
+		perror("gettimeofday");
+		return 1;
+	}
+	for (i = 0; i < 500; i++)
+		sha1_block_data_order(iv, p, SIZE/64);
+	if (gettimeofday(&tv1, NULL) < 0) {
+		perror("gettimeofday");
+		return 1;
+	}
+	tv1.tv_sec -= tv0.tv_sec;
+	tv1.tv_usec -= tv0.tv_usec;
+	if (tv1.tv_usec < 0) {
+		tv1.tv_sec--;
+		tv1.tv_usec += 1000000;
+	}
+	printf("%u bytes: %u.%06u s\n", i * SIZE, (unsigned)tv1.tv_sec,
+		(unsigned)tv1.tv_usec);
+	return 0;
+}
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-586.h	2009-07-27 09:34:03.000000000 -0400
@@ -0,0 +1,3 @@
+#include <stdint.h>
+
+void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len);
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-x86.pl	2009-07-31 06:10:18.000000000 -0400
@@ -0,0 +1,398 @@
+#!/usr/bin/env perl
+
+# ====================================================================
+# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+
+# "[Re]written" was achieved in two major overhauls. In 2004 BODY_*
+# functions were re-implemented to address P4 performance issue [see
+# commentary below], and in 2006 the rest was rewritten in order to
+# gain freedom to liberate licensing terms.
+
+# It was noted that Intel IA-32 C compiler generates code which
+# performs ~30% *faster* on P4 CPU than original *hand-coded*
+# SHA1 assembler implementation. To address this problem (and
+# prove that humans are still better than machines:-), the
+# original code was overhauled, which resulted in following
+# performance changes:
+#
+#		compared with original	compared with Intel cc
+#		assembler impl.		generated code
+# Pentium	-16%			+48%
+# PIII/AMD	+8%			+16%
+# P4		+85%(!)			+45%
+#
+# As you can see Pentium came out as looser:-( Yet I reckoned that
+# improvement on P4 outweights the loss and incorporate this
+# re-tuned code to 0.9.7 and later.
+# ----------------------------------------------------------------
+#					<appro@fy.chalmers.se>
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../perlasm");
+require "x86asm.pl";
+
+&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386");
+
+$A="eax";
+$B="ebx";
+$C="ecx";
+$D="edx";
+$E="ebp";
+
+# Two temporaries
+$S="esi";
+$T="edi";
+
+# The round constants
+use constant K1 => 0x5a827999;
+use constant K2 => 0x6ED9EBA1;
+use constant K3 => 0x8F1BBCDC;
+use constant K4 => 0xCA62C1D6;
+
+@V=($A,$B,$C,$D,$E);
+
+# Given unlimited registers and functional units, it would be
+# possible to compute SHA-1 at two cycles per round, using 7
+# operations per round.  Remember, each round computes a new
+# value for E, which is used as A in the following round and B
+# in the round after that.  There are two critical paths:
+# - A must be rotated and added to the output E
+# - B must go through two boolean operations before being added
+#   to the result E.  Since this latter addition can't be done
+#   in the same-cycle as the critical addition of a<<<5, this is
+#   a total of 2+1+1 = 4 cycles.
+# Additionally, if you want to avoid copying B, you have to
+# rotate it soon after use in this round so it is ready for use
+# as the following round's C.
+
+# f = (a <<< 5) + e + K + in[i] + (d^(b&(c^d)))		(0..19)
+# f = (a <<< 5) + e + K + in[i] + (b^c^d)		(20..39, 60..79)
+# f = (a <<< 5) + e + K + in[i] + (c&d) + (b&(c^d))	(40..59)
+# The hard part is doing this with only two temporary registers.
+# Let's divide it into 4 parts.  These can be executed in a 7-cycle
+# loop, assuming triple (quadruple counting xor separately) issue:
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	(movl T,in[i])                  xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B	//
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	(movl C,S)			addl S,E
+#
+# (The last 3 rounds can omit the store of T.)
+# The "addl T+K,E" line is actually implemented using the lea instruction,
+# which (on a Pentium) requires that neither T not K was modified on the
+# previous cycle.
+#
+# The other two rounds are a bit simpler, and can therefore be "pulled in"
+# one cycle, to 6.  The bit-select function (0..19):
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl A,S)
+#	xor in[i+2],T	(rorl 2,C)	(roll 5,S)
+#	xor in[i+3],T	movl D,S	(addl S,A)
+#	roll 1,T	xorl C,S
+#	movl T,in[i]	andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S		//	(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+#
+# And the XOR function (also 6, limited by the in[i] forming) used in
+# rounds 20..39 and 60..79:
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl C,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl A,S)
+#	xor in[i+2],T			(roll 5,S)	
+#	xor in[i+3],T			(addl S,A)
+#	roll 1,T	movl D,S
+#	movl T,in[i]	xorl B,S			rorl 2,B
+#	addl T+K,E	xorl C,S		//	(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S
+#	(xor in[i+3],T)			addl S,E
+#
+# The first 16 rounds don't need to form the in[i] equation, letting
+# us pull it in another 2 cycles, to 4, after some reassignment of
+# temporaries:
+#	in[i]		F(b,c,d)	a<<<5
+#			movl D,S	(roll 5,T)	(addl S,A)
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S	movl A,T
+#			addl S,E	roll 5,T	(movl C,S)
+#	(mov in[i],T)	(xorl B,S)	addl T,E
+#
+
+# The transition between rounds 15 and 16 will be a bit tricky... the best
+# thing to do is to delay the computation of a<<<5 one cycle and move it back
+# to the S register.  That way, T is free as early as possible.
+#	in[i]		F(b,c,d)	a<<<5
+#	(addl T+K,A)	(xorl E,S)	(movl A,T)
+#			movl D,S	(roll 5,T)	(addl S,A)
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S			(movl in[1],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)	rorl 2,B	roll 5,S
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+
+
+
+
+# This expects the first copy of D to S to have been done already
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S	movl A,T
+#			addl S,E	roll 5,T	(movl C,S)	//
+#	(mov in[i],T)	(xorl B,S)	addl T,E	
+
+sub BODY_00_15
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("00_15 $n");
+		&mov($S,$d) if ($n == 0);
+	&mov($T,&swtmp($n%16));		# Load Xi.
+		&xor($S,$c);		# Continue computing F() = d^(b&(c^d))
+		&and($S,$b);
+			&rotr($b,2);
+	&lea($e,&DWP(K1,$e,$T));	# Add Xi and K
+    if ($n < 15) {
+			&mov($T,$a);
+		&xor($S,$d);
+			&rotl($T,5);
+		&add($e,$S);
+		&mov($S,$c);		# Start of NEXT round's F()
+			&add($e,$T);
+    } else {
+	# This version provides the correct start for BODY_20_39
+	&mov($T,&swtmp(($n+1)%16));	# Start computing mext Xi.
+		&xor($S,$d);
+	&xor($T,&swtmp(($n+3)%16));
+		&add($e,$S);		# Add F()
+			&mov($S,$a);	# Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16));
+			&rotl($S,5);
+    }
+
+}
+
+# The transition between rounds 15 and 16 will be a bit tricky... the best
+# thing to do is to delay the computation of a<<<5 one cycle and move it back
+# to the S register.  That way, T is free as early as possible.
+#	in[i]		F(b,c,d)	a<<<5
+#	(addl T+K,A)	(xorl E,S)	(movl B,T)
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(movl in[1],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)	rorl 2,B	roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+
+# This starts just before starting to compute F(); the Xi should have XORed
+# the first three values together.  (Break is at //)
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 	movl D,S  	(addl S,A)
+#	roll 1,T	xorl C,S
+#	movl T,in[i]	andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_16_19
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("16_20 $n");
+
+	&xor($T,&swtmp(($n+13)%16));
+			&add($a,$S);	# End of previous round
+		&mov($S,$d);		# Start current round's F()
+	&rotl($T,1);
+		&xor($S,$c);
+	&mov(&swtmp($n%16),$T);		# Store computed Xi.
+		&and($S,$b);
+		&rotr($b,2);
+	&lea($e,&DWP(K1,$e,$T));	# Add Xi and K
+	&mov($T,&swtmp(($n+1)%16));	# Start computing mext Xi.
+		&xor($S,$d);
+	&xor($T,&swtmp(($n+3)%16));
+		&add($e,$S);		# Add F()
+			&mov($S,$a);	# Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16));
+			&rotl($S,5);
+}
+
+# This is just like BODY_16_19, but computes a different F() = b^c^d
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 			(addl S,A)
+#	roll 1,T	movl C,S
+#	movl T,in[i]	xorl B,S			rorl 2,B
+#	addl T+K,E 	xorl C,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_20_39	# And 61..79
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+	local $K=($n<40) ? K2 : K4;
+
+	&comment("21_30 $n");
+
+	&xor($T,&swtmp(($n+13)%16));
+			&add($a,$S);	# End of previous round
+		&mov($S,$d)
+	&rotl($T,1);
+		&mov($S,$d);		# Start current round's F()
+	&mov(&swtmp($n%16),$T) if ($n < 77);	# Store computed Xi.
+		&xor($S,$b);
+		&rotr($b,2);
+	&lea($e,&DWP($K,$e,$T));	# Add Xi and K
+	&mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi.
+		&xor($S,$c);
+	&xor($T,&swtmp(($n+3)%16)) if ($n < 79);
+		&add($e,$S);		# Add F1()
+			&mov($S,$a);	# Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16)) if ($n < 79);
+			&rotl($S,5);
+
+			&add($e,$S) if ($n == 79);
+}
+
+
+# This starts immediately after the LEA, and expects to need to finish
+# the previous round. (break is at //)
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	(addl T+K,E)			(andl C,S)	(rorl 2,C)
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T /	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	(movl T,in[i])                  xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	// (movl C,S)			addl S,E
+
+sub BODY_40_59
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("41_59 $n");
+
+			&add($a,$S);	# End of previous round
+		&mov($S,$d);		# Start current round's F(1)
+	&xor($T,&swtmp(($n+13)%16));
+		&and($S,$c);
+	&rotl($T,1);
+		&add($e,$S);		# Add F1()
+		&mov($S,$d);		# Start current round's F2()
+	&mov(&swtmp($n%16),$T);		# Store computed Xi.
+		&xor($S,$c);
+	&lea($e,&DWP(K3,$e,$T));
+		&and($S,$b);
+		&rotr($b,2);
+	&mov($T,&swtmp(($n+1)%16));	# Start computing next Xi.
+		&add($e,$S);		# Add F2()
+		&mov($S,$a);		# Start computing a<<<5
+	&xor($T,&swtmp(($n+3)%16));
+			&rotl($S,5);
+	&xor($T,&swtmp(($n+9)%16));
+}
+
+&function_begin("sha1_block_data_order",16);
+	&mov($S,&wparam(0));	# SHA_CTX *c
+	&mov($T,&wparam(1));	# const void *input
+	&mov($A,&wparam(2));	# size_t num
+	&stack_push(16);	# allocate X[16]
+	&shl($A,6);
+	&add($A,$T);
+	&mov(&wparam(2),$A);	# pointer beyond the end of input
+	&mov($E,&DWP(16,$S));# pre-load E
+
+	&set_label("loop",16);
+
+	# copy input chunk to X, but reversing byte order!
+	for ($i=0; $i<16; $i+=4)
+		{
+		&mov($A,&DWP(4*($i+0),$T));
+		&mov($B,&DWP(4*($i+1),$T));
+		&mov($C,&DWP(4*($i+2),$T));
+		&mov($D,&DWP(4*($i+3),$T));
+		&bswap($A);
+		&bswap($B);
+		&bswap($C);
+		&bswap($D);
+		&mov(&swtmp($i+0),$A);
+		&mov(&swtmp($i+1),$B);
+		&mov(&swtmp($i+2),$C);
+		&mov(&swtmp($i+3),$D);
+		}
+	&mov(&wparam(1),$T);	# redundant in 1st spin
+
+	&mov($A,&DWP(0,$S));	# load SHA_CTX
+	&mov($B,&DWP(4,$S));
+	&mov($C,&DWP(8,$S));
+	&mov($D,&DWP(12,$S));
+	# E is pre-loaded
+
+	for($i=0;$i<16;$i++)	{ &BODY_00_15($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<16;$i++)	{ &BODY_15($i,@V);    unshift(@V,pop(@V)); }
+	for(;$i<20;$i++)	{ &BODY_16_19($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<40;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<60;$i++)	{ &BODY_40_59($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<80;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+
+	(($V[4] eq $E) and ($V[0] eq $A)) or die;	# double-check
+
+	&comment("Loop trailer");
+
+	&mov($S,&wparam(0));	# re-load SHA_CTX*
+	&mov($T,&wparam(1));	# re-load input pointer
+
+	&add($A,&DWP(0,$S));	# E is last "A"...
+	&add($B,&DWP(4,$S));
+	&add($C,&DWP(8,$S));
+	&add($D,&DWP(12,$S));
+	&add($E,&DWP(16,$S));
+
+	&mov(&DWP(0,$S),$A);	# update SHA_CTX
+	 &add($T,64);		# advance input pointer
+	&mov(&DWP(4,$S),$B);
+	 &cmp($T,&wparam(2));	# have we reached the end yet?
+	&mov(&DWP(8,$S),$C);
+	&mov(&DWP(12,$S),$D);
+	&mov(&DWP(16,$S),$E);
+	&jb(&label("loop"));
+
+	&stack_pop(16);
+&function_end("sha1_block_data_order");
+&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>");
+
+&asm_finish();

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
@ 2009-07-31 11:11   ` Erik Faye-Lund
  2009-07-31 11:31     ` George Spelvin
  2009-07-31 11:37     ` Michael J Gruber
  2009-07-31 11:21   ` Michael J Gruber
                     ` (7 subsequent siblings)
  8 siblings, 2 replies; 60+ messages in thread
From: Erik Faye-Lund @ 2009-07-31 11:11 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote:
> I haven't packaged this nicely, but it's not that complicated.
> - Download Andy's original code from
>  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - Unpack and cd to the cryptogams-0/x86 directory
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>   and "sha1-x86.pl".
> - "make"

$ patch < ../sha1-opt.patch.eml
patching file `Makefile'
patching file `sha1test.c'
patching file `sha1-586.h'
patching file `sha1-x86.pl'

$ make
make: *** No rule to make target `sha1-586.o', needed by `586test'.  Stop.

What did I do wrong? :)
Would it be easier if you pushed it out somewhere?

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
  2009-07-31 11:11   ` Erik Faye-Lund
@ 2009-07-31 11:21   ` Michael J Gruber
  2009-07-31 11:26   ` Michael J Gruber
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Michael J Gruber @ 2009-07-31 11:21 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

George Spelvin venit, vidit, dixit 31.07.2009 12:46:
> After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
> got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
> I'm curious about the performance on other CPUs I don't have access to,
> particularly Core 2 duo and i7.
> 
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.
> 
> I haven't packaged this nicely, but it's not that complicated.
> - Download Andy's original code from
>   http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - Unpack and cd to the cryptogams-0/x86 directory
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>    and "sha1-x86.pl".
> - "make" 
> - Run ./586test (before) and ./x86test (after) and note the timings.
> 
> Thank you!

Best of 6 runs:
./586test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.642336 s
./x86test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.532153 s

System:
uname -a
Linux localhost.localdomain 2.6.29.6-213.fc11.x86_64 #1 SMP Tue Jul 7
21:02:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
stepping        : 11
cpu MHz         : 800.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi
flexpriority
bogomips        : 4389.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
stepping        : 11
cpu MHz         : 800.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi
flexpriority
bogomips        : 4388.78
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
  2009-07-31 11:11   ` Erik Faye-Lund
  2009-07-31 11:21   ` Michael J Gruber
@ 2009-07-31 11:26   ` Michael J Gruber
  2009-07-31 12:31   ` Carlos R. Mafra
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Michael J Gruber @ 2009-07-31 11:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

George Spelvin venit, vidit, dixit 31.07.2009 12:46:
> After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
> got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
> I'm curious about the performance on other CPUs I don't have access to,
> particularly Core 2 duo and i7.
> 
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.
> 
> I haven't packaged this nicely, but it's not that complicated.
> - Download Andy's original code from
>   http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - Unpack and cd to the cryptogams-0/x86 directory
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>    and "sha1-x86.pl".
> - "make" 
> - Run ./586test (before) and ./x86test (after) and note the timings.
> 
> Thank you!

Best of 6 runs:
./586test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.258031 s
./x86test
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 1.171770 s

System:
uname -a
Linux whatever 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008
i686 GNU/Linux
cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
cpu MHz         : 2000.000
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 5988.92
clflush size    : 64

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
stepping        : 10
cpu MHz         : 2000.000
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 5984.92
clflush size    : 64

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 11:11   ` Erik Faye-Lund
@ 2009-07-31 11:31     ` George Spelvin
  2009-07-31 11:37     ` Michael J Gruber
  1 sibling, 0 replies; 60+ messages in thread
From: George Spelvin @ 2009-07-31 11:31 UTC (permalink / raw)
  To: kusmabite, linux; +Cc: git

> $ make
> make: *** No rule to make target `sha1-586.o', needed by `586test'.  Stop.
>
> What did I do wrong? :)
> Would it be easier if you pushed it out somewhere?

H'm.... It *should* do

perl sha1-586.pl elf > sha1-586.s
as --32  -o sha1-586.o sha1-586.s
gcc -m32 -W -Wall -Os -g -o 586test sha1test.c sha1-586.o
(And likewise for the "x86test" binary.)

which is what happened when I tested it.  Obviously I have something
non-portable in the Makefile.

You could try "make sha1-586.s" and "sha1-586.o" and see which rule is
f*ed up.

Um, you *are* in a directory which contains a sha1-586.pl file, right?


Thanks!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 11:11   ` Erik Faye-Lund
  2009-07-31 11:31     ` George Spelvin
@ 2009-07-31 11:37     ` Michael J Gruber
  2009-07-31 12:24       ` Erik Faye-Lund
  1 sibling, 1 reply; 60+ messages in thread
From: Michael J Gruber @ 2009-07-31 11:37 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: George Spelvin, git

Erik Faye-Lund venit, vidit, dixit 31.07.2009 13:11:
> On Fri, Jul 31, 2009 at 12:46 PM, George Spelvin<linux@horizon.com> wrote:
>> I haven't packaged this nicely, but it's not that complicated.
>> - Download Andy's original code from
>>  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
>> - Unpack and cd to the cryptogams-0/x86 directory
>> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>>   and "sha1-x86.pl".
>> - "make"
> 
> $ patch < ../sha1-opt.patch.eml
> patching file `Makefile'
> patching file `sha1test.c'
> patching file `sha1-586.h'
> patching file `sha1-x86.pl'
> 
> $ make
> make: *** No rule to make target `sha1-586.o', needed by `586test'.  Stop.
> 
> What did I do wrong? :)
> Would it be easier if you pushed it out somewhere?
> 

You need to go to the x86 directory, apply the patch and run make there.
(I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're
on a 64 bit system.

Michael

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 11:37     ` Michael J Gruber
@ 2009-07-31 12:24       ` Erik Faye-Lund
  2009-07-31 12:29         ` Johannes Schindelin
  2009-07-31 12:32         ` George Spelvin
  0 siblings, 2 replies; 60+ messages in thread
From: Erik Faye-Lund @ 2009-07-31 12:24 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: George Spelvin, git

On Fri, Jul 31, 2009 at 1:37 PM, Michael J
Gruber<git@drmicha.warpmail.net> wrote:
>> What did I do wrong? :)
>
> You need to go to the x86 directory, apply the patch and run make there.
> (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're
> on a 64 bit system.

Aha, thanks :)

Now I'm getting a different error:
$ make
as   -o sha1-586.o sha1-586.s
sha1-586.s: Assembler messages:
sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored.
sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s'
sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored.
sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s'

make: *** [sha1-586.o] Error 1

What might be relevant, is that I'm trying this on Windows (Vista
64bit). I'd still think GNU as should be able to assemble the source,
though. I've got an i7, so I thought the result might be interresting.

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:24       ` Erik Faye-Lund
@ 2009-07-31 12:29         ` Johannes Schindelin
  2009-07-31 12:32         ` George Spelvin
  1 sibling, 0 replies; 60+ messages in thread
From: Johannes Schindelin @ 2009-07-31 12:29 UTC (permalink / raw)
  To: Erik Faye-Lund; +Cc: Michael J Gruber, George Spelvin, git

Hi,

On Fri, 31 Jul 2009, Erik Faye-Lund wrote:

> On Fri, Jul 31, 2009 at 1:37 PM, Michael J
> Gruber<git@drmicha.warpmail.net> wrote:
> >> What did I do wrong? :)
> >
> > You need to go to the x86 directory, apply the patch and run make there.
> > (I made the same mistake.) Also, you i586 (32bit) glibc-devel if you're
> > on a 64 bit system.
> 
> Aha, thanks :)
> 
> Now I'm getting a different error:
> $ make
> as   -o sha1-586.o sha1-586.s
> sha1-586.s: Assembler messages:
> sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s'
> sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s'
> 
> make: *** [sha1-586.o] Error 1
> 
> What might be relevant, is that I'm trying this on Windows (Vista
> 64bit).

Probably using msysGit?  Then you're still using the 32-bit environment, 
as MSys is 32-bit only for now.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (2 preceding siblings ...)
  2009-07-31 11:26   ` Michael J Gruber
@ 2009-07-31 12:31   ` Carlos R. Mafra
  2009-07-31 13:27   ` Brian Ristuccia
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Carlos R. Mafra @ 2009-07-31 12:31 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri 31.Jul'09 at  6:46:02 -0400, George Spelvin wrote:
> - Run ./586test (before) and ./x86test (after) and note the timings.

For 8 runs in a Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz,

before: 1.75 +- 0.02
after: 1.62 +- 0.02

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:24       ` Erik Faye-Lund
  2009-07-31 12:29         ` Johannes Schindelin
@ 2009-07-31 12:32         ` George Spelvin
  2009-07-31 12:45           ` Erik Faye-Lund
  1 sibling, 1 reply; 60+ messages in thread
From: George Spelvin @ 2009-07-31 12:32 UTC (permalink / raw)
  To: git, kusmabite; +Cc: git, linux

> Now I'm getting a different error:
> $ make
> as   -o sha1-586.o sha1-586.s
> sha1-586.s: Assembler messages:
> sha1-586.s:4: Warning: .type pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:4: Error: junk at end of line, first unrecognized character is `s'
> sha1-586.s:1438: Warning: .size pseudo-op used outside of .def/.endef ignored.
> sha1-586.s:1438: Error: junk at end of line, first unrecognized character is `s'
> 
> make: *** [sha1-586.o] Error 1

> What might be relevant, is that I'm trying this on Windows (Vista
> 64bit). I'd still think GNU as should be able to assemble the source,
> though. I've got an i7, so I thought the result might be interresting.

Ah... what assembler?  the perl proprocessor supports multiple
assemblers:
	elf     - Linux, FreeBSD, Solaris x86, etc.
	a.out   - DJGPP, elder OpenBSD, etc.
	coff    - GAS/COFF such as Win32 targets
	win32n  - Windows 95/Windows NT NASM format
	nw-nasm - NetWare NASM format
	nw-mwasm- NetWare Metrowerks Assembler

Maybe you need to replace "elf" with "coff"?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:32         ` George Spelvin
@ 2009-07-31 12:45           ` Erik Faye-Lund
  2009-07-31 13:02             ` George Spelvin
  0 siblings, 1 reply; 60+ messages in thread
From: Erik Faye-Lund @ 2009-07-31 12:45 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, git

On Fri, Jul 31, 2009 at 2:32 PM, George Spelvin<linux@horizon.com> wrote:
> Maybe you need to replace "elf" with "coff"?

That did the trick, thanks!

Best of 6 runs on an Intel Core i7 920 @ 2.67GHz:

before (586test): 1.415
after (x86test): 1.470

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 12:45           ` Erik Faye-Lund
@ 2009-07-31 13:02             ` George Spelvin
  0 siblings, 0 replies; 60+ messages in thread
From: George Spelvin @ 2009-07-31 13:02 UTC (permalink / raw)
  To: kusmabite, linux; +Cc: git, git

> That did the trick, thanks!
>
> Best of 6 runs on an Intel Core i7 920 @ 2.67GHz:
> 
> before (586test): 1.415
> after (x86test): 1.470

So it's slower.  Bummer. :-(  Obviously I have some work to do.

But thank you very much for the result!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (3 preceding siblings ...)
  2009-07-31 12:31   ` Carlos R. Mafra
@ 2009-07-31 13:27   ` Brian Ristuccia
  2009-07-31 14:05     ` George Spelvin
  2009-07-31 13:27   ` Jakub Narebski
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 60+ messages in thread
From: Brian Ristuccia @ 2009-07-31 13:27 UTC (permalink / raw)
  To: git; +Cc: George Spelvin

The revised code is faster on Intel Atom N270 by around 15% (results below
typical of several runs):

$ ./586test ; ./x86test 
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 4.981760 s
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 4.323324 s

$ cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N270   @ 1.60GHz
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc
arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm
lahf_lm
bogomips	: 3191.59
clflush size	: 64
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N270   @ 1.60GHz
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc
arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm
lahf_lm
bogomips	: 3191.91
clflush size	: 64
power management:

-- 
Brian Ristuccia
brian@ristuccia.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (4 preceding siblings ...)
  2009-07-31 13:27   ` Brian Ristuccia
@ 2009-07-31 13:27   ` Jakub Narebski
  2009-07-31 15:05   ` Peter Harris
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 60+ messages in thread
From: Jakub Narebski @ 2009-07-31 13:27 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

"George Spelvin" <linux@horizon.com> writes:
> After studying Andy Polyakov's optimized x86 SHA-1 in OpenSSL, I've
> got a version that's 1.6% slower on a P4 and 15% faster on a Phenom.
> I'm curious about the performance on other CPUs I don't have access to,
> particularly Core 2 duo and i7.
> 
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.

----------
$ [time] ./586test 
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 5.376325 s

real    0m5.384s
user    0m5.108s
sys     0m0.008s

500000000 bytes: 5.367261 s

5.09user 0.00system 0:05.38elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+378minor)pagefaults 0swaps

----------
$ [time] ./x86test 
Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
500000000 bytes: 5.312238 s

real    0m5.325s
user    0m5.060s
sys     0m0.008s

500000000 bytes: 5.323783 s

5.06user 0.00system 0:05.34elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+377minor)pagefaults 0swaps

==========
System:
$ uname -a
Linux roke 2.6.14-11.1.aur.2 #1 Tue Jan 31 16:05:05 CET 2006 \
 i686 athlon i386 GNU/Linux

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 4
model name	: AMD Athlon(tm) processor
stepping	: 2
cpu MHz		: 1000.188
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 mtrr pge \
 mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips	: 2002.43

$ free
             total       used       free     shared    buffers     cached
Mem:        515616     495812      19804          0       6004     103160
-/+ buffers/cache:     386648     128968
Swap:      1052248     279544     772704

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 13:27   ` Brian Ristuccia
@ 2009-07-31 14:05     ` George Spelvin
  0 siblings, 0 replies; 60+ messages in thread
From: George Spelvin @ 2009-07-31 14:05 UTC (permalink / raw)
  To: brian, git; +Cc: linux

> The revised code is faster on Intel Atom N270 by around 15% (results below
> typical of several runs):
> 
> $ ./586test ; ./x86test 
> Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> 500000000 bytes: 4.981760 s
> Result:  a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> Expected:a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d
> 500000000 bytes: 4.323324 s

Cool, thanks!  I hadn't optimized it at all for Atom's
in-order pipe, so I'm pleasantly surprised.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (5 preceding siblings ...)
  2009-07-31 13:27   ` Jakub Narebski
@ 2009-07-31 15:05   ` Peter Harris
  2009-07-31 15:22   ` Peter Harris
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  8 siblings, 0 replies; 60+ messages in thread
From: Peter Harris @ 2009-07-31 15:05 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote:
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.

The new code appears to be marginally faster on a Pentium 3 Xeon:

Best of five runs:
586test: 11.658661 s
x86test: 11.209024 s

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 7
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 547.630
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1097.12
clflush size    : 32
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 7
model name      : Pentium III (Katmai)
stepping        : 3
cpu MHz         : 547.630
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1095.37
clflush size    : 32
power management:

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Request for benchmarking: x86 SHA1 code
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (6 preceding siblings ...)
  2009-07-31 15:05   ` Peter Harris
@ 2009-07-31 15:22   ` Peter Harris
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  8 siblings, 0 replies; 60+ messages in thread
From: Peter Harris @ 2009-07-31 15:22 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Fri, Jul 31, 2009 at 6:46 AM, George Spelvin wrote:
> Could someone do some benchmarking for me?  Old (486/Pentium/P2/P3)
> machines are also interesting, but I'm optimizing for newer ones.

My Geode isn't old in age, but I admit it's old in design (and the
vendor switched to Atom right after I bought it...)

Best of three runs:

586test: 26.536597 s
x86test: 26.111148 s

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 5
model           : 10
model name      : Geode(TM) Integrated Processor by AMD PCS
stepping        : 2
cpu MHz         : 499.927
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu de pse tsc msr cx8 sep pge cmov clflush mmx
mmxext 3dnowext 3dnow
bogomips        : 1001.72
clflush size    : 32
power management:

Peter Harris

^ permalink raw reply	[flat|nested] 60+ messages in thread

* x86 SHA1: Faster than OpenSSL
  2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
                     ` (7 preceding siblings ...)
  2009-07-31 15:22   ` Peter Harris
@ 2009-08-03  3:47   ` George Spelvin
  2009-08-03  7:36     ` Jonathan del Strother
                       ` (3 more replies)
  8 siblings, 4 replies; 60+ messages in thread
From: George Spelvin @ 2009-08-03  3:47 UTC (permalink / raw)
  To: git; +Cc: appro, appro, linux

(Work in progress, state dump to mailing list archives.)

This started when discussing git startup overhead due to the dynamic
linker.  One big contributor is the openssl library, which is used only
for its optimized x86 SHA-1 implementation.  So I took a look at it,
with an eye to importing the code directly into the git source tree,
and decided that I felt like trying to do better.

The original code was excellent, but it was optimized when the P4 was new.
After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the
P4, but a small-but-noticeable speedup on a variety of other processors.

Before      After       Gain    Processor
1.585248    1.353314	+17%	2500 MHz Phenom
3.249614    3.295619	-1.4%	1594 MHz P4
1.414512    1.352843	+4.5%	2.66 GHz i7
3.460635    3.284221	+5.4%	1596 MHz Athlon XP
4.077993    3.891826	+4.8%	1144 MHz Athlon
1.912161    1.623212	+17%	2100 MHz Athlon 64 X2
2.956432    2.940210	+0.55%	1794 MHz Mobile Celeron (fam 15 model 2)

(Seconds to hash 500x 1 MB, best of 10 runs in all cases.)

This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and
(for now) uses the same perl preprocessor.   To test it, do the following:
- Download Andy's original code from
  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
- "tar xz cryptogams-0.tar.gz"
- "cd cryptogams-0/x86"
- "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
   and "sha1-x86.pl".
- "make" 
- Run ./586test (before) and ./x86test (after) and note the timings.

The code is currenty only the core SHA transform.  Adding the appropriate
init/uodate/finish wrappers is straightforward.

An open question is how to add appropriate CPU detection to the git
build scripts.  (Note that `uname -m`, which it currently uses to select
the ARM code, does NOT produce the right answer if you're using a 32-bit
compiler on a 64-bit platform.)

I try to explain it in the comments, but with all the software pipelining
that makes the rounds overlap (and there are at least 4 different kinds
of rounds, which overlap with each other), it's a bit intricate.  If you
feel really masochistic, make commenting suggestions...

--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ Makefile	2009-08-02 06:44:44.000000000 -0400
@@ -0,0 +1,16 @@
+CC := gcc
+CFLAGS := -m32 -W -Wall -Os -g
+ASFLAGS := --32
+
+all: 586test x86test
+
+586test : sha1test.c sha1-586.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-586.o
+
+x86test : sha1test.c sha1-x86.o
+	$(CC) $(CFLAGS) -o $@ sha1test.c sha1-x86.o
+
+586test x86test : sha1-586.h
+
+%.s : %.pl x86asm.pl x86unix.pl
+	perl $< elf > $@
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-586.h	2009-08-02 06:44:17.000000000 -0400
@@ -0,0 +1,3 @@
+#include <stdint.h>
+
+void sha1_block_data_order(uint32_t iv[5], void const *in, unsigned len);
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1test.c	2009-08-02 08:27:48.449609504 -0400
@@ -0,0 +1,85 @@
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/time.h>
+
+#include "sha1-586.h"
+
+#define SIZE 1000000
+
+#if SIZE % 64
+# error SIZE must be a multiple of 64!
+#endif
+
+static unsigned long
+timing_test(uint32_t iv[5], unsigned iter)
+{
+	unsigned i;
+	struct timeval tv0, tv1;
+	static char *p;	/* Very large buffer */
+
+	if (!p) {
+		p = malloc(SIZE);
+		if (!p) {
+			perror("malloc");
+			exit(1);
+		}
+	}
+
+	if (gettimeofday(&tv0, NULL) < 0) {
+		perror("gettimeofday");
+		exit(1);
+	}
+	for (i = 0; i < iter; i++)
+		sha1_block_data_order(iv, p, SIZE/64);
+	if (gettimeofday(&tv1, NULL) < 0) {
+		perror("gettimeofday");
+		exit(1);
+	}
+	return 1000000ul * (tv1.tv_sec-tv0.tv_sec) + tv1.tv_usec-tv0.tv_usec;
+}
+
+int
+main(void)
+{
+	uint32_t iv[5] = {
+		0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
+	};
+	/* Simplest known-answer test, "abc" */
+	static uint8_t const testbuf[64] = {
+		'a','b','c', 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24
+	};
+	/* Expected: A9993E364706816ABA3E25717850C26C9CD0D89D */
+	static uint32_t const expected[5] = {
+		0xa9993e36, 0x4706816a, 0xba3e2571, 0x7850c26c, 0x9cd0d89d };
+	unsigned i;
+	unsigned long min_usec = -1ul;
+
+	/* quick correct-answer test.  silent unless successful. */
+	sha1_block_data_order(iv, testbuf, 1);
+	for (i = 0; i < 5; i++) {
+		if (iv[i] != expected[i]) {
+			printf("Result:  %08x %08x %08x %08x %08x\n"
+			       "Expected:%08x %08x %08x %08x %08x\n",
+				iv[0], iv[1], iv[2], iv[3], iv[4], expected[0],
+				expected[1], expected[2], expected[3],
+				expected[4]);
+			break;
+		}
+	}
+
+	for (i = 0; i < 10; i++) {
+		unsigned long usec = timing_test(iv, 500);
+		printf("%2u/10: %u.%06u s\n", i+1, (unsigned)(usec/1000000),
+			(unsigned)(usec % 1000000));
+		if (usec < min_usec)
+			min_usec = usec;
+	}
+	printf("Minimum time to hash %u bytes: %u.%06u\n", 
+		500 * SIZE, (unsigned)(min_usec/1000000),
+		(unsigned)(min_usec % 1000000));
+	return 0;
+}
--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-x86.pl	2009-08-02 08:51:01.069614130 -0400
@@ -0,0 +1,389 @@
+#!/usr/bin/env perl
+
+# ====================================================================
+# [Re]written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+
+# "[Re]written" was achieved in two major overhauls. In 2004 BODY_*
+# functions were re-implemented to address P4 performance issue [see
+# commentary below], and in 2006 the rest was rewritten in order to
+# gain freedom to liberate licensing terms.
+
+# It was noted that Intel IA-32 C compiler generates code which
+# performs ~30% *faster* on P4 CPU than original *hand-coded*
+# SHA1 assembler implementation. To address this problem (and
+# prove that humans are still better than machines:-), the
+# original code was overhauled, which resulted in following
+# performance changes:
+#
+#		compared with original	compared with Intel cc
+#		assembler impl.		generated code
+# Pentium	-16%			+48%
+# PIII/AMD	+8%			+16%
+# P4		+85%(!)			+45%
+#
+# As you can see Pentium came out as looser:-( Yet I reckoned that
+# improvement on P4 outweights the loss and incorporate this
+# re-tuned code to 0.9.7 and later.
+# ----------------------------------------------------------------
+#					<appro@fy.chalmers.se>
+
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+push(@INC,"${dir}","${dir}../../perlasm");
+require "x86asm.pl";
+
+&asm_init($ARGV[0],"sha1-586.pl",$ARGV[$#ARGV] eq "386");
+
+$A="eax";
+$B="ebx";
+$C="ecx";
+$D="edx";
+$E="ebp";
+
+# Two temporaries
+$S="edi";
+$T="esi";
+
+# The round constants
+use constant K1 => 0x5a827999;
+use constant K2 => 0x6ED9EBA1;
+use constant K3 => 0x8F1BBCDC;
+use constant K4 => 0xCA62C1D6;
+
+# Given unlimited registers and functional units, it would be possible to
+# compute SHA-1 at two cycles per round, using 7 operations per round.
+# Remember, each round computes a new value for E, which is used as A
+# in the following round and B in the round after that.  There are two
+# critical paths:
+# - A must be rotated and added to the output E (2 cycles between rounds)
+# - B must go through two boolean operations before being added to
+#   the result E.  Since this latter addition can't be done in the
+#   same-cycle as the critical addition of a<<<5, this is a total of
+#   2+1+1 = 4 cycles per 2 rounds.
+# Additionally, if you want to avoid copying B, you have to rotate it
+# soon after use in this round so it is ready for use as the following
+# round's C.
+
+# e += (a <<< 5) + K + in[i] + (d^(b&(c^d)))		(0..19)
+# e += (a <<< 5) + K + in[i] + (b^c^d)			(20..39, 60..79)
+# e += (a <<< 5) + K + in[i] + (c&d) + (b&(c^d))	(40..59)
+#
+# The hard part is doing this with only two temporary registers.
+# Taking the most complex F(b,c,d) function, writing it as above
+# breaks it into two parts which can be accumulated into e separately.
+# Let's divide it into 4 parts.  These can be executed in a 7-cycle
+# loop, assuming an in-order triple issue machine
+# (quadruple counting xor-from-memory as 2):
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	movl T,in[i]			xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B	//
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	(movl C,S)			addl S,E
+#
+# In the above, we routinely read and write a register on the same cycle,
+# overlapping the beginning of one computation with the end of another.
+# I've tried to place the reads to the left of the writes, but some of the
+# overlapping operations from adjacent rounds (given in parentheses)
+# violate that.
+#
+# The "addl T+K,E" line is actually implemented using the lea instruction,
+# which (on a Pentium) requires that neither T not K was modified on the
+# previous cycle.
+#
+# As you can see, in the absence of out-of-order execution, the first
+# column takes a minimum of 6 cycles (fetch, 3 XORs, rotate, add to E),
+# and I reserve a seventh cycle before the add to E so that I can use a
+# Pentium's lea instruction.
+#
+# The other three columns take 3, 4 and 3 cycles, respectively.
+# These can all be overlapped by 1 cycle assuming a superscalar
+# processor, for a total of 2+2+3 = 7 cycles.
+#
+# The other F() functions require 5 and 4 cycles, respectively.
+# overlapped with the 3-cycle a<<<5 computation, that makes a total of 6
+# and 5 cycles, respectively.  If we overlap the beginning and end of the
+# Xi computation, we can get it down to 6 cycles, but below that, we'll
+# just have to waste a cycle.
+#
+# For the first 16 rounds, forming Xi is just a fetch, and the F()
+# function only requires 5 cycles, so the whole round can be pulled in
+# to 4 cycles.
+
+
+# Atom pairing rules (not yet optimized):
+# The Atom has a dial-issue in-order pipeline, similar to the
+# original Pentium.  However, the issue restrictions are different.
+# In particular, all memory source operations must use st use port 0,
+# as must all rotates.
+#
+# Given that a round uses 4 fetches and 3 rotates, that's going to
+# require significant care to pair well.  It may take a completely
+# different implementation.
+#
+# LEA must use port 1, but apparently it has even worse address generation
+# interlock latency than the Pentium.  Oh well, it's still the best way
+# to do a 3-way add with a 32-bit immediate.
+
+
+# The first 16 rounds use s simple simplest F(b,c,d) = d^(b&(c^d)), and
+# don't need to form the in[i] equation, letting us reduce the round to
+# 4 cycles, after some reassignment of temporaries:
+
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E	xorl D,S	movl A,T
+#			addl S,E	roll 5,T	(movl C,S)	//
+#	(mov in[i],T)	(xorl B,S)	addl T,E	
+
+# The // mark where the round function starts.  Each round expects the
+# first copy of D to S to have been done already.
+
+# The transition between rounds 15 and 16 is a bit tricky... the best
+# thing to do is to delay the computation of a<<<5 one cycle and move it back
+# to the S register.  That way, T is free as early as possible.
+#	in[i]		F(b,c,d)	a<<<5
+#	(addl T+K,A)	(xorl E,S)	(movl B,T)
+#			movl D,S	(roll 5,T)	(addl S,A)	//
+#	mov in[i],T	xorl C,S	(addl T,A)
+#			andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(movl in[1],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)	rorl 2,B	roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_00_15
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("00_15 $n");
+		&mov($S,$d) if ($n == 0);
+	&mov($T,&swtmp($n%16));		#  V Load Xi.
+		&xor($S,$c);		# U  Continue F() = d^(b&(c^d))
+		&and($S,$b);		#  V
+			&rotr($b,2);	# NP
+	&lea($e,&DWP(K1,$e,$T));	# U  Add Xi and K
+    if ($n < 15) {
+			&mov($T,$a);	#  V
+		&xor($S,$d);		# U 
+			&rotl($T,5);	# NP
+		&add($e,$S);		# U 
+		&mov($S,$c);		#  V Start of NEXT round's F()
+			&add($e,$T);	# U 
+    } else {
+	# This version provides the correct start for BODY_20_39
+		&xor($S,$d);		#  V
+	&mov($T,&swtmp(($n+1)%16));	# U  Start computing mext Xi.
+		&add($e,$S);		#  V Add F()
+			&mov($S,$a);	# U  Start computing a<<<5
+	&xor($T,&swtmp(($n+3)%16));	#  V
+			&rotl($S,5);	# U 
+	&xor($T,&swtmp(($n+9)%16));	#  V
+    }
+}
+
+
+# A full round using F(b,c,d) = b^c^d.  6 cycles of dependency chain.
+# This starts just before starting to compute F(); the Xi should have XORed
+# the first three values together.  (Break is at //)
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 	movl D,S  	(addl S,A)
+#	roll 1,T	xorl C,S
+#	movl T,in[i]	andl B,S			rorl 2,B
+#	addl T+K,E 	xorl D,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_16_19
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("16_19 $n");
+
+	&xor($T,&swtmp(($n+13)%16));	# U 
+			&add($a,$S);	#  V End of previous round
+		&mov($S,$d);		# U  Start current round's F()
+	&rotl($T,1);			#  V
+		&xor($S,$c);		# U 
+	&mov(&swtmp($n%16),$T);		# U Store computed Xi.
+		&and($S,$b);		#  V
+		&rotr($b,2);		# NP
+	&lea($e,&DWP(K1,$e,$T));	# U  Add Xi and K
+	&mov($T,&swtmp(($n+1)%16));	#  V Start computing mext Xi.
+		&xor($S,$d);		# U
+	&xor($T,&swtmp(($n+3)%16));	#  V
+		&add($e,$S);		# U  Add F()
+			&mov($S,$a);	#  V Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16));	# U
+			&rotl($S,5);	# NP
+}
+
+# This is just like BODY_16_19, but computes a different F() = b^c^d
+#
+#	in[i]		F(b,c,d)	a<<<5
+#	mov in[i],T	(xorl E,S)			(addl T+K,A)
+#	xor in[i+1],T	(addl S,A)	(movl B,S)
+#	xor in[i+2],T			(roll 5,S)	//
+#	xor in[i+3],T 			(addl S,A)
+#	roll 1,T	movl D,S
+#	movl T,in[i]	xorl B,S			rorl 2,B
+#	addl T+K,E 	xorl C,S			(mov in[i],T)
+#	(xor in[i+1],T)	addl S,E	movl A,S
+#	(xor in[i+2],T)			roll 5,S	//
+#	(xor in[i+3],T)	(movl C,S)	addl S,E
+
+sub BODY_20_39	# And 61..79
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+	local $K=($n<40) ? K2 : K4;
+
+	&comment("20_39 $n");
+
+	&xor($T,&swtmp(($n+13)%16));	# U 
+			&add($a,$S);	#  V End of previous round
+	&rotl($T,1);			# U
+		&mov($S,$d);		#  V Start current round's F()
+	&mov(&swtmp($n%16),$T) if ($n < 77);	# Store computed Xi.
+		&xor($S,$b);		#  V
+		&rotr($b,2);		# NP
+	&lea($e,&DWP($K,$e,$T));	# U  Add Xi and K
+	&mov($T,&swtmp(($n+1)%16)) if ($n < 79); # Start computing next Xi.
+		&xor($S,$c);		# U
+	&xor($T,&swtmp(($n+3)%16)) if ($n < 79);
+		&add($e,$S);		# U  Add F1()
+			&mov($S,$a);	#  V Start computing a<<<5
+	&xor($T,&swtmp(($n+9)%16)) if ($n < 79);
+			&rotl($S,5);	# NP
+
+			&add($e,$S) if ($n == 79);
+}
+
+
+# This starts immediately after the LEA, and expects to need to finish
+# the previous round. (break is at //)
+#
+#	in[i]		F1(c,d)		F2(b,c,d)	a<<<5
+#	(addl T+K,E)			(andl C,S)	(rorl 2,C)
+#	mov in[i],T			(addl S,A)	(movl B,S)
+#	xor in[i+1],T					(rorl 5,S)
+#	xor in[i+2],T /	movl D,S			(addl S,A)
+#	xor in[i+3],T	andl C,S
+#	rotl 1,T	addl S,E	movl D,S
+#	(movl T,in[i])			xorl C,S
+#	addl T+K,E			andl B,S	rorl 2,B
+#	(mov in[i],T)			addl S,E	movl A,S
+#	(xor in[i+1],T)					rorl 5,S
+#	(xor in[i+2],T)	// (movl C,S)			addl S,E
+
+sub BODY_40_59
+{
+	local($n,$a,$b,$c,$d,$e)=@_;
+
+	&comment("40_59 $n");
+
+			&add($a,$S);	#  V End of previous round
+		&mov($S,$d);		# U  Start current round's F(1)
+	&xor($T,&swtmp(($n+13)%16));	#  V
+		&and($S,$c);		# U
+	&rotl($T,1);			# U	XXX Missed pairing
+		&add($e,$S);		#  V Add F1()
+		&mov($S,$d);		# U  Start current round's F2()
+	&mov(&swtmp($n%16),$T);		#  V Store computed Xi.
+		&xor($S,$c);		# U
+	&lea($e,&DWP(K3,$e,$T));	#  V
+		&and($S,$b);		# U	XXX Missed pairing
+		&rotr($b,2);		# NP
+	&mov($T,&swtmp(($n+1)%16));	# U  Start computing next Xi.
+		&add($e,$S);		#  V Add F2()
+		&mov($S,$a);		# U  Start computing a<<<5
+	&xor($T,&swtmp(($n+3)%16));	#  V
+			&rotl($S,5);	# NP
+	&xor($T,&swtmp(($n+9)%16));	# U
+}
+# The above code is NOT optimally paired for the Pentium.  (And thus,
+# presumably, Atom, which has a very similar dual-issue in-order pipeline.)
+# However, attempts to improve it make it slower on Phenom & i7.
+
+&function_begin("sha1_block_data_order",16);
+
+	local @V = ($A,$B,$C,$D,$E);
+	local @W = ($A,$B,$C);
+
+	&mov($S,&wparam(0));	# SHA_CTX *c
+	&mov($T,&wparam(1));	# const void *input
+	&mov($A,&wparam(2));	# size_t num
+	&stack_push(16);	# allocate X[16]
+	&shl($A,6);
+	&add($A,$T);
+	&mov(&wparam(2),$A);	# pointer beyond the end of input
+	&mov($E,&DWP(16,$S));# pre-load E
+	&mov($D,&DWP(12,$S));# pre-load D
+
+	&set_label("loop",16);
+
+	# copy input chunk to X, but reversing byte order!
+	&mov($W[2],&DWP(4*(0),$T));
+	&mov($W[1],&DWP(4*(1),$T));
+	&bswap($W[2]);
+	for ($i=0; $i<14; $i++) {
+		&mov($W[0],&DWP(4*($i+2),$T));
+		&bswap($W[1]);
+		&mov(&swtmp($i+0),$W[2]);
+		unshift(@W,pop(@W));
+	}
+	&bswap($W[1]);
+	&mov(&swtmp($i+0),$W[2]);
+	&mov(&swtmp($i+1),$W[1]);
+
+	&mov(&wparam(1),$T);	# redundant in 1st spin
+
+	# Reload A, B and C, which we use as temporaries in the copying
+	&mov($C,&DWP(8,$S));
+	&mov($B,&DWP(4,$S));
+	&mov($A,&DWP(0,$S));
+
+	for($i=0;$i<16;$i++)	{ &BODY_00_15($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<20;$i++)	{ &BODY_16_19($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<40;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<60;$i++)	{ &BODY_40_59($i,@V); unshift(@V,pop(@V)); }
+	for(;$i<80;$i++)	{ &BODY_20_39($i,@V); unshift(@V,pop(@V)); }
+
+	(($V[4] eq $E) and ($V[0] eq $A)) or die;	# double-check
+
+	&comment("Loop trailer");
+
+	&mov($S,&wparam(0));	# re-load SHA_CTX*
+	&mov($T,&wparam(1));	# re-load input pointer
+
+	&add($E,&DWP(16,$S));
+	&add($D,&DWP(12,$S));
+	&add(&DWP(8,$S),$C);
+	&add(&DWP(4,$S),$B);
+	 &add($T,64);		# advance input pointer
+	&add(&DWP(0,$S),$A);
+	&mov(&DWP(12,$S),$D);
+	&mov(&DWP(16,$S),$E);
+
+	&cmp($T,&wparam(2));	# have we reached the end yet?
+	&jb(&label("loop"));
+
+	&stack_pop(16);
+&function_end("sha1_block_data_order");
+&asciz("SHA1 block transform for x86, CRYPTOGAMS by <appro\@openssl.org>");
+
+&asm_finish();

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
@ 2009-08-03  7:36     ` Jonathan del Strother
  2009-08-04  1:40     ` Mark Lodato
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Jonathan del Strother @ 2009-08-03  7:36 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro

On Mon, Aug 3, 2009 at 4:47 AM, George Spelvin<linux@horizon.com> wrote:
> (Work in progress, state dump to mailing list archives.)
>
> This started when discussing git startup overhead due to the dynamic
> linker.  One big contributor is the openssl library, which is used only
> for its optimized x86 SHA-1 implementation.  So I took a look at it,
> with an eye to importing the code directly into the git source tree,
> and decided that I felt like trying to do better.
>

FWIW, this doesn't work on OS X / Darwin.  'as' doesn't take a --32
flag, it takes an -arch i386 flag.  After changing that, I get:

as -arch i386  -o sha1-586.o sha1-586.s
sha1-586.s:4:Unknown pseudo-op: .type
sha1-586.s:4:Rest of line ignored. 1st junk character valued 115 (s).
sha1-586.s:5:Alignment too large: 15. assumed.
sha1-586.s:19:Alignment too large: 15. assumed.
sha1-586.s:1438:Unknown pseudo-op: .size
sha1-586.s:1438:Rest of line ignored. 1st junk character valued 115 (s).
make: *** [sha1-586.o] Error 1

- at which point I have no idea how to fix it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  2009-08-03  7:36     ` Jonathan del Strother
@ 2009-08-04  1:40     ` Mark Lodato
  2009-08-04  2:30     ` Linus Torvalds
  2009-08-18 21:26     ` Andy Polyakov
  3 siblings, 0 replies; 60+ messages in thread
From: Mark Lodato @ 2009-08-04  1:40 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro

On Sun, Aug 2, 2009 at 11:47 PM, George Spelvin<linux@horizon.com> wrote:
> Before      After       Gain    Processor
> 1.585248    1.353314    +17%    2500 MHz Phenom
> 3.249614    3.295619    -1.4%   1594 MHz P4
> 1.414512    1.352843    +4.5%   2.66 GHz i7
> 3.460635    3.284221    +5.4%   1596 MHz Athlon XP
> 4.077993    3.891826    +4.8%   1144 MHz Athlon
> 1.912161    1.623212    +17%    2100 MHz Athlon 64 X2
> 2.956432    2.940210    +0.55%  1794 MHz Mobile Celeron (fam 15 model 2)
>
> (Seconds to hash 500x 1 MB, best of 10 runs in all cases.)
>
> This is based on Andy Polyakov's GPL/BSD licensed cryptogams code, and
> (for now) uses the same perl preprocessor.   To test it, do the following:
> - Download Andy's original code from
>  http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - "tar xz cryptogams-0.tar.gz"
> - "cd cryptogams-0/x86"
> - "patch < this_email" to create "sha1test.c", "sha1-586.h", "Makefile",
>   and "sha1-x86.pl".
> - "make"
> - Run ./586test (before) and ./x86test (after) and note the timings.

Note, to compile this on Ubuntu x86-64, I had to:
$ sudo apt-get install libc6-dev-i386

$ ./586test
 1/10: 2.016621 s
 2/10: 2.030742 s
 3/10: 2.027333 s
 4/10: 2.024018 s
 5/10: 2.022306 s
 6/10: 2.022418 s
 7/10: 2.047103 s
 8/10: 2.035467 s
 9/10: 2.032237 s
10/10: 2.029231 s
Minimum time to hash 500000000 bytes: 2.016621
$ ./x86test
 1/10: 1.818661 s
 2/10: 1.814856 s
 3/10: 1.816232 s
 4/10: 1.815208 s
 5/10: 1.834047 s
 6/10: 1.843020 s
 7/10: 1.819564 s
 8/10: 1.815560 s
 9/10: 1.824232 s
10/10: 1.820943 s
Minimum time to hash 500000000 bytes: 1.814856
$ python -c 'print 2.016621 / 1.814856'
1.11117410968
$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
stepping        : 2
cpu MHz         : 1861.825
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips        : 3723.65
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
stepping        : 2
cpu MHz         : 1861.825
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant
_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips        : 3724.01
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:



I imagine that you can get a bigger speedup by making a 64-bit version
(but maybe not).  Either way, it would be nice if x86-64 users did not
have to install an additional package to compile.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
  2009-08-03  7:36     ` Jonathan del Strother
  2009-08-04  1:40     ` Mark Lodato
@ 2009-08-04  2:30     ` Linus Torvalds
  2009-08-04  2:51       ` Linus Torvalds
  2009-08-04  4:48       ` George Spelvin
  2009-08-18 21:26     ` Andy Polyakov
  3 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-04  2:30 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro



On Sun, 2 Aug 2009, George Spelvin wrote:
> 
> The original code was excellent, but it was optimized when the P4 was new.
> After a bit of tweaking, I've inflicted a slight (1.4%) slowdown on the
> P4, but a small-but-noticeable speedup on a variety of other processors.
> 
> Before      After       Gain    Processor
> 1.585248    1.353314	+17%	2500 MHz Phenom
> 3.249614    3.295619	-1.4%	1594 MHz P4
> 1.414512    1.352843	+4.5%	2.66 GHz i7
> 3.460635    3.284221	+5.4%	1596 MHz Athlon XP
> 4.077993    3.891826	+4.8%	1144 MHz Athlon
> 1.912161    1.623212	+17%	2100 MHz Athlon 64 X2
> 2.956432    2.940210	+0.55%	1794 MHz Mobile Celeron (fam 15 model 2)

It would be better to have a more git-centric benchmark that actually 
shows some real git load, rather than a sha1-only microbenchmark.

The thing that I'd prefer is simply

	git fsck --full

on the Linux kernel archive. For me (with a fast machine), it takes about 
4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
using a NO_OPENSSL=1 build).

So that's an example of a load that is actually very sensitive to SHA1 
performance (more so than _most_ git loads, I suspect), and at the same 
time is a real git load rather than some SHA1-only microbenchmark. It also 
shows very clearly why we default to the OpenSSL version over the Mozilla 
one.

NOTE! I didn't do multiple runs to see how stable the numbers are, and 
so it's possible that I exaggerated the OpenSSL advantage over the 
Mozilla-SHA1 code. Or vice versa. My point is really only that I don't 
know how meaningful a "50 x 1M SHA1" benchmark is, while I know that a 
"git fsck" benchmark has at least _some_ real life value.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:30     ` Linus Torvalds
@ 2009-08-04  2:51       ` Linus Torvalds
  2009-08-04  3:07         ` Jon Smirl
  2009-08-18 21:50         ` Andy Polyakov
  2009-08-04  4:48       ` George Spelvin
  1 sibling, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-04  2:51 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, appro, appro



On Mon, 3 Aug 2009, Linus Torvalds wrote:
> 
> The thing that I'd prefer is simply
> 
> 	git fsck --full
> 
> on the Linux kernel archive. For me (with a fast machine), it takes about 
> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
> using a NO_OPENSSL=1 build).
> 
> So that's an example of a load that is actually very sensitive to SHA1 
> performance (more so than _most_ git loads, I suspect), and at the same 
> time is a real git load rather than some SHA1-only microbenchmark. It also 
> shows very clearly why we default to the OpenSSL version over the Mozilla 
> one.

"perf report --sort comm,dso,symbol" profiling shows the following for 
'git fsck --full' on the kernel repo, using the Mozilla SHA1:

    47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
    22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
     7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
     4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
     3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
     2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
     2.41%               git  /home/torvalds/git/git     [.] lookup_object
     1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
     0.84%               git  /home/torvalds/git/git     [.] patch_delta
     0.78%               git  [kernel]                   [k] hpet_next_event

so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the 
OpenSSL SHA1 implementation must be about twice as fast as the C version 
we use.

That said, under "normal" git usage models, the SHA1 costs are almost 
invisible. So git-fsck is definitely a fairly unusual case that stresses 
the SHA1 performance more than most git lods.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:51       ` Linus Torvalds
@ 2009-08-04  3:07         ` Jon Smirl
  2009-08-04  5:01           ` George Spelvin
  2009-08-18 21:50         ` Andy Polyakov
  1 sibling, 1 reply; 60+ messages in thread
From: Jon Smirl @ 2009-08-04  3:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, git, appro, appro

On Mon, Aug 3, 2009 at 10:51 PM, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
>
>
> On Mon, 3 Aug 2009, Linus Torvalds wrote:
>>
>> The thing that I'd prefer is simply
>>
>>       git fsck --full
>>
>> on the Linux kernel archive. For me (with a fast machine), it takes about
>> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie
>> using a NO_OPENSSL=1 build).
>>
>> So that's an example of a load that is actually very sensitive to SHA1
>> performance (more so than _most_ git loads, I suspect), and at the same
>> time is a real git load rather than some SHA1-only microbenchmark. It also
>> shows very clearly why we default to the OpenSSL version over the Mozilla
>> one.
>
> "perf report --sort comm,dso,symbol" profiling shows the following for
> 'git fsck --full' on the kernel repo, using the Mozilla SHA1:
>
>    47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
>    22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
>     7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
>     4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
>     3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
>     2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
>     2.41%               git  /home/torvalds/git/git     [.] lookup_object
>     1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
>     0.84%               git  /home/torvalds/git/git     [.] patch_delta
>     0.78%               git  [kernel]                   [k] hpet_next_event
>
> so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the
> OpenSSL SHA1 implementation must be about twice as fast as the C version
> we use.

Would there happen to be a SHA1 implementation around that can compute
the SHA1 without first decompressing the data? Databases gain a lot of
speed by using special algorithms that can directly operate on the
compressed data.

>
> That said, under "normal" git usage models, the SHA1 costs are almost
> invisible. So git-fsck is definitely a fairly unusual case that stresses
> the SHA1 performance more than most git lods.
>
>                Linus
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:30     ` Linus Torvalds
  2009-08-04  2:51       ` Linus Torvalds
@ 2009-08-04  4:48       ` George Spelvin
  2009-08-04  6:30         ` Linus Torvalds
  2009-08-04  6:40         ` Linus Torvalds
  1 sibling, 2 replies; 60+ messages in thread
From: George Spelvin @ 2009-08-04  4:48 UTC (permalink / raw)
  To: torvalds; +Cc: git, linux

> It would be better to have a more git-centric benchmark that actually 
> shows some real git load, rather than a sha1-only microbenchmark.
> 
> The thing that I'd prefer is simply
>
>	git fsck --full
>
> on the Linux kernel archive. For me (with a fast machine), it takes about 
> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
> using a NO_OPENSSL=1 build).

The actual goal of this effort is to address the dynamic linker startup
time issues by removing the second-largest contributor after libcurl,
namely openssl.  Optimizing the assembly code is just the fun part. ;-)

Anyway, on the git repository:

[1273]$ time x/git-fsck --full			(New SHA1 code)
dangling tree 524973049a7e4593df4af41e0564912f678a41ac
dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56

real    0m59.306s
user    0m58.760s
sys     0m0.550s
[1274]$ time ./git-fsck --full			(OpenSSL)
dangling tree 524973049a7e4593df4af41e0564912f678a41ac
dangling tree 7da7d73185a1df5c2a477d2ee5599ac8a58cad56

real    1m0.364s
user    0m59.970s
sys     0m0.400s

1.6% is a pretty minor difference, especially as the machine is running
a backup at the time (but it's a quad-core, with near-zero CPU usage;
the business is all I/O).

On the full Linux repository, I repacked it first to make sure that
everything was in RAM, and I have the first result:

[517]$ time ~/git/x/git-fsck --full		(New SHA1 code)

real    10m12.702s
user    9m48.410s
sys     0m23.350s
[518]$ time ~/git/git-fsck --full		(OpenSSL)

real    10m26.083s
user    10m2.800s
sys     0m22.000s

Again, 2.2% is not a huge improvement.  But my only goal was not to be worse.

> So that's an example of a load that is actually very sensitive to SHA1 
> performance (more so than _most_ git loads, I suspect), and at the same 
> time is a real git load rather than some SHA1-only microbenchmark. It also 
> shows very clearly why we default to the OpenSSL version over the Mozilla 
> one.

I wasn't questioning *that*.  As I said, I was just doing the fun part
of importing a heavily-optimized OpenSSL-like SHA1 implementation into
the git source tree.

(The un-fun part is modifying the build process to detect the target
processor and include the right asm automatically.)

Anyway, if you want to test it, here's a crude x86_32-only patch to the
git tree.  "make NO_OPENSSL=1" to use the new code.

diff --git a/Makefile b/Makefile
index daf4296..8531c39 100644
--- a/Makefile
+++ b/Makefile
@@ -1176,8 +1176,10 @@ ifdef ARM_SHA1
 	LIB_OBJS += arm/sha1.o arm/sha1_arm.o
 else
 ifdef MOZILLA_SHA1
-	SHA1_HEADER = "mozilla-sha1/sha1.h"
-	LIB_OBJS += mozilla-sha1/sha1.o
+#	SHA1_HEADER = "mozilla-sha1/sha1.h"
+#	LIB_OBJS += mozilla-sha1/sha1.o
+	SHA1_HEADER = "x86/sha1.h"
+	LIB_OBJS += x86/sha1.o x86/sha1-x86.o
 else
 	SHA1_HEADER = <openssl/sha.h>
 	EXTLIBS += $(LIB_4_CRYPTO)
diff --git a/x86/sha1-x86.s b/x86/sha1-x86.s
new file mode 100644
index 0000000..96796d4
--- /dev/null
+++ b/x86/sha1-x86.s
@@ -0,0 +1,1372 @@
+.file	"sha1-586.s"
+.text
+.globl	sha1_block_data_order
+.type	sha1_block_data_order,@function
+.align	16
+sha1_block_data_order:
+	pushl	%ebp
+	pushl	%ebx
+	pushl	%esi
+	pushl	%edi
+	movl	20(%esp),%edi
+	movl	24(%esp),%esi
+	movl	28(%esp),%eax
+	subl	$64,%esp
+	shll	$6,%eax
+	addl	%esi,%eax
+	movl	%eax,92(%esp)
+	movl	16(%edi),%ebp
+	movl	12(%edi),%edx
+.align	16
+.L000loop:
+	movl	(%esi),%ecx
+	movl	4(%esi),%ebx
+	bswap	%ecx
+	movl	8(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,(%esp)
+	movl	12(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,4(%esp)
+	movl	16(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,8(%esp)
+	movl	20(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,12(%esp)
+	movl	24(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,16(%esp)
+	movl	28(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,20(%esp)
+	movl	32(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,24(%esp)
+	movl	36(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,28(%esp)
+	movl	40(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,32(%esp)
+	movl	44(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,36(%esp)
+	movl	48(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,40(%esp)
+	movl	52(%esi),%ebx
+	bswap	%ecx
+	movl	%eax,44(%esp)
+	movl	56(%esi),%eax
+	bswap	%ebx
+	movl	%ecx,48(%esp)
+	movl	60(%esi),%ecx
+	bswap	%eax
+	movl	%ebx,52(%esp)
+	bswap	%ecx
+	movl	%eax,56(%esp)
+	movl	%ecx,60(%esp)
+	movl	%esi,88(%esp)
+	movl	8(%edi),%ecx
+	movl	4(%edi),%ebx
+	movl	(%edi),%eax
+	/* 00_15 0 */
+	movl	%edx,%edi
+	movl	(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	%eax,%esi
+	xorl	%edx,%edi
+	roll	$5,%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	addl	%esi,%ebp
+	/* 00_15 1 */
+	movl	4(%esp),%esi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	%ebp,%esi
+	xorl	%ecx,%edi
+	roll	$5,%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	addl	%esi,%edx
+	/* 00_15 2 */
+	movl	8(%esp),%esi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	%edx,%esi
+	xorl	%ebx,%edi
+	roll	$5,%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	addl	%esi,%ecx
+	/* 00_15 3 */
+	movl	12(%esp),%esi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	%ecx,%esi
+	xorl	%eax,%edi
+	roll	$5,%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	addl	%esi,%ebx
+	/* 00_15 4 */
+	movl	16(%esp),%esi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	%ebx,%esi
+	xorl	%ebp,%edi
+	roll	$5,%esi
+	addl	%edi,%eax
+	movl	%edx,%edi
+	addl	%esi,%eax
+	/* 00_15 5 */
+	movl	20(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	%eax,%esi
+	xorl	%edx,%edi
+	roll	$5,%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	addl	%esi,%ebp
+	/* 00_15 6 */
+	movl	24(%esp),%esi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	%ebp,%esi
+	xorl	%ecx,%edi
+	roll	$5,%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	addl	%esi,%edx
+	/* 00_15 7 */
+	movl	28(%esp),%esi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	%edx,%esi
+	xorl	%ebx,%edi
+	roll	$5,%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	addl	%esi,%ecx
+	/* 00_15 8 */
+	movl	32(%esp),%esi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	%ecx,%esi
+	xorl	%eax,%edi
+	roll	$5,%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	addl	%esi,%ebx
+	/* 00_15 9 */
+	movl	36(%esp),%esi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	%ebx,%esi
+	xorl	%ebp,%edi
+	roll	$5,%esi
+	addl	%edi,%eax
+	movl	%edx,%edi
+	addl	%esi,%eax
+	/* 00_15 10 */
+	movl	40(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	%eax,%esi
+	xorl	%edx,%edi
+	roll	$5,%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	addl	%esi,%ebp
+	/* 00_15 11 */
+	movl	44(%esp),%esi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	%ebp,%esi
+	xorl	%ecx,%edi
+	roll	$5,%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	addl	%esi,%edx
+	/* 00_15 12 */
+	movl	48(%esp),%esi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	%edx,%esi
+	xorl	%ebx,%edi
+	roll	$5,%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	addl	%esi,%ecx
+	/* 00_15 13 */
+	movl	52(%esp),%esi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	%ecx,%esi
+	xorl	%eax,%edi
+	roll	$5,%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	addl	%esi,%ebx
+	/* 00_15 14 */
+	movl	56(%esp),%esi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	%ebx,%esi
+	xorl	%ebp,%edi
+	roll	$5,%esi
+	addl	%edi,%eax
+	movl	%edx,%edi
+	addl	%esi,%eax
+	/* 00_15 15 */
+	movl	60(%esp),%esi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1518500249(%ebp,%esi,1),%ebp
+	xorl	%edx,%edi
+	movl	(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 16_19 16 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	roll	$1,%esi
+	xorl	%ebx,%edi
+	movl	%esi,(%esp)
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	4(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 17 */
+	xorl	56(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	roll	$1,%esi
+	xorl	%eax,%edi
+	movl	%esi,4(%esp)
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	8(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 18 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	roll	$1,%esi
+	xorl	%ebp,%edi
+	movl	%esi,8(%esp)
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	12(%esp),%esi
+	xorl	%eax,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 19 */
+	xorl	(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	roll	$1,%esi
+	xorl	%edx,%edi
+	movl	%esi,12(%esp)
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	16(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 20 */
+	xorl	4(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,16(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	20(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 21 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,20(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	24(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 22 */
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	28(%esp),%esi
+	xorl	%eax,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 23 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,28(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	32(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 24 */
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	36(%esp),%esi
+	xorl	%edx,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 25 */
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	40(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 26 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	44(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 27 */
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	48(%esp),%esi
+	xorl	%eax,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 28 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	52(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 29 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	56(%esp),%esi
+	xorl	%edx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 30 */
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	60(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 31 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 32 */
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	4(%esp),%esi
+	xorl	%eax,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 33 */
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	8(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 34 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	12(%esp),%esi
+	xorl	%edx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 35 */
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	16(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 36 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	20(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 37 */
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	24(%esp),%esi
+	xorl	%eax,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 38 */
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	28(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 39 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	32(%esp),%esi
+	xorl	%edx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 40_59 40 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	20(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 41 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	24(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 42 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	28(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 43 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	32(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 40_59 44 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	xorl	20(%esp),%esi
+	/* 40_59 45 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	xorl	24(%esp),%esi
+	/* 40_59 46 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	xorl	28(%esp),%esi
+	/* 40_59 47 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 40_59 48 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	xorl	36(%esp),%esi
+	/* 40_59 49 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	8(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	xorl	40(%esp),%esi
+	/* 40_59 50 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	12(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	xorl	44(%esp),%esi
+	/* 40_59 51 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	16(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	xorl	48(%esp),%esi
+	/* 40_59 52 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	20(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	xorl	52(%esp),%esi
+	/* 40_59 53 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	24(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	xorl	56(%esp),%esi
+	/* 40_59 54 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	28(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	xorl	60(%esp),%esi
+	/* 40_59 55 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	32(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	xorl	(%esp),%esi
+	/* 40_59 56 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	36(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 57 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,36(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	40(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 58 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,40(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	44(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 59 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,44(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	48(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 20_39 60 */
+	xorl	36(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,48(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	52(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 61 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,52(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	56(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 62 */
+	xorl	44(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	60(%esp),%esi
+	xorl	%eax,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 63 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,60(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 64 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	4(%esp),%esi
+	xorl	%edx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 65 */
+	xorl	56(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,4(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	8(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 66 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,8(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	12(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 67 */
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	16(%esp),%esi
+	xorl	%eax,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 68 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,16(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	20(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 69 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,20(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	24(%esp),%esi
+	xorl	%edx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 70 */
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	28(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 71 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,28(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	32(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 72 */
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	36(%esp),%esi
+	xorl	%eax,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 73 */
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,36(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	40(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 74 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,40(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	44(%esp),%esi
+	xorl	%edx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 75 */
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	48(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 76 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,48(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	52(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 77 */
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	56(%esp),%esi
+	xorl	%eax,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 78 */
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	60(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 79 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	xorl	%edx,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* Loop trailer */
+	movl	84(%esp),%edi
+	movl	88(%esp),%esi
+	addl	16(%edi),%ebp
+	addl	12(%edi),%edx
+	addl	%ecx,8(%edi)
+	addl	%ebx,4(%edi)
+	addl	$64,%esi
+	addl	%eax,(%edi)
+	movl	%edx,12(%edi)
+	movl	%ebp,16(%edi)
+	cmpl	92(%esp),%esi
+	jb	.L000loop
+	addl	$64,%esp
+	popl	%edi
+	popl	%esi
+	popl	%ebx
+	popl	%ebp
+	ret
+.L_sha1_block_data_order_end:
+.size	sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order
+.byte	83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
diff --git a/x86/sha1.c b/x86/sha1.c
new file mode 100644
index 0000000..4c1a569
--- /dev/null
+++ b/x86/sha1.c
@@ -0,0 +1,81 @@
+/*
+ * SHA-1 implementation.
+ *
+ * Copyright (C) 2005 Paul Mackerras <paulus@samba.org>
+ *
+ * This version assumes we are running on a big-endian machine.
+ * It calls an external sha1_core() to process blocks of 64 bytes.
+ */
+#include <stdio.h>
+#include <string.h>
+#include <arpa/inet.h>	/* For htonl */
+#include "sha1.h"
+
+#define x86_sha1_core sha1_block_data_order
+extern void x86_sha1_core(uint32_t hash[5], const unsigned char *p,
+			  unsigned int nblocks);
+
+void x86_SHA1_Init(x86_SHA_CTX *c)
+{
+	/* Matches prefix of scontext structure */
+	static struct {
+		uint32_t hash[5];
+		uint64_t len;
+	} const iv = {
+		{ 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 },
+		0
+	};
+
+	memcpy(c, &iv, sizeof iv);
+}
+
+void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n)
+{
+	unsigned pos = (unsigned)c->len & 63;
+	unsigned long nb;
+
+	c->len += n;
+
+	/* Initial partial block */
+	if (pos) {
+		unsigned space = 64 - pos;
+		if (space > n)
+			goto end;
+		memcpy(c->buf + pos, p, space);
+		p += space;
+		n -= space;
+		x86_sha1_core(c->hash, c->buf, 1);
+	}
+
+	/* The big impressive middle */
+	nb = n >> 6;
+	if (nb) {
+		x86_sha1_core(c->hash, p, nb);
+		p += nb << 6;
+		n &= 63;
+	}
+	pos = 0;
+end:
+	/* Final partial block */
+	memcpy(c->buf + pos, p, n);
+}
+
+void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c)
+{
+	unsigned pos = (unsigned)c->len & 63;
+
+	c->buf[pos++] = 0x80;
+	if (pos > 56) {
+		memset(c->buf + pos, 0, 64 - pos);
+		x86_sha1_core(c->hash, c->buf, 1);
+		pos = 0;
+	}
+	memset(c->buf + pos, 0, 56 - pos);
+	/* Last two words are 64-bit *bit* count */
+	*(uint32_t *)(c->buf + 56) = htonl((uint32_t)(c->len >> 29));
+	*(uint32_t *)(c->buf + 60) = htonl((uint32_t)c->len << 3);
+	x86_sha1_core(c->hash, c->buf, 1);
+
+	for (pos = 0; pos < 5; pos++)
+		((uint32_t *)hash)[pos] = htonl(c->hash[pos]);
+}
diff --git a/x86/sha1.h b/x86/sha1.h
new file mode 100644
index 0000000..8988da9
--- /dev/null
+++ b/x86/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * SHA-1 implementation.
+ *
+ * Copyright (C) 2005 Paul Mackerras <paulus@samba.org>
+ */
+#include <stdint.h>
+
+typedef struct {
+	uint32_t hash[5];
+	uint64_t len;
+	unsigned char buf[64];	/* Keep this aligned */
+} x86_SHA_CTX;
+
+void x86_SHA1_Init(x86_SHA_CTX *c);
+void x86_SHA1_Update(x86_SHA_CTX *c, const void *p, unsigned long n);
+void x86_SHA1_Final(unsigned char *hash, x86_SHA_CTX *c);
+
+#define git_SHA_CTX	x86_SHA_CTX
+#define git_SHA1_Init	x86_SHA1_Init
+#define git_SHA1_Update	x86_SHA1_Update
+#define git_SHA1_Final	x86_SHA1_Final

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  3:07         ` Jon Smirl
@ 2009-08-04  5:01           ` George Spelvin
  2009-08-04 12:56             ` Jon Smirl
  0 siblings, 1 reply; 60+ messages in thread
From: George Spelvin @ 2009-08-04  5:01 UTC (permalink / raw)
  To: jonsmirl; +Cc: git, linux

> Would there happen to be a SHA1 implementation around that can compute
> the SHA1 without first decompressing the data? Databases gain a lot of
> speed by using special algorithms that can directly operate on the
> compressed data.

I can't imagine how.  In general, this requires that the compression
be carefully designed to be compatible with the algorithms, and SHA1
is specifically designed to depend on every bit of the input in
an un-analyzable way.

Also, git normally avoids hashing objects that it doesn't need
uncompressed for some other reason.  git-fsck is a notable exception,
but I think the idea of creating special optimized code paths for that
interferes with its reliability and robustness goals.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  4:48       ` George Spelvin
@ 2009-08-04  6:30         ` Linus Torvalds
  2009-08-04  8:01           ` George Spelvin
  2009-08-04  6:40         ` Linus Torvalds
  1 sibling, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-04  6:30 UTC (permalink / raw)
  To: George Spelvin; +Cc: Git Mailing List



On Mon, 4 Aug 2009, George Spelvin wrote:
> 
> The actual goal of this effort is to address the dynamic linker startup
> time issues by removing the second-largest contributor after libcurl,
> namely openssl.  Optimizing the assembly code is just the fun part. ;-)

Now, I agree that it would be wonderful to get rid of the linker startup, 
but the startup costs of openssl are very low compared to the equivalent 
curl ones. So we can't lose _too_ much performance - especially for 
long-running jobs where startup costs really don't even matter - in the 
quest to get rid of those.

That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very 
good. That means that you not only avodied the startup costs, you actually 
improved on the openssl code. So it's a win-win situation.

That said, it would be even better if the SHA1 code was also somewhat 
portable to other environments (it looks like your current patch is very 
GNU as specific), and if you had a solution for x86-64 too ;)

Yeah, I'm a whiny little b*tch, aren't I?

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  4:48       ` George Spelvin
  2009-08-04  6:30         ` Linus Torvalds
@ 2009-08-04  6:40         ` Linus Torvalds
  1 sibling, 0 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-04  6:40 UTC (permalink / raw)
  To: George Spelvin; +Cc: Git Mailing List



On Mon, 4 Aug 2009, George Spelvin wrote:
> +sha1_block_data_order:
> +	pushl	%ebp
> +	pushl	%ebx
> +	pushl	%esi
> +	pushl	%edi
> +	movl	20(%esp),%edi
> +	movl	24(%esp),%esi
> +	movl	28(%esp),%eax
> +	subl	$64,%esp
> +	shll	$6,%eax
> +	addl	%esi,%eax
> +	movl	%eax,92(%esp)
> +	movl	16(%edi),%ebp
> +	movl	12(%edi),%edx
> +.align	16
> +.L000loop:
> +	movl	(%esi),%ecx
> +	movl	4(%esi),%ebx
> +	bswap	%ecx
> +	movl	8(%esi),%eax
> +	bswap	%ebx
> +	movl	%ecx,(%esp)

...

Hmm. Does it really help to do the bswap as a separate initial phase?

As far as I can tell, you load the result of the bswap just a single time 
for each value. So the initial "bswap all 64 bytes" seems pointless.

> +	/* 00_15 0 */
> +	movl	%edx,%edi
> +	movl	(%esp),%esi

Why not do the bswap here instead?

Is it because you're running out of registers for scheduling, and want to 
use the stack pointer rather than the original source?

Or does the data dependency end up being so much better that you're better 
off doing a separate bswap loop?

Or is it just because the code was written that way?

Intriguing, either way.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  6:30         ` Linus Torvalds
@ 2009-08-04  8:01           ` George Spelvin
  2009-08-04 20:41             ` Junio C Hamano
  0 siblings, 1 reply; 60+ messages in thread
From: George Spelvin @ 2009-08-04  8:01 UTC (permalink / raw)
  To: linux, torvalds; +Cc: git

> Now, I agree that it would be wonderful to get rid of the linker startup, 
> but the startup costs of openssl are very low compared to the equivalent 
> curl ones. So we can't lose _too_ much performance - especially for 
> long-running jobs where startup costs really don't even matter - in the 
> quest to get rid of those.
>
> That said, your numbers are impressive. Improving fsck by 1.1-2.2% is very 
> good. That means that you not only avodied the startup costs, you actually 
> improved on the openssl code. So it's a win-win situation.

Er, yes, that *is* what the subject line is advertising.  I started
with the OpenSSL core SHA1 code (which is BSD/GPL dual-licensed by its
author) and tweaked it some more for more recent processors.

> That said, it would be even better if the SHA1 code was also somewhat 
> portable to other environments (it looks like your current patch is very 
> GNU as specific), and if you had a solution for x86-64 too ;)

Done and will be done.

The code is *actually* written (see the first e-mail in this thread)
in the perl-preprocessor that OpenSSL uses, which can generate quite a
few output syntaxes (including Intel).  I just included the preprocessed
version to reduce the complexity of the rough-draft patch.

The one question I have is that currently perl is not a critical
compile-time dependency; it's needed for some extra stuff, but AFAIK you
can get most of git working without it.  Whether to add that dependency
or what is a Junio question.

As for x86-64, I haven't actually *written* it yet, but it'll be a very
simple adaptation.  Mostly it's just a matter of using the additional
registers effectively.

> Yeah, I'm a whiny little b*tch, aren't I?

Not at all; I expected all of that.  Getting rid of OpenSSL kind of
requires those things.

> Hmm. Does it really help to do the bswap as a separate initial phase?
> 
> As far as I can tell, you load the result of the bswap just a single time 
> for each value. So the initial "bswap all 64 bytes" seems pointless.

>> +	/* 00_15 0 */
>> +	movl	%edx,%edi
>> +	movl	(%esp),%esi

> Why not do the bswap here instead?
>
> Is it because you're running out of registers for scheduling, and want to 
> use the stack pointer rather than the original source?

Exactly.  I looked hard at it, but that means that I'd have to write the
first 16 rounds with only one temp register, because the other is being
used as an input pointer.

Here's the pipelined loop for the first 16 rounds (when in[i] is the
stack buffer), showing parallel operations on the same line.
(Operations in parens belong to adjacent rounds.)
#                       movl D,S        (roll 5,T)      (addl S,A)      //
#       mov in[i],T     xorl C,S        (addl T,A)
#                       andl B,S                        rorl 2,B
#       addl T+K,E      xorl D,S        movl A,T
#                       addl S,E        roll 5,T        (movl C,S)      //
#       (mov in[i],T)   (xorl B,S)      addl T,E        

which translates in perl code to:

sub BODY_00_15
{
        local($n,$a,$b,$c,$d,$e)=@_;

        &comment("00_15 $n");
                &mov($S,$d) if ($n == 0);
        &mov($T,&swtmp($n%16));         #  V Load Xi.
                &xor($S,$c);            # U  Continue F() = d^(b&(c^d))
                &and($S,$b);            #  V
                        &rotr($b,2);    # NP
        &lea($e,&DWP(K1,$e,$T));        # U  Add Xi and K
    if ($n < 15) {
                        &mov($T,$a);    #  V
                &xor($S,$d);            # U 
                        &rotl($T,5);    # NP
                &add($e,$S);            # U 
                &mov($S,$c);            #  V Start of NEXT round's F()
                        &add($e,$T);    # U 
    } else {
        # This version provides the correct start for BODY_20_39
                &xor($S,$d);            #  V
        &mov($T,&swtmp(($n+1)%16));     # U  Start computing mext Xi.
                &add($e,$S);            #  V Add F()
                        &mov($S,$a);    # U  Start computing a<<<5
        &xor($T,&swtmp(($n+3)%16));     #  V
                        &rotl($S,5);    # U 
        &xor($T,&swtmp(($n+9)%16));     #  V
    }
}

Anyway, the round is:

#define K1 0x5a827999
e += bswap(in[i]) + K1 + (d^(b&(c^d))) + ROTL(a,5).
b = ROTR(b,2);

Notice how I use one temp (T) for in[i] and ROTL(a,5), and the other (S)
for F1(b,c,d) = d^(b&(c^d)).

If I only had one temporary, I'd have to seriously un-overlap it:
	mov	S[i],T
	bswap	T
	mov	T,in[i]
	lea	K1(T,e),e
	  mov	  d,T
	  xor	  c,T
	  and	  b,T
	  xor	  d,T
	  add	  T,e
	mov	a,T
	roll	5,T
	add	T,e

Current processors probably have enough out-of-order scheduling resources to
find the parallelism there, but something like an Atom would be doomed.

I just cobbled together a test implementation, and it looks pretty similar
on my Phenom here (minimum of 30 runs):

Separate copy loop: 1.355603
In-line:            1.350444 (+0.4% faster)

A hint of being faster, but not much.

It is a couple of percent faster on a P4:
Separate copy loop: 3.297174
In-line:            3.237354 (+1.8% faster)

And on an i7:
Separate copy loop: 1.353641
In-line:            1.336766 (+1.2% faster)

but I worry about in-order machines.  An Athlon XP:
Separate copy loop: 3.252682
In-line:            3.313870 (-1.8% slower)

H'm... it's not bad.  And the code is smaller.  Maybe I'll work on
it a bit.

If you want to try it, the modified sha1-x86.s file is appended.

--- /dev/null	2009-05-12 02:55:38.579106460 -0400
+++ sha1-x86.s	2009-08-04 03:42:31.073284734 -0400
@@ -0,0 +1,1359 @@
+.file	"sha1-586.s"
+.text
+.globl	sha1_block_data_order
+.type	sha1_block_data_order,@function
+.align	16
+sha1_block_data_order:
+	pushl	%ebp
+	pushl	%ebx
+	pushl	%esi
+	pushl	%edi
+	movl	20(%esp),%edi
+	movl	24(%esp),%esi
+	movl	28(%esp),%eax
+	subl	$64,%esp
+	shll	$6,%eax
+	addl	%esi,%eax
+	movl	%eax,92(%esp)
+	movl	16(%edi),%ebp
+	movl	12(%edi),%edx
+	movl	8(%edi),%ecx
+	movl	4(%edi),%ebx
+	movl	(%edi),%eax
+.align	16
+.L000loop:
+	movl	%esi,88(%esp)
+	/* 00_15 0 */
+	movl	(%esi),%edi
+	bswap	%edi
+	movl	%edi,(%esp)
+	leal	1518500249(%ebp,%edi,1),%ebp
+	movl	%edx,%edi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	roll	$5,%edi
+	addl	%edi,%ebp
+	/* 00_15 1 */
+	movl	4(%esi),%edi
+	bswap	%edi
+	movl	%edi,4(%esp)
+	leal	1518500249(%edx,%edi,1),%edx
+	movl	%ecx,%edi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	xorl	%ecx,%edi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	roll	$5,%edi
+	addl	%edi,%edx
+	/* 00_15 2 */
+	movl	8(%esi),%edi
+	bswap	%edi
+	movl	%edi,8(%esp)
+	leal	1518500249(%ecx,%edi,1),%ecx
+	movl	%ebx,%edi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	xorl	%ebx,%edi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	roll	$5,%edi
+	addl	%edi,%ecx
+	/* 00_15 3 */
+	movl	12(%esi),%edi
+	bswap	%edi
+	movl	%edi,12(%esp)
+	leal	1518500249(%ebx,%edi,1),%ebx
+	movl	%eax,%edi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	xorl	%eax,%edi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	roll	$5,%edi
+	addl	%edi,%ebx
+	/* 00_15 4 */
+	movl	16(%esi),%edi
+	bswap	%edi
+	movl	%edi,16(%esp)
+	leal	1518500249(%eax,%edi,1),%eax
+	movl	%ebp,%edi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	xorl	%ebp,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* 00_15 5 */
+	movl	20(%esi),%edi
+	bswap	%edi
+	movl	%edi,20(%esp)
+	leal	1518500249(%ebp,%edi,1),%ebp
+	movl	%edx,%edi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	roll	$5,%edi
+	addl	%edi,%ebp
+	/* 00_15 6 */
+	movl	24(%esi),%edi
+	bswap	%edi
+	movl	%edi,24(%esp)
+	leal	1518500249(%edx,%edi,1),%edx
+	movl	%ecx,%edi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	xorl	%ecx,%edi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	roll	$5,%edi
+	addl	%edi,%edx
+	/* 00_15 7 */
+	movl	28(%esi),%edi
+	bswap	%edi
+	movl	%edi,28(%esp)
+	leal	1518500249(%ecx,%edi,1),%ecx
+	movl	%ebx,%edi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	xorl	%ebx,%edi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	roll	$5,%edi
+	addl	%edi,%ecx
+	/* 00_15 8 */
+	movl	32(%esi),%edi
+	bswap	%edi
+	movl	%edi,32(%esp)
+	leal	1518500249(%ebx,%edi,1),%ebx
+	movl	%eax,%edi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	xorl	%eax,%edi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	roll	$5,%edi
+	addl	%edi,%ebx
+	/* 00_15 9 */
+	movl	36(%esi),%edi
+	bswap	%edi
+	movl	%edi,36(%esp)
+	leal	1518500249(%eax,%edi,1),%eax
+	movl	%ebp,%edi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	xorl	%ebp,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* 00_15 10 */
+	movl	40(%esi),%edi
+	bswap	%edi
+	movl	%edi,40(%esp)
+	leal	1518500249(%ebp,%edi,1),%ebp
+	movl	%edx,%edi
+	xorl	%ecx,%edi
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	roll	$5,%edi
+	addl	%edi,%ebp
+	/* 00_15 11 */
+	movl	44(%esi),%edi
+	bswap	%edi
+	movl	%edi,44(%esp)
+	leal	1518500249(%edx,%edi,1),%edx
+	movl	%ecx,%edi
+	xorl	%ebx,%edi
+	andl	%eax,%edi
+	rorl	$2,%eax
+	xorl	%ecx,%edi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	roll	$5,%edi
+	addl	%edi,%edx
+	/* 00_15 12 */
+	movl	48(%esi),%edi
+	bswap	%edi
+	movl	%edi,48(%esp)
+	leal	1518500249(%ecx,%edi,1),%ecx
+	movl	%ebx,%edi
+	xorl	%eax,%edi
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	xorl	%ebx,%edi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	roll	$5,%edi
+	addl	%edi,%ecx
+	/* 00_15 13 */
+	movl	52(%esi),%edi
+	bswap	%edi
+	movl	%edi,52(%esp)
+	leal	1518500249(%ebx,%edi,1),%ebx
+	movl	%eax,%edi
+	xorl	%ebp,%edi
+	andl	%edx,%edi
+	rorl	$2,%edx
+	xorl	%eax,%edi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	roll	$5,%edi
+	addl	%edi,%ebx
+	/* 00_15 14 */
+	movl	56(%esi),%edi
+	movl	60(%esi),%esi
+	bswap	%edi
+	movl	%edi,56(%esp)
+	leal	1518500249(%eax,%edi,1),%eax
+	movl	%ebp,%edi
+	xorl	%edx,%edi
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	xorl	%ebp,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* 00_15 15 */
+	movl	%edx,%edi
+	bswap	%esi
+	xorl	%ecx,%edi
+	movl	%esi,60(%esp)
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	xorl	%edx,%edi
+	leal	1518500249(%ebp,%esi,1),%ebp
+	movl	(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 16_19 16 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	roll	$1,%esi
+	xorl	%ebx,%edi
+	movl	%esi,(%esp)
+	andl	%eax,%edi
+	rorl	$2,%eax
+	leal	1518500249(%edx,%esi,1),%edx
+	movl	4(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 17 */
+	xorl	56(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	roll	$1,%esi
+	xorl	%eax,%edi
+	movl	%esi,4(%esp)
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1518500249(%ecx,%esi,1),%ecx
+	movl	8(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 18 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	roll	$1,%esi
+	xorl	%ebp,%edi
+	movl	%esi,8(%esp)
+	andl	%edx,%edi
+	rorl	$2,%edx
+	leal	1518500249(%ebx,%esi,1),%ebx
+	movl	12(%esp),%esi
+	xorl	%eax,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 16_19 19 */
+	xorl	(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	roll	$1,%esi
+	xorl	%edx,%edi
+	movl	%esi,12(%esp)
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1518500249(%eax,%esi,1),%eax
+	movl	16(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 20 */
+	xorl	4(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,16(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	20(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 21 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,20(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	24(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 22 */
+	xorl	12(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	28(%esp),%esi
+	xorl	%eax,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 23 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,28(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	32(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 24 */
+	xorl	20(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	36(%esp),%esi
+	xorl	%edx,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 25 */
+	xorl	24(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	40(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 26 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	44(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 27 */
+	xorl	32(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	48(%esp),%esi
+	xorl	%eax,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 28 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	52(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 29 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	56(%esp),%esi
+	xorl	%edx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 30 */
+	xorl	44(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	60(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 31 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 32 */
+	xorl	52(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	4(%esp),%esi
+	xorl	%eax,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 33 */
+	xorl	56(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	8(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 34 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	12(%esp),%esi
+	xorl	%edx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 35 */
+	xorl	(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	1859775393(%ebp,%esi,1),%ebp
+	movl	16(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 36 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	1859775393(%edx,%esi,1),%edx
+	movl	20(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 37 */
+	xorl	8(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	1859775393(%ecx,%esi,1),%ecx
+	movl	24(%esp),%esi
+	xorl	%eax,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 38 */
+	xorl	12(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	1859775393(%ebx,%esi,1),%ebx
+	movl	28(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 39 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	1859775393(%eax,%esi,1),%eax
+	movl	32(%esp),%esi
+	xorl	%edx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 40_59 40 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	20(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 41 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	24(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,36(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 42 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	28(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,40(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 43 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	32(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 40_59 44 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	36(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,48(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	xorl	20(%esp),%esi
+	/* 40_59 45 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	40(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,52(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	xorl	24(%esp),%esi
+	/* 40_59 46 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	44(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	xorl	28(%esp),%esi
+	/* 40_59 47 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	48(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,60(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	xorl	32(%esp),%esi
+	/* 40_59 48 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	52(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	xorl	36(%esp),%esi
+	/* 40_59 49 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	56(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,4(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	8(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	xorl	40(%esp),%esi
+	/* 40_59 50 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	60(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,8(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	12(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	xorl	44(%esp),%esi
+	/* 40_59 51 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	16(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	xorl	48(%esp),%esi
+	/* 40_59 52 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	4(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,16(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	20(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	xorl	52(%esp),%esi
+	/* 40_59 53 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	8(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,20(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	24(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	xorl	56(%esp),%esi
+	/* 40_59 54 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	12(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,24(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	28(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	xorl	60(%esp),%esi
+	/* 40_59 55 */
+	addl	%edi,%eax
+	movl	%edx,%edi
+	xorl	16(%esp),%esi
+	andl	%ecx,%edi
+	roll	$1,%esi
+	addl	%edi,%ebp
+	movl	%edx,%edi
+	movl	%esi,28(%esp)
+	xorl	%ecx,%edi
+	leal	2400959708(%ebp,%esi,1),%ebp
+	andl	%ebx,%edi
+	rorl	$2,%ebx
+	movl	32(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	xorl	(%esp),%esi
+	/* 40_59 56 */
+	addl	%edi,%ebp
+	movl	%ecx,%edi
+	xorl	20(%esp),%esi
+	andl	%ebx,%edi
+	roll	$1,%esi
+	addl	%edi,%edx
+	movl	%ecx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebx,%edi
+	leal	2400959708(%edx,%esi,1),%edx
+	andl	%eax,%edi
+	rorl	$2,%eax
+	movl	36(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	xorl	4(%esp),%esi
+	/* 40_59 57 */
+	addl	%edi,%edx
+	movl	%ebx,%edi
+	xorl	24(%esp),%esi
+	andl	%eax,%edi
+	roll	$1,%esi
+	addl	%edi,%ecx
+	movl	%ebx,%edi
+	movl	%esi,36(%esp)
+	xorl	%eax,%edi
+	leal	2400959708(%ecx,%esi,1),%ecx
+	andl	%ebp,%edi
+	rorl	$2,%ebp
+	movl	40(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	xorl	8(%esp),%esi
+	/* 40_59 58 */
+	addl	%edi,%ecx
+	movl	%eax,%edi
+	xorl	28(%esp),%esi
+	andl	%ebp,%edi
+	roll	$1,%esi
+	addl	%edi,%ebx
+	movl	%eax,%edi
+	movl	%esi,40(%esp)
+	xorl	%ebp,%edi
+	leal	2400959708(%ebx,%esi,1),%ebx
+	andl	%edx,%edi
+	rorl	$2,%edx
+	movl	44(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	xorl	12(%esp),%esi
+	/* 40_59 59 */
+	addl	%edi,%ebx
+	movl	%ebp,%edi
+	xorl	32(%esp),%esi
+	andl	%edx,%edi
+	roll	$1,%esi
+	addl	%edi,%eax
+	movl	%ebp,%edi
+	movl	%esi,44(%esp)
+	xorl	%edx,%edi
+	leal	2400959708(%eax,%esi,1),%eax
+	andl	%ecx,%edi
+	rorl	$2,%ecx
+	movl	48(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	xorl	16(%esp),%esi
+	/* 20_39 60 */
+	xorl	36(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,48(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	52(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 61 */
+	xorl	40(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,52(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	56(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 62 */
+	xorl	44(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,56(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	60(%esp),%esi
+	xorl	%eax,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 63 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,60(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	32(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 64 */
+	xorl	52(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	4(%esp),%esi
+	xorl	%edx,%edi
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	36(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 65 */
+	xorl	56(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,4(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	8(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	40(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 66 */
+	xorl	60(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,8(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	12(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	44(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 67 */
+	xorl	(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,12(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	16(%esp),%esi
+	xorl	%eax,%edi
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	48(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 68 */
+	xorl	4(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,16(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	20(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	52(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 69 */
+	xorl	8(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,20(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	24(%esp),%esi
+	xorl	%edx,%edi
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	56(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 70 */
+	xorl	12(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,24(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	28(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	60(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 71 */
+	xorl	16(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,28(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	32(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 72 */
+	xorl	20(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	movl	%esi,32(%esp)
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	36(%esp),%esi
+	xorl	%eax,%edi
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	4(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 73 */
+	xorl	24(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	movl	%esi,36(%esp)
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	40(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	8(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 74 */
+	xorl	28(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	movl	%esi,40(%esp)
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	movl	44(%esp),%esi
+	xorl	%edx,%edi
+	xorl	52(%esp),%esi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	xorl	12(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 75 */
+	xorl	32(%esp),%esi
+	addl	%edi,%eax
+	roll	$1,%esi
+	movl	%edx,%edi
+	movl	%esi,44(%esp)
+	xorl	%ebx,%edi
+	rorl	$2,%ebx
+	leal	3395469782(%ebp,%esi,1),%ebp
+	movl	48(%esp),%esi
+	xorl	%ecx,%edi
+	xorl	56(%esp),%esi
+	addl	%edi,%ebp
+	movl	%eax,%edi
+	xorl	16(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 76 */
+	xorl	36(%esp),%esi
+	addl	%edi,%ebp
+	roll	$1,%esi
+	movl	%ecx,%edi
+	movl	%esi,48(%esp)
+	xorl	%eax,%edi
+	rorl	$2,%eax
+	leal	3395469782(%edx,%esi,1),%edx
+	movl	52(%esp),%esi
+	xorl	%ebx,%edi
+	xorl	60(%esp),%esi
+	addl	%edi,%edx
+	movl	%ebp,%edi
+	xorl	20(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 77 */
+	xorl	40(%esp),%esi
+	addl	%edi,%edx
+	roll	$1,%esi
+	movl	%ebx,%edi
+	xorl	%ebp,%edi
+	rorl	$2,%ebp
+	leal	3395469782(%ecx,%esi,1),%ecx
+	movl	56(%esp),%esi
+	xorl	%eax,%edi
+	xorl	(%esp),%esi
+	addl	%edi,%ecx
+	movl	%edx,%edi
+	xorl	24(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 78 */
+	xorl	44(%esp),%esi
+	addl	%edi,%ecx
+	roll	$1,%esi
+	movl	%eax,%edi
+	xorl	%edx,%edi
+	rorl	$2,%edx
+	leal	3395469782(%ebx,%esi,1),%ebx
+	movl	60(%esp),%esi
+	xorl	%ebp,%edi
+	xorl	4(%esp),%esi
+	addl	%edi,%ebx
+	movl	%ecx,%edi
+	xorl	28(%esp),%esi
+	roll	$5,%edi
+	/* 20_39 79 */
+	xorl	48(%esp),%esi
+	addl	%edi,%ebx
+	roll	$1,%esi
+	movl	%ebp,%edi
+	xorl	%ecx,%edi
+	rorl	$2,%ecx
+	leal	3395469782(%eax,%esi,1),%eax
+	xorl	%edx,%edi
+	addl	%edi,%eax
+	movl	%ebx,%edi
+	roll	$5,%edi
+	addl	%edi,%eax
+	/* Loop trailer */
+	movl	84(%esp),%edi
+	movl	88(%esp),%esi
+	addl	16(%edi),%ebp
+	addl	12(%edi),%edx
+	addl	8(%edi),%ecx
+	addl	4(%edi),%ebx
+	addl	(%edi),%eax
+	addl	$64,%esi
+	movl	%ebp,16(%edi)
+	movl	%edx,12(%edi)
+	cmpl	92(%esp),%esi
+	movl	%ecx,8(%edi)
+	movl	%ebx,4(%edi)
+	movl	%eax,(%edi)
+	jb	.L000loop
+	addl	$64,%esp
+	popl	%edi
+	popl	%esi
+	popl	%ebx
+	popl	%ebp
+	ret
+.L_sha1_block_data_order_end:
+.size	sha1_block_data_order,.L_sha1_block_data_order_end-sha1_block_data_order
+.byte	83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,120,56,54,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  5:01           ` George Spelvin
@ 2009-08-04 12:56             ` Jon Smirl
  2009-08-04 14:29               ` Dmitry Potapov
  0 siblings, 1 reply; 60+ messages in thread
From: Jon Smirl @ 2009-08-04 12:56 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

On Tue, Aug 4, 2009 at 1:01 AM, George Spelvin<linux@horizon.com> wrote:
>> Would there happen to be a SHA1 implementation around that can compute
>> the SHA1 without first decompressing the data? Databases gain a lot of
>> speed by using special algorithms that can directly operate on the
>> compressed data.
>
> I can't imagine how.  In general, this requires that the compression
> be carefully designed to be compatible with the algorithms, and SHA1
> is specifically designed to depend on every bit of the input in
> an un-analyzable way.

A simple start would be to feed each byte as it is decompressed
directly into the sha code and avoid the intermediate buffer. Removing
the buffer reduces cache pressure.

> Also, git normally avoids hashing objects that it doesn't need
> uncompressed for some other reason.  git-fsck is a notable exception,
> but I think the idea of creating special optimized code paths for that
> interferes with its reliability and robustness goals.

Agreed that there is no real need for this, just something to play
with if you are trying for a speed record.

I'd much rather have a solution for the rebase problem where one side
of the diff has moved to a different file and rebase can't figure it
out.

>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04 12:56             ` Jon Smirl
@ 2009-08-04 14:29               ` Dmitry Potapov
  0 siblings, 0 replies; 60+ messages in thread
From: Dmitry Potapov @ 2009-08-04 14:29 UTC (permalink / raw)
  To: Jon Smirl; +Cc: George Spelvin, git

On Tue, Aug 04, 2009 at 08:56:48AM -0400, Jon Smirl wrote:
> 
> A simple start would be to feed each byte as it is decompressed
> directly into the sha code and avoid the intermediate buffer. Removing
> the buffer reduces cache pressure.

First, you still have to preserve any decoded byte in the compress
window, which is 32Kb by default. Typical files in Git repositories are
not so big, many are under 32Kb and practically all of them fit to L2
cache of modern processors. Second, complication of assembler code from
the coupling of two algorithms will be enormous. It is not sufficient
registers on x86 for SHA-1 alone. Third, SHA-1 is very computationally
intensive and with predictable access pattern (linear), so you do not
wait for L2, because it will be in L1. So, I don't see where you can
gain significantly. Perhaps, you can win just from re-writing inflate in
assembler, but I do not expect any significant gains other than that.
And coupling has obvious disadvantages when it comes to maintenance...


Dmitry

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  8:01           ` George Spelvin
@ 2009-08-04 20:41             ` Junio C Hamano
  2009-08-05 18:17               ` George Spelvin
  0 siblings, 1 reply; 60+ messages in thread
From: Junio C Hamano @ 2009-08-04 20:41 UTC (permalink / raw)
  To: George Spelvin; +Cc: torvalds, git

"George Spelvin" <linux@horizon.com> writes:

> The one question I have is that currently perl is not a critical
> compile-time dependency; it's needed for some extra stuff, but AFAIK you
> can get most of git working without it.  Whether to add that dependency
> or what is a Junio question.

I am actually feel a lot more uneasy to apply a patch signed of by
somebody who calls himself George Spelvin, though.

Three classes of people compile git from the source:

 * People who want to be on the bleeding edge and compile git for
   themselves, even though they are on mainstream platforms where they
   could choose distro-packaged one;

 * People who produce binary packages for distribution.

 * People who are on minority platforms and have no other way to get git
   than compiling for themselves;

We do not have to worry about the first two groups of people.  It won't
be too involved for them to install Perl on their system; after all they
are already coping with asciidoc and xmlto ;-)

We can continue shipping mozilla one to help the last group.

In the Makefile, we say:

    # Define NO_OPENSSL environment variable if you do not have OpenSSL.
    # This also implies MOZILLA_SHA1.

and with your change, we would start implying STANDALONE_OPENSSL_SHA1
instead.  But if MOZILLA_SHA1 was given explicitly, we could use that.

If they really really really want the extra performance out of statically
linked OpenSSL derivative, they could prepare a preprocessed assmebly on
some other machine and use it as the last resort if they do not have/want
Perl.  The situation is exactly the same as the documentation set.  They
are using HTML/man prepared on another machine (namely, mine) as the last
resort if they do not have/want AsciiDoc toolchain.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04 20:41             ` Junio C Hamano
@ 2009-08-05 18:17               ` George Spelvin
  2009-08-05 20:36                 ` Johannes Schindelin
                                   ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: George Spelvin @ 2009-08-05 18:17 UTC (permalink / raw)
  To: gitster; +Cc: git, linux, torvalds

> Three classes of people compile git from the source:
>
> * People who want to be on the bleeding edge and compile git for
>   themselves, even though they are on mainstream platforms where they
>   could choose distro-packaged one;
>
> * People who produce binary packages for distribution.
>
> * People who are on minority platforms and have no other way to get git
>   than compiling for themselves;
>
> We do not have to worry about the first two groups of people.  It won't
> be too involved for them to install Perl on their system; after all they
> are already coping with asciidoc and xmlto ;-)

Actually, I'd get rid of the perl entirely, but I'm not sure how
necessary the other-assembler-syntax features are needed by the
folks on MacOS X and Windows (msysgit).

> We can continue shipping mozilla one to help the last group.

Of course, we always need a C fallback.  Would you like a faster one?

> In the Makefile, we say:
>
>    # Define NO_OPENSSL environment variable if you do not have OpenSSL.
>    # This also implies MOZILLA_SHA1.
>
> and with your change, we would start implying STANDALONE_OPENSSL_SHA1
> instead.  But if MOZILLA_SHA1 was given explicitly, we could use that.

Well, I'd really like to auto-detect the processor.  Current gcc's
"gcc -v" output includes a "Target: " line that will do nicely.  I can,
of course, fall back to C if it fails, but is there a significant user
base using a non-GCC compiler?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 18:17               ` George Spelvin
@ 2009-08-05 20:36                 ` Johannes Schindelin
  2009-08-05 20:44                 ` Junio C Hamano
  2009-08-05 20:55                 ` Linus Torvalds
  2 siblings, 0 replies; 60+ messages in thread
From: Johannes Schindelin @ 2009-08-05 20:36 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git, torvalds

Hi,

On Wed, 5 Aug 2009, George Spelvin wrote:

> > Three classes of people compile git from the source:
> >
> > * People who want to be on the bleeding edge and compile git for
> >   themselves, even though they are on mainstream platforms where they
> >   could choose distro-packaged one;
> >
> > * People who produce binary packages for distribution.
> >
> > * People who are on minority platforms and have no other way to get git
> >   than compiling for themselves;
> >
> > We do not have to worry about the first two groups of people.  It won't
> > be too involved for them to install Perl on their system; after all they
> > are already coping with asciidoc and xmlto ;-)
> 
> Actually, I'd get rid of the perl entirely, but I'm not sure how
> necessary the other-assembler-syntax features are needed by the
> folks on MacOS X and Windows (msysgit).

Don't worry for MacOSX and msysGit (or Cygwin, for that matter): all of 
them use GCC.

> > We can continue shipping mozilla one to help the last group.
> 
> Of course, we always need a C fallback.  Would you like a faster one?

Is that a trick question?

:-)

> > In the Makefile, we say:
> >
> >    # Define NO_OPENSSL environment variable if you do not have OpenSSL.
> >    # This also implies MOZILLA_SHA1.
> >
> > and with your change, we would start implying STANDALONE_OPENSSL_SHA1
> > instead.  But if MOZILLA_SHA1 was given explicitly, we could use that.
> 
> Well, I'd really like to auto-detect the processor.  Current gcc's
> "gcc -v" output includes a "Target: " line that will do nicely.  I can,
> of course, fall back to C if it fails, but is there a significant user
> base using a non-GCC compiler?

Do you really want to determine which processor to optimize for at compile 
time?  Build system and target system are often different...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 18:17               ` George Spelvin
  2009-08-05 20:36                 ` Johannes Schindelin
@ 2009-08-05 20:44                 ` Junio C Hamano
  2009-08-05 20:55                 ` Linus Torvalds
  2 siblings, 0 replies; 60+ messages in thread
From: Junio C Hamano @ 2009-08-05 20:44 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, torvalds

"George Spelvin" <linux@horizon.com> writes:

>> We can continue shipping mozilla one to help the last group.
>
> Of course, we always need a C fallback.  Would you like a faster one?

No.  I'd rather keep tested and tried while a better alternative is in
work-in-progress state.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 18:17               ` George Spelvin
  2009-08-05 20:36                 ` Johannes Schindelin
  2009-08-05 20:44                 ` Junio C Hamano
@ 2009-08-05 20:55                 ` Linus Torvalds
  2009-08-05 23:13                   ` Linus Torvalds
  2 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-05 20:55 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git



On Wed, 5 Aug 2009, George Spelvin wrote:
> 
> > We can continue shipping mozilla one to help the last group.
> 
> Of course, we always need a C fallback.  Would you like a faster one?

I actually looked at code generation (on x86-64) for the C fallback, and 
it should be quite doable to re-write the C one to generate good code on 
x86-64.

On 32-bit x86, I suspect the register pressures are so intense that it's 
unrealistic to expect gcc to do a good job, but the Mozilla SHA1 C code 
really seems _designed_ to be slow in stupid ways (that whole "byte at a 
time into a word buffer with shifts" is a really really sucky way to 
handle the endianness issues).

So if you'd like to look at the C version, that's definitely worth it. 
Much bigger bang for the buck than trying to schedule asm language and 
having to deal with different assemblers/linkers/whatnot.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 20:55                 ` Linus Torvalds
@ 2009-08-05 23:13                   ` Linus Torvalds
  2009-08-06  1:18                     ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-05 23:13 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> I actually looked at code generation (on x86-64) for the C fallback, and 
> it should be quite doable to re-write the C one to generate good code on 
> x86-64.

Ok, here's a try.

It's based on the mozilla SHA1 code, but with quite a bit of surgery. 
Enable with "make BLK_SHA1=1".

Timings for "git fsck --full" on the git directory:

 - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1

	real	0m38.194s
	user	0m37.838s
	sys	0m0.356s

 - This code ("half-portable C code"): BLK_SHA1=1

	real	0m28.120s
	user	0m27.930s
	sys	0m0.192s

 - OpenSSL assembler code:

	real	0m26.327s
	user	0m26.194s
	sys	0m0.136s

ie this is slightly slower than the openssh SHA1 routines, but that's only 
true on something very SHA1-intensive like "git fsck", and this is 
_almost_ portable code. I say "almost" because it really does require that 
we can do unaligned word loads, and do a good job of 'htonl()', and it 
assumes that 'unsigned int' is 32-bit (the latter would be easy to change 
by using 'uint32_t', but since it's not the relevant portability issue, I 
don't think it matters).

In other words, unlike the Mozilla SHA1, this one doesn't suck. It's 
certainly not great either, but it's probably good enough in practice, 
without the headaches of actually making people use an assembler version.

And maybe somebody can see how to improve it further?

		Linus
---
From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH] Add new optimized C 'block-sha1' routines

Based on the mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.

It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 Makefile          |    9 +++
 block-sha1/sha1.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block-sha1/sha1.h |   21 ++++++++
 3 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index d7669b1..f12024c 100644
--- a/Makefile
+++ b/Makefile
@@ -84,6 +84,10 @@ all::
 # specify your own (or DarwinPort's) include directories and
 # library directories by defining CFLAGS and LDFLAGS appropriately.
 #
+# Define BLK_SHA1 environment variable if you want the C version
+# of the SHA1 that assumes you can do unaligned 32-bit loads and
+# have a fast htonl() function.
+#
 # Define PPC_SHA1 environment variable when running make to make use of
 # a bundled SHA1 routine optimized for PowerPC.
 #
@@ -1166,6 +1170,10 @@ ifdef NO_DEFLATE_BOUND
 	BASIC_CFLAGS += -DNO_DEFLATE_BOUND
 endif
 
+ifdef BLK_SHA1
+	SHA1_HEADER = "block-sha1/sha1.h"
+	LIB_OBJS += block-sha1/sha1.o
+else
 ifdef PPC_SHA1
 	SHA1_HEADER = "ppc/sha1.h"
 	LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o
@@ -1183,6 +1191,7 @@ else
 endif
 endif
 endif
+endif
 ifdef NO_PERL_MAKEMAKER
 	export NO_PERL_MAKEMAKER
 endif
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
new file mode 100644
index 0000000..8fd90b0
--- /dev/null
+++ b/block-sha1/sha1.c
@@ -0,0 +1,145 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "sha1.h"
+
+/* Hash one 64-byte block of data */
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data);
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx)
+{
+	ctx->lenW = 0;
+	ctx->size = 0;
+
+	/* Initialize H with the magic constants (see FIPS180 for constants)
+	 */
+	ctx->H[0] = 0x67452301;
+	ctx->H[1] = 0xefcdab89;
+	ctx->H[2] = 0x98badcfe;
+	ctx->H[3] = 0x10325476;
+	ctx->H[4] = 0xc3d2e1f0;
+}
+
+
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len)
+{
+	int lenW = ctx->lenW;
+
+	ctx->size += len << 3;
+
+	/* Read the data into W and process blocks as they get full
+	 */
+	if (lenW) {
+		int left = 64 - lenW;
+		if (len < left)
+			left = len;
+		memcpy(lenW + (char *)ctx->W, data, left);
+		lenW = (lenW + left) & 63;
+		len -= left;
+		data += left;
+		ctx->lenW = lenW;
+		if (lenW)
+			return;
+		blk_SHA1Block(ctx, ctx->W);
+	}
+	while (len >= 64) {
+		blk_SHA1Block(ctx, data);
+		data += 64;
+		len -= 64;
+	}
+	if (len) {
+		memcpy(ctx->W, data, len);
+		ctx->lenW = len;
+	}
+}
+
+
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
+{
+	static const unsigned char pad[64] = { 0x80 };
+	unsigned int padlen[2];
+	int i;
+
+	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
+	 */
+	padlen[0] = htonl(ctx->size >> 32);
+	padlen[1] = htonl(ctx->size);
+
+	blk_SHA1_Update(ctx, pad, 1+ (63 & (55 - ctx->lenW)));
+	blk_SHA1_Update(ctx, padlen, 8);
+
+	/* Output hash
+	 */
+	for (i = 0; i < 5; i++)
+		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+}
+
+#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
+{
+	int t;
+	unsigned int A,B,C,D,E,TEMP;
+	unsigned int W[80];
+
+	for (t = 0; t < 16; t++)
+		W[t] = htonl(data[t]);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+	A = ctx->H[0];
+	B = ctx->H[1];
+	C = ctx->H[2];
+	D = ctx->H[3];
+	E = ctx->H[4];
+
+#define T_0_19(t) \
+	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
+	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
+	T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
+	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+
+#define T_20_39(t) \
+	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
+	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
+	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
+	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+
+#define T_40_59(t) \
+	TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
+	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
+	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
+	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+
+#define T_60_79(t) \
+	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+
+	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
+	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
+	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
+	T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+
+	ctx->H[0] += A;
+	ctx->H[1] += B;
+	ctx->H[2] += C;
+	ctx->H[3] += D;
+	ctx->H[4] += E;
+}
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
new file mode 100644
index 0000000..dbc719f
--- /dev/null
+++ b/block-sha1/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+typedef struct {
+	unsigned int H[5];
+	unsigned int W[16];
+	int lenW;
+	unsigned long long size;
+} blk_SHA_CTX;
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len);
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
+
+#define git_SHA_CTX	blk_SHA_CTX
+#define git_SHA1_Init	blk_SHA1_Init
+#define git_SHA1_Update	blk_SHA1_Update
+#define git_SHA1_Final	blk_SHA1_Final

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-05 23:13                   ` Linus Torvalds
@ 2009-08-06  1:18                     ` Linus Torvalds
  2009-08-06  1:52                       ` Nicolas Pitre
  2009-08-06 18:49                       ` Erik Faye-Lund
  0 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  1:18 UTC (permalink / raw)
  To: George Spelvin; +Cc: gitster, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> Timings for "git fsck --full" on the git directory:
> 
>  - Mozilla SHA1 portable C-code (sucky sucky): MOZILLA_SHA1=1
> 
> 	real	0m38.194s
> 	user	0m37.838s
> 	sys	0m0.356s
> 
>  - This code ("half-portable C code"): BLK_SHA1=1
> 
> 	real	0m28.120s
> 	user	0m27.930s
> 	sys	0m0.192s
> 
>  - OpenSSL assembler code:
> 
> 	real	0m26.327s
> 	user	0m26.194s
> 	sys	0m0.136s

Ok, I installed the 32-bit libraries too, to see what it looks like for 
that case. As expected, the compiler is not able to do a great job due to 
it being somewhat register starved, but on the other hand, the old Mozilla 
code did even worse, so..

 - Mozilla SHA:

	real	0m47.063s
	user	0m46.815s
	sys	0m0.252s

 - BLK_SHA1=1

	real	0m34.705s
	user	0m34.394s
	sys	0m0.312s

 - OPENSSL:

	real	0m29.754s
	user	0m29.446s
	sys	0m0.288s

so the tuned asm from OpenSSL does kick ass, but the C code version isn't 
_that_ far away. It's quite a reasonable alternative if you don't have the 
OpenSSL libraries installed, for example.

I note that MINGW does NO_OPENSSL by default, for example, and maybe the 
MINGW people want to test the patch out and enable BLK_SHA1 rather than 
the original Mozilla one.

But while looking at 32-bit issues, I noticed that I really should also 
cast 'len' when shifting it. Otherwise the thing is limited to fairly 
small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
looking at 32-bit issues.

So here's an incremental patch to fix that. 

		Linus

---
 block-sha1/sha1.c |    4 ++--
 block-sha1/sha1.h |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 8fd90b0..eef32f7 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -27,11 +27,11 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 }
 
 
-void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len)
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
 	int lenW = ctx->lenW;
 
-	ctx->size += len << 3;
+	ctx->size += (unsigned long long) len << 3;
 
 	/* Read the data into W and process blocks as they get full
 	 */
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
index dbc719f..7be2d93 100644
--- a/block-sha1/sha1.h
+++ b/block-sha1/sha1.h
@@ -12,7 +12,7 @@ typedef struct {
 } blk_SHA_CTX;
 
 void blk_SHA1_Init(blk_SHA_CTX *ctx);
-void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, unsigned long len);
 void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
 
 #define git_SHA_CTX	blk_SHA_CTX

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:18                     ` Linus Torvalds
@ 2009-08-06  1:52                       ` Nicolas Pitre
  2009-08-06  2:04                         ` Junio C Hamano
  2009-08-06  2:08                         ` Linus Torvalds
  2009-08-06 18:49                       ` Erik Faye-Lund
  1 sibling, 2 replies; 60+ messages in thread
From: Nicolas Pitre @ 2009-08-06  1:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, Junio C Hamano, git

On Wed, 5 Aug 2009, Linus Torvalds wrote:

> But while looking at 32-bit issues, I noticed that I really should also 
> cast 'len' when shifting it. Otherwise the thing is limited to fairly 
> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
> a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
> looking at 32-bit issues.

Even better is to not shift len at all in SHA_update() but shift 
ctx->size only at the end in SHA_final().  It is not like if 
SHA_update() could operate on partial bytes, so counting total bytes 
instead of total bits is all you need.  This way you need no cast there 
and make the code slightly faster.


Nicolas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:52                       ` Nicolas Pitre
@ 2009-08-06  2:04                         ` Junio C Hamano
  2009-08-06  2:10                           ` Linus Torvalds
  2009-08-06  2:20                           ` Nicolas Pitre
  2009-08-06  2:08                         ` Linus Torvalds
  1 sibling, 2 replies; 60+ messages in thread
From: Junio C Hamano @ 2009-08-06  2:04 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, George Spelvin, git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 5 Aug 2009, Linus Torvalds wrote:
>
>> But while looking at 32-bit issues, I noticed that I really should also 
>> cast 'len' when shifting it. Otherwise the thing is limited to fairly 
>> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
>> a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
>> looking at 32-bit issues.
>
> Even better is to not shift len at all in SHA_update() but shift 
> ctx->size only at the end in SHA_final().  It is not like if 
> SHA_update() could operate on partial bytes, so counting total bytes 
> instead of total bits is all you need.  This way you need no cast there 
> and make the code slightly faster.

Like this?

By the way, Mozilla one calls Init at the end of Final but block-sha1
doesn't.  I do not think it matters for our callers, but on the other hand
FInal is not performance critical part nor Init is heavy, so it may not be
a bad idea to imitate them as well.  Or am I missing something?

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index eef32f7..8293f7b 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
 	int lenW = ctx->lenW;
 
-	ctx->size += (unsigned long long) len << 3;
+	ctx->size += (unsigned long long) len;
 
 	/* Read the data into W and process blocks as they get full
 	 */
@@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
 	 */
+	ctx->size <<= 3; /* bytes to bits */
 	padlen[0] = htonl(ctx->size >> 32);
 	padlen[1] = htonl(ctx->size);
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:52                       ` Nicolas Pitre
  2009-08-06  2:04                         ` Junio C Hamano
@ 2009-08-06  2:08                         ` Linus Torvalds
  2009-08-06  3:19                           ` Artur Skawina
  1 sibling, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  2:08 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: George Spelvin, Junio C Hamano, git



On Wed, 5 Aug 2009, Nicolas Pitre wrote:
> 
> Even better is to not shift len at all in SHA_update() but shift 
> ctx->size only at the end in SHA_final().  It is not like if 
> SHA_update() could operate on partial bytes, so counting total bytes 
> instead of total bits is all you need.  This way you need no cast there 
> and make the code slightly faster.

Yeah, I tried it, but it's not noticeable.

The bigger issue seems to be that it's shifter-limited, or that's what I 
take away from my profiles. I suspect it's even _more_ shifter-limited on 
some other micro-architectures, because gcc is being stupid, and generates

	ror $31,%eax

from the "left shift + right shift" combination. It seems to -always- 
generate a "ror", rather than trying to generate 'rot' if the shift count 
would be smaller that way.

And I know _some_ old micro-architectures will literally internally loop 
on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol 
$1".

That isn't the case on my Nehalem, though. But I can't seem to get gcc to 
generate better code without actually using inline asm..

(So to clarify: this patch makes no difference that I can see to 
performance, but I suspect it could matter on other CPU's like an old 
Pentium or maybe an Atom).

		Linus

---
 block-sha1/sha1.c |   36 ++++++++++++++++++++++++------------
 block-sha1/sha1.h |    2 +-
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 8fd90b0..a45a3de 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -80,7 +80,19 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
 }
 
-#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#if defined(__i386__) || defined(__x86_64__)
+
+#define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
+#define SHA_ROL(x,n)	SHA_ASM("rol", x, n)
+#define SHA_ROR(x,n)	SHA_ASM("ror", x, n)
+
+#else
+
+#define SHA_ROT(X,n)	(((X) << (l)) | ((X) >> (r)))
+#define SHA_ROL(X,n)	SHA_ROT(X,n,32-(n))
+#define SHA_ROR(X,n)	SHA_ROT(X,32-(n),n)
+
+#endif
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
@@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 
 	/* Unroll it? */
 	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
 
 	A = ctx->H[0];
 	B = ctx->H[1];
@@ -102,8 +114,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 #define T_0_19(t) \
-	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
 	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
@@ -111,8 +123,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
 
 #define T_20_39(t) \
-	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
 	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -120,8 +132,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
 #define T_40_59(t) \
-	TEMP = SHA_ROT(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
 	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
@@ -129,8 +141,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
 #define T_60_79(t) \
-	TEMP = SHA_ROT(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
-	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
+	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
 	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  2:04                         ` Junio C Hamano
@ 2009-08-06  2:10                           ` Linus Torvalds
  2009-08-06  2:20                           ` Nicolas Pitre
  1 sibling, 0 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  2:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, George Spelvin, git



On Wed, 5 Aug 2009, Junio C Hamano wrote:
> 
> Like this?

No, combine it with the other shifts:

Yes:

> -	ctx->size += (unsigned long long) len << 3;
> +	ctx->size += (unsigned long long) len;

No:

> +	ctx->size <<= 3; /* bytes to bits */
>  	padlen[0] = htonl(ctx->size >> 32);
>  	padlen[1] = htonl(ctx->size);

Do

	padlen[0] = htonl(ctx->size >> 29);
	padlen[1] = htonl(ctx->size << 3);

instead. Or whatever.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  2:04                         ` Junio C Hamano
  2009-08-06  2:10                           ` Linus Torvalds
@ 2009-08-06  2:20                           ` Nicolas Pitre
  1 sibling, 0 replies; 60+ messages in thread
From: Nicolas Pitre @ 2009-08-06  2:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, George Spelvin, git

On Wed, 5 Aug 2009, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Wed, 5 Aug 2009, Linus Torvalds wrote:
> >
> >> But while looking at 32-bit issues, I noticed that I really should also 
> >> cast 'len' when shifting it. Otherwise the thing is limited to fairly 
> >> small areas (28 bits - 256MB). This is not just a 32-bit problem ("int" is 
> >> a signed 32-bit thing even in a 64-bit build), but I only noticed it when 
> >> looking at 32-bit issues.
> >
> > Even better is to not shift len at all in SHA_update() but shift 
> > ctx->size only at the end in SHA_final().  It is not like if 
> > SHA_update() could operate on partial bytes, so counting total bytes 
> > instead of total bits is all you need.  This way you need no cast there 
> > and make the code slightly faster.
> 
> Like this?

Almost (see below).

> By the way, Mozilla one calls Init at the end of Final but block-sha1
> doesn't.  I do not think it matters for our callers, but on the other hand
> FInal is not performance critical part nor Init is heavy, so it may not be
> a bad idea to imitate them as well.  Or am I missing something?

It is done only to make sure potentially crypto sensitive information is 
wiped out of the ctx structure instance.  In our case we have no such 
concerns.

> diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
> index eef32f7..8293f7b 100644
> --- a/block-sha1/sha1.c
> +++ b/block-sha1/sha1.c
> @@ -31,7 +31,7 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
>  {
>  	int lenW = ctx->lenW;
>  
> -	ctx->size += (unsigned long long) len << 3;
> +	ctx->size += (unsigned long long) len;

You can get rid of the cast as well now.

>  	/* Read the data into W and process blocks as they get full
>  	 */
> @@ -68,6 +68,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
>  
>  	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
>  	 */
> +	ctx->size <<= 3; /* bytes to bits */
>  	padlen[0] = htonl(ctx->size >> 32);
>  	padlen[1] = htonl(ctx->size);

Instead, I'd do:

	padlen[0] = htonl(ctx->size >> (32 - 3));
	padlen[1] = htonl(ctx->size << 3);

That would eliminate a redundant write back of ctx->size.


Nicolas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  2:08                         ` Linus Torvalds
@ 2009-08-06  3:19                           ` Artur Skawina
  2009-08-06  3:31                             ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  3:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> The bigger issue seems to be that it's shifter-limited, or that's what I 
> take away from my profiles. I suspect it's even _more_ shifter-limited on 
> some other micro-architectures, because gcc is being stupid, and generates
> 
> 	ror $31,%eax
> 
> from the "left shift + right shift" combination. It seems to -always- 
> generate a "ror", rather than trying to generate 'rot' if the shift count 
> would be smaller that way.
> 
> And I know _some_ old micro-architectures will literally internally loop 
> on the rol/ror counts, so "ror $31" can be _much_ more expensive than "rol 
> $1".
> 
> That isn't the case on my Nehalem, though. But I can't seem to get gcc to 
> generate better code without actually using inline asm..

The compiler does the right thing w/ something like this:

+#if __GNUC__>1 && defined(__i386)
+#define SHA_ROT(data,bits) ({ \
+  unsigned d = (data); \
+  if (bits<16) \
+    __asm__ ("roll %1,%0" : "=r" (d) : "I" (bits), "0" (d)); \
+  else \
+    __asm__ ("rorl %1,%0" : "=r" (d) : "I" (32-bits), "0" (d)); \
+  d; \
+  })
+#else
 #define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#endif
 
which doesn't obfuscate the code as much.
(I needed the asm on p4 anyway, as w/o it the mozilla version is even
 slower than an rfc3174 one. rol vs ror makes no measurable difference)

>  static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
>  {
> @@ -93,7 +105,7 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
>  
>  	/* Unroll it? */
>  	for (t = 16; t <= 79; t++)
> -		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
> +		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);

unrolling this once (but not more) is a win, at least on p4.

>  #define T_0_19(t) \
> -	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> -	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
> +	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> +	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
>  
>  	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
>  	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);

unrolling these otoh is a clear loss (iirc ~10%). 

artur

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:19                           ` Artur Skawina
@ 2009-08-06  3:31                             ` Linus Torvalds
  2009-08-06  3:48                               ` Linus Torvalds
  2009-08-06  4:08                               ` Artur Skawina
  0 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  3:31 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> >  #define T_0_19(t) \
> > -	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> > -	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
> > +	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
> > +	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
> >  
> >  	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
> >  	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
> 
> unrolling these otoh is a clear loss (iirc ~10%). 

I can well imagine. The P4 decode bandwidth is abysmal unless you get 
things into the trace cache, and the trace cache is of a very limited 
size.

However, on at least Nehalem, unrolling it all is quite a noticeable win.

The way it's written, I can easily make it do one or the other by just 
turning the macro inside a loop (and we can have a preprocessor flag to 
choose one or the other), but let me work on it a bit more first.

I'm trying to move the htonl() inside the loops (the same way I suggested 
George do with his assembly), and it seems to help a tiny bit. But I may 
be measuring noise.

However, right now my biggest profile hit is on this irritating loop:

	/* Unroll it? */
	for (t = 16; t <= 79; t++)
		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);

and I haven't been able to move _that_ into the other iterations yet.

Here's my micro-optimization update. It does the first 16 rounds (of the 
first 20-round thing) specially, and takes the data directly from the 
input array. I'm _this_ close to breaking the 28s second barrier on 
git-fsck, but not quite yet.

			Linus

---
From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH] block-sha1: make the 'ntohl()' part of the first SHA1 loop

This helps a teeny bit.  But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block-sha1/sha1.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index a45a3de..39a5bbb 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -100,27 +100,31 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	unsigned int A,B,C,D,E,TEMP;
 	unsigned int W[80];
 
-	for (t = 0; t < 16; t++)
-		W[t] = htonl(data[t]);
-
-	/* Unroll it? */
-	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
-
 	A = ctx->H[0];
 	B = ctx->H[1];
 	C = ctx->H[2];
 	D = ctx->H[3];
 	E = ctx->H[4];
 
-#define T_0_19(t) \
+#define T_0_15(t) \
+	TEMP = htonl(data[t]); W[t] = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D)     + E + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
+
+	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
+	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
+	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
+	T_0_15(15);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+#define T_16_19(t) \
 	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
-	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
-	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
-	T_0_19(10); T_0_19(11); T_0_19(12); T_0_19(13); T_0_19(14);
-	T_0_19(15); T_0_19(16); T_0_19(17); T_0_19(18); T_0_19(19);
+	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
 #define T_20_39(t) \
 	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:31                             ` Linus Torvalds
@ 2009-08-06  3:48                               ` Linus Torvalds
  2009-08-06  4:01                                 ` Linus Torvalds
  2009-08-06  4:52                                 ` George Spelvin
  2009-08-06  4:08                               ` Artur Skawina
  1 sibling, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  3:48 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> However, right now my biggest profile hit is on this irritating loop:
> 
> 	/* Unroll it? */
> 	for (t = 16; t <= 79; t++)
> 		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
> 
> and I haven't been able to move _that_ into the other iterations yet.

Oh yes I have.

Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
sub-27s times. In fact, it's really close to the OpenSSL times.

And all using plain C.

Again - this is all on x86-64. I suspect 32-bit code ends up having 
spills due to register pressure. That said, I did get rid of that big 
temporary array, and it now basically only uses that 512-bit array as one 
circular queue.

		Linus

PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain 
about it. It's disgusting C preprocessor misuse. But dang, it's kind of 
fun to abuse the compiler this way.

---
 block-sha1/sha1.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index 39a5bbb..80193d4 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -96,9 +96,8 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
 
 static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 {
-	int t;
 	unsigned int A,B,C,D,E,TEMP;
-	unsigned int W[80];
+	unsigned int array[16];
 
 	A = ctx->H[0];
 	B = ctx->H[1];
@@ -107,8 +106,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	E = ctx->H[4];
 
 #define T_0_15(t) \
-	TEMP = htonl(data[t]); W[t] = TEMP; \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D)     + E + 0x5a827999; \
+	TEMP = htonl(data[t]); array[t] = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
 	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
@@ -116,18 +115,21 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
 	T_0_15(15);
 
-	/* Unroll it? */
-	for (t = 16; t <= 79; t++)
-		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])
+#define SHA_XOR(t) \
+	TEMP = SHA_ROL(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1); W(t) = TEMP;
 
 #define T_16_19(t) \
-	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
 	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
 
 #define T_20_39(t) \
-	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0x6ed9eba1; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
@@ -136,7 +138,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
 
 #define T_40_59(t) \
-	TEMP = SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + W[t] + 0x8f1bbcdc; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
@@ -145,7 +148,8 @@ static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned int *data)
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
 
 #define T_60_79(t) \
-	TEMP = SHA_ROL(A,5) + (B^C^D)           + E + W[t] + 0xca62c1d6; \
+	SHA_XOR(t); \
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:48                               ` Linus Torvalds
@ 2009-08-06  4:01                                 ` Linus Torvalds
  2009-08-06  4:28                                   ` Artur Skawina
  2009-08-06  4:52                                 ` George Spelvin
  1 sibling, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  4:01 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Wed, 5 Aug 2009, Linus Torvalds wrote:
> 
> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
> sub-27s times. In fact, it's really close to the OpenSSL times.

Just to back that up:

 - OpenSSL:

	real	0m26.363s
	user	0m26.174s
	sys	0m0.188s

 - This C implementation:

	real	0m26.594s
	user	0m26.310s
	sys	0m0.256s

so I'm still slower, but now you really have to look closely to see the 
difference. In fact, you have to do multiple runs to make sure, because 
the error bars are bigger thant he difference - but openssl definitely 
edges my C code out by a small amount, and the above numbers are rairly 
normal.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:31                             ` Linus Torvalds
  2009-08-06  3:48                               ` Linus Torvalds
@ 2009-08-06  4:08                               ` Artur Skawina
  2009-08-06  4:27                                 ` Linus Torvalds
  1 sibling, 1 reply; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  4:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>>>  #define T_0_19(t) \
>>> -	TEMP = SHA_ROT(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
>>> -	E = D; D = C; C = SHA_ROT(B, 30); B = A; A = TEMP;
>>> +	TEMP = SHA_ROL(A,5) + (((C^D)&B)^D)     + E + W[t] + 0x5a827999; \
>>> +	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
>>>  
>>>  	T_0_19( 0); T_0_19( 1); T_0_19( 2); T_0_19( 3); T_0_19( 4);
>>>  	T_0_19( 5); T_0_19( 6); T_0_19( 7); T_0_19( 8); T_0_19( 9);
>> unrolling these otoh is a clear loss (iirc ~10%). 
> 
> I can well imagine. The P4 decode bandwidth is abysmal unless you get 
> things into the trace cache, and the trace cache is of a very limited 
> size.
> 
> However, on at least Nehalem, unrolling it all is quite a noticeable win.
> 
> The way it's written, I can easily make it do one or the other by just 
> turning the macro inside a loop (and we can have a preprocessor flag to 
> choose one or the other), but let me work on it a bit more first.

that's of course how i measured it.. :)

> I'm trying to move the htonl() inside the loops (the same way I suggested 
> George do with his assembly), and it seems to help a tiny bit. But I may 
> be measuring noise.

i haven't tried your version at all yet (just applied the rol/ror and
unrolling changes, but neither was a win on p4)

> However, right now my biggest profile hit is on this irritating loop:
> 
> 	/* Unroll it? */
> 	for (t = 16; t <= 79; t++)
> 		W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
> 
> and I haven't been able to move _that_ into the other iterations yet.

i've done that before -- was a small loss -- maybe because of the small
trace cache. deleted that attempt while cleaning up the #if mess, so don't
have the patch, but it was basically

#define newW(t) (W[t] = SHA_ROL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1))

and than s/W[t]/newW(t)/ in rounds 16..79.

I've only tested on p4 and there the winner so far is still:

-  for (t = 16; t <= 79; t++)
+  for (t = 16; t <= 79; t+=2) {
     ctx->W[t] =
-      SHA_ROT(ctx->W[t-3] ^ ctx->W[t-8] ^ ctx->W[t-14] ^ ctx->W[t-16], 1);
+      SHA_ROT(ctx->W[t-16] ^ ctx->W[t-14] ^ ctx->W[t-8] ^ ctx->W[t-3], 1);
+    ctx->W[t+1] =
+      SHA_ROT(ctx->W[t-15] ^ ctx->W[t-13] ^ ctx->W[t-7] ^ ctx->W[t-2], 1);
+  }

> Here's my micro-optimization update. It does the first 16 rounds (of the 
> first 20-round thing) specially, and takes the data directly from the 
> input array. I'm _this_ close to breaking the 28s second barrier on 
> git-fsck, but not quite yet.

tried this before too -- doesn't help. Not much a of a surprise --
if unrolling didn't help adding another loop (for rounds 17..20) won't.

artur

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:08                               ` Artur Skawina
@ 2009-08-06  4:27                                 ` Linus Torvalds
  2009-08-06  5:44                                   ` Artur Skawina
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  4:27 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Thu, 6 Aug 2009, Artur Skawina wrote:
> > 
> > The way it's written, I can easily make it do one or the other by just 
> > turning the macro inside a loop (and we can have a preprocessor flag to 
> > choose one or the other), but let me work on it a bit more first.
> 
> that's of course how i measured it.. :)

Well, with my "rolling 512-bit array" I can't do that easily any more.

Now it actually depends on the compiler being able to statically do that 
circular list calculation. If I were to turn it back into the chunks of 
loops, my new code would suck, because it would have all those nasty 
dynamic address calculations.

> I've only tested on p4 and there the winner so far is still:

Yeah, well, I refuse to touch that crappy micro-architecture any more. I 
complained to Intel people for years that their best CPU was only 
available as a laptop chip (Pentium-M), and I'm really happy to have 
gotten rid of all my horrid P4's.

(Ok, so it was great when the P4 ran at 2x the frequency of the 
competition, and then it smoked them all. Except on OS loads, where the P4 
exception handling took ten times longer than anything else).

So I'm a big biased against P4. 

I'll try it on my Atom's, though. They're pretty crappy CPU's, but they 
have a fairly good _reason_ to be crappy.

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:01                                 ` Linus Torvalds
@ 2009-08-06  4:28                                   ` Artur Skawina
  2009-08-06  4:50                                     ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  4:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Wed, 5 Aug 2009, Linus Torvalds wrote:
>> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
>> sub-27s times. In fact, it's really close to the OpenSSL times.
> 
> Just to back that up:
> 
>  - OpenSSL:
> 
> 	real	0m26.363s
> 	user	0m26.174s
> 	sys	0m0.188s
> 
>  - This C implementation:
> 
> 	real	0m26.594s
> 	user	0m26.310s
> 	sys	0m0.256s
> 
> so I'm still slower, but now you really have to look closely to see the 
> difference. In fact, you have to do multiple runs to make sure, because 
> the error bars are bigger thant he difference - but openssl definitely 
> edges my C code out by a small amount, and the above numbers are rairly 
> normal.

nice, the p4 microbenchmark #s:

#             TIME[s] SPEED[MB/s]
rfc3174         1.357       44.99
rfc3174         1.352       45.13
mozilla         1.509       40.44
mozillaas       1.133       53.87
linus          0.5818       104.9

so it's more than twice as fast as the mozilla implementation.

artur

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:28                                   ` Artur Skawina
@ 2009-08-06  4:50                                     ` Linus Torvalds
  2009-08-06  5:19                                       ` Artur Skawina
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-08-06  4:50 UTC (permalink / raw)
  To: Artur Skawina; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git



On Thu, 6 Aug 2009, Artur Skawina wrote:
> 
> #             TIME[s] SPEED[MB/s]
> rfc3174         1.357       44.99
> rfc3174         1.352       45.13
> mozilla         1.509       40.44
> mozillaas       1.133       53.87
> linus          0.5818       104.9
> 
> so it's more than twice as fast as the mozilla implementation.

So that's some general SHA1 benchmark you have?

I hope it tests correctness too. 

Although I can't imagine it being wrong - I've made mistakes (oh, yes, 
many mistakes) when trying to convert the code to something efficient, and 
even the smallest mistake results in 'git fsck' immediately complaining 
about every single object.

But still. I literally haven't tested it any other way (well, the git 
test-suite ends up doing a fair amount of testing too, and I _have_ run 
that).

As to my atom testing: my poor little atom is a sad little thing, and 
it's almost painful to benchmark that thing. But it's worth it to look at 
how the 32-bit code compares to the openssl asm code too:

 - BLK_SHA1:

	real	2m27.160s
	user	2m23.651s
	sys	0m2.392s

 - OpenSSL:

	real	2m12.580s
	user	2m9.998s
	sys	0m1.811s

 - Mozilla-SHA1:

	real	3m21.836s
	user	3m18.369s
	sys	0m2.862s

As expected, the hand-tuned assembly does better (and by a bigger margin). 
Probably partly because scheduling is important when in-order, and partly 
because gcc will have a harder time with the small register set.

But it's still a big improvement over mozilla one.

(This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 
calculation, so the SHA1 speedup is larger than you see from just th 
enumbers)

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  3:48                               ` Linus Torvalds
  2009-08-06  4:01                                 ` Linus Torvalds
@ 2009-08-06  4:52                                 ` George Spelvin
  1 sibling, 0 replies; 60+ messages in thread
From: George Spelvin @ 2009-08-06  4:52 UTC (permalink / raw)
  To: art.08.09, torvalds; +Cc: git, gitster, linux, nico

On Wed, 5 Aug 2009, Linus Torvalds wrote:
> Oh yes I have.
> 
> Here's the patch that gets me sub-28s git-fsck times. In fact, it gives me 
> sub-27s times. In fact, it's really close to the OpenSSL times.
> 
> And all using plain C.
> 
> Again - this is all on x86-64. I suspect 32-bit code ends up having 
> spills due to register pressure. That said, I did get rid of that big 
> temporary array, and it now basically only uses that 512-bit array as one 
> circular queue.
> 
> 		Linus
> 
> PS. Ok, so my definition of "plain C" is a bit odd. There's nothing plain 
> about it. It's disgusting C preprocessor misuse. But dang, it's kind of 
> fun to abuse the compiler this way.

You're still missing three tricks, which give a slight speedup 
on my machine:

1) (major)
	Instead of reassigning all those variable all the time,
	make the round function
		E += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + W[t] + 0x8f1bbcdc; \
		B = SHA_ROR(B, 2);
	and rename the variables between rounds.

2 (minor)
	One of the round functions ((B&C)|(D&(B|C))) can be rewritten
		  (B&C) | (C&D) | (D&B)
		= (B&C) | (D&(B|C))
		= (B&C) | (D&(B^C))
		= (B&C) ^ (D&(B^C))
		= (B&C) + (D&(B^C))
	to expose more associativty (and thus scheduling flexibility)
	to the compiler.

3) (minor)
	ctx->lenW is always simply a copy of the low 6 bits of ctx->size,
	so there's no need to bother with it.

Actually, looking at the code, GCC manages to figure out the first
(major) one by itself.  Way to go, GCC authors!

But getting avoiding the extra temporary in trick 2 also gets rid of
some extra REX prefixes, saving 240 bytes in blk_SHA1Block, which is
kind of nice in an inner loop.

Here's my modified version of your earlier code.  I haven't
incoporated the W[] formation into the round functions as in
your latest version.

I'm sure you can bash the two together in very little time.  Or I'll
get to it later; I really should attend to $DAY_JOB at the moment.

diff --git a/Makefile b/Makefile
index daf4296..e6df8ec 100644
--- a/Makefile
+++ b/Makefile
@@ -84,6 +84,10 @@ all::
 # specify your own (or DarwinPort's) include directories and
 # library directories by defining CFLAGS and LDFLAGS appropriately.
 #
+# Define BLK_SHA1 environment variable if you want the C version
+# of the SHA1 that assumes you can do unaligned 32-bit loads and
+# have a fast htonl() function.
+#
 # Define PPC_SHA1 environment variable when running make to make use of
 # a bundled SHA1 routine optimized for PowerPC.
 #
@@ -1167,6 +1171,10 @@ ifdef NO_DEFLATE_BOUND
 	BASIC_CFLAGS += -DNO_DEFLATE_BOUND
 endif
 
+ifdef BLK_SHA1
+	SHA1_HEADER = "block-sha1/sha1.h"
+	LIB_OBJS += block-sha1/sha1.o
+else
 ifdef PPC_SHA1
 	SHA1_HEADER = "ppc/sha1.h"
 	LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o
@@ -1184,6 +1192,7 @@ else
 endif
 endif
 endif
+endif
 ifdef NO_PERL_MAKEMAKER
 	export NO_PERL_MAKEMAKER
 endif
diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
new file mode 100644
index 0000000..261eae7
--- /dev/null
+++ b/block-sha1/sha1.c
@@ -0,0 +1,141 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.c),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "sha1.h"
+
+/* Hash one 64-byte block of data */
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data);
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx)
+{
+	/* Initialize H with the magic constants (see FIPS180 for constants)
+	 */
+	ctx->H[0] = 0x67452301;
+	ctx->H[1] = 0xefcdab89;
+	ctx->H[2] = 0x98badcfe;
+	ctx->H[3] = 0x10325476;
+	ctx->H[4] = 0xc3d2e1f0;
+	ctx->size = 0;
+}
+
+
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, int len)
+{
+	int lenW = (int)ctx->size & 63;
+
+	ctx->size += len;
+
+	/* Read the data into W and process blocks as they get full
+	 */
+	if (lenW) {
+		int left = 64 - lenW;
+		if (len < left)
+			left = len;
+		memcpy(lenW + (char *)ctx->W, data, left);
+		if (left + lenW != 64)
+			return;
+		len -= left;
+		data += left;
+		blk_SHA1Block(ctx, ctx->W);
+	}
+	while (len >= 64) {
+		blk_SHA1Block(ctx, data);
+		data += 64;
+		len -= 64;
+	}
+	memcpy(ctx->W, data, len);
+}
+
+
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
+{
+	int i, lenW = (int)ctx->size & 63;
+
+	/* Pad with a binary 1 (ie 0x80), then zeroes, then length
+	 */
+	((char *)ctx->W)[lenW++] = 0x80;
+	if (lenW > 56) {
+		memset((char *)ctx->W + lenW, 0, 64 - lenW);
+		blk_SHA1Block(ctx, ctx->W);
+		lenW = 0;
+	}
+	memset((char *)ctx->W + lenW, 0, 56 - lenW);
+	ctx->W[14] = htonl(ctx->size >> 29);
+	ctx->W[15] = htonl((uint32_t)ctx->size << 3);
+	blk_SHA1Block(ctx, ctx->W);
+
+	/* Output hash
+	 */
+	for (i = 0; i < 5; i++)
+		((unsigned int *)hashout)[i] = htonl(ctx->H[i]);
+}
+
+/* SHA-1 helper macros */
+#define SHA_ROT(X,n) (((X) << (n)) | ((X) >> (32-(n))))
+#define F1(b,c,d) (((d^c)&b)^d)
+#define F2(b,c,d) (b^c^d)
+/* This version lets the compiler use the fact that + is associative. */
+#define F3(b,c,d) (c&d) + (b & (c^d))
+
+/* The basic SHA-1 round */
+#define ROUND(a, b, c, d, e, f, k, t) \
+	e += SHA_ROT(a,5) + f(b,c,d) + W[t] + k;  b = SHA_ROT(b, 30)
+/* Five SHA-1 rounds */
+#define FIVE(f, k, t) \
+	ROUND(A, B, C, D, E, f, k, t  ); \
+	ROUND(E, A, B, C, D, f, k, t+1); \
+	ROUND(D, E, A, B, C, f, k, t+2); \
+	ROUND(C, D, E, A, B, f, k, t+3); \
+	ROUND(B, C, D, E, A, f, k, t+4)
+
+static void blk_SHA1Block(blk_SHA_CTX *ctx, const uint32_t *data)
+{
+	int t;
+	uint32_t A,B,C,D,E;
+	uint32_t W[80];
+
+	for (t = 0; t < 16; t++)
+		W[t] = htonl(data[t]);
+
+	/* Unroll it? */
+	for (t = 16; t <= 79; t++)
+		W[t] = SHA_ROT(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
+
+	A = ctx->H[0];
+	B = ctx->H[1];
+	C = ctx->H[2];
+	D = ctx->H[3];
+	E = ctx->H[4];
+
+	FIVE(F1, 0x5a827999,  0);
+	FIVE(F1, 0x5a827999,  5);
+	FIVE(F1, 0x5a827999, 10);
+	FIVE(F1, 0x5a827999, 15);
+
+	FIVE(F2, 0x6ed9eba1, 20);
+	FIVE(F2, 0x6ed9eba1, 25);
+	FIVE(F2, 0x6ed9eba1, 30);
+	FIVE(F2, 0x6ed9eba1, 35);
+
+	FIVE(F3, 0x8f1bbcdc, 40);
+	FIVE(F3, 0x8f1bbcdc, 45);
+	FIVE(F3, 0x8f1bbcdc, 50);
+	FIVE(F3, 0x8f1bbcdc, 55);
+
+	FIVE(F2, 0xca62c1d6, 60);
+	FIVE(F2, 0xca62c1d6, 65);
+	FIVE(F2, 0xca62c1d6, 70);
+	FIVE(F2, 0xca62c1d6, 75);
+
+	ctx->H[0] += A;
+	ctx->H[1] += B;
+	ctx->H[2] += C;
+	ctx->H[3] += D;
+	ctx->H[4] += E;
+}
diff --git a/block-sha1/sha1.h b/block-sha1/sha1.h
new file mode 100644
index 0000000..c9dc156
--- /dev/null
+++ b/block-sha1/sha1.h
@@ -0,0 +1,21 @@
+/*
+ * Based on the Mozilla SHA1 (see mozilla-sha1/sha1.h),
+ * optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ */
+ #include <stdint.h>
+
+typedef struct {
+	uint32_t H[5];
+	uint64_t size;
+	uint32_t W[16];
+} blk_SHA_CTX;
+
+void blk_SHA1_Init(blk_SHA_CTX *ctx);
+void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *dataIn, int len);
+void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx);
+
+#define git_SHA_CTX	blk_SHA_CTX
+#define git_SHA1_Init	blk_SHA1_Init
+#define git_SHA1_Update	blk_SHA1_Update
+#define git_SHA1_Final	blk_SHA1_Final

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:50                                     ` Linus Torvalds
@ 2009-08-06  5:19                                       ` Artur Skawina
  2009-08-06  7:03                                         ` George Spelvin
  0 siblings, 1 reply; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  5:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> #             TIME[s] SPEED[MB/s]
>> rfc3174         1.357       44.99
>> rfc3174         1.352       45.13
>> mozilla         1.509       40.44
>> mozillaas       1.133       53.87
>> linus          0.5818       104.9
>>
>> so it's more than twice as fast as the mozilla implementation.
> 
> So that's some general SHA1 benchmark you have?
> 
> I hope it tests correctness too. 

yep, sort of, i just check that all versions return the same result
when hashing some pseudorandom data.

> As to my atom testing: my poor little atom is a sad little thing, and 
> it's almost painful to benchmark that thing. But it's worth it to look at 
> how the 32-bit code compares to the openssl asm code too:
> 
>  - BLK_SHA1:
> 	real	2m27.160s
>  - OpenSSL:
> 	real	2m12.580s
>  - Mozilla-SHA1:
> 	real	3m21.836s
> 
> As expected, the hand-tuned assembly does better (and by a bigger margin). 
> Probably partly because scheduling is important when in-order, and partly 
> because gcc will have a harder time with the small register set.
> 
> But it's still a big improvement over mozilla one.
> 
> (This is, as always, 'git fsck --full'. It spends about 50% on that SHA1 
> calculation, so the SHA1 speedup is larger than you see from just th 
> enumbers)

I'll start looking at other cpus once i integrate the asm versions into
my benchmark. 

P4s really are "special". Even something as simple as this on top of your
version:

@@ -129,8 +133,8 @@
 
 #define T_20_39(t) \
        SHA_XOR(t); \
-       TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+       TEMP += SHA_ROL(A,5) + (B^C^D) + E; \
+       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1;
 
        T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
        T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
@@ -139,8 +143,8 @@
 
 #define T_40_59(t) \
        SHA_XOR(t); \
-       TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
-       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+       TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \
+       E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc;
 
        T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
        T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);

saves another 10% or so:

#Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         1.403        43.5
# New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f
rfc3174         1.403       43.51
linus          0.5891       103.6
linusas        0.5337       114.4
mozilla         1.535       39.76
mozillaas       1.128       54.13


artur

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  4:27                                 ` Linus Torvalds
@ 2009-08-06  5:44                                   ` Artur Skawina
  2009-08-06  5:56                                     ` Artur Skawina
  0 siblings, 1 reply; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  5:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Linus Torvalds wrote:
> 
> On Thu, 6 Aug 2009, Artur Skawina wrote:
>>> The way it's written, I can easily make it do one or the other by just 
>>> turning the macro inside a loop (and we can have a preprocessor flag to 
>>> choose one or the other), but let me work on it a bit more first.
>> that's of course how i measured it.. :)
> 
> Well, with my "rolling 512-bit array" I can't do that easily any more.
> 
> Now it actually depends on the compiler being able to statically do that 
> circular list calculation. If I were to turn it back into the chunks of 
> loops, my new code would suck, because it would have all those nasty 
> dynamic address calculations.

i did try (obvious patch below) and in fact the loops still win on p4:

#Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         1.437       42.47
rfc3174         1.438       42.45
linus          0.5791       105.4
linusas        0.5052       120.8
mozilla         1.525       40.01
mozillaas       1.192       51.19

artur

--- block-sha1/sha1.c	2009-08-06 06:45:03.407322970 +0200
+++ block-sha1/sha1as.c	2009-08-06 07:36:41.332318683 +0200
@@ -107,13 +107,17 @@
 
 #define T_0_15(t) \
 	TEMP = htonl(data[t]); array[t] = TEMP; \
-	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
+	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x5a827999; \
 
+#if UNROLL
 	T_0_15( 0); T_0_15( 1); T_0_15( 2); T_0_15( 3); T_0_15( 4);
 	T_0_15( 5); T_0_15( 6); T_0_15( 7); T_0_15( 8); T_0_15( 9);
 	T_0_15(10); T_0_15(11); T_0_15(12); T_0_15(13); T_0_15(14);
 	T_0_15(15);
+#else
+	for (int t = 0; t <= 15; t++) { T_0_15(t); }
+#endif
 
 /* This "rolls" over the 512-bit array */
 #define W(x) (array[(x)&15])
@@ -125,37 +129,53 @@
 	TEMP += SHA_ROL(A,5) + (((C^D)&B)^D) + E + 0x5a827999; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP; \
 
+#if UNROLL
 	T_16_19(16); T_16_19(17); T_16_19(18); T_16_19(19);
+#else
+	for (int t = 16; t <= 19; t++) { T_16_19(t); }
+#endif
 
 #define T_20_39(t) \
 	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0x6ed9eba1; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	TEMP += SHA_ROL(A,5) + (B^C^D) + E; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x6ed9eba1;
 
+#if UNROLL
 	T_20_39(20); T_20_39(21); T_20_39(22); T_20_39(23); T_20_39(24);
 	T_20_39(25); T_20_39(26); T_20_39(27); T_20_39(28); T_20_39(29);
 	T_20_39(30); T_20_39(31); T_20_39(32); T_20_39(33); T_20_39(34);
 	T_20_39(35); T_20_39(36); T_20_39(37); T_20_39(38); T_20_39(39);
+#else
+	for (int t = 20; t <= 39; t++) { T_20_39(t); }
+#endif
 
 #define T_40_59(t) \
 	SHA_XOR(t); \
-	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E + 0x8f1bbcdc; \
-	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
+	TEMP += SHA_ROL(A,5) + ((B&C)|(D&(B|C))) + E; \
+	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP + 0x8f1bbcdc;
 
+#if UNROLL
 	T_40_59(40); T_40_59(41); T_40_59(42); T_40_59(43); T_40_59(44);
 	T_40_59(45); T_40_59(46); T_40_59(47); T_40_59(48); T_40_59(49);
 	T_40_59(50); T_40_59(51); T_40_59(52); T_40_59(53); T_40_59(54);
 	T_40_59(55); T_40_59(56); T_40_59(57); T_40_59(58); T_40_59(59);
+#else
+	for (int t = 40; t <= 59; t++) { T_40_59(t); }
+#endif
 
 #define T_60_79(t) \
 	SHA_XOR(t); \
 	TEMP += SHA_ROL(A,5) + (B^C^D) + E + 0xca62c1d6; \
 	E = D; D = C; C = SHA_ROR(B, 2); B = A; A = TEMP;
 
+#if UNROLL
 	T_60_79(60); T_60_79(61); T_60_79(62); T_60_79(63); T_60_79(64);
 	T_60_79(65); T_60_79(66); T_60_79(67); T_60_79(68); T_60_79(69);
 	T_60_79(70); T_60_79(71); T_60_79(72); T_60_79(73); T_60_79(74);
 	T_60_79(75); T_60_79(76); T_60_79(77); T_60_79(78); T_60_79(79);
+#else
+	for (int t = 60; t <= 79; t++) { T_60_79(t); }
+#endif
 
 	ctx->H[0] += A;
 	ctx->H[1] += B;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  5:44                                   ` Artur Skawina
@ 2009-08-06  5:56                                     ` Artur Skawina
  2009-08-06  7:45                                       ` Artur Skawina
  0 siblings, 1 reply; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  5:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Artur Skawina wrote:
> i did try (obvious patch below) and in fact the loops still win on p4:
> 
> #Initializing... Rounds: 1000000, size: 62500K, time: 1.428s, speed: 42.76MB/s
> #             TIME[s] SPEED[MB/s]
> rfc3174         1.437       42.47
> rfc3174         1.438       42.45
> linus          0.5791       105.4
> linusas        0.5052       120.8
> mozilla         1.525       40.01
> mozillaas       1.192       51.19

and my atom seems to like the compact loops too: 

#Initializing... Rounds: 1000000, size: 62500K, time: 4.379s, speed: 13.94MB/s
#             TIME[s] SPEED[MB/s]
rfc3174         4.429       13.78
rfc3174         4.414       13.83
linus           1.733       35.22
linusas           1.5        40.7
mozilla         2.818       21.66
mozillaas       2.539       24.04

artur

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  5:19                                       ` Artur Skawina
@ 2009-08-06  7:03                                         ` George Spelvin
  0 siblings, 0 replies; 60+ messages in thread
From: George Spelvin @ 2009-08-06  7:03 UTC (permalink / raw)
  To: art.08.09, torvalds; +Cc: git, gitster, linux, nico

> On Thu, 6 Aug 2009, Artur Skawina wrote:
>> #             TIME[s] SPEED[MB/s]
>> rfc3174         1.357       44.99
>> rfc3174         1.352       45.13
>> mozilla         1.509       40.44
>> mozillaas       1.133       53.87
>> linus          0.5818       104.9

> #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s
> #             TIME[s] SPEED[MB/s]
> rfc3174         1.403        43.5
> # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f
> rfc3174         1.403       43.51
> linus          0.5891       103.6
> linusas        0.5337       114.4
> mozilla         1.535       39.76
> mozillaas       1.128       54.13

I'm trying to absorb what you're learning about P4 performance, but
I'm getting confused... what is what in these benchmarks?

The major architectural decisions I see are:

1) Three possible ways to compute the W[] array for rounds 16..79:
	1a) Compute W[16..79] in a loop beforehand (you noted that unrolling
	    two copies helped significantly.)
	1b) Compute W[16..79] as part of hash rounds 16..79.
	1c) Compute W[0..15] in-place as part of hash rounds 16..79

2) The main hashing can be rolled up or unrolled:
	2a) Four 20-round loops.  (In case of options 1b and 1c, the
	    first one might be split into a 16 and a 4.)
	2b) Four 4-round loops, each unrolled 5x.  (See the ARM assembly.)
	2c) all 80 rounds unrolled.

As Linus noted, 1c is not friends with options 2a and 2b, because the
W() indexing math is not longer a compile-time constant.

Linus has posted 1a+2c and 1c+2c.  You posted some code that could be
2a or 2c depending on an UNROLL preprocessor #define.  Which combinations
are your "linus" and "linusas" code?

You talk about "and my atom seems to like the compact loops too", but
I'm not sure which loops those are.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  5:56                                     ` Artur Skawina
@ 2009-08-06  7:45                                       ` Artur Skawina
  0 siblings, 0 replies; 60+ messages in thread
From: Artur Skawina @ 2009-08-06  7:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, George Spelvin, Junio C Hamano, git

Artur Skawina wrote:
> 
> and my atom seems to like the compact loops too: 

no, that was wrong, i forgot to turn off the ondemand governor...

the unrolled loops are in fact much faster and the numbers
look more reasonable, after a few tweaks even on a P4.
Now i just need to check how well it does compared to the
asm implementations...

artur

#             TIME[s] SPEED[MB/s]
# ATOM
rfc3174         2.199       27.75
linus          0.8642       70.62
linusas         1.606       38.01
linusas2       0.8763       69.65
mozilla         2.813        21.7
mozillaas       2.539       24.04
# P4
rfc3174         1.402       43.53
linus          0.5835       104.6
linusas        0.4625         132
linusas2       0.4456         137
mozilla         1.529       39.91
mozillaas       1.131       53.96
# P3
rfc3174         5.019       12.16
linus            1.86       32.81
linusas         3.108       19.64
linusas2        1.812       33.68
mozilla         6.431        9.49
mozillaas       5.868        10.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-06  1:18                     ` Linus Torvalds
  2009-08-06  1:52                       ` Nicolas Pitre
@ 2009-08-06 18:49                       ` Erik Faye-Lund
  1 sibling, 0 replies; 60+ messages in thread
From: Erik Faye-Lund @ 2009-08-06 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, gitster, git

On Thu, Aug 6, 2009 at 3:18 AM, Linus
Torvalds<torvalds@linux-foundation.org> wrote:
> I note that MINGW does NO_OPENSSL by default, for example, and maybe the
> MINGW people want to test the patch out and enable BLK_SHA1 rather than
> the original Mozilla one.

We recently got OpenSSL in msysgit. The NO_OPENSSL-switch hasn't been
flipped yet, though. (We did OpenSSL to get https-support in cURL...)

-- 
Erik "kusma" Faye-Lund
kusmabite@gmail.com
(+47) 986 59 656

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
                       ` (2 preceding siblings ...)
  2009-08-04  2:30     ` Linus Torvalds
@ 2009-08-18 21:26     ` Andy Polyakov
  3 siblings, 0 replies; 60+ messages in thread
From: Andy Polyakov @ 2009-08-18 21:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: git

George Spelvin wrote:
> (Work in progress, state dump to mailing list archives.)
> 
> This started when discussing git startup overhead due to the dynamic
> linker.  One big contributor is the openssl library, which is used only
> for its optimized x86 SHA-1 implementation.  So I took a look at it,
> with an eye to importing the code directly into the git source tree,
> and decided that I felt like trying to do better.
> 
> The original code was excellent, but it was optimized when the P4 was new.

Even though last revision took place when "the P4 was new" and even
triggered by its appearance, *all-round* performance was and will always
be the prime goal. This means that improvements on some particular
micro-architecture is always weighed against losses on others [and
compromise is considered of so required]. Please note that I'm *not*
trying to diminish George's effort by saying that proposed code is
inappropriate, on the contrary I'm nothing but grateful! Thanks, George!
I'm only saying that it will be given thorough consideration. Well, I've
actually given the consideration and outcome is already committed:-) See
http://cvs.openssl.org/chngview?cn=18513. I don't deliver +17%, only
+12%, but at the cost of Intel Atom-specific optimizations. I used this
opportunity to optimize even for Intel Atom core, something I was
planning to do at some point anyway...

>   http://www.openssl.org/~appro/cryptogams/cryptogams-0.tar.gz
> - "tar xz cryptogams-0.tar.gz"

If there is interest I can pack new tar ball with updated modules.

> An open question is how to add appropriate CPU detection to the git
> build scripts. (Note that `uname -m`, which it currently uses to select
> the ARM code, does NOT produce the right answer if you're using a 32-bit
> compiler on a 64-bit platform.)

It's not only that. As next subscriber noted problem on MacOS X, it
[MacOS X] uses slightly different assembler convention and ELF modules
can't be compiled on MacOS X. OpenSSL perlasm takes care of several
assembler flavors and executable formats, including MacOS X. I'm talking
about

> +++ Makefile	2009-08-02 06:44:44.000000000 -0400
> +%.s : %.pl x86asm.pl x86unix.pl
> +	perl $< elf > $@
                ^^^ this argument.

Cheers. A.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: x86 SHA1: Faster than OpenSSL
  2009-08-04  2:51       ` Linus Torvalds
  2009-08-04  3:07         ` Jon Smirl
@ 2009-08-18 21:50         ` Andy Polyakov
  1 sibling, 0 replies; 60+ messages in thread
From: Andy Polyakov @ 2009-08-18 21:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: George Spelvin, git

> On Mon, 3 Aug 2009, Linus Torvalds wrote:
>> The thing that I'd prefer is simply
>>
>> 	git fsck --full
>>
>> on the Linux kernel archive. For me (with a fast machine), it takes about 
>> 4m30s with the OpenSSL SHA1, and takes 6m40s with the Mozilla SHA1 (ie 
>> using a NO_OPENSSL=1 build).
>>
>> So that's an example of a load that is actually very sensitive to SHA1 
>> performance (more so than _most_ git loads, I suspect), and at the same 
>> time is a real git load rather than some SHA1-only microbenchmark.

I couldn't agree more that real-life benchmarks are of greater value
than specific algorithm micro-benchmark. And given the provided
profiling data one can argue that +17% (or my +12%) improvement on
micro-benchmark aren't really worth bothering about. But it's kind of
sport [at least for me], so don't judge too harshly:-)

>> It also 
>> shows very clearly why we default to the OpenSSL version over the Mozilla 
>> one.

As George implicitly mentioned most OpenSSL assembler modules are
available under more permissive license and if there is interest I'm
ready to assist...

> "perf report --sort comm,dso,symbol" profiling shows the following for 
> 'git fsck --full' on the kernel repo, using the Mozilla SHA1:
> 
>     47.69%               git  /home/torvalds/git/git     [.] moz_SHA1_Update
>     22.98%               git  /lib64/libz.so.1.2.3       [.] inflate_fast
>      7.32%               git  /lib64/libc-2.10.1.so      [.] __GI_memcpy
>      4.66%               git  /lib64/libz.so.1.2.3       [.] inflate
>      3.76%               git  /lib64/libz.so.1.2.3       [.] adler32
>      2.86%               git  /lib64/libz.so.1.2.3       [.] inflate_table
>      2.41%               git  /home/torvalds/git/git     [.] lookup_object
>      1.31%               git  /lib64/libc-2.10.1.so      [.] _int_malloc
>      0.84%               git  /home/torvalds/git/git     [.] patch_delta
>      0.78%               git  [kernel]                   [k] hpet_next_event
> 
> so yeah, SHA1 performance matters. Judging by the OpenSSL numbers, the 
> OpenSSL SHA1 implementation must be about twice as fast as the C version 
> we use.

And given /lib64 path this is 64-bit C compiler-generated code compared
to 32-bit assembler? Either way in this context I have extra comment
addressing previous subscriber, Mark Lodato, who effectively wondered
how would 64-bit assembler compare to 32-bit one. First of all there
*is* even 64-bit assembler version. But as SHA1 is essentially 32-bit
algorithm, 64-bit implementation is only nominally faster, +20% at most.
Faster thanks to larger register bank facilitating more efficient
instruction scheduling.

Cheers. A.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2009-08-18 21:50 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-26 23:21 Performance issue of 'git branch' George Spelvin
2009-07-31 10:46 ` Request for benchmarking: x86 SHA1 code George Spelvin
2009-07-31 11:11   ` Erik Faye-Lund
2009-07-31 11:31     ` George Spelvin
2009-07-31 11:37     ` Michael J Gruber
2009-07-31 12:24       ` Erik Faye-Lund
2009-07-31 12:29         ` Johannes Schindelin
2009-07-31 12:32         ` George Spelvin
2009-07-31 12:45           ` Erik Faye-Lund
2009-07-31 13:02             ` George Spelvin
2009-07-31 11:21   ` Michael J Gruber
2009-07-31 11:26   ` Michael J Gruber
2009-07-31 12:31   ` Carlos R. Mafra
2009-07-31 13:27   ` Brian Ristuccia
2009-07-31 14:05     ` George Spelvin
2009-07-31 13:27   ` Jakub Narebski
2009-07-31 15:05   ` Peter Harris
2009-07-31 15:22   ` Peter Harris
2009-08-03  3:47   ` x86 SHA1: Faster than OpenSSL George Spelvin
2009-08-03  7:36     ` Jonathan del Strother
2009-08-04  1:40     ` Mark Lodato
2009-08-04  2:30     ` Linus Torvalds
2009-08-04  2:51       ` Linus Torvalds
2009-08-04  3:07         ` Jon Smirl
2009-08-04  5:01           ` George Spelvin
2009-08-04 12:56             ` Jon Smirl
2009-08-04 14:29               ` Dmitry Potapov
2009-08-18 21:50         ` Andy Polyakov
2009-08-04  4:48       ` George Spelvin
2009-08-04  6:30         ` Linus Torvalds
2009-08-04  8:01           ` George Spelvin
2009-08-04 20:41             ` Junio C Hamano
2009-08-05 18:17               ` George Spelvin
2009-08-05 20:36                 ` Johannes Schindelin
2009-08-05 20:44                 ` Junio C Hamano
2009-08-05 20:55                 ` Linus Torvalds
2009-08-05 23:13                   ` Linus Torvalds
2009-08-06  1:18                     ` Linus Torvalds
2009-08-06  1:52                       ` Nicolas Pitre
2009-08-06  2:04                         ` Junio C Hamano
2009-08-06  2:10                           ` Linus Torvalds
2009-08-06  2:20                           ` Nicolas Pitre
2009-08-06  2:08                         ` Linus Torvalds
2009-08-06  3:19                           ` Artur Skawina
2009-08-06  3:31                             ` Linus Torvalds
2009-08-06  3:48                               ` Linus Torvalds
2009-08-06  4:01                                 ` Linus Torvalds
2009-08-06  4:28                                   ` Artur Skawina
2009-08-06  4:50                                     ` Linus Torvalds
2009-08-06  5:19                                       ` Artur Skawina
2009-08-06  7:03                                         ` George Spelvin
2009-08-06  4:52                                 ` George Spelvin
2009-08-06  4:08                               ` Artur Skawina
2009-08-06  4:27                                 ` Linus Torvalds
2009-08-06  5:44                                   ` Artur Skawina
2009-08-06  5:56                                     ` Artur Skawina
2009-08-06  7:45                                       ` Artur Skawina
2009-08-06 18:49                       ` Erik Faye-Lund
2009-08-04  6:40         ` Linus Torvalds
2009-08-18 21:26     ` Andy Polyakov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.