linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] add slice by 8 algorithm to crc32.c
@ 2011-08-08  9:28 George Spelvin
  2011-08-08 10:31 ` Joakim Tjernlund
  2011-08-08 16:50 ` Bob Pearson
  0 siblings, 2 replies; 27+ messages in thread
From: George Spelvin @ 2011-08-08  9:28 UTC (permalink / raw)
  To: fzago, linux-kernel; +Cc: akpm, joakim.tjernlund, linux, rpearson

Sorry I didn't see this when first posted.

The "slice by 8" terminology is pretty confusing.  How about
"Extended Joakim Tjernlund's optimization from commit
836e2af92503f1642dbc3c3281ec68ec1dd39d2e to 8-way parallelism."

Which is essentally what you're doing.  The renaming of tab[0] to t0_le
and t0_be, and removal of the DO_CRC4 macro just increases the diff size.

If you're looking at speeding up the CRC through larger tables, have
you tried using 10+11+11-bit tables?  That would require 20K of tables
rather than 8K, but would reduce the number of table lookups per byte.


One more stunt you could try to increase parallelism: rather than maintain
the CRC in one register, maintain it in several, and only XOR and collapse
them at the end.

Start with your 64-bit code, but imagine that the second code block's
"q = *p32++" always loads 0, and therefore the whole block can be skipped.
(Since tab[0] = 0 for all CRC tables.)

This computes the CRC of the even words.  Then do a second one in parallel
for the odd words into a separate CRC register.  Then combine them at the end.
(Shift one up by 32 bits and XOR into the other.)

This would let you get away with 5K of tables: t4 through t7, and t0.
t1 through t3 could be skipped.


Ideally, I'd write all this code myself, but I'm a bit crunched at work
right now so wouldn't be able to get to it for a few days.



Another possible simplification to the startup code.  There's no need
to compute init_bytes explicitly; just loop until the pointer is aligned:

	while ((unsigned)buf & 3) {
		if (!len--)
			goto done;
#ifdef __LITTLE_ENDIAN
		i0 = *buf++ ^ crc;
		crc = t0_le[i0] ^ (crc >> 8);
#else
		i0 = *buf++ ^ (crc >> 24);
		crc = t0_le[i0] ^ (crc << 8);
#endif  
	}
	p32 = (u32 const *)buf;
	words = len >> 2;
	end_bytes = len & 3;


... although I'd prefer to keep the DO_CRC() and DO_CRC4 macros, and
extend them to the 64-bit case, to avoid the nested #ifdefs.  That would
make:

	while ((unsigned)buf & 3) {
		if (!len--)
			goto done;
		DO_CRC(*buf++);
	}
	p32 = (u32 const *)buf;
	words = len >> 2;
	end_bytes = len & 3;

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08  9:28 [PATCH] add slice by 8 algorithm to crc32.c George Spelvin
@ 2011-08-08 10:31 ` Joakim Tjernlund
  2011-08-08 10:52   ` George Spelvin
  2011-08-08 16:50 ` Bob Pearson
  1 sibling, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-08 10:31 UTC (permalink / raw)
  To: George Spelvin; +Cc: akpm, fzago, linux-kernel, rpearson

Hi George, just a quick comment or two.

"George Spelvin" <linux@horizon.com> wrote on 2011/08/08 11:28:26:
>
> Sorry I didn't see this when first posted.
>
> The "slice by 8" terminology is pretty confusing.  How about
> "Extended Joakim Tjernlund's optimization from commit
> 836e2af92503f1642dbc3c3281ec68ec1dd39d2e to 8-way parallelism."
>
> Which is essentally what you're doing.  The renaming of tab[0] to t0_le
> and t0_be, and removal of the DO_CRC4 macro just increases the diff size.

Yes, and also increases ppc size. The matrix needs to stay. I have proposed
to add this intead:
@@ -51,20 +51,21 @@ static inline u32
 crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256])
 {
 # ifdef __LITTLE_ENDIAN
-#  define DO_CRC(x) crc = tab[0][(crc ^ (x)) & 255] ^ (crc >> 8)
-#  define DO_CRC4 crc = tab[3][(crc) & 255] ^ \
-		tab[2][(crc >> 8) & 255] ^ \
-		tab[1][(crc >> 16) & 255] ^ \
-		tab[0][(crc >> 24) & 255]
+#  define DO_CRC(x) crc = t0[(crc ^ (x)) & 255] ^ (crc >> 8)
+#  define DO_CRC4 crc = t3[(crc) & 255] ^ \
+		t2[(crc >> 8) & 255] ^ \
+		t1[(crc >> 16) & 255] ^ \
+		t0[(crc >> 24) & 255]
 # else
-#  define DO_CRC(x) crc = tab[0][((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
-#  define DO_CRC4 crc = tab[0][(crc) & 255] ^ \
-		tab[1][(crc >> 8) & 255] ^ \
-		tab[2][(crc >> 16) & 255] ^ \
-		tab[3][(crc >> 24) & 255]
+#  define DO_CRC(x) crc = t0[((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
+#  define DO_CRC4 crc = t0[(crc) & 255] ^ \
+		t1[(crc >> 8) & 255] ^  \
+		t2[(crc >> 16) & 255] ^	\
+		t3[(crc >> 24) & 255]
 # endif
 	const u32 *b;
 	size_t    rem_len;
+	const u32 *t0=tab[0], *t1=tab[1], *t2=tab[2], *t3=tab[3];

This saves 3 insns for ppc in the hot loop, hopefully x86 is happy with this too.

>
> If you're looking at speeding up the CRC through larger tables, have
> you tried using 10+11+11-bit tables?  That would require 20K of tables
> rather than 8K, but would reduce the number of table lookups per byte.
>
>
> One more stunt you could try to increase parallelism: rather than maintain
> the CRC in one register, maintain it in several, and only XOR and collapse
> them at the end.
>
> Start with your 64-bit code, but imagine that the second code block's
> "q = *p32++" always loads 0, and therefore the whole block can be skipped.
> (Since tab[0] = 0 for all CRC tables.)
>
> This computes the CRC of the even words.  Then do a second one in parallel
> for the odd words into a separate CRC register.  Then combine them at the end.
> (Shift one up by 32 bits and XOR into the other.)
>
> This would let you get away with 5K of tables: t4 through t7, and t0.
> t1 through t3 could be skipped.
>
>
> Ideally, I'd write all this code myself, but I'm a bit crunched at work
> right now so wouldn't be able to get to it for a few days.
>
>
>
> Another possible simplification to the startup code.  There's no need
> to compute init_bytes explicitly; just loop until the pointer is aligned:
>
>    while ((unsigned)buf & 3) {
>       if (!len--)
>          goto done;
> #ifdef __LITTLE_ENDIAN
>       i0 = *buf++ ^ crc;
>       crc = t0_le[i0] ^ (crc >> 8);
> #else
>       i0 = *buf++ ^ (crc >> 24);
>       crc = t0_le[i0] ^ (crc << 8);
> #endif
>    }
>    p32 = (u32 const *)buf;
>    words = len >> 2;
>    end_bytes = len & 3;
>
>
> ... although I'd prefer to keep the DO_CRC() and DO_CRC4 macros, and
> extend them to the 64-bit case, to avoid the nested #ifdefs.  That would
> make:
>
>    while ((unsigned)buf & 3) {
>       if (!len--)
>          goto done;
>       DO_CRC(*buf++);
>    }
>    p32 = (u32 const *)buf;
>    words = len >> 2;
>    end_bytes = len & 3;

I prefer to keep the current code which (at the time) generated good code
for at least ppc:
	/* Align it */
	if (unlikely((long)buf & 3 && len)) {
		do {
			DO_CRC(*buf++);
		} while ((--len) && ((long)buf)&3);
	}


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 10:31 ` Joakim Tjernlund
@ 2011-08-08 10:52   ` George Spelvin
  2011-08-08 11:11     ` Joakim Tjernlund
                       ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: George Spelvin @ 2011-08-08 10:52 UTC (permalink / raw)
  To: joakim.tjernlund, linux; +Cc: akpm, fzago, linux-kernel, rpearson

> I prefer to keep the current code which (at the time) generated good code
> for at least ppc:
> 	/* Align it */
> 	if (unlikely((long)buf & 3 && len)) {
> 		do {
> 			DO_CRC(*buf++);
> 		} while ((--len) && ((long)buf)&3);
> 	}

Ah, I was looking at fzago's initial patch; I hadn't realized you'd
tweaked it.  That's pretty much what I was talking about.

Would
 	if (unlikely((long)buf & 3) && len) {

give the compiler better hints?  len != 0 is awfully
likely, actually.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 10:52   ` George Spelvin
@ 2011-08-08 11:11     ` Joakim Tjernlund
  2011-08-08 17:04       ` Bob Pearson
       [not found]     ` <OFEA1BD2B2.B2A7F07F-ONC12578E6.003D368C-C12578E6.003D7468@LocalDomain>
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-08 11:11 UTC (permalink / raw)
  To: George Spelvin; +Cc: akpm, fzago, linux-kernel, rpearson

"George Spelvin" <linux@horizon.com> wrote on 2011/08/08 12:52:01:
>
> > I prefer to keep the current code which (at the time) generated good code
> > for at least ppc:
> >    /* Align it */
> >    if (unlikely((long)buf & 3 && len)) {
> >       do {
> >          DO_CRC(*buf++);
> >       } while ((--len) && ((long)buf)&3);
> >    }
>
> Ah, I was looking at fzago's initial patch; I hadn't realized you'd
> tweaked it.  That's pretty much what I was talking about.
>
> Would
>     if (unlikely((long)buf & 3) && len) {
>
> give the compiler better hints?  len != 0 is awfully
> likely, actually.

Doesn't matter on ppc(gcc 4.4.4). The whole while loop is moved out of line
in both cases and the generated asm is identical.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
       [not found]     ` <OFEA1BD2B2.B2A7F07F-ONC12578E6.003D368C-C12578E6.003D7468@LocalDomain>
@ 2011-08-08 11:24       ` Joakim Tjernlund
  0 siblings, 0 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-08 11:24 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: George Spelvin, akpm, fzago, linux-kernel, rpearson

Joakim Tjernlund/Transmode wrote on 2011/08/08 13:11:14:
>
> "George Spelvin" <linux@horizon.com> wrote on 2011/08/08 12:52:01:
> >
> > > I prefer to keep the current code which (at the time) generated good code
> > > for at least ppc:
> > >    /* Align it */
> > >    if (unlikely((long)buf & 3 && len)) {
> > >       do {
> > >          DO_CRC(*buf++);
> > >       } while ((--len) && ((long)buf)&3);
> > >    }
> >
> > Ah, I was looking at fzago's initial patch; I hadn't realized you'd
> > tweaked it.  That's pretty much what I was talking about.
> >
> > Would
> >     if (unlikely((long)buf & 3) && len) {
> >
> > give the compiler better hints?  len != 0 is awfully
> > likely, actually.
>
> Doesn't matter on ppc(gcc 4.4.4). The whole while loop is moved out of line
> in both cases and the generated asm is identical.

Actually, it does make a difference(must remember to save file before gcc :)
Hard to tell if its any better though. Strangle this impacted the inner loop as well,
it gained an extra addi.

 Jocke


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 10:52   ` George Spelvin
  2011-08-08 11:11     ` Joakim Tjernlund
       [not found]     ` <OFEA1BD2B2.B2A7F07F-ONC12578E6.003D368C-C12578E6.003D7468@LocalDomain>
@ 2011-08-08 11:42     ` Joakim Tjernlund
  2011-08-08 12:54       ` George Spelvin
  2011-08-08 16:54     ` Bob Pearson
  3 siblings, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-08 11:42 UTC (permalink / raw)
  To: George Spelvin; +Cc: akpm, fzago, linux-kernel, rpearson

"George Spelvin" <linux@horizon.com> wrote on 2011/08/08 12:52:01:
>
> > I prefer to keep the current code which (at the time) generated good code
> > for at least ppc:
> >    /* Align it */
> >    if (unlikely((long)buf & 3 && len)) {
> >       do {
> >          DO_CRC(*buf++);
> >       } while ((--len) && ((long)buf)&3);
> >    }
>
> Ah, I was looking at fzago's initial patch; I hadn't realized you'd
> tweaked it.  That's pretty much what I was talking about.
>
> Would
>     if (unlikely((long)buf & 3) && len) {
>
> give the compiler better hints?  len != 0 is awfully
> likely, actually.

This is my current hack. Not ideal yet, just to give you a hint:
diff --git a/lib/crc32.c b/lib/crc32.c
index b06d1e7..d822ceb 100644
--- a/lib/crc32.c
+++ b/lib/crc32.c
@@ -28,13 +28,13 @@
 #include <linux/init.h>
 #include <asm/atomic.h>
 #include "crc32defs.h"
-#if CRC_LE_BITS == 8
+#if CRC_LE_BITS == 32 || CRC_LE_BITS == 64
 # define tole(x) __constant_cpu_to_le32(x)
 #else
 # define tole(x) (x)
 #endif

-#if CRC_BE_BITS == 8
+#if CRC_BE_BITS == 32 || CRC_BE_BITS == 64
 # define tobe(x) __constant_cpu_to_be32(x)
 #else
 # define tobe(x) (x)
@@ -45,27 +45,32 @@ MODULE_AUTHOR("Matt Domsch <Matt_Domsch@dell.com>");
 MODULE_DESCRIPTION("Ethernet CRC32 calculations");
 MODULE_LICENSE("GPL");

-#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
+#if CRC_LE_BITS == 32 || CRC_BE_BITS == 64

 static inline u32
 crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256])
 {
 # ifdef __LITTLE_ENDIAN
 #  define DO_CRC(x) crc = t0[(crc ^ (x)) & 255] ^ (crc >> 8)
-#  define DO_CRC4 crc = t3[(crc) & 255] ^ \
-		t2[(crc >> 8) & 255] ^ \
-		t1[(crc >> 16) & 255] ^ \
-		t0[(crc >> 24) & 255]
+#  define DO_CRC4(crc, x0, x1, x2, x3) \
+	x3[(crc) & 255] ^		\
+		x2[(crc >> 8) & 255] ^	\
+		x1[(crc >> 16) & 255] ^ \
+		x0[(crc >> 24) & 255]
 # else
 #  define DO_CRC(x) crc = t0[((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
-#  define DO_CRC4 crc = t0[(crc) & 255] ^ \
-		t1[(crc >> 8) & 255] ^  \
-		t2[(crc >> 16) & 255] ^	\
-		t3[(crc >> 24) & 255]
+#  define DO_CRC4(crc, x0, x1, x2, x3) \
+	x0[(crc) & 255] ^		\
+		x1[(crc >> 8) & 255] ^  \
+		x2[(crc >> 16) & 255] ^	\
+		x3[(crc >> 24) & 255]
 # endif
 	const u32 *b;
 	size_t    rem_len;
 	const u32 *t0=tab[0], *t1=tab[1], *t2=tab[2], *t3=tab[3];
+#if CRC_LE_BITS == 64 || CRC_BE_BITS == 64
+	const u32 *t4=tab[4], *t5=tab[5], *t6=tab[6], *t7=tab[7];
+#endif

 	/* Align it */
 	if (unlikely((long)buf & 3 && len)) {
@@ -73,13 +78,25 @@ crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256])
 			DO_CRC(*buf++);
 		} while ((--len) && ((long)buf)&3);
 	}
+#if CRC_LE_BITS == 64 || CRC_BE_BITS == 64
+	rem_len = len & 7;
+	len = len >> 3;
+#else
 	rem_len = len & 3;
-	/* load data 32 bits wide, xor data 32 bits wide. */
 	len = len >> 2;
+#endif
+	/* load data 32 bits wide, xor data 32 bits wide. */
 	b = (const u32 *)buf;
 	for (--b; len; --len) {
 		crc ^= *++b; /* use pre increment for speed */
-		DO_CRC4;
+#if CRC_LE_BITS == 64 || CRC_BE_BITS == 64
+		crc = DO_CRC4(crc, t4, t5, t6, t7);
+		++b;
+		crc ^= DO_CRC4(*b, t0, t1, t2, t3);
+#else
+		crc = DO_CRC4(crc, t0, t1, t2, t3);
+#endif
+
 	}
 	len = rem_len;
 	/* And the last few bytes */
@@ -123,7 +140,7 @@ u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)

 u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
 {
-# if CRC_LE_BITS == 8
+# if CRC_LE_BITS == 32 || CRC_LE_BITS == 64
 	const u32      (*tab)[] = crc32table_le;

 	crc = __cpu_to_le32(crc);
@@ -132,17 +149,17 @@ u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
 # elif CRC_LE_BITS == 4
 	while (len--) {
 		crc ^= *p++;
-		crc = (crc >> 4) ^ crc32table_le[crc & 15];
-		crc = (crc >> 4) ^ crc32table_le[crc & 15];
+		crc = (crc >> 4) ^ crc32table_le[0][crc & 15];
+		crc = (crc >> 4) ^ crc32table_le[0][crc & 15];
 	}
 	return crc;
 # elif CRC_LE_BITS == 2
 	while (len--) {
 		crc ^= *p++;
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
+		crc = (crc >> 2) ^ crc32table_le[0][crc & 3];
+		crc = (crc >> 2) ^ crc32table_le[0][crc & 3];
+		crc = (crc >> 2) ^ crc32table_le[0][crc & 3];
+		crc = (crc >> 2) ^ crc32table_le[0][crc & 3];
 	}
 	return crc;
 # endif
@@ -180,7 +197,7 @@ u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
 #else				/* Table-based approach */
 u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
 {
-# if CRC_BE_BITS == 8
+# if CRC_BE_BITS == 32 || CRC_BE_BITS == 64
 	const u32      (*tab)[] = crc32table_be;

 	crc = __cpu_to_be32(crc);
@@ -189,17 +206,17 @@ u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
 # elif CRC_BE_BITS == 4
 	while (len--) {
 		crc ^= *p++ << 24;
-		crc = (crc << 4) ^ crc32table_be[crc >> 28];
-		crc = (crc << 4) ^ crc32table_be[crc >> 28];
+		crc = (crc << 4) ^ crc32table_be[0][crc >> 28];
+		crc = (crc << 4) ^ crc32table_be[0][crc >> 28];
 	}
 	return crc;
 # elif CRC_BE_BITS == 2
 	while (len--) {
 		crc ^= *p++ << 24;
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
+		crc = (crc << 2) ^ crc32table_be[0][crc >> 30];
+		crc = (crc << 2) ^ crc32table_be[0][crc >> 30];
+		crc = (crc << 2) ^ crc32table_be[0][crc >> 30];
+		crc = (crc << 2) ^ crc32table_be[0][crc >> 30];
 	}
 	return crc;
 # endif
diff --git a/lib/crc32defs.h b/lib/crc32defs.h
index 9b6773d..c5fa3eb 100644
--- a/lib/crc32defs.h
+++ b/lib/crc32defs.h
@@ -8,25 +8,25 @@

 /* How many bits at a time to use.  Requires a table of 4<<CRC_xx_BITS bytes. */
 /* For less performance-sensitive, use 4 */
-#ifndef CRC_LE_BITS
-# define CRC_LE_BITS 8
+#ifndef CRC_LE_BITS
+# define CRC_LE_BITS 32
 #endif
 #ifndef CRC_BE_BITS
-# define CRC_BE_BITS 8
+# define CRC_BE_BITS 32
 #endif

 /*
  * Little-endian CRC computation.  Used with serial bit streams sent
  * lsbit-first.  Be sure to use cpu_to_le32() to append the computed CRC.
  */
-#if CRC_LE_BITS > 8 || CRC_LE_BITS < 1 || CRC_LE_BITS & CRC_LE_BITS-1
-# error CRC_LE_BITS must be a power of 2 between 1 and 8
+#if CRC_LE_BITS > 64 || CRC_LE_BITS < 1 || CRC_LE_BITS & CRC_LE_BITS-1
+# error CRC_LE_BITS must be a power of 2 between 1 and 64
 #endif

 /*
  * Big-endian CRC computation.  Used with serial bit streams sent
  * msbit-first.  Be sure to use cpu_to_be32() to append the computed CRC.
  */
-#if CRC_BE_BITS > 8 || CRC_BE_BITS < 1 || CRC_BE_BITS & CRC_BE_BITS-1
+#if CRC_BE_BITS > 64 || CRC_BE_BITS < 1 || CRC_BE_BITS & CRC_BE_BITS-1
 # error CRC_BE_BITS must be a power of 2 between 1 and 8
 #endif
diff --git a/lib/gen_crc32table.c b/lib/gen_crc32table.c
index 85d0e41..16336eb 100644
--- a/lib/gen_crc32table.c
+++ b/lib/gen_crc32table.c
@@ -4,11 +4,22 @@

 #define ENTRIES_PER_LINE 4

-#define LE_TABLE_SIZE (1 << CRC_LE_BITS)
-#define BE_TABLE_SIZE (1 << CRC_BE_BITS)
+#if CRC_LE_BITS > 8
+# define LE_TABLE_SIZE 256
+#else
+# define LE_TABLE_SIZE (1 << CRC_LE_BITS)
+#endif
+#if CRC_BE_BITS > 8
+# define BE_TABLE_SIZE 256
+#else
+# define BE_TABLE_SIZE (1 << CRC_BE_BITS)
+#endif

-static uint32_t crc32table_le[4][LE_TABLE_SIZE];
-static uint32_t crc32table_be[4][BE_TABLE_SIZE];
+#define LE_TABLE_ROWS ((CRC_LE_BITS - 1)/8 + 1)
+#define BE_TABLE_ROWS ((CRC_BE_BITS - 1)/8 + 1)
+
+static uint32_t crc32table_le[LE_TABLE_ROWS][LE_TABLE_SIZE];
+static uint32_t crc32table_be[BE_TABLE_ROWS][BE_TABLE_SIZE];

 /**
  * crc32init_le() - allocate and initialize LE table data
@@ -24,14 +35,14 @@ static void crc32init_le(void)

 	crc32table_le[0][0] = 0;

-	for (i = 1 << (CRC_LE_BITS - 1); i; i >>= 1) {
+	for (i = LE_TABLE_SIZE/2; i; i >>= 1) {
 		crc = (crc >> 1) ^ ((crc & 1) ? CRCPOLY_LE : 0);
 		for (j = 0; j < LE_TABLE_SIZE; j += 2 * i)
 			crc32table_le[0][i + j] = crc ^ crc32table_le[0][j];
 	}
 	for (i = 0; i < LE_TABLE_SIZE; i++) {
 		crc = crc32table_le[0][i];
-		for (j = 1; j < 4; j++) {
+		for (j = 1; j < LE_TABLE_ROWS; j++) {
 			crc = crc32table_le[0][crc & 0xff] ^ (crc >> 8);
 			crc32table_le[j][i] = crc;
 		}
@@ -55,43 +66,47 @@ static void crc32init_be(void)
 	}
 	for (i = 0; i < BE_TABLE_SIZE; i++) {
 		crc = crc32table_be[0][i];
-		for (j = 1; j < 4; j++) {
+		for (j = 1; j < BE_TABLE_ROWS; j++) {
 			crc = crc32table_be[0][(crc >> 24) & 0xff] ^ (crc << 8);
 			crc32table_be[j][i] = crc;
 		}
 	}
 }

-static void output_table(uint32_t table[4][256], int len, char *trans)
+static void output_table(uint32_t table[], int len, char *trans)
 {
-	int i, j;
-
-	for (j = 0 ; j < 4; j++) {
-		printf("{");
-		for (i = 0; i < len - 1; i++) {
-			if (i % ENTRIES_PER_LINE == 0)
-				printf("\n");
-			printf("%s(0x%8.8xL), ", trans, table[j][i]);
-		}
-		printf("%s(0x%8.8xL)},\n", trans, table[j][len - 1]);
+	int i;
+
+	printf("{");
+	for (i = 0; i < len - 1; i++) {
+		if (i % ENTRIES_PER_LINE == 0)
+			printf("\n");
+		printf("%s(0x%8.8xL), ", trans, table[i]);
 	}
+	printf("%s(0x%8.8xL)},\n", trans, table[len - 1]);
 }

 int main(int argc, char** argv)
 {
+	int i;
+
 	printf("/* this file is generated - do not edit */\n\n");

 	if (CRC_LE_BITS > 1) {
 		crc32init_le();
-		printf("static const u32 crc32table_le[4][256] = {");
-		output_table(crc32table_le, LE_TABLE_SIZE, "tole");
+		printf("static const u32 crc32table_le[%d][%d] = {",
+		       LE_TABLE_ROWS, LE_TABLE_SIZE);
+		for (i = 0 ; i < LE_TABLE_ROWS; i++)
+			output_table(crc32table_le[i], LE_TABLE_SIZE, "tole");
 		printf("};\n");
 	}

 	if (CRC_BE_BITS > 1) {
 		crc32init_be();
-		printf("static const u32 crc32table_be[4][256] = {");
-		output_table(crc32table_be, BE_TABLE_SIZE, "tobe");
+		printf("static const u32 crc32table_be[%d][%d] = {",
+		       BE_TABLE_ROWS, BE_TABLE_SIZE);
+		for (i = 0 ; i < BE_TABLE_ROWS; i++)
+			output_table(crc32table_be[i], BE_TABLE_SIZE, "tobe");
 		printf("};\n");
 	}



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 11:42     ` Joakim Tjernlund
@ 2011-08-08 12:54       ` George Spelvin
  2011-08-08 17:01         ` Bob Pearson
  0 siblings, 1 reply; 27+ messages in thread
From: George Spelvin @ 2011-08-08 12:54 UTC (permalink / raw)
  To: joakim.tjernlund, linux; +Cc: akpm, fzago, linux-kernel, rpearson

> -#define LE_TABLE_SIZE (1 << CRC_LE_BITS)
> -#define BE_TABLE_SIZE (1 << CRC_BE_BITS)
> +#if CRC_LE_BITS > 8
> +# define LE_TABLE_SIZE 256
> +#else
> +# define LE_TABLE_SIZE (1 << CRC_LE_BITS)
> +#endif
> +#if CRC_BE_BITS > 8
> +# define BE_TABLE_SIZE 256
> +#else
> +# define BE_TABLE_SIZE (1 << CRC_BE_BITS)
> +#endif
> 
> -static uint32_t crc32table_le[4][LE_TABLE_SIZE];
> -static uint32_t crc32table_be[4][BE_TABLE_SIZE];
> +#define LE_TABLE_ROWS ((CRC_LE_BITS - 1)/8 + 1)
> +#define BE_TABLE_ROWS ((CRC_BE_BITS - 1)/8 + 1)
> +
> +static uint32_t crc32table_le[LE_TABLE_ROWS][LE_TABLE_SIZE];
> +static uint32_t crc32table_be[BE_TABLE_ROWS][BE_TABLE_SIZE];

Minor cleanup suggestion: The two different ways of computing
xE_TABLE_SIZE and xE_TABLE_ROWS is a bit confusing.

May I recommend choosing one of the following:

#if CRC_LE_BITS > 8
# define LE_TABLE_ROWS (CRC_LE_BITS/8)
# define LE_TABLE_SIZE 256
#else
# define LE_TABLE_ROWS 1
# define LE_TABLE_SIZE (1 << CRC_LE_BITS)
#endif

or

#define LE_TABLE_ROWS ((CRC_LE_BITS - 1)/8 + 1)
#define LE_TABLE_SIZE (1 << ((CRC_LE_BITS - 1)%8 + 1))

Either one makes the relationship between the two clearer.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08  9:28 [PATCH] add slice by 8 algorithm to crc32.c George Spelvin
  2011-08-08 10:31 ` Joakim Tjernlund
@ 2011-08-08 16:50 ` Bob Pearson
  1 sibling, 0 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-08 16:50 UTC (permalink / raw)
  To: 'George Spelvin', fzago, linux-kernel; +Cc: akpm, joakim.tjernlund



> -----Original Message-----
> From: George Spelvin [mailto:linux@horizon.com]
> Sent: Monday, August 08, 2011 4:28 AM
> To: fzago@systemfabricworks.com; linux-kernel@vger.kernel.org
> Cc: akpm@linux-foundation.org; joakim.tjernlund@transmode.se;
> linux@horizon.com; rpearson@systemfabricworks.com
> Subject: [PATCH] add slice by 8 algorithm to crc32.c
> 
> Sorry I didn't see this when first posted.
> 
> The "slice by 8" terminology is pretty confusing.  How about
> "Extended Joakim Tjernlund's optimization from commit
> 836e2af92503f1642dbc3c3281ec68ec1dd39d2e to 8-way parallelism."

Here is a link to the article I first read about this algorithm. It mentions
both the 4 and 8 byte version.
I do not know about priority between Joakim and the folks at Intel but Intel
is usually credited with the idea in other articles I have seen. Clearly the
algorithm that is currently in crc32.c is the same as the one described in
the article. As you can see I mis-copied the name from slicing-by-8 to slice
by 8.

http://www.intel.com/technology/comms/perfnet/download/CRC_generators.pdf

> 
> Which is essentally what you're doing.  The renaming of tab[0] to t0_le
> and t0_be, and removal of the DO_CRC4 macro just increases the diff size.
> 
> If you're looking at speeding up the CRC through larger tables, have
> you tried using 10+11+11-bit tables?  That would require 20K of tables
> rather than 8K, but would reduce the number of table lookups per byte.
> 
> 
> One more stunt you could try to increase parallelism: rather than maintain
> the CRC in one register, maintain it in several, and only XOR and collapse
> them at the end.
> 
> Start with your 64-bit code, but imagine that the second code block's
> "q = *p32++" always loads 0, and therefore the whole block can be skipped.
> (Since tab[0] = 0 for all CRC tables.)
> 
> This computes the CRC of the even words.  Then do a second one in parallel
> for the odd words into a separate CRC register.  Then combine them at the
> end.
> (Shift one up by 32 bits and XOR into the other.)
> 
> This would let you get away with 5K of tables: t4 through t7, and t0.
> t1 through t3 could be skipped.
> 
> 
> Ideally, I'd write all this code myself, but I'm a bit crunched at work
> right now so wouldn't be able to get to it for a few days.
> 
> 
> 
> Another possible simplification to the startup code.  There's no need
> to compute init_bytes explicitly; just loop until the pointer is aligned:
> 
> 	while ((unsigned)buf & 3) {
> 		if (!len--)
> 			goto done;
> #ifdef __LITTLE_ENDIAN
> 		i0 = *buf++ ^ crc;
> 		crc = t0_le[i0] ^ (crc >> 8);
> #else
> 		i0 = *buf++ ^ (crc >> 24);
> 		crc = t0_le[i0] ^ (crc << 8);
> #endif
> 	}
> 	p32 = (u32 const *)buf;
> 	words = len >> 2;
> 	end_bytes = len & 3;
> 
> 
> ... although I'd prefer to keep the DO_CRC() and DO_CRC4 macros, and
> extend them to the 64-bit case, to avoid the nested #ifdefs.  That would
> make:
> 
> 	while ((unsigned)buf & 3) {
> 		if (!len--)
> 			goto done;
> 		DO_CRC(*buf++);
> 	}
> 	p32 = (u32 const *)buf;
> 	words = len >> 2;
> 	end_bytes = len & 3;

Personally I don't like macros unless they are very frequently used as you
can probably tell. The ifdefs were somewhat rediced in the second version of
the patch.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 10:52   ` George Spelvin
                       ` (2 preceding siblings ...)
  2011-08-08 11:42     ` Joakim Tjernlund
@ 2011-08-08 16:54     ` Bob Pearson
  3 siblings, 0 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-08 16:54 UTC (permalink / raw)
  To: 'George Spelvin', joakim.tjernlund; +Cc: akpm, fzago, linux-kernel



> -----Original Message-----
> From: George Spelvin [mailto:linux@horizon.com]
> Sent: Monday, August 08, 2011 5:52 AM
> To: joakim.tjernlund@transmode.se; linux@horizon.com
> Cc: akpm@linux-foundation.org; fzago@systemfabricworks.com; linux-
> kernel@vger.kernel.org; rpearson@systemfabricworks.com
> Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> 
> > I prefer to keep the current code which (at the time) generated good
code
> > for at least ppc:
> > 	/* Align it */
> > 	if (unlikely((long)buf & 3 && len)) {
> > 		do {
> > 			DO_CRC(*buf++);
> > 		} while ((--len) && ((long)buf)&3);
> > 	}
> 
> Ah, I was looking at fzago's initial patch; I hadn't realized you'd
> tweaked it.  That's pretty much what I was talking about.

Frank actually deserves none of the 'blame' for this effort. He is much more
experienced in sending in patches to linux-kernel than I am and helped me to
prepare and send the first version of the patch.

> 
> Would
>  	if (unlikely((long)buf & 3) && len) {
> 
> give the compiler better hints?  len != 0 is awfully
> likely, actually.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 12:54       ` George Spelvin
@ 2011-08-08 17:01         ` Bob Pearson
  2011-08-08 20:45           ` George Spelvin
  0 siblings, 1 reply; 27+ messages in thread
From: Bob Pearson @ 2011-08-08 17:01 UTC (permalink / raw)
  To: 'George Spelvin', joakim.tjernlund; +Cc: akpm, fzago, linux-kernel

Happy to consider this. I have been asking the list for comments about the
idea of dropping the BITS=2 and BITS=4 algorithms altogether which would
make the table size just 256. So far no one has claimed that they actually
care about those algorithms except as 'examples'. The 'Sarwate' algorithm
(which I added as an 8 bit version) is faster and only adds 2x4KB of table.

If no one really uses the smaller but much slower versions I don't see a
reason to keep them.

> -----Original Message-----
> From: George Spelvin [mailto:linux@horizon.com]
> Sent: Monday, August 08, 2011 7:55 AM
> To: joakim.tjernlund@transmode.se; linux@horizon.com
> Cc: akpm@linux-foundation.org; fzago@systemfabricworks.com; linux-
> kernel@vger.kernel.org; rpearson@systemfabricworks.com
> Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> 
> > -#define LE_TABLE_SIZE (1 << CRC_LE_BITS)
> > -#define BE_TABLE_SIZE (1 << CRC_BE_BITS)
> > +#if CRC_LE_BITS > 8
> > +# define LE_TABLE_SIZE 256
> > +#else
> > +# define LE_TABLE_SIZE (1 << CRC_LE_BITS)
> > +#endif
> > +#if CRC_BE_BITS > 8
> > +# define BE_TABLE_SIZE 256
> > +#else
> > +# define BE_TABLE_SIZE (1 << CRC_BE_BITS)
> > +#endif
> >
> > -static uint32_t crc32table_le[4][LE_TABLE_SIZE];
> > -static uint32_t crc32table_be[4][BE_TABLE_SIZE];
> > +#define LE_TABLE_ROWS ((CRC_LE_BITS - 1)/8 + 1)
> > +#define BE_TABLE_ROWS ((CRC_BE_BITS - 1)/8 + 1)
> > +
> > +static uint32_t crc32table_le[LE_TABLE_ROWS][LE_TABLE_SIZE];
> > +static uint32_t crc32table_be[BE_TABLE_ROWS][BE_TABLE_SIZE];
> 
> Minor cleanup suggestion: The two different ways of computing
> xE_TABLE_SIZE and xE_TABLE_ROWS is a bit confusing.
> 
> May I recommend choosing one of the following:
> 
> #if CRC_LE_BITS > 8
> # define LE_TABLE_ROWS (CRC_LE_BITS/8)
> # define LE_TABLE_SIZE 256
> #else
> # define LE_TABLE_ROWS 1
> # define LE_TABLE_SIZE (1 << CRC_LE_BITS)
> #endif
> 
> or
> 
> #define LE_TABLE_ROWS ((CRC_LE_BITS - 1)/8 + 1)
> #define LE_TABLE_SIZE (1 << ((CRC_LE_BITS - 1)%8 + 1))
> 
> Either one makes the relationship between the two clearer.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 11:11     ` Joakim Tjernlund
@ 2011-08-08 17:04       ` Bob Pearson
  0 siblings, 0 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-08 17:04 UTC (permalink / raw)
  To: 'Joakim Tjernlund', 'George Spelvin'
  Cc: akpm, fzago, linux-kernel

Same for x86. The code for the pre and post cleanup loops are the same
there.

> -----Original Message-----
> From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> Sent: Monday, August 08, 2011 6:11 AM
> To: George Spelvin
> Cc: akpm@linux-foundation.org; fzago@systemfabricworks.com; linux-
> kernel@vger.kernel.org; rpearson@systemfabricworks.com
> Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> 
> "George Spelvin" <linux@horizon.com> wrote on 2011/08/08 12:52:01:
> >
> > > I prefer to keep the current code which (at the time) generated good
> code
> > > for at least ppc:
> > >    /* Align it */
> > >    if (unlikely((long)buf & 3 && len)) {
> > >       do {
> > >          DO_CRC(*buf++);
> > >       } while ((--len) && ((long)buf)&3);
> > >    }
> >
> > Ah, I was looking at fzago's initial patch; I hadn't realized you'd
> > tweaked it.  That's pretty much what I was talking about.
> >
> > Would
> >     if (unlikely((long)buf & 3) && len) {
> >
> > give the compiler better hints?  len != 0 is awfully
> > likely, actually.
> 
> Doesn't matter on ppc(gcc 4.4.4). The whole while loop is moved out of
line
> in both cases and the generated asm is identical.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 17:01         ` Bob Pearson
@ 2011-08-08 20:45           ` George Spelvin
  2011-08-08 22:21             ` Bob Pearson
  0 siblings, 1 reply; 27+ messages in thread
From: George Spelvin @ 2011-08-08 20:45 UTC (permalink / raw)
  To: joakim.tjernlund, linux, rpearson; +Cc: akpm, fzago, linux-kernel

> Happy to consider this. I have been asking the list for comments about the
> idea of dropping the BITS=2 and BITS=4 algorithms altogether which would
> make the table size just 256. So far no one has claimed that they actually
> care about those algorithms except as 'examples'. The 'Sarwate' algorithm
> (which I added as an 8 bit version) is faster and only adds 2x4KB of table.

Um, I thought the Sarwate agorithm was what was already there as BITS=8.
(The original paper stores the CRC in two 8-bit variables and two
8-bit tables rather than using 16-bit words, but other than that,
it's identical.)

And that requires just 1KB of table per endianness.

Anyway, I thought the complaint about laarge tables was
not so much the memory as the L1 cache pollution.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-08 20:45           ` George Spelvin
@ 2011-08-08 22:21             ` Bob Pearson
  0 siblings, 0 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-08 22:21 UTC (permalink / raw)
  To: 'George Spelvin', joakim.tjernlund; +Cc: akpm, fzago, linux-kernel



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of George Spelvin
> Sent: Monday, August 08, 2011 3:45 PM
> To: joakim.tjernlund@transmode.se; linux@horizon.com;
> rpearson@systemfabricworks.com
> Cc: akpm@linux-foundation.org; fzago@systemfabricworks.com; linux-
> kernel@vger.kernel.org
> Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
> 
> > Happy to consider this. I have been asking the list for comments about
the
> > idea of dropping the BITS=2 and BITS=4 algorithms altogether which would
> > make the table size just 256. So far no one has claimed that they
actually
> > care about those algorithms except as 'examples'. The 'Sarwate'
algorithm
> > (which I added as an 8 bit version) is faster and only adds 2x4KB of
table.
> 
> Um, I thought the Sarwate agorithm was what was already there as BITS=8.
> (The original paper stores the CRC in two 8-bit variables and two
> 8-bit tables rather than using 16-bit words, but other than that,
> it's identical.)

In the current i.e. original version of crc32.c BITS=8 is the 32 bit
algorithm. The Sarwate version is not used. In the patched version that was
moved to BITS=32 and another one slipped in as BITS=8 which is Sarwate.

> 
> And that requires just 1KB of table per endianness.

You're right! Its 2X1KB not 2X4KB. There are 2 tables (le and be).

> 
> Anyway, I thought the complaint about laarge tables was
> not so much the memory as the L1 cache pollution. 

As far as performance goes cache pollution is the more important effect.
That is why you can't just extend the 2, 4, 8 sequence which has the tables
growing exponentially with BITS. The Slicing-by-N algorithm though has table
size growing linearly with N. But, if you are trying to run Linux on a small
embedded CPU even linear table size might be important. Back in the good old
days you could strip Linux down to a very small kernel.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-05 17:27         ` Bob Pearson
@ 2011-08-08  7:15           ` Joakim Tjernlund
  0 siblings, 0 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-08  7:15 UTC (permalink / raw)
  To: Bob Pearson; +Cc: 'Andrew Morton', 'frank zago', linux-kernel

"Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/05 19:27:26:
>
> > >
> > > >
> > > > >
> > > > > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for
> (i =
> > > foo
> > > > > - 1; i >= 0; i--) { ... }
> > > >
> > > > That should be (i = foo; i ; --i) { ... }
> > >
> > > Shouldn't make much difference, branch on zero bit or branch on sign
> bit.
> > > But at the end of the day didn't help on Nehalem.
>
> I figured out why "for (i = 0; i < len; i++) {...}" is faster than "for (;
> len; len--) {...}" on my system.
> The current code is
>
> for (; Ien; len--) {
>    load *++p
>    ...
> }
>
> Which turns into (in fake assembly)
>
> top:
>    dec len
>    inc p
>    load p
>    ...
>    test len
>    branch neq top
>
> But when I replace that with
>
> for(i = 0; i < len; i++) {
>    load *++p
>    ...
> }
>
> Gcc turns it into
>
> top:
>    load p[i]
>    i++
>    ...
>    compare i, len
>    branch lt top
>
> which is fewer instructions and i++ is well scheduled. Incrementing the
> pointer has been moved out of the loop.

I see. Lets leave the pre vs. post inc. for now. That is something
that can be sorted separately.

 Jocke


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-05 15:51         ` Bob Pearson
@ 2011-08-08  7:11           ` Joakim Tjernlund
  0 siblings, 0 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-08  7:11 UTC (permalink / raw)
  To: Bob Pearson; +Cc: 'Andrew Morton', 'frank zago', linux-kernel

"Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/05 17:51:00:
>
> > > version. While I haven't done the experiment you suggest there is
> > something
> > > to the point that the second q computation in the new version can be
> > moved
> > > ahead of the table lookups from the first q computation . My guess is
> that
> > > the unrolled version will be significantly slower.
> >
> > Ah, didn't see that. Don't understand how this works though.
> > Why do you do two 32 bit loads instead of one 64 bit load?
> >
> > >
>
> The two expression trees can be computed in parallel and combined with the
> final xor. If the compiler/instruction scheduler are smart enough and can
> process enough instructions per cycle they overlap well and you get some
> speedup. I did try a 64 bit load on Nehalem but got about 2 cycles per byte
> which is a little worse than doing two loads and better than the 32 bit
> version. I'm not really sure why.

I see. There is one thing that strikes me as a bit odd. Why do 8 byte alignment
when you do 32 bit accesses? Surely the current 4 byte alignment will do?

 Jocke


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-05  9:22       ` Joakim Tjernlund
  2011-08-05 15:51         ` Bob Pearson
@ 2011-08-05 17:27         ` Bob Pearson
  2011-08-08  7:15           ` Joakim Tjernlund
  1 sibling, 1 reply; 27+ messages in thread
From: Bob Pearson @ 2011-08-05 17:27 UTC (permalink / raw)
  To: 'Joakim Tjernlund'
  Cc: 'Andrew Morton', 'frank zago', linux-kernel

> >
> > >
> > > >
> > > > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for
(i =
> > foo
> > > > - 1; i >= 0; i--) { ... }
> > >
> > > That should be (i = foo; i ; --i) { ... }
> >
> > Shouldn't make much difference, branch on zero bit or branch on sign
bit.
> > But at the end of the day didn't help on Nehalem.

I figured out why "for (i = 0; i < len; i++) {...}" is faster than "for (;
len; len--) {...}" on my system.
The current code is

for (; Ien; len--) {
	load *++p
	...
}

Which turns into (in fake assembly)

top:
	dec len
	inc p
	load p
	...
	test len
	branch neq top

But when I replace that with 

for(i = 0; i < len; i++) {
	load *++p
	...
}

Gcc turns it into

top:
	load p[i]
	i++
	...
	compare i, len
	branch lt top

which is fewer instructions and i++ is well scheduled. Incrementing the
pointer has been moved out of the loop.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-05  9:22       ` Joakim Tjernlund
@ 2011-08-05 15:51         ` Bob Pearson
  2011-08-08  7:11           ` Joakim Tjernlund
  2011-08-05 17:27         ` Bob Pearson
  1 sibling, 1 reply; 27+ messages in thread
From: Bob Pearson @ 2011-08-05 15:51 UTC (permalink / raw)
  To: 'Joakim Tjernlund'
  Cc: 'Andrew Morton', 'frank zago', linux-kernel

> > version. While I haven't done the experiment you suggest there is
> something
> > to the point that the second q computation in the new version can be
> moved
> > ahead of the table lookups from the first q computation . My guess is
that
> > the unrolled version will be significantly slower.
> 
> Ah, didn't see that. Don't understand how this works though.
> Why do you do two 32 bit loads instead of one 64 bit load?
> 
> >

The two expression trees can be computed in parallel and combined with the
final xor. If the compiler/instruction scheduler are smart enough and can
process enough instructions per cycle they overlap well and you get some
speedup. I did try a 64 bit load on Nehalem but got about 2 cycles per byte
which is a little worse than doing two loads and better than the 32 bit
version. I'm not really sure why.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
       [not found]       ` <OF14136E0E.3F2388EF-ONC12578E3.00301969-C12578E3.00338524@LocalDomain>
@ 2011-08-05 13:34         ` Joakim Tjernlund
  0 siblings, 0 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-05 13:34 UTC (permalink / raw)
  Cc: Bob Pearson, 'Andrew Morton', 'frank zago', linux-kernel

Joakim Tjernlund/Transmode wrote on 2011/08/05 11:22:44:
>
> "Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/04 20:53:20:
> >
> > Sure... See below.
> >
> > > -----Original Message-----
> > > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> > > Sent: Thursday, August 04, 2011 6:54 AM
> > > To: Bob Pearson
> > > Cc: 'Andrew Morton'; 'frank zago'; linux-kernel@vger.kernel.org
> > > Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
> > >
> > > "Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/02
> > > 23:14:39:
> > > >
> > > > Hi Joakim,
> > > >
> > > > Sorry to take so long to respond.
> > >
> > > No problem but please insert you answers in correct context(like I did).
> > This
> > > makes it much easier to read and comment on.
> > >
> > > >
> > > > Here are some performance data collected from the original and modified
> > > > crc32 algorithms.
> > > > The following is a simple test loop that computes the time to compute
> > 1000
> > > > crc's over 4096 bytes of data aligned on an 8 byte boundary after
> > warming
> > > > the cache. You could make other measurements but this is sort of a best
> > > > case.
> > > >
> > > > These measurements were made on a dual socket Nehalem 2.267 GHz
> > > system.
> > >
> > > Measurements on your SPARC would be good too.
> >
> > Will do. But it is decrepit and quite slow. My main motivation is to run a
> > 10G protocol so I am mostly motivated to get x86_64 going as fast as
> > possible.
>
> 64 bits may be faster on x86_64 but not on ppc32. Your latest patch gives:
>  crc32: CRC_LE_BITS = 64, CRC_BE BITS = 64
>  crc32: self tests passed, processed 225944 bytes in 3987640 nsec
>  crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
>  crc32: self tests passed, processed 225944 bytes in 2003630 nsec
> Almost a factor 2 slower.
> So in any case I don't think 64 bits should be default for all archs.
> Probably only for 64 bit archs.

I checked the asm on ppc for 32 bits crc32 and compared yours vs. mine. PPC suffers
from your version. The startup cost is much higher. I did notice one win with your
version though. The inner loop was reduced with 3 insns if one use separate arrays.
However, loading 4 separate arrays are 16 insns on PPC so I did the best thing for
ppc:

diff --git a/lib/crc32.c b/lib/crc32.c
index 4855995..e3e391f 100644
--- a/lib/crc32.c
+++ b/lib/crc32.c
@@ -51,20 +51,21 @@ static inline u32
 crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256])
 {
 # ifdef __LITTLE_ENDIAN
-#  define DO_CRC(x) crc = tab[0][(crc ^ (x)) & 255] ^ (crc >> 8)
-#  define DO_CRC4 crc = tab[3][(crc) & 255] ^ \
-		tab[2][(crc >> 8) & 255] ^ \
-		tab[1][(crc >> 16) & 255] ^ \
-		tab[0][(crc >> 24) & 255]
+#  define DO_CRC(x) crc = t0[(crc ^ (x)) & 255] ^ (crc >> 8)
+#  define DO_CRC4 crc = t3[(crc) & 255] ^ \
+		t2[(crc >> 8) & 255] ^ \
+		t1[(crc >> 16) & 255] ^ \
+		t0[(crc >> 24) & 255]
 # else
-#  define DO_CRC(x) crc = tab[0][((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
-#  define DO_CRC4 crc = tab[0][(crc) & 255] ^ \
-		tab[1][(crc >> 8) & 255] ^ \
-		tab[2][(crc >> 16) & 255] ^ \
-		tab[3][(crc >> 24) & 255]
+#  define DO_CRC(x) crc = t0[((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
+#  define DO_CRC4 crc = t0[(crc) & 255] ^ \
+		t1[(crc >> 8) & 255] ^ \
+		t2[(crc >> 16) & 255] ^ \
+		t3[(crc >> 24) & 255]
 # endif
 	const u32 *b;
 	size_t    rem_len;
+	const u32 *t0=tab[0], *t1=t0 + 256, *t2=t1 + 256, *t3=t2 + 256;

 	/* Align it */
 	if (unlikely((long)buf & 3 && len)) {

This reduces the inner loop with 3 insns while adding only 5 insns startup cost.
I hope this brings my crc32(32 bits) in line with yours, even on x86_64.
Please test.

 Jocke


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-04 18:53     ` Bob Pearson
@ 2011-08-05  9:22       ` Joakim Tjernlund
  2011-08-05 15:51         ` Bob Pearson
  2011-08-05 17:27         ` Bob Pearson
       [not found]       ` <OF14136E0E.3F2388EF-ONC12578E3.00301969-C12578E3.00338524@LocalDomain>
  1 sibling, 2 replies; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-05  9:22 UTC (permalink / raw)
  To: Bob Pearson; +Cc: 'Andrew Morton', 'frank zago', linux-kernel

"Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/04 20:53:20:
>
> Sure... See below.
>
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> > Sent: Thursday, August 04, 2011 6:54 AM
> > To: Bob Pearson
> > Cc: 'Andrew Morton'; 'frank zago'; linux-kernel@vger.kernel.org
> > Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
> >
> > "Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/02
> > 23:14:39:
> > >
> > > Hi Joakim,
> > >
> > > Sorry to take so long to respond.
> >
> > No problem but please insert you answers in correct context(like I did).
> This
> > makes it much easier to read and comment on.
> >
> > >
> > > Here are some performance data collected from the original and modified
> > > crc32 algorithms.
> > > The following is a simple test loop that computes the time to compute
> 1000
> > > crc's over 4096 bytes of data aligned on an 8 byte boundary after
> warming
> > > the cache. You could make other measurements but this is sort of a best
> > > case.
> > >
> > > These measurements were made on a dual socket Nehalem 2.267 GHz
> > system.
> >
> > Measurements on your SPARC would be good too.
>
> Will do. But it is decrepit and quite slow. My main motivation is to run a
> 10G protocol so I am mostly motivated to get x86_64 going as fast as
> possible.

64 bits may be faster on x86_64 but not on ppc32. Your latest patch gives:
 crc32: CRC_LE_BITS = 64, CRC_BE BITS = 64
 crc32: self tests passed, processed 225944 bytes in 3987640 nsec
 crc32: CRC_LE_BITS = 32, CRC_BE BITS = 32
 crc32: self tests passed, processed 225944 bytes in 2003630 nsec
Almost a factor 2 slower.
So in any case I don't think 64 bits should be default for all archs.
Probably only for 64 bit archs.

>
> >
> > >
> > > #ifdef CONFIG_CRC32_PERFTEST
> > > #include <linux/time.h>
> > >
> > > static u8 __attribute__((__aligned__(8))) test_buf[4096];
> > >
> > > /* need to convince gcc to not optimize out all the crc calls as dead
> code
> > > */
> > > u32 crc;
> > >
> > > static int __init crc32_init(void)
> > > {
> > >         int i;
> > >         struct timespec start, stop;
> > >         u64 nsec_le;
> > >         u64 nsec_be;
> > >
> > >         for (i = 0; i  < 4096; i++)
> > >                 test_buf[i] = i;
> > >
> > >         crc = crc32_le(0, test_buf, 4096);
> > >         crc = crc32_le(crc, test_buf, 4096);
> > >
> > >         getnstimeofday(&start);
> > >         for (i = 0; i < 1000; i++)
> > >                 crc = crc32_le(crc, test_buf, 4096);
> > >         getnstimeofday(&stop);
> > >
> > >         nsec_le = stop.tv_nsec - start.tv_nsec +
> > >                 1000000000 * (stop.tv_sec - start.tv_sec);
> > >
> > >         crc = crc32_be(0, test_buf, 4096);
> > >         crc = crc32_be(crc, test_buf, 4096);
> > >
> > >         getnstimeofday(&start);
> > >         for (i = 0; i < 1000; i++)
> > >                 crc = crc32_be(crc, test_buf, 4096);
> > >         getnstimeofday(&stop);
> > >
> > >         nsec_be = stop.tv_nsec - start.tv_nsec +
> > >                 1000000000 * (stop.tv_sec - start.tv_sec);
> > >
> > >         pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le,
> nsec_be);
> > >
> > >         return 0;
> > > }
> > >
> > > module_init(crc32_init);
> > > #endif /* CONFIG_CRC32_PERFTEST */
> > >
> > > Here are the timings.
> > >
> > >          CRC_?E_BITS     crc_le(nsec)       crc_be(nsec)
> > >                 orig crc32.c                      8
> 5842712
> > > 5829793
> > >
> > >                 new crc32.c                     64              2877530
> > > 2862515
> > >                 new crc32.c                     32              5174400
> > > 5105816         (Equiv. to orig. BITS = 8)
> >
> > I really can't wrap my head around why your 32 bits version is faster than
> the
> > current one. Especially as you say that 2D array is a little faster that
> 1D array.
> > The current impl. already uses a 2D array. Can you identify what makes you
> > version
> > faster?
>
> Sorry if the comment below was not clear. But it is the 1D version that is
> faster not the other way around. That is one difference. The original code

Oh, I had it the other way around.

> is masking the indices with 0xff which may cause the compiler to put out
> more instructions than are necessary, not sure. In the new version it uses a
> byte operation.

Don't think that should matter. gcc should optimize this to to the same asm.
On the other hand gcc fails quite often to do trivial optimizations so
one will have to check the asm. It also varies with gcc version and in my experience
these simple optimizations often gets worse the newer gcc is :(

>
> >
> > >
> > > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i =
> foo
> > > - 1; i >= 0; i--) { ... }
> >
> > That should be (i = foo; i ; --i) { ... }
>
> Shouldn't make much difference, branch on zero bit or branch on sign bit.
> But at the end of the day didn't help on Nehalem.
>
> >
> > >                 + count down                 64              2837860
> > > 3230433
> > >
> > > 2865727              3264033 (repeat)
> > >                 + count down                 32              5177334
> > > 5070337
> > >                    Results seem ambiguous on Nehalem. Slight improvement
> in
> > > some cases. 12% worse in one case.
> > >
> > > Modify main loop to use pre-increment instead of post-increment.
> > >                 + pre-increment           64              2872502
> > > 2767601
> > >                 + pre-increment           32              5100579
> > > 5090453
> > >                     Small scale improvements for each case. Will add to
> next
> > > drop.
> > >
> > > Do both count down and pre increment
> > >                 + ct down + pre-inc     64              3329125
> > > 3367815
> > >                 + ct down + pre-inc     32              5065308
> > > 5089825
> > >                    Got significantly worse for 64 bit slightly better
> for
> > > 32.
> > >
> > > Separately I looked at the difference between 1D and 2D arrays and 1D
> > arrays
> > > are a few % faster. The difference is that the loader computes the base
> > > address at compile time. In the 2D case you have to add an offset to a
> > > parameter that was passed in to the body routine which is a runtime
> > > addition.
> >
> > I image the difference is even bigger on RISC archs like PowerPC and SPARC
> > as
> > these may need 2 insns to load an address. A relocatable kernel would be
> > even
> > worse I think.
>
> Sounds like you are agreeing with me. Loading 'base[i]' is always going to
> be easier than 'base[i][j]'. This is why the newer 4 byte version is faster.
> The first one takes one instruction with the actual base address put in
> place by the loader.

No, I meant it the other way around. It isn't base[i][j] but base[0][j], base[1][j]
etc. so you only load one base and use offsets instead of loading 4 base addresses.
Possibly one can avoid some issues if you wrap a struct around the arrays:
struct crc32_arrays {
 u32 t0[256]
 u32 t1[256]
...
}

Tests will tell though.

>
> >
> > >
> > > The first comment below relates to the behavior of gen_crc32table.c
> which
> > > always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS
> =
> > 2
> > > and 4 cases only read the first 4 and 16 table entries respectively so
> if
> > > you are trying to make a really small image you can save most of 1KB by
> > > shortening the table.
> >
> > Surely this could be fixed without moving to 1D arrays?
>
> This pertains to the BITS = 2 & 4 versions which actually treat the array as
> 1D in the current code even though the array is 2D. Actually I'm not sure
> why gcc didn't complain.

Me neither, but should be easy to fix.

>
> >
> > >
> > > I think I just caused more confusion below. The point is that the two
> > > routines crc32_le and crc32_be take a byte string as argument which is
> > > invariant under big/little endian byte order and produce a u32 which is
> just
> > > an integer. All the byte swapping in the code is implementation detail
> and
> > > has no effect on the result. The caller should just do what you normally
> do
> > > with an integer when you use it in a network message e.g. call htonl or
> > > equivalent before putting it in a packet.
> > >
> > > Last point below. The existing code fails sparse with __CHECK_ENDIAN__.
> > > (There are several other sparse fails in lib as well BTW.) I cleaned up
> > > these warnings by forcing everything to u32.
> >
> > What about unrolling the 32 bit variant instead of doing 64 bit? The 64
> bit
> > variant looks like an unrolled 32 bit variant:
>
> I agree that is has that appearance. As it happened, I started from the
> literature on the slice by 8 algorithm which is generally thought to be the
> fastest algorithm currently available, and didn't think about the slice by 4

You have any references to this literature you can post?

> version. While I haven't done the experiment you suggest there is something
> to the point that the second q computation in the new version can be moved
> ahead of the table lookups from the first q computation . My guess is that
> the unrolled version will be significantly slower.

Ah, didn't see that. Don't understand how this works though.
Why do you do two 32 bit loads instead of one 64 bit load?

>
> > +              q = *++p32 ^ crc;
> > +              i3 = q;
> > +              i2 = q >> 8;
> > +              i1 = q >> 16;
> > +              i0 = q >> 24;
> > +              crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^
> > t0_be[i0];
> >
> > +              q = *++p32 ^ crc;
> > +              i3 = q;
> > +              i2 = q >> 8;
> > +              i1 = q >> 16;
> > +              i0 = q >> 24;
> > +              crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^
> > t0_be[i0];
> >
> > I have a really hard time accepting the code duplication your patch
> > introduces.
>
> It is a taste issue. Andrew Morton was unhappy with the ifdefs mixed into
> the code so I just wrote two versions that are clean and easier to read.
> Only one gets compiled so the object code is no bigger even though some
> lines are repeated. The important thing is making it easy read and follow
> once you have the best performance and smallest object code.
>
> > Can you not find a way to add whatever changes you want into the current
> > impl. ?
>
> That is how this got written! But, the current code has lots of issues as

I would not say lots. A few minor, easily fixable, that has no real impact.
Your code is harder to maintain, one would have to update lots of places
if one wants to improve further.

> well. My question to current users is how important is it to have lots of
> versions of the algorithm? There are no references to the configuration
> variables CRC_LE_BITS/CRC_BE_BITS anywhere else in the kernel tree so for
> all I know only the fastest version is ever used. Since I was 'adding' to
> the current code I just added more to the list. Things would get much
> cleaner if the really slow versions could be dropped.

These were put there by Matt Domsch I think and are more for educational purposes.

>
> >
> > Finally, your last patch does to much. Changing the unit test code should
> > be a separate patch. Restructuring the code is another one and 64 bit
> support
> > should be a separate patch too.
>
> That's easy to change.  I'm happy to do it.

If you start with the unit test it would be easy to test different
versions.

   Jocke
PS.
   Please don't wrap long lines.

> >
> >  Jocke
> >
> > >
> > > I will include Andrew's comments with the above and send out another
> > patch
> > > soon.
> > >
> > > Regards,
> > >
> > > Bob Pearson
> > >
> > > > -----Original Message-----
> > > > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> > > > Sent: Monday, August 01, 2011 4:20 AM
> > > > To: frank zago; Andrew Morton; Bob Pearson
> > > > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> > > >
> > > >
> > > > Hi Bob, Frank and Andrew
> > > >
> > > > I just got back from vacation and noticed this patch in the kernel
> mail
> > > archive.
> > > > I have pasted some
> > > > bits from there and commented too.
> > > >
> > > > > Hello,
> > > > >
> > > > > Added support for slice by 8 to existing crc32 algorithm. Also
> > > > > modified gen_crc32table.c to only produce table entries that are
> > > > > actually used.
> > > >
> > > > Hmm, I don't get this. What entries are unused?
> > > >
> > > > > The parameters CRC_LE_BITS and CRC_BE_BITS determine
> > > > > the number of bits in the input array that are processed during each
> > > > > step. Generally the more bits the faster the algorithm is but the
> > > > > more table data required.
> > > > >
> > > > > Using an x86_64 Opteron machine running at 2100MHz the following
> > table
> > > > > was collected with a pre-warmed cache by computing the crc 1000
> > times
> > > > > on a buffer of 4096 bytes.
> > > > >
> > > > >        BITS       Size       LE Cycles/byte       BE
> > > > Cycles/byte
> > > > >        ----------------------------------------------
> > > > >        1       873       41.65
> > > >     34.60
> > > > >        2       1097       25.43
> > > >     29.61
> > > > >        4       1057       13.29
> > > >     15.28
> > > > >        8       2913       7.13
> > > >     8.19
> > > > >        32       9684       2.80
> > > >     2.82
> > > > >        64       18178       1.53
> > > >     1.53
> > > > >
> > > > >        BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> > > > >        default was 8 which actually selected the 32 bit algorithm.
> > > In
> > > > >        this version the value 8 is used to select the standard
> > > > >        8 bit algorithm and two new values: 32 and 64 are
> > > introduced
> > > > >        to select the slice by 4 and slice by 8 algorithms
> > > respectively.
> > > > >
> > > > >        Where Size is the size of crc32.o's text segment which
> > > > includes
> > > > >        code and table data when both LE and BE versions are set to
> > > > BITS.
> > > >
> > > > I miss the numbers from the current 32 bits impl. How does this new
> impl.
> > > > for 32 bits compare to the current impl?
> > > >
> > > > >
> > > > > The current version of crc32.c by default uses the slice by 4
> algorithm
> > > > > which requires about 2.8 cycles per byte. The slice by 8 algorithm
> is
> > > > > roughly 2X faster and enables packet processing at over 1GB/sec on a
> > > > typical
> > > > > 2-3GHz system.
> > > > > --- lib/crc32.c
> > > > > +++ lib/crc32.c
> > > > >
> > > > > ...
> > > > >
> > > > > @@ -28,14 +31,17 @@
> > > > >  #include <linux/init.h>
> > > > >  #include <asm/atomic.h>
> > > > >  #include "crc32defs.h"
> > > > > -#if CRC_LE_BITS == 8
> > > > > -# define tole(x) __constant_cpu_to_le32(x)
> > > > > +
> > > > > +#include <asm/msr.h>
> > > > > +#if CRC_LE_BITS > 8
> > > > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> > > > >  #else
> > > > >  # define tole(x) (x)
> > > > >  #endif
> > > > >
> > > > > -#if CRC_BE_BITS == 8
> > > > > -# define tobe(x) __constant_cpu_to_be32(x)
> > > > > +#if CRC_BE_BITS > 8
> > > > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
> > > > >  #else
> > > > >  # define tobe(x) (x)
> > > > >  #endif
> > > > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch
> > > > <Matt_Domsch@
> > > > >  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
> > > > >  MODULE_LICENSE("GPL");
> > > > >
> > > > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > > > > +#if CRC_LE_BITS > 8
> > > > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> > > > > +{
> > > > > +       const u8 *p8;
> > > > > +       const u32 *p32;
> > > > > +       int init_bytes, end_bytes;
> > > > > +       size_t words;
> > > > > +       int i;
> > > > > +       u32 q;
> > > > > +       u8 i0, i1, i2, i3;
> > > > > +
> > > > > +       crc = (__force u32) __cpu_to_le32(crc);
> > > > > +
> > > > > +#if CRC_LE_BITS == 64
> > > > > +       p8 = buf;
> > > > > +       p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> > > > > +
> > > > > +       init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > > > +       if (init_bytes > len)
> > > > > +              init_bytes = len;
> > > > > +       words = (len - init_bytes) >> 3;
> > > > > +       end_bytes = (len - init_bytes) & 7;
> > > > > +#else
> > > > > +       p8 = buf;
> > > > > +       p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > > > > +       init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > > > +       if (init_bytes > len)
> > > > > +              init_bytes = len;
> > > > > +       words = (len - init_bytes) >> 2;
> > > > > +       end_bytes = (len - init_bytes) & 3;
> > > > > +#endif
> > > >
> > > > The difference in the above code is just two constants. Should be
> > possible
> > > > to avoid code duplication.
> > > >
> > > > > +
> > > > > +       for (i = 0; i < init_bytes; i++) {
> > > >
> > > > All for(..) loops should be changed to:
> > > >   for (i = init_bytes; i ; --i)
> > > > This is easier for gcc to optimize. Always compare to zero when
> possible.
> > > >
> > > > > +#ifdef __LITTLE_ENDIAN
> > > > > +              i0 = *p8++ ^ crc;
> > > > > +              crc = t0_le[i0] ^ (crc >> 8);
> > > > > +#else
> > > > > +              i0 = *p8++ ^ (crc >> 24);
> > > > > +              crc = t0_le[i0] ^ (crc << 8);
> > > > > +#endif
> > > > > +       }
> > > > >
> > > > Better to hide in macro as current code do.
> > > > ...
> > > >
> > > > > +       for (i = 0; i < words; i++) {
> > > > > +#ifdef __LITTLE_ENDIAN
> > > > > +#  if CRC_LE_BITS == 64
> > > > > +              /* slice by 8 algorithm */
> > > > > +              q = *p32++ ^ crc;
> > > > > +              i3 = q;
> > > > > +              i2 = q >> 8;
> > > > > +              i1 = q >> 16;
> > > > > +              i0 = q >> 24;
> > > > > +              crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^
> > > > t4_le[i0];
> > > > > +
> > > > > +              q = *p32++;
> > > > > +              i3 = q;
> > > > > +              i2 = q >> 8;
> > > > > +              i1 = q >> 16;
> > > > > +              i0 = q >> 24;
> > > > > +              crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > > > t0_le[i0];
> > > >
> > > > I am not convinced that converting the table matrix to several arrays
> is
> > > faster.
> > > > Now you have to load 8 addressed and then index them instead of just
> > one
> > > > address.
> > > >
> > > > Also, this looks more like unrolling the 32 bit variant. Are you sure
> that
> > > you
> > > > wont get just as good results by just unrolling the 32 bit variant?
> > > >
> > > > You also lost the pre increment which was faster on RISC archs like
> > > ppc(sparc
> > > > too?)
> > > > last time I tested crc32 speed.
> > > >
> > > > > +#  else
> > > > > +              /* slice by 4 algorithm */
> > > > > +              q = *p32++ ^ crc;
> > > > > +              i3 = q;
> > > > > +              i2 = q >> 8;
> > > > > +              i1 = q >> 16;
> > > > > +              i0 = q >> 24;
> > > > > +              crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > > > t0_le[i0];
> > > > > +#  endif
> > > >
> > > > ...
> > > > > General comment. The use of le/be in this and the previous version
> of
> > > > > crc32.c is confusing because it does not refer to little/big endian
> cpu
> > > > > architecture but rather to the arbitrary choice made by different
> > > protocols
> > > > > as to whether the bits in a byte are presented to the CRC in
> little/big
> > > > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the
> bits
> > > > > starting from the LSB and some (e.g. atm) process the bits in MSB
> > order.
> > > > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in
> cpu
> > > > byte
> > > > > order. The caller has to then get the CRC into the correct order for
> > > > > inclusion into the message. As a result there are four versions of
> the
> > > >
> > > > No, the caller does not have get the CRC into correct order, this is
> done
> > > by
> > > > crc32 code.
> > > >
> > > > > calculation:
> > > > > BE bit order on BE byte order cpu, BE bit order on LE byte order
> cpu,
> > > etc.
> > > >
> > > > What would be better? I don't see what to do. Linux requires both LE
> and
> > > BE
> > > > bit
> > > > ordering. When we optimize heavily, CPU byte order becomes an issue
> > that
> > > > needs
> > > > to be dealt with.
> > > >
> > > > > Some calculations are simplified by arranging the bits sequentially
> in
> > > > WORDS
> > > > > when we are going to process them in a certain order within each
> byte.
> > > This
> > > > > means that the tables used to compute crc32_be are easier to use in
> BE
> > > > byte
> > > > > order and that tables used to compute crc32_le are easier to use in
> LE
> > > byte
> > > > > order. That is the logic behind the following weird looking macros
> which
> > > are
> > > > > kept the same as the previous version except for the inclusion of
> > > __force
> > > > to
> > > > > pass sparse with -D__CHECK_ENDIAN__.
> > > >
> > > > Don't understand, how is you code any different from what we have
> > today?
> > > >
> > > >  Jocke
> > >
> > >
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-04 11:54   ` Joakim Tjernlund
@ 2011-08-04 18:53     ` Bob Pearson
  2011-08-05  9:22       ` Joakim Tjernlund
       [not found]       ` <OF14136E0E.3F2388EF-ONC12578E3.00301969-C12578E3.00338524@LocalDomain>
  0 siblings, 2 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-04 18:53 UTC (permalink / raw)
  To: 'Joakim Tjernlund'
  Cc: 'Andrew Morton', 'frank zago', linux-kernel

Sure... See below.

> -----Original Message-----
> From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> Sent: Thursday, August 04, 2011 6:54 AM
> To: Bob Pearson
> Cc: 'Andrew Morton'; 'frank zago'; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
> 
> "Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/02
> 23:14:39:
> >
> > Hi Joakim,
> >
> > Sorry to take so long to respond.
> 
> No problem but please insert you answers in correct context(like I did).
This
> makes it much easier to read and comment on.
> 
> >
> > Here are some performance data collected from the original and modified
> > crc32 algorithms.
> > The following is a simple test loop that computes the time to compute
1000
> > crc's over 4096 bytes of data aligned on an 8 byte boundary after
warming
> > the cache. You could make other measurements but this is sort of a best
> > case.
> >
> > These measurements were made on a dual socket Nehalem 2.267 GHz
> system.
> 
> Measurements on your SPARC would be good too.

Will do. But it is decrepit and quite slow. My main motivation is to run a
10G protocol so I am mostly motivated to get x86_64 going as fast as
possible.

> 
> >
> > #ifdef CONFIG_CRC32_PERFTEST
> > #include <linux/time.h>
> >
> > static u8 __attribute__((__aligned__(8))) test_buf[4096];
> >
> > /* need to convince gcc to not optimize out all the crc calls as dead
code
> > */
> > u32 crc;
> >
> > static int __init crc32_init(void)
> > {
> >         int i;
> >         struct timespec start, stop;
> >         u64 nsec_le;
> >         u64 nsec_be;
> >
> >         for (i = 0; i  < 4096; i++)
> >                 test_buf[i] = i;
> >
> >         crc = crc32_le(0, test_buf, 4096);
> >         crc = crc32_le(crc, test_buf, 4096);
> >
> >         getnstimeofday(&start);
> >         for (i = 0; i < 1000; i++)
> >                 crc = crc32_le(crc, test_buf, 4096);
> >         getnstimeofday(&stop);
> >
> >         nsec_le = stop.tv_nsec - start.tv_nsec +
> >                 1000000000 * (stop.tv_sec - start.tv_sec);
> >
> >         crc = crc32_be(0, test_buf, 4096);
> >         crc = crc32_be(crc, test_buf, 4096);
> >
> >         getnstimeofday(&start);
> >         for (i = 0; i < 1000; i++)
> >                 crc = crc32_be(crc, test_buf, 4096);
> >         getnstimeofday(&stop);
> >
> >         nsec_be = stop.tv_nsec - start.tv_nsec +
> >                 1000000000 * (stop.tv_sec - start.tv_sec);
> >
> >         pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le,
nsec_be);
> >
> >         return 0;
> > }
> >
> > module_init(crc32_init);
> > #endif /* CONFIG_CRC32_PERFTEST */
> >
> > Here are the timings.
> >
> >          CRC_?E_BITS     crc_le(nsec)       crc_be(nsec)
> >                 orig crc32.c                      8
5842712
> > 5829793
> >
> >                 new crc32.c                     64              2877530
> > 2862515
> >                 new crc32.c                     32              5174400
> > 5105816         (Equiv. to orig. BITS = 8)
> 
> I really can't wrap my head around why your 32 bits version is faster than
the
> current one. Especially as you say that 2D array is a little faster that
1D array.
> The current impl. already uses a 2D array. Can you identify what makes you
> version
> faster?

Sorry if the comment below was not clear. But it is the 1D version that is
faster not the other way around. That is one difference. The original code
is masking the indices with 0xff which may cause the compiler to put out
more instructions than are necessary, not sure. In the new version it uses a
byte operation.

> 
> >
> > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i =
foo
> > - 1; i >= 0; i--) { ... }
> 
> That should be (i = foo; i ; --i) { ... }

Shouldn't make much difference, branch on zero bit or branch on sign bit.
But at the end of the day didn't help on Nehalem.

> 
> >                 + count down                 64              2837860
> > 3230433
> >
> > 2865727              3264033 (repeat)
> >                 + count down                 32              5177334
> > 5070337
> >                    Results seem ambiguous on Nehalem. Slight improvement
in
> > some cases. 12% worse in one case.
> >
> > Modify main loop to use pre-increment instead of post-increment.
> >                 + pre-increment           64              2872502
> > 2767601
> >                 + pre-increment           32              5100579
> > 5090453
> >                     Small scale improvements for each case. Will add to
next
> > drop.
> >
> > Do both count down and pre increment
> >                 + ct down + pre-inc     64              3329125
> > 3367815
> >                 + ct down + pre-inc     32              5065308
> > 5089825
> >                    Got significantly worse for 64 bit slightly better
for
> > 32.
> >
> > Separately I looked at the difference between 1D and 2D arrays and 1D
> arrays
> > are a few % faster. The difference is that the loader computes the base
> > address at compile time. In the 2D case you have to add an offset to a
> > parameter that was passed in to the body routine which is a runtime
> > addition.
> 
> I image the difference is even bigger on RISC archs like PowerPC and SPARC
> as
> these may need 2 insns to load an address. A relocatable kernel would be
> even
> worse I think.

Sounds like you are agreeing with me. Loading 'base[i]' is always going to
be easier than 'base[i][j]'. This is why the newer 4 byte version is faster.
The first one takes one instruction with the actual base address put in
place by the loader.

> 
> >
> > The first comment below relates to the behavior of gen_crc32table.c
which
> > always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS
=
> 2
> > and 4 cases only read the first 4 and 16 table entries respectively so
if
> > you are trying to make a really small image you can save most of 1KB by
> > shortening the table.
> 
> Surely this could be fixed without moving to 1D arrays?

This pertains to the BITS = 2 & 4 versions which actually treat the array as
1D in the current code even though the array is 2D. Actually I'm not sure
why gcc didn't complain.

> 
> >
> > I think I just caused more confusion below. The point is that the two
> > routines crc32_le and crc32_be take a byte string as argument which is
> > invariant under big/little endian byte order and produce a u32 which is
just
> > an integer. All the byte swapping in the code is implementation detail
and
> > has no effect on the result. The caller should just do what you normally
do
> > with an integer when you use it in a network message e.g. call htonl or
> > equivalent before putting it in a packet.
> >
> > Last point below. The existing code fails sparse with __CHECK_ENDIAN__.
> > (There are several other sparse fails in lib as well BTW.) I cleaned up
> > these warnings by forcing everything to u32.
> 
> What about unrolling the 32 bit variant instead of doing 64 bit? The 64
bit
> variant looks like an unrolled 32 bit variant:

I agree that is has that appearance. As it happened, I started from the
literature on the slice by 8 algorithm which is generally thought to be the
fastest algorithm currently available, and didn't think about the slice by 4
version. While I haven't done the experiment you suggest there is something
to the point that the second q computation in the new version can be moved
ahead of the table lookups from the first q computation . My guess is that
the unrolled version will be significantly slower.

> +		 		 q = *++p32 ^ crc;
> +		 		 i3 = q;
> +		 		 i2 = q >> 8;
> +		 		 i1 = q >> 16;
> +		 		 i0 = q >> 24;
> +		 		 crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^
> t0_be[i0];
> 
> +		 		 q = *++p32 ^ crc;
> +		 		 i3 = q;
> +		 		 i2 = q >> 8;
> +		 		 i1 = q >> 16;
> +		 		 i0 = q >> 24;
> +		 		 crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^
> t0_be[i0];
> 
> I have a really hard time accepting the code duplication your patch
> introduces.

It is a taste issue. Andrew Morton was unhappy with the ifdefs mixed into
the code so I just wrote two versions that are clean and easier to read.
Only one gets compiled so the object code is no bigger even though some
lines are repeated. The important thing is making it easy read and follow
once you have the best performance and smallest object code.

> Can you not find a way to add whatever changes you want into the current
> impl. ?

That is how this got written! But, the current code has lots of issues as
well. My question to current users is how important is it to have lots of
versions of the algorithm? There are no references to the configuration
variables CRC_LE_BITS/CRC_BE_BITS anywhere else in the kernel tree so for
all I know only the fastest version is ever used. Since I was 'adding' to
the current code I just added more to the list. Things would get much
cleaner if the really slow versions could be dropped.

> 
> Finally, your last patch does to much. Changing the unit test code should
> be a separate patch. Restructuring the code is another one and 64 bit
support
> should be a separate patch too.

That's easy to change.  I'm happy to do it.
> 
>  Jocke
> 
> >
> > I will include Andrew's comments with the above and send out another
> patch
> > soon.
> >
> > Regards,
> >
> > Bob Pearson
> >
> > > -----Original Message-----
> > > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> > > Sent: Monday, August 01, 2011 4:20 AM
> > > To: frank zago; Andrew Morton; Bob Pearson
> > > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> > >
> > >
> > > Hi Bob, Frank and Andrew
> > >
> > > I just got back from vacation and noticed this patch in the kernel
mail
> > archive.
> > > I have pasted some
> > > bits from there and commented too.
> > >
> > > > Hello,
> > > >
> > > > Added support for slice by 8 to existing crc32 algorithm. Also
> > > > modified gen_crc32table.c to only produce table entries that are
> > > > actually used.
> > >
> > > Hmm, I don't get this. What entries are unused?
> > >
> > > > The parameters CRC_LE_BITS and CRC_BE_BITS determine
> > > > the number of bits in the input array that are processed during each
> > > > step. Generally the more bits the faster the algorithm is but the
> > > > more table data required.
> > > >
> > > > Using an x86_64 Opteron machine running at 2100MHz the following
> table
> > > > was collected with a pre-warmed cache by computing the crc 1000
> times
> > > > on a buffer of 4096 bytes.
> > > >
> > > >        BITS       Size       LE Cycles/byte       BE
> > > Cycles/byte
> > > >        ----------------------------------------------
> > > >        1       873       41.65
> > >     34.60
> > > >        2       1097       25.43
> > >     29.61
> > > >        4       1057       13.29
> > >     15.28
> > > >        8       2913       7.13
> > >     8.19
> > > >        32       9684       2.80
> > >     2.82
> > > >        64       18178       1.53
> > >     1.53
> > > >
> > > >        BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> > > >        default was 8 which actually selected the 32 bit algorithm.
> > In
> > > >        this version the value 8 is used to select the standard
> > > >        8 bit algorithm and two new values: 32 and 64 are
> > introduced
> > > >        to select the slice by 4 and slice by 8 algorithms
> > respectively.
> > > >
> > > >        Where Size is the size of crc32.o's text segment which
> > > includes
> > > >        code and table data when both LE and BE versions are set to
> > > BITS.
> > >
> > > I miss the numbers from the current 32 bits impl. How does this new
impl.
> > > for 32 bits compare to the current impl?
> > >
> > > >
> > > > The current version of crc32.c by default uses the slice by 4
algorithm
> > > > which requires about 2.8 cycles per byte. The slice by 8 algorithm
is
> > > > roughly 2X faster and enables packet processing at over 1GB/sec on a
> > > typical
> > > > 2-3GHz system.
> > > > --- lib/crc32.c
> > > > +++ lib/crc32.c
> > > >
> > > > ...
> > > >
> > > > @@ -28,14 +31,17 @@
> > > >  #include <linux/init.h>
> > > >  #include <asm/atomic.h>
> > > >  #include "crc32defs.h"
> > > > -#if CRC_LE_BITS == 8
> > > > -# define tole(x) __constant_cpu_to_le32(x)
> > > > +
> > > > +#include <asm/msr.h>
> > > > +#if CRC_LE_BITS > 8
> > > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> > > >  #else
> > > >  # define tole(x) (x)
> > > >  #endif
> > > >
> > > > -#if CRC_BE_BITS == 8
> > > > -# define tobe(x) __constant_cpu_to_be32(x)
> > > > +#if CRC_BE_BITS > 8
> > > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
> > > >  #else
> > > >  # define tobe(x) (x)
> > > >  #endif
> > > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch
> > > <Matt_Domsch@
> > > >  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
> > > >  MODULE_LICENSE("GPL");
> > > >
> > > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > > > +#if CRC_LE_BITS > 8
> > > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> > > > +{
> > > > +       const u8 *p8;
> > > > +       const u32 *p32;
> > > > +       int init_bytes, end_bytes;
> > > > +       size_t words;
> > > > +       int i;
> > > > +       u32 q;
> > > > +       u8 i0, i1, i2, i3;
> > > > +
> > > > +       crc = (__force u32) __cpu_to_le32(crc);
> > > > +
> > > > +#if CRC_LE_BITS == 64
> > > > +       p8 = buf;
> > > > +       p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> > > > +
> > > > +       init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > > +       if (init_bytes > len)
> > > > +              init_bytes = len;
> > > > +       words = (len - init_bytes) >> 3;
> > > > +       end_bytes = (len - init_bytes) & 7;
> > > > +#else
> > > > +       p8 = buf;
> > > > +       p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > > > +       init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > > +       if (init_bytes > len)
> > > > +              init_bytes = len;
> > > > +       words = (len - init_bytes) >> 2;
> > > > +       end_bytes = (len - init_bytes) & 3;
> > > > +#endif
> > >
> > > The difference in the above code is just two constants. Should be
> possible
> > > to avoid code duplication.
> > >
> > > > +
> > > > +       for (i = 0; i < init_bytes; i++) {
> > >
> > > All for(..) loops should be changed to:
> > >   for (i = init_bytes; i ; --i)
> > > This is easier for gcc to optimize. Always compare to zero when
possible.
> > >
> > > > +#ifdef __LITTLE_ENDIAN
> > > > +              i0 = *p8++ ^ crc;
> > > > +              crc = t0_le[i0] ^ (crc >> 8);
> > > > +#else
> > > > +              i0 = *p8++ ^ (crc >> 24);
> > > > +              crc = t0_le[i0] ^ (crc << 8);
> > > > +#endif
> > > > +       }
> > > >
> > > Better to hide in macro as current code do.
> > > ...
> > >
> > > > +       for (i = 0; i < words; i++) {
> > > > +#ifdef __LITTLE_ENDIAN
> > > > +#  if CRC_LE_BITS == 64
> > > > +              /* slice by 8 algorithm */
> > > > +              q = *p32++ ^ crc;
> > > > +              i3 = q;
> > > > +              i2 = q >> 8;
> > > > +              i1 = q >> 16;
> > > > +              i0 = q >> 24;
> > > > +              crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^
> > > t4_le[i0];
> > > > +
> > > > +              q = *p32++;
> > > > +              i3 = q;
> > > > +              i2 = q >> 8;
> > > > +              i1 = q >> 16;
> > > > +              i0 = q >> 24;
> > > > +              crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > > t0_le[i0];
> > >
> > > I am not convinced that converting the table matrix to several arrays
is
> > faster.
> > > Now you have to load 8 addressed and then index them instead of just
> one
> > > address.
> > >
> > > Also, this looks more like unrolling the 32 bit variant. Are you sure
that
> > you
> > > wont get just as good results by just unrolling the 32 bit variant?
> > >
> > > You also lost the pre increment which was faster on RISC archs like
> > ppc(sparc
> > > too?)
> > > last time I tested crc32 speed.
> > >
> > > > +#  else
> > > > +              /* slice by 4 algorithm */
> > > > +              q = *p32++ ^ crc;
> > > > +              i3 = q;
> > > > +              i2 = q >> 8;
> > > > +              i1 = q >> 16;
> > > > +              i0 = q >> 24;
> > > > +              crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > > t0_le[i0];
> > > > +#  endif
> > >
> > > ...
> > > > General comment. The use of le/be in this and the previous version
of
> > > > crc32.c is confusing because it does not refer to little/big endian
cpu
> > > > architecture but rather to the arbitrary choice made by different
> > protocols
> > > > as to whether the bits in a byte are presented to the CRC in
little/big
> > > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the
bits
> > > > starting from the LSB and some (e.g. atm) process the bits in MSB
> order.
> > > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in
cpu
> > > byte
> > > > order. The caller has to then get the CRC into the correct order for
> > > > inclusion into the message. As a result there are four versions of
the
> > >
> > > No, the caller does not have get the CRC into correct order, this is
done
> > by
> > > crc32 code.
> > >
> > > > calculation:
> > > > BE bit order on BE byte order cpu, BE bit order on LE byte order
cpu,
> > etc.
> > >
> > > What would be better? I don't see what to do. Linux requires both LE
and
> > BE
> > > bit
> > > ordering. When we optimize heavily, CPU byte order becomes an issue
> that
> > > needs
> > > to be dealt with.
> > >
> > > > Some calculations are simplified by arranging the bits sequentially
in
> > > WORDS
> > > > when we are going to process them in a certain order within each
byte.
> > This
> > > > means that the tables used to compute crc32_be are easier to use in
BE
> > > byte
> > > > order and that tables used to compute crc32_le are easier to use in
LE
> > byte
> > > > order. That is the logic behind the following weird looking macros
which
> > are
> > > > kept the same as the previous version except for the inclusion of
> > __force
> > > to
> > > > pass sparse with -D__CHECK_ENDIAN__.
> > >
> > > Don't understand, how is you code any different from what we have
> today?
> > >
> > >  Jocke
> >
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-02 21:14 ` Bob Pearson
  2011-08-02 21:19   ` Bob Pearson
@ 2011-08-04 11:54   ` Joakim Tjernlund
  2011-08-04 18:53     ` Bob Pearson
  1 sibling, 1 reply; 27+ messages in thread
From: Joakim Tjernlund @ 2011-08-04 11:54 UTC (permalink / raw)
  To: Bob Pearson; +Cc: 'Andrew Morton', 'frank zago', linux-kernel

"Bob Pearson" <rpearson@systemfabricworks.com> wrote on 2011/08/02 23:14:39:
>
> Hi Joakim,
>
> Sorry to take so long to respond.

No problem but please insert you answers in correct context(like I did). This
makes it much easier to read and comment on.

>
> Here are some performance data collected from the original and modified
> crc32 algorithms.
> The following is a simple test loop that computes the time to compute 1000
> crc's over 4096 bytes of data aligned on an 8 byte boundary after warming
> the cache. You could make other measurements but this is sort of a best
> case.
>
> These measurements were made on a dual socket Nehalem 2.267 GHz system.

Measurements on your SPARC would be good too.

>
> #ifdef CONFIG_CRC32_PERFTEST
> #include <linux/time.h>
>
> static u8 __attribute__((__aligned__(8))) test_buf[4096];
>
> /* need to convince gcc to not optimize out all the crc calls as dead code
> */
> u32 crc;
>
> static int __init crc32_init(void)
> {
>         int i;
>         struct timespec start, stop;
>         u64 nsec_le;
>         u64 nsec_be;
>
>         for (i = 0; i  < 4096; i++)
>                 test_buf[i] = i;
>
>         crc = crc32_le(0, test_buf, 4096);
>         crc = crc32_le(crc, test_buf, 4096);
>
>         getnstimeofday(&start);
>         for (i = 0; i < 1000; i++)
>                 crc = crc32_le(crc, test_buf, 4096);
>         getnstimeofday(&stop);
>
>         nsec_le = stop.tv_nsec - start.tv_nsec +
>                 1000000000 * (stop.tv_sec - start.tv_sec);
>
>         crc = crc32_be(0, test_buf, 4096);
>         crc = crc32_be(crc, test_buf, 4096);
>
>         getnstimeofday(&start);
>         for (i = 0; i < 1000; i++)
>                 crc = crc32_be(crc, test_buf, 4096);
>         getnstimeofday(&stop);
>
>         nsec_be = stop.tv_nsec - start.tv_nsec +
>                 1000000000 * (stop.tv_sec - start.tv_sec);
>
>         pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le, nsec_be);
>
>         return 0;
> }
>
> module_init(crc32_init);
> #endif /* CONFIG_CRC32_PERFTEST */
>
> Here are the timings.
>
>          CRC_?E_BITS     crc_le(nsec)       crc_be(nsec)
>                 orig crc32.c                      8                 5842712
> 5829793
>
>                 new crc32.c                     64              2877530
> 2862515
>                 new crc32.c                     32              5174400
> 5105816         (Equiv. to orig. BITS = 8)

I really can't wrap my head around why your 32 bits version is faster than the
current one. Especially as you say that 2D array is a little faster that 1D array.
The current impl. already uses a 2D array. Can you identify what makes you version
faster?

>
> Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i = foo
> - 1; i >= 0; i--) { ... }

That should be (i = foo; i ; --i) { ... }

>                 + count down                 64              2837860
> 3230438
>
> 2865727              3264033 (repeat)
>                 + count down                 32              5177334
> 5070337
>                    Results seem ambiguous on Nehalem. Slight improvement in
> some cases. 12% worse in one case.
>
> Modify main loop to use pre-increment instead of post-increment.
>                 + pre-increment           64              2872502
> 2767601
>                 + pre-increment           32              5100579
> 5090453
>                     Small scale improvements for each case. Will add to next
> drop.
>
> Do both count down and pre increment
>                 + ct down + pre-inc     64              3329125
> 3367815
>                 + ct down + pre-inc     32              5065308
> 5089825
>                    Got significantly worse for 64 bit slightly better for
> 32.
>
> Separately I looked at the difference between 1D and 2D arrays and 1D arrays
> are a few % faster. The difference is that the loader computes the base
> address at compile time. In the 2D case you have to add an offset to a
> parameter that was passed in to the body routine which is a runtime
> addition.

I image the difference is even bigger on RISC archs like PowerPC and SPARC as
these may need 2 insns to load an address. A relocatable kernel would be even
worse I think.

>
> The first comment below relates to the behavior of gen_crc32table.c which
> always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS = 2
> and 4 cases only read the first 4 and 16 table entries respectively so if
> you are trying to make a really small image you can save most of 1KB by
> shortening the table.

Surely this could be fixed without moving to 1D arrays?

>
> I think I just caused more confusion below. The point is that the two
> routines crc32_le and crc32_be take a byte string as argument which is
> invariant under big/little endian byte order and produce a u32 which is just
> an integer. All the byte swapping in the code is implementation detail and
> has no effect on the result. The caller should just do what you normally do
> with an integer when you use it in a network message e.g. call htonl or
> equivalent before putting it in a packet.
>
> Last point below. The existing code fails sparse with __CHECK_ENDIAN__.
> (There are several other sparse fails in lib as well BTW.) I cleaned up
> these warnings by forcing everything to u32.

What about unrolling the 32 bit variant instead of doing 64 bit? The 64 bit
variant looks like an unrolled 32 bit variant:
+		 		 q = *++p32 ^ crc;
+		 		 i3 = q;
+		 		 i2 = q >> 8;
+		 		 i1 = q >> 16;
+		 		 i0 = q >> 24;
+		 		 crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0];

+		 		 q = *++p32 ^ crc;
+		 		 i3 = q;
+		 		 i2 = q >> 8;
+		 		 i1 = q >> 16;
+		 		 i0 = q >> 24;
+		 		 crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0];

I have a really hard time accepting the code duplication your patch introduces.
Can you not find a way to add whatever changes you want into the current impl. ?

Finally, your last patch does to much. Changing the unit test code should
be a separate patch. Restructuring the code is another one and 64 bit support
should be a separate patch too.

 Jocke

>
> I will include Andrew's comments with the above and send out another patch
> soon.
>
> Regards,
>
> Bob Pearson
>
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> > Sent: Monday, August 01, 2011 4:20 AM
> > To: frank zago; Andrew Morton; Bob Pearson
> > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> >
> >
> > Hi Bob, Frank and Andrew
> >
> > I just got back from vacation and noticed this patch in the kernel mail
> archive.
> > I have pasted some
> > bits from there and commented too.
> >
> > > Hello,
> > >
> > > Added support for slice by 8 to existing crc32 algorithm. Also
> > > modified gen_crc32table.c to only produce table entries that are
> > > actually used.
> >
> > Hmm, I don't get this. What entries are unused?
> >
> > > The parameters CRC_LE_BITS and CRC_BE_BITS determine
> > > the number of bits in the input array that are processed during each
> > > step. Generally the more bits the faster the algorithm is but the
> > > more table data required.
> > >
> > > Using an x86_64 Opteron machine running at 2100MHz the following table
> > > was collected with a pre-warmed cache by computing the crc 1000 times
> > > on a buffer of 4096 bytes.
> > >
> > >        BITS       Size       LE Cycles/byte       BE
> > Cycles/byte
> > >        ----------------------------------------------
> > >        1       873       41.65
> >     34.60
> > >        2       1097       25.43
> >     29.61
> > >        4       1057       13.29
> >     15.28
> > >        8       2913       7.13
> >     8.19
> > >        32       9684       2.80
> >     2.82
> > >        64       18178       1.53
> >     1.53
> > >
> > >        BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> > >        default was 8 which actually selected the 32 bit algorithm.
> In
> > >        this version the value 8 is used to select the standard
> > >        8 bit algorithm and two new values: 32 and 64 are
> introduced
> > >        to select the slice by 4 and slice by 8 algorithms
> respectively.
> > >
> > >        Where Size is the size of crc32.o's text segment which
> > includes
> > >        code and table data when both LE and BE versions are set to
> > BITS.
> >
> > I miss the numbers from the current 32 bits impl. How does this new impl.
> > for 32 bits compare to the current impl?
> >
> > >
> > > The current version of crc32.c by default uses the slice by 4 algorithm
> > > which requires about 2.8 cycles per byte. The slice by 8 algorithm is
> > > roughly 2X faster and enables packet processing at over 1GB/sec on a
> > typical
> > > 2-3GHz system.
> > > --- lib/crc32.c
> > > +++ lib/crc32.c
> > >
> > > ...
> > >
> > > @@ -28,14 +31,17 @@
> > >  #include <linux/init.h>
> > >  #include <asm/atomic.h>
> > >  #include "crc32defs.h"
> > > -#if CRC_LE_BITS == 8
> > > -# define tole(x) __constant_cpu_to_le32(x)
> > > +
> > > +#include <asm/msr.h>
> > > +#if CRC_LE_BITS > 8
> > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> > >  #else
> > >  # define tole(x) (x)
> > >  #endif
> > >
> > > -#if CRC_BE_BITS == 8
> > > -# define tobe(x) __constant_cpu_to_be32(x)
> > > +#if CRC_BE_BITS > 8
> > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
> > >  #else
> > >  # define tobe(x) (x)
> > >  #endif
> > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch
> > <Matt_Domsch@
> > >  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
> > >  MODULE_LICENSE("GPL");
> > >
> > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > > +#if CRC_LE_BITS > 8
> > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> > > +{
> > > +       const u8 *p8;
> > > +       const u32 *p32;
> > > +       int init_bytes, end_bytes;
> > > +       size_t words;
> > > +       int i;
> > > +       u32 q;
> > > +       u8 i0, i1, i2, i3;
> > > +
> > > +       crc = (__force u32) __cpu_to_le32(crc);
> > > +
> > > +#if CRC_LE_BITS == 64
> > > +       p8 = buf;
> > > +       p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> > > +
> > > +       init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > +       if (init_bytes > len)
> > > +              init_bytes = len;
> > > +       words = (len - init_bytes) >> 3;
> > > +       end_bytes = (len - init_bytes) & 7;
> > > +#else
> > > +       p8 = buf;
> > > +       p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > > +       init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > +       if (init_bytes > len)
> > > +              init_bytes = len;
> > > +       words = (len - init_bytes) >> 2;
> > > +       end_bytes = (len - init_bytes) & 3;
> > > +#endif
> >
> > The difference in the above code is just two constants. Should be possible
> > to avoid code duplication.
> >
> > > +
> > > +       for (i = 0; i < init_bytes; i++) {
> >
> > All for(..) loops should be changed to:
> >   for (i = init_bytes; i ; --i)
> > This is easier for gcc to optimize. Always compare to zero when possible.
> >
> > > +#ifdef __LITTLE_ENDIAN
> > > +              i0 = *p8++ ^ crc;
> > > +              crc = t0_le[i0] ^ (crc >> 8);
> > > +#else
> > > +              i0 = *p8++ ^ (crc >> 24);
> > > +              crc = t0_le[i0] ^ (crc << 8);
> > > +#endif
> > > +       }
> > >
> > Better to hide in macro as current code do.
> > ...
> >
> > > +       for (i = 0; i < words; i++) {
> > > +#ifdef __LITTLE_ENDIAN
> > > +#  if CRC_LE_BITS == 64
> > > +              /* slice by 8 algorithm */
> > > +              q = *p32++ ^ crc;
> > > +              i3 = q;
> > > +              i2 = q >> 8;
> > > +              i1 = q >> 16;
> > > +              i0 = q >> 24;
> > > +              crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^
> > t4_le[i0];
> > > +
> > > +              q = *p32++;
> > > +              i3 = q;
> > > +              i2 = q >> 8;
> > > +              i1 = q >> 16;
> > > +              i0 = q >> 24;
> > > +              crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > t0_le[i0];
> >
> > I am not convinced that converting the table matrix to several arrays is
> faster.
> > Now you have to load 8 addressed and then index them instead of just one
> > address.
> >
> > Also, this looks more like unrolling the 32 bit variant. Are you sure that
> you
> > wont get just as good results by just unrolling the 32 bit variant?
> >
> > You also lost the pre increment which was faster on RISC archs like
> ppc(sparc
> > too?)
> > last time I tested crc32 speed.
> >
> > > +#  else
> > > +              /* slice by 4 algorithm */
> > > +              q = *p32++ ^ crc;
> > > +              i3 = q;
> > > +              i2 = q >> 8;
> > > +              i1 = q >> 16;
> > > +              i0 = q >> 24;
> > > +              crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > t0_le[i0];
> > > +#  endif
> >
> > ...
> > > General comment. The use of le/be in this and the previous version of
> > > crc32.c is confusing because it does not refer to little/big endian cpu
> > > architecture but rather to the arbitrary choice made by different
> protocols
> > > as to whether the bits in a byte are presented to the CRC in little/big
> > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits
> > > starting from the LSB and some (e.g. atm) process the bits in MSB order.
> > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu
> > byte
> > > order. The caller has to then get the CRC into the correct order for
> > > inclusion into the message. As a result there are four versions of the
> >
> > No, the caller does not have get the CRC into correct order, this is done
> by
> > crc32 code.
> >
> > > calculation:
> > > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu,
> etc.
> >
> > What would be better? I don't see what to do. Linux requires both LE and
> BE
> > bit
> > ordering. When we optimize heavily, CPU byte order becomes an issue that
> > needs
> > to be dealt with.
> >
> > > Some calculations are simplified by arranging the bits sequentially in
> > WORDS
> > > when we are going to process them in a certain order within each byte.
> This
> > > means that the tables used to compute crc32_be are easier to use in BE
> > byte
> > > order and that tables used to compute crc32_le are easier to use in LE
> byte
> > > order. That is the logic behind the following weird looking macros which
> are
> > > kept the same as the previous version except for the inclusion of
> __force
> > to
> > > pass sparse with -D__CHECK_ENDIAN__.
> >
> > Don't understand, how is you code any different from what we have today?
> >
> >  Jocke
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-08-02 21:14 ` Bob Pearson
@ 2011-08-02 21:19   ` Bob Pearson
  2011-08-04 11:54   ` Joakim Tjernlund
  1 sibling, 0 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-02 21:19 UTC (permalink / raw)
  To: 'Joakim Tjernlund', 'frank zago',
	'Andrew Morton',
	linux-kernel

Outlook sucks. Sorry.

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Bob Pearson
> Sent: Tuesday, August 02, 2011 4:15 PM
> To: 'Joakim Tjernlund'; 'frank zago'; 'Andrew Morton'; linux-
> kernel@vger.kernel.org
> Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
> 
> Hi Joakim,
> 
> Sorry to take so long to respond.
> 
> Here are some performance data collected from the original and modified
> crc32 algorithms.
> The following is a simple test loop that computes the time to compute 1000
> crc's over 4096 bytes of data aligned on an 8 byte boundary after warming
> the cache. You could make other measurements but this is sort of a best
> case.
> 
> These measurements were made on a dual socket Nehalem 2.267 GHz
> system.
> 
> #ifdef CONFIG_CRC32_PERFTEST
> #include <linux/time.h>
> 
> static u8 __attribute__((__aligned__(8))) test_buf[4096];
> 
> /* need to convince gcc to not optimize out all the crc calls as dead code
> */
> u32 crc;
> 
> static int __init crc32_init(void)
> {
>         int i;
>         struct timespec start, stop;
>         u64 nsec_le;
>         u64 nsec_be;
> 
>         for (i = 0; i  < 4096; i++)
>                 test_buf[i] = i;
> 
>         crc = crc32_le(0, test_buf, 4096);
>         crc = crc32_le(crc, test_buf, 4096);
> 
>         getnstimeofday(&start);
>         for (i = 0; i < 1000; i++)
>                 crc = crc32_le(crc, test_buf, 4096);
>         getnstimeofday(&stop);
> 
>         nsec_le = stop.tv_nsec - start.tv_nsec +
>                 1000000000 * (stop.tv_sec - start.tv_sec);
> 
>         crc = crc32_be(0, test_buf, 4096);
>         crc = crc32_be(crc, test_buf, 4096);
> 
>         getnstimeofday(&start);
>         for (i = 0; i < 1000; i++)
>                 crc = crc32_be(crc, test_buf, 4096);
>         getnstimeofday(&stop);
> 
>         nsec_be = stop.tv_nsec - start.tv_nsec +
>                 1000000000 * (stop.tv_sec - start.tv_sec);
> 
>         pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le,
nsec_be);
> 
>         return 0;
> }
> 
> module_init(crc32_init);
> #endif /* CONFIG_CRC32_PERFTEST */
> 
> Here are the timings.
> 
> 			CRC_?E_BITS     crc_le(nsec)       crc_be(nsec)
>                 orig crc32.c                      8
5842712
> 5829793
> 
>                 new crc32.c                     64              2877530
> 2862515
>                 new crc32.c                     32              5174400
> 5105816         (Equiv. to orig. BITS = 8)
> 
> Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i =
foo
> - 1; i >= 0; i--) { ... }
>                 + count down                 64              2837860
> 3230438
> 
> 2865727              3264033 (repeat)
>                 + count down                 32              5177334
> 5070337
>                    Results seem ambiguous on Nehalem. Slight improvement
in
> some cases. 12% worse in one case.
> 
> Modify main loop to use pre-increment instead of post-increment.
>                 + pre-increment           64              2872502
> 2767601
>                 + pre-increment           32              5100579
> 5090453
>                     Small scale improvements for each case. Will add to
next
> drop.
> 
> Do both count down and pre increment
>                 + ct down + pre-inc     64              3329125
> 3367815
>                 + ct down + pre-inc     32              5065308
> 5089825
>                    Got significantly worse for 64 bit slightly better for
> 32.
> 
> Separately I looked at the difference between 1D and 2D arrays and 1D
> arrays
> are a few % faster. The difference is that the loader computes the base
> address at compile time. In the 2D case you have to add an offset to a
> parameter that was passed in to the body routine which is a runtime
> addition.
> 
> The first comment below relates to the behavior of gen_crc32table.c which
> always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS =
2
> and 4 cases only read the first 4 and 16 table entries respectively so if
> you are trying to make a really small image you can save most of 1KB by
> shortening the table.
> 
> I think I just caused more confusion below. The point is that the two
> routines crc32_le and crc32_be take a byte string as argument which is
> invariant under big/little endian byte order and produce a u32 which is
just
> an integer. All the byte swapping in the code is implementation detail and
> has no effect on the result. The caller should just do what you normally
do
> with an integer when you use it in a network message e.g. call htonl or
> equivalent before putting it in a packet.
> 
> Last point below. The existing code fails sparse with __CHECK_ENDIAN__.
> (There are several other sparse fails in lib as well BTW.) I cleaned up
> these warnings by forcing everything to u32.
> 
> I will include Andrew's comments with the above and send out another patch
> soon.
> 
> Regards,
> 
> Bob Pearson
> 
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> > Sent: Monday, August 01, 2011 4:20 AM
> > To: frank zago; Andrew Morton; Bob Pearson
> > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> >
> >
> > Hi Bob, Frank and Andrew
> >
> > I just got back from vacation and noticed this patch in the kernel mail
> archive.
> > I have pasted some
> > bits from there and commented too.
> >
> > > Hello,
> > >
> > > Added support for slice by 8 to existing crc32 algorithm. Also
> > > modified gen_crc32table.c to only produce table entries that are
> > > actually used.
> >
> > Hmm, I don't get this. What entries are unused?
> >
> > > The parameters CRC_LE_BITS and CRC_BE_BITS determine
> > > the number of bits in the input array that are processed during each
> > > step. Generally the more bits the faster the algorithm is but the
> > > more table data required.
> > >
> > > Using an x86_64 Opteron machine running at 2100MHz the following
> table
> > > was collected with a pre-warmed cache by computing the crc 1000 times
> > > on a buffer of 4096 bytes.
> > >
> > > 		 BITS		 Size		 LE Cycles/byte		 BE
> > Cycles/byte
> > > 		 ----------------------------------------------
> > > 		 1		 873		 41.65
> > 	 34.60
> > > 		 2		 1097		 25.43
> > 	 29.61
> > > 		 4		 1057		 13.29
> > 	 15.28
> > > 		 8		 2913		 7.13
> > 	 8.19
> > > 		 32		 9684		 2.80
> > 	 2.82
> > > 		 64		 18178		 1.53
> > 	 1.53
> > >
> > > 		 BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> > > 		 default was 8 which actually selected the 32 bit algorithm.
> In
> > > 		 this version the value 8 is used to select the standard
> > > 		 8 bit algorithm and two new values: 32 and 64 are
> introduced
> > > 		 to select the slice by 4 and slice by 8 algorithms
> respectively.
> > >
> > > 		 Where Size is the size of crc32.o's text segment which
> > includes
> > > 		 code and table data when both LE and BE versions are set to
> > BITS.
> >
> > I miss the numbers from the current 32 bits impl. How does this new
impl.
> > for 32 bits compare to the current impl?
> >
> > >
> > > The current version of crc32.c by default uses the slice by 4
algorithm
> > > which requires about 2.8 cycles per byte. The slice by 8 algorithm is
> > > roughly 2X faster and enables packet processing at over 1GB/sec on a
> > typical
> > > 2-3GHz system.
> > > --- lib/crc32.c
> > > +++ lib/crc32.c
> > >
> > > ...
> > >
> > > @@ -28,14 +31,17 @@
> > >  #include <linux/init.h>
> > >  #include <asm/atomic.h>
> > >  #include "crc32defs.h"
> > > -#if CRC_LE_BITS == 8
> > > -# define tole(x) __constant_cpu_to_le32(x)
> > > +
> > > +#include <asm/msr.h>
> > > +#if CRC_LE_BITS > 8
> > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> > >  #else
> > >  # define tole(x) (x)
> > >  #endif
> > >
> > > -#if CRC_BE_BITS == 8
> > > -# define tobe(x) __constant_cpu_to_be32(x)
> > > +#if CRC_BE_BITS > 8
> > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
> > >  #else
> > >  # define tobe(x) (x)
> > >  #endif
> > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch
> > <Matt_Domsch@
> > >  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
> > >  MODULE_LICENSE("GPL");
> > >
> > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > > +#if CRC_LE_BITS > 8
> > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> > > +{
> > > +		 const u8 *p8;
> > > +		 const u32 *p32;
> > > +		 int init_bytes, end_bytes;
> > > +		 size_t words;
> > > +		 int i;
> > > +		 u32 q;
> > > +		 u8 i0, i1, i2, i3;
> > > +
> > > +		 crc = (__force u32) __cpu_to_le32(crc);
> > > +
> > > +#if CRC_LE_BITS == 64
> > > +		 p8 = buf;
> > > +		 p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> > > +
> > > +		 init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > +		 if (init_bytes > len)
> > > +		 		 init_bytes = len;
> > > +		 words = (len - init_bytes) >> 3;
> > > +		 end_bytes = (len - init_bytes) & 7;
> > > +#else
> > > +		 p8 = buf;
> > > +		 p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > > +		 init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > +		 if (init_bytes > len)
> > > +		 		 init_bytes = len;
> > > +		 words = (len - init_bytes) >> 2;
> > > +		 end_bytes = (len - init_bytes) & 3;
> > > +#endif
> >
> > The difference in the above code is just two constants. Should be
possible
> > to avoid code duplication.
> >
> > > +
> > > +		 for (i = 0; i < init_bytes; i++) {
> >
> > All for(..) loops should be changed to:
> >   for (i = init_bytes; i ; --i)
> > This is easier for gcc to optimize. Always compare to zero when
possible.
> >
> > > +#ifdef __LITTLE_ENDIAN
> > > +		 		 i0 = *p8++ ^ crc;
> > > +		 		 crc = t0_le[i0] ^ (crc >> 8);
> > > +#else
> > > +		 		 i0 = *p8++ ^ (crc >> 24);
> > > +		 		 crc = t0_le[i0] ^ (crc << 8);
> > > +#endif
> > > +		 }
> > >
> > Better to hide in macro as current code do.
> > ...
> >
> > > +		 for (i = 0; i < words; i++) {
> > > +#ifdef __LITTLE_ENDIAN
> > > +#  if CRC_LE_BITS == 64
> > > +		 		 /* slice by 8 algorithm */
> > > +		 		 q = *p32++ ^ crc;
> > > +		 		 i3 = q;
> > > +		 		 i2 = q >> 8;
> > > +		 		 i1 = q >> 16;
> > > +		 		 i0 = q >> 24;
> > > +		 		 crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^
> > t4_le[i0];
> > > +
> > > +		 		 q = *p32++;
> > > +		 		 i3 = q;
> > > +		 		 i2 = q >> 8;
> > > +		 		 i1 = q >> 16;
> > > +		 		 i0 = q >> 24;
> > > +		 		 crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > t0_le[i0];
> >
> > I am not convinced that converting the table matrix to several arrays is
> faster.
> > Now you have to load 8 addressed and then index them instead of just one
> > address.
> >
> > Also, this looks more like unrolling the 32 bit variant. Are you sure
that
> you
> > wont get just as good results by just unrolling the 32 bit variant?
> >
> > You also lost the pre increment which was faster on RISC archs like
> ppc(sparc
> > too?)
> > last time I tested crc32 speed.
> >
> > > +#  else
> > > +		 		 /* slice by 4 algorithm */
> > > +		 		 q = *p32++ ^ crc;
> > > +		 		 i3 = q;
> > > +		 		 i2 = q >> 8;
> > > +		 		 i1 = q >> 16;
> > > +		 		 i0 = q >> 24;
> > > +		 		 crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> > t0_le[i0];
> > > +#  endif
> >
> > ...
> > > General comment. The use of le/be in this and the previous version of
> > > crc32.c is confusing because it does not refer to little/big endian
cpu
> > > architecture but rather to the arbitrary choice made by different
> protocols
> > > as to whether the bits in a byte are presented to the CRC in
little/big
> > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the
bits
> > > starting from the LSB and some (e.g. atm) process the bits in MSB
order.
> > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu
> > byte
> > > order. The caller has to then get the CRC into the correct order for
> > > inclusion into the message. As a result there are four versions of the
> >
> > No, the caller does not have get the CRC into correct order, this is
done
> by
> > crc32 code.
> >
> > > calculation:
> > > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu,
> etc.
> >
> > What would be better? I don't see what to do. Linux requires both LE and
> BE
> > bit
> > ordering. When we optimize heavily, CPU byte order becomes an issue
> that
> > needs
> > to be dealt with.
> >
> > > Some calculations are simplified by arranging the bits sequentially in
> > WORDS
> > > when we are going to process them in a certain order within each byte.
> This
> > > means that the tables used to compute crc32_be are easier to use in BE
> > byte
> > > order and that tables used to compute crc32_le are easier to use in LE
> byte
> > > order. That is the logic behind the following weird looking macros
which
> are
> > > kept the same as the previous version except for the inclusion of
> __force
> > to
> > > pass sparse with -D__CHECK_ENDIAN__.
> >
> > Don't understand, how is you code any different from what we have
> today?
> >
> >  Jocke
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
       [not found] <OF4AE0115F.3AA5397E-ONC12578DF.002EC6DF-C12578DF.003348E5@transmode.se>
@ 2011-08-02 21:14 ` Bob Pearson
  2011-08-02 21:19   ` Bob Pearson
  2011-08-04 11:54   ` Joakim Tjernlund
  0 siblings, 2 replies; 27+ messages in thread
From: Bob Pearson @ 2011-08-02 21:14 UTC (permalink / raw)
  To: 'Joakim Tjernlund', 'frank zago',
	'Andrew Morton',
	linux-kernel

Hi Joakim,

Sorry to take so long to respond.

Here are some performance data collected from the original and modified
crc32 algorithms.
The following is a simple test loop that computes the time to compute 1000
crc's over 4096 bytes of data aligned on an 8 byte boundary after warming
the cache. You could make other measurements but this is sort of a best
case.

These measurements were made on a dual socket Nehalem 2.267 GHz system.

#ifdef CONFIG_CRC32_PERFTEST
#include <linux/time.h>

static u8 __attribute__((__aligned__(8))) test_buf[4096];

/* need to convince gcc to not optimize out all the crc calls as dead code
*/
u32 crc;

static int __init crc32_init(void)
{
        int i;
        struct timespec start, stop;
        u64 nsec_le;
        u64 nsec_be;

        for (i = 0; i  < 4096; i++)
                test_buf[i] = i;

        crc = crc32_le(0, test_buf, 4096);
        crc = crc32_le(crc, test_buf, 4096);

        getnstimeofday(&start);
        for (i = 0; i < 1000; i++)
                crc = crc32_le(crc, test_buf, 4096);
        getnstimeofday(&stop);

        nsec_le = stop.tv_nsec - start.tv_nsec +
                1000000000 * (stop.tv_sec - start.tv_sec);

        crc = crc32_be(0, test_buf, 4096);
        crc = crc32_be(crc, test_buf, 4096);

        getnstimeofday(&start);
        for (i = 0; i < 1000; i++)
                crc = crc32_be(crc, test_buf, 4096);
        getnstimeofday(&stop);

        nsec_be = stop.tv_nsec - start.tv_nsec +
                1000000000 * (stop.tv_sec - start.tv_sec);

        pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le, nsec_be);

        return 0;
}

module_init(crc32_init);
#endif /* CONFIG_CRC32_PERFTEST */

Here are the timings.

			CRC_?E_BITS     crc_le(nsec)       crc_be(nsec)
                orig crc32.c                      8                 5842712
5829793

                new crc32.c                     64              2877530
2862515
                new crc32.c                     32              5174400
5105816         (Equiv. to orig. BITS = 8)

Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i = foo
- 1; i >= 0; i--) { ... }
                + count down                 64              2837860
3230438
 
2865727              3264033 (repeat)
                + count down                 32              5177334
5070337
                   Results seem ambiguous on Nehalem. Slight improvement in
some cases. 12% worse in one case.

Modify main loop to use pre-increment instead of post-increment.
                + pre-increment           64              2872502
2767601
                + pre-increment           32              5100579
5090453
                    Small scale improvements for each case. Will add to next
drop.

Do both count down and pre increment
                + ct down + pre-inc     64              3329125
3367815
                + ct down + pre-inc     32              5065308
5089825
                   Got significantly worse for 64 bit slightly better for
32.

Separately I looked at the difference between 1D and 2D arrays and 1D arrays
are a few % faster. The difference is that the loader computes the base
address at compile time. In the 2D case you have to add an offset to a
parameter that was passed in to the body routine which is a runtime
addition.

The first comment below relates to the behavior of gen_crc32table.c which
always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS = 2
and 4 cases only read the first 4 and 16 table entries respectively so if
you are trying to make a really small image you can save most of 1KB by
shortening the table.

I think I just caused more confusion below. The point is that the two
routines crc32_le and crc32_be take a byte string as argument which is
invariant under big/little endian byte order and produce a u32 which is just
an integer. All the byte swapping in the code is implementation detail and
has no effect on the result. The caller should just do what you normally do
with an integer when you use it in a network message e.g. call htonl or
equivalent before putting it in a packet.

Last point below. The existing code fails sparse with __CHECK_ENDIAN__.
(There are several other sparse fails in lib as well BTW.) I cleaned up
these warnings by forcing everything to u32.

I will include Andrew's comments with the above and send out another patch
soon.

Regards,

Bob Pearson

> -----Original Message-----
> From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> Sent: Monday, August 01, 2011 4:20 AM
> To: frank zago; Andrew Morton; Bob Pearson
> Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> 
> 
> Hi Bob, Frank and Andrew
> 
> I just got back from vacation and noticed this patch in the kernel mail
archive.
> I have pasted some
> bits from there and commented too.
> 
> > Hello,
> >
> > Added support for slice by 8 to existing crc32 algorithm. Also
> > modified gen_crc32table.c to only produce table entries that are
> > actually used.
> 
> Hmm, I don't get this. What entries are unused?
> 
> > The parameters CRC_LE_BITS and CRC_BE_BITS determine
> > the number of bits in the input array that are processed during each
> > step. Generally the more bits the faster the algorithm is but the
> > more table data required.
> >
> > Using an x86_64 Opteron machine running at 2100MHz the following table
> > was collected with a pre-warmed cache by computing the crc 1000 times
> > on a buffer of 4096 bytes.
> >
> > 		 BITS		 Size		 LE Cycles/byte		 BE
> Cycles/byte
> > 		 ----------------------------------------------
> > 		 1		 873		 41.65
> 	 34.60
> > 		 2		 1097		 25.43
> 	 29.61
> > 		 4		 1057		 13.29
> 	 15.28
> > 		 8		 2913		 7.13
> 	 8.19
> > 		 32		 9684		 2.80
> 	 2.82
> > 		 64		 18178		 1.53
> 	 1.53
> >
> > 		 BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> > 		 default was 8 which actually selected the 32 bit algorithm.
In
> > 		 this version the value 8 is used to select the standard
> > 		 8 bit algorithm and two new values: 32 and 64 are
introduced
> > 		 to select the slice by 4 and slice by 8 algorithms
respectively.
> >
> > 		 Where Size is the size of crc32.o's text segment which
> includes
> > 		 code and table data when both LE and BE versions are set to
> BITS.
> 
> I miss the numbers from the current 32 bits impl. How does this new impl.
> for 32 bits compare to the current impl?
> 
> >
> > The current version of crc32.c by default uses the slice by 4 algorithm
> > which requires about 2.8 cycles per byte. The slice by 8 algorithm is
> > roughly 2X faster and enables packet processing at over 1GB/sec on a
> typical
> > 2-3GHz system.
> > --- lib/crc32.c
> > +++ lib/crc32.c
> >
> > ...
> >
> > @@ -28,14 +31,17 @@
> >  #include <linux/init.h>
> >  #include <asm/atomic.h>
> >  #include "crc32defs.h"
> > -#if CRC_LE_BITS == 8
> > -# define tole(x) __constant_cpu_to_le32(x)
> > +
> > +#include <asm/msr.h>
> > +#if CRC_LE_BITS > 8
> > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> >  #else
> >  # define tole(x) (x)
> >  #endif
> >
> > -#if CRC_BE_BITS == 8
> > -# define tobe(x) __constant_cpu_to_be32(x)
> > +#if CRC_BE_BITS > 8
> > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
> >  #else
> >  # define tobe(x) (x)
> >  #endif
> > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch
> <Matt_Domsch@
> >  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
> >  MODULE_LICENSE("GPL");
> >
> > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > +#if CRC_LE_BITS > 8
> > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> > +{
> > +		 const u8 *p8;
> > +		 const u32 *p32;
> > +		 int init_bytes, end_bytes;
> > +		 size_t words;
> > +		 int i;
> > +		 u32 q;
> > +		 u8 i0, i1, i2, i3;
> > +
> > +		 crc = (__force u32) __cpu_to_le32(crc);
> > +
> > +#if CRC_LE_BITS == 64
> > +		 p8 = buf;
> > +		 p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> > +
> > +		 init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > +		 if (init_bytes > len)
> > +		 		 init_bytes = len;
> > +		 words = (len - init_bytes) >> 3;
> > +		 end_bytes = (len - init_bytes) & 7;
> > +#else
> > +		 p8 = buf;
> > +		 p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > +		 init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > +		 if (init_bytes > len)
> > +		 		 init_bytes = len;
> > +		 words = (len - init_bytes) >> 2;
> > +		 end_bytes = (len - init_bytes) & 3;
> > +#endif
> 
> The difference in the above code is just two constants. Should be possible
> to avoid code duplication.
> 
> > +
> > +		 for (i = 0; i < init_bytes; i++) {
> 
> All for(..) loops should be changed to:
>   for (i = init_bytes; i ; --i)
> This is easier for gcc to optimize. Always compare to zero when possible.
> 
> > +#ifdef __LITTLE_ENDIAN
> > +		 		 i0 = *p8++ ^ crc;
> > +		 		 crc = t0_le[i0] ^ (crc >> 8);
> > +#else
> > +		 		 i0 = *p8++ ^ (crc >> 24);
> > +		 		 crc = t0_le[i0] ^ (crc << 8);
> > +#endif
> > +		 }
> >
> Better to hide in macro as current code do.
> ...
> 
> > +		 for (i = 0; i < words; i++) {
> > +#ifdef __LITTLE_ENDIAN
> > +#  if CRC_LE_BITS == 64
> > +		 		 /* slice by 8 algorithm */
> > +		 		 q = *p32++ ^ crc;
> > +		 		 i3 = q;
> > +		 		 i2 = q >> 8;
> > +		 		 i1 = q >> 16;
> > +		 		 i0 = q >> 24;
> > +		 		 crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^
> t4_le[i0];
> > +
> > +		 		 q = *p32++;
> > +		 		 i3 = q;
> > +		 		 i2 = q >> 8;
> > +		 		 i1 = q >> 16;
> > +		 		 i0 = q >> 24;
> > +		 		 crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> t0_le[i0];
> 
> I am not convinced that converting the table matrix to several arrays is
faster.
> Now you have to load 8 addressed and then index them instead of just one
> address.
> 
> Also, this looks more like unrolling the 32 bit variant. Are you sure that
you
> wont get just as good results by just unrolling the 32 bit variant?
> 
> You also lost the pre increment which was faster on RISC archs like
ppc(sparc
> too?)
> last time I tested crc32 speed.
> 
> > +#  else
> > +		 		 /* slice by 4 algorithm */
> > +		 		 q = *p32++ ^ crc;
> > +		 		 i3 = q;
> > +		 		 i2 = q >> 8;
> > +		 		 i1 = q >> 16;
> > +		 		 i0 = q >> 24;
> > +		 		 crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> t0_le[i0];
> > +#  endif
> 
> ...
> > General comment. The use of le/be in this and the previous version of
> > crc32.c is confusing because it does not refer to little/big endian cpu
> > architecture but rather to the arbitrary choice made by different
protocols
> > as to whether the bits in a byte are presented to the CRC in little/big
> > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits
> > starting from the LSB and some (e.g. atm) process the bits in MSB order.
> > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu
> byte
> > order. The caller has to then get the CRC into the correct order for
> > inclusion into the message. As a result there are four versions of the
> 
> No, the caller does not have get the CRC into correct order, this is done
by
> crc32 code.
> 
> > calculation:
> > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu,
etc.
> 
> What would be better? I don't see what to do. Linux requires both LE and
BE
> bit
> ordering. When we optimize heavily, CPU byte order becomes an issue that
> needs
> to be dealt with.
> 
> > Some calculations are simplified by arranging the bits sequentially in
> WORDS
> > when we are going to process them in a certain order within each byte.
This
> > means that the tables used to compute crc32_be are easier to use in BE
> byte
> > order and that tables used to compute crc32_le are easier to use in LE
byte
> > order. That is the logic behind the following weird looking macros which
are
> > kept the same as the previous version except for the inclusion of
__force
> to
> > pass sparse with -D__CHECK_ENDIAN__.
> 
> Don't understand, how is you code any different from what we have today?
> 
>  Jocke



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-07-29  1:47   ` Bob Pearson
@ 2011-08-01 19:39     ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2011-08-01 19:39 UTC (permalink / raw)
  To: Bob Pearson; +Cc: 'frank zago', linux-kernel, 'Roland Dreier'

On Thu, 28 Jul 2011 20:47:36 -0500
"Bob Pearson" <rpearson@systemfabricworks.com> wrote:

> > 
> > > +
> > > +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > 
> > Please take a look at the types of init_bytes and end_bytes.
> > ptrdiff_t, size_t and uint would all eb more appropriate than `int'.
> 
> Is there a performance difference on 32 bit machines from using a long to
> hold something that is in the range [0,7]? Which of these would you
> recommend?

I'm not aware of 64-bit being any faster than 32-bit on any machine. 
But I'm not aware of much.

Making this a signed quantity was inappropriate.  unsigned int, perhaps.

IMO, unsigned should have been the default in C - negative numbers are
relatively rare.  Whatever.

> > > +	p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > 
> > ALIGN()?
> 
> Sure.
> 
> > 
> > > +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > > +	if (init_bytes > len)
> > > +		init_bytes = len;
> > 
> > max()?
> 
> Maybe. First line is the main thought and the if handles the exception where
> the unrolled loop doesn't have a middle. Happy to oblige though. How about
> an unlikely in the if?

Don't care much ;) It's just that the reader looks at those two lines,
absorbs them then says "ah, it's taking the minimum".  If it had just
used "min()" then review and reading are more efficient, and reliable.  Plus min()
has typechecking features which can catch slipups.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] add slice by 8 algorithm to crc32.c
  2011-07-28 22:16 ` Andrew Morton
@ 2011-07-29  1:47   ` Bob Pearson
  2011-08-01 19:39     ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Bob Pearson @ 2011-07-29  1:47 UTC (permalink / raw)
  To: 'Andrew Morton', 'frank zago'
  Cc: linux-kernel, 'Roland Dreier'

Hi Andrew,

A few  questions/comments below that will help resending the patch.

Thanks,

Bob Pearson

> 
> Are you sure that the code doesn't add any unaligned accesses?  Those are
> very bad on some non-x86 architectures.

Agreed. P8 is byte aligned. P32 is 32/64 bit aligned for the 32/64 bit
algorithms. I will try to make it clearer.

> 
> > --- lib/crc32.c
> > +++ lib/crc32.c
> 
> Please prepare patches in `patch -p1' form.  So that should have been

OK

> 
> --- a/lib/crc32.c
> +++ a/lib/crc32.c
> 
> >
> > ...
> >
> > @@ -28,14 +31,17 @@
> >  #include <linux/init.h>
> >  #include <asm/atomic.h>
> >  #include "crc32defs.h"
> > -#if CRC_LE_BITS == 8
> > -# define tole(x) __constant_cpu_to_le32(x)
> > +
> > +#include <asm/msr.h>
> 
> That breaks the build on all non-x86.  Fortunately the inclusion appears
to be
> unnecessary.

Oops! Was part of some performance measurement code that got taken out.

> 

General comment. The use of le/be in this and the previous version of
crc32.c is confusing because it does not refer to little/big endian cpu
architecture but rather to the arbitrary choice made by different protocols
as to whether the bits in a byte are presented to the CRC in little/big
*BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits
starting from the LSB and some (e.g. atm) process the bits in MSB order.
These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu byte
order. The caller has to then get the CRC into the correct order for
inclusion into the message. As a result there are four versions of the
calculation:
BE bit order on BE byte order cpu, BE bit order on LE byte order cpu, etc.

Some calculations are simplified by arranging the bits sequentially in WORDS
when we are going to process them in a certain order within each byte. This
means that the tables used to compute crc32_be are easier to use in BE byte
order and that tables used to compute crc32_le are easier to use in LE byte
order. That is the logic behind the following weird looking macros which are
kept the same as the previous version except for the inclusion of __force to
pass sparse with -D__CHECK_ENDIAN__.

> > +#if CRC_LE_BITS > 8
> > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> >  #else
> >  # define tole(x) (x)
> >  #endif

> > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > +#if CRC_LE_BITS > 8
> > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> 
> `inline' is largely ignored by gcc, with gcc-version dependencies.
> __always_inline is more likely to succeed.  But the compiler would inline
this
> all by itself anyway.

Agreed. Inherited from previous version and happy to delete it.

> > +	p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> 
> Perhaps using ALIGN() would be clearer.

OK by me.  Its cleaner.

> 
> > +
> > +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> 
> Please take a look at the types of init_bytes and end_bytes.
> ptrdiff_t, size_t and uint would all eb more appropriate than `int'.

Is there a performance difference on 32 bit machines from using a long to
hold something that is in the range [0,7]? Which of these would you
recommend?

> > +	p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> 
> ALIGN()?

Sure.

> 
> > +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > +	if (init_bytes > len)
> > +		init_bytes = len;
> 
> max()?

Maybe. First line is the main thought and the if handles the exception where
the unrolled loop doesn't have a middle. Happy to oblige though. How about
an unlikely in the if?

> > +	for (i = 0; i < init_bytes; i++) {
> > +#ifdef __LITTLE_ENDIAN
> > +		i0 = *p8++ ^ crc;
> > +		crc = t0_le[i0] ^ (crc >> 8);
> > +#else
> > +		i0 = *p8++ ^ (crc >> 24);
> > +		crc = t0_le[i0] ^ (crc << 8);
> > +#endif
> > +	}
> 
> All the ifdeffing in here bursts my brain.

Agreed. But this is basically the same as the original code except that the
#ifdefs were buried in the macros. I couldn't figure out how to save the
CRC4 macros and for consistent style decided to get rid of all of them.
I'm not sure how to avoid it except to replicate the subroutines 4X.

> 
> Has this code been carefully tested in LE and BE?

Yes. My Big Endian test machine collection just consists of an old Ultra
Sparc machine. But we have 64 and 32 bit X86 systems. The patch adds one new
option to an existing collection of different ways to compute each CRC,
which have not really changed. In particular the 1 bit versions are
identical by inspection to the original. That allows us to compare the
results for different inputs and optional versions of the algorithm and of
course the unpatched file.

One change to the original was the removal of the 2D arrays in the table and
replacement with 1D arrays. I have the opinion that this makes the inner
loop shorter but to do this required making two copies of the 'body'
subroutine. Perhaps someone can comment on this. I have looked and I can't
see any way to shorten the code in the slice by 8 code from what gcc does
but almost anything that I do to the loop makes it longer.

I have doubts about the need to have 6 versions with different trade-offs
between table size and speed but I was afraid to make radical changes to the
original code. People who make small embedded systems probably care a lot
but I don't have a good feel for the issue.

> 

I left in the unit test and documentation without change but from the looks
of it no one has run it for a while and I have doubts about whether it works
anymore. What do you think about moving the documentation to Documentation
and ditching the unit tests?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] add slice by 8 algorithm to crc32.c
  2011-07-20 22:19 frank zago
@ 2011-07-28 22:16 ` Andrew Morton
  2011-07-29  1:47   ` Bob Pearson
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2011-07-28 22:16 UTC (permalink / raw)
  To: frank zago; +Cc: linux-kernel, Bob Pearson, Roland Dreier


(I don't think rdreier@cisco.com exists any more)

On Wed, 20 Jul 2011 17:19:42 -0500
frank zago <fzago@systemfabricworks.com> wrote:

> Hello,
> 
> Added support for slice by 8 to existing crc32 algorithm. Also
> modified gen_crc32table.c to only produce table entries that are
> actually used. The parameters CRC_LE_BITS and CRC_BE_BITS determine
> the number of bits in the input array that are processed during each
> step. Generally the more bits the faster the algorithm is but the
> more table data required.
> 
> Using an x86_64 Opteron machine running at 2100MHz the following table
> was collected with a pre-warmed cache by computing the crc 1000 times
> on a buffer of 4096 bytes.
> 
> 	BITS	Size	LE Cycles/byte	BE Cycles/byte
> 	----------------------------------------------
> 	1	873	41.65		34.60
> 	2	1097	25.43		29.61
> 	4	1057	13.29		15.28
> 	8	2913	7.13		8.19
> 	32	9684	2.80		2.82
> 	64	18178	1.53		1.53
> 
> 	BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> 	default was 8 which actually selected the 32 bit algorithm. In
> 	this version the value 8 is used to select the standard
> 	8 bit algorithm and two new values: 32 and 64 are introduced
> 	to select the slice by 4 and slice by 8 algorithms respectively.
> 
> 	Where Size is the size of crc32.o's text segment which includes
> 	code and table data when both LE and BE versions are set to BITS.
> 
> The current version of crc32.c by default uses the slice by 4 algorithm
> which requires about 2.8 cycles per byte. The slice by 8 algorithm is
> roughly 2X faster and enables packet processing at over 1GB/sec on a typical
> 2-3GHz system.

Are you sure that the code doesn't add any unaligned accesses?  Those
are very bad on some non-x86 architectures.

> --- lib/crc32.c
> +++ lib/crc32.c

Please prepare patches in `patch -p1' form.  So that should have been

--- a/lib/crc32.c
+++ a/lib/crc32.c

>
> ...
>
> @@ -28,14 +31,17 @@
>  #include <linux/init.h>
>  #include <asm/atomic.h>
>  #include "crc32defs.h"
> -#if CRC_LE_BITS == 8
> -# define tole(x) __constant_cpu_to_le32(x)
> +
> +#include <asm/msr.h>

That breaks the build on all non-x86.  Fortunately the inclusion
appears to be unnecessary.

> +#if CRC_LE_BITS > 8
> +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
>  #else
>  # define tole(x) (x)
>  #endif
>  
> -#if CRC_BE_BITS == 8
> -# define tobe(x) __constant_cpu_to_be32(x)
> +#if CRC_BE_BITS > 8
> +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
>  #else
>  # define tobe(x) (x)
>  #endif
> @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch <Matt_Domsch@
>  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
>  MODULE_LICENSE("GPL");
>  
> -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> +#if CRC_LE_BITS > 8
> +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)

`inline' is largely ignored by gcc, with gcc-version dependencies. 
__always_inline is more likely to succeed.  But the compiler would
inline this all by itself anyway.

> +{
> +	const u8 *p8;
> +	const u32 *p32;
> +	int init_bytes, end_bytes;
> +	size_t words;
> +	int i;
> +	u32 q;
> +	u8 i0, i1, i2, i3;
> +
> +	crc = (__force u32) __cpu_to_le32(crc);
> +
> +#if CRC_LE_BITS == 64
> +	p8 = buf;
> +	p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);

Perhaps using ALIGN() would be clearer.

> +
> +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;

Please take a look at the types of init_bytes and end_bytes. 
ptrdiff_t, size_t and uint would all eb more appropriate than `int'.

> +	if (init_bytes > len)
> +		init_bytes = len;
> +	words = (len - init_bytes) >> 3;
> +	end_bytes = (len - init_bytes) & 7;
> +#else
> +	p8 = buf;
> +	p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);

ALIGN()?

> +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> +	if (init_bytes > len)
> +		init_bytes = len;

max()?

> +	words = (len - init_bytes) >> 2;
> +	end_bytes = (len - init_bytes) & 3;
> +#endif
> +
> +	for (i = 0; i < init_bytes; i++) {
> +#ifdef __LITTLE_ENDIAN
> +		i0 = *p8++ ^ crc;
> +		crc = t0_le[i0] ^ (crc >> 8);
> +#else
> +		i0 = *p8++ ^ (crc >> 24);
> +		crc = t0_le[i0] ^ (crc << 8);
> +#endif
> +	}

All the ifdeffing in here bursts my brain.

Has this code been carefully tested in LE and BE?

>
> ...
>
> +#if CRC_LE_BITS == 64
> +	p8 = buf;
> +	p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> +
> +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> +	if (init_bytes > len)
> +		init_bytes = len;
> +	words = (len - init_bytes) >> 3;
> +	end_bytes = (len - init_bytes) & 7;

Various dittoes.

> +#else
> +	p8 = buf;
> +	p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> +
> +	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> +	if (init_bytes > len)
> +		init_bytes = len;
> +	words = (len - init_bytes) >> 2;
> +	end_bytes = (len - init_bytes) & 3;
> +#endif
> +
> +	for (i = 0; i < init_bytes; i++) {
> +#ifdef __LITTLE_ENDIAN
> +		i0 = *p8++ ^ crc;
> +		crc = t0_be[i0] ^ (crc >> 8);
> +#else
> +		i0 = *p8++ ^ (crc >> 24);
> +		crc = t0_be[i0] ^ (crc << 8);
> +#endif
> +	}
>  


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] add slice by 8 algorithm to crc32.c
@ 2011-07-20 22:19 frank zago
  2011-07-28 22:16 ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: frank zago @ 2011-07-20 22:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Bob Pearson, Roland Dreier, Andrew Morton

Hello,

Added support for slice by 8 to existing crc32 algorithm. Also
modified gen_crc32table.c to only produce table entries that are
actually used. The parameters CRC_LE_BITS and CRC_BE_BITS determine
the number of bits in the input array that are processed during each
step. Generally the more bits the faster the algorithm is but the
more table data required.

Using an x86_64 Opteron machine running at 2100MHz the following table
was collected with a pre-warmed cache by computing the crc 1000 times
on a buffer of 4096 bytes.

	BITS	Size	LE Cycles/byte	BE Cycles/byte
	----------------------------------------------
	1	873	41.65		34.60
	2	1097	25.43		29.61
	4	1057	13.29		15.28
	8	2913	7.13		8.19
	32	9684	2.80		2.82
	64	18178	1.53		1.53

	BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
	default was 8 which actually selected the 32 bit algorithm. In
	this version the value 8 is used to select the standard
	8 bit algorithm and two new values: 32 and 64 are introduced
	to select the slice by 4 and slice by 8 algorithms respectively.

	Where Size is the size of crc32.o's text segment which includes
	code and table data when both LE and BE versions are set to BITS.

The current version of crc32.c by default uses the slice by 4 algorithm
which requires about 2.8 cycles per byte. The slice by 8 algorithm is
roughly 2X faster and enables packet processing at over 1GB/sec on a typical
2-3GHz system.

Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>

Index: lib/crc32.c
===================================================================
--- lib/crc32.c
+++ lib/crc32.c
@@ -1,4 +1,8 @@
 /*
+ * July 20, 2011 Bob Pearson <rpearson at systemfabricworks.com>
+ * added slice by 8 algorithm to the existing conventional and
+ * slice by 4 algorithms.
+ *
  * Oct 15, 2000 Matt Domsch <Matt_Domsch@dell.com>
  * Nicer crc32 functions/docs submitted by linux@horizon.com.  Thanks!
  * Code was from the public domain, copyright abandoned.  Code was
@@ -19,7 +23,6 @@
  * This source code is licensed under the GNU General Public License,
  * Version 2.  See the file COPYING for more details.
  */
-
 #include <linux/crc32.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
@@ -28,14 +31,17 @@
 #include <linux/init.h>
 #include <asm/atomic.h>
 #include "crc32defs.h"
-#if CRC_LE_BITS == 8
-# define tole(x) __constant_cpu_to_le32(x)
+
+#include <asm/msr.h>
+
+#if CRC_LE_BITS > 8
+# define tole(x) (__force u32) __constant_cpu_to_le32(x)
 #else
 # define tole(x) (x)
 #endif
 
-#if CRC_BE_BITS == 8
-# define tobe(x) __constant_cpu_to_be32(x)
+#if CRC_BE_BITS > 8
+# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
 #else
 # define tobe(x) (x)
 #endif
@@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch <Matt_Domsch@
 MODULE_DESCRIPTION("Ethernet CRC32 calculations");
 MODULE_LICENSE("GPL");
 
-#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
+#if CRC_LE_BITS > 8
+static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
+{
+	const u8 *p8;
+	const u32 *p32;
+	int init_bytes, end_bytes;
+	size_t words;
+	int i;
+	u32 q;
+	u8 i0, i1, i2, i3;
+
+	crc = (__force u32) __cpu_to_le32(crc);
+
+#if CRC_LE_BITS == 64
+	p8 = buf;
+	p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
+
+	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
+	if (init_bytes > len)
+		init_bytes = len;
+	words = (len - init_bytes) >> 3;
+	end_bytes = (len - init_bytes) & 7;
+#else
+	p8 = buf;
+	p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
+
+	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
+	if (init_bytes > len)
+		init_bytes = len;
+	words = (len - init_bytes) >> 2;
+	end_bytes = (len - init_bytes) & 3;
+#endif
+
+	for (i = 0; i < init_bytes; i++) {
+#ifdef __LITTLE_ENDIAN
+		i0 = *p8++ ^ crc;
+		crc = t0_le[i0] ^ (crc >> 8);
+#else
+		i0 = *p8++ ^ (crc >> 24);
+		crc = t0_le[i0] ^ (crc << 8);
+#endif
+	}
+
+	for (i = 0; i < words; i++) {
+#ifdef __LITTLE_ENDIAN
+#  if CRC_LE_BITS == 64
+		/* slice by 8 algorithm */
+		q = *p32++ ^ crc;
+		i3 = q;
+		i2 = q >> 8;
+		i1 = q >> 16;
+		i0 = q >> 24;
+		crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^ t4_le[i0];
+
+		q = *p32++;
+		i3 = q;
+		i2 = q >> 8;
+		i1 = q >> 16;
+		i0 = q >> 24;
+		crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ t0_le[i0];
+#  else
+		/* slice by 4 algorithm */
+		q = *p32++ ^ crc;
+		i3 = q;
+		i2 = q >> 8;
+		i1 = q >> 16;
+		i0 = q >> 24;
+		crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ t0_le[i0];
+#  endif
+#else
+#  if CRC_LE_BITS == 64
+		q = *p32++ ^ crc;
+		i3 = q >> 24;
+		i2 = q >> 16;
+		i1 = q >> 8;
+		i0 = q;
+		crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^ t4_le[i0];
+
+		q = *p32++;
+		i3 = q >> 24;
+		i2 = q >> 16;
+		i1 = q >> 8;
+		i0 = q;
+		crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ t0_le[i0];
+#  else
+		q = *p32++ ^ crc;
+		i3 = q >> 24;
+		i2 = q >> 16;
+		i1 = q >> 8;
+		i0 = q;
+		crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ t0_le[i0];
+#  endif
+#endif
+	}
+
+	p8 = (u8 *)p32;
+
+	for (i = 0; i < end_bytes; i++) {
+#ifdef __LITTLE_ENDIAN
+		i0 = *p8++ ^ crc;
+		crc = t0_le[i0] ^ (crc >> 8);
+#else
+		i0 = *p8++ ^ (crc >> 24);
+		crc = t0_le[i0] ^ (crc << 8);
+#endif
+	}
+
+	return __le32_to_cpu((__force __le32)crc);
+}
+#endif
 
-static inline u32
-crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 (*tab)[256])
+#if CRC_BE_BITS > 8
+static inline u32 crc32_be_body(u32 crc, u8 const *buf, size_t len)
 {
-# ifdef __LITTLE_ENDIAN
-#  define DO_CRC(x) crc = tab[0][(crc ^ (x)) & 255] ^ (crc >> 8)
-#  define DO_CRC4 crc = tab[3][(crc) & 255] ^ \
-		tab[2][(crc >> 8) & 255] ^ \
-		tab[1][(crc >> 16) & 255] ^ \
-		tab[0][(crc >> 24) & 255]
-# else
-#  define DO_CRC(x) crc = tab[0][((crc >> 24) ^ (x)) & 255] ^ (crc << 8)
-#  define DO_CRC4 crc = tab[0][(crc) & 255] ^ \
-		tab[1][(crc >> 8) & 255] ^ \
-		tab[2][(crc >> 16) & 255] ^ \
-		tab[3][(crc >> 24) & 255]
-# endif
-	const u32 *b;
-	size_t    rem_len;
+	const u8 *p8;
+	const u32 *p32;
+	int init_bytes, end_bytes;
+	size_t words;
+	int i;
+	u32 q;
+	u8 i0, i1, i2, i3;
+
+	crc = (__force u32) __cpu_to_be32(crc);
+
+#if CRC_LE_BITS == 64
+	p8 = buf;
+	p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
+
+	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
+	if (init_bytes > len)
+		init_bytes = len;
+	words = (len - init_bytes) >> 3;
+	end_bytes = (len - init_bytes) & 7;
+#else
+	p8 = buf;
+	p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
+
+	init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
+	if (init_bytes > len)
+		init_bytes = len;
+	words = (len - init_bytes) >> 2;
+	end_bytes = (len - init_bytes) & 3;
+#endif
+
+	for (i = 0; i < init_bytes; i++) {
+#ifdef __LITTLE_ENDIAN
+		i0 = *p8++ ^ crc;
+		crc = t0_be[i0] ^ (crc >> 8);
+#else
+		i0 = *p8++ ^ (crc >> 24);
+		crc = t0_be[i0] ^ (crc << 8);
+#endif
+	}
 
-	/* Align it */
-	if (unlikely((long)buf & 3 && len)) {
-		do {
-			DO_CRC(*buf++);
-		} while ((--len) && ((long)buf)&3);
-	}
-	rem_len = len & 3;
-	/* load data 32 bits wide, xor data 32 bits wide. */
-	len = len >> 2;
-	b = (const u32 *)buf;
-	for (--b; len; --len) {
-		crc ^= *++b; /* use pre increment for speed */
-		DO_CRC4;
-	}
-	len = rem_len;
-	/* And the last few bytes */
-	if (len) {
-		u8 *p = (u8 *)(b + 1) - 1;
-		do {
-			DO_CRC(*++p); /* use pre increment for speed */
-		} while (--len);
+	for (i = 0; i < words; i++) {
+#ifdef __LITTLE_ENDIAN
+#  if CRC_LE_BITS == 64
+		/* slice by 8 algorithm */
+		q = *p32++ ^ crc;
+		i3 = q;
+		i2 = q >> 8;
+		i1 = q >> 16;
+		i0 = q >> 24;
+		crc = t7_be[i3] ^ t6_be[i2] ^ t5_be[i1] ^ t4_be[i0];
+
+		q = *p32++;
+		i3 = q;
+		i2 = q >> 8;
+		i1 = q >> 16;
+		i0 = q >> 24;
+		crc ^= t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0];
+#  else
+		/* slice by 4 algorithm */
+		q = *p32++ ^ crc;
+		i3 = q;
+		i2 = q >> 8;
+		i1 = q >> 16;
+		i0 = q >> 24;
+		crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0];
+#  endif
+#else
+#  if CRC_LE_BITS == 64
+		q = *p32++ ^ crc;
+		i3 = q >> 24;
+		i2 = q >> 16;
+		i1 = q >> 8;
+		i0 = q;
+		crc = t7_be[i3] ^ t6_be[i2] ^ t5_be[i1] ^ t4_be[i0];
+
+		q = *p32++;
+		i3 = q >> 24;
+		i2 = q >> 16;
+		i1 = q >> 8;
+		i0 = q;
+		crc ^= t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0];
+#  else
+		q = *p32++ ^ crc;
+		i3 = q >> 24;
+		i2 = q >> 16;
+		i1 = q >> 8;
+		i0 = q;
+		crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0];
+#  endif
+#endif
 	}
-	return crc;
-#undef DO_CRC
-#undef DO_CRC4
+
+	p8 = (u8 *)p32;
+
+	for (i = 0; i < end_bytes; i++) {
+#ifdef __LITTLE_ENDIAN
+		i0 = *p8++ ^ crc;
+		crc = t0_be[i0] ^ (crc >> 8);
+#else
+		i0 = *p8++ ^ (crc >> 24);
+		crc = t0_be[i0] ^ (crc << 8);
+#endif
+	}
+
+	return __be32_to_cpu((__force __be32)crc);
 }
 #endif
+
 /**
  * crc32_le() - Calculate bitwise little-endian Ethernet AUTODIN II CRC32
  * @crc: seed value for computation.  ~0 for Ethernet, sometimes 0 for
@@ -100,53 +280,40 @@ crc32_body(u32 crc, unsigned char const 
  * @p: pointer to buffer over which CRC is run
  * @len: length of buffer @p
  */
-u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len);
-
-#if CRC_LE_BITS == 1
-/*
- * In fact, the table-based code will work in this case, but it can be
- * simplified by inlining the table in ?: form.
- */
-
 u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
 {
+#if CRC_LE_BITS == 1
 	int i;
 	while (len--) {
 		crc ^= *p++;
 		for (i = 0; i < 8; i++)
 			crc = (crc >> 1) ^ ((crc & 1) ? CRCPOLY_LE : 0);
 	}
-	return crc;
-}
-#else				/* Table-based approach */
-
-u32 __pure crc32_le(u32 crc, unsigned char const *p, size_t len)
-{
-# if CRC_LE_BITS == 8
-	const u32      (*tab)[] = crc32table_le;
-
-	crc = __cpu_to_le32(crc);
-	crc = crc32_body(crc, p, len, tab);
-	return __le32_to_cpu(crc);
+# elif CRC_LE_BITS == 2
+	while (len--) {
+		crc ^= *p++;
+		crc = (crc >> 2) ^ t0_le[crc & 0x03];
+		crc = (crc >> 2) ^ t0_le[crc & 0x03];
+		crc = (crc >> 2) ^ t0_le[crc & 0x03];
+		crc = (crc >> 2) ^ t0_le[crc & 0x03];
+	}
 # elif CRC_LE_BITS == 4
 	while (len--) {
 		crc ^= *p++;
-		crc = (crc >> 4) ^ crc32table_le[crc & 15];
-		crc = (crc >> 4) ^ crc32table_le[crc & 15];
+		crc = (crc >> 4) ^ t0_le[crc & 0x0f];
+		crc = (crc >> 4) ^ t0_le[crc & 0x0f];
 	}
-	return crc;
-# elif CRC_LE_BITS == 2
+# elif CRC_LE_BITS == 8
 	while (len--) {
 		crc ^= *p++;
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
-		crc = (crc >> 2) ^ crc32table_le[crc & 3];
+		crc = (crc >> 8) ^ t0_le[crc & 0xff];
 	}
-	return crc;
+# else
+	crc = crc32_le_body(crc, p, len);
 # endif
+	return crc;
 }
-#endif
+EXPORT_SYMBOL(crc32_le);
 
 /**
  * crc32_be() - Calculate bitwise big-endian Ethernet AUTODIN II CRC32
@@ -155,57 +322,40 @@ u32 __pure crc32_le(u32 crc, unsigned ch
  * @p: pointer to buffer over which CRC is run
  * @len: length of buffer @p
  */
-u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len);
-
-#if CRC_BE_BITS == 1
-/*
- * In fact, the table-based code will work in this case, but it can be
- * simplified by inlining the table in ?: form.
- */
-
 u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
 {
+#if CRC_BE_BITS == 1
 	int i;
 	while (len--) {
 		crc ^= *p++ << 24;
 		for (i = 0; i < 8; i++)
-			crc =
-			    (crc << 1) ^ ((crc & 0x80000000) ? CRCPOLY_BE :
-					  0);
+			crc = (crc << 1) ^
+			      ((crc & 0x80000000) ? CRCPOLY_BE : 0);
+	}
+# elif CRC_BE_BITS == 2
+	while (len--) {
+		crc ^= *p++ << 24;
+		crc = (crc << 2) ^ t0_be[crc >> 30];
+		crc = (crc << 2) ^ t0_be[crc >> 30];
+		crc = (crc << 2) ^ t0_be[crc >> 30];
+		crc = (crc << 2) ^ t0_be[crc >> 30];
 	}
-	return crc;
-}
-
-#else				/* Table-based approach */
-u32 __pure crc32_be(u32 crc, unsigned char const *p, size_t len)
-{
-# if CRC_BE_BITS == 8
-	const u32      (*tab)[] = crc32table_be;
-
-	crc = __cpu_to_be32(crc);
-	crc = crc32_body(crc, p, len, tab);
-	return __be32_to_cpu(crc);
 # elif CRC_BE_BITS == 4
 	while (len--) {
 		crc ^= *p++ << 24;
-		crc = (crc << 4) ^ crc32table_be[crc >> 28];
-		crc = (crc << 4) ^ crc32table_be[crc >> 28];
+		crc = (crc << 4) ^ t0_be[crc >> 28];
+		crc = (crc << 4) ^ t0_be[crc >> 28];
 	}
-	return crc;
-# elif CRC_BE_BITS == 2
+# elif CRC_BE_BITS == 8
 	while (len--) {
 		crc ^= *p++ << 24;
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
-		crc = (crc << 2) ^ crc32table_be[crc >> 30];
+		crc = (crc << 8) ^ t0_be[crc >> 24];
 	}
-	return crc;
+# else
+	crc = crc32_be_body(crc, p, len);
 # endif
+	return crc;
 }
-#endif
-
-EXPORT_SYMBOL(crc32_le);
 EXPORT_SYMBOL(crc32_be);
 
 /*
Index: lib/crc32defs.h
===================================================================
--- lib/crc32defs.h
+++ lib/crc32defs.h
@@ -6,27 +6,29 @@
 #define CRCPOLY_LE 0xedb88320
 #define CRCPOLY_BE 0x04c11db7
 
-/* How many bits at a time to use.  Requires a table of 4<<CRC_xx_BITS bytes. */
+/* How many bits at a time to use.  Valid values are 1, 2, 4, 8, 32 and 64. */
 /* For less performance-sensitive, use 4 */
 #ifndef CRC_LE_BITS 
-# define CRC_LE_BITS 8
+# define CRC_LE_BITS 64
 #endif
 #ifndef CRC_BE_BITS
-# define CRC_BE_BITS 8
+# define CRC_BE_BITS 64
 #endif
 
 /*
  * Little-endian CRC computation.  Used with serial bit streams sent
  * lsbit-first.  Be sure to use cpu_to_le32() to append the computed CRC.
  */
-#if CRC_LE_BITS > 8 || CRC_LE_BITS < 1 || CRC_LE_BITS & CRC_LE_BITS-1
-# error CRC_LE_BITS must be a power of 2 between 1 and 8
+#if CRC_LE_BITS > 64 || CRC_LE_BITS < 1 || CRC_LE_BITS == 16 || \
+	CRC_LE_BITS & CRC_LE_BITS-1
+# error "CRC_LE_BITS must be one of {1, 2, 4, 8, 32, 64}"
 #endif
 
 /*
  * Big-endian CRC computation.  Used with serial bit streams sent
  * msbit-first.  Be sure to use cpu_to_be32() to append the computed CRC.
  */
-#if CRC_BE_BITS > 8 || CRC_BE_BITS < 1 || CRC_BE_BITS & CRC_BE_BITS-1
-# error CRC_BE_BITS must be a power of 2 between 1 and 8
+#if CRC_BE_BITS > 64 || CRC_BE_BITS < 1 || CRC_BE_BITS == 16 || \
+	CRC_BE_BITS & CRC_BE_BITS-1
+# error "CRC_BE_BITS must be one of {1, 2, 4, 8, 32, 64}"
 #endif
Index: lib/gen_crc32table.c
===================================================================
--- lib/gen_crc32table.c
+++ lib/gen_crc32table.c
@@ -4,11 +4,20 @@
 
 #define ENTRIES_PER_LINE 4
 
+#if CRC_LE_BITS <= 8
 #define LE_TABLE_SIZE (1 << CRC_LE_BITS)
+#else
+#define LE_TABLE_SIZE 256
+#endif
+
+#if CRC_BE_BITS <= 8
 #define BE_TABLE_SIZE (1 << CRC_BE_BITS)
+#else
+#define BE_TABLE_SIZE 256
+#endif
 
-static uint32_t crc32table_le[4][LE_TABLE_SIZE];
-static uint32_t crc32table_be[4][BE_TABLE_SIZE];
+static uint32_t crc32table_le[8][256];
+static uint32_t crc32table_be[8][256];
 
 /**
  * crc32init_le() - allocate and initialize LE table data
@@ -24,14 +33,14 @@ static void crc32init_le(void)
 
 	crc32table_le[0][0] = 0;
 
-	for (i = 1 << (CRC_LE_BITS - 1); i; i >>= 1) {
+	for (i = LE_TABLE_SIZE >> 1; i; i >>= 1) {
 		crc = (crc >> 1) ^ ((crc & 1) ? CRCPOLY_LE : 0);
 		for (j = 0; j < LE_TABLE_SIZE; j += 2 * i)
 			crc32table_le[0][i + j] = crc ^ crc32table_le[0][j];
 	}
 	for (i = 0; i < LE_TABLE_SIZE; i++) {
 		crc = crc32table_le[0][i];
-		for (j = 1; j < 4; j++) {
+		for (j = 1; j < 8; j++) {
 			crc = crc32table_le[0][crc & 0xff] ^ (crc >> 8);
 			crc32table_le[j][i] = crc;
 		}
@@ -55,44 +64,58 @@ static void crc32init_be(void)
 	}
 	for (i = 0; i < BE_TABLE_SIZE; i++) {
 		crc = crc32table_be[0][i];
-		for (j = 1; j < 4; j++) {
+		for (j = 1; j < 8; j++) {
 			crc = crc32table_be[0][(crc >> 24) & 0xff] ^ (crc << 8);
 			crc32table_be[j][i] = crc;
 		}
 	}
 }
 
-static void output_table(uint32_t table[4][256], int len, char *trans)
+static void output_table(uint32_t table[8][256], int len, char trans)
 {
 	int i, j;
 
-	for (j = 0 ; j < 4; j++) {
-		printf("{");
+	for (j = 0 ; j < 8; j++) {
+		printf("static const u32 t%d_%ce[] = {", j, trans);
 		for (i = 0; i < len - 1; i++) {
-			if (i % ENTRIES_PER_LINE == 0)
+			if ((i % ENTRIES_PER_LINE) == 0)
 				printf("\n");
-			printf("%s(0x%8.8xL), ", trans, table[j][i]);
+			printf("to%ce(0x%8.8xL),", trans, table[j][i]);
+			if ((i % ENTRIES_PER_LINE) != (ENTRIES_PER_LINE - 1))
+				printf(" ");
+		}
+		printf("to%ce(0x%8.8xL)};\n\n", trans, table[j][len - 1]);
+
+		if (trans == 'l') {
+			if ((j+1)*8 >= CRC_LE_BITS)
+				break;
+		} else {
+			if ((j+1)*8 >= CRC_BE_BITS)
+				break;
 		}
-		printf("%s(0x%8.8xL)},\n", trans, table[j][len - 1]);
 	}
 }
 
 int main(int argc, char** argv)
 {
-	printf("/* this file is generated - do not edit */\n\n");
+	printf("/*\n");
+	printf(" * crc32table.h - CRC32 tables\n");
+	printf(" *    this file is generated - do not edit\n");
+	printf(" *	# gen_crc32table > crc32table.h\n");
+	printf(" *    with\n");
+	printf(" *	CRC_LE_BITS = %d\n", CRC_LE_BITS);
+	printf(" *	CRC_BE_BITS = %d\n", CRC_BE_BITS);
+	printf(" */\n");
+	printf("\n");
 
 	if (CRC_LE_BITS > 1) {
 		crc32init_le();
-		printf("static const u32 crc32table_le[4][256] = {");
-		output_table(crc32table_le, LE_TABLE_SIZE, "tole");
-		printf("};\n");
+		output_table(crc32table_le, LE_TABLE_SIZE, 'l');
 	}
 
 	if (CRC_BE_BITS > 1) {
 		crc32init_be();
-		printf("static const u32 crc32table_be[4][256] = {");
-		output_table(crc32table_be, BE_TABLE_SIZE, "tobe");
-		printf("};\n");
+		output_table(crc32table_be, BE_TABLE_SIZE, 'b');
 	}
 
 	return 0;

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-08-08 22:21 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-08  9:28 [PATCH] add slice by 8 algorithm to crc32.c George Spelvin
2011-08-08 10:31 ` Joakim Tjernlund
2011-08-08 10:52   ` George Spelvin
2011-08-08 11:11     ` Joakim Tjernlund
2011-08-08 17:04       ` Bob Pearson
     [not found]     ` <OFEA1BD2B2.B2A7F07F-ONC12578E6.003D368C-C12578E6.003D7468@LocalDomain>
2011-08-08 11:24       ` Joakim Tjernlund
2011-08-08 11:42     ` Joakim Tjernlund
2011-08-08 12:54       ` George Spelvin
2011-08-08 17:01         ` Bob Pearson
2011-08-08 20:45           ` George Spelvin
2011-08-08 22:21             ` Bob Pearson
2011-08-08 16:54     ` Bob Pearson
2011-08-08 16:50 ` Bob Pearson
     [not found] <OF4AE0115F.3AA5397E-ONC12578DF.002EC6DF-C12578DF.003348E5@transmode.se>
2011-08-02 21:14 ` Bob Pearson
2011-08-02 21:19   ` Bob Pearson
2011-08-04 11:54   ` Joakim Tjernlund
2011-08-04 18:53     ` Bob Pearson
2011-08-05  9:22       ` Joakim Tjernlund
2011-08-05 15:51         ` Bob Pearson
2011-08-08  7:11           ` Joakim Tjernlund
2011-08-05 17:27         ` Bob Pearson
2011-08-08  7:15           ` Joakim Tjernlund
     [not found]       ` <OF14136E0E.3F2388EF-ONC12578E3.00301969-C12578E3.00338524@LocalDomain>
2011-08-05 13:34         ` Joakim Tjernlund
  -- strict thread matches above, loose matches on Subject: below --
2011-07-20 22:19 frank zago
2011-07-28 22:16 ` Andrew Morton
2011-07-29  1:47   ` Bob Pearson
2011-08-01 19:39     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).