From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754067Ab1HDLyU (ORCPT ); Thu, 4 Aug 2011 07:54:20 -0400 Received: from gw1.transmode.se ([195.58.98.146]:38444 "EHLO gw1.transmode.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753853Ab1HDLyR (ORCPT ); Thu, 4 Aug 2011 07:54:17 -0400 In-Reply-To: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com> References: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com> Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c X-KeepSent: 83CC7C9C:07EC5D4A-C12578E2:0024F5AE; type=4; name=$KeepSent To: "Bob Pearson" Cc: "'Andrew Morton'" , "'frank zago'" , linux-kernel@vger.kernel.org X-Mailer: Lotus Notes Release 8.5.2FP3 Aug 10, 2010 Message-ID: From: Joakim Tjernlund Date: Thu, 4 Aug 2011 13:54:13 +0200 X-MIMETrack: Serialize by Router on mail1/Transmode(Release 8.5.2FP3|July 10, 2011) at 08/04/2011 13:54:14 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "Bob Pearson" wrote on 2011/08/02 23:14:39: > > Hi Joakim, > > Sorry to take so long to respond. No problem but please insert you answers in correct context(like I did). This makes it much easier to read and comment on. > > Here are some performance data collected from the original and modified > crc32 algorithms. > The following is a simple test loop that computes the time to compute 1000 > crc's over 4096 bytes of data aligned on an 8 byte boundary after warming > the cache. You could make other measurements but this is sort of a best > case. > > These measurements were made on a dual socket Nehalem 2.267 GHz system. Measurements on your SPARC would be good too. > > #ifdef CONFIG_CRC32_PERFTEST > #include > > static u8 __attribute__((__aligned__(8))) test_buf[4096]; > > /* need to convince gcc to not optimize out all the crc calls as dead code > */ > u32 crc; > > static int __init crc32_init(void) > { > int i; > struct timespec start, stop; > u64 nsec_le; > u64 nsec_be; > > for (i = 0; i < 4096; i++) > test_buf[i] = i; > > crc = crc32_le(0, test_buf, 4096); > crc = crc32_le(crc, test_buf, 4096); > > getnstimeofday(&start); > for (i = 0; i < 1000; i++) > crc = crc32_le(crc, test_buf, 4096); > getnstimeofday(&stop); > > nsec_le = stop.tv_nsec - start.tv_nsec + > 1000000000 * (stop.tv_sec - start.tv_sec); > > crc = crc32_be(0, test_buf, 4096); > crc = crc32_be(crc, test_buf, 4096); > > getnstimeofday(&start); > for (i = 0; i < 1000; i++) > crc = crc32_be(crc, test_buf, 4096); > getnstimeofday(&stop); > > nsec_be = stop.tv_nsec - start.tv_nsec + > 1000000000 * (stop.tv_sec - start.tv_sec); > > pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le, nsec_be); > > return 0; > } > > module_init(crc32_init); > #endif /* CONFIG_CRC32_PERFTEST */ > > Here are the timings. > > CRC_?E_BITS crc_le(nsec) crc_be(nsec) > orig crc32.c 8 5842712 > 5829793 > > new crc32.c 64 2877530 > 2862515 > new crc32.c 32 5174400 > 5105816 (Equiv. to orig. BITS = 8) I really can't wrap my head around why your 32 bits version is faster than the current one. Especially as you say that 2D array is a little faster that 1D array. The current impl. already uses a 2D array. Can you identify what makes you version faster? > > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i = foo > - 1; i >= 0; i--) { ... } That should be (i = foo; i ; --i) { ... } > + count down 64 2837860 > 3230438 > > 2865727 3264033 (repeat) > + count down 32 5177334 > 5070337 > Results seem ambiguous on Nehalem. Slight improvement in > some cases. 12% worse in one case. > > Modify main loop to use pre-increment instead of post-increment. > + pre-increment 64 2872502 > 2767601 > + pre-increment 32 5100579 > 5090453 > Small scale improvements for each case. Will add to next > drop. > > Do both count down and pre increment > + ct down + pre-inc 64 3329125 > 3367815 > + ct down + pre-inc 32 5065308 > 5089825 > Got significantly worse for 64 bit slightly better for > 32. > > Separately I looked at the difference between 1D and 2D arrays and 1D arrays > are a few % faster. The difference is that the loader computes the base > address at compile time. In the 2D case you have to add an offset to a > parameter that was passed in to the body routine which is a runtime > addition. I image the difference is even bigger on RISC archs like PowerPC and SPARC as these may need 2 insns to load an address. A relocatable kernel would be even worse I think. > > The first comment below relates to the behavior of gen_crc32table.c which > always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS = 2 > and 4 cases only read the first 4 and 16 table entries respectively so if > you are trying to make a really small image you can save most of 1KB by > shortening the table. Surely this could be fixed without moving to 1D arrays? > > I think I just caused more confusion below. The point is that the two > routines crc32_le and crc32_be take a byte string as argument which is > invariant under big/little endian byte order and produce a u32 which is just > an integer. All the byte swapping in the code is implementation detail and > has no effect on the result. The caller should just do what you normally do > with an integer when you use it in a network message e.g. call htonl or > equivalent before putting it in a packet. > > Last point below. The existing code fails sparse with __CHECK_ENDIAN__. > (There are several other sparse fails in lib as well BTW.) I cleaned up > these warnings by forcing everything to u32. What about unrolling the 32 bit variant instead of doing 64 bit? The 64 bit variant looks like an unrolled 32 bit variant: + q = *++p32 ^ crc; + i3 = q; + i2 = q >> 8; + i1 = q >> 16; + i0 = q >> 24; + crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0]; + q = *++p32 ^ crc; + i3 = q; + i2 = q >> 8; + i1 = q >> 16; + i0 = q >> 24; + crc = t3_be[i3] ^ t2_be[i2] ^ t1_be[i1] ^ t0_be[i0]; I have a really hard time accepting the code duplication your patch introduces. Can you not find a way to add whatever changes you want into the current impl. ? Finally, your last patch does to much. Changing the unit test code should be a separate patch. Restructuring the code is another one and 64 bit support should be a separate patch too. Jocke > > I will include Andrew's comments with the above and send out another patch > soon. > > Regards, > > Bob Pearson > > > -----Original Message----- > > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se] > > Sent: Monday, August 01, 2011 4:20 AM > > To: frank zago; Andrew Morton; Bob Pearson > > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c > > > > > > Hi Bob, Frank and Andrew > > > > I just got back from vacation and noticed this patch in the kernel mail > archive. > > I have pasted some > > bits from there and commented too. > > > > > Hello, > > > > > > Added support for slice by 8 to existing crc32 algorithm. Also > > > modified gen_crc32table.c to only produce table entries that are > > > actually used. > > > > Hmm, I don't get this. What entries are unused? > > > > > The parameters CRC_LE_BITS and CRC_BE_BITS determine > > > the number of bits in the input array that are processed during each > > > step. Generally the more bits the faster the algorithm is but the > > > more table data required. > > > > > > Using an x86_64 Opteron machine running at 2100MHz the following table > > > was collected with a pre-warmed cache by computing the crc 1000 times > > > on a buffer of 4096 bytes. > > > > > > BITS Size LE Cycles/byte BE > > Cycles/byte > > > ---------------------------------------------- > > > 1 873 41.65 > > 34.60 > > > 2 1097 25.43 > > 29.61 > > > 4 1057 13.29 > > 15.28 > > > 8 2913 7.13 > > 8.19 > > > 32 9684 2.80 > > 2.82 > > > 64 18178 1.53 > > 1.53 > > > > > > BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old > > > default was 8 which actually selected the 32 bit algorithm. > In > > > this version the value 8 is used to select the standard > > > 8 bit algorithm and two new values: 32 and 64 are > introduced > > > to select the slice by 4 and slice by 8 algorithms > respectively. > > > > > > Where Size is the size of crc32.o's text segment which > > includes > > > code and table data when both LE and BE versions are set to > > BITS. > > > > I miss the numbers from the current 32 bits impl. How does this new impl. > > for 32 bits compare to the current impl? > > > > > > > > The current version of crc32.c by default uses the slice by 4 algorithm > > > which requires about 2.8 cycles per byte. The slice by 8 algorithm is > > > roughly 2X faster and enables packet processing at over 1GB/sec on a > > typical > > > 2-3GHz system. > > > --- lib/crc32.c > > > +++ lib/crc32.c > > > > > > ... > > > > > > @@ -28,14 +31,17 @@ > > > #include > > > #include > > > #include "crc32defs.h" > > > -#if CRC_LE_BITS == 8 > > > -# define tole(x) __constant_cpu_to_le32(x) > > > + > > > +#include > > > +#if CRC_LE_BITS > 8 > > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x) > > > #else > > > # define tole(x) (x) > > > #endif > > > > > > -#if CRC_BE_BITS == 8 > > > -# define tobe(x) __constant_cpu_to_be32(x) > > > +#if CRC_BE_BITS > 8 > > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x) > > > #else > > > # define tobe(x) (x) > > > #endif > > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch > > > > MODULE_DESCRIPTION("Ethernet CRC32 calculations"); > > > MODULE_LICENSE("GPL"); > > > > > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8 > > > +#if CRC_LE_BITS > 8 > > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len) > > > +{ > > > + const u8 *p8; > > > + const u32 *p32; > > > + int init_bytes, end_bytes; > > > + size_t words; > > > + int i; > > > + u32 q; > > > + u8 i0, i1, i2, i3; > > > + > > > + crc = (__force u32) __cpu_to_le32(crc); > > > + > > > +#if CRC_LE_BITS == 64 > > > + p8 = buf; > > > + p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7); > > > + > > > + init_bytes = (uintptr_t)p32 - (uintptr_t)p8; > > > + if (init_bytes > len) > > > + init_bytes = len; > > > + words = (len - init_bytes) >> 3; > > > + end_bytes = (len - init_bytes) & 7; > > > +#else > > > + p8 = buf; > > > + p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3); > > > + init_bytes = (uintptr_t)p32 - (uintptr_t)p8; > > > + if (init_bytes > len) > > > + init_bytes = len; > > > + words = (len - init_bytes) >> 2; > > > + end_bytes = (len - init_bytes) & 3; > > > +#endif > > > > The difference in the above code is just two constants. Should be possible > > to avoid code duplication. > > > > > + > > > + for (i = 0; i < init_bytes; i++) { > > > > All for(..) loops should be changed to: > > for (i = init_bytes; i ; --i) > > This is easier for gcc to optimize. Always compare to zero when possible. > > > > > +#ifdef __LITTLE_ENDIAN > > > + i0 = *p8++ ^ crc; > > > + crc = t0_le[i0] ^ (crc >> 8); > > > +#else > > > + i0 = *p8++ ^ (crc >> 24); > > > + crc = t0_le[i0] ^ (crc << 8); > > > +#endif > > > + } > > > > > Better to hide in macro as current code do. > > ... > > > > > + for (i = 0; i < words; i++) { > > > +#ifdef __LITTLE_ENDIAN > > > +# if CRC_LE_BITS == 64 > > > + /* slice by 8 algorithm */ > > > + q = *p32++ ^ crc; > > > + i3 = q; > > > + i2 = q >> 8; > > > + i1 = q >> 16; > > > + i0 = q >> 24; > > > + crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^ > > t4_le[i0]; > > > + > > > + q = *p32++; > > > + i3 = q; > > > + i2 = q >> 8; > > > + i1 = q >> 16; > > > + i0 = q >> 24; > > > + crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ > > t0_le[i0]; > > > > I am not convinced that converting the table matrix to several arrays is > faster. > > Now you have to load 8 addressed and then index them instead of just one > > address. > > > > Also, this looks more like unrolling the 32 bit variant. Are you sure that > you > > wont get just as good results by just unrolling the 32 bit variant? > > > > You also lost the pre increment which was faster on RISC archs like > ppc(sparc > > too?) > > last time I tested crc32 speed. > > > > > +# else > > > + /* slice by 4 algorithm */ > > > + q = *p32++ ^ crc; > > > + i3 = q; > > > + i2 = q >> 8; > > > + i1 = q >> 16; > > > + i0 = q >> 24; > > > + crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ > > t0_le[i0]; > > > +# endif > > > > ... > > > General comment. The use of le/be in this and the previous version of > > > crc32.c is confusing because it does not refer to little/big endian cpu > > > architecture but rather to the arbitrary choice made by different > protocols > > > as to whether the bits in a byte are presented to the CRC in little/big > > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits > > > starting from the LSB and some (e.g. atm) process the bits in MSB order. > > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu > > byte > > > order. The caller has to then get the CRC into the correct order for > > > inclusion into the message. As a result there are four versions of the > > > > No, the caller does not have get the CRC into correct order, this is done > by > > crc32 code. > > > > > calculation: > > > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu, > etc. > > > > What would be better? I don't see what to do. Linux requires both LE and > BE > > bit > > ordering. When we optimize heavily, CPU byte order becomes an issue that > > needs > > to be dealt with. > > > > > Some calculations are simplified by arranging the bits sequentially in > > WORDS > > > when we are going to process them in a certain order within each byte. > This > > > means that the tables used to compute crc32_be are easier to use in BE > > byte > > > order and that tables used to compute crc32_le are easier to use in LE > byte > > > order. That is the logic behind the following weird looking macros which > are > > > kept the same as the previous version except for the inclusion of > __force > > to > > > pass sparse with -D__CHECK_ENDIAN__. > > > > Don't understand, how is you code any different from what we have today? > > > > Jocke > >