From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755386Ab1HBVOs (ORCPT ); Tue, 2 Aug 2011 17:14:48 -0400 Received: from cdptpa-bc-oedgelb.mail.rr.com ([75.180.133.33]:60787 "EHLO cdptpa-bc-oedgelb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755320Ab1HBVOm (ORCPT ); Tue, 2 Aug 2011 17:14:42 -0400 Authentication-Results: cdptpa-bc-oedgelb.mail.rr.com smtp.user=rpearson@systemfabricworks.com; auth=pass (LOGIN) X-Authority-Analysis: v=1.1 cv=40Z/dbZBr1wgzPkGSf8y7qdCkiWp+M7NvixVUiz+qMg= c=1 sm=0 a=I7fHHdvOj7QA:10 a=ozIaqLvjkoIA:10 a=kj9zAlcOel0A:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:17 a=MeBxE6WzOd5txV3XJPYA:9 a=hYoO4Hz248dsl6hfgZUA:7 a=CjuIK1q_8ugA:10 a=AKE76feSDgbZQlGz:21 a=jO8uW_QDVyUIFcHQ:21 a=DCwX0kaxZCiV3mmbfDr8nQ==:117 X-Cloudmark-Score: 0 X-Originating-IP: 67.79.195.91 From: "Bob Pearson" To: "'Joakim Tjernlund'" , "'frank zago'" , "'Andrew Morton'" , References: In-Reply-To: Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c Date: Tue, 2 Aug 2011 16:14:39 -0500 Message-ID: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQDVh4pDc04B6RwbKPhJsZYe9rqqGpb3ocxg Content-Language: en-us Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Joakim, Sorry to take so long to respond. Here are some performance data collected from the original and modified crc32 algorithms. The following is a simple test loop that computes the time to compute 1000 crc's over 4096 bytes of data aligned on an 8 byte boundary after warming the cache. You could make other measurements but this is sort of a best case. These measurements were made on a dual socket Nehalem 2.267 GHz system. #ifdef CONFIG_CRC32_PERFTEST #include static u8 __attribute__((__aligned__(8))) test_buf[4096]; /* need to convince gcc to not optimize out all the crc calls as dead code */ u32 crc; static int __init crc32_init(void) { int i; struct timespec start, stop; u64 nsec_le; u64 nsec_be; for (i = 0; i < 4096; i++) test_buf[i] = i; crc = crc32_le(0, test_buf, 4096); crc = crc32_le(crc, test_buf, 4096); getnstimeofday(&start); for (i = 0; i < 1000; i++) crc = crc32_le(crc, test_buf, 4096); getnstimeofday(&stop); nsec_le = stop.tv_nsec - start.tv_nsec + 1000000000 * (stop.tv_sec - start.tv_sec); crc = crc32_be(0, test_buf, 4096); crc = crc32_be(crc, test_buf, 4096); getnstimeofday(&start); for (i = 0; i < 1000; i++) crc = crc32_be(crc, test_buf, 4096); getnstimeofday(&stop); nsec_be = stop.tv_nsec - start.tv_nsec + 1000000000 * (stop.tv_sec - start.tv_sec); pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le, nsec_be); return 0; } module_init(crc32_init); #endif /* CONFIG_CRC32_PERFTEST */ Here are the timings. CRC_?E_BITS crc_le(nsec) crc_be(nsec) orig crc32.c 8 5842712 5829793 new crc32.c 64 2877530 2862515 new crc32.c 32 5174400 5105816 (Equiv. to orig. BITS = 8) Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i = foo - 1; i >= 0; i--) { ... } + count down 64 2837860 3230438 2865727 3264033 (repeat) + count down 32 5177334 5070337 Results seem ambiguous on Nehalem. Slight improvement in some cases. 12% worse in one case. Modify main loop to use pre-increment instead of post-increment. + pre-increment 64 2872502 2767601 + pre-increment 32 5100579 5090453 Small scale improvements for each case. Will add to next drop. Do both count down and pre increment + ct down + pre-inc 64 3329125 3367815 + ct down + pre-inc 32 5065308 5089825 Got significantly worse for 64 bit slightly better for 32. Separately I looked at the difference between 1D and 2D arrays and 1D arrays are a few % faster. The difference is that the loader computes the base address at compile time. In the 2D case you have to add an offset to a parameter that was passed in to the body routine which is a runtime addition. The first comment below relates to the behavior of gen_crc32table.c which always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS = 2 and 4 cases only read the first 4 and 16 table entries respectively so if you are trying to make a really small image you can save most of 1KB by shortening the table. I think I just caused more confusion below. The point is that the two routines crc32_le and crc32_be take a byte string as argument which is invariant under big/little endian byte order and produce a u32 which is just an integer. All the byte swapping in the code is implementation detail and has no effect on the result. The caller should just do what you normally do with an integer when you use it in a network message e.g. call htonl or equivalent before putting it in a packet. Last point below. The existing code fails sparse with __CHECK_ENDIAN__. (There are several other sparse fails in lib as well BTW.) I cleaned up these warnings by forcing everything to u32. I will include Andrew's comments with the above and send out another patch soon. Regards, Bob Pearson > -----Original Message----- > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se] > Sent: Monday, August 01, 2011 4:20 AM > To: frank zago; Andrew Morton; Bob Pearson > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c > > > Hi Bob, Frank and Andrew > > I just got back from vacation and noticed this patch in the kernel mail archive. > I have pasted some > bits from there and commented too. > > > Hello, > > > > Added support for slice by 8 to existing crc32 algorithm. Also > > modified gen_crc32table.c to only produce table entries that are > > actually used. > > Hmm, I don't get this. What entries are unused? > > > The parameters CRC_LE_BITS and CRC_BE_BITS determine > > the number of bits in the input array that are processed during each > > step. Generally the more bits the faster the algorithm is but the > > more table data required. > > > > Using an x86_64 Opteron machine running at 2100MHz the following table > > was collected with a pre-warmed cache by computing the crc 1000 times > > on a buffer of 4096 bytes. > > > > BITS Size LE Cycles/byte BE > Cycles/byte > > ---------------------------------------------- > > 1 873 41.65 > 34.60 > > 2 1097 25.43 > 29.61 > > 4 1057 13.29 > 15.28 > > 8 2913 7.13 > 8.19 > > 32 9684 2.80 > 2.82 > > 64 18178 1.53 > 1.53 > > > > BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old > > default was 8 which actually selected the 32 bit algorithm. In > > this version the value 8 is used to select the standard > > 8 bit algorithm and two new values: 32 and 64 are introduced > > to select the slice by 4 and slice by 8 algorithms respectively. > > > > Where Size is the size of crc32.o's text segment which > includes > > code and table data when both LE and BE versions are set to > BITS. > > I miss the numbers from the current 32 bits impl. How does this new impl. > for 32 bits compare to the current impl? > > > > > The current version of crc32.c by default uses the slice by 4 algorithm > > which requires about 2.8 cycles per byte. The slice by 8 algorithm is > > roughly 2X faster and enables packet processing at over 1GB/sec on a > typical > > 2-3GHz system. > > --- lib/crc32.c > > +++ lib/crc32.c > > > > ... > > > > @@ -28,14 +31,17 @@ > > #include > > #include > > #include "crc32defs.h" > > -#if CRC_LE_BITS == 8 > > -# define tole(x) __constant_cpu_to_le32(x) > > + > > +#include > > +#if CRC_LE_BITS > 8 > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x) > > #else > > # define tole(x) (x) > > #endif > > > > -#if CRC_BE_BITS == 8 > > -# define tobe(x) __constant_cpu_to_be32(x) > > +#if CRC_BE_BITS > 8 > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x) > > #else > > # define tobe(x) (x) > > #endif > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch > > MODULE_DESCRIPTION("Ethernet CRC32 calculations"); > > MODULE_LICENSE("GPL"); > > > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8 > > +#if CRC_LE_BITS > 8 > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len) > > +{ > > + const u8 *p8; > > + const u32 *p32; > > + int init_bytes, end_bytes; > > + size_t words; > > + int i; > > + u32 q; > > + u8 i0, i1, i2, i3; > > + > > + crc = (__force u32) __cpu_to_le32(crc); > > + > > +#if CRC_LE_BITS == 64 > > + p8 = buf; > > + p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7); > > + > > + init_bytes = (uintptr_t)p32 - (uintptr_t)p8; > > + if (init_bytes > len) > > + init_bytes = len; > > + words = (len - init_bytes) >> 3; > > + end_bytes = (len - init_bytes) & 7; > > +#else > > + p8 = buf; > > + p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3); > > + init_bytes = (uintptr_t)p32 - (uintptr_t)p8; > > + if (init_bytes > len) > > + init_bytes = len; > > + words = (len - init_bytes) >> 2; > > + end_bytes = (len - init_bytes) & 3; > > +#endif > > The difference in the above code is just two constants. Should be possible > to avoid code duplication. > > > + > > + for (i = 0; i < init_bytes; i++) { > > All for(..) loops should be changed to: > for (i = init_bytes; i ; --i) > This is easier for gcc to optimize. Always compare to zero when possible. > > > +#ifdef __LITTLE_ENDIAN > > + i0 = *p8++ ^ crc; > > + crc = t0_le[i0] ^ (crc >> 8); > > +#else > > + i0 = *p8++ ^ (crc >> 24); > > + crc = t0_le[i0] ^ (crc << 8); > > +#endif > > + } > > > Better to hide in macro as current code do. > ... > > > + for (i = 0; i < words; i++) { > > +#ifdef __LITTLE_ENDIAN > > +# if CRC_LE_BITS == 64 > > + /* slice by 8 algorithm */ > > + q = *p32++ ^ crc; > > + i3 = q; > > + i2 = q >> 8; > > + i1 = q >> 16; > > + i0 = q >> 24; > > + crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^ > t4_le[i0]; > > + > > + q = *p32++; > > + i3 = q; > > + i2 = q >> 8; > > + i1 = q >> 16; > > + i0 = q >> 24; > > + crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ > t0_le[i0]; > > I am not convinced that converting the table matrix to several arrays is faster. > Now you have to load 8 addressed and then index them instead of just one > address. > > Also, this looks more like unrolling the 32 bit variant. Are you sure that you > wont get just as good results by just unrolling the 32 bit variant? > > You also lost the pre increment which was faster on RISC archs like ppc(sparc > too?) > last time I tested crc32 speed. > > > +# else > > + /* slice by 4 algorithm */ > > + q = *p32++ ^ crc; > > + i3 = q; > > + i2 = q >> 8; > > + i1 = q >> 16; > > + i0 = q >> 24; > > + crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ > t0_le[i0]; > > +# endif > > ... > > General comment. The use of le/be in this and the previous version of > > crc32.c is confusing because it does not refer to little/big endian cpu > > architecture but rather to the arbitrary choice made by different protocols > > as to whether the bits in a byte are presented to the CRC in little/big > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits > > starting from the LSB and some (e.g. atm) process the bits in MSB order. > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu > byte > > order. The caller has to then get the CRC into the correct order for > > inclusion into the message. As a result there are four versions of the > > No, the caller does not have get the CRC into correct order, this is done by > crc32 code. > > > calculation: > > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu, etc. > > What would be better? I don't see what to do. Linux requires both LE and BE > bit > ordering. When we optimize heavily, CPU byte order becomes an issue that > needs > to be dealt with. > > > Some calculations are simplified by arranging the bits sequentially in > WORDS > > when we are going to process them in a certain order within each byte. This > > means that the tables used to compute crc32_be are easier to use in BE > byte > > order and that tables used to compute crc32_le are easier to use in LE byte > > order. That is the logic behind the following weird looking macros which are > > kept the same as the previous version except for the inclusion of __force > to > > pass sparse with -D__CHECK_ENDIAN__. > > Don't understand, how is you code any different from what we have today? > > Jocke