From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755386Ab1HBVOs (ORCPT <rfc822;w@1wt.eu>);
	Tue, 2 Aug 2011 17:14:48 -0400
Received: from cdptpa-bc-oedgelb.mail.rr.com ([75.180.133.33]:60787 "EHLO
	cdptpa-bc-oedgelb.mail.rr.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1755320Ab1HBVOm (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 2 Aug 2011 17:14:42 -0400
Authentication-Results: cdptpa-bc-oedgelb.mail.rr.com smtp.user=rpearson@systemfabricworks.com; auth=pass (LOGIN)
X-Authority-Analysis: v=1.1 cv=40Z/dbZBr1wgzPkGSf8y7qdCkiWp+M7NvixVUiz+qMg= c=1 sm=0 a=I7fHHdvOj7QA:10 a=ozIaqLvjkoIA:10 a=kj9zAlcOel0A:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:17 a=MeBxE6WzOd5txV3XJPYA:9 a=hYoO4Hz248dsl6hfgZUA:7 a=CjuIK1q_8ugA:10 a=AKE76feSDgbZQlGz:21 a=jO8uW_QDVyUIFcHQ:21 a=DCwX0kaxZCiV3mmbfDr8nQ==:117
X-Cloudmark-Score: 0
X-Originating-IP: 67.79.195.91
From: "Bob Pearson" <rpearson@systemfabricworks.com>
To: "'Joakim Tjernlund'" <joakim.tjernlund@transmode.se>,
        "'frank zago'" <fzago@systemfabricworks.com>,
        "'Andrew Morton'" <akpm@linux-foundation.org>,
        <linux-kernel@vger.kernel.org>
References: <OF4AE0115F.3AA5397E-ONC12578DF.002EC6DF-C12578DF.003348E5@transmode.se>
In-Reply-To: <OF4AE0115F.3AA5397E-ONC12578DF.002EC6DF-C12578DF.003348E5@transmode.se>
Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c
Date: Tue, 2 Aug 2011 16:14:39 -0500
Message-ID: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
Thread-Index: AQDVh4pDc04B6RwbKPhJsZYe9rqqGpb3ocxg
Content-Language: en-us
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Joakim,

Sorry to take so long to respond.

Here are some performance data collected from the original and modified
crc32 algorithms.
The following is a simple test loop that computes the time to compute 1000
crc's over 4096 bytes of data aligned on an 8 byte boundary after warming
the cache. You could make other measurements but this is sort of a best
case.

These measurements were made on a dual socket Nehalem 2.267 GHz system.

#ifdef CONFIG_CRC32_PERFTEST
#include <linux/time.h>

static u8 __attribute__((__aligned__(8))) test_buf[4096];

/* need to convince gcc to not optimize out all the crc calls as dead code
*/
u32 crc;

static int __init crc32_init(void)
{
        int i;
        struct timespec start, stop;
        u64 nsec_le;
        u64 nsec_be;

        for (i = 0; i  < 4096; i++)
                test_buf[i] = i;

        crc = crc32_le(0, test_buf, 4096);
        crc = crc32_le(crc, test_buf, 4096);

        getnstimeofday(&start);
        for (i = 0; i < 1000; i++)
                crc = crc32_le(crc, test_buf, 4096);
        getnstimeofday(&stop);

        nsec_le = stop.tv_nsec - start.tv_nsec +
                1000000000 * (stop.tv_sec - start.tv_sec);

        crc = crc32_be(0, test_buf, 4096);
        crc = crc32_be(crc, test_buf, 4096);

        getnstimeofday(&start);
        for (i = 0; i < 1000; i++)
                crc = crc32_be(crc, test_buf, 4096);
        getnstimeofday(&stop);

        nsec_be = stop.tv_nsec - start.tv_nsec +
                1000000000 * (stop.tv_sec - start.tv_sec);

        pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le, nsec_be);

        return 0;
}

module_init(crc32_init);
#endif /* CONFIG_CRC32_PERFTEST */

Here are the timings.

			CRC_?E_BITS     crc_le(nsec)       crc_be(nsec)
                orig crc32.c                      8                 5842712
5829793

                new crc32.c                     64              2877530
2862515
                new crc32.c                     32              5174400
5105816         (Equiv. to orig. BITS = 8)

Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i = foo
- 1; i >= 0; i--) { ... }
                + count down                 64              2837860
3230438
 
2865727              3264033 (repeat)
                + count down                 32              5177334
5070337
                   Results seem ambiguous on Nehalem. Slight improvement in
some cases. 12% worse in one case.

Modify main loop to use pre-increment instead of post-increment.
                + pre-increment           64              2872502
2767601
                + pre-increment           32              5100579
5090453
                    Small scale improvements for each case. Will add to next
drop.

Do both count down and pre increment
                + ct down + pre-inc     64              3329125
3367815
                + ct down + pre-inc     32              5065308
5089825
                   Got significantly worse for 64 bit slightly better for
32.

Separately I looked at the difference between 1D and 2D arrays and 1D arrays
are a few % faster. The difference is that the loader computes the base
address at compile time. In the 2D case you have to add an offset to a
parameter that was passed in to the body routine which is a runtime
addition.

The first comment below relates to the behavior of gen_crc32table.c which
always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS = 2
and 4 cases only read the first 4 and 16 table entries respectively so if
you are trying to make a really small image you can save most of 1KB by
shortening the table.

I think I just caused more confusion below. The point is that the two
routines crc32_le and crc32_be take a byte string as argument which is
invariant under big/little endian byte order and produce a u32 which is just
an integer. All the byte swapping in the code is implementation detail and
has no effect on the result. The caller should just do what you normally do
with an integer when you use it in a network message e.g. call htonl or
equivalent before putting it in a packet.

Last point below. The existing code fails sparse with __CHECK_ENDIAN__.
(There are several other sparse fails in lib as well BTW.) I cleaned up
these warnings by forcing everything to u32.

I will include Andrew's comments with the above and send out another patch
soon.

Regards,

Bob Pearson

> -----Original Message-----
> From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se]
> Sent: Monday, August 01, 2011 4:20 AM
> To: frank zago; Andrew Morton; Bob Pearson
> Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c
> 
> 
> Hi Bob, Frank and Andrew
> 
> I just got back from vacation and noticed this patch in the kernel mail
archive.
> I have pasted some
> bits from there and commented too.
> 
> > Hello,
> >
> > Added support for slice by 8 to existing crc32 algorithm. Also
> > modified gen_crc32table.c to only produce table entries that are
> > actually used.
> 
> Hmm, I don't get this. What entries are unused?
> 
> > The parameters CRC_LE_BITS and CRC_BE_BITS determine
> > the number of bits in the input array that are processed during each
> > step. Generally the more bits the faster the algorithm is but the
> > more table data required.
> >
> > Using an x86_64 Opteron machine running at 2100MHz the following table
> > was collected with a pre-warmed cache by computing the crc 1000 times
> > on a buffer of 4096 bytes.
> >
> > 		 BITS		 Size		 LE Cycles/byte		 BE
> Cycles/byte
> > 		 ----------------------------------------------
> > 		 1		 873		 41.65
> 	 34.60
> > 		 2		 1097		 25.43
> 	 29.61
> > 		 4		 1057		 13.29
> 	 15.28
> > 		 8		 2913		 7.13
> 	 8.19
> > 		 32		 9684		 2.80
> 	 2.82
> > 		 64		 18178		 1.53
> 	 1.53
> >
> > 		 BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
> > 		 default was 8 which actually selected the 32 bit algorithm.
In
> > 		 this version the value 8 is used to select the standard
> > 		 8 bit algorithm and two new values: 32 and 64 are
introduced
> > 		 to select the slice by 4 and slice by 8 algorithms
respectively.
> >
> > 		 Where Size is the size of crc32.o's text segment which
> includes
> > 		 code and table data when both LE and BE versions are set to
> BITS.
> 
> I miss the numbers from the current 32 bits impl. How does this new impl.
> for 32 bits compare to the current impl?
> 
> >
> > The current version of crc32.c by default uses the slice by 4 algorithm
> > which requires about 2.8 cycles per byte. The slice by 8 algorithm is
> > roughly 2X faster and enables packet processing at over 1GB/sec on a
> typical
> > 2-3GHz system.
> > --- lib/crc32.c
> > +++ lib/crc32.c
> >
> > ...
> >
> > @@ -28,14 +31,17 @@
> >  #include <linux/init.h>
> >  #include <asm/atomic.h>
> >  #include "crc32defs.h"
> > -#if CRC_LE_BITS == 8
> > -# define tole(x) __constant_cpu_to_le32(x)
> > +
> > +#include <asm/msr.h>
> > +#if CRC_LE_BITS > 8
> > +# define tole(x) (__force u32) __constant_cpu_to_le32(x)
> >  #else
> >  # define tole(x) (x)
> >  #endif
> >
> > -#if CRC_BE_BITS == 8
> > -# define tobe(x) __constant_cpu_to_be32(x)
> > +#if CRC_BE_BITS > 8
> > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x)
> >  #else
> >  # define tobe(x) (x)
> >  #endif
> > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch
> <Matt_Domsch@
> >  MODULE_DESCRIPTION("Ethernet CRC32 calculations");
> >  MODULE_LICENSE("GPL");
> >
> > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8
> > +#if CRC_LE_BITS > 8
> > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len)
> > +{
> > +		 const u8 *p8;
> > +		 const u32 *p32;
> > +		 int init_bytes, end_bytes;
> > +		 size_t words;
> > +		 int i;
> > +		 u32 q;
> > +		 u8 i0, i1, i2, i3;
> > +
> > +		 crc = (__force u32) __cpu_to_le32(crc);
> > +
> > +#if CRC_LE_BITS == 64
> > +		 p8 = buf;
> > +		 p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7);
> > +
> > +		 init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > +		 if (init_bytes > len)
> > +		 		 init_bytes = len;
> > +		 words = (len - init_bytes) >> 3;
> > +		 end_bytes = (len - init_bytes) & 7;
> > +#else
> > +		 p8 = buf;
> > +		 p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3);
> > +		 init_bytes = (uintptr_t)p32 - (uintptr_t)p8;
> > +		 if (init_bytes > len)
> > +		 		 init_bytes = len;
> > +		 words = (len - init_bytes) >> 2;
> > +		 end_bytes = (len - init_bytes) & 3;
> > +#endif
> 
> The difference in the above code is just two constants. Should be possible
> to avoid code duplication.
> 
> > +
> > +		 for (i = 0; i < init_bytes; i++) {
> 
> All for(..) loops should be changed to:
>   for (i = init_bytes; i ; --i)
> This is easier for gcc to optimize. Always compare to zero when possible.
> 
> > +#ifdef __LITTLE_ENDIAN
> > +		 		 i0 = *p8++ ^ crc;
> > +		 		 crc = t0_le[i0] ^ (crc >> 8);
> > +#else
> > +		 		 i0 = *p8++ ^ (crc >> 24);
> > +		 		 crc = t0_le[i0] ^ (crc << 8);
> > +#endif
> > +		 }
> >
> Better to hide in macro as current code do.
> ...
> 
> > +		 for (i = 0; i < words; i++) {
> > +#ifdef __LITTLE_ENDIAN
> > +#  if CRC_LE_BITS == 64
> > +		 		 /* slice by 8 algorithm */
> > +		 		 q = *p32++ ^ crc;
> > +		 		 i3 = q;
> > +		 		 i2 = q >> 8;
> > +		 		 i1 = q >> 16;
> > +		 		 i0 = q >> 24;
> > +		 		 crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^
> t4_le[i0];
> > +
> > +		 		 q = *p32++;
> > +		 		 i3 = q;
> > +		 		 i2 = q >> 8;
> > +		 		 i1 = q >> 16;
> > +		 		 i0 = q >> 24;
> > +		 		 crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> t0_le[i0];
> 
> I am not convinced that converting the table matrix to several arrays is
faster.
> Now you have to load 8 addressed and then index them instead of just one
> address.
> 
> Also, this looks more like unrolling the 32 bit variant. Are you sure that
you
> wont get just as good results by just unrolling the 32 bit variant?
> 
> You also lost the pre increment which was faster on RISC archs like
ppc(sparc
> too?)
> last time I tested crc32 speed.
> 
> > +#  else
> > +		 		 /* slice by 4 algorithm */
> > +		 		 q = *p32++ ^ crc;
> > +		 		 i3 = q;
> > +		 		 i2 = q >> 8;
> > +		 		 i1 = q >> 16;
> > +		 		 i0 = q >> 24;
> > +		 		 crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^
> t0_le[i0];
> > +#  endif
> 
> ...
> > General comment. The use of le/be in this and the previous version of
> > crc32.c is confusing because it does not refer to little/big endian cpu
> > architecture but rather to the arbitrary choice made by different
protocols
> > as to whether the bits in a byte are presented to the CRC in little/big
> > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits
> > starting from the LSB and some (e.g. atm) process the bits in MSB order.
> > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu
> byte
> > order. The caller has to then get the CRC into the correct order for
> > inclusion into the message. As a result there are four versions of the
> 
> No, the caller does not have get the CRC into correct order, this is done
by
> crc32 code.
> 
> > calculation:
> > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu,
etc.
> 
> What would be better? I don't see what to do. Linux requires both LE and
BE
> bit
> ordering. When we optimize heavily, CPU byte order becomes an issue that
> needs
> to be dealt with.
> 
> > Some calculations are simplified by arranging the bits sequentially in
> WORDS
> > when we are going to process them in a certain order within each byte.
This
> > means that the tables used to compute crc32_be are easier to use in BE
> byte
> > order and that tables used to compute crc32_le are easier to use in LE
byte
> > order. That is the logic behind the following weird looking macros which
are
> > kept the same as the previous version except for the inclusion of
__force
> to
> > pass sparse with -D__CHECK_ENDIAN__.
> 
> Don't understand, how is you code any different from what we have today?
> 
>  Jocke