From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751466Ab1HHJ2b (ORCPT ); Mon, 8 Aug 2011 05:28:31 -0400 Received: from science.horizon.com ([71.41.210.146]:25177 "HELO science.horizon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751023Ab1HHJ22 (ORCPT ); Mon, 8 Aug 2011 05:28:28 -0400 Date: 8 Aug 2011 05:28:26 -0400 Message-ID: <20110808092826.21881.qmail@science.horizon.com> From: "George Spelvin" To: fzago@systemfabricworks.com, linux-kernel@vger.kernel.org Subject: [PATCH] add slice by 8 algorithm to crc32.c Cc: akpm@linux-foundation.org, joakim.tjernlund@transmode.se, linux@horizon.com, rpearson@systemfabricworks.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry I didn't see this when first posted. The "slice by 8" terminology is pretty confusing. How about "Extended Joakim Tjernlund's optimization from commit 836e2af92503f1642dbc3c3281ec68ec1dd39d2e to 8-way parallelism." Which is essentally what you're doing. The renaming of tab[0] to t0_le and t0_be, and removal of the DO_CRC4 macro just increases the diff size. If you're looking at speeding up the CRC through larger tables, have you tried using 10+11+11-bit tables? That would require 20K of tables rather than 8K, but would reduce the number of table lookups per byte. One more stunt you could try to increase parallelism: rather than maintain the CRC in one register, maintain it in several, and only XOR and collapse them at the end. Start with your 64-bit code, but imagine that the second code block's "q = *p32++" always loads 0, and therefore the whole block can be skipped. (Since tab[0] = 0 for all CRC tables.) This computes the CRC of the even words. Then do a second one in parallel for the odd words into a separate CRC register. Then combine them at the end. (Shift one up by 32 bits and XOR into the other.) This would let you get away with 5K of tables: t4 through t7, and t0. t1 through t3 could be skipped. Ideally, I'd write all this code myself, but I'm a bit crunched at work right now so wouldn't be able to get to it for a few days. Another possible simplification to the startup code. There's no need to compute init_bytes explicitly; just loop until the pointer is aligned: while ((unsigned)buf & 3) { if (!len--) goto done; #ifdef __LITTLE_ENDIAN i0 = *buf++ ^ crc; crc = t0_le[i0] ^ (crc >> 8); #else i0 = *buf++ ^ (crc >> 24); crc = t0_le[i0] ^ (crc << 8); #endif } p32 = (u32 const *)buf; words = len >> 2; end_bytes = len & 3; ... although I'd prefer to keep the DO_CRC() and DO_CRC4 macros, and extend them to the 64-bit case, to avoid the nested #ifdefs. That would make: while ((unsigned)buf & 3) { if (!len--) goto done; DO_CRC(*buf++); } p32 = (u32 const *)buf; words = len >> 2; end_bytes = len & 3;