From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64 Date: Thu, 4 Feb 2016 11:22:03 -0800 Message-ID: References: <1454527121-4007853-1-git-send-email-tom@herbertland.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: David Miller , Netdev , tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, x86@kernel.org, Kernel Team To: Tom Herbert Return-path: Received: from mail-io0-f171.google.com ([209.85.223.171]:33550 "EHLO mail-io0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755733AbcBDTWF (ORCPT ); Thu, 4 Feb 2016 14:22:05 -0500 Received: by mail-io0-f171.google.com with SMTP id f81so103961394iof.0 for ; Thu, 04 Feb 2016 11:22:04 -0800 (PST) In-Reply-To: <1454527121-4007853-1-git-send-email-tom@herbertland.com> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY > conversion. > > CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether > we need to avoid unaligned accesses. Efficient unaligned accesses > offer a nice additional speedup as demonstrated in the results > provided below. > > This implementation is similar to csum_partial implemented in > checksum_32.S, however since we are dealing with 8 bytes at a time > there are more cases for small lengths (and alignments)-- for that > we employ a jump table. > > Testing: > > Correctness: > > Verified correctness by testing arbitrary length buffer filled with > random data. For each buffer I compared the computed checksum > using the original algorithm for each possible alignment (0-7 bytes). > > Checksum performance: > > Isolating old and new implementation for some common cases: > > Old NewA NewA % NewNoA NewNoA % > Len/Aln nsec nsec Improv nsecs Improve > --------+-------+--------+-------+-------+--------------------- > 1400/0 192.9 175.1 10% 174.9 10% (Big packet) > 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) > 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) > 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) > 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) > 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) > 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) > > NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set > NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set > > Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz > > Also test on these with similar results: > > Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz > Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz > > Branch prediction: > > To test the effects of poor branch prediction in the jump tables I > tested checksum performance with runs for two combinations of length > and alignment. As the baseline I performed the test by doing half of > calls with the first combination, followed by using the second > combination for the second half. In the test case, I interleave the > two combinations so that in every call the length and alignment are > different to defeat the effects of branch prediction. Running several > cases, I did not see any material performance difference between > the baseline and the interleaving test case. > > Signed-off-by: Tom Herbert So a couple of general questions about all this. First why do you break the alignment part out over so many reads? It seems like you could just get the offset, subtract that from your starting buffer, read the aligned long, and then mask off the bits you don't want. That should take maybe 4 or 5 instructions which would save you a bunch of jumping around since most of it is just arithmetic. I am still not sure about the loops, but as mentioned you probably don't want to use loop. You would likely be better off generating a pointer representing the last valid offset and just running through the loop that way. Then on the last read the same question as the first. Why not just do one read and mask the result. It should be faster. Finally I am not sure if your result is safe if the offset for the buffer is an odd value. The original code did a collapse to 16b and rotate by 8 if the offset was odd. I don't see anything like that in your code. - Alex