From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 560F0C4363D for ; Thu, 24 Sep 2020 22:03:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EE44523AA3 for ; Thu, 24 Sep 2020 22:03:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=web.de header.i=@web.de header.b="X4WEPvz7" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726478AbgIXWDJ (ORCPT ); Thu, 24 Sep 2020 18:03:09 -0400 Received: from mout.web.de ([212.227.17.11]:49457 "EHLO mout.web.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726205AbgIXWDI (ORCPT ); Thu, 24 Sep 2020 18:03:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de; s=dbaedf251592; t=1600984975; bh=EEVrsLhm+pljwrDdLfa6Zv5mD+9DVOOxIAU/RHK2Fcw=; h=X-UI-Sender-Class:Subject:To:Cc:References:From:Date:In-Reply-To; b=X4WEPvz7TfroFbE2+fittU5YND+cn6/qiSZxfFsgYFlnczUHf4EV4GjNK1Tg46fKk sCeaa10RBeZ9Hq6SH9BMILNp+yV7s7QiG39KWJVllD0FX0opvqqQSTPlW21VRm+lUa bWQyTRfrzfVhhF/S2IqAywIrXIrxsp+x5LPDjjD8= X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9 Received: from [192.168.178.26] ([91.47.149.245]) by smtp.web.de (mrweb101 [213.165.67.124]) with ESMTPSA (Nemesis) id 0LcPf8-1kla9Z2o7y-00jnVr; Fri, 25 Sep 2020 00:02:55 +0200 Subject: Re: [PATCH 1/2] bswap.h: drop unaligned loads To: Jeff King , Junio C Hamano Cc: Han-Wen Nienhuys , git , Han-Wen Nienhuys References: <20200924191638.GA2528003@coredump.intra.peff.net> <20200924192111.GA2528225@coredump.intra.peff.net> From: =?UTF-8?Q?Ren=c3=a9_Scharfe?= Message-ID: Date: Fri, 25 Sep 2020 00:02:38 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: <20200924192111.GA2528225@coredump.intra.peff.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:Xn+dqB8zRnP8Jja2ZHXPjFtrCJf2W/N1qPxGM/qUYvn8+t7mpuV mOBOuiNf2+Go37XvbBU7IDSv678jvbeJcwoTov7R2AQepHDPHbtW85ftDQl0itwnVwXr/fk rcFCuGZqQUS1qnOUbK/+lgGeJP+Z1hcxMTslQiaV/7mO3c74iZSlUf2+YxtY8yZ0/TdkZ35 ZsZIHsQC3QIniqWwLQ7ew== X-UI-Out-Filterresults: notjunk:1;V03:K0:gztNnLeYFkU=:5s1icKCc8s3tZC0/eUVKaA loHNJ48TnIXsbYwa9+lfWLvxCUXeDRXwFHZoOvibq66TBp5mDFV1AWif+bm9y5/1b88bwNsgJ UghuRXTRPhSi0dtRAGAqbKKQq76kv93+5xXVXcOsD+7aJFCARnGXugYuQZGLppCIUSZLQjTcj B39HLdlJol6vS/i5eL670NJ7NghGW3lRRCW31cAbvJI5ncvCR9mdxda8G1lbtHXvfs7FC6QIc ZeS1hW9hGhIozFhQXhv7pB3uCYYHvktJevCJwBbGfn7r5nYlgN4JH0IYtvScIHRugMx7Ty03W Ce3TLRtRpUAT617pLS5lZMkMK6PR91TRiucqqXK8/mJRtAbGiVe2uNoUI/j9ZDhjTBLD31NGU IgemxnxdFYPaeW5IBPg6BvE4Eitfe2gE9LOHh8UGFOPFjVSXBJPHa7FxEbL1TjtK69flGrJUb 9EDlwyJscHFSPOe8FLhLgh4qxFk6ooSPs6qh93DBzXHv7b616tBQaVl2wFt5gg6mq8gmrM2EG j/ZL5Q7+3cyCZW2zFjZ6333W3YbeePELkXrNb2u8WefkZT66mN/2Rb5kknD+GiBKz64ZTzoYP M2e42XlW57fU4swVr4cx1tg5wltnDoUNPUve+kLzEdCndzYS4ycxZK/ZX5VRnFNa4N7Z15BGD rnaFulJuXKva7c/H/48V2V7gD5fbVDm8y+V+ub4cCLeUbv9WV4nWMDt8AoAazAFfbGcJrxOkV TFaYuDbvbx2D83+kTTSZ8YKCvRCD0IC+/7ULvKXCaFRafIOBrjuAzIQTmOSDqTAHXCR8LWGPw zDdXWZMi3H0gQWmt6s+UQfepF9wAgtl8DVnPt8ame9II+AfXhPkNapocZGZ7Iv2UYBV5avWtf lpi3weMc0FslN3m7SDBMyVmUD+ufVjuyx6c2FViSa2HY67U0+k7DJCgMfzlO6bWpRiQqCtaZg YA4ziMB8cDVO3e9ZOpsM0h5q9XYrrR9AmgxWzp130G3KpzP2IHpjNKaRMzCZS3rzAFNhlnX7E l/TwbXS+r0OUe4xIaXlUjWGCF25jedr9F3E1kJ6mBF9vShZZwvuGHG2yhDdkEBosSjXLWScwY 9JtbwNLTJXbTx/TdG9AY0hTIo4aku5pdzJFx+dvthgKMqSJZ7zgTlX+zD4+osSzbZAtjKgKAD rbADb4dex7wVdklM1GP6PoGsn0rlGdR8QCzl5ypjhmmW2WqIVK4D1yxoJ4zMgJiTGPL8rSehx kLXDOs3BDN/1aFqVo Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Am 24.09.20 um 21:21 schrieb Jeff King: > Our put_be32() routine and its variants (get_be32(), put_be64(), etc) > has two implementations: on some platforms we cast memory in place and > use nothl()/htonl(), which can cause unaligned memory access. And on > others, we pick out the individual bytes using bitshifts. > > This introduces extra complexity, and sometimes causes compilers to > generate warnings about type-punning. And it's not clear there's any > performance advantage. > > This split goes back to 660231aa97 (block-sha1: support for > architectures with memory alignment restrictions, 2009-08-12). The > unaligned versions were part of the original block-sha1 code in > d7c208a92e (Add new optimized C 'block-sha1' routines, 2009-08-05), > which says it is: > > Based on the mozilla SHA1 routine, but doing the input data accesses = a > word at a time and with 'htonl()' instead of loading bytes and shifti= ng. > > Back then, Linus provided timings versus the mozilla code which showed a > 27% improvement: > > https://lore.kernel.org/git/alpine.LFD.2.01.0908051545000.3390@localho= st.localdomain/ > > However, the unaligned loads were either not the useful part of that > speedup, or perhaps compilers and processors have changed since then. > Here are times for computing the sha1 of 4GB of random data, with and > without -DNO_UNALIGNED_LOADS (and BLK_SHA1=3D1, of course). This is with > gcc 10, -O2, and the processor is a Core i9-9880H. > > [stock] > Benchmark #1: t/helper/test-tool sha1 Time (mean =C2=B1 =CF=83): 6.638 s =C2=B1 0.081 s [User: 6.= 269 s, System: 0.368 s] > Range (min =E2=80=A6 max): 6.550 s =E2=80=A6 6.841 s 10 runs > > [-DNO_UNALIGNED_LOADS] > Benchmark #1: t/helper/test-tool sha1 Time (mean =C2=B1 =CF=83): 6.418 s =C2=B1 0.015 s [User: 6.= 058 s, System: 0.360 s] > Range (min =E2=80=A6 max): 6.394 s =E2=80=A6 6.447 s 10 runs > > And here's the same test run on an AMD A8-7600, using gcc 8. > > [stock] > Benchmark #1: t/helper/test-tool sha1 Time (mean =C2=B1 =CF=83): 11.721 s =C2=B1 0.113 s [User: 10= .761 s, System: 0.951 s] > Range (min =E2=80=A6 max): 11.509 s =E2=80=A6 11.861 s 10 runs > > [-DNO_UNALIGNED_LOADS] > Benchmark #1: t/helper/test-tool sha1 Time (mean =C2=B1 =CF=83): 11.744 s =C2=B1 0.066 s [User: 10= .807 s, System: 0.928 s] > Range (min =E2=80=A6 max): 11.637 s =E2=80=A6 11.863 s 10 runs Yay, benchmarks! GCC 10.2 with -O2 on an i5-9600K without NO_UNALIGNED_LO= ADS: Benchmark #1: t/helper/test-tool sha1 +cc Ren=C3=A9 because I know he is going to feed the two of them into > godbolt; I could do that, too, but he will provide much better analy= sis > on top ;) Weeell, I don't know about that, but I couldn't resist taking a quick look at what some compilers do with the 32-bit functions, which are the ones used in block-sha1: https://www.godbolt.org/z/rhKMTM. Older versions of gcc and clang didn't see through the shifting put_be32() implementation. If you go back further there are also versions that didn't optimize the shifting get_be32(). And the latest icc still can't do that. gcc 10.2 just optimizes all functions to a bswap and a mov. Can't do any better than that, can you? But why do we then see a difference in our benchmark results? Not sure, but https://www.godbolt.org/z/7xh8ao shows that gcc is shuffling some instructions around depending on the implementation. Switch to clang if you want to see more vigorous shuffling. The performance of bigger pieces of code seems to be a matter of luck to some extent. :-/ Ren=C3=A9