On 2020-09-24 at 19:21:11, Jeff King wrote: > However, the unaligned loads were either not the useful part of that > speedup, or perhaps compilers and processors have changed since then. > Here are times for computing the sha1 of 4GB of random data, with and > without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with > gcc 10, -O2, and the processor is a Core i9-9880H. > > [stock] > Benchmark #1: t/helper/test-tool sha1 Time (mean ± σ): 6.638 s ± 0.081 s [User: 6.269 s, System: 0.368 s] > Range (min … max): 6.550 s … 6.841 s 10 runs > > [-DNO_UNALIGNED_LOADS] > Benchmark #1: t/helper/test-tool sha1 Time (mean ± σ): 6.418 s ± 0.015 s [User: 6.058 s, System: 0.360 s] > Range (min … max): 6.394 s … 6.447 s 10 runs > > And here's the same test run on an AMD A8-7600, using gcc 8. > > [stock] > Benchmark #1: t/helper/test-tool sha1 Time (mean ± σ): 11.721 s ± 0.113 s [User: 10.761 s, System: 0.951 s] > Range (min … max): 11.509 s … 11.861 s 10 runs > > [-DNO_UNALIGNED_LOADS] > Benchmark #1: t/helper/test-tool sha1 Time (mean ± σ): 11.744 s ± 0.066 s [User: 10.807 s, System: 0.928 s] > Range (min … max): 11.637 s … 11.863 s 10 runs I think this is a fine and desirable change, both for performance and correctness. It is, as usual, well explained. > So the unaligned loads don't seem to help much, and actually make things > worse. It's possible there are platforms where they provide more > benefit, but: > > - the non-x86 platforms for which we use this code are old and obscure > (powerpc and s390). I cannot speak for s390, since I have never owned one, but my understanding on unaligned access is that typically there is a tiny penalty on x86 (about a cycle) and a more significant penalty on PowerPC, although that may have changed with newer POWER chips. So my gut tells me this is an improvement either way, although I no longer own any such bootable hardware to measure for certain. Anyway, as René found, the latest versions of GCC already use the peephole optimizer to recognize and optimize this on x86, so I expect they'll do so on other architectures as well. Byte swapping is a pretty common operation. > - the main caller that cares about performance is block-sha1. But > these days it is rarely used anyway, in favor of sha1dc (which is > already much slower, and nobody seems to have cared that much). I think block-sha256 uses it as well, but in any case, it's still faster than sha1dc and people who care desperately about performance will use a crypto library instead. -- brian m. carlson: Houston, Texas, US