On 2020-09-24 at 19:21:11, Jeff King wrote:
> However, the unaligned loads were either not the useful part of that
> speedup, or perhaps compilers and processors have changed since then.
> Here are times for computing the sha1 of 4GB of random data, with and
> without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
> gcc 10, -O2, and the processor is a Core i9-9880H.
> 
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
>     Range (min … max):    6.550 s …  6.841 s    10 runs
> 
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
>     Range (min … max):    6.394 s …  6.447 s    10 runs
> 
> And here's the same test run on an AMD A8-7600, using gcc 8.
> 
>   [stock]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
>     Range (min … max):   11.509 s … 11.861 s    10 runs
> 
>   [-DNO_UNALIGNED_LOADS]
>   Benchmark #1: t/helper/test-tool sha1 <foo.rand
>     Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
>     Range (min … max):   11.637 s … 11.863 s    10 runs

I think this is a fine and desirable change, both for performance and
correctness.  It is, as usual, well explained.

> So the unaligned loads don't seem to help much, and actually make things
> worse. It's possible there are platforms where they provide more
> benefit, but:
> 
>   - the non-x86 platforms for which we use this code are old and obscure
>     (powerpc and s390).

I cannot speak for s390, since I have never owned one, but my
understanding on unaligned access is that typically there is a tiny
penalty on x86 (about a cycle) and a more significant penalty on
PowerPC, although that may have changed with newer POWER chips.  So my
gut tells me this is an improvement either way, although I no longer own
any such bootable hardware to measure for certain.

Anyway, as René found, the latest versions of GCC already use the
peephole optimizer to recognize and optimize this on x86, so I expect
they'll do so on other architectures as well.  Byte swapping is a pretty
common operation.

>   - the main caller that cares about performance is block-sha1. But
>     these days it is rarely used anyway, in favor of sha1dc (which is
>     already much slower, and nobody seems to have cared that much).

I think block-sha256 uses it as well, but in any case, it's still faster
than sha1dc and people who care desperately about performance will use a
crypto library instead.
-- 
brian m. carlson: Houston, Texas, US