On Wed, Jul 21, 2021 at 11:17 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> I find it somewhat arbitrary that we choose to align the 2nd pointer and
> not the first.

Yeah, that's a bit odd, but I don't think it matters.

The hope is obviously that they are mutually aligned, and in that case
it doesn't matter which one you aim to align.

> So you are saying that the current memcmp could indeed use improvement
> but you don't want it to be based on the glibc's code due to the ugly
> misalignment handling?

Yeah. I suspect that this (very simple) patch gives you the same
performance improvement that the glibc code does.

NOTE! I'm not saying this patch is perfect. This one doesn't even
_try_ to do the mutual alignment, because it's really silly. But I'm
throwing this out here for discussion, because

 - it's really simple

 - I suspect it gets you 99% of the way there

 - the code generation is actually quite good with both gcc and clang.
This is gcc:

        memcmp:
                jmp     .L60
        .L52:
                movq    (%rsi), %rax
                cmpq    %rax, (%rdi)
                jne     .L53
                addq    $8, %rdi
                addq    $8, %rsi
                subq    $8, %rdx
        .L60:
                cmpq    $7, %rdx
                ja      .L52
                testq   %rdx, %rdx
                je      .L61
        .L53:
                xorl    %ecx, %ecx
                jmp     .L56
        .L62:
                addq    $1, %rcx
                cmpq    %rcx, %rdx
                je      .L51
        .L56:
                movzbl  (%rdi,%rcx), %eax
                movzbl  (%rsi,%rcx), %r8d
                subl    %r8d, %eax
                je      .L62
        .L51:
                ret
        .L61:
                xorl    %eax, %eax
                ret

and notice how there are no spills, no extra garbage, just simple and
straightforward code.

Those things ends mattering too - it's good for I$, it's good for the
small cases, and it's good for debugging and reading the code.

If this is "good enough" for your test-case, I really would prefer
something like this. "Make it as simple as possible, but no simpler"

I can do the mutual alignment too, but I'd actually prefer to do it as
a separate patch, for when there are numbers for that.

And I wouldn't do it as a byte-by-byte case, because that's just stupid.

I'd do it using a separate first single "get unaligned word from both
sources, compare them for equality, and then only add enough bytes to
align"

                  Linus