On Wed, Jul 21, 2021 at 11:17 AM Nikolay Borisov wrote: > > I find it somewhat arbitrary that we choose to align the 2nd pointer and > not the first. Yeah, that's a bit odd, but I don't think it matters. The hope is obviously that they are mutually aligned, and in that case it doesn't matter which one you aim to align. > So you are saying that the current memcmp could indeed use improvement > but you don't want it to be based on the glibc's code due to the ugly > misalignment handling? Yeah. I suspect that this (very simple) patch gives you the same performance improvement that the glibc code does. NOTE! I'm not saying this patch is perfect. This one doesn't even _try_ to do the mutual alignment, because it's really silly. But I'm throwing this out here for discussion, because - it's really simple - I suspect it gets you 99% of the way there - the code generation is actually quite good with both gcc and clang. This is gcc: memcmp: jmp .L60 .L52: movq (%rsi), %rax cmpq %rax, (%rdi) jne .L53 addq $8, %rdi addq $8, %rsi subq $8, %rdx .L60: cmpq $7, %rdx ja .L52 testq %rdx, %rdx je .L61 .L53: xorl %ecx, %ecx jmp .L56 .L62: addq $1, %rcx cmpq %rcx, %rdx je .L51 .L56: movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %r8d subl %r8d, %eax je .L62 .L51: ret .L61: xorl %eax, %eax ret and notice how there are no spills, no extra garbage, just simple and straightforward code. Those things ends mattering too - it's good for I$, it's good for the small cases, and it's good for debugging and reading the code. If this is "good enough" for your test-case, I really would prefer something like this. "Make it as simple as possible, but no simpler" I can do the mutual alignment too, but I'd actually prefer to do it as a separate patch, for when there are numbers for that. And I wouldn't do it as a byte-by-byte case, because that's just stupid. I'd do it using a separate first single "get unaligned word from both sources, compare them for equality, and then only add enough bytes to align" Linus