On Wed, Nov 24, 2021 at 10:56 PM Noah Goldstein wrote: > > On Wed, Nov 24, 2021 at 10:20 PM Eric Dumazet wrote: > > > > On Wed, Nov 24, 2021 at 8:08 PM Eric Dumazet wrote: > > > > > > On Wed, Nov 24, 2021 at 8:00 PM Eric Dumazet wrote: > > > > > > > > > > > It is an issue in general, not in standard cases because network > > > > headers are aligned. > > > > > > > > I think it came when I folded csum_partial() and do_csum(), I forgot > > > > to ror() the seed. > > > > > > > > I suspect the following would help: > > > > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c > > > > index 1eb8f2d11f7c785be624eba315fe9ca7989fd56d..ee7b0e7a6055bcbef42d22f7e1d8f52ddbd6be6d > > > > 100644 > > > > --- a/arch/x86/lib/csum-partial_64.c > > > > +++ b/arch/x86/lib/csum-partial_64.c > > > > @@ -41,6 +41,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) > > > > if (unlikely(odd)) { > > > > if (unlikely(len == 0)) > > > > return sum; > > > > + temp64 = ror32((__force u64)sum, 8); > > > > temp64 += (*(unsigned char *)buff << 8); > > > > len--; > > > > buff++; > > > > > > > > > > > > > > It is a bit late here, I will test the following later this week. > > > > > > We probably can remove one conditional jump at the end of the function > > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c > > > index 1eb8f2d11f7c785be624eba315fe9ca7989fd56d..15986ad42ed5ccb8241ff467a34188cf901ec98e > > > 100644 > > > --- a/arch/x86/lib/csum-partial_64.c > > > +++ b/arch/x86/lib/csum-partial_64.c > > > @@ -41,9 +41,11 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) > > > if (unlikely(odd)) { > > > if (unlikely(len == 0)) > > > return sum; > > > + temp64 = ror32((__force u64)sum, 8); > > > temp64 += (*(unsigned char *)buff << 8); > > > len--; > > > buff++; > > > + odd = 8; > > > } > > > > > > while (unlikely(len >= 64)) { > > > @@ -129,10 +131,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) > > > #endif > > > } > > > result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff); > > > - if (unlikely(odd)) { > > > - result = from32to16(result); > > > - result = ((result >> 8) & 0xff) | ((result & 0xff) << 8); > > > - } > > > + ror32(result, odd); > > > > this would be > > result = ror32(result, odd); > > > > definitely time to stop working today for me. > > > > > return (__force __wsum)result; > > > } > > > EXPORT_SYMBOL(csum_partial); > > All my tests pass with that change :) Although I see slightly worse performance with aligned `buff` in the branch-free approach. Imagine if non-aligned `buff` is that uncommon might be better to speculate past the work of `ror`.