Optimising csum_fold()

* Optimising csum_fold()
@ 2022-11-22 13:08 David Laight
  2022-11-22 16:24 ` Willy Tarreau
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2022-11-22 13:08 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, dave.hansen

There are currently 20 copies of csum_fold(), some in C some in assembler.
The default C version (in asm-generic/checksum.h) is pretty horrid.
Some of the asm versions (including x86 and x86-64) aren't much better.

There are 3 pretty good C versions:
  1:	(~sum - rol32(sum, 16)) >> 16
  2:  ~(sum + rol32(sum, 16)) >> 16
  3:  (u16)~((sum + rol32(sum, 16)) >> 16)
All three are (usually) 4 arithmetic instructions.

The first two have the advantage that the high bits are zero.
Relevant when the value is being checked rather than set.

The first one can generate better instruction scheduling (the rotate
and invert can be executed in the same clock).

The 3rd one saves an instruction on arm, but may need masking.
(I've not compiled an arm kernel to see how often that happens.)

The only architectures where (I think) the current asm code is better
than the C above are sparc and sparc64.
Sparc doesn't have a rotate instruction, but does have a carry flag.
This makes the current asm version one instruction shorter.

For architectures like mips and risc-v which have neither rotate
instructions nor carry flags the C is as good as the current asm.
The rotate is 3 instructions - the same as the extra cmp+add.

Changing everything to use [1] would improve quite a few architectures
while only adding 1 clock to some paths in arm/arm64 and sparc.

Unfortunately it is all currently a mess.
Most architectures don't include asm-generic/checksum.h at all.

Thoughts?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 5+ messages in thread