netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Optimising csum_fold()
@ 2022-11-22 13:08 David Laight
  2022-11-22 16:24 ` Willy Tarreau
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2022-11-22 13:08 UTC (permalink / raw)
  To: linux-kernel, netdev, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, dave.hansen

There are currently 20 copies of csum_fold(), some in C some in assembler.
The default C version (in asm-generic/checksum.h) is pretty horrid.
Some of the asm versions (including x86 and x86-64) aren't much better.

There are 3 pretty good C versions:
  1:	(~sum - rol32(sum, 16)) >> 16
  2:  ~(sum + rol32(sum, 16)) >> 16
  3:  (u16)~((sum + rol32(sum, 16)) >> 16)
All three are (usually) 4 arithmetic instructions.

The first two have the advantage that the high bits are zero.
Relevant when the value is being checked rather than set.

The first one can generate better instruction scheduling (the rotate
and invert can be executed in the same clock).

The 3rd one saves an instruction on arm, but may need masking.
(I've not compiled an arm kernel to see how often that happens.)

The only architectures where (I think) the current asm code is better
than the C above are sparc and sparc64.
Sparc doesn't have a rotate instruction, but does have a carry flag.
This makes the current asm version one instruction shorter.

For architectures like mips and risc-v which have neither rotate
instructions nor carry flags the C is as good as the current asm.
The rotate is 3 instructions - the same as the extra cmp+add.

Changing everything to use [1] would improve quite a few architectures
while only adding 1 clock to some paths in arm/arm64 and sparc.

Unfortunately it is all currently a mess.
Most architectures don't include asm-generic/checksum.h at all.

Thoughts?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Optimising csum_fold()
  2022-11-22 13:08 Optimising csum_fold() David Laight
@ 2022-11-22 16:24 ` Willy Tarreau
  2022-11-22 16:55   ` David Laight
  0 siblings, 1 reply; 5+ messages in thread
From: Willy Tarreau @ 2022-11-22 16:24 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, netdev, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, dave.hansen

On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> There are currently 20 copies of csum_fold(), some in C some in assembler.
> The default C version (in asm-generic/checksum.h) is pretty horrid.
> Some of the asm versions (including x86 and x86-64) aren't much better.
> 
> There are 3 pretty good C versions:
>   1:	(~sum - rol32(sum, 16)) >> 16
>   2:  ~(sum + rol32(sum, 16)) >> 16
>   3:  (u16)~((sum + rol32(sum, 16)) >> 16)
> All three are (usually) 4 arithmetic instructions.
> 
> The first two have the advantage that the high bits are zero.
> Relevant when the value is being checked rather than set.
> 
> The first one can generate better instruction scheduling (the rotate
> and invert can be executed in the same clock).
> 
> The 3rd one saves an instruction on arm, but may need masking.
> (I've not compiled an arm kernel to see how often that happens.)
> 
> The only architectures where (I think) the current asm code is better
> than the C above are sparc and sparc64.
> Sparc doesn't have a rotate instruction, but does have a carry flag.
> This makes the current asm version one instruction shorter.
> 
> For architectures like mips and risc-v which have neither rotate
> instructions nor carry flags the C is as good as the current asm.
> The rotate is 3 instructions - the same as the extra cmp+add.
> 
> Changing everything to use [1] would improve quite a few architectures
> while only adding 1 clock to some paths in arm/arm64 and sparc.
> 
> Unfortunately it is all currently a mess.
> Most architectures don't include asm-generic/checksum.h at all.
> 
> Thoughts?

Then why not just have one version per arch, the most efficient one,
and use it everywhere ? The simple fact that we're discussing the
tradeoffs means that if we don't want to compromise performance here
(which I assume to be the case), then it needs to be per-arch and
that's all. At least that's the way I understand it.

Regards,
Willy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Optimising csum_fold()
  2022-11-22 16:24 ` Willy Tarreau
@ 2022-11-22 16:55   ` David Laight
  2022-11-22 16:59     ` Willy Tarreau
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2022-11-22 16:55 UTC (permalink / raw)
  To: 'Willy Tarreau'
  Cc: linux-kernel, netdev, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, dave.hansen

From: Willy Tarreau <w@1wt.eu>
> Sent: 22 November 2022 16:25
> 
> On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> > There are currently 20 copies of csum_fold(), some in C some in assembler.
> > The default C version (in asm-generic/checksum.h) is pretty horrid.
> > Some of the asm versions (including x86 and x86-64) aren't much better.
> >
> > There are 3 pretty good C versions:
> >   1:	(~sum - rol32(sum, 16)) >> 16
> >   2:  ~(sum + rol32(sum, 16)) >> 16
> >   3:  (u16)~((sum + rol32(sum, 16)) >> 16)
> > All three are (usually) 4 arithmetic instructions.
> >
> > The first two have the advantage that the high bits are zero.
> > Relevant when the value is being checked rather than set.
> >
> > The first one can generate better instruction scheduling (the rotate
> > and invert can be executed in the same clock).
> >
> > The 3rd one saves an instruction on arm, but may need masking.
> > (I've not compiled an arm kernel to see how often that happens.)
> >
> > The only architectures where (I think) the current asm code is better
> > than the C above are sparc and sparc64.
> > Sparc doesn't have a rotate instruction, but does have a carry flag.
> > This makes the current asm version one instruction shorter.
> >
> > For architectures like mips and risc-v which have neither rotate
> > instructions nor carry flags the C is as good as the current asm.
> > The rotate is 3 instructions - the same as the extra cmp+add.
> >
> > Changing everything to use [1] would improve quite a few architectures
> > while only adding 1 clock to some paths in arm/arm64 and sparc.
> >
> > Unfortunately it is all currently a mess.
> > Most architectures don't include asm-generic/checksum.h at all.
> >
> > Thoughts?
> 
> Then why not just have one version per arch, the most efficient one,
> and use it everywhere ? The simple fact that we're discussing the
> tradeoffs means that if we don't want to compromise performance here
> (which I assume to be the case), then it needs to be per-arch and
> that's all. At least that's the way I understand it.

At the moment there are a lot of arch-specific ones that are
definitely sub-optimal.

I started doing some patches, my x86-64 kernel in about 4k
smaller with [1].
I was going to post the patches to asm-generic an x86.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Optimising csum_fold()
  2022-11-22 16:55   ` David Laight
@ 2022-11-22 16:59     ` Willy Tarreau
  0 siblings, 0 replies; 5+ messages in thread
From: Willy Tarreau @ 2022-11-22 16:59 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, netdev, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, dave.hansen

On Tue, Nov 22, 2022 at 04:55:27PM +0000, David Laight wrote:
> From: Willy Tarreau <w@1wt.eu>
> > Sent: 22 November 2022 16:25
> > 
> > On Tue, Nov 22, 2022 at 01:08:23PM +0000, David Laight wrote:
> > > There are currently 20 copies of csum_fold(), some in C some in assembler.
> > > The default C version (in asm-generic/checksum.h) is pretty horrid.
> > > Some of the asm versions (including x86 and x86-64) aren't much better.
> > >
> > > There are 3 pretty good C versions:
> > >   1:	(~sum - rol32(sum, 16)) >> 16
> > >   2:  ~(sum + rol32(sum, 16)) >> 16
> > >   3:  (u16)~((sum + rol32(sum, 16)) >> 16)
> > > All three are (usually) 4 arithmetic instructions.
> > >
> > > The first two have the advantage that the high bits are zero.
> > > Relevant when the value is being checked rather than set.
> > >
> > > The first one can generate better instruction scheduling (the rotate
> > > and invert can be executed in the same clock).
> > >
> > > The 3rd one saves an instruction on arm, but may need masking.
> > > (I've not compiled an arm kernel to see how often that happens.)
> > >
> > > The only architectures where (I think) the current asm code is better
> > > than the C above are sparc and sparc64.
> > > Sparc doesn't have a rotate instruction, but does have a carry flag.
> > > This makes the current asm version one instruction shorter.
> > >
> > > For architectures like mips and risc-v which have neither rotate
> > > instructions nor carry flags the C is as good as the current asm.
> > > The rotate is 3 instructions - the same as the extra cmp+add.
> > >
> > > Changing everything to use [1] would improve quite a few architectures
> > > while only adding 1 clock to some paths in arm/arm64 and sparc.
> > >
> > > Unfortunately it is all currently a mess.
> > > Most architectures don't include asm-generic/checksum.h at all.
> > >
> > > Thoughts?
> > 
> > Then why not just have one version per arch, the most efficient one,
> > and use it everywhere ? The simple fact that we're discussing the
> > tradeoffs means that if we don't want to compromise performance here
> > (which I assume to be the case), then it needs to be per-arch and
> > that's all. At least that's the way I understand it.
> 
> At the moment there are a lot of arch-specific ones that are
> definitely sub-optimal.

Yes very likely!

> I started doing some patches, my x86-64 kernel in about 4k
> smaller with [1].
> I was going to post the patches to asm-generic an x86.

I mean, maybe we could have your 3 versions with different names
in asm-generic, and have each asm file define csum_fold() to one
of them. That would limit the spreadth of variants and the auditing
difficulty.

Willy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Optimising csum_fold()
@ 2022-09-06 10:08 David Laight
  0 siblings, 0 replies; 5+ messages in thread
From: David Laight @ 2022-09-06 10:08 UTC (permalink / raw)
  To: netdev

The default C version is:

static inline __sum16 csum_fold(__wsum csum)
{
        u32 sum = (__force u32)csum;
        sum = (sum & 0xffff) + (sum >> 16);
        sum = (sum & 0xffff) + (sum >> 16);
        return (__force __sum16)~sum;
}

This is has a register dependency of at least 5.
More if register moves cost and the final mask has to be done.

x86 (and other architectures with a carry flag) may use:
static inline __sum16 csum_fold(__wsum sum)
{
        asm("addl %1, %0                ;\n"
            "adcl $0xffff, %0   ;\n"
            : "=r" (sum)
            : "r" ((__force u32)sum << 16),
              "0" ((__force u32)sum & 0xffff0000));
        return (__force __sum16)(~(__force u32)sum >> 16);
}
This isn't actually any better!

arm64 (and a few others) have the C version:
static inline __sum16 csum_fold(__wsum csum)
{
        u32 sum = (__force u32)csum;
        sum += (sum >> 16) | (sum << 16);
        return (__force __sum16)~(sum >> 16);
}
Assuming the shifts get converted to a rotate
this is one instruction shorter.

Finally arc has the slight variant:
static inline __sum16 csum_fold(__wsum s)
{
        unsigned r = s << 16 | s >> 16; /* ror */
        s = ~s;
        s -= r;
        return s >> 16;
}
On a multi-issue cpu the rotate and ~ can happen in the same clock.
If the compiler is any good the final mask is never needed.
So this has a register dependency chain length of 3.

This looks to be better than the existing versions for
almost all architectures.
(There seem to be a few where the shifts aren't converted
to a rotate. I'd be surprised if the cpus don't have a
rotate instruction - so gcc must get confused.)

See https://godbolt.org/z/on1v6naoE

Annoyingly it isn't trivial to convert most of the architectures to
the generic version because they don't include asm-generic/checksum.h

It has to be said that this function seems to generate 0x0000
instead of 0xffff.
That definitely matters for IPv6.
One solution is to add one to the initial constant checksum
(usually 0) and then add one to the result of csum_fold().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-11-22 17:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-22 13:08 Optimising csum_fold() David Laight
2022-11-22 16:24 ` Willy Tarreau
2022-11-22 16:55   ` David Laight
2022-11-22 16:59     ` Willy Tarreau
  -- strict thread matches above, loose matches on Subject: below --
2022-09-06 10:08 David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).