RE: [RFC] x86/csum: rewrite csum_partial()

From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Dumazet' <eric.dumazet@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	"x86@kernel.org" <x86@kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>
Subject: RE: [RFC] x86/csum: rewrite csum_partial()
Date: Sun, 14 Nov 2021 13:07:06 +0000	[thread overview]
Message-ID: <e6fcc05d59974ba9afa49ba07a7251aa@AcuMS.aculab.com> (raw)
In-Reply-To: <20211111065322.1261275-1-eric.dumazet@gmail.com>

From: Eric Dumazet
> Sent: 11 November 2021 06:53
> 
> With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used.
> csum_partial() is heavily used with small amount of bytes,
> and is consuming many cycles.
> 
> IPv6 header size for instance is 40 bytes.
> 
> Another thing to consider is that NET_IP_ALIGN is 0 on x86,
> meaning that network headers in RX path are not word-aligned,
> unless the driver forces this.
> 
> This means that csum_partial() fetches one u16
> to 'align the buffer', then perform seven u64 additions
> with carry in a loop, then a remaining u32, then a remaining u16.
> 
> With this new version, we perform 10 u32 adds with carry, to
> avoid the expensive 64->32 transformation. Using 5 u64 adds
> plus one add32_with_carry() is more expensive.
> 
> Also note that this avoids loops for less than ~60 bytes.

I spent far too long looking at this code a while back :-)
I did post a patch - probably 10th May 2020.

Prior to Sandy bridge ADC always took two clocks.
Even on Sandy bridge there is a two clock delay for the sum,
only the carry flag is available earlier.
Broadwell (and I think all AMD cpu) do ADC in 1 clock.
This can be avoided by adding to alternate registers.
There are also issues on some cpu with the partial updates
to the flags register (DEC sets Z but not C) causing unwanted
register dependencies.

I think misaligned memory reads take an extra clock.
But more recent cpu can do two memory reads per clock.
So unless the code is trying to beat 8 bytes/clock it
shouldn't have much effect.

The fastest loop I found (for large buffers) used:

+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

The principle is that 'buff' points to the end on the buffer.
The 'length' (in %cx) is negative and then increased until it hits zero.

This runs at 8 bytes/clock on anything recent (and approaches it on Ivy bridge).
Splitting the 'add 32' did make a slight improvement.
If you aren't worried (too much) about cpu before Bradwell then IIRC
this loop gets close to 8 bytes/clock:

+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum]\n"
+               "       adc   8(%[buff], %[len]), %[sum]\n"
+               "       lea   16(%[len]), %[tmp]\n"
+               "       jmp   10b\n"
+               " 20:"

I also toyed with this loop:

            count = (lim + 7 - buf) & -64;
            buf += count;

            count = -count;
            asm(    "       xor   %[sum_odd], %[sum_odd]\n"    // Also clears carry and overflow
                    "10:    jrcxz 20f\n"
                    "       adcx    (%[buf], %[count]), %[sum]\n"
                    "       adox   8(%[buf], %[count]), %[sum_odd]\n"
                    "       adcx  16(%[buf], %[count]), %[sum]\n"
                    "       adox  24(%[buf], %[count]), %[sum_odd]\n"
                    "       adcx  32(%[buf], %[count]), %[sum]\n"
                    "       adox  40(%[buf], %[count]), %[sum_odd]\n"
                    "       adcx  48(%[buf], %[count]), %[sum]\n"
                    "       adox  56(%[buf], %[count]), %[sum_odd]\n"
                    "       lea   64(%[count]), %[count]\n"
                    "       jmp   10b\n"
                    "20:    adox  %[count], %[sum_odd]\n"  // [count] is zero
                    "       adcx  %[sum_odd], %[sum]\n"
                    "       adcx  %[count], %[sum]"
                : [sum] "=&r" (sum), [count] "=&c" (count), [sum_odd] "=&r" (sum_odd)
                : [buf] "r" (buf), "0" (sum), "1" (count)
                : "memory");
        }

My notes say it achieved 12 bytes/clock on an i7-7700.
However it is only really useful for long aligned buffers.

It is completely annoying that you can't use LOOP (dec %cx and jump nz)
because it is very slow on Intel cpu - even ones that support adox).
JCZX is fine.

It is also possible to reduce the checksum to 16 bits using:
	sum = (sum % 0xffff) ^ 0xffff;
I think this is faster if gcc uses a 'multiply by reciprocal'
but (in my tests) it sometimes used a divide.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)