All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Dumazet' <eric.dumazet@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	"x86@kernel.org" <x86@kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>
Subject: RE: [RFC] x86/csum: rewrite csum_partial()
Date: Sun, 14 Nov 2021 13:07:06 +0000	[thread overview]
Message-ID: <e6fcc05d59974ba9afa49ba07a7251aa@AcuMS.aculab.com> (raw)
In-Reply-To: <20211111065322.1261275-1-eric.dumazet@gmail.com>

From: Eric Dumazet
> Sent: 11 November 2021 06:53
> 
> With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used.
> csum_partial() is heavily used with small amount of bytes,
> and is consuming many cycles.
> 
> IPv6 header size for instance is 40 bytes.
> 
> Another thing to consider is that NET_IP_ALIGN is 0 on x86,
> meaning that network headers in RX path are not word-aligned,
> unless the driver forces this.
> 
> This means that csum_partial() fetches one u16
> to 'align the buffer', then perform seven u64 additions
> with carry in a loop, then a remaining u32, then a remaining u16.
> 
> With this new version, we perform 10 u32 adds with carry, to
> avoid the expensive 64->32 transformation. Using 5 u64 adds
> plus one add32_with_carry() is more expensive.
> 
> Also note that this avoids loops for less than ~60 bytes.

I spent far too long looking at this code a while back :-)
I did post a patch - probably 10th May 2020.

Prior to Sandy bridge ADC always took two clocks.
Even on Sandy bridge there is a two clock delay for the sum,
only the carry flag is available earlier.
Broadwell (and I think all AMD cpu) do ADC in 1 clock.
This can be avoided by adding to alternate registers.
There are also issues on some cpu with the partial updates
to the flags register (DEC sets Z but not C) causing unwanted
register dependencies.

I think misaligned memory reads take an extra clock.
But more recent cpu can do two memory reads per clock.
So unless the code is trying to beat 8 bytes/clock it
shouldn't have much effect.

The fastest loop I found (for large buffers) used:

+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

The principle is that 'buff' points to the end on the buffer.
The 'length' (in %cx) is negative and then increased until it hits zero.

This runs at 8 bytes/clock on anything recent (and approaches it on Ivy bridge).
Splitting the 'add 32' did make a slight improvement.
If you aren't worried (too much) about cpu before Bradwell then IIRC
this loop gets close to 8 bytes/clock:

+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum]\n"
+               "       adc   8(%[buff], %[len]), %[sum]\n"
+               "       lea   16(%[len]), %[tmp]\n"
+               "       jmp   10b\n"
+               " 20:"

I also toyed with this loop:

            count = (lim + 7 - buf) & -64;
            buf += count;

            count = -count;
            asm(    "       xor   %[sum_odd], %[sum_odd]\n"    // Also clears carry and overflow
                    "10:    jrcxz 20f\n"
                    "       adcx    (%[buf], %[count]), %[sum]\n"
                    "       adox   8(%[buf], %[count]), %[sum_odd]\n"
                    "       adcx  16(%[buf], %[count]), %[sum]\n"
                    "       adox  24(%[buf], %[count]), %[sum_odd]\n"
                    "       adcx  32(%[buf], %[count]), %[sum]\n"
                    "       adox  40(%[buf], %[count]), %[sum_odd]\n"
                    "       adcx  48(%[buf], %[count]), %[sum]\n"
                    "       adox  56(%[buf], %[count]), %[sum_odd]\n"
                    "       lea   64(%[count]), %[count]\n"
                    "       jmp   10b\n"
                    "20:    adox  %[count], %[sum_odd]\n"  // [count] is zero
                    "       adcx  %[sum_odd], %[sum]\n"
                    "       adcx  %[count], %[sum]"
                : [sum] "=&r" (sum), [count] "=&c" (count), [sum_odd] "=&r" (sum_odd)
                : [buf] "r" (buf), "0" (sum), "1" (count)
                : "memory");
        }

My notes say it achieved 12 bytes/clock on an i7-7700.
However it is only really useful for long aligned buffers.

It is completely annoying that you can't use LOOP (dec %cx and jump nz)
because it is very slow on Intel cpu - even ones that support adox).
JCZX is fine.

It is also possible to reduce the checksum to 16 bits using:
	sum = (sum % 0xffff) ^ 0xffff;
I think this is faster if gcc uses a 'multiply by reciprocal'
but (in my tests) it sometimes used a divide.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


  parent reply	other threads:[~2021-11-14 13:07 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-11  6:53 [RFC] x86/csum: rewrite csum_partial() Eric Dumazet
2021-11-11  9:10 ` Peter Zijlstra
2021-11-11  9:44   ` Peter Zijlstra
2021-11-11 16:02     ` Eric Dumazet
2021-11-11 16:52       ` Eric Dumazet
2021-11-11 18:17         ` Andrew Cooper
2021-11-11 19:02           ` Eric Dumazet
2021-11-11 16:51 ` Alexander Duyck
2021-11-14 13:07 ` David Laight [this message]
2021-11-14 14:12   ` David Laight
2021-11-15 10:23     ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6fcc05d59974ba9afa49ba07a7251aa@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.