RE: [RFC] x86/csum: rewrite csum_partial()

From: David Laight <David.Laight@ACULAB.COM>
To: 'Eric Dumazet' <eric.dumazet@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	"x86@kernel.org" <x86@kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>
Subject: RE: [RFC] x86/csum: rewrite csum_partial()
Date: Mon, 15 Nov 2021 10:23:31 +0000	[thread overview]
Message-ID: <e08af965e5b4422e9b38d8ccd90f8e7b@AcuMS.aculab.com> (raw)
In-Reply-To: <3f7414264ba0456b9102dd63c695272e@AcuMS.aculab.com>

From: David Laight
> Sent: 14 November 2021 14:12
> ..
> > If you aren't worried (too much) about cpu before Broadwell then IIRC
> > this loop gets close to 8 bytes/clock:
> >
> > +               "10:    jecxz 20f\n"
> > +               "       adc   (%[buff], %[len]), %[sum]\n"
> > +               "       adc   8(%[buff], %[len]), %[sum]\n"
> > +               "       lea   16(%[len]), %[tmp]\n"
> > +               "       jmp   10b\n"
> > +               " 20:"
> 
> It is even possible a loop based on:
> 	10:	adc	(%[buff], %[len], 8), %sum
> 		inc	%[len]
> 		jnz	10b
> will run at 8 bytes per clock on very recent Intel cpu.

It doesn't on i7-7700.
(which I probably tested last year).

But the first loop does run twice as fast - and will only
be beaten by the adcx/adox loop.
So there is no need to unroll to more than 2 reads/loop.

For cpu between Ivy bridge and Broadwell you want to use
separate 'sum' registers to avoid the 2 clock latency
of the adc result.
That should beat the 4 bytes/clock of the current loop.
But does need an extra unroll to get near 8 bytes/clock.

For older cpu (nehalem/core2) the 'jecxz' loop is about the
only way to 'loop carry' the carry flag without the
6 clock penalty for the partial flags register update.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)