RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

From: David Laight <David.Laight@ACULAB.COM>
To: 'Noah Goldstein' <goldstein.w.n@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	X86 ML <x86@kernel.org>, "hpa@zytor.com" <hpa@zytor.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	open list <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
Date: Sun, 28 Nov 2021 22:41:28 +0000	[thread overview]
Message-ID: <1ac1f60c643a478d84862ac264437d14@AcuMS.aculab.com> (raw)
In-Reply-To: <CAFUsyfLoEckBrnYKUgqWC7AJPTBDfarjBOgBvtK7eGVZj9muYQ@mail.gmail.com>

From: Noah Goldstein
> Sent: 28 November 2021 21:00
> 
> On Sun, Nov 28, 2021 at 1:47 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > ...
> > > Regarding the 32 byte case, adding two accumulators helps with the latency
> > > numbers but causes a regression in throughput for the 40/48 byte cases. Which
> > > is the more important metric for the usage of csum_partial()?
> > >
> > > Here are the numbers for the smaller sizes:
> > >
> > > size, lat old,    lat ver2,    lat ver1,    tput old,   tput ver2,   tput ver1
> > >    0,   4.961,       4.503,       4.901,       4.887,       4.399,       4.951
> > >    8,   5.590,       5.594,       5.620,       4.227,       4.110,       4.252
> > >   16,   6.182,       6.398,       6.202,       4.233,       4.062,       4.278
> > >   24,   7.392,       7.591,       7.380,       4.256,       4.246,       4.279
> > >   32,   7.371,       6.366,       7.390,       4.550,       4.900,       4.537
> > >   40,   8.621,       7.496,       8.601,       4.862,       5.162,       4.836
> > >   48,   9.406,       8.128,       9.374,       5.206,       5.736,       5.234
> > >   56,  10.535,       9.189,      10.522,       5.416,       5.772,       5.447
> > >   64,  10.000,       7.487,       7.590,       6.946,       6.975,       6.989
> > >   72,  11.192,       8.639,       8.763,       7.210,       7.311,       7.277
> > >   80,  11.734,       9.179,       9.409,       7.605,       7.620,       7.548
> > >   88,  12.933,      10.545,      10.584,       7.878,       7.902,       7.858
> > >   96,  12.952,       9.331,      10.625,       8.168,       8.470,       8.206
> > >  104,  14.206,      10.424,      11.839,       8.491,       8.785,       8.502
> > >  112,  14.763,      11.403,      12.416,       8.798,       9.134,       8.771
> > >  120,  15.955,      12.635,      13.651,       9.175,       9.494,       9.130
> > >  128,  15.271,      10.599,      10.724,       9.726,       9.672,       9.655
> > >
> > > 'ver2' uses two accumulators for 32 byte case and has better latency numbers
> > > but regressions in tput compared to 'old' and 'ver1'. 'ver1' is the
> > > implementation
> > > posted which has essentially the same numbers for tput/lat as 'old'
> > > for sizes [0, 63].
> >
> > Which cpu are you testing on - it will make a big difference ?
> 
> Tigerlake, although assuming `adc` as the bottleneck, the results
> should be largely independent.

The cpu definitely makes a difference.
Although the big changes for Intel mainstream cpu was before Ivy/Sandy bridge
and to Broadwell/Haswell. They improved the latency for adc from 2 clocks
to 1 clock.
The later cpu also have extra instruction ports - which can help
parallel exceution.
That could well make a big difference where you've duplicated the adc chain.

Interesting an adc chain can only do one add (8 bytes) per clock.
But you should be able to do two 4 byte loads and add to two separate
64bit register every clock.
That is also 8 bytes/clock.
So a C loop:
	for (;buf != buf_end; buf += 2) {
		sum_64a += buf_32[0];
		sum_64b += buf_32[1];
	}
Might be as fast as the asm one!
It probably needs unrolling once, and may need 'adjusting'
so that the loop test is 'add $d, %reg; jnz 1b'
The 'add' and 'jnz' should then get 'fused' into a single u-op.

> > And what are you measing throughput in?
> 
> Running back to back iterations with the same input without any
> dependency between iterations. The OoO window will include
> multiple iterations at once.
> 
> > And are you testing aligned or mis-aligned 64bit reads?
> 
> Aligned as that is the common case.

I was thinking that the code to align the buffer start may not
be needed - so the test can be removed without affecting performance.
Especially if aligned buffers are 'expected'
So you just have to support variable lengths, not alignments.

> > I think one of the performance counters will give 'cpu clocks'.
> 
> Time is in Ref Cycles using `rdtsc`

Hmmm...
Unless you manage to lock the cpu frequency governor (or whatever it
is called) rdtsc is almost useless for timing instructions.
The cpu hardware will change the clock multiplier on you.

I've used histogram[delta_cyles() >> n]++ (with bound checking) to
get a distribution of the cost of each pass - rather than an average.
I last used that testing writev("dev/null", ...).
Each run gave a sharp peak - but it differed run to run!
Possibly because of the relative alignment of the stacks and buffers.

..
> > Using adxc/adxo together is a right PITA.
> 
> I'm a bit hesitant about adxc/adxo because they are extensions so
> support will need to be tested.

Indeed.

> https://lore.kernel.org/lkml/CANn89iLpFOok_tv=DKsLX1mxZGdHQgATdW4Xs0rc6oaXQEa5Ng@mail.gmail.com/T/

Ah that is your patch that just changes the asm() block.
I was looking for Eric's changes as well.

I spent too long trying to optimise this code last year.
Mostly because I found a really crap version in one of our applications
that is checksumming UDP/IPv4 RTP packets before sending them on a RAW
socket. These are about 200 bytes each and we are sending 100s every 20ms.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)