RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c

From: David Laight <David.Laight@ACULAB.COM>
To: 'Noah Goldstein' <goldstein.w.n@gmail.com>,
	Eric Dumazet <edumazet@google.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	X86 ML <x86@kernel.org>, "hpa@zytor.com" <hpa@zytor.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"open list" <linux-kernel@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>
Subject: RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
Date: Thu, 2 Dec 2021 21:11:41 +0000	[thread overview]
Message-ID: <ca8dcc5b6fbf47b29d55a2ab9815c182@AcuMS.aculab.com> (raw)
In-Reply-To: <CAFUsyfJticWKb3fv12r5L5QZ0AVxytWqtPVkYKeFYLW3K1SMNw@mail.gmail.com>

From: Noah Goldstein
> Sent: 02 December 2021 20:19
> 
> On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > I've dug out my test program and measured the performance of
> > > various copied of the inner loop - usually 64 bytes/iteration.
> > > Code is below.
> > >
> > > It uses the hardware performance counter to get the number of
> > > clocks the inner loop takes.
> > > This is reasonable stable once the branch predictor has settled down.
> > > So the different in clocks between a 64 byte buffer and a 128 byte
> > > buffer is the number of clocks for 64 bytes.
> 
> Intuitively 10 passes is a bit low.

I'm doing 10 separate measurements.
The first one is much slower because the cache is cold.
All the ones after (typically) number 5 or 6 tend to give the same answer.
10 is plenty to give you that 'warm fuzzy feeling' that you've got
a consistent answer.

Run the program 5 or 6 times with the same parameters and you sometimes
get a different stable value - probably something to do with stack and
data physical pages.
Was more obvious when I was timing a system call.

> Also you might consider aligning
> the `csum64` function and possibly the loops.

Won't matter here, instruction decode isn't the problem.
Also the uops all come out of the loop uop cache.

> There a reason you put ` jrcxz` at the beginning of the loops instead
> of the end?

jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop!

The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero.
But that is 7+ cycles on current Intel cpu (ok on amd ones).

I can get a two clock loop with jrcxz and jmp - as in the examples.
But it is more stable taken out to 4 clocks.

You can't do a one clock loop :-(

> > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> > >
> > > What is interesting is that even some of the trivial loops appear
> > > to be doing 16 bytes per clock for short buffers - which is impossible.
> > > Checksum 1k bytes and you get an entirely different answer.
> > > The only loop that really exceeds 8 bytes/clock for long buffers
> > > is the adxc/adoc one.
> > >
> > > What is almost certainly happening is that all the memory reads and
> > > the dependant add/adc instructions are all queued up in the 'out of
> > > order' execution unit.
> > > Since 'rdpmc' isn't a serialising instruction they can still be
> > > outstanding when the function returns.
> > > Uncomment the 'rdtsc' and you get much slower values for short buffers.
> 
> Maybe add an `lfence` before / after `csum64`

That's probably less strong than rdtsc, I might try it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)