linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Noah Goldstein' <goldstein.w.n@gmail.com>,
	Eric Dumazet <edumazet@google.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	X86 ML <x86@kernel.org>, "hpa@zytor.com" <hpa@zytor.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"alexanderduyck@fb.com" <alexanderduyck@fb.com>,
	"open list" <linux-kernel@vger.kernel.org>,
	netdev <netdev@vger.kernel.org>
Subject: RE: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
Date: Thu, 2 Dec 2021 21:11:41 +0000	[thread overview]
Message-ID: <ca8dcc5b6fbf47b29d55a2ab9815c182@AcuMS.aculab.com> (raw)
In-Reply-To: <CAFUsyfJticWKb3fv12r5L5QZ0AVxytWqtPVkYKeFYLW3K1SMNw@mail.gmail.com>

From: Noah Goldstein
> Sent: 02 December 2021 20:19
> 
> On Thu, Dec 2, 2021 at 9:01 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Dec 2, 2021 at 6:24 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > I've dug out my test program and measured the performance of
> > > various copied of the inner loop - usually 64 bytes/iteration.
> > > Code is below.
> > >
> > > It uses the hardware performance counter to get the number of
> > > clocks the inner loop takes.
> > > This is reasonable stable once the branch predictor has settled down.
> > > So the different in clocks between a 64 byte buffer and a 128 byte
> > > buffer is the number of clocks for 64 bytes.
> 
> Intuitively 10 passes is a bit low.

I'm doing 10 separate measurements.
The first one is much slower because the cache is cold.
All the ones after (typically) number 5 or 6 tend to give the same answer.
10 is plenty to give you that 'warm fuzzy feeling' that you've got
a consistent answer.

Run the program 5 or 6 times with the same parameters and you sometimes
get a different stable value - probably something to do with stack and
data physical pages.
Was more obvious when I was timing a system call.

> Also you might consider aligning
> the `csum64` function and possibly the loops.

Won't matter here, instruction decode isn't the problem.
Also the uops all come out of the loop uop cache.

> There a reason you put ` jrcxz` at the beginning of the loops instead
> of the end?

jrcxz is 'jump if cx zero' - hard to use at the bottom of a loop!

The 'paired' loop end instruction is 'loop' - decrement %cx and jump non-zero.
But that is 7+ cycles on current Intel cpu (ok on amd ones).

I can get a two clock loop with jrcxz and jmp - as in the examples.
But it is more stable taken out to 4 clocks.

You can't do a one clock loop :-(

> > > (Unlike the TSC the pmc count doesn't depend on the cpu frequency.)
> > >
> > > What is interesting is that even some of the trivial loops appear
> > > to be doing 16 bytes per clock for short buffers - which is impossible.
> > > Checksum 1k bytes and you get an entirely different answer.
> > > The only loop that really exceeds 8 bytes/clock for long buffers
> > > is the adxc/adoc one.
> > >
> > > What is almost certainly happening is that all the memory reads and
> > > the dependant add/adc instructions are all queued up in the 'out of
> > > order' execution unit.
> > > Since 'rdpmc' isn't a serialising instruction they can still be
> > > outstanding when the function returns.
> > > Uncomment the 'rdtsc' and you get much slower values for short buffers.
> 
> Maybe add an `lfence` before / after `csum64`

That's probably less strong than rdtsc, I might try it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

  reply	other threads:[~2021-12-02 21:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-25 19:38 [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c Noah Goldstein
2021-11-26  1:50 ` Eric Dumazet
2021-11-26  2:15   ` Noah Goldstein
2021-11-26  2:18     ` Noah Goldstein
2021-11-26  2:38       ` Noah Goldstein
2021-11-28 19:47         ` David Laight
2021-11-28 20:59           ` Noah Goldstein
2021-11-28 22:41             ` David Laight
2021-12-02 14:24             ` David Laight
2021-12-02 15:01               ` Eric Dumazet
2021-12-02 20:19                 ` Noah Goldstein
2021-12-02 21:11                   ` David Laight [this message]
2021-11-26 16:08     ` Eric Dumazet
2021-11-26 18:17       ` Noah Goldstein
2021-11-26 18:27         ` Eric Dumazet
2021-11-26 18:50           ` Noah Goldstein
2021-11-26 19:14             ` Noah Goldstein
2021-11-26 19:21               ` Eric Dumazet
2021-11-26 19:50                 ` Noah Goldstein
2021-11-26 20:07                   ` Eric Dumazet
2021-11-26 20:33                     ` Noah Goldstein
2021-11-27  0:15                       ` Eric Dumazet
2021-11-27  0:39                         ` Noah Goldstein
2021-11-26 18:17       ` Eric Dumazet
2021-11-27  4:25 ` [PATCH v2] " Noah Goldstein
2021-11-27  6:03   ` Eric Dumazet
2021-11-27  6:38     ` Noah Goldstein
2021-11-27  6:39 ` [PATCH v3] " Noah Goldstein
2021-11-27  6:51   ` Eric Dumazet
2021-11-27  7:18     ` Noah Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ca8dcc5b6fbf47b29d55a2ab9815c182@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=alexanderduyck@fb.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=edumazet@google.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).