All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Arvind Sankar' <nivedita@alum.mit.edu>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	Eric Biggers <ebiggers@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Eric Biggers <ebiggers@google.com>
Subject: RE: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
Date: Sun, 25 Oct 2020 23:23:52 +0000	[thread overview]
Message-ID: <5d8f86fcfe84441fa5c9877959069ff1@AcuMS.aculab.com> (raw)
In-Reply-To: <20201025201820.GA1237388@rani.riverdale.lan>

From: Arvind Sankar
> Sent: 25 October 2020 20:18
> 
> On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote:
> > From: Arvind Sankar
> > > Sent: 25 October 2020 14:31
> > >
> > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> > > (tested on Broadwell Xeon) while not increasing code size too much.
> >
> > I can't believe unrolling the BLEND loop makes any difference.
> 
> It's actually the BLEND loop that accounts for almost all of the
> difference. The LOAD loop doesn't matter much in general: even replacing
> it with a plain memcpy() only increases performance by 3-4%. But
> unrolling it is low cost in code size terms, and clang actually does it
> without being asked.

(memcpy is wrong - misses the byte swaps).

That's odd, the BLEND loop is about 20 instructions.
I wouldn't expect unrolling to help - unless you manage
to use 16 registers for the active W[] values.

> > WRT patch 5.
> > On the C2758 the original unrolled code is slightly faster.
> > On the i7-7700 the 8 unroll is a bit faster 'hot cache',
> > but slower 'cold cache' - probably because of the d-cache
> > loads for K[].
> >
> > Non-x86 architectures might need to use d-cache reads for
> > the 32bit 'K' constants even in the unrolled loop.
> > X86 can use 'lea' with a 32bit offset to avoid data reads.
> > So the cold-cache case for the old code may be similar.
> 
> Not sure I follow: in the old code, the K's are 32-bit immediates, so
> they should come from the i-cache whether an add or an lea is used?

I was thinking of other instruction sets that end up using pc-relative
addressing for constants.
Might only happen for 64bint ones though.

> Why is the cold-cache case relevant anyway? If the code is only being
> executed a couple of times or so, i.e. you're hashing a single say
> 64-128 byte input once in a blue moon, the performance of the hash
> doesn't really matter, no?

I was measuring the cold cache one because I could.
I didn't note the actual figures but it was 8-10 times slower
that the hot-cache case.
While sha256 is likely to be run hot-cache (on a big buffer)
the cold-cache timing are probably relevant for things like memcpy().
I remember seeing a very long divide function for sparc32 that
was probably only a gain in a benchmark loop - it would have
displaced a lot of the working set from the i-cache!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

  reply	other threads:[~2020-10-25 23:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
2020-10-26  7:59   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 2/6] crypto: " Arvind Sankar
2020-10-26  7:58   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables Arvind Sankar
2020-10-26  7:59   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform() Arvind Sankar
2020-10-26  8:00   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64 Arvind Sankar
2020-10-26  8:00   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
2020-10-25 18:51   ` David Laight
2020-10-25 20:18     ` Arvind Sankar
2020-10-25 23:23       ` David Laight [this message]
2020-10-25 23:53         ` Arvind Sankar
2020-10-26 10:06           ` David Laight
2020-10-26  8:02   ` Ard Biesheuvel
2020-10-30  6:53 ` [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5d8f86fcfe84441fa5c9877959069ff1@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@google.com \
    --cc=ebiggers@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nivedita@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.