RE: [PATCH v2 2/2] random: use BLAKE2s instead of SHA1 in extraction

From: David Laight <David.Laight@ACULAB.COM>
To: "'Jason A. Donenfeld'" <Jason@zx2c4.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Theodore Tso <tytso@mit.edu>,
	Greg KH <gregkh@linuxfoundation.org>,
	Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Subject: RE: [PATCH v2 2/2] random: use BLAKE2s instead of SHA1 in extraction
Date: Fri, 14 Jan 2022 17:27:43 +0000	[thread overview]
Message-ID: <05ae373684334e6581294baa8afd3238@AcuMS.aculab.com> (raw)
In-Reply-To: <Yd18+iQ8zicsSPa0@zx2c4.com>

From: Jason A. Donenfeld
> Sent: 11 January 2022 12:50
>
> On Tue, Jan 11, 2022 at 1:28 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> > If you're really quite concerned about m68k code size, I can probably
> > do some things to reduce that. For example, blake2s256_hmac is only
> > used by wireguard and it could probably be made local there. And with
> > some trivial loop re-rolling, I can shave off another 2300 bytes. And
> > I bet I can find a few other things too. The question is: how
> > important is this to you?
> 
> And with another trick (see below), another extra 1000 bytes or so
> shaved off. Aside from moving blake2s256_hmac, I'm not really super
> enthusiastic about making these changes, but depending on how important
> this is to you, maybe we can make something work. There are probably
> additional possibilities too with the code.

Quite clearly whoever wrote the unrolled loops needs their head examined.
It is extremely unlikely that a cpu has enough registers to implement it
effeciently.
(Of course, a pipelined implementation on a fgpa is another matter.)

So every read of v[] is going to be a memory read.
Much better to do that than to spill values that change.
The memory reads won't really hit performance either.
They add a bit of latency - but that will be handled by
instruction scheduling - either by the compiler of cpu hardware.

> -#define ROUND(r) do { \
> -	G(r, 0, v[0], v[ 4], v[ 8], v[12]); \
> -	G(r, 1, v[1], v[ 5], v[ 9], v[13]); \
> -	G(r, 2, v[2], v[ 6], v[10], v[14]); \
> -	G(r, 3, v[3], v[ 7], v[11], v[15]); \
> -	G(r, 4, v[0], v[ 5], v[10], v[15]); \
> -	G(r, 5, v[1], v[ 6], v[11], v[12]); \
> -	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
> -	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
> -} while (0)
> -		ROUND(0);
> -		ROUND(1);
> -		ROUND(2);
> -		ROUND(3);
> -		ROUND(4);
> -		ROUND(5);
> -		ROUND(6);
> -		ROUND(7);
> -		ROUND(8);
> -		ROUND(9);

The v[] values clearly don't change in the above.
Use 4 separate arrays so you have:

#define ROUND(r) do { \
	G(r, 0, v[0], w[0], x[0], y[0]); \
	G(r, 1, v[1], w[1], x[1], y[1]); \
	G(r, 2, v[2], w[2], x[2], y[2]); \
	G(r, 3, v[3], w[3], x[3], y[3]); \
	G(r, 4, v[0], w[1], x[2], y[3]); \
	G(r, 5, v[1], w[2], x[3], y[0]); \
	G(r, 6, v[2], w[3], x[0], y[1]); \
	G(r, 7, v[3], w[0], x[1], y[2]); \

Now double the sizes of v/w/x/y array and write the correct
values when they are created/updated and you get:

#define ROUND(r) do { \
	G(r, 0, v[0], w[0], x[0], y[0]); \
	G(r, 1, v[1], w[1], x[1], y[1]); \
	G(r, 2, v[2], w[2], x[2], y[2]); \
	G(r, 3, v[3], w[3], x[3], y[3]); \
	G(r, 4, v[4], w[4], x[4], y[4]); \
	G(r, 5, v[5], w[5], x[5], y[5]); \
	G(r, 6, v[6], w[6], x[6], y[6]); \
	G(r, 7, v[7], w[7], x[7], y[7]); \

Oh - that is a nice loop...
So we get:
	for (r = 0; r < 10; r++)
		for (j = 0; j < 8; j++)
			G(r, j, v[j], w[j], x[j], y[j]);

Which is likely to be just as fast as any other version.

You might need to give the compiler some great big hints
in order to get sensible code.
Possible make v[] w[] x[] and y[] all volatile and replace
the inner loop body with:
			v_j = v[j]; w_j = x[j]; x_j = x[j]; y_j = z[j];
			G(r, j, v_j, w_j, x_j, y_j);

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)