All of lore.kernel.org
 help / color / mirror / Atom feed
* blake2b_compress_one_generic() stack use with gcc-12
@ 2022-06-11 19:24 Linus Torvalds
  2022-06-11 21:04 ` Jason A. Donenfeld
  0 siblings, 1 reply; 2+ messages in thread
From: Linus Torvalds @ 2022-06-11 19:24 UTC (permalink / raw)
  To: Eric Biggers, David Sterba, Arnd Bergmann, Kees Cook
  Cc: Herbert Xu, Linux Crypto Mailing List

On an i386 build, I now get

  crypto/blake2b_generic.c: In function ‘blake2b_compress_one_generic’:
  crypto/blake2b_generic.c:109:1: warning: the frame size of 2640
bytes is larger than 2048 bytes [-Wframe-larger-than=]

probably due to upgrading to gcc-12. But who knows - it's been several
months since I bothered to do a 32-bit build, so maybe it was
something else.

That stack frame is disgusting on x86-64 too, but at least a bit less
so (it's "only" 592 bytes there). I assume there are fewer spills due
to more registers or something, and then gcc doesn't go all crazy.

The actual data arrays it uses should use 256 bytes plus a few other
things, so the expansion due to spilling(?) is truly ludicrous.

I assumed it was some debug option causing the compiler to not re-use
spill slots or something like that. We've had that before. But
disabling KASAN did nothing for the stack use. Neither did disabling
UBSAN or the gcc plugins.

I then started to think it's the same issue that clang hit, and that
Arnd fixed with 0c0408e86dbe ("crypto: blake2b - Fix clang
optimization for ARMv7-M"), but adding -fno-unroll-loops
-fno-peel-loops didn't do anything either. Neither did adding a

  #pragma GCC unroll 0

where the clang #pragma is.

So this is a "Help me, Obi-Wan Kenobi. You're my only hope" email.

Anybody got any ideas?

It's worth noting that clang does much better. On both i386 and
x86-64, it does a stack frame of just over 400 bytes. So this very
much is about gcc.

                  Linus

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: blake2b_compress_one_generic() stack use with gcc-12
  2022-06-11 19:24 blake2b_compress_one_generic() stack use with gcc-12 Linus Torvalds
@ 2022-06-11 21:04 ` Jason A. Donenfeld
  0 siblings, 0 replies; 2+ messages in thread
From: Jason A. Donenfeld @ 2022-06-11 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Biggers, David Sterba, Arnd Bergmann, Kees Cook, Herbert Xu,
	Linux Crypto Mailing List

Hey Linus,

On Sat, Jun 11, 2022 at 12:24:29PM -0700, Linus Torvalds wrote:
> So this is a "Help me, Obi-Wan Kenobi. You're my only hope" email.
> 
> Anybody got any ideas?

The way to fix this is by re-rolling the rounds:

diff --git a/crypto/blake2b_generic.c b/crypto/blake2b_generic.c
index 6704c0355889..ac50c3086f6b 100644
--- a/crypto/blake2b_generic.c
+++ b/crypto/blake2b_generic.c
@@ -89,18 +89,9 @@ static void blake2b_compress_one_generic(struct blake2b_state *S,
 	v[14] = BLAKE2B_IV6 ^ S->f[0];
 	v[15] = BLAKE2B_IV7 ^ S->f[1];

-	ROUND(0);
-	ROUND(1);
-	ROUND(2);
-	ROUND(3);
-	ROUND(4);
-	ROUND(5);
-	ROUND(6);
-	ROUND(7);
-	ROUND(8);
-	ROUND(9);
-	ROUND(10);
-	ROUND(11);
+	for (i = 0; i < 12; ++i)
+		ROUND(i);
+
 #ifdef CONFIG_CC_IS_CLANG
 #pragma nounroll /* https://bugs.llvm.org/show_bug.cgi?id=45803 */
 #endif

The good thing about doing this is that it makes the function smaller,
so it fits easier into cache. The bad thing is that it means the
blake2b_sigma array can't be inlined, so you have accesses into that,
which means overall worse performance if called repeatedly (which maybe
it is? This is only used for btrfs disk hashing afaik). BLAKE2 is
actually 12 different permutations, rather than one, which is partially
how they got away with reducing the number of rounds from the original
BLAKE SHA-3 finalist. But without fancy permutation SIMD instructions,
it also means you're left with this trade off in the generic C code.

However, getting back to the actual issue you pointed out, this seems
like an odd gcc 12 issue. Or perhaps it's some aspect of the inliner
there? I noticed that on gcc 11, the above patch increases the frame
size from 416 to 704. But on gcc 12 it decreases it from 2632 to 680.
Huh.

Jason

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-06-11 21:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-11 19:24 blake2b_compress_one_generic() stack use with gcc-12 Linus Torvalds
2022-06-11 21:04 ` Jason A. Donenfeld

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.