linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
@ 2020-04-20  7:57 Jason A. Donenfeld
  2020-04-20  8:32 ` David Laight
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-20  7:57 UTC (permalink / raw)
  To: herbert, linux-crypto, linux-kernel, ebiggers, ardb
  Cc: Jason A. Donenfeld, stable

The initial Zinc patchset, after some mailing list discussion, contained
code to ensure that kernel_fpu_enable would not be kept on for more than
a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
isn't totally scientific, but it's not a bad guess either, and it's
what's used in both the x86 poly1305 and blake2s library code already.
Unfortunately it appears to have been left out of the final patchset
that actually added the glue code. So, this commit adds back the
PAGE_SIZE chunking.

Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Cc: Eric Biggers <ebiggers@google.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
patches, or if there was actually some later discussion in which we
concluded that the PAGE_SIZE chunking wasn't required, perhaps because
of FPU changes. If that's the case, please do let me know, in which case
I'll submit a _different_ patch that removes the chunking from x86 poly
and blake. I can't find any emails that would indicate that, but I might
be mistaken.

 arch/arm/crypto/chacha-glue.c        | 16 +++++++++++++---
 arch/arm/crypto/poly1305-glue.c      | 17 +++++++++++++----
 arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
 arch/arm64/crypto/poly1305-glue.c    | 17 +++++++++++++----
 arch/x86/crypto/chacha_glue.c        | 16 +++++++++++++---
 5 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
index 6fdb0ac62b3d..0e29ebac95fd 100644
--- a/arch/arm/crypto/chacha-glue.c
+++ b/arch/arm/crypto/chacha-glue.c
@@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 		return;
 	}
 
-	kernel_neon_begin();
-	chacha_doneon(state, dst, src, bytes, nrounds);
-	kernel_neon_end();
+	for (;;) {
+		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
+
+		kernel_neon_begin();
+		chacha_doneon(state, dst, src, todo, nrounds);
+		kernel_neon_end();
+
+		bytes -= todo;
+		if (!bytes)
+			break;
+		src += todo;
+		dst += todo;
+	}
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
index ceec04ec2f40..536a4a943ebe 100644
--- a/arch/arm/crypto/poly1305-glue.c
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -160,13 +160,22 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
 		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
 
 		if (static_branch_likely(&have_neon) && do_neon) {
-			kernel_neon_begin();
-			poly1305_blocks_neon(&dctx->h, src, len, 1);
-			kernel_neon_end();
+			for (;;) {
+				unsigned int todo = min_t(unsigned int, PAGE_SIZE, len);
+
+				kernel_neon_begin();
+				poly1305_blocks_neon(&dctx->h, src, todo, 1);
+				kernel_neon_end();
+
+				len -= todo;
+				if (!len)
+					break;
+				src += todo;
+			}
 		} else {
 			poly1305_blocks_arm(&dctx->h, src, len, 1);
+			src += len;
 		}
-		src += len;
 		nbytes %= POLY1305_BLOCK_SIZE;
 	}
 
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 37ca3e889848..3eff767f4f77 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -87,9 +87,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 	    !crypto_simd_usable())
 		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
 
-	kernel_neon_begin();
-	chacha_doneon(state, dst, src, bytes, nrounds);
-	kernel_neon_end();
+	for (;;) {
+		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
+
+		kernel_neon_begin();
+		chacha_doneon(state, dst, src, todo, nrounds);
+		kernel_neon_end();
+
+		bytes -= todo;
+		if (!bytes)
+			break;
+		src += todo;
+		dst += todo;
+	}
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
index e97b092f56b8..616134bef02c 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -143,13 +143,22 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
 		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
 
 		if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
-			kernel_neon_begin();
-			poly1305_blocks_neon(&dctx->h, src, len, 1);
-			kernel_neon_end();
+			for (;;) {
+				unsigned int todo = min_t(unsigned int, PAGE_SIZE, len);
+
+				kernel_neon_begin();
+				poly1305_blocks_neon(&dctx->h, src, todo, 1);
+				kernel_neon_end();
+
+				len -= todo;
+				if (!len)
+					break;
+				src += todo;
+			}
 		} else {
 			poly1305_blocks(&dctx->h, src, len, 1);
+			src += len;
 		}
-		src += len;
 		nbytes %= POLY1305_BLOCK_SIZE;
 	}
 
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index b412c21ee06e..10733035b81c 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -153,9 +153,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 	    bytes <= CHACHA_BLOCK_SIZE)
 		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
 
-	kernel_fpu_begin();
-	chacha_dosimd(state, dst, src, bytes, nrounds);
-	kernel_fpu_end();
+	for (;;) {
+		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
+
+		kernel_fpu_begin();
+		chacha_dosimd(state, dst, src, todo, nrounds);
+		kernel_fpu_end();
+
+		bytes -= todo;
+		if (!bytes)
+			break;
+		src += todo;
+		dst += todo;
+	}
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* RE: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-20  7:57 [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Jason A. Donenfeld
@ 2020-04-20  8:32 ` David Laight
  2020-04-21  4:02   ` Jason A. Donenfeld
  2020-04-21  4:14   ` FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks] Jason A. Donenfeld
  2020-04-22  4:04 ` [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Eric Biggers
  2020-04-22 20:03 ` [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks Jason A. Donenfeld
  2 siblings, 2 replies; 33+ messages in thread
From: David Laight @ 2020-04-20  8:32 UTC (permalink / raw)
  To: 'Jason A. Donenfeld',
	herbert, linux-crypto, linux-kernel, ebiggers, ardb
  Cc: stable

From: Jason A. Donenfeld
> Sent: 20 April 2020 08:57
> 
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> isn't totally scientific, but it's not a bad guess either, and it's
> what's used in both the x86 poly1305 and blake2s library code already.
> Unfortunately it appears to have been left out of the final patchset
> that actually added the glue code. So, this commit adds back the
> PAGE_SIZE chunking.
> 
...
> ---
> Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> patches, or if there was actually some later discussion in which we
> concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> of FPU changes. If that's the case, please do let me know, in which case
> I'll submit a _different_ patch that removes the chunking from x86 poly
> and blake. I can't find any emails that would indicate that, but I might
> be mistaken.

Maybe kernel_fp_begin() should be passed the address of somewhere
the address of an fpu save area buffer can be written to.
Then the pre-emption code can allocate the buffer and save the
state into it.

However that doesn't solve the problem for non-preemptive kernels.
The may need a cond_resched() in the loop if it might take 1ms (or so).

kernel_fpu_begin() ought also be passed a parameter saying which
fpu features are required, and return which are allocated.
On x86 this could be used to check for AVX512 (etc) which may be
available in an ISR unless it interrupted inside a kernel_fpu_begin()
section (etc).
It would also allow optimisations if only 1 or 2 fpu registers are
needed (eg for some of the crypto functions) rather than the whole
fpu register set.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-20  8:32 ` David Laight
@ 2020-04-21  4:02   ` Jason A. Donenfeld
  2020-04-21  4:14   ` FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks] Jason A. Donenfeld
  1 sibling, 0 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-21  4:02 UTC (permalink / raw)
  To: David Laight; +Cc: herbert, linux-crypto, linux-kernel, ebiggers, ardb, stable

On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@aculab.com> wrote:
> Maybe kernel_fp_begin() should be passed the address of somewhere
> the address of an fpu save area buffer can be written to.
> Then the pre-emption code can allocate the buffer and save the
> state into it.
>
> However that doesn't solve the problem for non-preemptive kernels.
> The may need a cond_resched() in the loop if it might take 1ms (or so).
>
> kernel_fpu_begin() ought also be passed a parameter saying which
> fpu features are required, and return which are allocated.
> On x86 this could be used to check for AVX512 (etc) which may be
> available in an ISR unless it interrupted inside a kernel_fpu_begin()
> section (etc).
> It would also allow optimisations if only 1 or 2 fpu registers are
> needed (eg for some of the crypto functions) rather than the whole
> fpu register set.

There might be ways to improve lots of FPU things, indeed. This patch
here is just a patch to Herbert's branch in order to make uniform
usage of our existing solution for this, fixing the existing bug. I
wouldn't mind seeing more involved and better solutions in a patchset
for crypto-next.

Will follow up with your suggestion in a different thread, so as not
to block this one.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]
  2020-04-20  8:32 ` David Laight
  2020-04-21  4:02   ` Jason A. Donenfeld
@ 2020-04-21  4:14   ` Jason A. Donenfeld
  2020-04-21  4:25     ` Jason A. Donenfeld
                       ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-21  4:14 UTC (permalink / raw)
  To: David Laight; +Cc: herbert, linux-crypto, linux-kernel, ebiggers, ardb, stable

Hi David,

On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@aculab.com> wrote:
> Maybe kernel_fp_begin() should be passed the address of somewhere
> the address of an fpu save area buffer can be written to.
> Then the pre-emption code can allocate the buffer and save the
> state into it.

Interesting idea. It looks like `struct xregs_state` is only 576
bytes. That's not exactly small, but it's not insanely huge either,
and maybe we could justifiably stick that on the stack, or even
reserve part of the stack allocation for that that the function would
know about, without needing to specify any address.

> kernel_fpu_begin() ought also be passed a parameter saying which
> fpu features are required, and return which are allocated.
> On x86 this could be used to check for AVX512 (etc) which may be
> available in an ISR unless it interrupted inside a kernel_fpu_begin()
> section (etc).
> It would also allow optimisations if only 1 or 2 fpu registers are
> needed (eg for some of the crypto functions) rather than the whole
> fpu register set.

For AVX512 this probably makes sense, I suppose. But I'm not sure if
there are too many bits of crypto code that only use a few registers.
There are those accelerated memcpy routines in i915 though -- ever see
drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
this way, I wonder if it'd make sense to totally overengineer it and
write a gcc/as plugin to create the register mask for us. Or, maybe
some checker inside of objtool could help here.

Actually, though, the thing I've been wondering about is actually
moving in the complete opposite direction: is there some
efficient-enough way that we could allow FPU registers in all contexts
always, without the need for kernel_fpu_begin/end? I was reversing
ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
their judicious use of vectorisation everywhere. I assume a lot of
that is being generated by their compiler, which of course gcc could
do for us if we let it. Is that an interesting avenue to consider? Or
are you pretty certain that it'd be a huge mistake, with an
irreversible speed hit?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]
  2020-04-21  4:14   ` FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks] Jason A. Donenfeld
@ 2020-04-21  4:25     ` Jason A. Donenfeld
  2020-04-21  7:02     ` Ard Biesheuvel
  2020-04-21  8:11     ` David Laight
  2 siblings, 0 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-21  4:25 UTC (permalink / raw)
  To: David Laight; +Cc: herbert, linux-crypto, linux-kernel, ebiggers, ardb, stable

On Mon, Apr 20, 2020 at 10:14 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi David,
>
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@aculab.com> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
>
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.

Hah-hah, nevermind here. extended_state_area is of course huge,
bringing the whole structure to a whopping 3k with avx512. :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]
  2020-04-21  4:14   ` FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks] Jason A. Donenfeld
  2020-04-21  4:25     ` Jason A. Donenfeld
@ 2020-04-21  7:02     ` Ard Biesheuvel
  2020-04-21  8:05       ` David Laight
  2020-04-21  8:11     ` David Laight
  2 siblings, 1 reply; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-21  7:02 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Laight, herbert, linux-crypto, linux-kernel, ebiggers, stable

On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi David,
>
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@aculab.com> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
>
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.
>
> > kernel_fpu_begin() ought also be passed a parameter saying which
> > fpu features are required, and return which are allocated.
> > On x86 this could be used to check for AVX512 (etc) which may be
> > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > section (etc).
> > It would also allow optimisations if only 1 or 2 fpu registers are
> > needed (eg for some of the crypto functions) rather than the whole
> > fpu register set.
>
> For AVX512 this probably makes sense, I suppose. But I'm not sure if
> there are too many bits of crypto code that only use a few registers.
> There are those accelerated memcpy routines in i915 though -- ever see
> drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> this way, I wonder if it'd make sense to totally overengineer it and
> write a gcc/as plugin to create the register mask for us. Or, maybe
> some checker inside of objtool could help here.
>
> Actually, though, the thing I've been wondering about is actually
> moving in the complete opposite direction: is there some
> efficient-enough way that we could allow FPU registers in all contexts
> always, without the need for kernel_fpu_begin/end? I was reversing
> ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> their judicious use of vectorisation everywhere. I assume a lot of
> that is being generated by their compiler, which of course gcc could
> do for us if we let it. Is that an interesting avenue to consider? Or
> are you pretty certain that it'd be a huge mistake, with an
> irreversible speed hit?
>

When I added support for kernel mode SIMD to arm64 originally, there
was a kernel_neon_begin_partial() that took an int to specify how many
registers were being used, the reason being that NEON preserve/store
was fully eager at this point, and arm64 has 32 SIMD registers, many
of which weren't really used, e.g., in the basic implementation of AES
based on special instructions.

With the introduction of lazy restore, and SVE handling for userspace,
we decided to remove this because it didn't really help anymore, and
made the code more difficult to manage.

However, I think it would make sense to have something like this in
the general case. I.e., NEON registers 0-3 are always preserved when
an exception or interrupt (or syscall) is taken, and so they can be
used anywhere in the kernel. If you want the whole set, you will have
to use begin/end as before. This would already unlock a few
interesting case, like memcpy, xor, and sequences that can easily be
implemented with only a few registers such as instructio based AES.

Unfortunately, the compiler needs to be taught about this to be
completely useful, which means lots of prototyping and benchmarking
upfront, as the contract will be set in stone once the compilers get
on board.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]
  2020-04-21  7:02     ` Ard Biesheuvel
@ 2020-04-21  8:05       ` David Laight
  0 siblings, 0 replies; 33+ messages in thread
From: David Laight @ 2020-04-21  8:05 UTC (permalink / raw)
  To: 'Ard Biesheuvel', Jason A. Donenfeld
  Cc: herbert, linux-crypto, linux-kernel, ebiggers, stable

From: Ard Biesheuvel
> Sent: 21 April 2020 08:02
> On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > Hi David,
> >
> > On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@aculab.com> wrote:
> > > Maybe kernel_fp_begin() should be passed the address of somewhere
> > > the address of an fpu save area buffer can be written to.
> > > Then the pre-emption code can allocate the buffer and save the
> > > state into it.
> >
> > Interesting idea. It looks like `struct xregs_state` is only 576
> > bytes. That's not exactly small, but it's not insanely huge either,
> > and maybe we could justifiably stick that on the stack, or even
> > reserve part of the stack allocation for that that the function would
> > know about, without needing to specify any address.
> >
> > > kernel_fpu_begin() ought also be passed a parameter saying which
> > > fpu features are required, and return which are allocated.
> > > On x86 this could be used to check for AVX512 (etc) which may be
> > > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > > section (etc).
> > > It would also allow optimisations if only 1 or 2 fpu registers are
> > > needed (eg for some of the crypto functions) rather than the whole
> > > fpu register set.
> >
> > For AVX512 this probably makes sense, I suppose. But I'm not sure if
> > there are too many bits of crypto code that only use a few registers.
> > There are those accelerated memcpy routines in i915 though -- ever see
> > drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> > this way, I wonder if it'd make sense to totally overengineer it and
> > write a gcc/as plugin to create the register mask for us. Or, maybe
> > some checker inside of objtool could help here.
> >
> > Actually, though, the thing I've been wondering about is actually
> > moving in the complete opposite direction: is there some
> > efficient-enough way that we could allow FPU registers in all contexts
> > always, without the need for kernel_fpu_begin/end? I was reversing
> > ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> > their judicious use of vectorisation everywhere. I assume a lot of
> > that is being generated by their compiler, which of course gcc could
> > do for us if we let it. Is that an interesting avenue to consider? Or
> > are you pretty certain that it'd be a huge mistake, with an
> > irreversible speed hit?
> >
> 
> When I added support for kernel mode SIMD to arm64 originally, there
> was a kernel_neon_begin_partial() that took an int to specify how many
> registers were being used, the reason being that NEON preserve/store
> was fully eager at this point, and arm64 has 32 SIMD registers, many
> of which weren't really used, e.g., in the basic implementation of AES
> based on special instructions.
> 
> With the introduction of lazy restore, and SVE handling for userspace,
> we decided to remove this because it didn't really help anymore, and
> made the code more difficult to manage.
> 
> However, I think it would make sense to have something like this in
> the general case. I.e., NEON registers 0-3 are always preserved when
> an exception or interrupt (or syscall) is taken, and so they can be
> used anywhere in the kernel. If you want the whole set, you will have
> to use begin/end as before. This would already unlock a few
> interesting case, like memcpy, xor, and sequences that can easily be
> implemented with only a few registers such as instructio based AES.
> 
> Unfortunately, the compiler needs to be taught about this to be
> completely useful, which means lots of prototyping and benchmarking
> upfront, as the contract will be set in stone once the compilers get
> on board.

You can always just use asm with explicit registers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]
  2020-04-21  4:14   ` FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks] Jason A. Donenfeld
  2020-04-21  4:25     ` Jason A. Donenfeld
  2020-04-21  7:02     ` Ard Biesheuvel
@ 2020-04-21  8:11     ` David Laight
  2 siblings, 0 replies; 33+ messages in thread
From: David Laight @ 2020-04-21  8:11 UTC (permalink / raw)
  To: 'Jason A. Donenfeld'
  Cc: herbert, linux-crypto, linux-kernel, ebiggers, ardb, stable

From: Jason A. Donenfeld
> Sent: 21 April 2020 05:15
> 
> Hi David,
> 
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <David.Laight@aculab.com> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
> 
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.

As you said yourself, with AVX512 it is much larger.
Which is why I suggested the save code could allocate the area.
Note that this would only be needed for nested use (for a full save).

> > kernel_fpu_begin() ought also be passed a parameter saying which
> > fpu features are required, and return which are allocated.
> > On x86 this could be used to check for AVX512 (etc) which may be
> > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > section (etc).
> > It would also allow optimisations if only 1 or 2 fpu registers are
> > needed (eg for some of the crypto functions) rather than the whole
> > fpu register set.
> 
> For AVX512 this probably makes sense, I suppose. But I'm not sure if
> there are too many bits of crypto code that only use a few registers.
> There are those accelerated memcpy routines in i915 though -- ever see
> drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> this way, I wonder if it'd make sense to totally overengineer it and
> write a gcc/as plugin to create the register mask for us. Or, maybe
> some checker inside of objtool could help here.

I suspect some of that code is overly unrolled.

> Actually, though, the thing I've been wondering about is actually
> moving in the complete opposite direction: is there some
> efficient-enough way that we could allow FPU registers in all contexts
> always, without the need for kernel_fpu_begin/end? I was reversing
> ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> their judicious use of vectorisation everywhere. I assume a lot of
> that is being generated by their compiler, which of course gcc could
> do for us if we let it. Is that an interesting avenue to consider? Or
> are you pretty certain that it'd be a huge mistake, with an
> irreversible speed hit?


I think windows takes the 'hit' of saving the entire fpu state on
every kernel entry.
Note that for system calls this is actually minimal.
All the 'callee saved' registers (most of the fpu ones) can be
trashed - ie reloaded with zeros.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-20  7:57 [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Jason A. Donenfeld
  2020-04-20  8:32 ` David Laight
@ 2020-04-22  4:04 ` Eric Biggers
  2020-04-22  7:23   ` Ard Biesheuvel
  2020-04-22  7:32   ` Jason A. Donenfeld
  2020-04-22 20:03 ` [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks Jason A. Donenfeld
  2 siblings, 2 replies; 33+ messages in thread
From: Eric Biggers @ 2020-04-22  4:04 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: herbert, linux-crypto, linux-kernel, ardb, stable

On Mon, Apr 20, 2020 at 01:57:11AM -0600, Jason A. Donenfeld wrote:
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> isn't totally scientific, but it's not a bad guess either, and it's
> what's used in both the x86 poly1305 and blake2s library code already.
> Unfortunately it appears to have been left out of the final patchset
> that actually added the glue code. So, this commit adds back the
> PAGE_SIZE chunking.
> 
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> patches, or if there was actually some later discussion in which we
> concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> of FPU changes. If that's the case, please do let me know, in which case
> I'll submit a _different_ patch that removes the chunking from x86 poly
> and blake. I can't find any emails that would indicate that, but I might
> be mistaken.
> 
>  arch/arm/crypto/chacha-glue.c        | 16 +++++++++++++---
>  arch/arm/crypto/poly1305-glue.c      | 17 +++++++++++++----
>  arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
>  arch/arm64/crypto/poly1305-glue.c    | 17 +++++++++++++----
>  arch/x86/crypto/chacha_glue.c        | 16 +++++++++++++---
>  5 files changed, 65 insertions(+), 17 deletions(-)

I don't think you're missing anything.  On x86, kernel_fpu_begin() and
kernel_fpu_end() did get optimized in v5.2.  But they still disable preemption,
which is the concern here.

> 
> diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
> index 6fdb0ac62b3d..0e29ebac95fd 100644
> --- a/arch/arm/crypto/chacha-glue.c
> +++ b/arch/arm/crypto/chacha-glue.c
> @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>  		return;
>  	}
>  
> -	kernel_neon_begin();
> -	chacha_doneon(state, dst, src, bytes, nrounds);
> -	kernel_neon_end();
> +	for (;;) {
> +		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> +
> +		kernel_neon_begin();
> +		chacha_doneon(state, dst, src, todo, nrounds);
> +		kernel_neon_end();
> +
> +		bytes -= todo;
> +		if (!bytes)
> +			break;
> +		src += todo;
> +		dst += todo;
> +	}
>  }
>  EXPORT_SYMBOL(chacha_crypt_arch);

Seems this should just be a 'while' loop?

	while (bytes) {
		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);

		kernel_neon_begin();
		chacha_doneon(state, dst, src, todo, nrounds);
		kernel_neon_end();

		bytes -= todo;
		src += todo;
		dst += todo;
	}

Likewise elsewhere in this patch.  (For Poly1305, len >= POLY1305_BLOCK_SIZE at
the beginning, so that could use a 'do' loop.)

- Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22  4:04 ` [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Eric Biggers
@ 2020-04-22  7:23   ` Ard Biesheuvel
  2020-04-22  7:38     ` Jason A. Donenfeld
  2020-04-22 11:28     ` Sebastian Andrzej Siewior
  2020-04-22  7:32   ` Jason A. Donenfeld
  1 sibling, 2 replies; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-22  7:23 UTC (permalink / raw)
  To: Eric Biggers, bigeasy, linux-rt-users
  Cc: Jason A. Donenfeld, Herbert Xu, Linux Crypto Mailing List,
	Linux Kernel Mailing List

(+ linux-rt folks)

On Wed, 22 Apr 2020 at 06:04, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Mon, Apr 20, 2020 at 01:57:11AM -0600, Jason A. Donenfeld wrote:
> > The initial Zinc patchset, after some mailing list discussion, contained
> > code to ensure that kernel_fpu_enable would not be kept on for more than
> > a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> > isn't totally scientific, but it's not a bad guess either, and it's
> > what's used in both the x86 poly1305 and blake2s library code already.
> > Unfortunately it appears to have been left out of the final patchset
> > that actually added the glue code. So, this commit adds back the
> > PAGE_SIZE chunking.
> >
> > Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> > Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> > Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> > Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> > Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> > Cc: Eric Biggers <ebiggers@google.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > ---
> > Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> > patches, or if there was actually some later discussion in which we
> > concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> > of FPU changes. If that's the case, please do let me know, in which case
> > I'll submit a _different_ patch that removes the chunking from x86 poly
> > and blake. I can't find any emails that would indicate that, but I might
> > be mistaken.
> >
> >  arch/arm/crypto/chacha-glue.c        | 16 +++++++++++++---
> >  arch/arm/crypto/poly1305-glue.c      | 17 +++++++++++++----
> >  arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
> >  arch/arm64/crypto/poly1305-glue.c    | 17 +++++++++++++----
> >  arch/x86/crypto/chacha_glue.c        | 16 +++++++++++++---
> >  5 files changed, 65 insertions(+), 17 deletions(-)
>
> I don't think you're missing anything.  On x86, kernel_fpu_begin() and
> kernel_fpu_end() did get optimized in v5.2.  But they still disable preemption,
> which is the concern here.
>

My memory is a bit fuzzy here. I remember talking to the linux-rt guys
about what delay is actually acceptable, which was a lot higher than I
had thought based on their initial reports about scheduling blackouts
on arm64 due to preemption remaining disabled for too long. I intended
to revisit this with more accurate bounds but then I apparently
forgot.

So SIMD chacha20 and SIMD poly1305 both run in <5 cycles per bytes,
both on x86 and ARM. If we take 20 microseconds as a ballpark upper
bound for how long preemption may be disabled, that gives us ~4000
bytes of ChaCha20 or Poly1305 on a hypothetical 1 GHz core.

So I think 4 KB is indeed a reasonable quantum of work here. Only
PAGE_SIZE is not necessarily equal to 4 KB on arm64, so we should use
SZ_4K instead.

*However*, at the time, the report was triggered by the fact that we
were keeping SIMD enabled across calls into the scatterwalk API, which
may call kmalloc()/kfree() etc. There is no need for that anymore, now
that the FPU begin/end routines all have been optimized to restore the
userland SIMD state lazily.

So do we have any callers that are likely to pass more than 4 KB of
input at a time? AF_ALG perhaps? Is this code path covered by the
tcrypt tests? Even if we all agree that this chunking is probably
needed, I'd still like us to have an idea of when/how it is likely to
get exercised.


> >
> > diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
> > index 6fdb0ac62b3d..0e29ebac95fd 100644
> > --- a/arch/arm/crypto/chacha-glue.c
> > +++ b/arch/arm/crypto/chacha-glue.c
> > @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
> >               return;
> >       }
> >
> > -     kernel_neon_begin();
> > -     chacha_doneon(state, dst, src, bytes, nrounds);
> > -     kernel_neon_end();
> > +     for (;;) {
> > +             unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > +
> > +             kernel_neon_begin();
> > +             chacha_doneon(state, dst, src, todo, nrounds);
> > +             kernel_neon_end();
> > +
> > +             bytes -= todo;
> > +             if (!bytes)
> > +                     break;
> > +             src += todo;
> > +             dst += todo;
> > +     }
> >  }
> >  EXPORT_SYMBOL(chacha_crypt_arch);
>
> Seems this should just be a 'while' loop?
>
>         while (bytes) {
>                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
>
>                 kernel_neon_begin();
>                 chacha_doneon(state, dst, src, todo, nrounds);
>                 kernel_neon_end();
>
>                 bytes -= todo;
>                 src += todo;
>                 dst += todo;
>         }
>
> Likewise elsewhere in this patch.  (For Poly1305, len >= POLY1305_BLOCK_SIZE at
> the beginning, so that could use a 'do' loop.)
>
> - Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22  4:04 ` [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Eric Biggers
  2020-04-22  7:23   ` Ard Biesheuvel
@ 2020-04-22  7:32   ` Jason A. Donenfeld
  2020-04-22  7:39     ` Ard Biesheuvel
  1 sibling, 1 reply; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22  7:32 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Linux Crypto Mailing List, LKML, Ard Biesheuvel, stable

On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@kernel.org> wrote:
> Seems this should just be a 'while' loop?
>
>         while (bytes) {
>                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
>
>                 kernel_neon_begin();
>                 chacha_doneon(state, dst, src, todo, nrounds);
>                 kernel_neon_end();
>
>                 bytes -= todo;
>                 src += todo;
>                 dst += todo;
>         }

The for(;;) is how it's done elsewhere in the kernel (that this patch
doesn't touch), because then we can break out of the loop before
having to increment src and dst unnecessarily. Likely a pointless
optimization as probably the compiler can figure out how to avoid
that. But maybe it can't. If you have a strong preference, I can
reactor everything to use `while (bytes)`, but if you don't care,
let's keep this as-is. Opinion?

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22  7:23   ` Ard Biesheuvel
@ 2020-04-22  7:38     ` Jason A. Donenfeld
  2020-04-22 11:28     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22  7:38 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, Sebastian Siewior, linux-rt-users, Herbert Xu,
	Linux Crypto Mailing List, Linux Kernel Mailing List

On Wed, Apr 22, 2020 at 1:23 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> So I think 4 KB is indeed a reasonable quantum of work here. Only
> PAGE_SIZE is not necessarily equal to 4 KB on arm64, so we should use
> SZ_4K instead.

Will wait to hear the rt guys' opinion, but sure, we can do SZ_4K
explicitly. If we go with that in the end, v2 will adjust the other
places that are already using PAGE_SIZE.

> So do we have any callers that are likely to pass more than 4 KB of
> input at a time?

Network packets can be big -- ethernet jumbo packets are 9k, for
example -- so that means this could potentially ipsec, wireguard, and
maybe wifi too. (Crypto api users might go through another layer of
indirection that splits things up smaller, maybe.)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22  7:32   ` Jason A. Donenfeld
@ 2020-04-22  7:39     ` Ard Biesheuvel
  2020-04-22 19:51       ` Jason A. Donenfeld
  0 siblings, 1 reply; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-22  7:39 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Eric Biggers, Herbert Xu, Linux Crypto Mailing List, LKML

On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > Seems this should just be a 'while' loop?
> >
> >         while (bytes) {
> >                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> >
> >                 kernel_neon_begin();
> >                 chacha_doneon(state, dst, src, todo, nrounds);
> >                 kernel_neon_end();
> >
> >                 bytes -= todo;
> >                 src += todo;
> >                 dst += todo;
> >         }
>
> The for(;;) is how it's done elsewhere in the kernel (that this patch
> doesn't touch), because then we can break out of the loop before
> having to increment src and dst unnecessarily. Likely a pointless
> optimization as probably the compiler can figure out how to avoid
> that. But maybe it can't. If you have a strong preference, I can
> reactor everything to use `while (bytes)`, but if you don't care,
> let's keep this as-is. Opinion?
>

Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
given that bytes is guaranteed to be non-zero before we enter the
loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
we can.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22  7:23   ` Ard Biesheuvel
  2020-04-22  7:38     ` Jason A. Donenfeld
@ 2020-04-22 11:28     ` Sebastian Andrzej Siewior
  2020-04-22 19:35       ` Jason A. Donenfeld
  1 sibling, 1 reply; 33+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-04-22 11:28 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, linux-rt-users, Jason A. Donenfeld, Herbert Xu,
	Linux Crypto Mailing List, Linux Kernel Mailing List

On 2020-04-22 09:23:34 [+0200], Ard Biesheuvel wrote:
> My memory is a bit fuzzy here. I remember talking to the linux-rt guys
> about what delay is actually acceptable, which was a lot higher than I
> had thought based on their initial reports about scheduling blackouts
> on arm64 due to preemption remaining disabled for too long. I intended
> to revisit this with more accurate bounds but then I apparently
> forgot.
> 
> So SIMD chacha20 and SIMD poly1305 both run in <5 cycles per bytes,
> both on x86 and ARM. If we take 20 microseconds as a ballpark upper
> bound for how long preemption may be disabled, that gives us ~4000
> bytes of ChaCha20 or Poly1305 on a hypothetical 1 GHz core.
> 
> So I think 4 KB is indeed a reasonable quantum of work here. Only
> PAGE_SIZE is not necessarily equal to 4 KB on arm64, so we should use
> SZ_4K instead.
> 
> *However*, at the time, the report was triggered by the fact that we
> were keeping SIMD enabled across calls into the scatterwalk API, which
> may call kmalloc()/kfree() etc. There is no need for that anymore, now
> that the FPU begin/end routines all have been optimized to restore the
> userland SIMD state lazily.

The 20usec sound reasonable. The other concern was memory allocation
within the preempt-disable section. If this is no longer the case,
perfect.

Sebastian

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22 11:28     ` Sebastian Andrzej Siewior
@ 2020-04-22 19:35       ` Jason A. Donenfeld
  0 siblings, 0 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 19:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Ard Biesheuvel, Eric Biggers, linux-rt-users, Herbert Xu,
	Linux Crypto Mailing List, Linux Kernel Mailing List

On Wed, Apr 22, 2020 at 5:28 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2020-04-22 09:23:34 [+0200], Ard Biesheuvel wrote:
> > My memory is a bit fuzzy here. I remember talking to the linux-rt guys
> > about what delay is actually acceptable, which was a lot higher than I
> > had thought based on their initial reports about scheduling blackouts
> > on arm64 due to preemption remaining disabled for too long. I intended
> > to revisit this with more accurate bounds but then I apparently
> > forgot.
> >
> > So SIMD chacha20 and SIMD poly1305 both run in <5 cycles per bytes,
> > both on x86 and ARM. If we take 20 microseconds as a ballpark upper
> > bound for how long preemption may be disabled, that gives us ~4000
> > bytes of ChaCha20 or Poly1305 on a hypothetical 1 GHz core.
> >
> > So I think 4 KB is indeed a reasonable quantum of work here. Only
> > PAGE_SIZE is not necessarily equal to 4 KB on arm64, so we should use
> > SZ_4K instead.
> >
> > *However*, at the time, the report was triggered by the fact that we
> > were keeping SIMD enabled across calls into the scatterwalk API, which
> > may call kmalloc()/kfree() etc. There is no need for that anymore, now
> > that the FPU begin/end routines all have been optimized to restore the
> > userland SIMD state lazily.
>
> The 20usec sound reasonable. The other concern was memory allocation
> within the preempt-disable section. If this is no longer the case,
> perfect.

Cool, thanks for the confirmation. I'll get a v2 of this patch out the door.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22  7:39     ` Ard Biesheuvel
@ 2020-04-22 19:51       ` Jason A. Donenfeld
  2020-04-22 20:17         ` Jason A. Donenfeld
  0 siblings, 1 reply; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 19:51 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Eric Biggers, Herbert Xu, Linux Crypto Mailing List, LKML

On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > Seems this should just be a 'while' loop?
> > >
> > >         while (bytes) {
> > >                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > >
> > >                 kernel_neon_begin();
> > >                 chacha_doneon(state, dst, src, todo, nrounds);
> > >                 kernel_neon_end();
> > >
> > >                 bytes -= todo;
> > >                 src += todo;
> > >                 dst += todo;
> > >         }
> >
> > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > doesn't touch), because then we can break out of the loop before
> > having to increment src and dst unnecessarily. Likely a pointless
> > optimization as probably the compiler can figure out how to avoid
> > that. But maybe it can't. If you have a strong preference, I can
> > reactor everything to use `while (bytes)`, but if you don't care,
> > let's keep this as-is. Opinion?
> >
>
> Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> given that bytes is guaranteed to be non-zero before we enter the
> loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> we can.

Okay, will do-while it up for v2.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks
  2020-04-20  7:57 [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Jason A. Donenfeld
  2020-04-20  8:32 ` David Laight
  2020-04-22  4:04 ` [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Eric Biggers
@ 2020-04-22 20:03 ` Jason A. Donenfeld
  2020-04-22 22:39   ` Eric Biggers
  2020-04-22 23:18   ` [PATCH crypto-stable v3 1/2] crypto: arch/lib " Jason A. Donenfeld
  2 siblings, 2 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 20:03 UTC (permalink / raw)
  To: herbert, linux-crypto, linux-kernel, linux-rt-users
  Cc: Jason A. Donenfeld, Eric Biggers, Ard Biesheuvel,
	Sebastian Andrzej Siewior, stable

The initial Zinc patchset, after some mailing list discussion, contained
code to ensure that kernel_fpu_enable would not be kept on for more than
a 4k chunk, since it disables preemption. The choice of 4k isn't totally
scientific, but it's not a bad guess either, and it's what's used in
both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
of PAGE_SIZE, which this commit corrects to be explicitly 4k).

Ard did some back of the envelope calculations and found that
at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
means we have a maximum preemption disabling of 20us, which Sebastian
confirmed was probably a good limit.

Unfortunately the chunking appears to have been left out of the final
patchset that added the glue code. So, this commit adds it back in.

Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305")
Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305")
Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305")
Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305")
Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
Cc: Eric Biggers <ebiggers@google.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Changes v1->v2:
 - [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
 - [Eric] Prefer do-while over for (;;).

 arch/arm/crypto/chacha-glue.c            | 14 +++++++++++---
 arch/arm/crypto/nhpoly1305-neon-glue.c   |  2 +-
 arch/arm/crypto/poly1305-glue.c          | 15 +++++++++++----
 arch/arm64/crypto/chacha-neon-glue.c     | 14 +++++++++++---
 arch/arm64/crypto/nhpoly1305-neon-glue.c |  2 +-
 arch/arm64/crypto/poly1305-glue.c        | 15 +++++++++++----
 arch/x86/crypto/blake2s-glue.c           | 10 ++++------
 arch/x86/crypto/chacha_glue.c            | 14 +++++++++++---
 arch/x86/crypto/nhpoly1305-avx2-glue.c   |  2 +-
 arch/x86/crypto/nhpoly1305-sse2-glue.c   |  2 +-
 arch/x86/crypto/poly1305_glue.c          | 13 ++++++-------
 11 files changed, 69 insertions(+), 34 deletions(-)

diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
index 6fdb0ac62b3d..59da6c0b63b6 100644
--- a/arch/arm/crypto/chacha-glue.c
+++ b/arch/arm/crypto/chacha-glue.c
@@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 		return;
 	}
 
-	kernel_neon_begin();
-	chacha_doneon(state, dst, src, bytes, nrounds);
-	kernel_neon_end();
+	do {
+		unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
+
+		kernel_neon_begin();
+		chacha_doneon(state, dst, src, todo, nrounds);
+		kernel_neon_end();
+
+		bytes -= todo;
+		src += todo;
+		dst += todo;
+	} while (bytes);
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/arm/crypto/nhpoly1305-neon-glue.c b/arch/arm/crypto/nhpoly1305-neon-glue.c
index ae5aefc44a4d..ffa8d73fe722 100644
--- a/arch/arm/crypto/nhpoly1305-neon-glue.c
+++ b/arch/arm/crypto/nhpoly1305-neon-glue.c
@@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_neon_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
index ceec04ec2f40..13cfef4ae22e 100644
--- a/arch/arm/crypto/poly1305-glue.c
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
 		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
 
 		if (static_branch_likely(&have_neon) && do_neon) {
-			kernel_neon_begin();
-			poly1305_blocks_neon(&dctx->h, src, len, 1);
-			kernel_neon_end();
+			do {
+				unsigned int todo = min_t(unsigned int, len, SZ_4K);
+
+				kernel_neon_begin();
+				poly1305_blocks_neon(&dctx->h, src, todo, 1);
+				kernel_neon_end();
+
+				len -= todo;
+				src += todo;
+			} while (len);
 		} else {
 			poly1305_blocks_arm(&dctx->h, src, len, 1);
+			src += len;
 		}
-		src += len;
 		nbytes %= POLY1305_BLOCK_SIZE;
 	}
 
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 37ca3e889848..af2bbca38e70 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 	    !crypto_simd_usable())
 		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
 
-	kernel_neon_begin();
-	chacha_doneon(state, dst, src, bytes, nrounds);
-	kernel_neon_end();
+	do {
+		unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
+
+		kernel_neon_begin();
+		chacha_doneon(state, dst, src, todo, nrounds);
+		kernel_neon_end();
+
+		bytes -= todo;
+		src += todo;
+		dst += todo;
+	} while (bytes);
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c
index 895d3727c1fb..c5405e6a6db7 100644
--- a/arch/arm64/crypto/nhpoly1305-neon-glue.c
+++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c
@@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_neon_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
index e97b092f56b8..f33ada70c4ed 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
 		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
 
 		if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
-			kernel_neon_begin();
-			poly1305_blocks_neon(&dctx->h, src, len, 1);
-			kernel_neon_end();
+			do {
+				unsigned int todo = min_t(unsigned int, len, SZ_4K);
+
+				kernel_neon_begin();
+				poly1305_blocks_neon(&dctx->h, src, todo, 1);
+				kernel_neon_end();
+
+				len -= todo;
+				src += todo;
+			} while (len);
 		} else {
 			poly1305_blocks(&dctx->h, src, len, 1);
+			src += len;
 		}
-		src += len;
 		nbytes %= POLY1305_BLOCK_SIZE;
 	}
 
diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
index 06ef2d4a4701..6737bcea1fa1 100644
--- a/arch/x86/crypto/blake2s-glue.c
+++ b/arch/x86/crypto/blake2s-glue.c
@@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2s_state *state,
 			   const u32 inc)
 {
 	/* SIMD disables preemption, so relax after processing each page. */
-	BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8);
+	BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
 
 	if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) {
 		blake2s_compress_generic(state, block, nblocks, inc);
 		return;
 	}
 
-	for (;;) {
+	do {
 		const size_t blocks = min_t(size_t, nblocks,
-					    PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
+					    SZ_4K / BLAKE2S_BLOCK_SIZE);
 
 		kernel_fpu_begin();
 		if (IS_ENABLED(CONFIG_AS_AVX512) &&
@@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2s_state *state,
 		kernel_fpu_end();
 
 		nblocks -= blocks;
-		if (!nblocks)
-			break;
 		block += blocks * BLAKE2S_BLOCK_SIZE;
-	}
+	} while (nblocks);
 }
 EXPORT_SYMBOL(blake2s_compress_arch);
 
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index b412c21ee06e..22250091cdbe 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -153,9 +153,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 	    bytes <= CHACHA_BLOCK_SIZE)
 		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
 
-	kernel_fpu_begin();
-	chacha_dosimd(state, dst, src, bytes, nrounds);
-	kernel_fpu_end();
+	do {
+		unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
+
+		kernel_fpu_begin();
+		chacha_dosimd(state, dst, src, todo, nrounds);
+		kernel_fpu_end();
+
+		bytes -= todo;
+		src += todo;
+		dst += todo;
+	} while (bytes);
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c
index f7567cbd35b6..80fcb85736e1 100644
--- a/arch/x86/crypto/nhpoly1305-avx2-glue.c
+++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c
@@ -29,7 +29,7 @@ static int nhpoly1305_avx2_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_fpu_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2);
diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c
index a661ede3b5cf..cc6b7c1a2705 100644
--- a/arch/x86/crypto/nhpoly1305-sse2-glue.c
+++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c
@@ -29,7 +29,7 @@ static int nhpoly1305_sse2_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_fpu_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2);
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 6dfec19f7d57..dfe921efa9b2 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
 	struct poly1305_arch_internal *state = ctx;
 
 	/* SIMD disables preemption, so relax after processing each page. */
-	BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE ||
-		     PAGE_SIZE % POLY1305_BLOCK_SIZE);
+	BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE ||
+		     SZ_4K % POLY1305_BLOCK_SIZE);
 
 	if (!static_branch_likely(&poly1305_use_avx) ||
 	    (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) ||
@@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
 		return;
 	}
 
-	for (;;) {
-		const size_t bytes = min_t(size_t, len, PAGE_SIZE);
+	do {
+		const size_t bytes = min_t(size_t, len, SZ_4K);
 
 		kernel_fpu_begin();
 		if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512))
@@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
 		else
 			poly1305_blocks_avx(ctx, inp, bytes, padbit);
 		kernel_fpu_end();
+
 		len -= bytes;
-		if (!len)
-			break;
 		inp += bytes;
-	}
+	} while (len);
 }
 
 static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22 19:51       ` Jason A. Donenfeld
@ 2020-04-22 20:17         ` Jason A. Donenfeld
  2020-04-23  8:45           ` Ard Biesheuvel
  0 siblings, 1 reply; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 20:17 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Eric Biggers, Herbert Xu, Linux Crypto Mailing List, LKML

On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> > >
> > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > > Seems this should just be a 'while' loop?
> > > >
> > > >         while (bytes) {
> > > >                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > >
> > > >                 kernel_neon_begin();
> > > >                 chacha_doneon(state, dst, src, todo, nrounds);
> > > >                 kernel_neon_end();
> > > >
> > > >                 bytes -= todo;
> > > >                 src += todo;
> > > >                 dst += todo;
> > > >         }
> > >
> > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > doesn't touch), because then we can break out of the loop before
> > > having to increment src and dst unnecessarily. Likely a pointless
> > > optimization as probably the compiler can figure out how to avoid
> > > that. But maybe it can't. If you have a strong preference, I can
> > > reactor everything to use `while (bytes)`, but if you don't care,
> > > let's keep this as-is. Opinion?
> > >
> >
> > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > given that bytes is guaranteed to be non-zero before we enter the
> > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > we can.
>
> Okay, will do-while it up for v2.

I just sent v2 containing do-while, and I'm fine with that going in
that way. But just in the interest of curiosity in the pan-tone
palette, check this out:

https://godbolt.org/z/VxXien

It looks like on mine, the compiler avoids unnecessarily calling those
adds on the last iteration, but on the other hand, it results in an
otherwise unnecessary unconditional jump for the < 4096 case. Sort of
interesting. Arm64 code is more or less the same difference too.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks
  2020-04-22 20:03 ` [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks Jason A. Donenfeld
@ 2020-04-22 22:39   ` Eric Biggers
  2020-04-22 23:09     ` Jason A. Donenfeld
  2020-04-22 23:18   ` [PATCH crypto-stable v3 1/2] crypto: arch/lib " Jason A. Donenfeld
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Biggers @ 2020-04-22 22:39 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: herbert, linux-crypto, linux-kernel, linux-rt-users,
	Ard Biesheuvel, Sebastian Andrzej Siewior, stable

On Wed, Apr 22, 2020 at 02:03:44PM -0600, Jason A. Donenfeld wrote:
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a 4k chunk, since it disables preemption. The choice of 4k isn't totally
> scientific, but it's not a bad guess either, and it's what's used in
> both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
> of PAGE_SIZE, which this commit corrects to be explicitly 4k).
> 
> Ard did some back of the envelope calculations and found that
> at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
> means we have a maximum preemption disabling of 20us, which Sebastian
> confirmed was probably a good limit.
> 
> Unfortunately the chunking appears to have been left out of the final
> patchset that added the glue code. So, this commit adds it back in.
> 
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305")
> Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305")
> Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305")
> Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305")
> Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Changes v1->v2:
>  - [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
>  - [Eric] Prefer do-while over for (;;).
> 
>  arch/arm/crypto/chacha-glue.c            | 14 +++++++++++---
>  arch/arm/crypto/nhpoly1305-neon-glue.c   |  2 +-
>  arch/arm/crypto/poly1305-glue.c          | 15 +++++++++++----
>  arch/arm64/crypto/chacha-neon-glue.c     | 14 +++++++++++---
>  arch/arm64/crypto/nhpoly1305-neon-glue.c |  2 +-
>  arch/arm64/crypto/poly1305-glue.c        | 15 +++++++++++----
>  arch/x86/crypto/blake2s-glue.c           | 10 ++++------
>  arch/x86/crypto/chacha_glue.c            | 14 +++++++++++---
>  arch/x86/crypto/nhpoly1305-avx2-glue.c   |  2 +-
>  arch/x86/crypto/nhpoly1305-sse2-glue.c   |  2 +-
>  arch/x86/crypto/poly1305_glue.c          | 13 ++++++-------
>  11 files changed, 69 insertions(+), 34 deletions(-)

Can you split the nhpoly1305 changes into a separate patch?  They're a bit
different from the rest of this patch, which is fixing up the crypto library
interface that's new in v5.5.  The nhpoly1305 changes apply to v5.0 and don't
have anything to do with the crypto library interface, and they're also a bit
different since they replace PAGE_SIZE with 4K rather than unlimited with 4K.

- Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks
  2020-04-22 22:39   ` Eric Biggers
@ 2020-04-22 23:09     ` Jason A. Donenfeld
  0 siblings, 0 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 23:09 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Linux Crypto Mailing List, LKML, linux-rt-users,
	Ard Biesheuvel, Sebastian Andrzej Siewior, stable

On Wed, Apr 22, 2020 at 4:39 PM Eric Biggers <ebiggers@kernel.org> wrote:
> Can you split the nhpoly1305 changes into a separate patch?  They're a bit
> different from the rest of this patch, which is fixing up the crypto library
> interface that's new in v5.5.  The nhpoly1305 changes apply to v5.0 and don't
> have anything to do with the crypto library interface, and they're also a bit
> different since they replace PAGE_SIZE with 4K rather than unlimited with 4K.

Good point about the nhpoly change not being part of the library
interface and thus backportable differently. I'll split that out and
send a v3 shortly.

Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-22 20:03 ` [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks Jason A. Donenfeld
  2020-04-22 22:39   ` Eric Biggers
@ 2020-04-22 23:18   ` Jason A. Donenfeld
  2020-04-22 23:18     ` [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit " Jason A. Donenfeld
                       ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 23:18 UTC (permalink / raw)
  To: herbert, linux-crypto, linux-kernel, linux-rt-users
  Cc: Jason A. Donenfeld, Eric Biggers, Ard Biesheuvel,
	Sebastian Andrzej Siewior, stable

The initial Zinc patchset, after some mailing list discussion, contained
code to ensure that kernel_fpu_enable would not be kept on for more than
a 4k chunk, since it disables preemption. The choice of 4k isn't totally
scientific, but it's not a bad guess either, and it's what's used in
both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
of PAGE_SIZE, which this commit corrects to be explicitly 4k for the
former two).

Ard did some back of the envelope calculations and found that
at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
means we have a maximum preemption disabling of 20us, which Sebastian
confirmed was probably a good limit.

Unfortunately the chunking appears to have been left out of the final
patchset that added the glue code. So, this commit adds it back in.

Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
Cc: Eric Biggers <ebiggers@google.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Changes v2->v3:
 - [Eric] Split nhpoly1305 changes into separate commit, since it's not
   related to the library interface.

Changes v1->v2:
 - [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
 - [Eric] Prefer do-while over for (;;).

 arch/arm/crypto/chacha-glue.c        | 14 +++++++++++---
 arch/arm/crypto/poly1305-glue.c      | 15 +++++++++++----
 arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++---
 arch/arm64/crypto/poly1305-glue.c    | 15 +++++++++++----
 arch/x86/crypto/blake2s-glue.c       | 10 ++++------
 arch/x86/crypto/chacha_glue.c        | 14 +++++++++++---
 arch/x86/crypto/poly1305_glue.c      | 13 ++++++-------
 7 files changed, 65 insertions(+), 30 deletions(-)

diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
index 6fdb0ac62b3d..59da6c0b63b6 100644
--- a/arch/arm/crypto/chacha-glue.c
+++ b/arch/arm/crypto/chacha-glue.c
@@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 		return;
 	}
 
-	kernel_neon_begin();
-	chacha_doneon(state, dst, src, bytes, nrounds);
-	kernel_neon_end();
+	do {
+		unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
+
+		kernel_neon_begin();
+		chacha_doneon(state, dst, src, todo, nrounds);
+		kernel_neon_end();
+
+		bytes -= todo;
+		src += todo;
+		dst += todo;
+	} while (bytes);
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
index ceec04ec2f40..13cfef4ae22e 100644
--- a/arch/arm/crypto/poly1305-glue.c
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
 		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
 
 		if (static_branch_likely(&have_neon) && do_neon) {
-			kernel_neon_begin();
-			poly1305_blocks_neon(&dctx->h, src, len, 1);
-			kernel_neon_end();
+			do {
+				unsigned int todo = min_t(unsigned int, len, SZ_4K);
+
+				kernel_neon_begin();
+				poly1305_blocks_neon(&dctx->h, src, todo, 1);
+				kernel_neon_end();
+
+				len -= todo;
+				src += todo;
+			} while (len);
 		} else {
 			poly1305_blocks_arm(&dctx->h, src, len, 1);
+			src += len;
 		}
-		src += len;
 		nbytes %= POLY1305_BLOCK_SIZE;
 	}
 
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 37ca3e889848..af2bbca38e70 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 	    !crypto_simd_usable())
 		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
 
-	kernel_neon_begin();
-	chacha_doneon(state, dst, src, bytes, nrounds);
-	kernel_neon_end();
+	do {
+		unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
+
+		kernel_neon_begin();
+		chacha_doneon(state, dst, src, todo, nrounds);
+		kernel_neon_end();
+
+		bytes -= todo;
+		src += todo;
+		dst += todo;
+	} while (bytes);
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
index e97b092f56b8..f33ada70c4ed 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
 		unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
 
 		if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
-			kernel_neon_begin();
-			poly1305_blocks_neon(&dctx->h, src, len, 1);
-			kernel_neon_end();
+			do {
+				unsigned int todo = min_t(unsigned int, len, SZ_4K);
+
+				kernel_neon_begin();
+				poly1305_blocks_neon(&dctx->h, src, todo, 1);
+				kernel_neon_end();
+
+				len -= todo;
+				src += todo;
+			} while (len);
 		} else {
 			poly1305_blocks(&dctx->h, src, len, 1);
+			src += len;
 		}
-		src += len;
 		nbytes %= POLY1305_BLOCK_SIZE;
 	}
 
diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
index 06ef2d4a4701..6737bcea1fa1 100644
--- a/arch/x86/crypto/blake2s-glue.c
+++ b/arch/x86/crypto/blake2s-glue.c
@@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2s_state *state,
 			   const u32 inc)
 {
 	/* SIMD disables preemption, so relax after processing each page. */
-	BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8);
+	BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
 
 	if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) {
 		blake2s_compress_generic(state, block, nblocks, inc);
 		return;
 	}
 
-	for (;;) {
+	do {
 		const size_t blocks = min_t(size_t, nblocks,
-					    PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
+					    SZ_4K / BLAKE2S_BLOCK_SIZE);
 
 		kernel_fpu_begin();
 		if (IS_ENABLED(CONFIG_AS_AVX512) &&
@@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2s_state *state,
 		kernel_fpu_end();
 
 		nblocks -= blocks;
-		if (!nblocks)
-			break;
 		block += blocks * BLAKE2S_BLOCK_SIZE;
-	}
+	} while (nblocks);
 }
 EXPORT_SYMBOL(blake2s_compress_arch);
 
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index b412c21ee06e..22250091cdbe 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -153,9 +153,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
 	    bytes <= CHACHA_BLOCK_SIZE)
 		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
 
-	kernel_fpu_begin();
-	chacha_dosimd(state, dst, src, bytes, nrounds);
-	kernel_fpu_end();
+	do {
+		unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
+
+		kernel_fpu_begin();
+		chacha_dosimd(state, dst, src, todo, nrounds);
+		kernel_fpu_end();
+
+		bytes -= todo;
+		src += todo;
+		dst += todo;
+	} while (bytes);
 }
 EXPORT_SYMBOL(chacha_crypt_arch);
 
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 6dfec19f7d57..dfe921efa9b2 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
 	struct poly1305_arch_internal *state = ctx;
 
 	/* SIMD disables preemption, so relax after processing each page. */
-	BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE ||
-		     PAGE_SIZE % POLY1305_BLOCK_SIZE);
+	BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE ||
+		     SZ_4K % POLY1305_BLOCK_SIZE);
 
 	if (!static_branch_likely(&poly1305_use_avx) ||
 	    (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) ||
@@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
 		return;
 	}
 
-	for (;;) {
-		const size_t bytes = min_t(size_t, len, PAGE_SIZE);
+	do {
+		const size_t bytes = min_t(size_t, len, SZ_4K);
 
 		kernel_fpu_begin();
 		if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512))
@@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
 		else
 			poly1305_blocks_avx(ctx, inp, bytes, padbit);
 		kernel_fpu_end();
+
 		len -= bytes;
-		if (!len)
-			break;
 		inp += bytes;
-	}
+	} while (len);
 }
 
 static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit 4k chunks
  2020-04-22 23:18   ` [PATCH crypto-stable v3 1/2] crypto: arch/lib " Jason A. Donenfeld
@ 2020-04-22 23:18     ` Jason A. Donenfeld
  2020-04-23 20:39       ` Eric Biggers
  2020-04-23  7:18     ` [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to " Ard Biesheuvel
  2020-04-30  5:30     ` Herbert Xu
  2 siblings, 1 reply; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-22 23:18 UTC (permalink / raw)
  To: herbert, linux-crypto, linux-kernel, linux-rt-users
  Cc: Jason A. Donenfeld, Eric Biggers, stable

Rather than chunking via PAGE_SIZE, this commit changes the arch
implementations to chunk in explicit 4k parts, so that calculations on
maximum acceptable latency don't suddenly become invalid on platforms
where PAGE_SIZE isn't 4k, such as arm64.

Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305")
Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305")
Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305")
Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305")
Cc: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 arch/arm/crypto/nhpoly1305-neon-glue.c   | 2 +-
 arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +-
 arch/x86/crypto/nhpoly1305-avx2-glue.c   | 2 +-
 arch/x86/crypto/nhpoly1305-sse2-glue.c   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm/crypto/nhpoly1305-neon-glue.c b/arch/arm/crypto/nhpoly1305-neon-glue.c
index ae5aefc44a4d..ffa8d73fe722 100644
--- a/arch/arm/crypto/nhpoly1305-neon-glue.c
+++ b/arch/arm/crypto/nhpoly1305-neon-glue.c
@@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_neon_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c
index 895d3727c1fb..c5405e6a6db7 100644
--- a/arch/arm64/crypto/nhpoly1305-neon-glue.c
+++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c
@@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_neon_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c
index f7567cbd35b6..80fcb85736e1 100644
--- a/arch/x86/crypto/nhpoly1305-avx2-glue.c
+++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c
@@ -29,7 +29,7 @@ static int nhpoly1305_avx2_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_fpu_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2);
diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c
index a661ede3b5cf..cc6b7c1a2705 100644
--- a/arch/x86/crypto/nhpoly1305-sse2-glue.c
+++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c
@@ -29,7 +29,7 @@ static int nhpoly1305_sse2_update(struct shash_desc *desc,
 		return crypto_nhpoly1305_update(desc, src, srclen);
 
 	do {
-		unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
+		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
 
 		kernel_fpu_begin();
 		crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-22 23:18   ` [PATCH crypto-stable v3 1/2] crypto: arch/lib " Jason A. Donenfeld
  2020-04-22 23:18     ` [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit " Jason A. Donenfeld
@ 2020-04-23  7:18     ` Ard Biesheuvel
  2020-04-23  7:40       ` Christophe Leroy
  2020-04-23 18:42       ` Greg KH
  2020-04-30  5:30     ` Herbert Xu
  2 siblings, 2 replies; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-23  7:18 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Linux Crypto Mailing List, Linux Kernel Mailing List,
	linux-rt-users, Eric Biggers, Sebastian Andrzej Siewior

FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
or add the cc: line. Only patches that are already in Linus' tree
should be sent there.

Also, the fixes tags are really quite sufficient. In fact, it is
actually rather difficult these days to prevent something from being
taken into -stable if the bots notice that it applies cleanly.

On Thu, 23 Apr 2020 at 01:19, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a 4k chunk, since it disables preemption. The choice of 4k isn't totally
> scientific, but it's not a bad guess either, and it's what's used in
> both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
> of PAGE_SIZE, which this commit corrects to be explicitly 4k for the
> former two).
>
> Ard did some back of the envelope calculations and found that
> at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
> means we have a maximum preemption disabling of 20us, which Sebastian
> confirmed was probably a good limit.
>
> Unfortunately the chunking appears to have been left out of the final
> patchset that added the glue code. So, this commit adds it back in.
>
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

Thanks for cleaning this up

> ---
> Changes v2->v3:
>  - [Eric] Split nhpoly1305 changes into separate commit, since it's not
>    related to the library interface.
>
> Changes v1->v2:
>  - [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
>  - [Eric] Prefer do-while over for (;;).
>
>  arch/arm/crypto/chacha-glue.c        | 14 +++++++++++---
>  arch/arm/crypto/poly1305-glue.c      | 15 +++++++++++----
>  arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++---
>  arch/arm64/crypto/poly1305-glue.c    | 15 +++++++++++----
>  arch/x86/crypto/blake2s-glue.c       | 10 ++++------
>  arch/x86/crypto/chacha_glue.c        | 14 +++++++++++---
>  arch/x86/crypto/poly1305_glue.c      | 13 ++++++-------
>  7 files changed, 65 insertions(+), 30 deletions(-)
>
> diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
> index 6fdb0ac62b3d..59da6c0b63b6 100644
> --- a/arch/arm/crypto/chacha-glue.c
> +++ b/arch/arm/crypto/chacha-glue.c
> @@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>                 return;
>         }
>
> -       kernel_neon_begin();
> -       chacha_doneon(state, dst, src, bytes, nrounds);
> -       kernel_neon_end();
> +       do {
> +               unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
> +
> +               kernel_neon_begin();
> +               chacha_doneon(state, dst, src, todo, nrounds);
> +               kernel_neon_end();
> +
> +               bytes -= todo;
> +               src += todo;
> +               dst += todo;
> +       } while (bytes);
>  }
>  EXPORT_SYMBOL(chacha_crypt_arch);
>
> diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
> index ceec04ec2f40..13cfef4ae22e 100644
> --- a/arch/arm/crypto/poly1305-glue.c
> +++ b/arch/arm/crypto/poly1305-glue.c
> @@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
>                 unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
>
>                 if (static_branch_likely(&have_neon) && do_neon) {
> -                       kernel_neon_begin();
> -                       poly1305_blocks_neon(&dctx->h, src, len, 1);
> -                       kernel_neon_end();
> +                       do {
> +                               unsigned int todo = min_t(unsigned int, len, SZ_4K);
> +
> +                               kernel_neon_begin();
> +                               poly1305_blocks_neon(&dctx->h, src, todo, 1);
> +                               kernel_neon_end();
> +
> +                               len -= todo;
> +                               src += todo;
> +                       } while (len);
>                 } else {
>                         poly1305_blocks_arm(&dctx->h, src, len, 1);
> +                       src += len;
>                 }
> -               src += len;
>                 nbytes %= POLY1305_BLOCK_SIZE;
>         }
>
> diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
> index 37ca3e889848..af2bbca38e70 100644
> --- a/arch/arm64/crypto/chacha-neon-glue.c
> +++ b/arch/arm64/crypto/chacha-neon-glue.c
> @@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>             !crypto_simd_usable())
>                 return chacha_crypt_generic(state, dst, src, bytes, nrounds);
>
> -       kernel_neon_begin();
> -       chacha_doneon(state, dst, src, bytes, nrounds);
> -       kernel_neon_end();
> +       do {
> +               unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
> +
> +               kernel_neon_begin();
> +               chacha_doneon(state, dst, src, todo, nrounds);
> +               kernel_neon_end();
> +
> +               bytes -= todo;
> +               src += todo;
> +               dst += todo;
> +       } while (bytes);
>  }
>  EXPORT_SYMBOL(chacha_crypt_arch);
>
> diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
> index e97b092f56b8..f33ada70c4ed 100644
> --- a/arch/arm64/crypto/poly1305-glue.c
> +++ b/arch/arm64/crypto/poly1305-glue.c
> @@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
>                 unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
>
>                 if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
> -                       kernel_neon_begin();
> -                       poly1305_blocks_neon(&dctx->h, src, len, 1);
> -                       kernel_neon_end();
> +                       do {
> +                               unsigned int todo = min_t(unsigned int, len, SZ_4K);
> +
> +                               kernel_neon_begin();
> +                               poly1305_blocks_neon(&dctx->h, src, todo, 1);
> +                               kernel_neon_end();
> +
> +                               len -= todo;
> +                               src += todo;
> +                       } while (len);
>                 } else {
>                         poly1305_blocks(&dctx->h, src, len, 1);
> +                       src += len;
>                 }
> -               src += len;
>                 nbytes %= POLY1305_BLOCK_SIZE;
>         }
>
> diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
> index 06ef2d4a4701..6737bcea1fa1 100644
> --- a/arch/x86/crypto/blake2s-glue.c
> +++ b/arch/x86/crypto/blake2s-glue.c
> @@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2s_state *state,
>                            const u32 inc)
>  {
>         /* SIMD disables preemption, so relax after processing each page. */
> -       BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8);
> +       BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
>
>         if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) {
>                 blake2s_compress_generic(state, block, nblocks, inc);
>                 return;
>         }
>
> -       for (;;) {
> +       do {
>                 const size_t blocks = min_t(size_t, nblocks,
> -                                           PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
> +                                           SZ_4K / BLAKE2S_BLOCK_SIZE);
>
>                 kernel_fpu_begin();
>                 if (IS_ENABLED(CONFIG_AS_AVX512) &&
> @@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2s_state *state,
>                 kernel_fpu_end();
>
>                 nblocks -= blocks;
> -               if (!nblocks)
> -                       break;
>                 block += blocks * BLAKE2S_BLOCK_SIZE;
> -       }
> +       } while (nblocks);
>  }
>  EXPORT_SYMBOL(blake2s_compress_arch);
>
> diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
> index b412c21ee06e..22250091cdbe 100644
> --- a/arch/x86/crypto/chacha_glue.c
> +++ b/arch/x86/crypto/chacha_glue.c
> @@ -153,9 +153,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>             bytes <= CHACHA_BLOCK_SIZE)
>                 return chacha_crypt_generic(state, dst, src, bytes, nrounds);
>
> -       kernel_fpu_begin();
> -       chacha_dosimd(state, dst, src, bytes, nrounds);
> -       kernel_fpu_end();
> +       do {
> +               unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
> +
> +               kernel_fpu_begin();
> +               chacha_dosimd(state, dst, src, todo, nrounds);
> +               kernel_fpu_end();
> +
> +               bytes -= todo;
> +               src += todo;
> +               dst += todo;
> +       } while (bytes);
>  }
>  EXPORT_SYMBOL(chacha_crypt_arch);
>
> diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
> index 6dfec19f7d57..dfe921efa9b2 100644
> --- a/arch/x86/crypto/poly1305_glue.c
> +++ b/arch/x86/crypto/poly1305_glue.c
> @@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
>         struct poly1305_arch_internal *state = ctx;
>
>         /* SIMD disables preemption, so relax after processing each page. */
> -       BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE ||
> -                    PAGE_SIZE % POLY1305_BLOCK_SIZE);
> +       BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE ||
> +                    SZ_4K % POLY1305_BLOCK_SIZE);
>
>         if (!static_branch_likely(&poly1305_use_avx) ||
>             (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) ||
> @@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
>                 return;
>         }
>
> -       for (;;) {
> -               const size_t bytes = min_t(size_t, len, PAGE_SIZE);
> +       do {
> +               const size_t bytes = min_t(size_t, len, SZ_4K);
>
>                 kernel_fpu_begin();
>                 if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512))
> @@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
>                 else
>                         poly1305_blocks_avx(ctx, inp, bytes, padbit);
>                 kernel_fpu_end();
> +
>                 len -= bytes;
> -               if (!len)
> -                       break;
>                 inp += bytes;
> -       }
> +       } while (len);
>  }
>
>  static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23  7:18     ` [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to " Ard Biesheuvel
@ 2020-04-23  7:40       ` Christophe Leroy
  2020-04-23  7:47         ` Ard Biesheuvel
  2020-04-23 18:42       ` Greg KH
  1 sibling, 1 reply; 33+ messages in thread
From: Christophe Leroy @ 2020-04-23  7:40 UTC (permalink / raw)
  To: Ard Biesheuvel, Jason A. Donenfeld
  Cc: Herbert Xu, Linux Crypto Mailing List, Linux Kernel Mailing List,
	linux-rt-users, Eric Biggers, Sebastian Andrzej Siewior



Le 23/04/2020 à 09:18, Ard Biesheuvel a écrit :
> FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
> or add the cc: line. Only patches that are already in Linus' tree
> should be sent there.
> 
> Also, the fixes tags are really quite sufficient. In fact, it is
> actually rather difficult these days to prevent something from being
> taken into -stable if the bots notice that it applies cleanly.

According to Kernel Documentation, 
https://www.kernel.org/doc/html/latest/process/submitting-patches.html :


Patches that fix a severe bug in a released kernel should be directed 
toward the stable maintainers by putting a line like this:

Cc: stable@vger.kernel.org

into the sign-off area of your patch (note, NOT an email recipient). You 
should also read Documentation/process/stable-kernel-rules.rst in 
addition to this file.


Isn't it correct anymore ?

Christophe


> 
> On Thu, 23 Apr 2020 at 01:19, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>
>> The initial Zinc patchset, after some mailing list discussion, contained
>> code to ensure that kernel_fpu_enable would not be kept on for more than
>> a 4k chunk, since it disables preemption. The choice of 4k isn't totally
>> scientific, but it's not a bad guess either, and it's what's used in
>> both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
>> of PAGE_SIZE, which this commit corrects to be explicitly 4k for the
>> former two).
>>
>> Ard did some back of the envelope calculations and found that
>> at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
>> means we have a maximum preemption disabling of 20us, which Sebastian
>> confirmed was probably a good limit.
>>
>> Unfortunately the chunking appears to have been left out of the final
>> patchset that added the glue code. So, this commit adds it back in.
>>
>> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
>> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
>> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
>> Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
>> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
>> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
>> Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
>> Cc: Eric Biggers <ebiggers@google.com>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> 
> Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> 
> Thanks for cleaning this up
> 
>> ---
>> Changes v2->v3:
>>   - [Eric] Split nhpoly1305 changes into separate commit, since it's not
>>     related to the library interface.
>>
>> Changes v1->v2:
>>   - [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
>>   - [Eric] Prefer do-while over for (;;).
>>
>>   arch/arm/crypto/chacha-glue.c        | 14 +++++++++++---
>>   arch/arm/crypto/poly1305-glue.c      | 15 +++++++++++----
>>   arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++---
>>   arch/arm64/crypto/poly1305-glue.c    | 15 +++++++++++----
>>   arch/x86/crypto/blake2s-glue.c       | 10 ++++------
>>   arch/x86/crypto/chacha_glue.c        | 14 +++++++++++---
>>   arch/x86/crypto/poly1305_glue.c      | 13 ++++++-------
>>   7 files changed, 65 insertions(+), 30 deletions(-)
>>
>> diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
>> index 6fdb0ac62b3d..59da6c0b63b6 100644
>> --- a/arch/arm/crypto/chacha-glue.c
>> +++ b/arch/arm/crypto/chacha-glue.c
>> @@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>>                  return;
>>          }
>>
>> -       kernel_neon_begin();
>> -       chacha_doneon(state, dst, src, bytes, nrounds);
>> -       kernel_neon_end();
>> +       do {
>> +               unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
>> +
>> +               kernel_neon_begin();
>> +               chacha_doneon(state, dst, src, todo, nrounds);
>> +               kernel_neon_end();
>> +
>> +               bytes -= todo;
>> +               src += todo;
>> +               dst += todo;
>> +       } while (bytes);
>>   }
>>   EXPORT_SYMBOL(chacha_crypt_arch);
>>
>> diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
>> index ceec04ec2f40..13cfef4ae22e 100644
>> --- a/arch/arm/crypto/poly1305-glue.c
>> +++ b/arch/arm/crypto/poly1305-glue.c
>> @@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
>>                  unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
>>
>>                  if (static_branch_likely(&have_neon) && do_neon) {
>> -                       kernel_neon_begin();
>> -                       poly1305_blocks_neon(&dctx->h, src, len, 1);
>> -                       kernel_neon_end();
>> +                       do {
>> +                               unsigned int todo = min_t(unsigned int, len, SZ_4K);
>> +
>> +                               kernel_neon_begin();
>> +                               poly1305_blocks_neon(&dctx->h, src, todo, 1);
>> +                               kernel_neon_end();
>> +
>> +                               len -= todo;
>> +                               src += todo;
>> +                       } while (len);
>>                  } else {
>>                          poly1305_blocks_arm(&dctx->h, src, len, 1);
>> +                       src += len;
>>                  }
>> -               src += len;
>>                  nbytes %= POLY1305_BLOCK_SIZE;
>>          }
>>
>> diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
>> index 37ca3e889848..af2bbca38e70 100644
>> --- a/arch/arm64/crypto/chacha-neon-glue.c
>> +++ b/arch/arm64/crypto/chacha-neon-glue.c
>> @@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>>              !crypto_simd_usable())
>>                  return chacha_crypt_generic(state, dst, src, bytes, nrounds);
>>
>> -       kernel_neon_begin();
>> -       chacha_doneon(state, dst, src, bytes, nrounds);
>> -       kernel_neon_end();
>> +       do {
>> +               unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
>> +
>> +               kernel_neon_begin();
>> +               chacha_doneon(state, dst, src, todo, nrounds);
>> +               kernel_neon_end();
>> +
>> +               bytes -= todo;
>> +               src += todo;
>> +               dst += todo;
>> +       } while (bytes);
>>   }
>>   EXPORT_SYMBOL(chacha_crypt_arch);
>>
>> diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
>> index e97b092f56b8..f33ada70c4ed 100644
>> --- a/arch/arm64/crypto/poly1305-glue.c
>> +++ b/arch/arm64/crypto/poly1305-glue.c
>> @@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
>>                  unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
>>
>>                  if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
>> -                       kernel_neon_begin();
>> -                       poly1305_blocks_neon(&dctx->h, src, len, 1);
>> -                       kernel_neon_end();
>> +                       do {
>> +                               unsigned int todo = min_t(unsigned int, len, SZ_4K);
>> +
>> +                               kernel_neon_begin();
>> +                               poly1305_blocks_neon(&dctx->h, src, todo, 1);
>> +                               kernel_neon_end();
>> +
>> +                               len -= todo;
>> +                               src += todo;
>> +                       } while (len);
>>                  } else {
>>                          poly1305_blocks(&dctx->h, src, len, 1);
>> +                       src += len;
>>                  }
>> -               src += len;
>>                  nbytes %= POLY1305_BLOCK_SIZE;
>>          }
>>
>> diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c
>> index 06ef2d4a4701..6737bcea1fa1 100644
>> --- a/arch/x86/crypto/blake2s-glue.c
>> +++ b/arch/x86/crypto/blake2s-glue.c
>> @@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2s_state *state,
>>                             const u32 inc)
>>   {
>>          /* SIMD disables preemption, so relax after processing each page. */
>> -       BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8);
>> +       BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
>>
>>          if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) {
>>                  blake2s_compress_generic(state, block, nblocks, inc);
>>                  return;
>>          }
>>
>> -       for (;;) {
>> +       do {
>>                  const size_t blocks = min_t(size_t, nblocks,
>> -                                           PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
>> +                                           SZ_4K / BLAKE2S_BLOCK_SIZE);
>>
>>                  kernel_fpu_begin();
>>                  if (IS_ENABLED(CONFIG_AS_AVX512) &&
>> @@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2s_state *state,
>>                  kernel_fpu_end();
>>
>>                  nblocks -= blocks;
>> -               if (!nblocks)
>> -                       break;
>>                  block += blocks * BLAKE2S_BLOCK_SIZE;
>> -       }
>> +       } while (nblocks);
>>   }
>>   EXPORT_SYMBOL(blake2s_compress_arch);
>>
>> diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
>> index b412c21ee06e..22250091cdbe 100644
>> --- a/arch/x86/crypto/chacha_glue.c
>> +++ b/arch/x86/crypto/chacha_glue.c
>> @@ -153,9 +153,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>>              bytes <= CHACHA_BLOCK_SIZE)
>>                  return chacha_crypt_generic(state, dst, src, bytes, nrounds);
>>
>> -       kernel_fpu_begin();
>> -       chacha_dosimd(state, dst, src, bytes, nrounds);
>> -       kernel_fpu_end();
>> +       do {
>> +               unsigned int todo = min_t(unsigned int, bytes, SZ_4K);
>> +
>> +               kernel_fpu_begin();
>> +               chacha_dosimd(state, dst, src, todo, nrounds);
>> +               kernel_fpu_end();
>> +
>> +               bytes -= todo;
>> +               src += todo;
>> +               dst += todo;
>> +       } while (bytes);
>>   }
>>   EXPORT_SYMBOL(chacha_crypt_arch);
>>
>> diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
>> index 6dfec19f7d57..dfe921efa9b2 100644
>> --- a/arch/x86/crypto/poly1305_glue.c
>> +++ b/arch/x86/crypto/poly1305_glue.c
>> @@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
>>          struct poly1305_arch_internal *state = ctx;
>>
>>          /* SIMD disables preemption, so relax after processing each page. */
>> -       BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE ||
>> -                    PAGE_SIZE % POLY1305_BLOCK_SIZE);
>> +       BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE ||
>> +                    SZ_4K % POLY1305_BLOCK_SIZE);
>>
>>          if (!static_branch_likely(&poly1305_use_avx) ||
>>              (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) ||
>> @@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
>>                  return;
>>          }
>>
>> -       for (;;) {
>> -               const size_t bytes = min_t(size_t, len, PAGE_SIZE);
>> +       do {
>> +               const size_t bytes = min_t(size_t, len, SZ_4K);
>>
>>                  kernel_fpu_begin();
>>                  if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512))
>> @@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
>>                  else
>>                          poly1305_blocks_avx(ctx, inp, bytes, padbit);
>>                  kernel_fpu_end();
>> +
>>                  len -= bytes;
>> -               if (!len)
>> -                       break;
>>                  inp += bytes;
>> -       }
>> +       } while (len);
>>   }
>>
>>   static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
>> --
>> 2.26.2
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23  7:40       ` Christophe Leroy
@ 2020-04-23  7:47         ` Ard Biesheuvel
  0 siblings, 0 replies; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-23  7:47 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Jason A. Donenfeld, Herbert Xu, Linux Crypto Mailing List,
	Linux Kernel Mailing List, linux-rt-users, Eric Biggers,
	Sebastian Andrzej Siewior

On Thu, 23 Apr 2020 at 09:40, Christophe Leroy <christophe.leroy@c-s.fr> wrote:
>
>
>
> Le 23/04/2020 à 09:18, Ard Biesheuvel a écrit :
> > FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
> > or add the cc: line. Only patches that are already in Linus' tree
> > should be sent there.
> >
> > Also, the fixes tags are really quite sufficient. In fact, it is
> > actually rather difficult these days to prevent something from being
> > taken into -stable if the bots notice that it applies cleanly.
>
> According to Kernel Documentation,
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html :
>
>
> Patches that fix a severe bug in a released kernel should be directed
> toward the stable maintainers by putting a line like this:
>
> Cc: stable@vger.kernel.org
>
> into the sign-off area of your patch (note, NOT an email recipient). You
> should also read Documentation/process/stable-kernel-rules.rst in
> addition to this file.
>
>
> Isn't it correct anymore ?
>

So this states clearly that you should not actually cc the patch to
stable@vger.kernel.org, you should only add the cc: line to the commit
log area if it fixes a *severe* bug. Once the patch makes it into
Linus' tree, the cc: line will be used by the -stable maintainers to
locate the patch and pull it in.

But as I pointed out, even with just the fixes: tags, the patch will
be picked up by -stable anyway. Omitting the cc: line helps prevent
inadvertently sending the patch to stable@vger.kernel.org directly,
since git send-email typically honours Cc: lines in its input in its
default configuration.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks
  2020-04-22 20:17         ` Jason A. Donenfeld
@ 2020-04-23  8:45           ` Ard Biesheuvel
  0 siblings, 0 replies; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-23  8:45 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Eric Biggers, Herbert Xu, Linux Crypto Mailing List, LKML

On Wed, 22 Apr 2020 at 22:17, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> > > >
> > > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > > > Seems this should just be a 'while' loop?
> > > > >
> > > > >         while (bytes) {
> > > > >                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > > >
> > > > >                 kernel_neon_begin();
> > > > >                 chacha_doneon(state, dst, src, todo, nrounds);
> > > > >                 kernel_neon_end();
> > > > >
> > > > >                 bytes -= todo;
> > > > >                 src += todo;
> > > > >                 dst += todo;
> > > > >         }
> > > >
> > > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > > doesn't touch), because then we can break out of the loop before
> > > > having to increment src and dst unnecessarily. Likely a pointless
> > > > optimization as probably the compiler can figure out how to avoid
> > > > that. But maybe it can't. If you have a strong preference, I can
> > > > reactor everything to use `while (bytes)`, but if you don't care,
> > > > let's keep this as-is. Opinion?
> > > >
> > >
> > > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > > given that bytes is guaranteed to be non-zero before we enter the
> > > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > > we can.
> >
> > Okay, will do-while it up for v2.
>
> I just sent v2 containing do-while, and I'm fine with that going in
> that way. But just in the interest of curiosity in the pan-tone
> palette, check this out:
>
> https://godbolt.org/z/VxXien
>
> It looks like on mine, the compiler avoids unnecessarily calling those
> adds on the last iteration, but on the other hand, it results in an
> otherwise unnecessary unconditional jump for the < 4096 case. Sort of
> interesting. Arm64 code is more or less the same difference too.

Yeah, even if shaving off 1 or 2 cycles mattered here (since we've
just decided that ugh() may take up to 20,000 cycles), hiding a couple
of ALU instructions in the slots between the subs (which sets the zero
flag) and the conditional branch that tests it probably comes for free
on in-order cores anyway. And even if it didn't, backwards branches
are usually statically predicted as taken, in which case their results
are actually needed.

On out-of-order cores under speculation, none of this matters anyway.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23  7:18     ` [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to " Ard Biesheuvel
  2020-04-23  7:40       ` Christophe Leroy
@ 2020-04-23 18:42       ` Greg KH
  2020-04-23 18:47         ` Ard Biesheuvel
  1 sibling, 1 reply; 33+ messages in thread
From: Greg KH @ 2020-04-23 18:42 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A. Donenfeld, Herbert Xu, Linux Crypto Mailing List,
	Linux Kernel Mailing List, linux-rt-users, Eric Biggers,
	Sebastian Andrzej Siewior

On Thu, Apr 23, 2020 at 09:18:15AM +0200, Ard Biesheuvel wrote:
> FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
> or add the cc: line. Only patches that are already in Linus' tree
> should be sent there.

Not true at all, please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.  Please do not spread incorrect
information.

And Jason did this properly, he put cc: stable@ in the s-o-b area and
all is good, I will pick up this patch once it hits Linus's tree.

And there is no problem actually sending the patch to stable@vger while
under development like this, as it gives me a heads-up that something is
coming, and is trivial to filter out.

If you really want to be nice, you can just do:
	cc: stable@kernel.org
which goes to /dev/null on kernel.org, so no email will be sent to any
list, but my scripts still pick it up.  But no real need to do that,
it's fine.

> Also, the fixes tags are really quite sufficient.

No it is not, I have had to dig out patches more and more because people
do NOT put the cc: stable and only put Fixes:, which is not a good thing
as I then have to "guess" what the maintainer/developer ment.

Be explicit if you know it, cc: stable please.

> In fact, it is
> actually rather difficult these days to prevent something from being
> taken into -stable if the bots notice that it applies cleanly.

Those "bots" are still driven by a lot of human work, please make it
easy on us whenever possible.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23 18:42       ` Greg KH
@ 2020-04-23 18:47         ` Ard Biesheuvel
  2020-04-23 20:23           ` Eric Biggers
  0 siblings, 1 reply; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-23 18:47 UTC (permalink / raw)
  To: Greg KH
  Cc: Jason A. Donenfeld, Herbert Xu, Linux Crypto Mailing List,
	Linux Kernel Mailing List, linux-rt-users, Eric Biggers,
	Sebastian Andrzej Siewior

On Thu, 23 Apr 2020 at 20:42, Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Apr 23, 2020 at 09:18:15AM +0200, Ard Biesheuvel wrote:
> > FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
> > or add the cc: line. Only patches that are already in Linus' tree
> > should be sent there.
>
> Not true at all, please read:
>     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.  Please do not spread incorrect
> information.
>
> And Jason did this properly, he put cc: stable@ in the s-o-b area and
> all is good, I will pick up this patch once it hits Linus's tree.
>
> And there is no problem actually sending the patch to stable@vger while
> under development like this, as it gives me a heads-up that something is
> coming, and is trivial to filter out.
>
> If you really want to be nice, you can just do:
>         cc: stable@kernel.org
> which goes to /dev/null on kernel.org, so no email will be sent to any
> list, but my scripts still pick it up.  But no real need to do that,
> it's fine.
>

OK, thanks for clearing this up.

So does this mean you have stopped sending out 'formletter'
auto-replies for patches that were sent out to stable@vger.kernel.org
directly, telling people not to do that?

> > Also, the fixes tags are really quite sufficient.
>
> No it is not, I have had to dig out patches more and more because people
> do NOT put the cc: stable and only put Fixes:, which is not a good thing
> as I then have to "guess" what the maintainer/developer ment.
>
> Be explicit if you know it, cc: stable please.
>

OK

> > In fact, it is
> > actually rather difficult these days to prevent something from being
> > taken into -stable if the bots notice that it applies cleanly.
>
> Those "bots" are still driven by a lot of human work, please make it
> easy on us whenever possible.
>

Sure.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23 18:47         ` Ard Biesheuvel
@ 2020-04-23 20:23           ` Eric Biggers
  2020-04-23 20:49             ` Ard Biesheuvel
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Biggers @ 2020-04-23 20:23 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Greg KH, Jason A. Donenfeld, Herbert Xu,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	linux-rt-users, Sebastian Andrzej Siewior

On Thu, Apr 23, 2020 at 08:47:00PM +0200, Ard Biesheuvel wrote:
> On Thu, 23 Apr 2020 at 20:42, Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Apr 23, 2020 at 09:18:15AM +0200, Ard Biesheuvel wrote:
> > > FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
> > > or add the cc: line. Only patches that are already in Linus' tree
> > > should be sent there.
> >
> > Not true at all, please read:
> >     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > for how to do this properly.  Please do not spread incorrect
> > information.
> >
> > And Jason did this properly, he put cc: stable@ in the s-o-b area and
> > all is good, I will pick up this patch once it hits Linus's tree.
> >
> > And there is no problem actually sending the patch to stable@vger while
> > under development like this, as it gives me a heads-up that something is
> > coming, and is trivial to filter out.
> >
> > If you really want to be nice, you can just do:
> >         cc: stable@kernel.org
> > which goes to /dev/null on kernel.org, so no email will be sent to any
> > list, but my scripts still pick it up.  But no real need to do that,
> > it's fine.
> >
> 
> OK, thanks for clearing this up.
> 
> So does this mean you have stopped sending out 'formletter'
> auto-replies for patches that were sent out to stable@vger.kernel.org
> directly, telling people not to do that?
> 

I often leave stable@vger.kernel.org in the email Cc list, and no one has ever
complained.  It's only sending patches directly "To:" stable@vger.kernel.org
that isn't allowed, except when actually sending out backports.

If there were people who had an actual issue with Cc, then I think the rules
would have changed long ago to using some other tag like Backport-to that
doesn't get picked up by git send-email.

- Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit 4k chunks
  2020-04-22 23:18     ` [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit " Jason A. Donenfeld
@ 2020-04-23 20:39       ` Eric Biggers
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Biggers @ 2020-04-23 20:39 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: herbert, linux-crypto, linux-kernel, linux-rt-users, stable

On Wed, Apr 22, 2020 at 05:18:54PM -0600, Jason A. Donenfeld wrote:
> Rather than chunking via PAGE_SIZE, this commit changes the arch
> implementations to chunk in explicit 4k parts, so that calculations on
> maximum acceptable latency don't suddenly become invalid on platforms
> where PAGE_SIZE isn't 4k, such as arm64.
> 
> Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305")
> Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305")
> Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305")
> Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305")
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

arm64 normally uses PAGE_SIZE == 4k, so this commit message is a little
misleading.  Anyway, I agree with using 4k, so:

Reviewed-by: Eric Biggers <ebiggers@google.com>

- Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23 20:23           ` Eric Biggers
@ 2020-04-23 20:49             ` Ard Biesheuvel
  2020-04-28 23:09               ` Jason A. Donenfeld
  0 siblings, 1 reply; 33+ messages in thread
From: Ard Biesheuvel @ 2020-04-23 20:49 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Greg KH, Jason A. Donenfeld, Herbert Xu,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	linux-rt-users, Sebastian Andrzej Siewior

On Thu, 23 Apr 2020 at 22:23, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Thu, Apr 23, 2020 at 08:47:00PM +0200, Ard Biesheuvel wrote:
> > On Thu, 23 Apr 2020 at 20:42, Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Thu, Apr 23, 2020 at 09:18:15AM +0200, Ard Biesheuvel wrote:
> > > > FYI: you shouldn't cc stable@vger.kernel.org directly on your patches,
> > > > or add the cc: line. Only patches that are already in Linus' tree
> > > > should be sent there.
> > >
> > > Not true at all, please read:
> > >     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > for how to do this properly.  Please do not spread incorrect
> > > information.
> > >
> > > And Jason did this properly, he put cc: stable@ in the s-o-b area and
> > > all is good, I will pick up this patch once it hits Linus's tree.
> > >
> > > And there is no problem actually sending the patch to stable@vger while
> > > under development like this, as it gives me a heads-up that something is
> > > coming, and is trivial to filter out.
> > >
> > > If you really want to be nice, you can just do:
> > >         cc: stable@kernel.org
> > > which goes to /dev/null on kernel.org, so no email will be sent to any
> > > list, but my scripts still pick it up.  But no real need to do that,
> > > it's fine.
> > >
> >
> > OK, thanks for clearing this up.
> >
> > So does this mean you have stopped sending out 'formletter'
> > auto-replies for patches that were sent out to stable@vger.kernel.org
> > directly, telling people not to do that?
> >
>
> I often leave stable@vger.kernel.org in the email Cc list, and no one has ever
> complained.  It's only sending patches directly "To:" stable@vger.kernel.org
> that isn't allowed, except when actually sending out backports.
>
> If there were people who had an actual issue with Cc, then I think the rules
> would have changed long ago to using some other tag like Backport-to that
> doesn't get picked up by git send-email.
>

OK, good to know.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-23 20:49             ` Ard Biesheuvel
@ 2020-04-28 23:09               ` Jason A. Donenfeld
  0 siblings, 0 replies; 33+ messages in thread
From: Jason A. Donenfeld @ 2020-04-28 23:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Biggers, Ard Biesheuvel, Greg KH, Linux Crypto Mailing List,
	Linux Kernel Mailing List, linux-rt-users,
	Sebastian Andrzej Siewior

Hi Herbert,

This v3 patchset has a Reviewed-by from Ard for 1/2 and from Eric for
2/2, from last week. Could you submit this to Linus for rc4?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
  2020-04-22 23:18   ` [PATCH crypto-stable v3 1/2] crypto: arch/lib " Jason A. Donenfeld
  2020-04-22 23:18     ` [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit " Jason A. Donenfeld
  2020-04-23  7:18     ` [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to " Ard Biesheuvel
@ 2020-04-30  5:30     ` Herbert Xu
  2 siblings, 0 replies; 33+ messages in thread
From: Herbert Xu @ 2020-04-30  5:30 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-crypto, linux-kernel, linux-rt-users, Eric Biggers,
	Ard Biesheuvel, Sebastian Andrzej Siewior, stable

On Wed, Apr 22, 2020 at 05:18:53PM -0600, Jason A. Donenfeld wrote:
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a 4k chunk, since it disables preemption. The choice of 4k isn't totally
> scientific, but it's not a bad guess either, and it's what's used in
> both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
> of PAGE_SIZE, which this commit corrects to be explicitly 4k for the
> former two).
> 
> Ard did some back of the envelope calculations and found that
> at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
> means we have a maximum preemption disabling of 20us, which Sebastian
> confirmed was probably a good limit.
> 
> Unfortunately the chunking appears to have been left out of the final
> patchset that added the glue code. So, this commit adds it back in.
> 
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Changes v2->v3:
>  - [Eric] Split nhpoly1305 changes into separate commit, since it's not
>    related to the library interface.
> 
> Changes v1->v2:
>  - [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
>  - [Eric] Prefer do-while over for (;;).
> 
>  arch/arm/crypto/chacha-glue.c        | 14 +++++++++++---
>  arch/arm/crypto/poly1305-glue.c      | 15 +++++++++++----
>  arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++---
>  arch/arm64/crypto/poly1305-glue.c    | 15 +++++++++++----
>  arch/x86/crypto/blake2s-glue.c       | 10 ++++------
>  arch/x86/crypto/chacha_glue.c        | 14 +++++++++++---
>  arch/x86/crypto/poly1305_glue.c      | 13 ++++++-------
>  7 files changed, 65 insertions(+), 30 deletions(-)

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-04-30  5:31 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-20  7:57 [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Jason A. Donenfeld
2020-04-20  8:32 ` David Laight
2020-04-21  4:02   ` Jason A. Donenfeld
2020-04-21  4:14   ` FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks] Jason A. Donenfeld
2020-04-21  4:25     ` Jason A. Donenfeld
2020-04-21  7:02     ` Ard Biesheuvel
2020-04-21  8:05       ` David Laight
2020-04-21  8:11     ` David Laight
2020-04-22  4:04 ` [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks Eric Biggers
2020-04-22  7:23   ` Ard Biesheuvel
2020-04-22  7:38     ` Jason A. Donenfeld
2020-04-22 11:28     ` Sebastian Andrzej Siewior
2020-04-22 19:35       ` Jason A. Donenfeld
2020-04-22  7:32   ` Jason A. Donenfeld
2020-04-22  7:39     ` Ard Biesheuvel
2020-04-22 19:51       ` Jason A. Donenfeld
2020-04-22 20:17         ` Jason A. Donenfeld
2020-04-23  8:45           ` Ard Biesheuvel
2020-04-22 20:03 ` [PATCH crypto-stable v2] crypto: arch - limit simd usage to 4k chunks Jason A. Donenfeld
2020-04-22 22:39   ` Eric Biggers
2020-04-22 23:09     ` Jason A. Donenfeld
2020-04-22 23:18   ` [PATCH crypto-stable v3 1/2] crypto: arch/lib " Jason A. Donenfeld
2020-04-22 23:18     ` [PATCH crypto-stable v3 2/2] crypto: arch/nhpoly1305 - process in explicit " Jason A. Donenfeld
2020-04-23 20:39       ` Eric Biggers
2020-04-23  7:18     ` [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to " Ard Biesheuvel
2020-04-23  7:40       ` Christophe Leroy
2020-04-23  7:47         ` Ard Biesheuvel
2020-04-23 18:42       ` Greg KH
2020-04-23 18:47         ` Ard Biesheuvel
2020-04-23 20:23           ` Eric Biggers
2020-04-23 20:49             ` Ard Biesheuvel
2020-04-28 23:09               ` Jason A. Donenfeld
2020-04-30  5:30     ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).