All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] crypto: x86/sha512 - load based on CPU features
@ 2022-08-13 23:04 Robert Elliott
  2022-08-19 11:05 ` Herbert Xu
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Elliott @ 2022-08-13 23:04 UTC (permalink / raw)
  To: tim.c.chen, herbert, davem, linux-crypto, mingo, bp, dave.hansen,
	x86, toshi.kani, rwright
  Cc: Robert Elliott

x86 optimized crypto modules built as modules rather than built-in
to the kernel end up as .ko files in the filesystem, e.g., in
/usr/lib/modules. If the filesystem itself is a module, these might
not be available when the crypto API is initialized, resulting in
the generic implementation being used (e.g., sha512_transform rather
than sha512_transform_avx2).

In one test case, CPU utilization in the sha512 function dropped
from 15.34% to 7.18% after forcing loading of the optimized module.

Add module aliases for this x86 optimized crypto module based on CPU
feature bits so udev gets a chance to load them later in the boot
process when the filesystems are all running.

Signed-off-by: Robert Elliott <elliott@hpe.com>
---
 arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c
index 30e70f4fe2f7..6d3b85e53d0e 100644
--- a/arch/x86/crypto/sha512_ssse3_glue.c
+++ b/arch/x86/crypto/sha512_ssse3_glue.c
@@ -36,6 +36,7 @@
 #include <linux/types.h>
 #include <crypto/sha2.h>
 #include <crypto/sha512_base.h>
+#include <asm/cpu_device_id.h>
 #include <asm/simd.h>
 
 asmlinkage void sha512_transform_ssse3(struct sha512_state *state,
@@ -284,6 +285,13 @@ static int register_sha512_avx2(void)
 			ARRAY_SIZE(sha512_avx2_algs));
 	return 0;
 }
+static const struct x86_cpu_id module_cpu_ids[] = {
+	X86_MATCH_FEATURE(X86_FEATURE_AVX2, NULL),
+	X86_MATCH_FEATURE(X86_FEATURE_AVX, NULL),
+	X86_MATCH_FEATURE(X86_FEATURE_SSSE3, NULL),
+	{}
+};
+MODULE_DEVICE_TABLE(x86cpu, module_cpu_ids);
 
 static void unregister_sha512_avx2(void)
 {
@@ -294,6 +302,8 @@ static void unregister_sha512_avx2(void)
 
 static int __init sha512_ssse3_mod_init(void)
 {
+	if (!x86_match_cpu(module_cpu_ids))
+		return -ENODEV;
 
 	if (register_sha512_ssse3())
 		goto fail;
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] crypto: x86/sha512 - load based on CPU features
  2022-08-13 23:04 [PATCH] crypto: x86/sha512 - load based on CPU features Robert Elliott
@ 2022-08-19 11:05 ` Herbert Xu
  2022-08-19 22:36   ` Elliott, Robert (Servers)
  0 siblings, 1 reply; 6+ messages in thread
From: Herbert Xu @ 2022-08-19 11:05 UTC (permalink / raw)
  To: Robert Elliott
  Cc: tim.c.chen, davem, linux-crypto, mingo, bp, dave.hansen, x86,
	toshi.kani, rwright, elliott

Robert Elliott <elliott@hpe.com> wrote:
> x86 optimized crypto modules built as modules rather than built-in
> to the kernel end up as .ko files in the filesystem, e.g., in
> /usr/lib/modules. If the filesystem itself is a module, these might
> not be available when the crypto API is initialized, resulting in
> the generic implementation being used (e.g., sha512_transform rather
> than sha512_transform_avx2).
> 
> In one test case, CPU utilization in the sha512 function dropped
> from 15.34% to 7.18% after forcing loading of the optimized module.
> 
> Add module aliases for this x86 optimized crypto module based on CPU
> feature bits so udev gets a chance to load them later in the boot
> process when the filesystems are all running.
> 
> Signed-off-by: Robert Elliott <elliott@hpe.com>
> ---
> arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] crypto: x86/sha512 - load based on CPU features
  2022-08-19 11:05 ` Herbert Xu
@ 2022-08-19 22:36   ` Elliott, Robert (Servers)
  2022-08-26  2:40     ` Elliott, Robert (Servers)
  0 siblings, 1 reply; 6+ messages in thread
From: Elliott, Robert (Servers) @ 2022-08-19 22:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: tim.c.chen, davem, linux-crypto, mingo, bp, dave.hansen, x86,
	Kani, Toshi, Wright, Randy (HPE Servers Linux)



> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, August 19, 2022 6:05 AM
> Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features
> 
> Robert Elliott <elliott@hpe.com> wrote:
> > x86 optimized crypto modules built as modules rather than built-in
> > to the kernel end up as .ko files in the filesystem, e.g., in
> > /usr/lib/modules. If the filesystem itself is a module, these might
> > not be available when the crypto API is initialized, resulting in
> > the generic implementation being used (e.g., sha512_transform rather
> > than sha512_transform_avx2).
> >
> > In one test case, CPU utilization in the sha512 function dropped
> > from 15.34% to 7.18% after forcing loading of the optimized module.
> >
> > Add module aliases for this x86 optimized crypto module based on CPU
> > feature bits so udev gets a chance to load them later in the boot
> > process when the filesystems are all running.
> >
> > Signed-off-by: Robert Elliott <elliott@hpe.com>
> > ---
> > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> 
> Patch applied.  Thanks.

I'll post a series that applies this technique to all the other
x86 modules, if there are no problems reported with sha512.

Do other architectures have the same issue, or is something else done
to ensure that modules are loaded?

MODULE_DEVICE_TABLE() is used by three other architecture-specific
modules:

1. One arm (32-bit) module, matching on certain Arm extensions:
  arch/arm/crypto/crc32-ce-glue.c:MODULE_DEVICE_TABLE(cpu, crc32_cpu_feature);

2. One arm64 module, matching on certain Arm extensions:
  arch/arm64/crypto/ghash-ce-glue.c:MODULE_DEVICE_TABLE(cpu, ghash_cpu_feature);

3. All of the sparc modules share a global solution:
  arch/sparc/crypto/crop_devid.c:MODULE_DEVICE_TABLE(of, crypto_opcode_match);

/* This is a dummy device table linked into all of the crypto
 * opcode drivers.  It serves to trigger the module autoloading
 * mechanisms in userspace which scan the OF device tree and
 * load any modules which have device table entries that
 * match OF device nodes.
 */

Each module .c file includes that .c file:
  aes_glue.c:#include "crop_devid.c"
  camellia_glue.c:#include "crop_devid.c"
  crc32c_glue.c:#include "crop_devid.c"
  des_glue.c:#include "crop_devid.c"
  md5_glue.c:#include "crop_devid.c"
  sha1_glue.c:#include "crop_devid.c"
  sha256_glue.c:#include "crop_devid.c"
  sha512_glue.c:#include "crop_devid.c"



^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] crypto: x86/sha512 - load based on CPU features
  2022-08-19 22:36   ` Elliott, Robert (Servers)
@ 2022-08-26  2:40     ` Elliott, Robert (Servers)
  2022-08-26  2:52       ` Herbert Xu
  0 siblings, 1 reply; 6+ messages in thread
From: Elliott, Robert (Servers) @ 2022-08-26  2:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Elliott, Robert (Servers),
	tim.c.chen, Taehee Yoo, davem, linux-crypto, mingo, bp,
	dave.hansen, x86, Kani, Toshi, Wright, Randy (HPE Servers Linux)



> -----Original Message-----
> From: Elliott, Robert (Servers) <elliott@hpe.com>
> Sent: Friday, August 19, 2022 5:37 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Subject: RE: [PATCH] crypto: x86/sha512 - load based on CPU features
> 
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Friday, August 19, 2022 6:05 AM
> > Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features
> >
> > Robert Elliott <elliott@hpe.com> wrote:
> > > x86 optimized crypto modules built as modules rather than built-in
> > > to the kernel end up as .ko files in the filesystem, e.g., in
> > > /usr/lib/modules. If the filesystem itself is a module, these might
> > > not be available when the crypto API is initialized, resulting in
> > > the generic implementation being used (e.g., sha512_transform rather
> > > than sha512_transform_avx2).
> > >
> > > In one test case, CPU utilization in the sha512 function dropped
> > > from 15.34% to 7.18% after forcing loading of the optimized module.
> > >
> > > Add module aliases for this x86 optimized crypto module based on CPU
> > > feature bits so udev gets a chance to load them later in the boot
> > > process when the filesystems are all running.
> > >
> > > Signed-off-by: Robert Elliott <elliott@hpe.com>
> > > ---
> > > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
> > > 1 file changed, 10 insertions(+)
> >
> > Patch applied.  Thanks.
> 
> I'll post a series that applies this technique to all the other
> x86 modules, if there are no problems reported with sha512.

Suggestion: please revert the sha512-x86 patch for a while.

I've been running with this technique applied to all x86 modules for
a few weeks and noticed a potential problem.

I've seen several cases of "rcu_preempt detected expedited stalls" 
with stack dumps pointing to the x86 crypto functions:
    rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 12-... } 22 jiffies s: 277 root: 0x1/.

(It doesn't always happen; I haven't found a definitive recipe to
trigger the problem yet.)

Three instances occurred during boot, with the stack track pointing
to the sha512-x86 function (my system is set to use SHA-512 for 
module signing, so it's using that algorithm a lot):
	module_sig_check/mod_verify_sig/
	pkcs7_verify/pkcs7_digest/
	sha512_finup/sha512_base_do_update

I also triggered three of them by running "modprobe tcrypt mode=n"
looping from 0 to 1000, all in the aes-x86 module:
	tcrypt: testing encryption speed of sync skcipher cts(cbc(aes)) using cts(cbc(aes-aesni))
	tcrypt: testing encryption speed of sync skcipher cfb(aes) using cfb(aes-aesni)
	tcrypt: testing decryption speed of sync skcipher cfb(aes) using cfb(aes-aesni)

In that case the stack traces pointed to:
	do_test/prepare_alloc_pages/test_skcipher_speed.cold

Theory:
I noticed the proposed aria-x86 module has these calls surrounding
its main data processing functions:
	kernel_fpu_begin();
	kernel_fpu_end();

The sha-x86 and aes-x86 implementations use those as well. For
example, sha512_update() is structured as:
        kernel_fpu_begin();
        sha512_base_do_update(desc, data, len, sha512_xform);
        kernel_fpu_end();

and aesni_encrypt() is structured as:
	kernel_fpu_begin();
	aesni_enc(ctx, dst, src);
	kernel_fpu_end();

I noticed that kernel_fpu_begin() includes this:
	preempt_disable();

and kernel_fpu_end() has:
	preempt_enable();

Is that sufficient to cause the RCU problems? 

Although preempt_disable isn't disabling interrupts, it's blocking the
scheduler from using the CPUs (A few of the arm AES functions mess
with interrupts, but none of the others seem to do so). So, I suspect
that could lead to RCU problems and soft lockups, but not hard
lockups.

Do these functions need to break up their processing into smaller chunks
(e.g., a few Megabytes), calling kernel_fpu_end() periodically to 
allow the scheduler to take over the CPUs if needed? If so, what
chunk size would be appropriate?



Example stack trace:
[   29.729811] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 12-... } 22 jiffies s: 277 root: 0x1/.
[   29.729815] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-13:0x1000/.
[   29.729818] Task dump for CPU 12:
[   29.729819] task:modprobe        state:R  running task     stack:    0 pid: 1703 ppid:  1702 flags:0x0000400a
[   29.729822] Call Trace:
[   29.729823]  <TASK>
[   29.729825]  ? sysvec_apic_timer_interrupt+0xab/0xc0
[   29.729830]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[   29.729835]  ? loop1+0x3f2/0x98f [sha512_ssse3]
[   29.729839]  ? crypto_create_tfm_node+0x33/0x100
[   29.729843]  ? nowork+0x6/0x6 [sha512_ssse3]
[   29.729844]  ? sha512_base_do_update.isra.0+0xeb/0x160 [sha512_ssse3]
[   29.729847]  ? nowork+0x6/0x6 [sha512_ssse3]
[   29.729849]  ? sha512_finup.part.0+0x1de/0x230 [sha512_ssse3]
[   29.729851]  ? pkcs7_digest+0xaf/0x1f0
[   29.729854]  ? pkcs7_verify+0x61/0x540
[   29.729856]  ? verify_pkcs7_message_sig+0x4a/0xe0
[   29.729859]  ? pkcs7_parse_message+0x174/0x1b0
[   29.729861]  ? verify_pkcs7_signature+0x4c/0x80
[   29.729862]  ? mod_verify_sig+0x74/0x90
[   29.729867]  ? module_sig_check+0x87/0xd0
[   29.729868]  ? load_module+0x4e/0x1fc0
[   29.729871]  ? xfs_file_read_iter+0x70/0xe0 [xfs]
[   29.729955]  ? __kernel_read+0x118/0x290
[   29.729959]  ? ima_post_read_file+0xac/0xc0
[   29.729962]  ? kernel_read_file+0x211/0x2a0
[   29.729965]  ? __do_sys_finit_module+0x93/0xf0
[   29.729967]  ? __do_sys_finit_module+0x93/0xf0
[   29.729969]  ? do_syscall_64+0x58/0x80
[   29.729971]  ? syscall_exit_to_user_mode+0x17/0x40
[   29.729973]  ? do_syscall_64+0x67/0x80
[   29.729975]  ? exit_to_user_mode_prepare+0x18f/0x1f0
[   29.729976]  ? syscall_exit_to_user_mode+0x17/0x40
[   29.729977]  ? do_syscall_64+0x67/0x80
[   29.729979]  ? syscall_exit_to_user_mode+0x17/0x40
[   29.729980]  ? do_syscall_64+0x67/0x80
[   29.729982]  ? exc_page_fault+0x70/0x170
[   29.729983]  ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   29.729986]  </TASK>
[   29.753810] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { } 236 jiffies s: 281 root: 0x0/.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] crypto: x86/sha512 - load based on CPU features
  2022-08-26  2:40     ` Elliott, Robert (Servers)
@ 2022-08-26  2:52       ` Herbert Xu
  2022-08-30 17:40         ` Tim Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Herbert Xu @ 2022-08-26  2:52 UTC (permalink / raw)
  To: Elliott, Robert (Servers)
  Cc: tim.c.chen, Taehee Yoo, davem, linux-crypto, mingo, bp,
	dave.hansen, x86, Kani, Toshi, Wright, Randy (HPE Servers Linux)

On Fri, Aug 26, 2022 at 02:40:58AM +0000, Elliott, Robert (Servers) wrote:
>
> Suggestion: please revert the sha512-x86 patch for a while.

This problem would have existed anyway if the module was built
into the kernel.

> Do these functions need to break up their processing into smaller chunks
> (e.g., a few Megabytes), calling kernel_fpu_end() periodically to 
> allow the scheduler to take over the CPUs if needed? If so, what
> chunk size would be appropriate?

Yes these should be limited to 4K each.  It appears that all the
sha* helpers in arch/x86/crypto have the same problem.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] crypto: x86/sha512 - load based on CPU features
  2022-08-26  2:52       ` Herbert Xu
@ 2022-08-30 17:40         ` Tim Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Tim Chen @ 2022-08-30 17:40 UTC (permalink / raw)
  To: Herbert Xu, Elliott, Robert (Servers)
  Cc: Taehee Yoo, davem, linux-crypto, mingo, bp, dave.hansen, x86,
	Kani, Toshi, Wright, Randy (HPE Servers Linux)

On Fri, 2022-08-26 at 10:52 +0800, Herbert Xu wrote:
> On Fri, Aug 26, 2022 at 02:40:58AM +0000, Elliott, Robert (Servers) wrote:
> > Suggestion: please revert the sha512-x86 patch for a while.
> 
> This problem would have existed anyway if the module was built
> into the kernel.
> 
> > Do these functions need to break up their processing into smaller chunks
> > (e.g., a few Megabytes), calling kernel_fpu_end() periodically to 
> > allow the scheduler to take over the CPUs if needed? If so, what
> > chunk size would be appropriate?
> 
> Yes these should be limited to 4K each.  It appears that all the
> sha* helpers in arch/x86/crypto have the same problem.
> 

I think limiting chunk to 4K is reasonable to prevent the CPU from being hogged by
the crypto code for too long.

Tim


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-08-30 17:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-13 23:04 [PATCH] crypto: x86/sha512 - load based on CPU features Robert Elliott
2022-08-19 11:05 ` Herbert Xu
2022-08-19 22:36   ` Elliott, Robert (Servers)
2022-08-26  2:40     ` Elliott, Robert (Servers)
2022-08-26  2:52       ` Herbert Xu
2022-08-30 17:40         ` Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.