linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Elliott, Robert (Servers)" <elliott@hpe.com>
To: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: "herbert@gondor.apana.org.au" <herbert@gondor.apana.org.au>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"tim.c.chen@linux.intel.com" <tim.c.chen@linux.intel.com>,
	"ap420073@gmail.com" <ap420073@gmail.com>,
	"ardb@kernel.org" <ardb@kernel.org>,
	"David.Laight@ACULAB.COM" <David.Laight@ACULAB.COM>,
	"ebiggers@kernel.org" <ebiggers@kernel.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
Date: Tue, 22 Nov 2022 05:06:18 +0000	[thread overview]
Message-ID: <MW5PR84MB1842625D9ECAF4F6244F558BAB0D9@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <Y3TF7/+DejcnN0eV@zx2c4.com>



> -----Original Message-----
> From: Jason A. Donenfeld <Jason@zx2c4.com>
> Sent: Wednesday, November 16, 2022 5:14 AM
> To: Elliott, Robert (Servers) <elliott@hpe.com>
> Cc: herbert@gondor.apana.org.au; davem@davemloft.net;
> tim.c.chen@linux.intel.com; ap420073@gmail.com; ardb@kernel.org;
> David.Laight@ACULAB.COM; ebiggers@kernel.org; linux-
> crypto@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> 
> On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote:
> > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */
> > +static const unsigned int bytes_per_fpu = 337 * 1024;
> > +
> 
> Use an enum for constants like this:
> 
>     enum { BYTES_PER_FPU = ... };
> 
> You can even make it function-local, so it's near the code that uses it,
> which will better justify its existence.

Using crc32c-intel as an example, the gcc 12.2.1 assembly output is the
same for:

1. const variable per-module
	static const unsigned int bytes_per_fpu = 868 * 1024;
2. enum per-module
	enum { bytes_per_fpu = 868 * 1024 } ;
3. enum per function (_update and _finup)
	enum { bytes_per_fpu = 868 * 1024 } ;

Each function gets a movl instruction with that constant, and has the
same compare, add, subtract, and jump instructions.

# ../arch/x86/crypto/crc32c-intel_glue.c:198:                   unsigned int chunk = min(len, bytes_per_fpu);
        movl    $888832, %eax   #, tmp127

# ../arch/x86/crypto/crc32c-intel_glue.c:171:                   unsigned int chunk = min(len, bytes_per_fpu);
        movl    $888832, %r13d  #, tmp128

Since enum doesn't guarantee any particular type, those variations
upset the min() macro. min_t() is necessary to eliminate the
compiler warning.

../arch/x86/crypto/crc32c-intel_glue.c: In function ‘crc32c_pcl_intel_update’:
../arch/x86/crypto/crc32c-intel_glue.c:171:97: warning: comparison of distinct pointer types lacks a cast
  171 |                         unsigned int chunk = min(len, bytes_per_fpu);

> Also, where did you get this number? Seems kind of weird.

As described in replies on the v2 patches, I created a tcrypt test that
runs each algorithm on a 1 MiB buffer with no loop limits (and irqs
disabled),  picks the best result out of 10 passes, and calculates the
number of bytes that nominally fit in 30 us (on a 2.2 GHz Cascade
Lake CPU).

Actual results with those values vary from 37 to 102 us; it is much
better than running unlimited, but still imperfect.

https://lore.kernel.org/lkml/MW5PR84MB184284FBED63E2D043C93A6FAB369@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM/

The hash algorithms seem to congregate around three speeds:
- slow: 10 to 20 KiB for sha* and sm3
- medium: 200 to 400 KiB for poly*
- fast: 600 to 800 KiB for crc*

so it might be preferable to just go with three values (e.g., 16, 256, and
512 KiB). There's a lot of variability in CPU architecture, CPU speeds, and
other system activity that make this impossible to perfect.

It'd be ideal if the loops checked the CPU cycle count rather than
worked on a byte count, but the RDTSC instruction is notoriously slow and
would slow down overall performance.

The RAID6 library supports running a benchmark during each boot to pick
the best implementation to use (not always enabled):
[    3.341372] raid6: skipped pq benchmark and selected avx512x4

The crypto subsystem does something similar with its self-tests, which could
be expanded to include speed tests to tune the loop values. However, that 
slows down boot and could be misled by an NMI or SMI during the test, which
could lead to even worse results.

The selected values could be kept in each file or put in a shared .h file.
Both approaches seem difficult to maintain.

The powerpc modules have paragraph-sized comments explaining their MAX_BYTES
macros (e.g., in arch/powerpc/crypto/sha256-spe-glue.c) that might be a good
model for documenting the values:
	 * MAX_BYTES defines the number of bytes that are allowed to be processed
	 * between preempt_disable() and preempt_enable(). SHA256 takes ~2,000
	 * operations per 64 bytes. e500 cores can issue two arithmetic instructions
	 * per clock cycle using one 32/64 bit unit (SU1) and one 32 bit unit (SU2).
	 * Thus 1KB of input data will need an estimated maximum of 18,000 cycles.
	 * Headroom for cache misses included. Even with the low end model clocked
	 * at 667 MHz this equals to a critical time window of less than 27us.



> >  asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t
> message_len,
> >  			u8 hash[NH_HASH_BYTES]);
> >
> > @@ -26,18 +29,20 @@ static void _nh_avx2(const u32 *key, const u8
> *message, size_t message_len,
> >  static int nhpoly1305_avx2_update(struct shash_desc *desc,
> >  				  const u8 *src, unsigned int srclen)
> >  {
> > +	BUILD_BUG_ON(bytes_per_fpu == 0);
> 
> Make the constant function local and remove this check.

That just makes sure someone editing the source code doesn't pick a value that
will cause the loops to hang; it's stronger than a comment saying "don't set
this to 0". It's only a compile-time check, and doesn't result in any
the assembly language output that I can see.
 
> > +7
> >  	if (srclen < 64 || !crypto_simd_usable())
> >  		return crypto_nhpoly1305_update(desc, src, srclen);
> >
> > -	do {
> > -		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
> > +	while (srclen) {
> 
> Does this add a needless additional check or does it generate better
> code? Would be nice to have some explanation of the rationale.

Each module's assembly function can have different handling of
- length 0
- length < block size
- length < some minimum length
- length < a performance switchover point
- length not a multiple of block size
- current buffer pointer not aligned to block size
Sometimes the C glue logic checks values upfront; sometimes
it doesn't.

The while loops help get them to follow one of two patterns:
	while (length)
or
	while (length >= BLOCK_SIZE)

and sidestep some of the special handling concerns.

Performance-wise, the patches are either 
- adding lots of kernel_fpu_begin() and kernel_fpu_end() calls 
  (all the ones that were running unlimited)
- removing lots of kernel_fpu_begin() and kernel_fpu_end() calls
  (e.g., polyval relaxed from 4 KiB to 383 KiB)

which is much more impactful than the common while loop entry.

I created tcrypt tests that try lengths around all the special
values like 0, 16, 4096, and the selected bytes per FPU size
(comparing results to the generic algorithms like the extended
self-tests), so I think these loops are functionally correct
(I've found that violating the undocumented assumptions of the
assembly functions is a good way to exercise RCU stall
reporting).




  reply	other threads:[~2022-11-22  5:06 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-06 22:31 [RFC PATCH 0/7] crypto: x86 - fix RCU stalls Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 1/7] rcu: correct CONFIG_EXT_RCU_CPU_STALL_TIMEOUT descriptions Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 2/7] crypto: x86/sha - limit FPU preemption Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 3/7] crypto: x86/crc " Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 4/7] crypto: x86/sm3 " Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 5/7] crypto: x86/ghash - restructure FPU context saving Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 6/7] crypto: x86/ghash - limit FPU preemption Robert Elliott
2022-10-06 22:31 ` [RFC PATCH 7/7] crypto: x86 - use common macro for FPU limit Robert Elliott
2022-10-12 21:59 ` [PATCH v2 00/19] crypto: x86 - fix RCU stalls Robert Elliott
2022-10-12 21:59   ` [PATCH v2 01/19] crypto: tcrypt - test crc32 Robert Elliott
2022-10-12 21:59   ` [PATCH v2 02/19] crypto: tcrypt - test nhpoly1305 Robert Elliott
2022-10-12 21:59   ` [PATCH v2 03/19] crypto: tcrypt - reschedule during cycles speed tests Robert Elliott
2022-10-12 21:59   ` [PATCH v2 04/19] crypto: x86/sha - limit FPU preemption Robert Elliott
2022-10-13  0:41     ` Jason A. Donenfeld
2022-10-13 21:50       ` Elliott, Robert (Servers)
2022-10-14 11:01       ` David Laight
2022-10-13  5:57     ` Eric Biggers
2022-10-13  6:04       ` Herbert Xu
2022-10-13  6:08         ` Eric Biggers
2022-10-13  7:50           ` Herbert Xu
2022-10-13 22:41       ` :Re: " Elliott, Robert (Servers)
2022-10-12 21:59   ` [PATCH v2 05/19] crypto: x86/crc " Robert Elliott
2022-10-13  2:00     ` Herbert Xu
2022-10-13 22:34       ` Elliott, Robert (Servers)
2022-10-14  4:02     ` David Laight
2022-10-24  2:03     ` kernel test robot
2022-10-12 21:59   ` [PATCH v2 06/19] crypto: x86/sm3 " Robert Elliott
2022-10-12 21:59   ` [PATCH v2 07/19] crypto: x86/ghash - restructure FPU context saving Robert Elliott
2022-10-12 21:59   ` [PATCH v2 08/19] crypto: x86/ghash - limit FPU preemption Robert Elliott
2022-10-13  6:03     ` Eric Biggers
2022-10-13 22:52       ` Elliott, Robert (Servers)
2022-10-12 21:59   ` [PATCH v2 09/19] crypto: x86 - use common macro for FPU limit Robert Elliott
2022-10-13  0:35     ` Jason A. Donenfeld
2022-10-13 21:48       ` Elliott, Robert (Servers)
2022-10-14  1:26         ` Jason A. Donenfeld
2022-10-18  0:06           ` Elliott, Robert (Servers)
2022-10-12 21:59   ` [PATCH v2 10/19] crypto: x86/sha1, sha256 - load based on CPU features Robert Elliott
2022-10-12 21:59   ` [PATCH v2 11/19] crypto: x86/crc " Robert Elliott
2022-10-12 21:59   ` [PATCH v2 12/19] crypto: x86/sm3 " Robert Elliott
2022-10-12 21:59   ` [PATCH v2 13/19] crypto: x86/ghash " Robert Elliott
2022-10-12 21:59   ` [PATCH v2 14/19] crypto: x86 " Robert Elliott
2022-10-14 14:26     ` Elliott, Robert (Servers)
2022-10-12 21:59   ` [PATCH v2 15/19] crypto: x86 - add pr_fmt to all modules Robert Elliott
2022-10-12 21:59   ` [PATCH v2 16/19] crypto: x86 - print CPU optimized loaded messages Robert Elliott
2022-10-13  0:40     ` Jason A. Donenfeld
2022-10-13 13:47     ` kernel test robot
2022-10-13 13:48     ` kernel test robot
2022-10-12 21:59   ` [PATCH v2 17/19] crypto: x86 - standardize suboptimal prints Robert Elliott
2022-10-13  0:38     ` Jason A. Donenfeld
2022-10-12 21:59   ` [PATCH v2 18/19] crypto: x86 - standardize not loaded prints Robert Elliott
2022-10-13  0:42     ` Jason A. Donenfeld
2022-10-13 22:20       ` Elliott, Robert (Servers)
2022-11-10 22:06         ` Elliott, Robert (Servers)
2022-10-12 21:59   ` [PATCH v2 19/19] crypto: x86/sha - register only the best function Robert Elliott
2022-10-13  6:07     ` Eric Biggers
2022-10-13  7:52       ` Herbert Xu
2022-10-13 22:59         ` Elliott, Robert (Servers)
2022-10-14  8:22           ` Herbert Xu
2022-11-01 21:34   ` [PATCH v2 00/19] crypto: x86 - fix RCU stalls Elliott, Robert (Servers)
2022-11-03  4:27   ` [PATCH v3 00/17] crypt: " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 01/17] crypto: tcrypt - test crc32 Robert Elliott
2022-11-03  4:27     ` [PATCH v3 02/17] crypto: tcrypt - test nhpoly1305 Robert Elliott
2022-11-03  4:27     ` [PATCH v3 03/17] crypto: tcrypt - reschedule during cycles speed tests Robert Elliott
2022-11-03  4:27     ` [PATCH v3 04/17] crypto: x86/sha - limit FPU preemption Robert Elliott
2022-11-03  4:27     ` [PATCH v3 05/17] crypto: x86/crc " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 06/17] crypto: x86/sm3 " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 07/17] crypto: x86/ghash - use u8 rather than char Robert Elliott
2022-11-03  4:27     ` [PATCH v3 08/17] crypto: x86/ghash - restructure FPU context saving Robert Elliott
2022-11-03  4:27     ` [PATCH v3 09/17] crypto: x86/ghash - limit FPU preemption Robert Elliott
2022-11-03  4:27     ` [PATCH v3 10/17] crypto: x86/*poly* " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 11/17] crypto: x86/sha - register all variations Robert Elliott
2022-11-03  9:26       ` kernel test robot
2022-11-03  4:27     ` [PATCH v3 12/17] crypto: x86/sha - minimize time in FPU context Robert Elliott
2022-11-03  4:27     ` [PATCH v3 13/17] crypto: x86/sha1, sha256 - load based on CPU features Robert Elliott
2022-11-03  4:27     ` [PATCH v3 14/17] crypto: x86/crc " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 15/17] crypto: x86/sm3 " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 16/17] crypto: x86/ghash,polyval " Robert Elliott
2022-11-03  4:27     ` [PATCH v3 17/17] crypto: x86/nhpoly1305, poly1305 " Robert Elliott
2022-11-16  4:13     ` [PATCH v4 00/24] crypto: fix RCU stalls Robert Elliott
2022-11-16  4:13       ` [PATCH v4 01/24] crypto: tcrypt - test crc32 Robert Elliott
2022-11-16  4:13       ` [PATCH v4 02/24] crypto: tcrypt - test nhpoly1305 Robert Elliott
2022-11-16  4:13       ` [PATCH v4 03/24] crypto: tcrypt - reschedule during cycles speed tests Robert Elliott
2022-11-16  4:13       ` [PATCH v4 04/24] crypto: x86/sha - limit FPU preemption Robert Elliott
2022-11-16  4:13       ` [PATCH v4 05/24] crypto: x86/crc " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 06/24] crypto: x86/sm3 " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 07/24] crypto: x86/ghash - use u8 rather than char Robert Elliott
2022-11-16  4:13       ` [PATCH v4 08/24] crypto: x86/ghash - restructure FPU context saving Robert Elliott
2022-11-16  4:13       ` [PATCH v4 09/24] crypto: x86/ghash - limit FPU preemption Robert Elliott
2022-11-16  4:13       ` [PATCH v4 10/24] crypto: x86/poly " Robert Elliott
2022-11-16 11:13         ` Jason A. Donenfeld
2022-11-22  5:06           ` Elliott, Robert (Servers) [this message]
2022-11-22  9:07             ` David Laight
2022-11-25  8:40           ` Herbert Xu
2022-11-25  8:59             ` Ard Biesheuvel
2022-11-25  9:03               ` Herbert Xu
2022-11-28 16:57                 ` Elliott, Robert (Servers)
2022-11-28 18:48                   ` Elliott, Robert (Servers)
2022-12-02  6:21             ` Elliott, Robert (Servers)
2022-12-02  9:25               ` Herbert Xu
2022-12-02 16:15                 ` Elliott, Robert (Servers)
2022-12-06  4:27                   ` Herbert Xu
2022-12-06 14:03                     ` Peter Lafreniere
2022-12-06 14:44                       ` David Laight
2022-12-06 23:06               ` Peter Lafreniere
2022-12-10  0:34                 ` Elliott, Robert (Servers)
2022-12-16 22:12                   ` Elliott, Robert (Servers)
2022-11-16  4:13       ` [PATCH v4 11/24] crypto: x86/aegis " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 12/24] crypto: x86/sha - register all variations Robert Elliott
2022-11-16  4:13       ` [PATCH v4 13/24] crypto: x86/sha - minimize time in FPU context Robert Elliott
2022-11-16  4:13       ` [PATCH v4 14/24] crypto: x86/sha - load based on CPU features Robert Elliott
2022-11-16  4:13       ` [PATCH v4 15/24] crypto: x86/crc " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 16/24] crypto: x86/sm3 " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 17/24] crypto: x86/poly " Robert Elliott
2022-11-16 11:19         ` Jason A. Donenfeld
2022-11-16  4:13       ` [PATCH v4 18/24] crypto: x86/ghash " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 19/24] crypto: x86/aesni - avoid type conversions Robert Elliott
2022-11-16  4:13       ` [PATCH v4 20/24] crypto: x86/ciphers - load based on CPU features Robert Elliott
2022-11-16 11:30         ` Jason A. Donenfeld
2022-11-16  4:13       ` [PATCH v4 21/24] crypto: x86 - report used CPU features via module parameters Robert Elliott
2022-11-16 11:26         ` Jason A. Donenfeld
2022-11-16  4:13       ` [PATCH v4 22/24] crypto: x86 - report missing " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 23/24] crypto: x86 - report suboptimal CPUs " Robert Elliott
2022-11-16  4:13       ` [PATCH v4 24/24] crypto: x86 - standarize module descriptions Robert Elliott
2022-11-17  3:58       ` [PATCH v4 00/24] crypto: fix RCU stalls Herbert Xu
2022-11-17 15:13         ` Elliott, Robert (Servers)
2022-11-17 15:15           ` Jason A. Donenfeld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MW5PR84MB1842625D9ECAF4F6244F558BAB0D9@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM \
    --to=elliott@hpe.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=Jason@zx2c4.com \
    --cc=ap420073@gmail.com \
    --cc=ardb@kernel.org \
    --cc=davem@davemloft.net \
    --cc=ebiggers@kernel.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).