All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dey, Megha" <megha.dey@intel.com>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Linux Crypto Mailing List <linux-crypto@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	ravi.v.shankar@intel.com, tim.c.chen@intel.com,
	andi.kleen@intel.com, Dave Hansen <dave.hansen@intel.com>,
	wajdi.k.feghali@intel.com, greg.b.tucker@intel.com,
	robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com,
	tomasz.kantecki@intel.com, ryan.d.saffores@intel.com,
	ilya.albrekht@intel.com, kyung.min.park@intel.com,
	Tony Luck <tony.luck@intel.com>,
	ira.weiny@intel.com
Subject: Re: [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ
Date: Wed, 20 Jan 2021 14:48:55 -0800	[thread overview]
Message-ID: <e2db5780-f65f-d253-9261-1069ff9bc482@intel.com> (raw)
In-Reply-To: <CAMj1kXEOs_x9jHTmksgRZ1bkDV2+je+JaaqLRXyL0hg5uC9gBg@mail.gmail.com>

Hi Ard,

On 1/16/2021 9:16 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:
>> Introduce the AVX512 implementation that optimizes the AESNI-GCM encode
>> and decode routines using VPCLMULQDQ.
>>
>> The glue code in AESNI module overrides the existing AVX2 GCM mode
>> encryption/decryption routines with the AX512 AES GCM mode ones when the
>> following criteria are met:
>> At compile time:
>> 1. CONFIG_CRYPTO_AVX512 is enabled
>> 2. toolchain(assembler) supports VPCLMULQDQ instructions
>> At runtime:
>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform
>>     (currently only Icelake)
>> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this
>>     algorithm, switching from AVX512 optimized version is not possible
>>     once set at boot time because of how the code is structured today.(Can
>>     be changed later if required)
>>
>> The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,
>> aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted
>> from the Intel Optimized IPSEC Cryptographic library.
>>
>> On a Icelake desktop, with turbo disabled and all CPUs running at
>> maximum frequency, the AVX512 GCM mode optimization shows better
>> performance across data & key sizes as measured by tcrypt.
>>
>> The average performance improvement of the AVX512 version over the AVX2
>> version is as follows:
>> For all key sizes(128/192/256 bits),
>>          data sizes < 128 bytes/block, negligible improvement (~7.5%)
>>          data sizes > 128 bytes/block, there is an average improvement of
>>          40% for both encryption and decryption.
>>
>> A typical run of tcrypt with AES GCM mode encryption/decryption of the
>> AVX2 and AVX512 optimization on a Icelake desktop shows the following
>> results:
>>
>>    ----------------------------------------------------------------------
>>    |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |
>>    | length |   per |   encryption  |  decryption   |      loss         |
>>    | (bits) | block |-------------------------------|-------------------|
>>    |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |
>>    |---------------------------------------------------------------------
>>    |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |
>>    |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |
>>    |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |
>>    |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |
>>    |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |
>>    |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |
>>    |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |
>>    |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |
>>    |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |
>>    |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |
>>    |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |
>>    |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |
>>    |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |
>>    |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |
>>    |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |
>>    |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |
>>    |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |
>>    |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |
>>    |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |
>>    |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |
>>    |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |
>>    |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |
>>    |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |
>>    |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |
>>    ----------------------------------------------------------------------
>>
>> This work was inspired by the AES GCM mode optimization published in
>> Intel Optimized IPSEC Cryptographic library.
>> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm
>>
>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Megha Dey <megha.dey@intel.com>
>> ---
>>   arch/x86/crypto/Makefile                    |    1 +
>>   arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++
>>   arch/x86/crypto/aesni-intel_glue.c          |   62 +-
>>   crypto/Kconfig                              |   12 +
>>   4 files changed, 1858 insertions(+), 5 deletions(-)
>>   create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S
>>
> ...
>> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
>> index 9e56cdf..8fc5bac 100644
>> --- a/arch/x86/crypto/aesni-intel_glue.c
>> +++ b/arch/x86/crypto/aesni-intel_glue.c
>> @@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
>>    * This needs to be 16 byte aligned.
>>    */
>>   struct aesni_rfc4106_gcm_ctx {
>> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;
>> +       /* AVX512 optimized algorithms use 48 hash keys to conduct
>> +        * multiple PCLMULQDQ operations in parallel
>> +        */
>> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
>>          struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
>>          u8 nonce[4];
>>   };
>>
>>   struct generic_gcmaes_ctx {
>> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;
>> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
>>          struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
>>   };
>>
>> @@ -82,7 +85,7 @@ struct gcm_context_data {
>>          u8 current_counter[GCM_BLOCK_LEN];
>>          u64 partial_block_len;
>>          u64 unused;
>> -       u8 hash_keys[GCM_BLOCK_LEN * 16];
>> +       u8 hash_keys[48 * 16];
>>   };
>>
> This structure gets allocated on the stack, and gets inflated
> significantly by this change, even though the code is not enabled by
> default, and not even supported for most users.
Hmm yeah, this makes sense. I will look into it.
>
> Is it really necessary for this to be per-request data? If these are
> precomputed powers of H, they can be moved into the TFM context
> structure instead, which lives on the heap (and can be shared by all
> concurrent users of the TFM)
Yeah, this is a per request data.
>
>>   asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
>> @@ -266,6 +269,47 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = {
>>          .finalize = &aesni_gcm_finalize_avx_gen2,
>>   };
>>
>> +#ifdef CONFIG_CRYPTO_AES_GCM_AVX512
>> +/*
>> + * asmlinkage void aesni_gcm_init_avx_512()
>> + * gcm_data *my_ctx_data, context data
>> + * u8 *hash_subkey,  the Hash sub key input. Data starts on a 16-byte boundary.
>> + */
>> +asmlinkage void aesni_gcm_init_avx_512(void *my_ctx_data,
>> +                                      struct gcm_context_data *gdata,
>> +                                      u8 *iv,
>> +                                      u8 *hash_subkey,
>> +                                      const u8 *aad,
>> +                                      unsigned long aad_len);
>> +asmlinkage void aesni_gcm_enc_update_avx_512(void *ctx,
>> +                                            struct gcm_context_data *gdata,
>> +                                            u8 *out,
>> +                                            const u8 *in,
>> +                                            unsigned long plaintext_len);
>> +asmlinkage void aesni_gcm_dec_update_avx_512(void *ctx,
>> +                                            struct gcm_context_data *gdata,
>> +                                            u8 *out,
>> +                                            const u8 *in,
>> +                                            unsigned long ciphertext_len);
>> +asmlinkage void aesni_gcm_finalize_avx_512(void *ctx,
>> +                                          struct gcm_context_data *gdata,
>> +                                          u8 *auth_tag,
>> +                                          unsigned long auth_tag_len);
>> +
>> +asmlinkage void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey);
>> +
>> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {
>> +       .init = &aesni_gcm_init_avx_512,
>> +       .enc_update = &aesni_gcm_enc_update_avx_512,
>> +       .dec_update = &aesni_gcm_dec_update_avx_512,
>> +       .finalize = &aesni_gcm_finalize_avx_512,
>> +};
>> +#else
>> +static void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey)
>> +{}
>> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {};
>> +#endif
>> +
> Please drop the alternative dummy definitions.
ok
>
>>   /*
>>    * asmlinkage void aesni_gcm_init_avx_gen4()
>>    * gcm_data *my_ctx_data, context data
>> @@ -669,7 +713,11 @@ rfc4106_set_hash_subkey(u8 *hash_subkey, const u8 *key, unsigned int key_len)
>>          /* We want to cipher all zeros to create the hash sub key. */
>>          memset(hash_subkey, 0, RFC4106_HASH_SUBKEY_SIZE);
>>
>> -       aes_encrypt(&ctx, hash_subkey, hash_subkey);
>> +       if (IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) && use_avx512 &&
>> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ))
>> +               aes_gcm_precomp_avx_512(&ctx, hash_subkey);
>> +       else
>> +               aes_encrypt(&ctx, hash_subkey, hash_subkey);
>>
> I suppose this answers my question about the subkeys. Please find a
> way to move these out of struct gcm_context_data so they don't need to
> be copied to the stack for each request.
Hmm yeah. I will move this allocation to the heap instead.
>
>
>>          memzero_explicit(&ctx, sizeof(ctx));
>>          return 0;
>> @@ -1114,7 +1162,11 @@ static int __init aesni_init(void)
>>          if (!x86_match_cpu(aesni_cpu_id))
>>                  return -ENODEV;
>>   #ifdef CONFIG_X86_64
>> -       if (boot_cpu_has(X86_FEATURE_AVX2)) {
>> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) &&
>> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ)) {
>> +               pr_info("AVX512 version of gcm_enc/dec engaged.\n");
>> +               aesni_gcm_tfm = &aesni_gcm_tfm_avx_512;
> This was changed in the cryptodev tree to use static keys.

yep, will make the necessary changes.


-Megha

>
>> +       } else if (boot_cpu_has(X86_FEATURE_AVX2)) {
>>                  pr_info("AVX2 version of gcm_enc/dec engaged.\n");
>>                  aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;
>>          } else if (boot_cpu_has(X86_FEATURE_AVX)) {
>> diff --git a/crypto/Kconfig b/crypto/Kconfig
>> index 3043849..8c8a68d 100644
>> --- a/crypto/Kconfig
>> +++ b/crypto/Kconfig
>> @@ -661,6 +661,18 @@ config CRYPTO_AES_CTR_AVX512
>>          depends on CRYPTO_AES_NI_INTEL
>>          depends on AS_VAES_AVX512
>>
>> +# We default CRYPTO_AES_GCM_AVX512 to Y but depend on CRYPTO_AVX512 in
>> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
>> +# when supported. Specifically, if the platform and/or toolset does not
>> +# support VPLMULQDQ. Then this algorithm should not be supported as part of
>> +# the set that CRYPTO_AVX512 selects.
>> +config CRYPTO_AES_GCM_AVX512
>> +       bool
>> +       default y
>> +       depends on CRYPTO_AVX512
>> +       depends on CRYPTO_AES_NI_INTEL
>> +       depends on AS_VPCLMULQDQ
>> +
>>   config CRYPTO_CRC32C_SPARC64
>>          tristate "CRC32c CRC algorithm (SPARC64)"
>>          depends on SPARC64
>> --
>> 2.7.4
>>

  reply	other threads:[~2021-01-21  6:41 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
2021-01-16 16:54   ` Ard Biesheuvel
2021-01-20 22:38     ` Dey, Megha
2020-12-18 21:10 ` [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Megha Dey
2021-01-16 17:00   ` Ard Biesheuvel
2021-01-20 22:46     ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Megha Dey
2020-12-19 17:03   ` Ard Biesheuvel
2021-01-16  0:14     ` Dey, Megha
2021-01-16  0:20       ` Dave Hansen
2021-01-16  2:04         ` Eric Biggers
2021-01-16  5:13           ` Dave Hansen
2021-01-16 16:48             ` Ard Biesheuvel
2021-01-16  1:43       ` Eric Biggers
2021-01-16  5:07         ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 4/7] crypto: tcrypt - Add speed test for optimized " Megha Dey
2020-12-18 21:11 ` [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization Megha Dey
2021-01-16 17:03   ` Ard Biesheuvel
2021-01-20 22:46     ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 6/7] crypto: aesni - fix coding style for if/else block Megha Dey
2020-12-18 21:11 ` [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Megha Dey
2021-01-16 17:16   ` Ard Biesheuvel
2021-01-20 22:48     ` Dey, Megha [this message]
2020-12-21 23:20 ` [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Eric Biggers
2020-12-28 19:10   ` Dey, Megha
2021-01-16 16:52     ` Ard Biesheuvel
2021-01-16 18:35       ` Dey, Megha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e2db5780-f65f-d253-9261-1069ff9bc482@intel.com \
    --to=megha.dey@intel.com \
    --cc=andi.kleen@intel.com \
    --cc=ardb@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=davem@davemloft.net \
    --cc=greg.b.tucker@intel.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=ilya.albrekht@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=kyung.min.park@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rajendrakumar.chinnaiyan@intel.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=robert.a.kasten@intel.com \
    --cc=ryan.d.saffores@intel.com \
    --cc=tim.c.chen@intel.com \
    --cc=tomasz.kantecki@intel.com \
    --cc=tony.luck@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.