* [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms @ 2022-01-14 15:42 Jason A. Donenfeld 2022-01-14 15:42 ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld 2022-01-14 15:42 ` [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld 0 siblings, 2 replies; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-14 15:42 UTC (permalink / raw) To: linux-crypto, linux-kernel, geert, herbert; +Cc: Jason A. Donenfeld [ Resending this v3, because the previous one was so deeply nested inside other patchset threads that b4 was unable to extract it without getting terribly confused. And if b4 was confused, probably human readers were too. This new cover letter is a new root thread. ] Hi, Geert emailed me this afternoon concerned about blake2s codesize on m68k and other small systems. We identified two effective ways of chopping down the size. One of them moves some wireguard-specific things into wireguard proper. The other one adds a slower codepath for small machines to blake2s. This worked, and was v1 of this patchset, but I wasn't so much of a fan. Then someone pointed out that the generic C SHA-1 implementation is still unrolled, which is a *lot* of extra code. Simply rerolling that saves about as much as v1 did. So, we instead do that in this patchset. SHA-1 is being phased out, and soon it won't be included at all (hopefully). And nothing performance-oriented has anything to do with it anyway. The result of these two patches mitigates Geert's feared code size increase for 5.17. v3 improves on v2 by making the re-rolling of SHA-1 much simpler, resulting in even larger code size reduction and much better performance. The reason I'm sending yet a third version in such a short amount of time is because the trick here feels obvious and substantial enough that I'd hate for Geert to waste time measuring the impact of the previous commit. Thanks, Jason Jason A. Donenfeld (2): lib/crypto: blake2s: move hmac construction into wireguard lib/crypto: sha1: re-roll loops to reduce code size drivers/net/wireguard/noise.c | 45 ++++++++++++++--- include/crypto/blake2s.h | 3 -- lib/crypto/blake2s-selftest.c | 31 ------------ lib/crypto/blake2s.c | 37 -------------- lib/sha1.c | 95 ++++++----------------------------- 5 files changed, 53 insertions(+), 158 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard 2022-01-14 15:42 [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld @ 2022-01-14 15:42 ` Jason A. Donenfeld 2022-01-14 15:42 ` [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld 1 sibling, 0 replies; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-14 15:42 UTC (permalink / raw) To: linux-crypto, linux-kernel, geert, herbert Cc: Jason A. Donenfeld, Ard Biesheuvel Basically nobody should use blake2s in an HMAC construction; it already has a keyed variant. But unfortunately for historical reasons, Noise, used by WireGuard, uses HKDF quite strictly, which means we have to use this. Because this really shouldn't be used by others, this commit moves it into wireguard's noise.c locally, so that kernels that aren't using WireGuard don't get this superfluous code baked in. On m68k systems, this shaves off ~314 bytes. Cc: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++----- include/crypto/blake2s.h | 3 --- lib/crypto/blake2s-selftest.c | 31 ------------------------ lib/crypto/blake2s.c | 37 ---------------------------- 4 files changed, 39 insertions(+), 77 deletions(-) diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c index c0cfd9b36c0b..720952b92e78 100644 --- a/drivers/net/wireguard/noise.c +++ b/drivers/net/wireguard/noise.c @@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key( static_identity->static_public, private_key); } +static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen) +{ + struct blake2s_state state; + u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 }; + u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32)); + int i; + + if (keylen > BLAKE2S_BLOCK_SIZE) { + blake2s_init(&state, BLAKE2S_HASH_SIZE); + blake2s_update(&state, key, keylen); + blake2s_final(&state, x_key); + } else + memcpy(x_key, key, keylen); + + for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i) + x_key[i] ^= 0x36; + + blake2s_init(&state, BLAKE2S_HASH_SIZE); + blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE); + blake2s_update(&state, in, inlen); + blake2s_final(&state, i_hash); + + for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i) + x_key[i] ^= 0x5c ^ 0x36; + + blake2s_init(&state, BLAKE2S_HASH_SIZE); + blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE); + blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE); + blake2s_final(&state, i_hash); + + memcpy(out, i_hash, BLAKE2S_HASH_SIZE); + memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE); + memzero_explicit(i_hash, BLAKE2S_HASH_SIZE); +} + /* This is Hugo Krawczyk's HKDF: * - https://eprint.iacr.org/2010/264.pdf * - https://tools.ietf.org/html/rfc5869 @@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data, ((third_len || third_dst) && (!second_len || !second_dst)))); /* Extract entropy from data into secret */ - blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN); + hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN); if (!first_dst || !first_len) goto out; /* Expand first key: key = secret, data = 0x1 */ output[0] = 1; - blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE); + hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE); memcpy(first_dst, output, first_len); if (!second_dst || !second_len) @@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data, /* Expand second key: key = secret, data = first-key || 0x2 */ output[BLAKE2S_HASH_SIZE] = 2; - blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, - BLAKE2S_HASH_SIZE); + hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE); memcpy(second_dst, output, second_len); if (!third_dst || !third_len) @@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data, /* Expand third key: key = secret, data = second-key || 0x3 */ output[BLAKE2S_HASH_SIZE] = 3; - blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, - BLAKE2S_HASH_SIZE); + hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE); memcpy(third_dst, output, third_len); out: diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h index df3c6c2f9553..f9ffd39194eb 100644 --- a/include/crypto/blake2s.h +++ b/include/crypto/blake2s.h @@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key, blake2s_final(&state, out); } -void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, - const size_t keylen); - #endif /* _CRYPTO_BLAKE2S_H */ diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c index 5d9ea53be973..409e4b728770 100644 --- a/lib/crypto/blake2s-selftest.c +++ b/lib/crypto/blake2s-selftest.c @@ -15,7 +15,6 @@ * #include <stdio.h> * * #include <openssl/evp.h> - * #include <openssl/hmac.h> * * #define BLAKE2S_TESTVEC_COUNT 256 * @@ -58,16 +57,6 @@ * } * printf("};\n\n"); * - * printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n"); - * - * HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL); - * print_vec(hash, BLAKE2S_OUTBYTES); - * - * HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL); - * print_vec(hash, BLAKE2S_OUTBYTES); - * - * printf("};\n"); - * * return 0; *} */ @@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = { 0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, }, }; -static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = { - { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70, - 0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79, - 0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, }, - { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9, - 0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f, - 0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, }, -}; - bool __init blake2s_selftest(void) { u8 key[BLAKE2S_KEY_SIZE]; @@ -607,16 +587,5 @@ bool __init blake2s_selftest(void) } } - if (success) { - blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key)); - success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE); - - blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf)); - success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE); - - if (!success) - pr_err("blake2s256_hmac self-test: FAIL\n"); - } - return success; } diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c index 93f2ae051370..9364f79937b8 100644 --- a/lib/crypto/blake2s.c +++ b/lib/crypto/blake2s.c @@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out) } EXPORT_SYMBOL(blake2s_final); -void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, - const size_t keylen) -{ - struct blake2s_state state; - u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 }; - u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32)); - int i; - - if (keylen > BLAKE2S_BLOCK_SIZE) { - blake2s_init(&state, BLAKE2S_HASH_SIZE); - blake2s_update(&state, key, keylen); - blake2s_final(&state, x_key); - } else - memcpy(x_key, key, keylen); - - for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i) - x_key[i] ^= 0x36; - - blake2s_init(&state, BLAKE2S_HASH_SIZE); - blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE); - blake2s_update(&state, in, inlen); - blake2s_final(&state, i_hash); - - for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i) - x_key[i] ^= 0x5c ^ 0x36; - - blake2s_init(&state, BLAKE2S_HASH_SIZE); - blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE); - blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE); - blake2s_final(&state, i_hash); - - memcpy(out, i_hash, BLAKE2S_HASH_SIZE); - memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE); - memzero_explicit(i_hash, BLAKE2S_HASH_SIZE); -} -EXPORT_SYMBOL(blake2s256_hmac); - static int __init blake2s_mod_init(void) { if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) && -- 2.34.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size 2022-01-14 15:42 [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld 2022-01-14 15:42 ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld @ 2022-01-14 15:42 ` Jason A. Donenfeld 1 sibling, 0 replies; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-14 15:42 UTC (permalink / raw) To: linux-crypto, linux-kernel, geert, herbert Cc: Jason A. Donenfeld, Ard Biesheuvel With SHA-1 no longer being used for anything performance oriented, and also soon to be phased out entirely, we can make up for the space added by unrolled BLAKE2s by simply re-rolling SHA-1. Since SHA-1 is so much more complex, re-rolling it more or less takes care of the code size added by BLAKE2s. And eventually, hopefully we'll see SHA-1 removed entirely from most small kernel builds. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ard Biesheuvel <ardb@kernel.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- lib/sha1.c | 95 ++++++++---------------------------------------------- 1 file changed, 14 insertions(+), 81 deletions(-) diff --git a/lib/sha1.c b/lib/sha1.c index 9bd1935a1472..0494766fc574 100644 --- a/lib/sha1.c +++ b/lib/sha1.c @@ -9,6 +9,7 @@ #include <linux/kernel.h> #include <linux/export.h> #include <linux/bitops.h> +#include <linux/string.h> #include <crypto/sha1.h> #include <asm/unaligned.h> @@ -55,7 +56,8 @@ #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \ __u32 TEMP = input(t); setW(t, TEMP); \ E += TEMP + rol32(A,5) + (fn) + (constant); \ - B = ror32(B, 2); } while (0) + B = ror32(B, 2); \ + TEMP = E; E = D; D = C; C = B; B = A; A = TEMP; } while (0) #define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) @@ -84,6 +86,7 @@ void sha1_transform(__u32 *digest, const char *data, __u32 *array) { __u32 A, B, C, D, E; + unsigned int i = 0; A = digest[0]; B = digest[1]; @@ -92,94 +95,24 @@ void sha1_transform(__u32 *digest, const char *data, __u32 *array) E = digest[4]; /* Round 1 - iterations 0-16 take their input from 'data' */ - T_0_15( 0, A, B, C, D, E); - T_0_15( 1, E, A, B, C, D); - T_0_15( 2, D, E, A, B, C); - T_0_15( 3, C, D, E, A, B); - T_0_15( 4, B, C, D, E, A); - T_0_15( 5, A, B, C, D, E); - T_0_15( 6, E, A, B, C, D); - T_0_15( 7, D, E, A, B, C); - T_0_15( 8, C, D, E, A, B); - T_0_15( 9, B, C, D, E, A); - T_0_15(10, A, B, C, D, E); - T_0_15(11, E, A, B, C, D); - T_0_15(12, D, E, A, B, C); - T_0_15(13, C, D, E, A, B); - T_0_15(14, B, C, D, E, A); - T_0_15(15, A, B, C, D, E); + for (; i < 16; ++i) + T_0_15(i, A, B, C, D, E); /* Round 1 - tail. Input from 512-bit mixing array */ - T_16_19(16, E, A, B, C, D); - T_16_19(17, D, E, A, B, C); - T_16_19(18, C, D, E, A, B); - T_16_19(19, B, C, D, E, A); + for (; i < 20; ++i) + T_16_19(i, A, B, C, D, E); /* Round 2 */ - T_20_39(20, A, B, C, D, E); - T_20_39(21, E, A, B, C, D); - T_20_39(22, D, E, A, B, C); - T_20_39(23, C, D, E, A, B); - T_20_39(24, B, C, D, E, A); - T_20_39(25, A, B, C, D, E); - T_20_39(26, E, A, B, C, D); - T_20_39(27, D, E, A, B, C); - T_20_39(28, C, D, E, A, B); - T_20_39(29, B, C, D, E, A); - T_20_39(30, A, B, C, D, E); - T_20_39(31, E, A, B, C, D); - T_20_39(32, D, E, A, B, C); - T_20_39(33, C, D, E, A, B); - T_20_39(34, B, C, D, E, A); - T_20_39(35, A, B, C, D, E); - T_20_39(36, E, A, B, C, D); - T_20_39(37, D, E, A, B, C); - T_20_39(38, C, D, E, A, B); - T_20_39(39, B, C, D, E, A); + for (; i < 40; ++i) + T_20_39(i, A, B, C, D, E); /* Round 3 */ - T_40_59(40, A, B, C, D, E); - T_40_59(41, E, A, B, C, D); - T_40_59(42, D, E, A, B, C); - T_40_59(43, C, D, E, A, B); - T_40_59(44, B, C, D, E, A); - T_40_59(45, A, B, C, D, E); - T_40_59(46, E, A, B, C, D); - T_40_59(47, D, E, A, B, C); - T_40_59(48, C, D, E, A, B); - T_40_59(49, B, C, D, E, A); - T_40_59(50, A, B, C, D, E); - T_40_59(51, E, A, B, C, D); - T_40_59(52, D, E, A, B, C); - T_40_59(53, C, D, E, A, B); - T_40_59(54, B, C, D, E, A); - T_40_59(55, A, B, C, D, E); - T_40_59(56, E, A, B, C, D); - T_40_59(57, D, E, A, B, C); - T_40_59(58, C, D, E, A, B); - T_40_59(59, B, C, D, E, A); + for (; i < 60; ++i) + T_40_59(i, A, B, C, D, E); /* Round 4 */ - T_60_79(60, A, B, C, D, E); - T_60_79(61, E, A, B, C, D); - T_60_79(62, D, E, A, B, C); - T_60_79(63, C, D, E, A, B); - T_60_79(64, B, C, D, E, A); - T_60_79(65, A, B, C, D, E); - T_60_79(66, E, A, B, C, D); - T_60_79(67, D, E, A, B, C); - T_60_79(68, C, D, E, A, B); - T_60_79(69, B, C, D, E, A); - T_60_79(70, A, B, C, D, E); - T_60_79(71, E, A, B, C, D); - T_60_79(72, D, E, A, B, C); - T_60_79(73, C, D, E, A, B); - T_60_79(74, B, C, D, E, A); - T_60_79(75, A, B, C, D, E); - T_60_79(76, E, A, B, C, D); - T_60_79(77, D, E, A, B, C); - T_60_79(78, C, D, E, A, B); - T_60_79(79, B, C, D, E, A); + for (; i < 80; ++i) + T_60_79(i, A, B, C, D, E); digest[0] += A; digest[1] += B; -- 2.34.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms @ 2022-01-11 18:10 Jason A. Donenfeld 2022-01-11 22:05 ` [PATCH crypto v3 " Jason A. Donenfeld 0 siblings, 1 reply; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-11 18:10 UTC (permalink / raw) To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso, gregkh, jeanphilippe.aumasson, ardb Cc: Jason A. Donenfeld Hi, Geert emailed me this afternoon concerned about blake2s codesize on m68k and other small systems. We identified two effective ways of chopping down the size. One of them moves some wireguard-specific things into wireguard proper. The other one adds a slower codepath for small machines to blake2s. This worked, and was v1 of this patchset, but I wasn't so much of a fan. Then someone pointed out that the generic C SHA-1 implementation is still unrolled, which is a *lot* of extra code. Simply rerolling that saves about as much as v1 did. So, we instead do that in this v2 patchset. SHA-1 is being phased out, and soon it won't be included at all (hopefully). And nothing performance-oriented has anything to do with it anyway. The result of these two patches mitigates Geert's feared code size increase for 5.17. Thanks, Jason Jason A. Donenfeld (2): lib/crypto: blake2s: move hmac construction into wireguard lib/crypto: sha1: re-roll loops to reduce code size drivers/net/wireguard/noise.c | 45 +++++++++++-- include/crypto/blake2s.h | 3 - lib/crypto/blake2s-selftest.c | 31 --------- lib/crypto/blake2s.c | 37 ----------- lib/sha1.c | 117 ++++++++-------------------------- 5 files changed, 64 insertions(+), 169 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-11 18:10 [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld @ 2022-01-11 22:05 ` Jason A. Donenfeld 2022-01-12 10:59 ` Geert Uytterhoeven 0 siblings, 1 reply; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-11 22:05 UTC (permalink / raw) To: linux-crypto, netdev, wireguard, linux-kernel, bpf, geert, tytso, gregkh, jeanphilippe.aumasson, ardb Cc: Jason A. Donenfeld Hi, Geert emailed me this afternoon concerned about blake2s codesize on m68k and other small systems. We identified two effective ways of chopping down the size. One of them moves some wireguard-specific things into wireguard proper. The other one adds a slower codepath for small machines to blake2s. This worked, and was v1 of this patchset, but I wasn't so much of a fan. Then someone pointed out that the generic C SHA-1 implementation is still unrolled, which is a *lot* of extra code. Simply rerolling that saves about as much as v1 did. So, we instead do that in this patchset. SHA-1 is being phased out, and soon it won't be included at all (hopefully). And nothing performance-oriented has anything to do with it anyway. The result of these two patches mitigates Geert's feared code size increase for 5.17. v3 improves on v2 by making the re-rolling of SHA-1 much simpler, resulting in even larger code size reduction and much better performance. The reason I'm sending yet a third version in such a short amount of time is because the trick here feels obvious and substantial enough that I'd hate for Geert to waste time measuring the impact of the previous commit. Thanks, Jason Jason A. Donenfeld (2): lib/crypto: blake2s: move hmac construction into wireguard lib/crypto: sha1: re-roll loops to reduce code size drivers/net/wireguard/noise.c | 45 ++++++++++++++--- include/crypto/blake2s.h | 3 -- lib/crypto/blake2s-selftest.c | 31 ------------ lib/crypto/blake2s.c | 37 -------------- lib/sha1.c | 95 ++++++----------------------------- 5 files changed, 53 insertions(+), 158 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-11 22:05 ` [PATCH crypto v3 " Jason A. Donenfeld @ 2022-01-12 10:59 ` Geert Uytterhoeven 2022-01-12 13:18 ` Jason A. Donenfeld 0 siblings, 1 reply; 10+ messages in thread From: Geert Uytterhoeven @ 2022-01-12 10:59 UTC (permalink / raw) To: Jason A. Donenfeld Cc: Linux Crypto Mailing List, netdev, wireguard, Linux Kernel Mailing List, bpf, Theodore Tso, Greg KH, jeanphilippe.aumasson, Ard Biesheuvel Hi Jason, On Tue, Jan 11, 2022 at 11:05 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Geert emailed me this afternoon concerned about blake2s codesize on m68k > and other small systems. We identified two effective ways of chopping > down the size. One of them moves some wireguard-specific things into > wireguard proper. The other one adds a slower codepath for small > machines to blake2s. This worked, and was v1 of this patchset, but I > wasn't so much of a fan. Then someone pointed out that the generic C > SHA-1 implementation is still unrolled, which is a *lot* of extra code. > Simply rerolling that saves about as much as v1 did. So, we instead do > that in this patchset. SHA-1 is being phased out, and soon it won't > be included at all (hopefully). And nothing performance-oriented has > anything to do with it anyway. > > The result of these two patches mitigates Geert's feared code size > increase for 5.17. > > v3 improves on v2 by making the re-rolling of SHA-1 much simpler, > resulting in even larger code size reduction and much better > performance. The reason I'm sending yet a third version in such a short > amount of time is because the trick here feels obvious and substantial > enough that I'd hate for Geert to waste time measuring the impact of the > previous commit. > > Thanks, > Jason > > Jason A. Donenfeld (2): > lib/crypto: blake2s: move hmac construction into wireguard > lib/crypto: sha1: re-roll loops to reduce code size Thanks for the series! On m68k: add/remove: 1/4 grow/shrink: 0/1 up/down: 4/-4232 (-4228) Function old new delta __ksymtab_blake2s256_hmac 12 - -12 blake2s_init.constprop 94 - -94 blake2s256_hmac 302 - -302 sha1_transform 4402 582 -3820 Total: Before=4230537, After=4226309, chg -0.10% Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-12 10:59 ` Geert Uytterhoeven @ 2022-01-12 13:18 ` Jason A. Donenfeld 2022-01-18 6:42 ` Herbert Xu 0 siblings, 1 reply; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-12 13:18 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Linux Crypto Mailing List, netdev, WireGuard mailing list, Linux Kernel Mailing List, bpf, Theodore Tso, Greg KH, Jean-Philippe Aumasson, Ard Biesheuvel, Herbert Xu Hi Geert, On Wed, Jan 12, 2022 at 12:00 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > Thanks for the series! > > On m68k: > add/remove: 1/4 grow/shrink: 0/1 up/down: 4/-4232 (-4228) > Function old new delta > __ksymtab_blake2s256_hmac 12 - -12 > blake2s_init.constprop 94 - -94 > blake2s256_hmac 302 - -302 > sha1_transform 4402 582 -3820 > Total: Before=4230537, After=4226309, chg -0.10% > > Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Excellent, thanks for the breakdown. So this shaves off ~4k, which was about what we were shooting for here, so I think indeed this series accomplishes its goal of counteracting the addition of BLAKE2s. Hopefully Herbert will apply this series for 5.17. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-12 13:18 ` Jason A. Donenfeld @ 2022-01-18 6:42 ` Herbert Xu 2022-01-18 11:43 ` Jason A. Donenfeld 0 siblings, 1 reply; 10+ messages in thread From: Herbert Xu @ 2022-01-18 6:42 UTC (permalink / raw) To: Jason A. Donenfeld Cc: geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso, gregkh, jeanphilippe.aumasson, ardb Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Excellent, thanks for the breakdown. So this shaves off ~4k, which was > about what we were shooting for here, so I think indeed this series > accomplishes its goal of counteracting the addition of BLAKE2s. > Hopefully Herbert will apply this series for 5.17. As the patches that triggered this weren't part of the crypto tree, this will have to go through the random tree if you want them for 5.17. Otherwise if you're happy to wait then I can pull them through cryptodev. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-18 6:42 ` Herbert Xu @ 2022-01-18 11:43 ` Jason A. Donenfeld 2022-01-18 12:44 ` David Laight 0 siblings, 1 reply; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-18 11:43 UTC (permalink / raw) To: Herbert Xu Cc: geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso, gregkh, jeanphilippe.aumasson, ardb On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote: > As the patches that triggered this weren't part of the crypto > tree, this will have to go through the random tree if you want > them for 5.17. Sure, will do. ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-18 11:43 ` Jason A. Donenfeld @ 2022-01-18 12:44 ` David Laight 2022-01-18 12:50 ` Jason A. Donenfeld 0 siblings, 1 reply; 10+ messages in thread From: David Laight @ 2022-01-18 12:44 UTC (permalink / raw) To: 'Jason A. Donenfeld', Herbert Xu Cc: geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso, gregkh, jeanphilippe.aumasson, ardb From: Jason A. Donenfeld > Sent: 18 January 2022 11:43 > > On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > As the patches that triggered this weren't part of the crypto > > tree, this will have to go through the random tree if you want > > them for 5.17. > > Sure, will do. I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8 Some things I've noticed; 1) There is no point having all the inline functions. Far better to have real functions to do the work. Given the cost of hashing 64 bytes of data the extra function call won't matter. Indeed for repeated calls it will help because the required code will be in the I-cache. 2) The compiles I tried do manage to remove the blake2_sigma[][] when unrolling everything - which is a slight gain for the full unroll. But I doubt it is that significant if the access can get sensibly optimised. For non-x86 that might require all the values by multiplied by 4. 3) Although G() is a massive register dependency chain the compiler knows that G(,[0-3],) are independent and can execute in parallel. This does help execution time on multi-issue cpu (like x86). With care it ought to be possible to use the same code for G(,[4-7],) without stopping the compiler interleaving all the instructions. 4) I strongly suspect that using a loop for the rounds will have minimal impact on performance - especially if the first call is 'cold cache'. But I've not got time to test the code. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms 2022-01-18 12:44 ` David Laight @ 2022-01-18 12:50 ` Jason A. Donenfeld 0 siblings, 0 replies; 10+ messages in thread From: Jason A. Donenfeld @ 2022-01-18 12:50 UTC (permalink / raw) To: David Laight Cc: Herbert Xu, geert, linux-crypto, netdev, wireguard, linux-kernel, bpf, tytso, gregkh, jeanphilippe.aumasson, ardb On Tue, Jan 18, 2022 at 1:45 PM David Laight <David.Laight@aculab.com> wrote: > I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8 > > Some things I've noticed; It seems like you've done a lot of work here but... > But I've not got time to test the code. But you're not going to take it all the way. So it unfortunately amounts to mailing list armchair optimization. That's too bad because it really seems like you might be onto something worth seeing through. As I've mentioned a few times now, I've dropped the blake2s optimization patch, and I won't be developing that further. But it appears as though you've really been captured by it, so I urge you: please send a real patch with benchmarks on various platforms! (And CC me on the patch.) Faster reference code would really be terrific. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-01-18 12:51 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-14 15:42 [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld 2022-01-14 15:42 ` [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard Jason A. Donenfeld 2022-01-14 15:42 ` [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size Jason A. Donenfeld -- strict thread matches above, loose matches on Subject: below -- 2022-01-11 18:10 [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms Jason A. Donenfeld 2022-01-11 22:05 ` [PATCH crypto v3 " Jason A. Donenfeld 2022-01-12 10:59 ` Geert Uytterhoeven 2022-01-12 13:18 ` Jason A. Donenfeld 2022-01-18 6:42 ` Herbert Xu 2022-01-18 11:43 ` Jason A. Donenfeld 2022-01-18 12:44 ` David Laight 2022-01-18 12:50 ` Jason A. Donenfeld
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.