All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization
@ 2020-10-25 14:31 Arvind Sankar
  2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel

Patch 1/2 -- Use memzero_explicit() instead of structure assignment/plain
memset() to clear sensitive state.

Patch 3 -- Currently the temporary variables used in the generic sha256
implementation are cleared, but the clearing is optimized away due to
lack of compiler barriers. Drop the clearing.

The last three patches are optimizations for generic sha256.

v4:
- Split the first patch into two, the first one just does
  lib/crypto/sha256.c, so that the second one can be applied or dropped
  depending on the outcome of the discussion between Herbert/Eric.

v3:
- Add some more files to patch 1
- Reword commit message for patch 2
- Reformat SHA256_K array
- Drop v2 patch combining K and W arrays

v2:
- Add patch to combine K and W arrays, suggested by David
- Reformat SHA256_ROUND() macro a little

Arvind Sankar (6):
  crypto: lib/sha256 - Use memzero_explicit() for clearing state
  crypto: Use memzero_explicit() for clearing state
  crypto: lib/sha256 - Don't clear temporary variables
  crypto: lib/sha256 - Clear W[] in sha256_update() instead of
    sha256_transform()
  crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
  crypto: lib/sha256 - Unroll LOAD and BLEND loops

 arch/arm64/crypto/ghash-ce-glue.c |   2 +-
 arch/arm64/crypto/poly1305-glue.c |   2 +-
 arch/arm64/crypto/sha3-ce-glue.c  |   2 +-
 arch/x86/crypto/poly1305_glue.c   |   2 +-
 include/crypto/sha1_base.h        |   3 +-
 include/crypto/sha256_base.h      |   3 +-
 include/crypto/sha512_base.h      |   3 +-
 include/crypto/sm3_base.h         |   3 +-
 lib/crypto/sha256.c               | 212 +++++++++---------------------
 9 files changed, 76 insertions(+), 156 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
@ 2020-10-25 14:31 ` Arvind Sankar
  2020-10-26  7:59   ` Ard Biesheuvel
  2020-10-25 14:31 ` [PATCH v4 2/6] crypto: " Arvind Sankar
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel, Eric Biggers

Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards. At least in lib/crypto/sha256.c:__sha256_final(), the
function can get inlined into sha256(), in which case the memset is
optimized away.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
---
 lib/crypto/sha256.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 2321f6cb322f..d43bc39ab05e 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -265,7 +265,7 @@ static void __sha256_final(struct sha256_state *sctx, u8 *out, int digest_words)
 		put_unaligned_be32(sctx->state[i], &dst[i]);
 
 	/* Zeroize sensitive information. */
-	memset(sctx, 0, sizeof(*sctx));
+	memzero_explicit(sctx, sizeof(*sctx));
 }
 
 void sha256_final(struct sha256_state *sctx, u8 *out)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/6] crypto: Use memzero_explicit() for clearing state
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
  2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
@ 2020-10-25 14:31 ` Arvind Sankar
  2020-10-26  7:58   ` Ard Biesheuvel
  2020-10-25 14:31 ` [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables Arvind Sankar
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel

Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
---
 arch/arm64/crypto/ghash-ce-glue.c | 2 +-
 arch/arm64/crypto/poly1305-glue.c | 2 +-
 arch/arm64/crypto/sha3-ce-glue.c  | 2 +-
 arch/x86/crypto/poly1305_glue.c   | 2 +-
 include/crypto/sha1_base.h        | 3 ++-
 include/crypto/sha256_base.h      | 3 ++-
 include/crypto/sha512_base.h      | 3 ++-
 include/crypto/sm3_base.h         | 3 ++-
 8 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 8536008e3e35..2427e2f3a9a1 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -168,7 +168,7 @@ static int ghash_final(struct shash_desc *desc, u8 *dst)
 	put_unaligned_be64(ctx->digest[1], dst);
 	put_unaligned_be64(ctx->digest[0], dst + 8);
 
-	*ctx = (struct ghash_desc_ctx){};
+	memzero_explicit(ctx, sizeof(*ctx));
 	return 0;
 }
 
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
index f33ada70c4ed..683de671741a 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -177,7 +177,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
 	}
 
 	poly1305_emit(&dctx->h, dst, dctx->s);
-	*dctx = (struct poly1305_desc_ctx){};
+	memzero_explicit(dctx, sizeof(*dctx));
 }
 EXPORT_SYMBOL(poly1305_final_arch);
 
diff --git a/arch/arm64/crypto/sha3-ce-glue.c b/arch/arm64/crypto/sha3-ce-glue.c
index 9a4bbfc45f40..e5a2936f0886 100644
--- a/arch/arm64/crypto/sha3-ce-glue.c
+++ b/arch/arm64/crypto/sha3-ce-glue.c
@@ -94,7 +94,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
 	if (digest_size & 4)
 		put_unaligned_le32(sctx->st[i], (__le32 *)digest);
 
-	*sctx = (struct sha3_state){};
+	memzero_explicit(sctx, sizeof(*sctx));
 	return 0;
 }
 
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index e508dbd91813..64d09520d279 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -209,7 +209,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
 	}
 
 	poly1305_simd_emit(&dctx->h, dst, dctx->s);
-	*dctx = (struct poly1305_desc_ctx){};
+	memzero_explicit(dctx, sizeof(*dctx));
 }
 EXPORT_SYMBOL(poly1305_final_arch);
 
diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h
index 20fd1f7468af..a5d6033efef7 100644
--- a/include/crypto/sha1_base.h
+++ b/include/crypto/sha1_base.h
@@ -12,6 +12,7 @@
 #include <crypto/sha.h>
 #include <linux/crypto.h>
 #include <linux/module.h>
+#include <linux/string.h>
 
 #include <asm/unaligned.h>
 
@@ -101,7 +102,7 @@ static inline int sha1_base_finish(struct shash_desc *desc, u8 *out)
 	for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
 		put_unaligned_be32(sctx->state[i], digest++);
 
-	*sctx = (struct sha1_state){};
+	memzero_explicit(sctx, sizeof(*sctx));
 	return 0;
 }
 
diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index 6ded110783ae..93f9fd21cc06 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -12,6 +12,7 @@
 #include <crypto/sha.h>
 #include <linux/crypto.h>
 #include <linux/module.h>
+#include <linux/string.h>
 
 #include <asm/unaligned.h>
 
@@ -105,7 +106,7 @@ static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
 	for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be32))
 		put_unaligned_be32(sctx->state[i], digest++);
 
-	*sctx = (struct sha256_state){};
+	memzero_explicit(sctx, sizeof(*sctx));
 	return 0;
 }
 
diff --git a/include/crypto/sha512_base.h b/include/crypto/sha512_base.h
index fb19c77494dc..93ab73baa38e 100644
--- a/include/crypto/sha512_base.h
+++ b/include/crypto/sha512_base.h
@@ -12,6 +12,7 @@
 #include <crypto/sha.h>
 #include <linux/crypto.h>
 #include <linux/module.h>
+#include <linux/string.h>
 
 #include <asm/unaligned.h>
 
@@ -126,7 +127,7 @@ static inline int sha512_base_finish(struct shash_desc *desc, u8 *out)
 	for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be64))
 		put_unaligned_be64(sctx->state[i], digest++);
 
-	*sctx = (struct sha512_state){};
+	memzero_explicit(sctx, sizeof(*sctx));
 	return 0;
 }
 
diff --git a/include/crypto/sm3_base.h b/include/crypto/sm3_base.h
index 1cbf9aa1fe52..2f3a32ab97bb 100644
--- a/include/crypto/sm3_base.h
+++ b/include/crypto/sm3_base.h
@@ -13,6 +13,7 @@
 #include <crypto/sm3.h>
 #include <linux/crypto.h>
 #include <linux/module.h>
+#include <linux/string.h>
 #include <asm/unaligned.h>
 
 typedef void (sm3_block_fn)(struct sm3_state *sst, u8 const *src, int blocks);
@@ -104,7 +105,7 @@ static inline int sm3_base_finish(struct shash_desc *desc, u8 *out)
 	for (i = 0; i < SM3_DIGEST_SIZE / sizeof(__be32); i++)
 		put_unaligned_be32(sctx->state[i], digest++);
 
-	*sctx = (struct sm3_state){};
+	memzero_explicit(sctx, sizeof(*sctx));
 	return 0;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
  2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
  2020-10-25 14:31 ` [PATCH v4 2/6] crypto: " Arvind Sankar
@ 2020-10-25 14:31 ` Arvind Sankar
  2020-10-26  7:59   ` Ard Biesheuvel
  2020-10-25 14:31 ` [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform() Arvind Sankar
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel, Eric Biggers

The assignments to clear a through h and t1/t2 are optimized out by the
compiler because they are unused after the assignments.

Clearing individual scalar variables is unlikely to be useful, as they
may have been assigned to registers, and even if stack spilling was
required, there may be compiler-generated temporaries that are
impossible to clear in any case.

So drop the clearing of a through h and t1/t2.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
---
 lib/crypto/sha256.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index d43bc39ab05e..099cd11f83c1 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
 	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
 
 	/* clear any sensitive info... */
-	a = b = c = d = e = f = g = h = t1 = t2 = 0;
 	memzero_explicit(W, 64 * sizeof(u32));
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
                   ` (2 preceding siblings ...)
  2020-10-25 14:31 ` [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables Arvind Sankar
@ 2020-10-25 14:31 ` Arvind Sankar
  2020-10-26  8:00   ` Ard Biesheuvel
  2020-10-25 14:31 ` [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64 Arvind Sankar
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel, Eric Biggers

The temporary W[] array is currently zeroed out once every call to
sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
sha256_update() instead so that it is cleared only once per update can
save about 2-3% of the total time taken to compute the digest, with a
reasonable memset() implementation, and considerably more (~20%) with a
bad one (eg the x86 purgatory currently uses a memset() coded in C).

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
---
 lib/crypto/sha256.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 099cd11f83c1..c6bfeacc5b81 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -43,10 +43,9 @@ static inline void BLEND_OP(int I, u32 *W)
 	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
-static void sha256_transform(u32 *state, const u8 *input)
+static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
 	u32 a, b, c, d, e, f, g, h, t1, t2;
-	u32 W[64];
 	int i;
 
 	/* load the input */
@@ -200,15 +199,13 @@ static void sha256_transform(u32 *state, const u8 *input)
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
 	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-
-	/* clear any sensitive info... */
-	memzero_explicit(W, 64 * sizeof(u32));
 }
 
 void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 {
 	unsigned int partial, done;
 	const u8 *src;
+	u32 W[64];
 
 	partial = sctx->count & 0x3f;
 	sctx->count += len;
@@ -223,11 +220,13 @@ void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 		}
 
 		do {
-			sha256_transform(sctx->state, src);
+			sha256_transform(sctx->state, src, W);
 			done += 64;
 			src = data + done;
 		} while (done + 63 < len);
 
+		memzero_explicit(W, sizeof(W));
+
 		partial = 0;
 	}
 	memcpy(sctx->buf + partial, src, len - done);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
                   ` (3 preceding siblings ...)
  2020-10-25 14:31 ` [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform() Arvind Sankar
@ 2020-10-25 14:31 ` Arvind Sankar
  2020-10-26  8:00   ` Ard Biesheuvel
  2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
  2020-10-30  6:53 ` [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Herbert Xu
  6 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel, Eric Biggers

This reduces code size substantially (on x86_64 with gcc-10 the size of
sha256_update() goes from 7593 bytes to 1952 bytes including the new
SHA256_K array), and on x86 is slightly faster than the full unroll
(tested on Broadwell Xeon).

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
---
 lib/crypto/sha256.c | 174 ++++++++++----------------------------------
 1 file changed, 38 insertions(+), 136 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index c6bfeacc5b81..e2e29d9b0ccd 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -18,6 +18,25 @@
 #include <crypto/sha.h>
 #include <asm/unaligned.h>
 
+static const u32 SHA256_K[] = {
+	0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
+	0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+	0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
+	0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+	0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
+	0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+	0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
+	0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+	0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
+	0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+	0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
+	0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+	0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
+	0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
+	0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
+};
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
 	return z ^ (x & (y ^ z));
@@ -43,9 +62,17 @@ static inline void BLEND_OP(int I, u32 *W)
 	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
+#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {		\
+	u32 t1, t2;						\
+	t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];	\
+	t2 = e0(a) + Maj(a, b, c);				\
+	d += t1;						\
+	h = t1 + t2;						\
+} while (0)
+
 static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
-	u32 a, b, c, d, e, f, g, h, t1, t2;
+	u32 a, b, c, d, e, f, g, h;
 	int i;
 
 	/* load the input */
@@ -61,141 +88,16 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 	e = state[4];  f = state[5];  g = state[6];  h = state[7];
 
 	/* now iterate */
-	t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0x4a7484aa + W[21];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0x5cb0a9dc + W[22];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0x76f988da + W[23];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0x983e5152 + W[24];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0xa831c66d + W[25];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0xb00327c8 + W[26];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0xbf597fc7 + W[27];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0xc6e00bf3 + W[28];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0xd5a79147 + W[29];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0x06ca6351 + W[30];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0x14292967 + W[31];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0x27b70a85 + W[32];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0x2e1b2138 + W[33];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0x4d2c6dfc + W[34];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0x53380d13 + W[35];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0x650a7354 + W[36];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0x766a0abb + W[37];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0x81c2c92e + W[38];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0x92722c85 + W[39];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0xa2bfe8a1 + W[40];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0xa81a664b + W[41];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0xc24b8b70 + W[42];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0xc76c51a3 + W[43];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0xd192e819 + W[44];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0xd6990624 + W[45];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0xf40e3585 + W[46];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0x106aa070 + W[47];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0x19a4c116 + W[48];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0x1e376c08 + W[49];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0x2748774c + W[50];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0x34b0bcb5 + W[51];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0x391c0cb3 + W[52];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0x4ed8aa4a + W[53];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0x5b9cca4f + W[54];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0x682e6ff3 + W[55];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
-
-	t1 = h + e1(e) + Ch(e, f, g) + 0x748f82ee + W[56];
-	t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
-	t1 = g + e1(d) + Ch(d, e, f) + 0x78a5636f + W[57];
-	t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
-	t1 = f + e1(c) + Ch(c, d, e) + 0x84c87814 + W[58];
-	t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
-	t1 = e + e1(b) + Ch(b, c, d) + 0x8cc70208 + W[59];
-	t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
-	t1 = d + e1(a) + Ch(a, b, c) + 0x90befffa + W[60];
-	t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
-	t1 = c + e1(h) + Ch(h, a, b) + 0xa4506ceb + W[61];
-	t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
-	t1 = b + e1(g) + Ch(g, h, a) + 0xbef9a3f7 + W[62];
-	t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
-	t1 = a + e1(f) + Ch(f, g, h) + 0xc67178f2 + W[63];
-	t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
+	for (i = 0; i < 64; i += 8) {
+		SHA256_ROUND(i + 0, a, b, c, d, e, f, g, h);
+		SHA256_ROUND(i + 1, h, a, b, c, d, e, f, g);
+		SHA256_ROUND(i + 2, g, h, a, b, c, d, e, f);
+		SHA256_ROUND(i + 3, f, g, h, a, b, c, d, e);
+		SHA256_ROUND(i + 4, e, f, g, h, a, b, c, d);
+		SHA256_ROUND(i + 5, d, e, f, g, h, a, b, c);
+		SHA256_ROUND(i + 6, c, d, e, f, g, h, a, b);
+		SHA256_ROUND(i + 7, b, c, d, e, f, g, h, a);
+	}
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
 	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
                   ` (4 preceding siblings ...)
  2020-10-25 14:31 ` [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64 Arvind Sankar
@ 2020-10-25 14:31 ` Arvind Sankar
  2020-10-25 18:51   ` David Laight
  2020-10-26  8:02   ` Ard Biesheuvel
  2020-10-30  6:53 ` [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Herbert Xu
  6 siblings, 2 replies; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 14:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers, David Laight
  Cc: linux-kernel, Eric Biggers

Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
(tested on Broadwell Xeon) while not increasing code size too much.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
---
 lib/crypto/sha256.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index e2e29d9b0ccd..cdef37c05972 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -76,12 +76,28 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 	int i;
 
 	/* load the input */
-	for (i = 0; i < 16; i++)
-		LOAD_OP(i, W, input);
+	for (i = 0; i < 16; i += 8) {
+		LOAD_OP(i + 0, W, input);
+		LOAD_OP(i + 1, W, input);
+		LOAD_OP(i + 2, W, input);
+		LOAD_OP(i + 3, W, input);
+		LOAD_OP(i + 4, W, input);
+		LOAD_OP(i + 5, W, input);
+		LOAD_OP(i + 6, W, input);
+		LOAD_OP(i + 7, W, input);
+	}
 
 	/* now blend */
-	for (i = 16; i < 64; i++)
-		BLEND_OP(i, W);
+	for (i = 16; i < 64; i += 8) {
+		BLEND_OP(i + 0, W);
+		BLEND_OP(i + 1, W);
+		BLEND_OP(i + 2, W);
+		BLEND_OP(i + 3, W);
+		BLEND_OP(i + 4, W);
+		BLEND_OP(i + 5, W);
+		BLEND_OP(i + 6, W);
+		BLEND_OP(i + 7, W);
+	}
 
 	/* load the state into our registers */
 	a = state[0];  b = state[1];  c = state[2];  d = state[3];
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
@ 2020-10-25 18:51   ` David Laight
  2020-10-25 20:18     ` Arvind Sankar
  2020-10-26  8:02   ` Ard Biesheuvel
  1 sibling, 1 reply; 19+ messages in thread
From: David Laight @ 2020-10-25 18:51 UTC (permalink / raw)
  To: 'Arvind Sankar',
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers
  Cc: linux-kernel, Eric Biggers

From: Arvind Sankar
> Sent: 25 October 2020 14:31
> 
> Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> (tested on Broadwell Xeon) while not increasing code size too much.

I can't believe unrolling the BLEND loop makes any difference.

Unrolling the LOAD one might - but you don't need 8 times,
once should be more than enough.
The LOAD loop needs a memory read, memory write and BSWAP per iteration.
The loop control is add + compare + jmp.
On sandy bridge and later the compare and jmp become a single u-op.
So the loop has the read, write (can happen together) and 3 other u-ops.
That won't run at 1 clock per iteration on Sandy Bridge.
However just unroll once and you need 4 non-memory u-op per loop iteration.
That might run at 2 clocks per 8 bytes.

Fiddling the loop to remove the compare (ie run from -64 to 0)
should merge the 'add' and 'jnz' into a single u-op.
That might be enough to get the 'rolled up' loop to run in 1 clock
on sandy bridge, certainly on slightly later cpu.

That is theoretical for intel cpu sandy bridge onwards.
I've an i7-7700 (Kaby Lake?) that I belive has an extra
instruction pipeline and might run the initial loop in 1 clock.

I don't have any recent AMD cpu, nor any ARM or PPC ones.
But fully out-of-order cpu are likely to be similar.

One of the other test systems I've got is an Atom C2758.
This 8 core but mostly in-order.
Running sha256_transform() on that tend to give one of two
TSC counts, one of which is double the other!
That is pretty consistent even for 100 iterations.

WRT patch 5.
On the C2758 the original unrolled code is slightly faster.
On the i7-7700 the 8 unroll is a bit faster 'hot cache',
but slower 'cold cache' - probably because of the d-cache
loads for K[].

Non-x86 architectures might need to use d-cache reads for
the 32bit 'K' constants even in the unrolled loop.
X86 can use 'lea' with a 32bit offset to avoid data reads.
So the cold-cache case for the old code may be similar.

Interestingly I had to write an asm ror32() to get reasonable
code (in userspace). The C version the kernel uses didn't work.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 18:51   ` David Laight
@ 2020-10-25 20:18     ` Arvind Sankar
  2020-10-25 23:23       ` David Laight
  0 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 20:18 UTC (permalink / raw)
  To: David Laight
  Cc: 'Arvind Sankar',
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	linux-kernel, Eric Biggers

On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote:
> From: Arvind Sankar
> > Sent: 25 October 2020 14:31
> > 
> > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> > (tested on Broadwell Xeon) while not increasing code size too much.
> 
> I can't believe unrolling the BLEND loop makes any difference.

It's actually the BLEND loop that accounts for almost all of the
difference. The LOAD loop doesn't matter much in general: even replacing
it with a plain memcpy() only increases performance by 3-4%. But
unrolling it is low cost in code size terms, and clang actually does it
without being asked.

> WRT patch 5.
> On the C2758 the original unrolled code is slightly faster.
> On the i7-7700 the 8 unroll is a bit faster 'hot cache',
> but slower 'cold cache' - probably because of the d-cache
> loads for K[].
> 
> Non-x86 architectures might need to use d-cache reads for
> the 32bit 'K' constants even in the unrolled loop.
> X86 can use 'lea' with a 32bit offset to avoid data reads.
> So the cold-cache case for the old code may be similar.

Not sure I follow: in the old code, the K's are 32-bit immediates, so
they should come from the i-cache whether an add or an lea is used?

Why is the cold-cache case relevant anyway? If the code is only being
executed a couple of times or so, i.e. you're hashing a single say
64-128 byte input once in a blue moon, the performance of the hash
doesn't really matter, no?

> 
> Interestingly I had to write an asm ror32() to get reasonable
> code (in userspace). The C version the kernel uses didn't work.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 20:18     ` Arvind Sankar
@ 2020-10-25 23:23       ` David Laight
  2020-10-25 23:53         ` Arvind Sankar
  0 siblings, 1 reply; 19+ messages in thread
From: David Laight @ 2020-10-25 23:23 UTC (permalink / raw)
  To: 'Arvind Sankar'
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	linux-kernel, Eric Biggers

From: Arvind Sankar
> Sent: 25 October 2020 20:18
> 
> On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote:
> > From: Arvind Sankar
> > > Sent: 25 October 2020 14:31
> > >
> > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> > > (tested on Broadwell Xeon) while not increasing code size too much.
> >
> > I can't believe unrolling the BLEND loop makes any difference.
> 
> It's actually the BLEND loop that accounts for almost all of the
> difference. The LOAD loop doesn't matter much in general: even replacing
> it with a plain memcpy() only increases performance by 3-4%. But
> unrolling it is low cost in code size terms, and clang actually does it
> without being asked.

(memcpy is wrong - misses the byte swaps).

That's odd, the BLEND loop is about 20 instructions.
I wouldn't expect unrolling to help - unless you manage
to use 16 registers for the active W[] values.

> > WRT patch 5.
> > On the C2758 the original unrolled code is slightly faster.
> > On the i7-7700 the 8 unroll is a bit faster 'hot cache',
> > but slower 'cold cache' - probably because of the d-cache
> > loads for K[].
> >
> > Non-x86 architectures might need to use d-cache reads for
> > the 32bit 'K' constants even in the unrolled loop.
> > X86 can use 'lea' with a 32bit offset to avoid data reads.
> > So the cold-cache case for the old code may be similar.
> 
> Not sure I follow: in the old code, the K's are 32-bit immediates, so
> they should come from the i-cache whether an add or an lea is used?

I was thinking of other instruction sets that end up using pc-relative
addressing for constants.
Might only happen for 64bint ones though.

> Why is the cold-cache case relevant anyway? If the code is only being
> executed a couple of times or so, i.e. you're hashing a single say
> 64-128 byte input once in a blue moon, the performance of the hash
> doesn't really matter, no?

I was measuring the cold cache one because I could.
I didn't note the actual figures but it was 8-10 times slower
that the hot-cache case.
While sha256 is likely to be run hot-cache (on a big buffer)
the cold-cache timing are probably relevant for things like memcpy().
I remember seeing a very long divide function for sparc32 that
was probably only a gain in a benchmark loop - it would have
displaced a lot of the working set from the i-cache!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 23:23       ` David Laight
@ 2020-10-25 23:53         ` Arvind Sankar
  2020-10-26 10:06           ` David Laight
  0 siblings, 1 reply; 19+ messages in thread
From: Arvind Sankar @ 2020-10-25 23:53 UTC (permalink / raw)
  To: David Laight
  Cc: 'Arvind Sankar',
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	linux-kernel, Eric Biggers

On Sun, Oct 25, 2020 at 11:23:52PM +0000, David Laight wrote:
> From: Arvind Sankar
> > Sent: 25 October 2020 20:18
> > 
> > On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote:
> > > From: Arvind Sankar
> > > > Sent: 25 October 2020 14:31
> > > >
> > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> > > > (tested on Broadwell Xeon) while not increasing code size too much.
> > >
> > > I can't believe unrolling the BLEND loop makes any difference.
> > 
> > It's actually the BLEND loop that accounts for almost all of the
> > difference. The LOAD loop doesn't matter much in general: even replacing
> > it with a plain memcpy() only increases performance by 3-4%. But
> > unrolling it is low cost in code size terms, and clang actually does it
> > without being asked.
> 
> (memcpy is wrong - misses the byte swaps).

I know it's wrong, the point is that it's impossible to gain very much
from optimizing the LOAD loop because it doesn't account for much of the
total time.

> 
> That's odd, the BLEND loop is about 20 instructions.
> I wouldn't expect unrolling to help - unless you manage
> to use 16 registers for the active W[] values.
> 

I am not sure about what's going on inside the hardware, but even with
a straightforward assembly version that just reads out of memory the way
the calculation is specified, unrolling the BLEND loop 8x speeds up the
performance by 7-8%.

The compiler is actually pretty bad here, just translating everything
into assembler with no attempt to optimize anything gets a 10-12%
speedup over the C version.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/6] crypto: Use memzero_explicit() for clearing state
  2020-10-25 14:31 ` [PATCH v4 2/6] crypto: " Arvind Sankar
@ 2020-10-26  7:58   ` Ard Biesheuvel
  0 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2020-10-26  7:58 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	David Laight, Linux Kernel Mailing List

On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> Without the barrier_data() inside memzero_explicit(), the compiler may
> optimize away the state-clearing if it can tell that the state is not
> used afterwards.
>
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

I agree with Eric that, even though there are cases where it is
unlikely that the compiler could elide an ordinary memset() or struct
assignment (even under LTO), using memzero_explicit() is better in
these cases, as it also clarifies the intent of the operation, and
doesn't result in worse code now that memzero_explicit() is a static
inline around memset() and a barrier.





> ---
>  arch/arm64/crypto/ghash-ce-glue.c | 2 +-
>  arch/arm64/crypto/poly1305-glue.c | 2 +-
>  arch/arm64/crypto/sha3-ce-glue.c  | 2 +-
>  arch/x86/crypto/poly1305_glue.c   | 2 +-
>  include/crypto/sha1_base.h        | 3 ++-
>  include/crypto/sha256_base.h      | 3 ++-
>  include/crypto/sha512_base.h      | 3 ++-
>  include/crypto/sm3_base.h         | 3 ++-
>  8 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
> index 8536008e3e35..2427e2f3a9a1 100644
> --- a/arch/arm64/crypto/ghash-ce-glue.c
> +++ b/arch/arm64/crypto/ghash-ce-glue.c
> @@ -168,7 +168,7 @@ static int ghash_final(struct shash_desc *desc, u8 *dst)
>         put_unaligned_be64(ctx->digest[1], dst);
>         put_unaligned_be64(ctx->digest[0], dst + 8);
>
> -       *ctx = (struct ghash_desc_ctx){};
> +       memzero_explicit(ctx, sizeof(*ctx));
>         return 0;
>  }
>
> diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
> index f33ada70c4ed..683de671741a 100644
> --- a/arch/arm64/crypto/poly1305-glue.c
> +++ b/arch/arm64/crypto/poly1305-glue.c
> @@ -177,7 +177,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
>         }
>
>         poly1305_emit(&dctx->h, dst, dctx->s);
> -       *dctx = (struct poly1305_desc_ctx){};
> +       memzero_explicit(dctx, sizeof(*dctx));
>  }
>  EXPORT_SYMBOL(poly1305_final_arch);
>
> diff --git a/arch/arm64/crypto/sha3-ce-glue.c b/arch/arm64/crypto/sha3-ce-glue.c
> index 9a4bbfc45f40..e5a2936f0886 100644
> --- a/arch/arm64/crypto/sha3-ce-glue.c
> +++ b/arch/arm64/crypto/sha3-ce-glue.c
> @@ -94,7 +94,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
>         if (digest_size & 4)
>                 put_unaligned_le32(sctx->st[i], (__le32 *)digest);
>
> -       *sctx = (struct sha3_state){};
> +       memzero_explicit(sctx, sizeof(*sctx));
>         return 0;
>  }
>
> diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
> index e508dbd91813..64d09520d279 100644
> --- a/arch/x86/crypto/poly1305_glue.c
> +++ b/arch/x86/crypto/poly1305_glue.c
> @@ -209,7 +209,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 *dst)
>         }
>
>         poly1305_simd_emit(&dctx->h, dst, dctx->s);
> -       *dctx = (struct poly1305_desc_ctx){};
> +       memzero_explicit(dctx, sizeof(*dctx));
>  }
>  EXPORT_SYMBOL(poly1305_final_arch);
>
> diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h
> index 20fd1f7468af..a5d6033efef7 100644
> --- a/include/crypto/sha1_base.h
> +++ b/include/crypto/sha1_base.h
> @@ -12,6 +12,7 @@
>  #include <crypto/sha.h>
>  #include <linux/crypto.h>
>  #include <linux/module.h>
> +#include <linux/string.h>
>
>  #include <asm/unaligned.h>
>
> @@ -101,7 +102,7 @@ static inline int sha1_base_finish(struct shash_desc *desc, u8 *out)
>         for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
>                 put_unaligned_be32(sctx->state[i], digest++);
>
> -       *sctx = (struct sha1_state){};
> +       memzero_explicit(sctx, sizeof(*sctx));
>         return 0;
>  }
>
> diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
> index 6ded110783ae..93f9fd21cc06 100644
> --- a/include/crypto/sha256_base.h
> +++ b/include/crypto/sha256_base.h
> @@ -12,6 +12,7 @@
>  #include <crypto/sha.h>
>  #include <linux/crypto.h>
>  #include <linux/module.h>
> +#include <linux/string.h>
>
>  #include <asm/unaligned.h>
>
> @@ -105,7 +106,7 @@ static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
>         for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be32))
>                 put_unaligned_be32(sctx->state[i], digest++);
>
> -       *sctx = (struct sha256_state){};
> +       memzero_explicit(sctx, sizeof(*sctx));
>         return 0;
>  }
>
> diff --git a/include/crypto/sha512_base.h b/include/crypto/sha512_base.h
> index fb19c77494dc..93ab73baa38e 100644
> --- a/include/crypto/sha512_base.h
> +++ b/include/crypto/sha512_base.h
> @@ -12,6 +12,7 @@
>  #include <crypto/sha.h>
>  #include <linux/crypto.h>
>  #include <linux/module.h>
> +#include <linux/string.h>
>
>  #include <asm/unaligned.h>
>
> @@ -126,7 +127,7 @@ static inline int sha512_base_finish(struct shash_desc *desc, u8 *out)
>         for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be64))
>                 put_unaligned_be64(sctx->state[i], digest++);
>
> -       *sctx = (struct sha512_state){};
> +       memzero_explicit(sctx, sizeof(*sctx));
>         return 0;
>  }
>
> diff --git a/include/crypto/sm3_base.h b/include/crypto/sm3_base.h
> index 1cbf9aa1fe52..2f3a32ab97bb 100644
> --- a/include/crypto/sm3_base.h
> +++ b/include/crypto/sm3_base.h
> @@ -13,6 +13,7 @@
>  #include <crypto/sm3.h>
>  #include <linux/crypto.h>
>  #include <linux/module.h>
> +#include <linux/string.h>
>  #include <asm/unaligned.h>
>
>  typedef void (sm3_block_fn)(struct sm3_state *sst, u8 const *src, int blocks);
> @@ -104,7 +105,7 @@ static inline int sm3_base_finish(struct shash_desc *desc, u8 *out)
>         for (i = 0; i < SM3_DIGEST_SIZE / sizeof(__be32); i++)
>                 put_unaligned_be32(sctx->state[i], digest++);
>
> -       *sctx = (struct sm3_state){};
> +       memzero_explicit(sctx, sizeof(*sctx));
>         return 0;
>  }
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state
  2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
@ 2020-10-26  7:59   ` Ard Biesheuvel
  0 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2020-10-26  7:59 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	David Laight, Linux Kernel Mailing List, Eric Biggers

On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> Without the barrier_data() inside memzero_explicit(), the compiler may
> optimize away the state-clearing if it can tell that the state is not
> used afterwards. At least in lib/crypto/sha256.c:__sha256_final(), the
> function can get inlined into sha256(), in which case the memset is
> optimized away.
>
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Reviewed-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  lib/crypto/sha256.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index 2321f6cb322f..d43bc39ab05e 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -265,7 +265,7 @@ static void __sha256_final(struct sha256_state *sctx, u8 *out, int digest_words)
>                 put_unaligned_be32(sctx->state[i], &dst[i]);
>
>         /* Zeroize sensitive information. */
> -       memset(sctx, 0, sizeof(*sctx));
> +       memzero_explicit(sctx, sizeof(*sctx));
>  }
>
>  void sha256_final(struct sha256_state *sctx, u8 *out)
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables
  2020-10-25 14:31 ` [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables Arvind Sankar
@ 2020-10-26  7:59   ` Ard Biesheuvel
  0 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2020-10-26  7:59 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	David Laight, Linux Kernel Mailing List, Eric Biggers

On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> The assignments to clear a through h and t1/t2 are optimized out by the
> compiler because they are unused after the assignments.
>
> Clearing individual scalar variables is unlikely to be useful, as they
> may have been assigned to registers, and even if stack spilling was
> required, there may be compiler-generated temporaries that are
> impossible to clear in any case.
>
> So drop the clearing of a through h and t1/t2.
>
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Reviewed-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  lib/crypto/sha256.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index d43bc39ab05e..099cd11f83c1 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
>         state[4] += e; state[5] += f; state[6] += g; state[7] += h;
>
>         /* clear any sensitive info... */
> -       a = b = c = d = e = f = g = h = t1 = t2 = 0;
>         memzero_explicit(W, 64 * sizeof(u32));
>  }
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()
  2020-10-25 14:31 ` [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform() Arvind Sankar
@ 2020-10-26  8:00   ` Ard Biesheuvel
  0 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2020-10-26  8:00 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	David Laight, Linux Kernel Mailing List, Eric Biggers

On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> The temporary W[] array is currently zeroed out once every call to
> sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
> sha256_update() instead so that it is cleared only once per update can
> save about 2-3% of the total time taken to compute the digest, with a
> reasonable memset() implementation, and considerably more (~20%) with a
> bad one (eg the x86 purgatory currently uses a memset() coded in C).
>
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Reviewed-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  lib/crypto/sha256.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index 099cd11f83c1..c6bfeacc5b81 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -43,10 +43,9 @@ static inline void BLEND_OP(int I, u32 *W)
>         W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
>  }
>
> -static void sha256_transform(u32 *state, const u8 *input)
> +static void sha256_transform(u32 *state, const u8 *input, u32 *W)
>  {
>         u32 a, b, c, d, e, f, g, h, t1, t2;
> -       u32 W[64];
>         int i;
>
>         /* load the input */
> @@ -200,15 +199,13 @@ static void sha256_transform(u32 *state, const u8 *input)
>
>         state[0] += a; state[1] += b; state[2] += c; state[3] += d;
>         state[4] += e; state[5] += f; state[6] += g; state[7] += h;
> -
> -       /* clear any sensitive info... */
> -       memzero_explicit(W, 64 * sizeof(u32));
>  }
>
>  void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
>  {
>         unsigned int partial, done;
>         const u8 *src;
> +       u32 W[64];
>
>         partial = sctx->count & 0x3f;
>         sctx->count += len;
> @@ -223,11 +220,13 @@ void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
>                 }
>
>                 do {
> -                       sha256_transform(sctx->state, src);
> +                       sha256_transform(sctx->state, src, W);
>                         done += 64;
>                         src = data + done;
>                 } while (done + 63 < len);
>
> +               memzero_explicit(W, sizeof(W));
> +
>                 partial = 0;
>         }
>         memcpy(sctx->buf + partial, src, len - done);
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
  2020-10-25 14:31 ` [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64 Arvind Sankar
@ 2020-10-26  8:00   ` Ard Biesheuvel
  0 siblings, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2020-10-26  8:00 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	David Laight, Linux Kernel Mailing List, Eric Biggers

On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> This reduces code size substantially (on x86_64 with gcc-10 the size of
> sha256_update() goes from 7593 bytes to 1952 bytes including the new
> SHA256_K array), and on x86 is slightly faster than the full unroll
> (tested on Broadwell Xeon).
>
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Reviewed-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  lib/crypto/sha256.c | 174 ++++++++++----------------------------------
>  1 file changed, 38 insertions(+), 136 deletions(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index c6bfeacc5b81..e2e29d9b0ccd 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -18,6 +18,25 @@
>  #include <crypto/sha.h>
>  #include <asm/unaligned.h>
>
> +static const u32 SHA256_K[] = {
> +       0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
> +       0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
> +       0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
> +       0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
> +       0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
> +       0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
> +       0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
> +       0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
> +       0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
> +       0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
> +       0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
> +       0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
> +       0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
> +       0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
> +       0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
> +       0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
> +};
> +
>  static inline u32 Ch(u32 x, u32 y, u32 z)
>  {
>         return z ^ (x & (y ^ z));
> @@ -43,9 +62,17 @@ static inline void BLEND_OP(int I, u32 *W)
>         W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
>  }
>
> +#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {           \
> +       u32 t1, t2;                                             \
> +       t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];      \
> +       t2 = e0(a) + Maj(a, b, c);                              \
> +       d += t1;                                                \
> +       h = t1 + t2;                                            \
> +} while (0)
> +
>  static void sha256_transform(u32 *state, const u8 *input, u32 *W)
>  {
> -       u32 a, b, c, d, e, f, g, h, t1, t2;
> +       u32 a, b, c, d, e, f, g, h;
>         int i;
>
>         /* load the input */
> @@ -61,141 +88,16 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W)
>         e = state[4];  f = state[5];  g = state[6];  h = state[7];
>
>         /* now iterate */
> -       t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0x4a7484aa + W[21];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0x5cb0a9dc + W[22];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0x76f988da + W[23];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0x983e5152 + W[24];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0xa831c66d + W[25];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0xb00327c8 + W[26];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0xbf597fc7 + W[27];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0xc6e00bf3 + W[28];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0xd5a79147 + W[29];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0x06ca6351 + W[30];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0x14292967 + W[31];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0x27b70a85 + W[32];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0x2e1b2138 + W[33];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0x4d2c6dfc + W[34];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0x53380d13 + W[35];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0x650a7354 + W[36];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0x766a0abb + W[37];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0x81c2c92e + W[38];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0x92722c85 + W[39];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0xa2bfe8a1 + W[40];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0xa81a664b + W[41];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0xc24b8b70 + W[42];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0xc76c51a3 + W[43];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0xd192e819 + W[44];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0xd6990624 + W[45];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0xf40e3585 + W[46];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0x106aa070 + W[47];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0x19a4c116 + W[48];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0x1e376c08 + W[49];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0x2748774c + W[50];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0x34b0bcb5 + W[51];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0x391c0cb3 + W[52];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0x4ed8aa4a + W[53];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0x5b9cca4f + W[54];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0x682e6ff3 + W[55];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> -
> -       t1 = h + e1(e) + Ch(e, f, g) + 0x748f82ee + W[56];
> -       t2 = e0(a) + Maj(a, b, c);    d += t1;    h = t1 + t2;
> -       t1 = g + e1(d) + Ch(d, e, f) + 0x78a5636f + W[57];
> -       t2 = e0(h) + Maj(h, a, b);    c += t1;    g = t1 + t2;
> -       t1 = f + e1(c) + Ch(c, d, e) + 0x84c87814 + W[58];
> -       t2 = e0(g) + Maj(g, h, a);    b += t1;    f = t1 + t2;
> -       t1 = e + e1(b) + Ch(b, c, d) + 0x8cc70208 + W[59];
> -       t2 = e0(f) + Maj(f, g, h);    a += t1;    e = t1 + t2;
> -       t1 = d + e1(a) + Ch(a, b, c) + 0x90befffa + W[60];
> -       t2 = e0(e) + Maj(e, f, g);    h += t1;    d = t1 + t2;
> -       t1 = c + e1(h) + Ch(h, a, b) + 0xa4506ceb + W[61];
> -       t2 = e0(d) + Maj(d, e, f);    g += t1;    c = t1 + t2;
> -       t1 = b + e1(g) + Ch(g, h, a) + 0xbef9a3f7 + W[62];
> -       t2 = e0(c) + Maj(c, d, e);    f += t1;    b = t1 + t2;
> -       t1 = a + e1(f) + Ch(f, g, h) + 0xc67178f2 + W[63];
> -       t2 = e0(b) + Maj(b, c, d);    e += t1;    a = t1 + t2;
> +       for (i = 0; i < 64; i += 8) {
> +               SHA256_ROUND(i + 0, a, b, c, d, e, f, g, h);
> +               SHA256_ROUND(i + 1, h, a, b, c, d, e, f, g);
> +               SHA256_ROUND(i + 2, g, h, a, b, c, d, e, f);
> +               SHA256_ROUND(i + 3, f, g, h, a, b, c, d, e);
> +               SHA256_ROUND(i + 4, e, f, g, h, a, b, c, d);
> +               SHA256_ROUND(i + 5, d, e, f, g, h, a, b, c);
> +               SHA256_ROUND(i + 6, c, d, e, f, g, h, a, b);
> +               SHA256_ROUND(i + 7, b, c, d, e, f, g, h, a);
> +       }
>
>         state[0] += a; state[1] += b; state[2] += c; state[3] += d;
>         state[4] += e; state[5] += f; state[6] += g; state[7] += h;
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
  2020-10-25 18:51   ` David Laight
@ 2020-10-26  8:02   ` Ard Biesheuvel
  1 sibling, 0 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2020-10-26  8:02 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	David Laight, Linux Kernel Mailing List, Eric Biggers

On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> (tested on Broadwell Xeon) while not increasing code size too much.
>
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Reviewed-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  lib/crypto/sha256.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index e2e29d9b0ccd..cdef37c05972 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -76,12 +76,28 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W)
>         int i;
>
>         /* load the input */
> -       for (i = 0; i < 16; i++)
> -               LOAD_OP(i, W, input);
> +       for (i = 0; i < 16; i += 8) {
> +               LOAD_OP(i + 0, W, input);
> +               LOAD_OP(i + 1, W, input);
> +               LOAD_OP(i + 2, W, input);
> +               LOAD_OP(i + 3, W, input);
> +               LOAD_OP(i + 4, W, input);
> +               LOAD_OP(i + 5, W, input);
> +               LOAD_OP(i + 6, W, input);
> +               LOAD_OP(i + 7, W, input);
> +       }
>
>         /* now blend */
> -       for (i = 16; i < 64; i++)
> -               BLEND_OP(i, W);
> +       for (i = 16; i < 64; i += 8) {
> +               BLEND_OP(i + 0, W);
> +               BLEND_OP(i + 1, W);
> +               BLEND_OP(i + 2, W);
> +               BLEND_OP(i + 3, W);
> +               BLEND_OP(i + 4, W);
> +               BLEND_OP(i + 5, W);
> +               BLEND_OP(i + 6, W);
> +               BLEND_OP(i + 7, W);
> +       }
>
>         /* load the state into our registers */
>         a = state[0];  b = state[1];  c = state[2];  d = state[3];
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops
  2020-10-25 23:53         ` Arvind Sankar
@ 2020-10-26 10:06           ` David Laight
  0 siblings, 0 replies; 19+ messages in thread
From: David Laight @ 2020-10-26 10:06 UTC (permalink / raw)
  To: 'Arvind Sankar'
  Cc: Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	linux-kernel, Eric Biggers

From: Arvind Sankar
> Sent: 25 October 2020 23:54
...
> > That's odd, the BLEND loop is about 20 instructions.
> > I wouldn't expect unrolling to help - unless you manage
> > to use 16 registers for the active W[] values.
> >
> 
> I am not sure about what's going on inside the hardware, but even with
> a straightforward assembly version that just reads out of memory the way
> the calculation is specified, unrolling the BLEND loop 8x speeds up the
> performance by 7-8%.
> 
> The compiler is actually pretty bad here, just translating everything
> into assembler with no attempt to optimize anything gets a 10-12%
> speedup over the C version.

I'm not seeing anything particularly stupid.
The loop body (excluding loop control) is 23 instructions.
Doubles to 46 if I unroll once.
Unrolling 4 times does save a couple of instructions per iteration.

The only horrid part of the code is the long dependency
chain at the end when the values get xor'ed together.
gcc is very bad at that, it converts (a + b) + (c + d)
to (((a + b) + c) + d) which takes an extra clock.

Unrolling 4 times gives almost all the gain.
But it really shouldn't be needed at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization
  2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
                   ` (5 preceding siblings ...)
  2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
@ 2020-10-30  6:53 ` Herbert Xu
  6 siblings, 0 replies; 19+ messages in thread
From: Herbert Xu @ 2020-10-30  6:53 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: David S. Miller, linux-crypto, Eric Biggers, David Laight, linux-kernel

On Sun, Oct 25, 2020 at 10:31:13AM -0400, Arvind Sankar wrote:
> Patch 1/2 -- Use memzero_explicit() instead of structure assignment/plain
> memset() to clear sensitive state.
> 
> Patch 3 -- Currently the temporary variables used in the generic sha256
> implementation are cleared, but the clearing is optimized away due to
> lack of compiler barriers. Drop the clearing.
> 
> The last three patches are optimizations for generic sha256.
> 
> v4:
> - Split the first patch into two, the first one just does
>   lib/crypto/sha256.c, so that the second one can be applied or dropped
>   depending on the outcome of the discussion between Herbert/Eric.
> 
> v3:
> - Add some more files to patch 1
> - Reword commit message for patch 2
> - Reformat SHA256_K array
> - Drop v2 patch combining K and W arrays
> 
> v2:
> - Add patch to combine K and W arrays, suggested by David
> - Reformat SHA256_ROUND() macro a little
> 
> Arvind Sankar (6):
>   crypto: lib/sha256 - Use memzero_explicit() for clearing state
>   crypto: Use memzero_explicit() for clearing state
>   crypto: lib/sha256 - Don't clear temporary variables
>   crypto: lib/sha256 - Clear W[] in sha256_update() instead of
>     sha256_transform()
>   crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
>   crypto: lib/sha256 - Unroll LOAD and BLEND loops
> 
>  arch/arm64/crypto/ghash-ce-glue.c |   2 +-
>  arch/arm64/crypto/poly1305-glue.c |   2 +-
>  arch/arm64/crypto/sha3-ce-glue.c  |   2 +-
>  arch/x86/crypto/poly1305_glue.c   |   2 +-
>  include/crypto/sha1_base.h        |   3 +-
>  include/crypto/sha256_base.h      |   3 +-
>  include/crypto/sha512_base.h      |   3 +-
>  include/crypto/sm3_base.h         |   3 +-
>  lib/crypto/sha256.c               | 212 +++++++++---------------------
>  9 files changed, 76 insertions(+), 156 deletions(-)

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-10-30  6:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-25 14:31 [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Arvind Sankar
2020-10-25 14:31 ` [PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state Arvind Sankar
2020-10-26  7:59   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 2/6] crypto: " Arvind Sankar
2020-10-26  7:58   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables Arvind Sankar
2020-10-26  7:59   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform() Arvind Sankar
2020-10-26  8:00   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64 Arvind Sankar
2020-10-26  8:00   ` Ard Biesheuvel
2020-10-25 14:31 ` [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops Arvind Sankar
2020-10-25 18:51   ` David Laight
2020-10-25 20:18     ` Arvind Sankar
2020-10-25 23:23       ` David Laight
2020-10-25 23:53         ` Arvind Sankar
2020-10-26 10:06           ` David Laight
2020-10-26  8:02   ` Ard Biesheuvel
2020-10-30  6:53 ` [PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.