All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
@ 2017-03-30 19:25 Ondrej Mosnacek
  2017-03-30 19:55 ` Eric Biggers
  0 siblings, 1 reply; 5+ messages in thread
From: Ondrej Mosnacek @ 2017-03-30 19:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto, Milan Broz, Ondrej Mosnacek

The gf128mul_x_ble function is currently defined in gf128mul.c, because
it depends on the gf128mul_table_be multiplication table.

However, since the function is very small and only uses two values from
the table, it is better for it to be defined as inline function in
gf128mul.h. That way, the function can be inlined by the compiler for
better performance.

After this change, the speed of the generic 'xts(aes)' implementation
increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup
benchmark' on an Intel system with CRYPTO_AES_X86_64 and
CRYPTO_AES_NI_INTEL disabled).

Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
---
 crypto/gf128mul.c         | 11 -----------
 include/crypto/gf128mul.h | 15 +++++++++++++--
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/crypto/gf128mul.c b/crypto/gf128mul.c
index 04facc0..2eab1a1 100644
--- a/crypto/gf128mul.c
+++ b/crypto/gf128mul.c
@@ -156,17 +156,6 @@ static void gf128mul_x_bbe(be128 *r, const be128 *x)
 	r->b = cpu_to_be64((b << 1) ^ _tt);
 }
 
-void gf128mul_x_ble(be128 *r, const be128 *x)
-{
-	u64 a = le64_to_cpu(x->a);
-	u64 b = le64_to_cpu(x->b);
-	u64 _tt = gf128mul_table_be[b >> 63];
-
-	r->a = cpu_to_le64((a << 1) ^ _tt);
-	r->b = cpu_to_le64((b << 1) | (a >> 63));
-}
-EXPORT_SYMBOL(gf128mul_x_ble);
-
 static void gf128mul_x8_lle(be128 *x)
 {
 	u64 a = be64_to_cpu(x->a);
diff --git a/include/crypto/gf128mul.h b/include/crypto/gf128mul.h
index 0bc9b5f..46a01a2 100644
--- a/include/crypto/gf128mul.h
+++ b/include/crypto/gf128mul.h
@@ -49,6 +49,7 @@
 #ifndef _CRYPTO_GF128MUL_H
 #define _CRYPTO_GF128MUL_H
 
+#include <asm/byteorder.h>
 #include <crypto/b128ops.h>
 #include <linux/slab.h>
 
@@ -163,8 +164,18 @@ void gf128mul_lle(be128 *a, const be128 *b);
 
 void gf128mul_bbe(be128 *a, const be128 *b);
 
-/* multiply by x in ble format, needed by XTS */
-void gf128mul_x_ble(be128 *a, const be128 *b);
+/* Multiply by x in ble format, needed by XTS.
+ * Defined here for performance. */
+static inline void gf128mul_x_ble(be128 *r, const be128 *x)
+{
+	u64 a = le64_to_cpu(x->a);
+	u64 b = le64_to_cpu(x->b);
+	/* equivalent to gf128mul_table_be[b >> 63] (see crypto/gf128mul.c): */
+	u64 _tt = (b & ((u64)1 << 63)) ? 0x87 : 0x00;
+
+	r->a = cpu_to_le64((a << 1) ^ _tt);
+	r->b = cpu_to_le64((b << 1) | (a >> 63));
+}
 
 /* 4k table optimization */
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
  2017-03-30 19:25 [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h Ondrej Mosnacek
@ 2017-03-30 19:55 ` Eric Biggers
  2017-03-30 21:32   ` Ondrej Mosnáček
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Biggers @ 2017-03-30 19:55 UTC (permalink / raw)
  To: Ondrej Mosnacek; +Cc: Herbert Xu, David S. Miller, linux-crypto, Milan Broz

Hi Ondrej,

On Thu, Mar 30, 2017 at 09:25:35PM +0200, Ondrej Mosnacek wrote:
> The gf128mul_x_ble function is currently defined in gf128mul.c, because
> it depends on the gf128mul_table_be multiplication table.
> 
> However, since the function is very small and only uses two values from
> the table, it is better for it to be defined as inline function in
> gf128mul.h. That way, the function can be inlined by the compiler for
> better performance.
> 
> After this change, the speed of the generic 'xts(aes)' implementation
> increased from ~225 MiB/s to ~235 MiB/s (measured using 'cryptsetup
> benchmark' on an Intel system with CRYPTO_AES_X86_64 and
> CRYPTO_AES_NI_INTEL disabled).
> 
> Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
...
>  
> -/* multiply by x in ble format, needed by XTS */
> -void gf128mul_x_ble(be128 *a, const be128 *b);
> +/* Multiply by x in ble format, needed by XTS.
> + * Defined here for performance. */
> +static inline void gf128mul_x_ble(be128 *r, const be128 *x)
> +{
> +	u64 a = le64_to_cpu(x->a);
> +	u64 b = le64_to_cpu(x->b);
> +	/* equivalent to gf128mul_table_be[b >> 63] (see crypto/gf128mul.c): */
> +	u64 _tt = (b & ((u64)1 << 63)) ? 0x87 : 0x00;
> +
> +	r->a = cpu_to_le64((a << 1) ^ _tt);
> +	r->b = cpu_to_le64((b << 1) | (a >> 63));
> +}
>  
>  /* 4k table optimization */
>  
> -- 
> 2.9.3
> 

This is an improvement; I'm just thinking that maybe this should be done for all
the gf128mul_x_*() functions, if only so that they use a consistent style and
are all defined next to each other.

Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting
compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore makes the
new version more efficient than one might expect:

	sar    $0x3f,%rax
	and    $0x87,%eax

It could even be written the branchless way explicitly, but it shouldn't matter.

- Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
  2017-03-30 19:55 ` Eric Biggers
@ 2017-03-30 21:32   ` Ondrej Mosnáček
  2017-03-31  6:05     ` Jeffrey Walton
  0 siblings, 1 reply; 5+ messages in thread
From: Ondrej Mosnáček @ 2017-03-30 21:32 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Herbert Xu, David S. Miller, linux-crypto, Milan Broz

Hi Eric,

2017-03-30 21:55 GMT+02:00 Eric Biggers <ebiggers3@gmail.com>:
> This is an improvement; I'm just thinking that maybe this should be done for all
> the gf128mul_x_*() functions, if only so that they use a consistent style and
> are all defined next to each other.

Right, that doesn't seem to be a bad idea... I was confused for a
while by the '& 0xff' in the _lle one, but now I see it also uses just
two values of the table, so it can be re-written in a similar way. In
fact, the OCB mode from RFC 7253 (that I'm currently trying to port to
kernel crypto API) uses gf128mul_x_bbe, so it would be useful to have
that one accessible, too.

I will move them all in v2, then.

> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting
> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore makes the
> new version more efficient than one might expect:
>
>         sar    $0x3f,%rax
>         and    $0x87,%eax
>
> It could even be written the branchless way explicitly, but it shouldn't matter.

I think the definition using unsigned operations is more intuitive...
Let's just leave the clever tricks up to the compiler :)

Thanks,
O.M.

>
> - Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
  2017-03-30 21:32   ` Ondrej Mosnáček
@ 2017-03-31  6:05     ` Jeffrey Walton
  2017-03-31  9:17       ` Ondrej Mosnáček
  0 siblings, 1 reply; 5+ messages in thread
From: Jeffrey Walton @ 2017-03-31  6:05 UTC (permalink / raw)
  To: Ondrej Mosnáček; +Cc: linux-crypto

>> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting
>> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore makes the
>> new version more efficient than one might expect:
>>
>>         sar    $0x3f,%rax
>>         and    $0x87,%eax
>>
>> It could even be written the branchless way explicitly, but it shouldn't matter.
>
> I think the definition using unsigned operations is more intuitive...
> Let's just leave the clever tricks up to the compiler :)

It may be a good idea to use the one that provides constant time-ness
to help avoid leaking information.

Jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h
  2017-03-31  6:05     ` Jeffrey Walton
@ 2017-03-31  9:17       ` Ondrej Mosnáček
  0 siblings, 0 replies; 5+ messages in thread
From: Ondrej Mosnáček @ 2017-03-31  9:17 UTC (permalink / raw)
  To: noloader; +Cc: linux-crypto

Hi Jeff,

2017-03-31 8:05 GMT+02:00 Jeffrey Walton <noloader@gmail.com>:
>>> Also note that '(b & ((u64)1 << 63)) ? 0x87 : 0x00;' is actually getting
>>> compiled as '((s64)b >> 63) & 0x87', which is branchless and therefore makes the
>>> new version more efficient than one might expect:
>>>
>>>         sar    $0x3f,%rax
>>>         and    $0x87,%eax
>>>
>>> It could even be written the branchless way explicitly, but it shouldn't matter.
>>
>> I think the definition using unsigned operations is more intuitive...
>> Let's just leave the clever tricks up to the compiler :)
>
> It may be a good idea to use the one that provides constant time-ness
> to help avoid leaking information.

That's a good point... I played around with various ways to write the
expression in Compiler Explorer [1] and indeed GCC fails to produce
constant-time code from my version on some architectures (e.g. the
32-bit ARM). The version with an explicit arithmetic right shift seems
to produce the most efficient code across platforms, so I'll rewrite
it like that for v3.

Thanks,
O.M.

[1] https://gcc.godbolt.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-31  9:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-30 19:25 [PATCH] crypto: gf128mul - define gf128mul_x_ble in gf128mul.h Ondrej Mosnacek
2017-03-30 19:55 ` Eric Biggers
2017-03-30 21:32   ` Ondrej Mosnáček
2017-03-31  6:05     ` Jeffrey Walton
2017-03-31  9:17       ` Ondrej Mosnáček

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.