From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1EEFC67863 for ; Sat, 20 Oct 2018 15:00:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 959E921566 for ; Sat, 20 Oct 2018 15:00:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linaro.org header.i=@linaro.org header.b="ff/qLj9p" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 959E921566 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727518AbeJTXKx (ORCPT ); Sat, 20 Oct 2018 19:10:53 -0400 Received: from mail-it1-f195.google.com ([209.85.166.195]:51403 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727413AbeJTXKx (ORCPT ); Sat, 20 Oct 2018 19:10:53 -0400 Received: by mail-it1-f195.google.com with SMTP id 74-v6so7467309itw.1 for ; Sat, 20 Oct 2018 08:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=DPV3GfcDQeGLJmx80Ehi4cTMV5YxhaOHhrX/OYXb34w=; b=ff/qLj9p0coDG4/4+SaWaGcTTgJviD/v5F668DAQXYGa1H1CKHfJnpR3EDgP/xQ3sc BoLrS+NPM4RQpmTSD0HSCkYrIYRkJP4piUQU0cW/Ul2JFcEufcnOdv1w3QQuInqOAr78 Bq0YA8VRZ+Pq8XRlooXXRp0P51KqEVn0obIy0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=DPV3GfcDQeGLJmx80Ehi4cTMV5YxhaOHhrX/OYXb34w=; b=nbzNEOiRiqLzCwa+dyk1WFo7bYdleSirMGvssduWKyT60R/e2a1FLhWiVNWoqndOyg TKy3kQ/3vEEoksKK8rk2fmbDqhZbnawD96aIDvgHWSenAGM2vhEDupTqHuW9qdW96jdj Y3vgkejCE98JdYGh5n2XjBJKVv5z2KcqFncMmG58xwZbtqj/Jt4VVEkIGtcHB7gF9XT9 KV9EMByR+jspXWgvASPEO3ERRlU1zLhAGSkywwTa8FSS2qslJ9eftI4qhn1g3bjRoyAa Qc7N9WIXNYYD8ZvCOWBtRuVUWnxJ4seJ/qa7sr+OuALEuGBRoGK7ECdEmsM+kDzkkbSx hbYg== X-Gm-Message-State: ABuFfog+VM/Vii+FxlLA1w3oMSeuEfgZcPSzy79vWr4DZp8ft8u0dmql UvbuSj0hjkZ3ajfqutNsgkqfXC1fEcacm6pd6H9K5g== X-Google-Smtp-Source: ACcGV60yvaXnYw72C/1dz4jBaFK+qGXgw4rLE0vGuKPg/NkZfwza5Vs0bqvzgqsgLNF6CIgNs6PpHQi8xQ31MKZk3FA= X-Received: by 2002:a05:660c:383:: with SMTP id x3mr5235345itj.121.1540047605417; Sat, 20 Oct 2018 08:00:05 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:5910:0:0:0:0:0 with HTTP; Sat, 20 Oct 2018 08:00:04 -0700 (PDT) In-Reply-To: <20181020055136.GD876@sol.localdomain> References: <20181015175424.97147-1-ebiggers@kernel.org> <20181015175424.97147-11-ebiggers@kernel.org> <20181020055136.GD876@sol.localdomain> From: Ard Biesheuvel Date: Sat, 20 Oct 2018 23:00:04 +0800 Message-ID: Subject: Re: [RFC PATCH v2 10/12] crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305 To: Eric Biggers Cc: "open list:HARDWARE RANDOM NUMBER GENERATOR CORE" , linux-fscrypt@vger.kernel.org, linux-arm-kernel , Linux Kernel Mailing List , Herbert Xu , Paul Crowley , Greg Kaiser , Michael Halcrow , "Jason A . Donenfeld" , Samuel Neves , Tomer Ashur Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 20 October 2018 at 13:51, Eric Biggers wrote: > On Sat, Oct 20, 2018 at 12:12:56PM +0800, Ard Biesheuvel wrote: >> On 16 October 2018 at 01:54, Eric Biggers wrote: >> > From: Eric Biggers >> > >> > Add an ARM NEON implementation of NHPoly1305, an =CE=B5-almost-=E2=88= =86-universal >> > hash function used in the Adiantum encryption mode. For now, only the >> > NH portion is actually NEON-accelerated; the Poly1305 part is less >> > performance-critical so is just implemented in C. >> > >> > Signed-off-by: Eric Biggers >> > --- >> > arch/arm/crypto/Kconfig | 5 ++ >> > arch/arm/crypto/Makefile | 2 + >> > arch/arm/crypto/nh-neon-core.S | 116 ++++++++++++++++++++++++= + >> > arch/arm/crypto/nhpoly1305-neon-glue.c | 78 +++++++++++++++++ >> > 4 files changed, 201 insertions(+) >> > create mode 100644 arch/arm/crypto/nh-neon-core.S >> > create mode 100644 arch/arm/crypto/nhpoly1305-neon-glue.c >> > >> > diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig >> > index cc932d9bba561..458562a34aabe 100644 >> > --- a/arch/arm/crypto/Kconfig >> > +++ b/arch/arm/crypto/Kconfig >> > @@ -122,4 +122,9 @@ config CRYPTO_CHACHA20_NEON >> > select CRYPTO_BLKCIPHER >> > select CRYPTO_CHACHA20 >> > >> > +config CRYPTO_NHPOLY1305_NEON >> > + tristate "NEON accelerated NHPoly1305 hash function (for Adian= tum)" >> > + depends on KERNEL_MODE_NEON >> > + select CRYPTO_NHPOLY1305 >> > + >> > endif >> > diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile >> > index 005482ff95047..b65d6bfab8e6b 100644 >> > --- a/arch/arm/crypto/Makefile >> > +++ b/arch/arm/crypto/Makefile >> > @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) +=3D sha1-arm-neo= n.o >> > obj-$(CONFIG_CRYPTO_SHA256_ARM) +=3D sha256-arm.o >> > obj-$(CONFIG_CRYPTO_SHA512_ARM) +=3D sha512-arm.o >> > obj-$(CONFIG_CRYPTO_CHACHA20_NEON) +=3D chacha-neon.o >> > +obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) +=3D nhpoly1305-neon.o >> > >> > ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) +=3D aes-arm-ce.o >> > ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) +=3D sha1-arm-ce.o >> > @@ -53,6 +54,7 @@ ghash-arm-ce-y :=3D ghash-ce-core.o ghash-ce-= glue.o >> > crct10dif-arm-ce-y :=3D crct10dif-ce-core.o crct10dif-ce-glue.o >> > crc32-arm-ce-y:=3D crc32-ce-core.o crc32-ce-glue.o >> > chacha-neon-y :=3D chacha-neon-core.o chacha-neon-glue.o >> > +nhpoly1305-neon-y :=3D nh-neon-core.o nhpoly1305-neon-glue.o >> > >> > ifdef REGENERATE_ARM_CRYPTO >> > quiet_cmd_perl =3D PERL $@ >> > diff --git a/arch/arm/crypto/nh-neon-core.S b/arch/arm/crypto/nh-neon-= core.S >> > new file mode 100644 >> > index 0000000000000..434d80ab531c2 >> > --- /dev/null >> > +++ b/arch/arm/crypto/nh-neon-core.S >> > @@ -0,0 +1,116 @@ >> > +/* SPDX-License-Identifier: GPL-2.0 */ >> > +/* >> > + * NH - =CE=B5-almost-universal hash function, NEON accelerated versi= on >> > + * >> > + * Copyright 2018 Google LLC >> > + * >> > + * Author: Eric Biggers >> > + */ >> > + >> > +#include >> > + >> > + .text >> > + .fpu neon >> > + >> > + KEY .req r0 >> > + MESSAGE .req r1 >> > + MESSAGE_LEN .req r2 >> > + HASH .req r3 >> > + >> > + PASS0_SUMS .req q0 >> > + PASS0_SUM_A .req d0 >> > + PASS0_SUM_B .req d1 >> > + PASS1_SUMS .req q1 >> > + PASS1_SUM_A .req d2 >> > + PASS1_SUM_B .req d3 >> > + PASS2_SUMS .req q2 >> > + PASS2_SUM_A .req d4 >> > + PASS2_SUM_B .req d5 >> > + PASS3_SUMS .req q3 >> > + PASS3_SUM_A .req d6 >> > + PASS3_SUM_B .req d7 >> > + K0 .req q4 >> > + K1 .req q5 >> > + K2 .req q6 >> > + K3 .req q7 >> > + T0 .req q8 >> > + T0_L .req d16 >> > + T0_H .req d17 >> > + T1 .req q9 >> > + T1_L .req d18 >> > + T1_H .req d19 >> > + T2 .req q10 >> > + T2_L .req d20 >> > + T2_H .req d21 >> > + T3 .req q11 >> > + T3_L .req d22 >> > + T3_H .req d23 >> > + >> > +.macro _nh_stride k0, k1, k2, k3 >> > + >> > + // Load next message stride >> > + vld1.8 {T3}, [MESSAGE]! >> > + >> > + // Load next key stride >> > + vld1.32 {\k3}, [KEY]! >> > + >> > + // Add message words to key words >> > + vadd.u32 T0, T3, \k0 >> > + vadd.u32 T1, T3, \k1 >> > + vadd.u32 T2, T3, \k2 >> > + vadd.u32 T3, T3, \k3 >> > + >> > + // Multiply 32x32 =3D> 64 and accumulate >> > + vmlal.u32 PASS0_SUMS, T0_L, T0_H >> > + vmlal.u32 PASS1_SUMS, T1_L, T1_H >> > + vmlal.u32 PASS2_SUMS, T2_L, T2_H >> > + vmlal.u32 PASS3_SUMS, T3_L, T3_H >> > +.endm >> > + >> >> Since we seem to have some spare NEON registers: would it help to have >> a double round version of this macro? >> > > It helps a little bit, but not much. The loads are the only part that ca= n be > optimized further. I think I'd rather have the shorter + simpler version= , > unless the loads can be optimized significantly more on other processors. > > Also, originally I had it loading the key and message for the next stride= while > doing the current one, but that didn't seem worthwhile either. > Fair enough. >> > +/* >> > + * void nh_neon(const u32 *key, const u8 *message, size_t message_len= , >> > + * u8 hash[NH_HASH_BYTES]) >> > + * >> > + * It's guaranteed that message_len % 16 =3D=3D 0. >> > + */ >> > +ENTRY(nh_neon) >> > + >> > + vld1.32 {K0,K1}, [KEY]! >> > + vmov.u64 PASS0_SUMS, #0 >> > + vmov.u64 PASS1_SUMS, #0 >> > + vld1.32 {K2}, [KEY]! >> > + vmov.u64 PASS2_SUMS, #0 >> > + vmov.u64 PASS3_SUMS, #0 >> > + >> > + subs MESSAGE_LEN, MESSAGE_LEN, #64 >> > + blt .Lloop4_done >> > +.Lloop4: >> > + _nh_stride K0, K1, K2, K3 >> > + _nh_stride K1, K2, K3, K0 >> > + _nh_stride K2, K3, K0, K1 >> > + _nh_stride K3, K0, K1, K2 >> > + subs MESSAGE_LEN, MESSAGE_LEN, #64 >> > + bge .Lloop4 >> > + >> > +.Lloop4_done: >> > + ands MESSAGE_LEN, MESSAGE_LEN, #63 >> > + beq .Ldone >> > + _nh_stride K0, K1, K2, K3 >> > + >> > + subs MESSAGE_LEN, MESSAGE_LEN, #16 >> > + beq .Ldone >> > + _nh_stride K1, K2, K3, K0 >> > + >> > + subs MESSAGE_LEN, MESSAGE_LEN, #16 >> > + beq .Ldone >> > + _nh_stride K2, K3, K0, K1 >> > + >> > +.Ldone: >> > + // Sum the accumulators for each pass, then store the sums to = 'hash' >> > + vadd.u64 T0_L, PASS0_SUM_A, PASS0_SUM_B >> > + vadd.u64 T0_H, PASS1_SUM_A, PASS1_SUM_B >> > + vadd.u64 T1_L, PASS2_SUM_A, PASS2_SUM_B >> > + vadd.u64 T1_H, PASS3_SUM_A, PASS3_SUM_B >> > + vst1.8 {T0-T1}, [HASH] >> > + bx lr >> > +ENDPROC(nh_neon) >> > diff --git a/arch/arm/crypto/nhpoly1305-neon-glue.c b/arch/arm/crypto/= nhpoly1305-neon-glue.c >> > new file mode 100644 >> > index 0000000000000..df48a00f4c50f >> > --- /dev/null >> > +++ b/arch/arm/crypto/nhpoly1305-neon-glue.c >> > @@ -0,0 +1,78 @@ >> > +// SPDX-License-Identifier: GPL-2.0 >> > +/* >> > + * NHPoly1305 - =CE=B5-almost-=E2=88=86-universal hash function for A= diantum >> > + * (NEON accelerated version) >> > + * >> > + * Copyright 2018 Google LLC >> > + */ >> > + >> > +#include >> > +#include >> > +#include >> > +#include >> > +#include >> > + >> > +asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t mes= sage_len, >> > + u8 hash[NH_HASH_BYTES]); >> > + >> > +static void _nh_neon(const u32 *key, const u8 *message, size_t messag= e_len, >> > + __le64 hash[NH_NUM_PASSES]) >> > +{ >> > + nh_neon(key, message, message_len, (u8 *)hash); >> > +} >> > + >> >> Why do we need this function? >> > > For now it's not needed so I should probably just remove it, but it seems= likely > that indirect calls to assembly functions in the kernel will be going awa= y in > order to add support for CFI (control flow integrity). The android-4.9 a= nd > android-4.14 kernels support CFI on arm64, so you might notice that some = of the > arm64 crypto code had to be patched for this reason. > I didn't actually look at those kernel trees so I hadn't noticed yet. In any case, I'd suggest that we just keep this wrapper then, but please add a comment describing why it's there.