Re: [PATCH v6 8/9] crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL

From: Eric Biggers <ebiggers@kernel.org>
To: Nathan Huckleberry <nhuck@google.com>
Cc: linux-crypto@vger.kernel.org, linux-fscrypt@vger.kernel.org,
	Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	linux-arm-kernel@lists.infradead.org,
	Paul Crowley <paulcrowley@google.com>,
	Sami Tolvanen <samitolvanen@google.com>,
	Ard Biesheuvel <ardb@kernel.org>
Subject: Re: [PATCH v6 8/9] crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL
Date: Wed, 4 May 2022 22:56:49 -0700	[thread overview]
Message-ID: <YnNnIV0P9bFgTkQt@sol.localdomain> (raw)
In-Reply-To: <20220504001823.2483834-9-nhuck@google.com>

On Wed, May 04, 2022 at 12:18:22AM +0000, Nathan Huckleberry wrote:
> + * X = [X_1 : X_0]
> + * Y = [Y_1 : Y_0]
> + *
> + * The multiplication produces four parts:
> + *   LOW: The polynomial given by performing carryless multiplication of X_0 and
> + *   Y_0
> + *   MID: The polynomial given by performing carryless multiplication of (X_0 +
> + *   X_1) and (Y_0 + Y_1)
> + *   HIGH: The polynomial given by performing carryless multiplication of X_1
> + *   and Y_1
> + *
> + * We compute:
> + *  LO += LOW
> + *  MI += MID
> + *  HI += HIGH

Three parts, not four.  But why not write this as the much more concise:

 * Given:
 *	X = [X_1 : X_0]
 *	Y = [Y_1 : Y_0]
 *
 * We compute:
 *	LO += X_0 * Y_0
 *	MI += (X_0 + X_1) * (Y_0 + Y_1)
 *	HI += X_1 * Y_1

> + * So our final computation is: T = T_1 : T_0 = g*(x) * P_0 V = V_1 : V_0 =      
> + * g*(x) * (P_1 + T_0) p(x) / x^{128} mod g(x) = P_3 + P_1 + T_0 + V_1 : P_2 +   
> + * P_0 + T_1 + V_0                                                               

As on the x86 version, this part is now unreadable.  It was fine in v5.

> + *   [HI_1 : HI_0 + HI_1 + MI_1 + LO_1 : LO_1 + HI_0 + MI_0 + LO_0 : LO_0]
[...]
> + *   [HI_1 : HI_1 + HI_0 + MI_1 + LO_1 : HI_0 + MI_0 + LO_1 + LO_0 : LO_0]
[...]
> +	// TMP_V = T_1 : T_0 = P_0 * g*(x)
> +	pmull	TMP_V.1q, PL.1d, GSTAR.1d
[...]
> +	// TMP_V = V_1 : V_0 = (P_1 + T_0) * g*(x)
> +	pmull2	TMP_V.1q, GSTAR.2d, TMP_V.2d
> +	eor	DEST.16b, PH.16b, TMP_V.16b
[...]
> +	pmull	TMP_V.1q, GSTAR.1d, PL.1d
[...]
> +	pmull2	TMP_V.1q, GSTAR.2d, TMP_V.2d
[...]
> +	eor	SUM.16b, TMP_V.16b, PH.16b

It looks like you didn't fully address my comments on v5 about putting operands
in a consistent order.  Not a big deal, but assembly code is always hard to
read, and anything to make it easier would be greatly appreciated.

> +/*
> + * Handle any extra blocks afer full_stride loop.
> + */

Typo above.

> diff --git a/arch/arm64/crypto/polyval-ce-glue.c b/arch/arm64/crypto/polyval-ce-glue.c
[...]
> +struct polyval_tfm_ctx {
> +	u8 key_powers[NUM_KEY_POWERS][POLYVAL_BLOCK_SIZE];
> +};

This is missing the comment about the order of the key powers that I had
suggested for readability.  It made it into the x86 version but not here.  This
file is very similar to arch/x86/crypto/polyval-clmulni_glue.c, so if you could
diff them and eliminate any unintended differences, that would be helpful.

Other than the above readability suggestions this patch looks good, nice job.

- Eric