linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] raid5, raid6: Accelerate RAID math with LoongArch SIMD
@ 2023-08-03 17:08 WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 1/3] LoongArch: Add SIMD-optimized XOR routines WANG Xuerui
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: WANG Xuerui @ 2023-08-03 17:08 UTC (permalink / raw)
  To: Song Liu; +Cc: Huacai Chen, linux-raid, loongarch, linux-kernel, WANG Xuerui

From: WANG Xuerui <git@xen0n.name>

Hi,

Seeing the LoongArch port recently (finally!) gained the ability to use
the vector units, I've subsequently ported the RAID5/6 math to LSX and
LASX (which are LoongArch's 128-bit and 256-bit SIMD extensions), with
nice speedups observed. They are reasonably straight-forward conversions
of existing code, and I hope the comments I put in there are helpful
enough for anyone not familiar with LoongArch assembly to get a rough
picture of how things work here. Performance numbers are included in
each commit's commit message.

This series needs [1] ("LoongArch: Allow usage of LSX/LASX in the
kernel") as a prerequisite, or the vector context would likely get
corrupted by the vector-unaware kernel_fpu_{begin,end} calls. I tested
the changes on top of next-20230731 with the raid6test build fixes [2]
applied, but the series should apply cleanly to v6.5-rc4 (or maybe any
other tag) too; it doesn't depend on the raid6test fixes.
The base-commit and prerequisite-patch-id info is available for
minimally recreating a working configuration for both the kernel and
raid6test tool.

[1]: https://lore.kernel.org/loongarch/20230722072201.2677516-1-chenhuacai@loongson.cn/
[2]: https://lore.kernel.org/linux-raid/20230731104911.411964-1-kernel@xen0n.name/

Changes in v2 (no functional change):

- minor commit message tweaks
- changed comment style from // to /* */ throughout

WANG Xuerui (3):
  LoongArch: Add SIMD-optimized XOR routines
  raid6: Add LoongArch SIMD syndrome calculation
  raid6: Add LoongArch SIMD recovery implementation

 arch/loongarch/include/asm/xor.h      |  68 ++++
 arch/loongarch/include/asm/xor_simd.h |  42 +++
 arch/loongarch/lib/Makefile           |   3 +
 arch/loongarch/lib/xor_simd.c         |  93 +++++
 arch/loongarch/lib/xor_simd.h         |  46 +++
 arch/loongarch/lib/xor_simd_glue.c    |  71 ++++
 arch/loongarch/lib/xor_template.c     | 110 ++++++
 include/linux/raid/pq.h               |   4 +
 lib/raid6/Makefile                    |   1 +
 lib/raid6/algos.c                     |  16 +
 lib/raid6/loongarch.h                 |  38 ++
 lib/raid6/loongarch_simd.c            | 422 +++++++++++++++++++++
 lib/raid6/recov_loongarch_simd.c      | 515 ++++++++++++++++++++++++++
 lib/raid6/test/Makefile               |  12 +
 14 files changed, 1441 insertions(+)
 create mode 100644 arch/loongarch/include/asm/xor.h
 create mode 100644 arch/loongarch/include/asm/xor_simd.h
 create mode 100644 arch/loongarch/lib/xor_simd.c
 create mode 100644 arch/loongarch/lib/xor_simd.h
 create mode 100644 arch/loongarch/lib/xor_simd_glue.c
 create mode 100644 arch/loongarch/lib/xor_template.c
 create mode 100644 lib/raid6/loongarch.h
 create mode 100644 lib/raid6/loongarch_simd.c
 create mode 100644 lib/raid6/recov_loongarch_simd.c


base-commit: 5d0c230f1de8c7515b6567d9afba1f196fb4e2f4
prerequisite-patch-id: 85d08a9828893250ae78dbca9d6e6f8dac755f61
prerequisite-patch-id: fe0bba41e0bbc676454365ed16fb13fc0aac6ee0
prerequisite-patch-id: 84ef8212b74e696ce019255bbfd9679d7516f7f7
prerequisite-patch-id: b1f8fc4e4acdaff7f821a9fcbd063475178e037b
prerequisite-patch-id: 82aacbf27f249fdefe40dd6bcc712e5795256926
prerequisite-patch-id: ae4e026e18f92ffcc93f6b135a3bd48fbdded39a
-- 
2.40.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/3] LoongArch: Add SIMD-optimized XOR routines
  2023-08-03 17:08 [PATCH v2 0/3] raid5, raid6: Accelerate RAID math with LoongArch SIMD WANG Xuerui
@ 2023-08-03 17:08 ` WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 2/3] raid6: Add LoongArch SIMD syndrome calculation WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation WANG Xuerui
  2 siblings, 0 replies; 6+ messages in thread
From: WANG Xuerui @ 2023-08-03 17:08 UTC (permalink / raw)
  To: Song Liu; +Cc: Huacai Chen, linux-raid, loongarch, linux-kernel, WANG Xuerui

From: WANG Xuerui <git@xen0n.name>

Add LSX and LASX implementations of xor operations, operating on 64
bytes (one L1 cache line) at a time, for a balance between memory
utilization and instruction mix. Huacai confirmed that all future
LoongArch implementations by Loongson (that we care) will likely also
feature 64-byte cache lines, and experiments show no throughput
improvement with further unrolling.

Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:

> 8regs           : 12702 MB/sec
> 8regs_prefetch  : 10920 MB/sec
> 32regs          : 12686 MB/sec
> 32regs_prefetch : 10918 MB/sec
> lsx             : 17589 MB/sec
> lasx            : 26116 MB/sec

Signed-off-by: WANG Xuerui <git@xen0n.name>
---
 arch/loongarch/include/asm/xor.h      |  68 ++++++++++++++++
 arch/loongarch/include/asm/xor_simd.h |  42 ++++++++++
 arch/loongarch/lib/Makefile           |   3 +
 arch/loongarch/lib/xor_simd.c         |  93 ++++++++++++++++++++++
 arch/loongarch/lib/xor_simd.h         |  46 +++++++++++
 arch/loongarch/lib/xor_simd_glue.c    |  71 +++++++++++++++++
 arch/loongarch/lib/xor_template.c     | 110 ++++++++++++++++++++++++++
 7 files changed, 433 insertions(+)
 create mode 100644 arch/loongarch/include/asm/xor.h
 create mode 100644 arch/loongarch/include/asm/xor_simd.h
 create mode 100644 arch/loongarch/lib/xor_simd.c
 create mode 100644 arch/loongarch/lib/xor_simd.h
 create mode 100644 arch/loongarch/lib/xor_simd_glue.c
 create mode 100644 arch/loongarch/lib/xor_template.c

diff --git a/arch/loongarch/include/asm/xor.h b/arch/loongarch/include/asm/xor.h
new file mode 100644
index 0000000000000..12467fffee468
--- /dev/null
+++ b/arch/loongarch/include/asm/xor.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
+ */
+#ifndef _ASM_LOONGARCH_XOR_H
+#define _ASM_LOONGARCH_XOR_H
+
+#include <asm/cpu-features.h>
+#include <asm/xor_simd.h>
+
+#ifdef CONFIG_CPU_HAS_LSX
+static struct xor_block_template xor_block_lsx = {
+	.name = "lsx",
+	.do_2 = xor_lsx_2,
+	.do_3 = xor_lsx_3,
+	.do_4 = xor_lsx_4,
+	.do_5 = xor_lsx_5,
+};
+
+#define XOR_SPEED_LSX()					\
+	do {						\
+		if (cpu_has_lsx)			\
+			xor_speed(&xor_block_lsx);	\
+	} while (0)
+#else /* CONFIG_CPU_HAS_LSX */
+#define XOR_SPEED_LSX()
+#endif /* CONFIG_CPU_HAS_LSX */
+
+#ifdef CONFIG_CPU_HAS_LASX
+static struct xor_block_template xor_block_lasx = {
+	.name = "lasx",
+	.do_2 = xor_lasx_2,
+	.do_3 = xor_lasx_3,
+	.do_4 = xor_lasx_4,
+	.do_5 = xor_lasx_5,
+};
+
+#define XOR_SPEED_LASX()					\
+	do {							\
+		if (cpu_has_lasx)				\
+			xor_speed(&xor_block_lasx);		\
+	} while (0)
+#else /* CONFIG_CPU_HAS_LASX */
+#define XOR_SPEED_LASX()
+#endif /* CONFIG_CPU_HAS_LASX */
+
+/*
+ * For grins, also test the generic routines.
+ *
+ * More importantly: it cannot be ruled out at this point of time, that some
+ * future (maybe reduced) models could run the vector algorithms slower than
+ * the scalar ones, maybe for errata or micro-op reasons. It may be
+ * appropriate to revisit this after one or two more uarch generations.
+ */
+#include <asm-generic/xor.h>
+
+#undef XOR_TRY_TEMPLATES
+#define XOR_TRY_TEMPLATES				\
+do {							\
+	xor_speed(&xor_block_8regs);			\
+	xor_speed(&xor_block_8regs_p);			\
+	xor_speed(&xor_block_32regs);			\
+	xor_speed(&xor_block_32regs_p);			\
+	XOR_SPEED_LSX();				\
+	XOR_SPEED_LASX();				\
+} while (0)
+
+#endif /* _ASM_LOONGARCH_XOR_H */
diff --git a/arch/loongarch/include/asm/xor_simd.h b/arch/loongarch/include/asm/xor_simd.h
new file mode 100644
index 0000000000000..a087e4ded5303
--- /dev/null
+++ b/arch/loongarch/include/asm/xor_simd.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
+ */
+#ifndef _ASM_LOONGARCH_XOR_SIMD_H
+#define _ASM_LOONGARCH_XOR_SIMD_H
+
+#ifdef CONFIG_CPU_HAS_LSX
+void xor_lsx_2(unsigned long bytes, unsigned long * __restrict p1,
+	       const unsigned long * __restrict p2);
+void xor_lsx_3(unsigned long bytes, unsigned long * __restrict p1,
+	       const unsigned long * __restrict p2,
+	       const unsigned long * __restrict p3);
+void xor_lsx_4(unsigned long bytes, unsigned long * __restrict p1,
+	       const unsigned long * __restrict p2,
+	       const unsigned long * __restrict p3,
+	       const unsigned long * __restrict p4);
+void xor_lsx_5(unsigned long bytes, unsigned long * __restrict p1,
+	       const unsigned long * __restrict p2,
+	       const unsigned long * __restrict p3,
+	       const unsigned long * __restrict p4,
+	       const unsigned long * __restrict p5);
+#endif /* CONFIG_CPU_HAS_LSX */
+
+#ifdef CONFIG_CPU_HAS_LASX
+void xor_lasx_2(unsigned long bytes, unsigned long * __restrict p1,
+	        const unsigned long * __restrict p2);
+void xor_lasx_3(unsigned long bytes, unsigned long * __restrict p1,
+	        const unsigned long * __restrict p2,
+	        const unsigned long * __restrict p3);
+void xor_lasx_4(unsigned long bytes, unsigned long * __restrict p1,
+	        const unsigned long * __restrict p2,
+	        const unsigned long * __restrict p3,
+	        const unsigned long * __restrict p4);
+void xor_lasx_5(unsigned long bytes, unsigned long * __restrict p1,
+	        const unsigned long * __restrict p2,
+	        const unsigned long * __restrict p3,
+	        const unsigned long * __restrict p4,
+	        const unsigned long * __restrict p5);
+#endif /* CONFIG_CPU_HAS_LASX */
+
+#endif /* _ASM_LOONGARCH_XOR_SIMD_H */
diff --git a/arch/loongarch/lib/Makefile b/arch/loongarch/lib/Makefile
index d60d4e096cfa9..045a3ed6ecd70 100644
--- a/arch/loongarch/lib/Makefile
+++ b/arch/loongarch/lib/Makefile
@@ -6,4 +6,7 @@
 lib-y	+= delay.o memset.o memcpy.o memmove.o \
 	   clear_user.o copy_user.o csum.o dump_tlb.o unaligned.o
 
+obj-$(CONFIG_CPU_HAS_LSX) += xor_simd.o xor_simd_glue.o
+
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+
diff --git a/arch/loongarch/lib/xor_simd.c b/arch/loongarch/lib/xor_simd.c
new file mode 100644
index 0000000000000..84cd24b728c47
--- /dev/null
+++ b/arch/loongarch/lib/xor_simd.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * LoongArch SIMD XOR operations
+ *
+ * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
+ */
+
+#include "xor_simd.h"
+
+/*
+ * Process one cache line (64 bytes) per loop. This is assuming all future
+ * popular LoongArch cores are similar performance-characteristics-wise to the
+ * current models.
+ */
+#define LINE_WIDTH 64
+
+#ifdef CONFIG_CPU_HAS_LSX
+
+#define LD(reg, base, offset)	\
+	"vld $vr" #reg ", %[" #base "], " #offset "\n\t"
+#define ST(reg, base, offset)	\
+	"vst $vr" #reg ", %[" #base "], " #offset "\n\t"
+#define XOR(dj, k)	"vxor.v $vr" #dj ", $vr" #dj ", $vr" #k "\n\t"
+
+#define LD_INOUT_LINE(base)	\
+	LD(0, base, 0)		\
+	LD(1, base, 16)		\
+	LD(2, base, 32)		\
+	LD(3, base, 48)
+
+#define LD_AND_XOR_LINE(base)	\
+	LD(4, base, 0)		\
+	LD(5, base, 16)		\
+	LD(6, base, 32)		\
+	LD(7, base, 48)		\
+	XOR(0, 4)		\
+	XOR(1, 5)		\
+	XOR(2, 6)		\
+	XOR(3, 7)
+
+#define ST_LINE(base)		\
+	ST(0, base, 0)		\
+	ST(1, base, 16)		\
+	ST(2, base, 32)		\
+	ST(3, base, 48)
+
+#define XOR_FUNC_NAME(nr) __xor_lsx_##nr
+#include "xor_template.c"
+
+#undef LD
+#undef ST
+#undef XOR
+#undef LD_INOUT_LINE
+#undef LD_AND_XOR_LINE
+#undef ST_LINE
+#undef XOR_FUNC_NAME
+
+#endif /* CONFIG_CPU_HAS_LSX */
+
+#ifdef CONFIG_CPU_HAS_LASX
+
+#define LD(reg, base, offset)	\
+	"xvld $xr" #reg ", %[" #base "], " #offset "\n\t"
+#define ST(reg, base, offset)	\
+	"xvst $xr" #reg ", %[" #base "], " #offset "\n\t"
+#define XOR(dj, k)	"xvxor.v $xr" #dj ", $xr" #dj ", $xr" #k "\n\t"
+
+#define LD_INOUT_LINE(base)	\
+	LD(0, base, 0)		\
+	LD(1, base, 32)
+
+#define LD_AND_XOR_LINE(base)	\
+	LD(2, base, 0)		\
+	LD(3, base, 32)		\
+	XOR(0, 2)		\
+	XOR(1, 3)
+
+#define ST_LINE(base)		\
+	ST(0, base, 0)		\
+	ST(1, base, 32)
+
+#define XOR_FUNC_NAME(nr) __xor_lasx_##nr
+#include "xor_template.c"
+
+#undef LD
+#undef ST
+#undef XOR
+#undef LD_INOUT_LINE
+#undef LD_AND_XOR_LINE
+#undef ST_LINE
+#undef XOR_FUNC_NAME
+
+#endif /* CONFIG_CPU_HAS_LASX */
diff --git a/arch/loongarch/lib/xor_simd.h b/arch/loongarch/lib/xor_simd.h
new file mode 100644
index 0000000000000..161b8d3875c7a
--- /dev/null
+++ b/arch/loongarch/lib/xor_simd.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Simple interface to link xor_simd.c and xor_simd_glue.c
+ *
+ * Separating these files ensures that no SIMD instructions are run outside of
+ * the kfpu critical section.
+ */
+
+#ifndef __LOONGARCH_LIB_XOR_SIMD_H
+#define __LOONGARCH_LIB_XOR_SIMD_H
+
+#ifdef CONFIG_CPU_HAS_LSX
+void __xor_lsx_2(unsigned long bytes, unsigned long * __restrict p1,
+		 const unsigned long * __restrict p2);
+void __xor_lsx_3(unsigned long bytes, unsigned long * __restrict p1,
+		 const unsigned long * __restrict p2,
+		 const unsigned long * __restrict p3);
+void __xor_lsx_4(unsigned long bytes, unsigned long * __restrict p1,
+		 const unsigned long * __restrict p2,
+		 const unsigned long * __restrict p3,
+		 const unsigned long * __restrict p4);
+void __xor_lsx_5(unsigned long bytes, unsigned long * __restrict p1,
+		 const unsigned long * __restrict p2,
+		 const unsigned long * __restrict p3,
+		 const unsigned long * __restrict p4,
+		 const unsigned long * __restrict p5);
+#endif /* CONFIG_CPU_HAS_LSX */
+
+#ifdef CONFIG_CPU_HAS_LASX
+void __xor_lasx_2(unsigned long bytes, unsigned long * __restrict p1,
+		  const unsigned long * __restrict p2);
+void __xor_lasx_3(unsigned long bytes, unsigned long * __restrict p1,
+		  const unsigned long * __restrict p2,
+		  const unsigned long * __restrict p3);
+void __xor_lasx_4(unsigned long bytes, unsigned long * __restrict p1,
+		  const unsigned long * __restrict p2,
+		  const unsigned long * __restrict p3,
+		  const unsigned long * __restrict p4);
+void __xor_lasx_5(unsigned long bytes, unsigned long * __restrict p1,
+		  const unsigned long * __restrict p2,
+		  const unsigned long * __restrict p3,
+		  const unsigned long * __restrict p4,
+		  const unsigned long * __restrict p5);
+#endif /* CONFIG_CPU_HAS_LASX */
+
+#endif /* __LOONGARCH_LIB_XOR_SIMD_H */
diff --git a/arch/loongarch/lib/xor_simd_glue.c b/arch/loongarch/lib/xor_simd_glue.c
new file mode 100644
index 0000000000000..6f1832862cbbd
--- /dev/null
+++ b/arch/loongarch/lib/xor_simd_glue.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+//
+// Loongson SIMD XOR operations
+//
+// Copyright 2023 WANG Xuerui <git@xen0n.name>
+
+#include <linux/export.h>
+#include <linux/sched.h>
+#include <asm/fpu.h>
+#include <asm/xor_simd.h>
+#include "xor_simd.h"
+
+#define MAKE_XOR_GLUE_2(flavor)							\
+void xor_##flavor##_2(unsigned long bytes, unsigned long * __restrict p1,	\
+		      const unsigned long * __restrict p2)			\
+{										\
+	kernel_fpu_begin();							\
+	__xor_##flavor##_2(bytes, p1, p2);					\
+	kernel_fpu_end();							\
+}										\
+EXPORT_SYMBOL_GPL(xor_##flavor##_2)
+
+#define MAKE_XOR_GLUE_3(flavor)							\
+void xor_##flavor##_3(unsigned long bytes, unsigned long * __restrict p1,	\
+		      const unsigned long * __restrict p2,			\
+		      const unsigned long * __restrict p3)			\
+{										\
+	kernel_fpu_begin();							\
+	__xor_##flavor##_3(bytes, p1, p2, p3);					\
+	kernel_fpu_end();							\
+}										\
+EXPORT_SYMBOL_GPL(xor_##flavor##_3)
+
+#define MAKE_XOR_GLUE_4(flavor)							\
+void xor_##flavor##_4(unsigned long bytes, unsigned long * __restrict p1,	\
+		      const unsigned long * __restrict p2,			\
+		      const unsigned long * __restrict p3,			\
+		      const unsigned long * __restrict p4)			\
+{										\
+	kernel_fpu_begin();							\
+	__xor_##flavor##_4(bytes, p1, p2, p3, p4);				\
+	kernel_fpu_end();							\
+}										\
+EXPORT_SYMBOL_GPL(xor_##flavor##_4)
+
+#define MAKE_XOR_GLUE_5(flavor)							\
+void xor_##flavor##_5(unsigned long bytes, unsigned long * __restrict p1,	\
+		      const unsigned long * __restrict p2,			\
+		      const unsigned long * __restrict p3,			\
+		      const unsigned long * __restrict p4,			\
+		      const unsigned long * __restrict p5)			\
+{										\
+	kernel_fpu_begin();							\
+	__xor_##flavor##_5(bytes, p1, p2, p3, p4, p5);				\
+	kernel_fpu_end();							\
+}										\
+EXPORT_SYMBOL_GPL(xor_##flavor##_5)
+
+#define MAKE_XOR_GLUES(flavor)		\
+	MAKE_XOR_GLUE_2(flavor);	\
+	MAKE_XOR_GLUE_3(flavor);	\
+	MAKE_XOR_GLUE_4(flavor);	\
+	MAKE_XOR_GLUE_5(flavor)
+
+#ifdef CONFIG_CPU_HAS_LSX
+MAKE_XOR_GLUES(lsx);
+#endif
+
+#ifdef CONFIG_CPU_HAS_LASX
+MAKE_XOR_GLUES(lasx);
+#endif
diff --git a/arch/loongarch/lib/xor_template.c b/arch/loongarch/lib/xor_template.c
new file mode 100644
index 0000000000000..0358ced7fe333
--- /dev/null
+++ b/arch/loongarch/lib/xor_template.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
+ *
+ * Template for XOR operations, instantiated in xor_simd.c.
+ *
+ * Expected preprocessor definitions:
+ *
+ * - LINE_WIDTH
+ * - XOR_FUNC_NAME(nr)
+ * - LD_INOUT_LINE(buf)
+ * - LD_AND_XOR_LINE(buf)
+ * - ST_LINE(buf)
+ */
+
+void XOR_FUNC_NAME(2)(unsigned long bytes,
+		      unsigned long * __restrict v1,
+		      const unsigned long * __restrict v2)
+{
+	unsigned long lines = bytes / LINE_WIDTH;
+
+	do {
+		__asm__ __volatile__ (
+			LD_INOUT_LINE(v1)
+			LD_AND_XOR_LINE(v2)
+			ST_LINE(v1)
+		: : [v1] "r"(v1), [v2] "r"(v2) : "memory"
+		);
+
+		v1 += LINE_WIDTH / sizeof(unsigned long);
+		v2 += LINE_WIDTH / sizeof(unsigned long);
+	} while (--lines > 0);
+}
+
+void XOR_FUNC_NAME(3)(unsigned long bytes,
+		      unsigned long * __restrict v1,
+		      const unsigned long * __restrict v2,
+		      const unsigned long * __restrict v3)
+{
+	unsigned long lines = bytes / LINE_WIDTH;
+
+	do {
+		__asm__ __volatile__ (
+			LD_INOUT_LINE(v1)
+			LD_AND_XOR_LINE(v2)
+			LD_AND_XOR_LINE(v3)
+			ST_LINE(v1)
+		: : [v1] "r"(v1), [v2] "r"(v2), [v3] "r"(v3) : "memory"
+		);
+
+		v1 += LINE_WIDTH / sizeof(unsigned long);
+		v2 += LINE_WIDTH / sizeof(unsigned long);
+		v3 += LINE_WIDTH / sizeof(unsigned long);
+	} while (--lines > 0);
+}
+
+void XOR_FUNC_NAME(4)(unsigned long bytes,
+		      unsigned long * __restrict v1,
+		      const unsigned long * __restrict v2,
+		      const unsigned long * __restrict v3,
+		      const unsigned long * __restrict v4)
+{
+	unsigned long lines = bytes / LINE_WIDTH;
+
+	do {
+		__asm__ __volatile__ (
+			LD_INOUT_LINE(v1)
+			LD_AND_XOR_LINE(v2)
+			LD_AND_XOR_LINE(v3)
+			LD_AND_XOR_LINE(v4)
+			ST_LINE(v1)
+		: : [v1] "r"(v1), [v2] "r"(v2), [v3] "r"(v3), [v4] "r"(v4)
+		: "memory"
+		);
+
+		v1 += LINE_WIDTH / sizeof(unsigned long);
+		v2 += LINE_WIDTH / sizeof(unsigned long);
+		v3 += LINE_WIDTH / sizeof(unsigned long);
+		v4 += LINE_WIDTH / sizeof(unsigned long);
+	} while (--lines > 0);
+}
+
+void XOR_FUNC_NAME(5)(unsigned long bytes,
+		      unsigned long * __restrict v1,
+		      const unsigned long * __restrict v2,
+		      const unsigned long * __restrict v3,
+		      const unsigned long * __restrict v4,
+		      const unsigned long * __restrict v5)
+{
+	unsigned long lines = bytes / LINE_WIDTH;
+
+	do {
+		__asm__ __volatile__ (
+			LD_INOUT_LINE(v1)
+			LD_AND_XOR_LINE(v2)
+			LD_AND_XOR_LINE(v3)
+			LD_AND_XOR_LINE(v4)
+			LD_AND_XOR_LINE(v5)
+			ST_LINE(v1)
+		: : [v1] "r"(v1), [v2] "r"(v2), [v3] "r"(v3), [v4] "r"(v4),
+		    [v5] "r"(v5) : "memory"
+		);
+
+		v1 += LINE_WIDTH / sizeof(unsigned long);
+		v2 += LINE_WIDTH / sizeof(unsigned long);
+		v3 += LINE_WIDTH / sizeof(unsigned long);
+		v4 += LINE_WIDTH / sizeof(unsigned long);
+		v5 += LINE_WIDTH / sizeof(unsigned long);
+	} while (--lines > 0);
+}
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/3] raid6: Add LoongArch SIMD syndrome calculation
  2023-08-03 17:08 [PATCH v2 0/3] raid5, raid6: Accelerate RAID math with LoongArch SIMD WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 1/3] LoongArch: Add SIMD-optimized XOR routines WANG Xuerui
@ 2023-08-03 17:08 ` WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation WANG Xuerui
  2 siblings, 0 replies; 6+ messages in thread
From: WANG Xuerui @ 2023-08-03 17:08 UTC (permalink / raw)
  To: Song Liu; +Cc: Huacai Chen, linux-raid, loongarch, linux-kernel, WANG Xuerui

From: WANG Xuerui <git@xen0n.name>

The algorithms work on 64 bytes at a time, which is the L1 cache line
size of all current and future LoongArch cores (that we care about), as
confirmed by Huacai. The code is based on the generic int.uc algorithm,
unrolled 4 times for LSX and 2 times for LASX. Further unrolling does
not meaningfully improve the performance according to experiments.

Performance numbers measured during system boot on a 3A5000 @ 2.5GHz:

> raid6: lasx     gen() 12726 MB/s
> raid6: lsx      gen() 10001 MB/s
> raid6: int64x8  gen()  2876 MB/s
> raid6: int64x4  gen()  3867 MB/s
> raid6: int64x2  gen()  2531 MB/s
> raid6: int64x1  gen()  1945 MB/s

Comparison of xor() speeds (from different boots but meaningful anyway):

> lasx:    11226 MB/s
> lsx:     6395 MB/s
> int64x4: 2147 MB/s

Performance as measured by raid6test:

> raid6: lasx     gen() 25109 MB/s
> raid6: lsx      gen() 13233 MB/s
> raid6: int64x8  gen()  4164 MB/s
> raid6: int64x4  gen()  6005 MB/s
> raid6: int64x2  gen()  5781 MB/s
> raid6: int64x1  gen()  4119 MB/s
> raid6: using algorithm lasx gen() 25109 MB/s
> raid6: .... xor() 14439 MB/s, rmw enabled

Signed-off-by: WANG Xuerui <git@xen0n.name>
---
 include/linux/raid/pq.h    |   2 +
 lib/raid6/Makefile         |   1 +
 lib/raid6/algos.c          |   8 +
 lib/raid6/loongarch.h      |  38 ++++
 lib/raid6/loongarch_simd.c | 422 +++++++++++++++++++++++++++++++++++++
 lib/raid6/test/Makefile    |  12 ++
 6 files changed, 483 insertions(+)
 create mode 100644 lib/raid6/loongarch.h
 create mode 100644 lib/raid6/loongarch_simd.c

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index f29aaaf2eb21e..8744474858487 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -108,6 +108,8 @@ extern const struct raid6_calls raid6_vpermxor1;
 extern const struct raid6_calls raid6_vpermxor2;
 extern const struct raid6_calls raid6_vpermxor4;
 extern const struct raid6_calls raid6_vpermxor8;
+extern const struct raid6_calls raid6_lsx;
+extern const struct raid6_calls raid6_lasx;
 
 struct raid6_recov_calls {
 	void (*data2)(int, size_t, int, int, void **);
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 45e17619422b4..2b9ebe1054806 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -9,6 +9,7 @@ raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
                               vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o recov_neon.o recov_neon_inner.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
+raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o
 
 hostprogs	+= mktables
 
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index a22a05c9af8a3..739c7ebcae1a2 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -73,6 +73,14 @@ const struct raid6_calls * const raid6_algos[] = {
 	&raid6_neonx2,
 	&raid6_neonx1,
 #endif
+#ifdef CONFIG_LOONGARCH
+#ifdef CONFIG_CPU_HAS_LASX
+	&raid6_lasx,
+#endif
+#ifdef CONFIG_CPU_HAS_LSX
+	&raid6_lsx,
+#endif
+#endif
 #if defined(__ia64__)
 	&raid6_intx32,
 	&raid6_intx16,
diff --git a/lib/raid6/loongarch.h b/lib/raid6/loongarch.h
new file mode 100644
index 0000000000000..acfc33ce70562
--- /dev/null
+++ b/lib/raid6/loongarch.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
+ *
+ * raid6/loongarch.h
+ *
+ * Definitions common to LoongArch RAID-6 code only
+ */
+
+#ifndef _LIB_RAID6_LOONGARCH_H
+#define _LIB_RAID6_LOONGARCH_H
+
+#ifdef __KERNEL__
+
+#include <asm/cpu-features.h>
+#include <asm/fpu.h>
+
+#else /* for user-space testing */
+
+#include <sys/auxv.h>
+
+/* have to supply these defines for glibc 2.37- and musl */
+#ifndef HWCAP_LOONGARCH_LSX
+#define HWCAP_LOONGARCH_LSX	(1 << 4)
+#endif
+#ifndef HWCAP_LOONGARCH_LASX
+#define HWCAP_LOONGARCH_LASX	(1 << 5)
+#endif
+
+#define kernel_fpu_begin()
+#define kernel_fpu_end()
+
+#define cpu_has_lsx	(getauxval(AT_HWCAP) & HWCAP_LOONGARCH_LSX)
+#define cpu_has_lasx	(getauxval(AT_HWCAP) & HWCAP_LOONGARCH_LASX)
+
+#endif /* __KERNEL__ */
+
+#endif /* _LIB_RAID6_LOONGARCH_H */
diff --git a/lib/raid6/loongarch_simd.c b/lib/raid6/loongarch_simd.c
new file mode 100644
index 0000000000000..ebb33e5c8f060
--- /dev/null
+++ b/lib/raid6/loongarch_simd.c
@@ -0,0 +1,422 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RAID6 syndrome calculations in LoongArch SIMD (LSX & LASX)
+ *
+ * Copyright 2023 WANG Xuerui <git@xen0n.name>
+ *
+ * Based on the generic RAID-6 code (int.uc):
+ *
+ * Copyright 2002-2004 H. Peter Anvin
+ */
+
+#include <linux/raid/pq.h>
+#include "loongarch.h"
+
+/*
+ * The vector algorithms are currently priority 0, which means the generic
+ * scalar algorithms are not being disabled if vector support is present.
+ * This is like the similar LoongArch RAID5 XOR code, with the main reason
+ * repeated here: it cannot be ruled out at this point of time, that some
+ * future (maybe reduced) models could run the vector algorithms slower than
+ * the scalar ones, maybe for errata or micro-op reasons. It may be
+ * appropriate to revisit this after one or two more uarch generations.
+ */
+
+#ifdef CONFIG_CPU_HAS_LSX
+#define NSIZE 16
+
+static int raid6_has_lsx(void)
+{
+	return cpu_has_lsx;
+}
+
+static void raid6_lsx_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	int d, z, z0;
+
+	z0 = disks - 3;		/* Highest data disk */
+	p = dptr[z0+1];		/* XOR parity */
+	q = dptr[z0+2];		/* RS syndrome */
+
+	kernel_fpu_begin();
+
+	/*
+	 * $vr0, $vr1, $vr2, $vr3: wp
+	 * $vr4, $vr5, $vr6, $vr7: wq
+	 * $vr8, $vr9, $vr10, $vr11: wd
+	 * $vr12, $vr13, $vr14, $vr15: w2
+	 * $vr16, $vr17, $vr18, $vr19: w1
+	 */
+	for ( d = 0 ; d < bytes ; d += NSIZE*4 ) {
+		/* wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; */
+		asm volatile("vld $vr0, %0" : : "m"(dptr[z0][d+0*NSIZE]));
+		asm volatile("vld $vr1, %0" : : "m"(dptr[z0][d+1*NSIZE]));
+		asm volatile("vld $vr2, %0" : : "m"(dptr[z0][d+2*NSIZE]));
+		asm volatile("vld $vr3, %0" : : "m"(dptr[z0][d+3*NSIZE]));
+		asm volatile("vori.b $vr4, $vr0, 0");
+		asm volatile("vori.b $vr5, $vr1, 0");
+		asm volatile("vori.b $vr6, $vr2, 0");
+		asm volatile("vori.b $vr7, $vr3, 0");
+		for ( z = z0-1 ; z >= 0 ; z-- ) {
+			/* wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; */
+			asm volatile("vld $vr8, %0" : : "m"(dptr[z][d+0*NSIZE]));
+			asm volatile("vld $vr9, %0" : : "m"(dptr[z][d+1*NSIZE]));
+			asm volatile("vld $vr10, %0" : : "m"(dptr[z][d+2*NSIZE]));
+			asm volatile("vld $vr11, %0" : : "m"(dptr[z][d+3*NSIZE]));
+			/* wp$$ ^= wd$$; */
+			asm volatile("vxor.v $vr0, $vr0, $vr8");
+			asm volatile("vxor.v $vr1, $vr1, $vr9");
+			asm volatile("vxor.v $vr2, $vr2, $vr10");
+			asm volatile("vxor.v $vr3, $vr3, $vr11");
+			/* w2$$ = MASK(wq$$); */
+			asm volatile("vslti.b $vr12, $vr4, 0");
+			asm volatile("vslti.b $vr13, $vr5, 0");
+			asm volatile("vslti.b $vr14, $vr6, 0");
+			asm volatile("vslti.b $vr15, $vr7, 0");
+			/* w1$$ = SHLBYTE(wq$$); */
+			asm volatile("vslli.b $vr16, $vr4, 1");
+			asm volatile("vslli.b $vr17, $vr5, 1");
+			asm volatile("vslli.b $vr18, $vr6, 1");
+			asm volatile("vslli.b $vr19, $vr7, 1");
+			/* w2$$ &= NBYTES(0x1d); */
+			asm volatile("vandi.b $vr12, $vr12, 0x1d");
+			asm volatile("vandi.b $vr13, $vr13, 0x1d");
+			asm volatile("vandi.b $vr14, $vr14, 0x1d");
+			asm volatile("vandi.b $vr15, $vr15, 0x1d");
+			/* w1$$ ^= w2$$; */
+			asm volatile("vxor.v $vr16, $vr16, $vr12");
+			asm volatile("vxor.v $vr17, $vr17, $vr13");
+			asm volatile("vxor.v $vr18, $vr18, $vr14");
+			asm volatile("vxor.v $vr19, $vr19, $vr15");
+			/* wq$$ = w1$$ ^ wd$$; */
+			asm volatile("vxor.v $vr4, $vr16, $vr8");
+			asm volatile("vxor.v $vr5, $vr17, $vr9");
+			asm volatile("vxor.v $vr6, $vr18, $vr10");
+			asm volatile("vxor.v $vr7, $vr19, $vr11");
+		}
+		/* *(unative_t *)&p[d+NSIZE*$$] = wp$$; */
+		asm volatile("vst $vr0, %0" : "=m"(p[d+NSIZE*0]));
+		asm volatile("vst $vr1, %0" : "=m"(p[d+NSIZE*1]));
+		asm volatile("vst $vr2, %0" : "=m"(p[d+NSIZE*2]));
+		asm volatile("vst $vr3, %0" : "=m"(p[d+NSIZE*3]));
+		/* *(unative_t *)&q[d+NSIZE*$$] = wq$$; */
+		asm volatile("vst $vr4, %0" : "=m"(q[d+NSIZE*0]));
+		asm volatile("vst $vr5, %0" : "=m"(q[d+NSIZE*1]));
+		asm volatile("vst $vr6, %0" : "=m"(q[d+NSIZE*2]));
+		asm volatile("vst $vr7, %0" : "=m"(q[d+NSIZE*3]));
+	}
+
+	kernel_fpu_end();
+}
+
+static void raid6_lsx_xor_syndrome(int disks, int start, int stop,
+				   size_t bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	int d, z, z0;
+
+	z0 = stop;		/* P/Q right side optimization */
+	p = dptr[disks-2];	/* XOR parity */
+	q = dptr[disks-1];	/* RS syndrome */
+
+	kernel_fpu_begin();
+
+	/*
+	 * $vr0, $vr1, $vr2, $vr3: wp
+	 * $vr4, $vr5, $vr6, $vr7: wq
+	 * $vr8, $vr9, $vr10, $vr11: wd
+	 * $vr12, $vr13, $vr14, $vr15: w2
+	 * $vr16, $vr17, $vr18, $vr19: w1
+	 */
+	for ( d = 0 ; d < bytes ; d += NSIZE*4 ) {
+		/* P/Q data pages */
+		/* wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; */
+		asm volatile("vld $vr0, %0" : : "m"(dptr[z0][d+0*NSIZE]));
+		asm volatile("vld $vr1, %0" : : "m"(dptr[z0][d+1*NSIZE]));
+		asm volatile("vld $vr2, %0" : : "m"(dptr[z0][d+2*NSIZE]));
+		asm volatile("vld $vr3, %0" : : "m"(dptr[z0][d+3*NSIZE]));
+		asm volatile("vori.b $vr4, $vr0, 0");
+		asm volatile("vori.b $vr5, $vr1, 0");
+		asm volatile("vori.b $vr6, $vr2, 0");
+		asm volatile("vori.b $vr7, $vr3, 0");
+		for ( z = z0-1 ; z >= start ; z-- ) {
+			/* wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; */
+			asm volatile("vld $vr8, %0" : : "m"(dptr[z][d+0*NSIZE]));
+			asm volatile("vld $vr9, %0" : : "m"(dptr[z][d+1*NSIZE]));
+			asm volatile("vld $vr10, %0" : : "m"(dptr[z][d+2*NSIZE]));
+			asm volatile("vld $vr11, %0" : : "m"(dptr[z][d+3*NSIZE]));
+			/* wp$$ ^= wd$$; */
+			asm volatile("vxor.v $vr0, $vr0, $vr8");
+			asm volatile("vxor.v $vr1, $vr1, $vr9");
+			asm volatile("vxor.v $vr2, $vr2, $vr10");
+			asm volatile("vxor.v $vr3, $vr3, $vr11");
+			/* w2$$ = MASK(wq$$); */
+			asm volatile("vslti.b $vr12, $vr4, 0");
+			asm volatile("vslti.b $vr13, $vr5, 0");
+			asm volatile("vslti.b $vr14, $vr6, 0");
+			asm volatile("vslti.b $vr15, $vr7, 0");
+			/* w1$$ = SHLBYTE(wq$$); */
+			asm volatile("vslli.b $vr16, $vr4, 1");
+			asm volatile("vslli.b $vr17, $vr5, 1");
+			asm volatile("vslli.b $vr18, $vr6, 1");
+			asm volatile("vslli.b $vr19, $vr7, 1");
+			/* w2$$ &= NBYTES(0x1d); */
+			asm volatile("vandi.b $vr12, $vr12, 0x1d");
+			asm volatile("vandi.b $vr13, $vr13, 0x1d");
+			asm volatile("vandi.b $vr14, $vr14, 0x1d");
+			asm volatile("vandi.b $vr15, $vr15, 0x1d");
+			/* w1$$ ^= w2$$; */
+			asm volatile("vxor.v $vr16, $vr16, $vr12");
+			asm volatile("vxor.v $vr17, $vr17, $vr13");
+			asm volatile("vxor.v $vr18, $vr18, $vr14");
+			asm volatile("vxor.v $vr19, $vr19, $vr15");
+			/* wq$$ = w1$$ ^ wd$$; */
+			asm volatile("vxor.v $vr4, $vr16, $vr8");
+			asm volatile("vxor.v $vr5, $vr17, $vr9");
+			asm volatile("vxor.v $vr6, $vr18, $vr10");
+			asm volatile("vxor.v $vr7, $vr19, $vr11");
+		}
+
+		/* P/Q left side optimization */
+		for ( z = start-1 ; z >= 0 ; z-- ) {
+			/* w2$$ = MASK(wq$$); */
+			asm volatile("vslti.b $vr12, $vr4, 0");
+			asm volatile("vslti.b $vr13, $vr5, 0");
+			asm volatile("vslti.b $vr14, $vr6, 0");
+			asm volatile("vslti.b $vr15, $vr7, 0");
+			/* w1$$ = SHLBYTE(wq$$); */
+			asm volatile("vslli.b $vr16, $vr4, 1");
+			asm volatile("vslli.b $vr17, $vr5, 1");
+			asm volatile("vslli.b $vr18, $vr6, 1");
+			asm volatile("vslli.b $vr19, $vr7, 1");
+			/* w2$$ &= NBYTES(0x1d); */
+			asm volatile("vandi.b $vr12, $vr12, 0x1d");
+			asm volatile("vandi.b $vr13, $vr13, 0x1d");
+			asm volatile("vandi.b $vr14, $vr14, 0x1d");
+			asm volatile("vandi.b $vr15, $vr15, 0x1d");
+			/* wq$$ = w1$$ ^ w2$$; */
+			asm volatile("vxor.v $vr4, $vr16, $vr12");
+			asm volatile("vxor.v $vr5, $vr17, $vr13");
+			asm volatile("vxor.v $vr6, $vr18, $vr14");
+			asm volatile("vxor.v $vr7, $vr19, $vr15");
+		}
+		/*
+		 * *(unative_t *)&p[d+NSIZE*$$] ^= wp$$;
+		 * *(unative_t *)&q[d+NSIZE*$$] ^= wq$$;
+		 */
+		asm volatile(
+			"vld $vr20, %0\n\t"
+			"vld $vr21, %1\n\t"
+			"vld $vr22, %2\n\t"
+			"vld $vr23, %3\n\t"
+			"vld $vr24, %4\n\t"
+			"vld $vr25, %5\n\t"
+			"vld $vr26, %6\n\t"
+			"vld $vr27, %7\n\t"
+			"vxor.v $vr20, $vr20, $vr0\n\t"
+			"vxor.v $vr21, $vr21, $vr1\n\t"
+			"vxor.v $vr22, $vr22, $vr2\n\t"
+			"vxor.v $vr23, $vr23, $vr3\n\t"
+			"vxor.v $vr24, $vr24, $vr4\n\t"
+			"vxor.v $vr25, $vr25, $vr5\n\t"
+			"vxor.v $vr26, $vr26, $vr6\n\t"
+			"vxor.v $vr27, $vr27, $vr7\n\t"
+			"vst $vr20, %0\n\t"
+			"vst $vr21, %1\n\t"
+			"vst $vr22, %2\n\t"
+			"vst $vr23, %3\n\t"
+			"vst $vr24, %4\n\t"
+			"vst $vr25, %5\n\t"
+			"vst $vr26, %6\n\t"
+			"vst $vr27, %7\n\t"
+			: "+m"(p[d+NSIZE*0]), "+m"(p[d+NSIZE*1]),
+			  "+m"(p[d+NSIZE*2]), "+m"(p[d+NSIZE*3]),
+			  "+m"(q[d+NSIZE*0]), "+m"(q[d+NSIZE*1]),
+			  "+m"(q[d+NSIZE*2]), "+m"(q[d+NSIZE*3])
+		);
+	}
+
+	kernel_fpu_end();
+}
+
+const struct raid6_calls raid6_lsx = {
+	raid6_lsx_gen_syndrome,
+	raid6_lsx_xor_syndrome,
+	raid6_has_lsx,
+	"lsx",
+	.priority = 0 /* see the comment near the top of the file for reason */
+};
+
+#undef NSIZE
+#endif /* CONFIG_CPU_HAS_LSX */
+
+#ifdef CONFIG_CPU_HAS_LASX
+#define NSIZE 32
+
+static int raid6_has_lasx(void)
+{
+	return cpu_has_lasx;
+}
+
+static void raid6_lasx_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	int d, z, z0;
+
+	z0 = disks - 3;		/* Highest data disk */
+	p = dptr[z0+1];		/* XOR parity */
+	q = dptr[z0+2];		/* RS syndrome */
+
+	kernel_fpu_begin();
+
+	/*
+	 * $xr0, $xr1: wp
+	 * $xr2, $xr3: wq
+	 * $xr4, $xr5: wd
+	 * $xr6, $xr7: w2
+	 * $xr8, $xr9: w1
+	 */
+	for ( d = 0 ; d < bytes ; d += NSIZE*2 ) {
+		/* wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; */
+		asm volatile("xvld $xr0, %0" : : "m"(dptr[z0][d+0*NSIZE]));
+		asm volatile("xvld $xr1, %0" : : "m"(dptr[z0][d+1*NSIZE]));
+		asm volatile("xvori.b $xr2, $xr0, 0");
+		asm volatile("xvori.b $xr3, $xr1, 0");
+		for ( z = z0-1 ; z >= 0 ; z-- ) {
+			/* wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; */
+			asm volatile("xvld $xr4, %0" : : "m"(dptr[z][d+0*NSIZE]));
+			asm volatile("xvld $xr5, %0" : : "m"(dptr[z][d+1*NSIZE]));
+			/* wp$$ ^= wd$$; */
+			asm volatile("xvxor.v $xr0, $xr0, $xr4");
+			asm volatile("xvxor.v $xr1, $xr1, $xr5");
+			/* w2$$ = MASK(wq$$); */
+			asm volatile("xvslti.b $xr6, $xr2, 0");
+			asm volatile("xvslti.b $xr7, $xr3, 0");
+			/* w1$$ = SHLBYTE(wq$$); */
+			asm volatile("xvslli.b $xr8, $xr2, 1");
+			asm volatile("xvslli.b $xr9, $xr3, 1");
+			/* w2$$ &= NBYTES(0x1d); */
+			asm volatile("xvandi.b $xr6, $xr6, 0x1d");
+			asm volatile("xvandi.b $xr7, $xr7, 0x1d");
+			/* w1$$ ^= w2$$; */
+			asm volatile("xvxor.v $xr8, $xr8, $xr6");
+			asm volatile("xvxor.v $xr9, $xr9, $xr7");
+			/* wq$$ = w1$$ ^ wd$$; */
+			asm volatile("xvxor.v $xr2, $xr8, $xr4");
+			asm volatile("xvxor.v $xr3, $xr9, $xr5");
+		}
+		/* *(unative_t *)&p[d+NSIZE*$$] = wp$$; */
+		asm volatile("xvst $xr0, %0" : "=m"(p[d+NSIZE*0]));
+		asm volatile("xvst $xr1, %0" : "=m"(p[d+NSIZE*1]));
+		/* *(unative_t *)&q[d+NSIZE*$$] = wq$$; */
+		asm volatile("xvst $xr2, %0" : "=m"(q[d+NSIZE*0]));
+		asm volatile("xvst $xr3, %0" : "=m"(q[d+NSIZE*1]));
+	}
+
+	kernel_fpu_end();
+}
+
+static void raid6_lasx_xor_syndrome(int disks, int start, int stop,
+				    size_t bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	int d, z, z0;
+
+	z0 = stop;		/* P/Q right side optimization */
+	p = dptr[disks-2];	/* XOR parity */
+	q = dptr[disks-1];	/* RS syndrome */
+
+	kernel_fpu_begin();
+
+	/*
+	 * $xr0, $xr1: wp
+	 * $xr2, $xr3: wq
+	 * $xr4, $xr5: wd
+	 * $xr6, $xr7: w2
+	 * $xr8, $xr9: w1
+	 */
+	for ( d = 0 ; d < bytes ; d += NSIZE*2 ) {
+		/* P/Q data pages */
+		/* wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; */
+		asm volatile("xvld $xr0, %0" : : "m"(dptr[z0][d+0*NSIZE]));
+		asm volatile("xvld $xr1, %0" : : "m"(dptr[z0][d+1*NSIZE]));
+		asm volatile("xvori.b $xr2, $xr0, 0");
+		asm volatile("xvori.b $xr3, $xr1, 0");
+		for ( z = z0-1 ; z >= start ; z-- ) {
+			/* wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; */
+			asm volatile("xvld $xr4, %0" : : "m"(dptr[z][d+0*NSIZE]));
+			asm volatile("xvld $xr5, %0" : : "m"(dptr[z][d+1*NSIZE]));
+			/* wp$$ ^= wd$$; */
+			asm volatile("xvxor.v $xr0, $xr0, $xr4");
+			asm volatile("xvxor.v $xr1, $xr1, $xr5");
+			/* w2$$ = MASK(wq$$); */
+			asm volatile("xvslti.b $xr6, $xr2, 0");
+			asm volatile("xvslti.b $xr7, $xr3, 0");
+			/* w1$$ = SHLBYTE(wq$$); */
+			asm volatile("xvslli.b $xr8, $xr2, 1");
+			asm volatile("xvslli.b $xr9, $xr3, 1");
+			/* w2$$ &= NBYTES(0x1d); */
+			asm volatile("xvandi.b $xr6, $xr6, 0x1d");
+			asm volatile("xvandi.b $xr7, $xr7, 0x1d");
+			/* w1$$ ^= w2$$; */
+			asm volatile("xvxor.v $xr8, $xr8, $xr6");
+			asm volatile("xvxor.v $xr9, $xr9, $xr7");
+			/* wq$$ = w1$$ ^ wd$$; */
+			asm volatile("xvxor.v $xr2, $xr8, $xr4");
+			asm volatile("xvxor.v $xr3, $xr9, $xr5");
+		}
+
+		/* P/Q left side optimization */
+		for ( z = start-1 ; z >= 0 ; z-- ) {
+			/* w2$$ = MASK(wq$$); */
+			asm volatile("xvslti.b $xr6, $xr2, 0");
+			asm volatile("xvslti.b $xr7, $xr3, 0");
+			/* w1$$ = SHLBYTE(wq$$); */
+			asm volatile("xvslli.b $xr8, $xr2, 1");
+			asm volatile("xvslli.b $xr9, $xr3, 1");
+			/* w2$$ &= NBYTES(0x1d); */
+			asm volatile("xvandi.b $xr6, $xr6, 0x1d");
+			asm volatile("xvandi.b $xr7, $xr7, 0x1d");
+			/* wq$$ = w1$$ ^ w2$$; */
+			asm volatile("xvxor.v $xr2, $xr8, $xr6");
+			asm volatile("xvxor.v $xr3, $xr9, $xr7");
+		}
+		/*
+		 * *(unative_t *)&p[d+NSIZE*$$] ^= wp$$;
+		 * *(unative_t *)&q[d+NSIZE*$$] ^= wq$$;
+		 */
+		asm volatile(
+			"xvld $xr10, %0\n\t"
+			"xvld $xr11, %1\n\t"
+			"xvld $xr12, %2\n\t"
+			"xvld $xr13, %3\n\t"
+			"xvxor.v $xr10, $xr10, $xr0\n\t"
+			"xvxor.v $xr11, $xr11, $xr1\n\t"
+			"xvxor.v $xr12, $xr12, $xr2\n\t"
+			"xvxor.v $xr13, $xr13, $xr3\n\t"
+			"xvst $xr10, %0\n\t"
+			"xvst $xr11, %1\n\t"
+			"xvst $xr12, %2\n\t"
+			"xvst $xr13, %3\n\t"
+			: "+m"(p[d+NSIZE*0]), "+m"(p[d+NSIZE*1]),
+			  "+m"(q[d+NSIZE*0]), "+m"(q[d+NSIZE*1])
+		);
+	}
+
+	kernel_fpu_end();
+}
+
+const struct raid6_calls raid6_lasx = {
+	raid6_lasx_gen_syndrome,
+	raid6_lasx_xor_syndrome,
+	raid6_has_lasx,
+	"lasx",
+	.priority = 0 /* see the comment near the top of the file for reason */
+};
+#undef NSIZE
+#endif /* CONFIG_CPU_HAS_LASX */
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 1f693ea3b980c..7b244bce32b3d 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -41,6 +41,16 @@ ifeq ($(findstring ppc,$(ARCH)),ppc)
                          gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
 endif
 
+ifeq ($(ARCH),loongarch64)
+        CFLAGS += -I../../../arch/loongarch/include -DCONFIG_LOONGARCH=1
+        CFLAGS += $(shell echo 'vld $$vr0, $$zero, 0' |         \
+                    gcc -c -x assembler - >/dev/null 2>&1 &&    \
+                    rm ./-.o && echo -DCONFIG_CPU_HAS_LSX=1)
+        CFLAGS += $(shell echo 'xvld $$xr0, $$zero, 0' |        \
+                    gcc -c -x assembler - >/dev/null 2>&1 &&    \
+                    rm ./-.o && echo -DCONFIG_CPU_HAS_LASX=1)
+endif
+
 ifeq ($(IS_X86),yes)
         OBJS   += mmx.o sse1.o sse2.o avx2.o recov_ssse3.o recov_avx2.o avx512.o recov_avx512.o
         CFLAGS += -DCONFIG_X86
@@ -54,6 +64,8 @@ else ifeq ($(HAS_ALTIVEC),yes)
         CFLAGS += -DCONFIG_ALTIVEC
         OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
                 vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
+else ifeq ($(ARCH),loongarch64)
+        OBJS += loongarch_simd.o
 endif
 
 .c.o:
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation
  2023-08-03 17:08 [PATCH v2 0/3] raid5, raid6: Accelerate RAID math with LoongArch SIMD WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 1/3] LoongArch: Add SIMD-optimized XOR routines WANG Xuerui
  2023-08-03 17:08 ` [PATCH v2 2/3] raid6: Add LoongArch SIMD syndrome calculation WANG Xuerui
@ 2023-08-03 17:08 ` WANG Xuerui
  2023-08-03 19:49   ` Paul Menzel
  2 siblings, 1 reply; 6+ messages in thread
From: WANG Xuerui @ 2023-08-03 17:08 UTC (permalink / raw)
  To: Song Liu; +Cc: Huacai Chen, linux-raid, loongarch, linux-kernel, WANG Xuerui

From: WANG Xuerui <git@xen0n.name>

Similar to the syndrome calculation, the recovery algorithms also work
on 64 bytes at a time to align with the L1 cache line size of current
and future LoongArch cores (that we care about). Which means
unrolled-by-4 LSX and unrolled-by-2 LASX code.

The assembly is originally based on the x86 SSSE3/AVX2 ports, but
register allocation has been redone to take advantage of LSX/LASX's 32
vector registers, and instruction sequence has been optimized to suit
(e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
cannot).

Performance numbers measured by instrumenting the raid6test code:

> lasx  2data: 354.987 MiB/s
> lasx  datap: 350.430 MiB/s
> lsx   2data: 340.026 MiB/s
> lsx   datap: 337.318 MiB/s
> intx1 2data: 164.280 MiB/s
> intx1 datap: 187.966 MiB/s

Signed-off-by: WANG Xuerui <git@xen0n.name>
---
 include/linux/raid/pq.h          |   2 +
 lib/raid6/Makefile               |   2 +-
 lib/raid6/algos.c                |   8 +
 lib/raid6/recov_loongarch_simd.c | 515 +++++++++++++++++++++++++++++++
 lib/raid6/test/Makefile          |   2 +-
 5 files changed, 527 insertions(+), 2 deletions(-)
 create mode 100644 lib/raid6/recov_loongarch_simd.c

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 8744474858487..006e18decfad0 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -125,6 +125,8 @@ extern const struct raid6_recov_calls raid6_recov_avx2;
 extern const struct raid6_recov_calls raid6_recov_avx512;
 extern const struct raid6_recov_calls raid6_recov_s390xc;
 extern const struct raid6_recov_calls raid6_recov_neon;
+extern const struct raid6_recov_calls raid6_recov_lsx;
+extern const struct raid6_recov_calls raid6_recov_lasx;
 
 extern const struct raid6_calls raid6_neonx1;
 extern const struct raid6_calls raid6_neonx2;
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 2b9ebe1054806..035b0a4db476a 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -9,7 +9,7 @@ raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
                               vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o recov_neon.o recov_neon_inner.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
-raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o
+raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
 
 hostprogs	+= mktables
 
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 739c7ebcae1a2..0ec534faf019b 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -111,6 +111,14 @@ const struct raid6_recov_calls *const raid6_recov_algos[] = {
 #endif
 #if defined(CONFIG_KERNEL_MODE_NEON)
 	&raid6_recov_neon,
+#endif
+#ifdef CONFIG_LOONGARCH
+#ifdef CONFIG_CPU_HAS_LASX
+	&raid6_recov_lasx,
+#endif
+#ifdef CONFIG_CPU_HAS_LSX
+	&raid6_recov_lsx,
+#endif
 #endif
 	&raid6_recov_intx1,
 	NULL
diff --git a/lib/raid6/recov_loongarch_simd.c b/lib/raid6/recov_loongarch_simd.c
new file mode 100644
index 0000000000000..de9d53e81bd2d
--- /dev/null
+++ b/lib/raid6/recov_loongarch_simd.c
@@ -0,0 +1,515 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * RAID6 recovery algorithms in LoongArch SIMD (LSX & LASX)
+ *
+ * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
+ *
+ * Originally based on recov_avx2.c and recov_ssse3.c:
+ *
+ * Copyright (C) 2012 Intel Corporation
+ * Author: Jim Kukunas <james.t.kukunas@linux.intel.com>
+ */
+
+#include <linux/raid/pq.h>
+#include "loongarch.h"
+
+/*
+ * Unlike with the syndrome calculation algorithms, there's no boot-time
+ * selection of recovery algorithms by benchmarking, so we have to specify
+ * the priorities and hope the future cores will all have decent vector
+ * support (i.e. no LASX slower than LSX, or even scalar code).
+ */
+
+#ifdef CONFIG_CPU_HAS_LSX
+static int raid6_has_lsx(void)
+{
+	return cpu_has_lsx;
+}
+
+static void raid6_2data_recov_lsx(int disks, size_t bytes, int faila,
+				  int failb, void **ptrs)
+{
+	u8 *p, *q, *dp, *dq;
+	const u8 *pbmul;	/* P multiplier table for B data */
+	const u8 *qmul;		/* Q multiplier table (for both) */
+
+	p = (u8 *)ptrs[disks-2];
+	q = (u8 *)ptrs[disks-1];
+
+	/*
+	 * Compute syndrome with zero for the missing data pages
+	 * Use the dead data pages as temporary storage for
+	 * delta p and delta q
+	 */
+	dp = (u8 *)ptrs[faila];
+	ptrs[faila] = (void *)raid6_empty_zero_page;
+	ptrs[disks-2] = dp;
+	dq = (u8 *)ptrs[failb];
+	ptrs[failb] = (void *)raid6_empty_zero_page;
+	ptrs[disks-1] = dq;
+
+	raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+	/* Restore pointer table */
+	ptrs[faila]   = dp;
+	ptrs[failb]   = dq;
+	ptrs[disks-2] = p;
+	ptrs[disks-1] = q;
+
+	/* Now, pick the proper data tables */
+	pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];
+	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
+		raid6_gfexp[failb]]];
+
+	kernel_fpu_begin();
+
+	/*
+	 * vr20, vr21: qmul
+	 * vr22, vr23: pbmul
+	 */
+	asm volatile("vld $vr20, %0" : : "m" (qmul[0]));
+	asm volatile("vld $vr21, %0" : : "m" (qmul[16]));
+	asm volatile("vld $vr22, %0" : : "m" (pbmul[0]));
+	asm volatile("vld $vr23, %0" : : "m" (pbmul[16]));
+
+	while (bytes) {
+		/* vr4 - vr7: Q */
+		asm volatile("vld $vr4, %0" : : "m" (q[0]));
+		asm volatile("vld $vr5, %0" : : "m" (q[16]));
+		asm volatile("vld $vr6, %0" : : "m" (q[32]));
+		asm volatile("vld $vr7, %0" : : "m" (q[48]));
+		/*  vr4 - vr7: Q + Qxy */
+		asm volatile("vld $vr8, %0" : : "m" (dq[0]));
+		asm volatile("vld $vr9, %0" : : "m" (dq[16]));
+		asm volatile("vld $vr10, %0" : : "m" (dq[32]));
+		asm volatile("vld $vr11, %0" : : "m" (dq[48]));
+		asm volatile("vxor.v $vr4, $vr4, $vr8");
+		asm volatile("vxor.v $vr5, $vr5, $vr9");
+		asm volatile("vxor.v $vr6, $vr6, $vr10");
+		asm volatile("vxor.v $vr7, $vr7, $vr11");
+		/* vr0 - vr3: P */
+		asm volatile("vld $vr0, %0" : : "m" (p[0]));
+		asm volatile("vld $vr1, %0" : : "m" (p[16]));
+		asm volatile("vld $vr2, %0" : : "m" (p[32]));
+		asm volatile("vld $vr3, %0" : : "m" (p[48]));
+		/* vr0 - vr3: P + Pxy */
+		asm volatile("vld $vr8, %0" : : "m" (dp[0]));
+		asm volatile("vld $vr9, %0" : : "m" (dp[16]));
+		asm volatile("vld $vr10, %0" : : "m" (dp[32]));
+		asm volatile("vld $vr11, %0" : : "m" (dp[48]));
+		asm volatile("vxor.v $vr0, $vr0, $vr8");
+		asm volatile("vxor.v $vr1, $vr1, $vr9");
+		asm volatile("vxor.v $vr2, $vr2, $vr10");
+		asm volatile("vxor.v $vr3, $vr3, $vr11");
+
+		/* vr8 - vr11: higher 4 bits of each byte of (Q + Qxy) */
+		asm volatile("vsrli.b $vr8, $vr4, 4");
+		asm volatile("vsrli.b $vr9, $vr5, 4");
+		asm volatile("vsrli.b $vr10, $vr6, 4");
+		asm volatile("vsrli.b $vr11, $vr7, 4");
+		/* vr4 - vr7: lower 4 bits of each byte of (Q + Qxy) */
+		asm volatile("vandi.b $vr4, $vr4, 0x0f");
+		asm volatile("vandi.b $vr5, $vr5, 0x0f");
+		asm volatile("vandi.b $vr6, $vr6, 0x0f");
+		asm volatile("vandi.b $vr7, $vr7, 0x0f");
+		/* lookup from qmul[0] */
+		asm volatile("vshuf.b $vr4, $vr20, $vr20, $vr4");
+		asm volatile("vshuf.b $vr5, $vr20, $vr20, $vr5");
+		asm volatile("vshuf.b $vr6, $vr20, $vr20, $vr6");
+		asm volatile("vshuf.b $vr7, $vr20, $vr20, $vr7");
+		/* lookup from qmul[16] */
+		asm volatile("vshuf.b $vr8, $vr21, $vr21, $vr8");
+		asm volatile("vshuf.b $vr9, $vr21, $vr21, $vr9");
+		asm volatile("vshuf.b $vr10, $vr21, $vr21, $vr10");
+		asm volatile("vshuf.b $vr11, $vr21, $vr21, $vr11");
+		/* vr16 - vr19: B(Q + Qxy) */
+		asm volatile("vxor.v $vr16, $vr8, $vr4");
+		asm volatile("vxor.v $vr17, $vr9, $vr5");
+		asm volatile("vxor.v $vr18, $vr10, $vr6");
+		asm volatile("vxor.v $vr19, $vr11, $vr7");
+
+		/* vr4 - vr7: higher 4 bits of each byte of (P + Pxy) */
+		asm volatile("vsrli.b $vr4, $vr0, 4");
+		asm volatile("vsrli.b $vr5, $vr1, 4");
+		asm volatile("vsrli.b $vr6, $vr2, 4");
+		asm volatile("vsrli.b $vr7, $vr3, 4");
+		/* vr12 - vr15: lower 4 bits of each byte of (P + Pxy) */
+		asm volatile("vandi.b $vr12, $vr0, 0x0f");
+		asm volatile("vandi.b $vr13, $vr1, 0x0f");
+		asm volatile("vandi.b $vr14, $vr2, 0x0f");
+		asm volatile("vandi.b $vr15, $vr3, 0x0f");
+		/* lookup from pbmul[0] */
+		asm volatile("vshuf.b $vr12, $vr22, $vr22, $vr12");
+		asm volatile("vshuf.b $vr13, $vr22, $vr22, $vr13");
+		asm volatile("vshuf.b $vr14, $vr22, $vr22, $vr14");
+		asm volatile("vshuf.b $vr15, $vr22, $vr22, $vr15");
+		/* lookup from pbmul[16] */
+		asm volatile("vshuf.b $vr4, $vr23, $vr23, $vr4");
+		asm volatile("vshuf.b $vr5, $vr23, $vr23, $vr5");
+		asm volatile("vshuf.b $vr6, $vr23, $vr23, $vr6");
+		asm volatile("vshuf.b $vr7, $vr23, $vr23, $vr7");
+		/* vr4 - vr7: A(P + Pxy) */
+		asm volatile("vxor.v $vr4, $vr4, $vr12");
+		asm volatile("vxor.v $vr5, $vr5, $vr13");
+		asm volatile("vxor.v $vr6, $vr6, $vr14");
+		asm volatile("vxor.v $vr7, $vr7, $vr15");
+
+		/* vr4 - vr7: A(P + Pxy) + B(Q + Qxy) = Dx */
+		asm volatile("vxor.v $vr4, $vr4, $vr16");
+		asm volatile("vxor.v $vr5, $vr5, $vr17");
+		asm volatile("vxor.v $vr6, $vr6, $vr18");
+		asm volatile("vxor.v $vr7, $vr7, $vr19");
+		asm volatile("vst $vr4, %0" : "=m" (dq[0]));
+		asm volatile("vst $vr5, %0" : "=m" (dq[16]));
+		asm volatile("vst $vr6, %0" : "=m" (dq[32]));
+		asm volatile("vst $vr7, %0" : "=m" (dq[48]));
+
+		/* vr0 - vr3: P + Pxy + Dx = Dy */
+		asm volatile("vxor.v $vr0, $vr0, $vr4");
+		asm volatile("vxor.v $vr1, $vr1, $vr5");
+		asm volatile("vxor.v $vr2, $vr2, $vr6");
+		asm volatile("vxor.v $vr3, $vr3, $vr7");
+		asm volatile("vst $vr0, %0" : "=m" (dp[0]));
+		asm volatile("vst $vr1, %0" : "=m" (dp[16]));
+		asm volatile("vst $vr2, %0" : "=m" (dp[32]));
+		asm volatile("vst $vr3, %0" : "=m" (dp[48]));
+
+		bytes -= 64;
+		p += 64;
+		q += 64;
+		dp += 64;
+		dq += 64;
+	}
+
+	kernel_fpu_end();
+}
+
+static void raid6_datap_recov_lsx(int disks, size_t bytes, int faila,
+				  void **ptrs)
+{
+	u8 *p, *q, *dq;
+	const u8 *qmul;		/* Q multiplier table */
+
+	p = (u8 *)ptrs[disks-2];
+	q = (u8 *)ptrs[disks-1];
+
+	/*
+	 * Compute syndrome with zero for the missing data page
+	 * Use the dead data page as temporary storage for delta q
+	 */
+	dq = (u8 *)ptrs[faila];
+	ptrs[faila] = (void *)raid6_empty_zero_page;
+	ptrs[disks-1] = dq;
+
+	raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+	/* Restore pointer table */
+	ptrs[faila]   = dq;
+	ptrs[disks-1] = q;
+
+	/* Now, pick the proper data tables */
+	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];
+
+	kernel_fpu_begin();
+
+	/* vr22, vr23: qmul */
+	asm volatile("vld $vr22, %0" : : "m" (qmul[0]));
+	asm volatile("vld $vr23, %0" : : "m" (qmul[16]));
+
+	while (bytes) {
+		/* vr0 - vr3: P + Dx */
+		asm volatile("vld $vr0, %0" : : "m" (p[0]));
+		asm volatile("vld $vr1, %0" : : "m" (p[16]));
+		asm volatile("vld $vr2, %0" : : "m" (p[32]));
+		asm volatile("vld $vr3, %0" : : "m" (p[48]));
+		/* vr4 - vr7: Qx */
+		asm volatile("vld $vr4, %0" : : "m" (dq[0]));
+		asm volatile("vld $vr5, %0" : : "m" (dq[16]));
+		asm volatile("vld $vr6, %0" : : "m" (dq[32]));
+		asm volatile("vld $vr7, %0" : : "m" (dq[48]));
+		/* vr4 - vr7: Q + Qx */
+		asm volatile("vld $vr8, %0" : : "m" (q[0]));
+		asm volatile("vld $vr9, %0" : : "m" (q[16]));
+		asm volatile("vld $vr10, %0" : : "m" (q[32]));
+		asm volatile("vld $vr11, %0" : : "m" (q[48]));
+		asm volatile("vxor.v $vr4, $vr4, $vr8");
+		asm volatile("vxor.v $vr5, $vr5, $vr9");
+		asm volatile("vxor.v $vr6, $vr6, $vr10");
+		asm volatile("vxor.v $vr7, $vr7, $vr11");
+
+		/* vr8 - vr11: higher 4 bits of each byte of (Q + Qx) */
+		asm volatile("vsrli.b $vr8, $vr4, 4");
+		asm volatile("vsrli.b $vr9, $vr5, 4");
+		asm volatile("vsrli.b $vr10, $vr6, 4");
+		asm volatile("vsrli.b $vr11, $vr7, 4");
+		/* vr4 - vr7: lower 4 bits of each byte of (Q + Qx) */
+		asm volatile("vandi.b $vr4, $vr4, 0x0f");
+		asm volatile("vandi.b $vr5, $vr5, 0x0f");
+		asm volatile("vandi.b $vr6, $vr6, 0x0f");
+		asm volatile("vandi.b $vr7, $vr7, 0x0f");
+		/* lookup from qmul[0] */
+		asm volatile("vshuf.b $vr4, $vr22, $vr22, $vr4");
+		asm volatile("vshuf.b $vr5, $vr22, $vr22, $vr5");
+		asm volatile("vshuf.b $vr6, $vr22, $vr22, $vr6");
+		asm volatile("vshuf.b $vr7, $vr22, $vr22, $vr7");
+		/* lookup from qmul[16] */
+		asm volatile("vshuf.b $vr8, $vr23, $vr23, $vr8");
+		asm volatile("vshuf.b $vr9, $vr23, $vr23, $vr9");
+		asm volatile("vshuf.b $vr10, $vr23, $vr23, $vr10");
+		asm volatile("vshuf.b $vr11, $vr23, $vr23, $vr11");
+		/* vr4 - vr7: qmul(Q + Qx) = Dx */
+		asm volatile("vxor.v $vr4, $vr4, $vr8");
+		asm volatile("vxor.v $vr5, $vr5, $vr9");
+		asm volatile("vxor.v $vr6, $vr6, $vr10");
+		asm volatile("vxor.v $vr7, $vr7, $vr11");
+		asm volatile("vst $vr4, %0" : "=m" (dq[0]));
+		asm volatile("vst $vr5, %0" : "=m" (dq[16]));
+		asm volatile("vst $vr6, %0" : "=m" (dq[32]));
+		asm volatile("vst $vr7, %0" : "=m" (dq[48]));
+
+		/* vr0 - vr3: P + Dx + Dx = P */
+		asm volatile("vxor.v $vr0, $vr0, $vr4");
+		asm volatile("vxor.v $vr1, $vr1, $vr5");
+		asm volatile("vxor.v $vr2, $vr2, $vr6");
+		asm volatile("vxor.v $vr3, $vr3, $vr7");
+		asm volatile("vst $vr0, %0" : "=m" (p[0]));
+		asm volatile("vst $vr1, %0" : "=m" (p[16]));
+		asm volatile("vst $vr2, %0" : "=m" (p[32]));
+		asm volatile("vst $vr3, %0" : "=m" (p[48]));
+
+		bytes -= 64;
+		p += 64;
+		q += 64;
+		dq += 64;
+	}
+
+	kernel_fpu_end();
+}
+
+const struct raid6_recov_calls raid6_recov_lsx = {
+	.data2 = raid6_2data_recov_lsx,
+	.datap = raid6_datap_recov_lsx,
+	.valid = raid6_has_lsx,
+	.name = "lsx",
+	.priority = 1,
+};
+#endif /* CONFIG_CPU_HAS_LSX */
+
+#ifdef CONFIG_CPU_HAS_LASX
+static int raid6_has_lasx(void)
+{
+	return cpu_has_lasx;
+}
+
+static void raid6_2data_recov_lasx(int disks, size_t bytes, int faila,
+				   int failb, void **ptrs)
+{
+	u8 *p, *q, *dp, *dq;
+	const u8 *pbmul;	/* P multiplier table for B data */
+	const u8 *qmul;		/* Q multiplier table (for both) */
+
+	p = (u8 *)ptrs[disks-2];
+	q = (u8 *)ptrs[disks-1];
+
+	/*
+	 * Compute syndrome with zero for the missing data pages
+	 * Use the dead data pages as temporary storage for
+	 * delta p and delta q
+	 */
+	dp = (u8 *)ptrs[faila];
+	ptrs[faila] = (void *)raid6_empty_zero_page;
+	ptrs[disks-2] = dp;
+	dq = (u8 *)ptrs[failb];
+	ptrs[failb] = (void *)raid6_empty_zero_page;
+	ptrs[disks-1] = dq;
+
+	raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+	/* Restore pointer table */
+	ptrs[faila]   = dp;
+	ptrs[failb]   = dq;
+	ptrs[disks-2] = p;
+	ptrs[disks-1] = q;
+
+	/* Now, pick the proper data tables */
+	pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];
+	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
+		raid6_gfexp[failb]]];
+
+	kernel_fpu_begin();
+
+	/*
+	 * xr20, xr21: qmul
+	 * xr22, xr23: pbmul
+	 */
+	asm volatile("vld $vr20, %0" : : "m" (qmul[0]));
+	asm volatile("vld $vr21, %0" : : "m" (qmul[16]));
+	asm volatile("vld $vr22, %0" : : "m" (pbmul[0]));
+	asm volatile("vld $vr23, %0" : : "m" (pbmul[16]));
+	asm volatile("xvreplve0.q $xr20, $xr20");
+	asm volatile("xvreplve0.q $xr21, $xr21");
+	asm volatile("xvreplve0.q $xr22, $xr22");
+	asm volatile("xvreplve0.q $xr23, $xr23");
+
+	while (bytes) {
+		/* xr0, xr1: Q */
+		asm volatile("xvld $xr0, %0" : : "m" (q[0]));
+		asm volatile("xvld $xr1, %0" : : "m" (q[32]));
+		/* xr0, xr1: Q + Qxy */
+		asm volatile("xvld $xr4, %0" : : "m" (dq[0]));
+		asm volatile("xvld $xr5, %0" : : "m" (dq[32]));
+		asm volatile("xvxor.v $xr0, $xr0, $xr4");
+		asm volatile("xvxor.v $xr1, $xr1, $xr5");
+		/* xr2, xr3: P */
+		asm volatile("xvld $xr2, %0" : : "m" (p[0]));
+		asm volatile("xvld $xr3, %0" : : "m" (p[32]));
+		/* xr2, xr3: P + Pxy */
+		asm volatile("xvld $xr4, %0" : : "m" (dp[0]));
+		asm volatile("xvld $xr5, %0" : : "m" (dp[32]));
+		asm volatile("xvxor.v $xr2, $xr2, $xr4");
+		asm volatile("xvxor.v $xr3, $xr3, $xr5");
+
+		/* xr4, xr5: higher 4 bits of each byte of (Q + Qxy) */
+		asm volatile("xvsrli.b $xr4, $xr0, 4");
+		asm volatile("xvsrli.b $xr5, $xr1, 4");
+		/* xr0, xr1: lower 4 bits of each byte of (Q + Qxy) */
+		asm volatile("xvandi.b $xr0, $xr0, 0x0f");
+		asm volatile("xvandi.b $xr1, $xr1, 0x0f");
+		/* lookup from qmul[0] */
+		asm volatile("xvshuf.b $xr0, $xr20, $xr20, $xr0");
+		asm volatile("xvshuf.b $xr1, $xr20, $xr20, $xr1");
+		/* lookup from qmul[16] */
+		asm volatile("xvshuf.b $xr4, $xr21, $xr21, $xr4");
+		asm volatile("xvshuf.b $xr5, $xr21, $xr21, $xr5");
+		/* xr6, xr7: B(Q + Qxy) */
+		asm volatile("xvxor.v $xr6, $xr4, $xr0");
+		asm volatile("xvxor.v $xr7, $xr5, $xr1");
+
+		/* xr4, xr5: higher 4 bits of each byte of (P + Pxy) */
+		asm volatile("xvsrli.b $xr4, $xr2, 4");
+		asm volatile("xvsrli.b $xr5, $xr3, 4");
+		/* xr0, xr1: lower 4 bits of each byte of (P + Pxy) */
+		asm volatile("xvandi.b $xr0, $xr2, 0x0f");
+		asm volatile("xvandi.b $xr1, $xr3, 0x0f");
+		/* lookup from pbmul[0] */
+		asm volatile("xvshuf.b $xr0, $xr22, $xr22, $xr0");
+		asm volatile("xvshuf.b $xr1, $xr22, $xr22, $xr1");
+		/* lookup from pbmul[16] */
+		asm volatile("xvshuf.b $xr4, $xr23, $xr23, $xr4");
+		asm volatile("xvshuf.b $xr5, $xr23, $xr23, $xr5");
+		/* xr0, xr1: A(P + Pxy) */
+		asm volatile("xvxor.v $xr0, $xr0, $xr4");
+		asm volatile("xvxor.v $xr1, $xr1, $xr5");
+
+		/* xr0, xr1: A(P + Pxy) + B(Q + Qxy) = Dx */
+		asm volatile("xvxor.v $xr0, $xr0, $xr6");
+		asm volatile("xvxor.v $xr1, $xr1, $xr7");
+
+		/* xr2, xr3: P + Pxy + Dx = Dy */
+		asm volatile("xvxor.v $xr2, $xr2, $xr0");
+		asm volatile("xvxor.v $xr3, $xr3, $xr1");
+
+		asm volatile("xvst $xr0, %0" : "=m" (dq[0]));
+		asm volatile("xvst $xr1, %0" : "=m" (dq[32]));
+		asm volatile("xvst $xr2, %0" : "=m" (dp[0]));
+		asm volatile("xvst $xr3, %0" : "=m" (dp[32]));
+
+		bytes -= 64;
+		p += 64;
+		q += 64;
+		dp += 64;
+		dq += 64;
+	}
+
+	kernel_fpu_end();
+}
+
+static void raid6_datap_recov_lasx(int disks, size_t bytes, int faila,
+				   void **ptrs)
+{
+	u8 *p, *q, *dq;
+	const u8 *qmul;		/* Q multiplier table */
+
+	p = (u8 *)ptrs[disks-2];
+	q = (u8 *)ptrs[disks-1];
+
+	/*
+	 * Compute syndrome with zero for the missing data page
+	 * Use the dead data page as temporary storage for delta q
+	 */
+	dq = (u8 *)ptrs[faila];
+	ptrs[faila] = (void *)raid6_empty_zero_page;
+	ptrs[disks-1] = dq;
+
+	raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+	/* Restore pointer table */
+	ptrs[faila]   = dq;
+	ptrs[disks-1] = q;
+
+	/* Now, pick the proper data tables */
+	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];
+
+	kernel_fpu_begin();
+
+	/* xr22, xr23: qmul */
+	asm volatile("vld $vr22, %0" : : "m" (qmul[0]));
+	asm volatile("xvreplve0.q $xr22, $xr22");
+	asm volatile("vld $vr23, %0" : : "m" (qmul[16]));
+	asm volatile("xvreplve0.q $xr23, $xr23");
+
+	while (bytes) {
+		/* xr0, xr1: P + Dx */
+		asm volatile("xvld $xr0, %0" : : "m" (p[0]));
+		asm volatile("xvld $xr1, %0" : : "m" (p[32]));
+		/* xr2, xr3: Qx */
+		asm volatile("xvld $xr2, %0" : : "m" (dq[0]));
+		asm volatile("xvld $xr3, %0" : : "m" (dq[32]));
+		/* xr2, xr3: Q + Qx */
+		asm volatile("xvld $xr4, %0" : : "m" (q[0]));
+		asm volatile("xvld $xr5, %0" : : "m" (q[32]));
+		asm volatile("xvxor.v $xr2, $xr2, $xr4");
+		asm volatile("xvxor.v $xr3, $xr3, $xr5");
+
+		/* xr4, xr5: higher 4 bits of each byte of (Q + Qx) */
+		asm volatile("xvsrli.b $xr4, $xr2, 4");
+		asm volatile("xvsrli.b $xr5, $xr3, 4");
+		/* xr2, xr3: lower 4 bits of each byte of (Q + Qx) */
+		asm volatile("xvandi.b $xr2, $xr2, 0x0f");
+		asm volatile("xvandi.b $xr3, $xr3, 0x0f");
+		/* lookup from qmul[0] */
+		asm volatile("xvshuf.b $xr2, $xr22, $xr22, $xr2");
+		asm volatile("xvshuf.b $xr3, $xr22, $xr22, $xr3");
+		/* lookup from qmul[16] */
+		asm volatile("xvshuf.b $xr4, $xr23, $xr23, $xr4");
+		asm volatile("xvshuf.b $xr5, $xr23, $xr23, $xr5");
+		/* xr2, xr3: qmul(Q + Qx) = Dx */
+		asm volatile("xvxor.v $xr2, $xr2, $xr4");
+		asm volatile("xvxor.v $xr3, $xr3, $xr5");
+
+		/* xr0, xr1: P + Dx + Dx = P */
+		asm volatile("xvxor.v $xr0, $xr0, $xr2");
+		asm volatile("xvxor.v $xr1, $xr1, $xr3");
+
+		asm volatile("xvst $xr2, %0" : "=m" (dq[0]));
+		asm volatile("xvst $xr3, %0" : "=m" (dq[32]));
+		asm volatile("xvst $xr0, %0" : "=m" (p[0]));
+		asm volatile("xvst $xr1, %0" : "=m" (p[32]));
+
+		bytes -= 64;
+		p += 64;
+		q += 64;
+		dq += 64;
+	}
+
+	kernel_fpu_end();
+}
+
+const struct raid6_recov_calls raid6_recov_lasx = {
+	.data2 = raid6_2data_recov_lasx,
+	.datap = raid6_datap_recov_lasx,
+	.valid = raid6_has_lasx,
+	.name = "lasx",
+	.priority = 2,
+};
+#endif /* CONFIG_CPU_HAS_LASX */
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 7b244bce32b3d..2abe0076a636c 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -65,7 +65,7 @@ else ifeq ($(HAS_ALTIVEC),yes)
         OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
                 vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 else ifeq ($(ARCH),loongarch64)
-        OBJS += loongarch_simd.o
+        OBJS += loongarch_simd.o recov_loongarch_simd.o
 endif
 
 .c.o:
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation
  2023-08-03 17:08 ` [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation WANG Xuerui
@ 2023-08-03 19:49   ` Paul Menzel
  2023-08-04  6:24     ` WANG Xuerui
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Menzel @ 2023-08-03 19:49 UTC (permalink / raw)
  To: WANG Xuerui
  Cc: Song Liu, Huacai Chen, linux-raid, loongarch, linux-kernel, WANG Xuerui

Dear Xuerui,


Thank you for your patches.


Am 03.08.23 um 19:08 schrieb WANG Xuerui:
> From: WANG Xuerui <git@xen0n.name>
> 
> Similar to the syndrome calculation, the recovery algorithms also work
> on 64 bytes at a time to align with the L1 cache line size of current
> and future LoongArch cores (that we care about). Which means
> unrolled-by-4 LSX and unrolled-by-2 LASX code.
> 
> The assembly is originally based on the x86 SSSE3/AVX2 ports, but
> register allocation has been redone to take advantage of LSX/LASX's 32
> vector registers, and instruction sequence has been optimized to suit
> (e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
> cannot).
> 
> Performance numbers measured by instrumenting the raid6test code:

It’d be great, if you also documented your test setup. That’s always 
good for benchmarking numbers.

>> lasx  2data: 354.987 MiB/s
>> lasx  datap: 350.430 MiB/s
>> lsx   2data: 340.026 MiB/s
>> lsx   datap: 337.318 MiB/s
>> intx1 2data: 164.280 MiB/s
>> intx1 datap: 187.966 MiB/s

So the speed is more than doubled. Nice job! The lasx implementation is 
always the fastest. Is it therefore the preferred one? Or does it come 
with higher power consumption?

> Signed-off-by: WANG Xuerui <git@xen0n.name>

Out of curiosity, what is your “first” name?

> ---
>   include/linux/raid/pq.h          |   2 +
>   lib/raid6/Makefile               |   2 +-
>   lib/raid6/algos.c                |   8 +
>   lib/raid6/recov_loongarch_simd.c | 515 +++++++++++++++++++++++++++++++
>   lib/raid6/test/Makefile          |   2 +-
>   5 files changed, 527 insertions(+), 2 deletions(-)
>   create mode 100644 lib/raid6/recov_loongarch_simd.c


Kind regards,

Paul


> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 8744474858487..006e18decfad0 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -125,6 +125,8 @@ extern const struct raid6_recov_calls raid6_recov_avx2;
>   extern const struct raid6_recov_calls raid6_recov_avx512;
>   extern const struct raid6_recov_calls raid6_recov_s390xc;
>   extern const struct raid6_recov_calls raid6_recov_neon;
> +extern const struct raid6_recov_calls raid6_recov_lsx;
> +extern const struct raid6_recov_calls raid6_recov_lasx;
>   
>   extern const struct raid6_calls raid6_neonx1;
>   extern const struct raid6_calls raid6_neonx2;
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 2b9ebe1054806..035b0a4db476a 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -9,7 +9,7 @@ raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
>                                 vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>   raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o recov_neon.o recov_neon_inner.o
>   raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> -raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o
> +raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
>   
>   hostprogs	+= mktables
>   
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 739c7ebcae1a2..0ec534faf019b 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -111,6 +111,14 @@ const struct raid6_recov_calls *const raid6_recov_algos[] = {
>   #endif
>   #if defined(CONFIG_KERNEL_MODE_NEON)
>   	&raid6_recov_neon,
> +#endif
> +#ifdef CONFIG_LOONGARCH
> +#ifdef CONFIG_CPU_HAS_LASX
> +	&raid6_recov_lasx,
> +#endif
> +#ifdef CONFIG_CPU_HAS_LSX
> +	&raid6_recov_lsx,
> +#endif
>   #endif
>   	&raid6_recov_intx1,
>   	NULL
> diff --git a/lib/raid6/recov_loongarch_simd.c b/lib/raid6/recov_loongarch_simd.c
> new file mode 100644
> index 0000000000000..de9d53e81bd2d
> --- /dev/null
> +++ b/lib/raid6/recov_loongarch_simd.c
> @@ -0,0 +1,515 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * RAID6 recovery algorithms in LoongArch SIMD (LSX & LASX)
> + *
> + * Copyright (C) 2023 WANG Xuerui <git@xen0n.name>
> + *
> + * Originally based on recov_avx2.c and recov_ssse3.c:
> + *
> + * Copyright (C) 2012 Intel Corporation
> + * Author: Jim Kukunas <james.t.kukunas@linux.intel.com>
> + */
> +
> +#include <linux/raid/pq.h>
> +#include "loongarch.h"
> +
> +/*
> + * Unlike with the syndrome calculation algorithms, there's no boot-time
> + * selection of recovery algorithms by benchmarking, so we have to specify
> + * the priorities and hope the future cores will all have decent vector
> + * support (i.e. no LASX slower than LSX, or even scalar code).
> + */
> +
> +#ifdef CONFIG_CPU_HAS_LSX
> +static int raid6_has_lsx(void)
> +{
> +	return cpu_has_lsx;
> +}
> +
> +static void raid6_2data_recov_lsx(int disks, size_t bytes, int faila,
> +				  int failb, void **ptrs)
> +{
> +	u8 *p, *q, *dp, *dq;
> +	const u8 *pbmul;	/* P multiplier table for B data */
> +	const u8 *qmul;		/* Q multiplier table (for both) */
> +
> +	p = (u8 *)ptrs[disks-2];
> +	q = (u8 *)ptrs[disks-1];
> +
> +	/*
> +	 * Compute syndrome with zero for the missing data pages
> +	 * Use the dead data pages as temporary storage for
> +	 * delta p and delta q
> +	 */
> +	dp = (u8 *)ptrs[faila];
> +	ptrs[faila] = (void *)raid6_empty_zero_page;
> +	ptrs[disks-2] = dp;
> +	dq = (u8 *)ptrs[failb];
> +	ptrs[failb] = (void *)raid6_empty_zero_page;
> +	ptrs[disks-1] = dq;
> +
> +	raid6_call.gen_syndrome(disks, bytes, ptrs);
> +
> +	/* Restore pointer table */
> +	ptrs[faila]   = dp;
> +	ptrs[failb]   = dq;
> +	ptrs[disks-2] = p;
> +	ptrs[disks-1] = q;
> +
> +	/* Now, pick the proper data tables */
> +	pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];

Should spaces be put around the operator?

> +	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
> +		raid6_gfexp[failb]]];
> +
> +	kernel_fpu_begin();
> +
> +	/*
> +	 * vr20, vr21: qmul
> +	 * vr22, vr23: pbmul
> +	 */
> +	asm volatile("vld $vr20, %0" : : "m" (qmul[0]));
> +	asm volatile("vld $vr21, %0" : : "m" (qmul[16]));
> +	asm volatile("vld $vr22, %0" : : "m" (pbmul[0]));
> +	asm volatile("vld $vr23, %0" : : "m" (pbmul[16]));
> +
> +	while (bytes) {
> +		/* vr4 - vr7: Q */
> +		asm volatile("vld $vr4, %0" : : "m" (q[0]));
> +		asm volatile("vld $vr5, %0" : : "m" (q[16]));
> +		asm volatile("vld $vr6, %0" : : "m" (q[32]));
> +		asm volatile("vld $vr7, %0" : : "m" (q[48]));
> +		/*  vr4 - vr7: Q + Qxy */
> +		asm volatile("vld $vr8, %0" : : "m" (dq[0]));
> +		asm volatile("vld $vr9, %0" : : "m" (dq[16]));
> +		asm volatile("vld $vr10, %0" : : "m" (dq[32]));
> +		asm volatile("vld $vr11, %0" : : "m" (dq[48]));
> +		asm volatile("vxor.v $vr4, $vr4, $vr8");
> +		asm volatile("vxor.v $vr5, $vr5, $vr9");
> +		asm volatile("vxor.v $vr6, $vr6, $vr10");
> +		asm volatile("vxor.v $vr7, $vr7, $vr11");
> +		/* vr0 - vr3: P */
> +		asm volatile("vld $vr0, %0" : : "m" (p[0]));
> +		asm volatile("vld $vr1, %0" : : "m" (p[16]));
> +		asm volatile("vld $vr2, %0" : : "m" (p[32]));
> +		asm volatile("vld $vr3, %0" : : "m" (p[48]));
> +		/* vr0 - vr3: P + Pxy */
> +		asm volatile("vld $vr8, %0" : : "m" (dp[0]));
> +		asm volatile("vld $vr9, %0" : : "m" (dp[16]));
> +		asm volatile("vld $vr10, %0" : : "m" (dp[32]));
> +		asm volatile("vld $vr11, %0" : : "m" (dp[48]));
> +		asm volatile("vxor.v $vr0, $vr0, $vr8");
> +		asm volatile("vxor.v $vr1, $vr1, $vr9");
> +		asm volatile("vxor.v $vr2, $vr2, $vr10");
> +		asm volatile("vxor.v $vr3, $vr3, $vr11");
> +
> +		/* vr8 - vr11: higher 4 bits of each byte of (Q + Qxy) */
> +		asm volatile("vsrli.b $vr8, $vr4, 4");
> +		asm volatile("vsrli.b $vr9, $vr5, 4");
> +		asm volatile("vsrli.b $vr10, $vr6, 4");
> +		asm volatile("vsrli.b $vr11, $vr7, 4");
> +		/* vr4 - vr7: lower 4 bits of each byte of (Q + Qxy) */
> +		asm volatile("vandi.b $vr4, $vr4, 0x0f");
> +		asm volatile("vandi.b $vr5, $vr5, 0x0f");
> +		asm volatile("vandi.b $vr6, $vr6, 0x0f");
> +		asm volatile("vandi.b $vr7, $vr7, 0x0f");
> +		/* lookup from qmul[0] */
> +		asm volatile("vshuf.b $vr4, $vr20, $vr20, $vr4");
> +		asm volatile("vshuf.b $vr5, $vr20, $vr20, $vr5");
> +		asm volatile("vshuf.b $vr6, $vr20, $vr20, $vr6");
> +		asm volatile("vshuf.b $vr7, $vr20, $vr20, $vr7");
> +		/* lookup from qmul[16] */
> +		asm volatile("vshuf.b $vr8, $vr21, $vr21, $vr8");
> +		asm volatile("vshuf.b $vr9, $vr21, $vr21, $vr9");
> +		asm volatile("vshuf.b $vr10, $vr21, $vr21, $vr10");
> +		asm volatile("vshuf.b $vr11, $vr21, $vr21, $vr11");
> +		/* vr16 - vr19: B(Q + Qxy) */
> +		asm volatile("vxor.v $vr16, $vr8, $vr4");
> +		asm volatile("vxor.v $vr17, $vr9, $vr5");
> +		asm volatile("vxor.v $vr18, $vr10, $vr6");
> +		asm volatile("vxor.v $vr19, $vr11, $vr7");
> +
> +		/* vr4 - vr7: higher 4 bits of each byte of (P + Pxy) */
> +		asm volatile("vsrli.b $vr4, $vr0, 4");
> +		asm volatile("vsrli.b $vr5, $vr1, 4");
> +		asm volatile("vsrli.b $vr6, $vr2, 4");
> +		asm volatile("vsrli.b $vr7, $vr3, 4");
> +		/* vr12 - vr15: lower 4 bits of each byte of (P + Pxy) */
> +		asm volatile("vandi.b $vr12, $vr0, 0x0f");
> +		asm volatile("vandi.b $vr13, $vr1, 0x0f");
> +		asm volatile("vandi.b $vr14, $vr2, 0x0f");
> +		asm volatile("vandi.b $vr15, $vr3, 0x0f");
> +		/* lookup from pbmul[0] */
> +		asm volatile("vshuf.b $vr12, $vr22, $vr22, $vr12");
> +		asm volatile("vshuf.b $vr13, $vr22, $vr22, $vr13");
> +		asm volatile("vshuf.b $vr14, $vr22, $vr22, $vr14");
> +		asm volatile("vshuf.b $vr15, $vr22, $vr22, $vr15");
> +		/* lookup from pbmul[16] */
> +		asm volatile("vshuf.b $vr4, $vr23, $vr23, $vr4");
> +		asm volatile("vshuf.b $vr5, $vr23, $vr23, $vr5");
> +		asm volatile("vshuf.b $vr6, $vr23, $vr23, $vr6");
> +		asm volatile("vshuf.b $vr7, $vr23, $vr23, $vr7");
> +		/* vr4 - vr7: A(P + Pxy) */
> +		asm volatile("vxor.v $vr4, $vr4, $vr12");
> +		asm volatile("vxor.v $vr5, $vr5, $vr13");
> +		asm volatile("vxor.v $vr6, $vr6, $vr14");
> +		asm volatile("vxor.v $vr7, $vr7, $vr15");
> +
> +		/* vr4 - vr7: A(P + Pxy) + B(Q + Qxy) = Dx */
> +		asm volatile("vxor.v $vr4, $vr4, $vr16");
> +		asm volatile("vxor.v $vr5, $vr5, $vr17");
> +		asm volatile("vxor.v $vr6, $vr6, $vr18");
> +		asm volatile("vxor.v $vr7, $vr7, $vr19");
> +		asm volatile("vst $vr4, %0" : "=m" (dq[0]));
> +		asm volatile("vst $vr5, %0" : "=m" (dq[16]));
> +		asm volatile("vst $vr6, %0" : "=m" (dq[32]));
> +		asm volatile("vst $vr7, %0" : "=m" (dq[48]));
> +
> +		/* vr0 - vr3: P + Pxy + Dx = Dy */
> +		asm volatile("vxor.v $vr0, $vr0, $vr4");
> +		asm volatile("vxor.v $vr1, $vr1, $vr5");
> +		asm volatile("vxor.v $vr2, $vr2, $vr6");
> +		asm volatile("vxor.v $vr3, $vr3, $vr7");
> +		asm volatile("vst $vr0, %0" : "=m" (dp[0]));
> +		asm volatile("vst $vr1, %0" : "=m" (dp[16]));
> +		asm volatile("vst $vr2, %0" : "=m" (dp[32]));
> +		asm volatile("vst $vr3, %0" : "=m" (dp[48]));
> +
> +		bytes -= 64;
> +		p += 64;
> +		q += 64;
> +		dp += 64;
> +		dq += 64;
> +	}
> +
> +	kernel_fpu_end();
> +}
> +
> +static void raid6_datap_recov_lsx(int disks, size_t bytes, int faila,
> +				  void **ptrs)
> +{
> +	u8 *p, *q, *dq;
> +	const u8 *qmul;		/* Q multiplier table */
> +
> +	p = (u8 *)ptrs[disks-2];
> +	q = (u8 *)ptrs[disks-1];
> +
> +	/*
> +	 * Compute syndrome with zero for the missing data page
> +	 * Use the dead data page as temporary storage for delta q
> +	 */
> +	dq = (u8 *)ptrs[faila];
> +	ptrs[faila] = (void *)raid6_empty_zero_page;
> +	ptrs[disks-1] = dq;
> +
> +	raid6_call.gen_syndrome(disks, bytes, ptrs);
> +
> +	/* Restore pointer table */
> +	ptrs[faila]   = dq;
> +	ptrs[disks-1] = q;
> +
> +	/* Now, pick the proper data tables */
> +	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];

Only one space after qmul?

> +
> +	kernel_fpu_begin();
> +
> +	/* vr22, vr23: qmul */
> +	asm volatile("vld $vr22, %0" : : "m" (qmul[0]));
> +	asm volatile("vld $vr23, %0" : : "m" (qmul[16]));
> +
> +	while (bytes) {
> +		/* vr0 - vr3: P + Dx */
> +		asm volatile("vld $vr0, %0" : : "m" (p[0]));
> +		asm volatile("vld $vr1, %0" : : "m" (p[16]));
> +		asm volatile("vld $vr2, %0" : : "m" (p[32]));
> +		asm volatile("vld $vr3, %0" : : "m" (p[48]));
> +		/* vr4 - vr7: Qx */
> +		asm volatile("vld $vr4, %0" : : "m" (dq[0]));
> +		asm volatile("vld $vr5, %0" : : "m" (dq[16]));
> +		asm volatile("vld $vr6, %0" : : "m" (dq[32]));
> +		asm volatile("vld $vr7, %0" : : "m" (dq[48]));
> +		/* vr4 - vr7: Q + Qx */
> +		asm volatile("vld $vr8, %0" : : "m" (q[0]));
> +		asm volatile("vld $vr9, %0" : : "m" (q[16]));
> +		asm volatile("vld $vr10, %0" : : "m" (q[32]));
> +		asm volatile("vld $vr11, %0" : : "m" (q[48]));
> +		asm volatile("vxor.v $vr4, $vr4, $vr8");
> +		asm volatile("vxor.v $vr5, $vr5, $vr9");
> +		asm volatile("vxor.v $vr6, $vr6, $vr10");
> +		asm volatile("vxor.v $vr7, $vr7, $vr11");
> +
> +		/* vr8 - vr11: higher 4 bits of each byte of (Q + Qx) */
> +		asm volatile("vsrli.b $vr8, $vr4, 4");
> +		asm volatile("vsrli.b $vr9, $vr5, 4");
> +		asm volatile("vsrli.b $vr10, $vr6, 4");
> +		asm volatile("vsrli.b $vr11, $vr7, 4");
> +		/* vr4 - vr7: lower 4 bits of each byte of (Q + Qx) */
> +		asm volatile("vandi.b $vr4, $vr4, 0x0f");
> +		asm volatile("vandi.b $vr5, $vr5, 0x0f");
> +		asm volatile("vandi.b $vr6, $vr6, 0x0f");
> +		asm volatile("vandi.b $vr7, $vr7, 0x0f");
> +		/* lookup from qmul[0] */
> +		asm volatile("vshuf.b $vr4, $vr22, $vr22, $vr4");
> +		asm volatile("vshuf.b $vr5, $vr22, $vr22, $vr5");
> +		asm volatile("vshuf.b $vr6, $vr22, $vr22, $vr6");
> +		asm volatile("vshuf.b $vr7, $vr22, $vr22, $vr7");
> +		/* lookup from qmul[16] */
> +		asm volatile("vshuf.b $vr8, $vr23, $vr23, $vr8");
> +		asm volatile("vshuf.b $vr9, $vr23, $vr23, $vr9");
> +		asm volatile("vshuf.b $vr10, $vr23, $vr23, $vr10");
> +		asm volatile("vshuf.b $vr11, $vr23, $vr23, $vr11");
> +		/* vr4 - vr7: qmul(Q + Qx) = Dx */
> +		asm volatile("vxor.v $vr4, $vr4, $vr8");
> +		asm volatile("vxor.v $vr5, $vr5, $vr9");
> +		asm volatile("vxor.v $vr6, $vr6, $vr10");
> +		asm volatile("vxor.v $vr7, $vr7, $vr11");
> +		asm volatile("vst $vr4, %0" : "=m" (dq[0]));
> +		asm volatile("vst $vr5, %0" : "=m" (dq[16]));
> +		asm volatile("vst $vr6, %0" : "=m" (dq[32]));
> +		asm volatile("vst $vr7, %0" : "=m" (dq[48]));
> +
> +		/* vr0 - vr3: P + Dx + Dx = P */
> +		asm volatile("vxor.v $vr0, $vr0, $vr4");
> +		asm volatile("vxor.v $vr1, $vr1, $vr5");
> +		asm volatile("vxor.v $vr2, $vr2, $vr6");
> +		asm volatile("vxor.v $vr3, $vr3, $vr7");
> +		asm volatile("vst $vr0, %0" : "=m" (p[0]));
> +		asm volatile("vst $vr1, %0" : "=m" (p[16]));
> +		asm volatile("vst $vr2, %0" : "=m" (p[32]));
> +		asm volatile("vst $vr3, %0" : "=m" (p[48]));
> +
> +		bytes -= 64;
> +		p += 64;
> +		q += 64;
> +		dq += 64;
> +	}
> +
> +	kernel_fpu_end();
> +}
> +
> +const struct raid6_recov_calls raid6_recov_lsx = {
> +	.data2 = raid6_2data_recov_lsx,
> +	.datap = raid6_datap_recov_lsx,
> +	.valid = raid6_has_lsx,
> +	.name = "lsx",
> +	.priority = 1,
> +};
> +#endif /* CONFIG_CPU_HAS_LSX */
> +
> +#ifdef CONFIG_CPU_HAS_LASX
> +static int raid6_has_lasx(void)
> +{
> +	return cpu_has_lasx;
> +}
> +
> +static void raid6_2data_recov_lasx(int disks, size_t bytes, int faila,
> +				   int failb, void **ptrs)
> +{
> +	u8 *p, *q, *dp, *dq;
> +	const u8 *pbmul;	/* P multiplier table for B data */
> +	const u8 *qmul;		/* Q multiplier table (for both) */
> +
> +	p = (u8 *)ptrs[disks-2];
> +	q = (u8 *)ptrs[disks-1];
> +
> +	/*
> +	 * Compute syndrome with zero for the missing data pages
> +	 * Use the dead data pages as temporary storage for
> +	 * delta p and delta q
> +	 */
> +	dp = (u8 *)ptrs[faila];
> +	ptrs[faila] = (void *)raid6_empty_zero_page;
> +	ptrs[disks-2] = dp;
> +	dq = (u8 *)ptrs[failb];
> +	ptrs[failb] = (void *)raid6_empty_zero_page;
> +	ptrs[disks-1] = dq;
> +
> +	raid6_call.gen_syndrome(disks, bytes, ptrs);
> +
> +	/* Restore pointer table */
> +	ptrs[faila]   = dp;
> +	ptrs[failb]   = dq;
> +	ptrs[disks-2] = p;
> +	ptrs[disks-1] = q;
> +
> +	/* Now, pick the proper data tables */
> +	pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];

Ditto.

> +	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
> +		raid6_gfexp[failb]]];
> +
> +	kernel_fpu_begin();
> +
> +	/*
> +	 * xr20, xr21: qmul
> +	 * xr22, xr23: pbmul
> +	 */
> +	asm volatile("vld $vr20, %0" : : "m" (qmul[0]));
> +	asm volatile("vld $vr21, %0" : : "m" (qmul[16]));
> +	asm volatile("vld $vr22, %0" : : "m" (pbmul[0]));
> +	asm volatile("vld $vr23, %0" : : "m" (pbmul[16]));
> +	asm volatile("xvreplve0.q $xr20, $xr20");
> +	asm volatile("xvreplve0.q $xr21, $xr21");
> +	asm volatile("xvreplve0.q $xr22, $xr22");
> +	asm volatile("xvreplve0.q $xr23, $xr23");
> +
> +	while (bytes) {
> +		/* xr0, xr1: Q */
> +		asm volatile("xvld $xr0, %0" : : "m" (q[0]));
> +		asm volatile("xvld $xr1, %0" : : "m" (q[32]));
> +		/* xr0, xr1: Q + Qxy */
> +		asm volatile("xvld $xr4, %0" : : "m" (dq[0]));
> +		asm volatile("xvld $xr5, %0" : : "m" (dq[32]));
> +		asm volatile("xvxor.v $xr0, $xr0, $xr4");
> +		asm volatile("xvxor.v $xr1, $xr1, $xr5");
> +		/* xr2, xr3: P */
> +		asm volatile("xvld $xr2, %0" : : "m" (p[0]));
> +		asm volatile("xvld $xr3, %0" : : "m" (p[32]));
> +		/* xr2, xr3: P + Pxy */
> +		asm volatile("xvld $xr4, %0" : : "m" (dp[0]));
> +		asm volatile("xvld $xr5, %0" : : "m" (dp[32]));
> +		asm volatile("xvxor.v $xr2, $xr2, $xr4");
> +		asm volatile("xvxor.v $xr3, $xr3, $xr5");
> +
> +		/* xr4, xr5: higher 4 bits of each byte of (Q + Qxy) */
> +		asm volatile("xvsrli.b $xr4, $xr0, 4");
> +		asm volatile("xvsrli.b $xr5, $xr1, 4");
> +		/* xr0, xr1: lower 4 bits of each byte of (Q + Qxy) */
> +		asm volatile("xvandi.b $xr0, $xr0, 0x0f");
> +		asm volatile("xvandi.b $xr1, $xr1, 0x0f");
> +		/* lookup from qmul[0] */
> +		asm volatile("xvshuf.b $xr0, $xr20, $xr20, $xr0");
> +		asm volatile("xvshuf.b $xr1, $xr20, $xr20, $xr1");
> +		/* lookup from qmul[16] */
> +		asm volatile("xvshuf.b $xr4, $xr21, $xr21, $xr4");
> +		asm volatile("xvshuf.b $xr5, $xr21, $xr21, $xr5");
> +		/* xr6, xr7: B(Q + Qxy) */
> +		asm volatile("xvxor.v $xr6, $xr4, $xr0");
> +		asm volatile("xvxor.v $xr7, $xr5, $xr1");
> +
> +		/* xr4, xr5: higher 4 bits of each byte of (P + Pxy) */
> +		asm volatile("xvsrli.b $xr4, $xr2, 4");
> +		asm volatile("xvsrli.b $xr5, $xr3, 4");
> +		/* xr0, xr1: lower 4 bits of each byte of (P + Pxy) */
> +		asm volatile("xvandi.b $xr0, $xr2, 0x0f");
> +		asm volatile("xvandi.b $xr1, $xr3, 0x0f");
> +		/* lookup from pbmul[0] */
> +		asm volatile("xvshuf.b $xr0, $xr22, $xr22, $xr0");
> +		asm volatile("xvshuf.b $xr1, $xr22, $xr22, $xr1");
> +		/* lookup from pbmul[16] */
> +		asm volatile("xvshuf.b $xr4, $xr23, $xr23, $xr4");
> +		asm volatile("xvshuf.b $xr5, $xr23, $xr23, $xr5");
> +		/* xr0, xr1: A(P + Pxy) */
> +		asm volatile("xvxor.v $xr0, $xr0, $xr4");
> +		asm volatile("xvxor.v $xr1, $xr1, $xr5");
> +
> +		/* xr0, xr1: A(P + Pxy) + B(Q + Qxy) = Dx */
> +		asm volatile("xvxor.v $xr0, $xr0, $xr6");
> +		asm volatile("xvxor.v $xr1, $xr1, $xr7");
> +
> +		/* xr2, xr3: P + Pxy + Dx = Dy */
> +		asm volatile("xvxor.v $xr2, $xr2, $xr0");
> +		asm volatile("xvxor.v $xr3, $xr3, $xr1");
> +
> +		asm volatile("xvst $xr0, %0" : "=m" (dq[0]));
> +		asm volatile("xvst $xr1, %0" : "=m" (dq[32]));
> +		asm volatile("xvst $xr2, %0" : "=m" (dp[0]));
> +		asm volatile("xvst $xr3, %0" : "=m" (dp[32]));
> +
> +		bytes -= 64;
> +		p += 64;
> +		q += 64;
> +		dp += 64;
> +		dq += 64;
> +	}
> +
> +	kernel_fpu_end();
> +}
> +
> +static void raid6_datap_recov_lasx(int disks, size_t bytes, int faila,
> +				   void **ptrs)
> +{
> +	u8 *p, *q, *dq;
> +	const u8 *qmul;		/* Q multiplier table */
> +
> +	p = (u8 *)ptrs[disks-2];
> +	q = (u8 *)ptrs[disks-1];
> +
> +	/*
> +	 * Compute syndrome with zero for the missing data page
> +	 * Use the dead data page as temporary storage for delta q
> +	 */
> +	dq = (u8 *)ptrs[faila];
> +	ptrs[faila] = (void *)raid6_empty_zero_page;
> +	ptrs[disks-1] = dq;
> +
> +	raid6_call.gen_syndrome(disks, bytes, ptrs);
> +
> +	/* Restore pointer table */
> +	ptrs[faila]   = dq;
> +	ptrs[disks-1] = q;
> +
> +	/* Now, pick the proper data tables */
> +	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];

Ditto.

> +
> +	kernel_fpu_begin();
> +
> +	/* xr22, xr23: qmul */
> +	asm volatile("vld $vr22, %0" : : "m" (qmul[0]));
> +	asm volatile("xvreplve0.q $xr22, $xr22");
> +	asm volatile("vld $vr23, %0" : : "m" (qmul[16]));
> +	asm volatile("xvreplve0.q $xr23, $xr23");
> +
> +	while (bytes) {
> +		/* xr0, xr1: P + Dx */
> +		asm volatile("xvld $xr0, %0" : : "m" (p[0]));
> +		asm volatile("xvld $xr1, %0" : : "m" (p[32]));
> +		/* xr2, xr3: Qx */
> +		asm volatile("xvld $xr2, %0" : : "m" (dq[0]));
> +		asm volatile("xvld $xr3, %0" : : "m" (dq[32]));
> +		/* xr2, xr3: Q + Qx */
> +		asm volatile("xvld $xr4, %0" : : "m" (q[0]));
> +		asm volatile("xvld $xr5, %0" : : "m" (q[32]));
> +		asm volatile("xvxor.v $xr2, $xr2, $xr4");
> +		asm volatile("xvxor.v $xr3, $xr3, $xr5");
> +
> +		/* xr4, xr5: higher 4 bits of each byte of (Q + Qx) */
> +		asm volatile("xvsrli.b $xr4, $xr2, 4");
> +		asm volatile("xvsrli.b $xr5, $xr3, 4");
> +		/* xr2, xr3: lower 4 bits of each byte of (Q + Qx) */
> +		asm volatile("xvandi.b $xr2, $xr2, 0x0f");
> +		asm volatile("xvandi.b $xr3, $xr3, 0x0f");
> +		/* lookup from qmul[0] */
> +		asm volatile("xvshuf.b $xr2, $xr22, $xr22, $xr2");
> +		asm volatile("xvshuf.b $xr3, $xr22, $xr22, $xr3");
> +		/* lookup from qmul[16] */
> +		asm volatile("xvshuf.b $xr4, $xr23, $xr23, $xr4");
> +		asm volatile("xvshuf.b $xr5, $xr23, $xr23, $xr5");
> +		/* xr2, xr3: qmul(Q + Qx) = Dx */
> +		asm volatile("xvxor.v $xr2, $xr2, $xr4");
> +		asm volatile("xvxor.v $xr3, $xr3, $xr5");
> +
> +		/* xr0, xr1: P + Dx + Dx = P */
> +		asm volatile("xvxor.v $xr0, $xr0, $xr2");
> +		asm volatile("xvxor.v $xr1, $xr1, $xr3");
> +
> +		asm volatile("xvst $xr2, %0" : "=m" (dq[0]));
> +		asm volatile("xvst $xr3, %0" : "=m" (dq[32]));
> +		asm volatile("xvst $xr0, %0" : "=m" (p[0]));
> +		asm volatile("xvst $xr1, %0" : "=m" (p[32]));
> +
> +		bytes -= 64;
> +		p += 64;
> +		q += 64;
> +		dq += 64;
> +	}
> +
> +	kernel_fpu_end();
> +}
> +
> +const struct raid6_recov_calls raid6_recov_lasx = {
> +	.data2 = raid6_2data_recov_lasx,
> +	.datap = raid6_datap_recov_lasx,
> +	.valid = raid6_has_lasx,
> +	.name = "lasx",
> +	.priority = 2,
> +};
> +#endif /* CONFIG_CPU_HAS_LASX */
> diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
> index 7b244bce32b3d..2abe0076a636c 100644
> --- a/lib/raid6/test/Makefile
> +++ b/lib/raid6/test/Makefile
> @@ -65,7 +65,7 @@ else ifeq ($(HAS_ALTIVEC),yes)
>           OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
>                   vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>   else ifeq ($(ARCH),loongarch64)
> -        OBJS += loongarch_simd.o
> +        OBJS += loongarch_simd.o recov_loongarch_simd.o
>   endif
>   
>   .c.o:


Kind regards,

Paul


PS: I brought up the raid speed tests in the past, and Borislav called 
them a random number generator [1]. ;-)


[1]: https://lore.kernel.org/all/20210406124126.GM17806@zn.tnic/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation
  2023-08-03 19:49   ` Paul Menzel
@ 2023-08-04  6:24     ` WANG Xuerui
  0 siblings, 0 replies; 6+ messages in thread
From: WANG Xuerui @ 2023-08-04  6:24 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Song Liu, Huacai Chen, linux-raid, loongarch, linux-kernel, WANG Xuerui

Hi,

On 2023/8/4 03:49, Paul Menzel wrote:
> Dear Xuerui,
> 
> 
> Thank you for your patches.
> 
> 
> Am 03.08.23 um 19:08 schrieb WANG Xuerui:
>> From: WANG Xuerui <git@xen0n.name>
>>
>> Similar to the syndrome calculation, the recovery algorithms also work
>> on 64 bytes at a time to align with the L1 cache line size of current
>> and future LoongArch cores (that we care about). Which means
>> unrolled-by-4 LSX and unrolled-by-2 LASX code.
>>
>> The assembly is originally based on the x86 SSSE3/AVX2 ports, but
>> register allocation has been redone to take advantage of LSX/LASX's 32
>> vector registers, and instruction sequence has been optimized to suit
>> (e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
>> cannot).
>>
>> Performance numbers measured by instrumenting the raid6test code:
> 
> It’d be great, if you also documented your test setup. That’s always 
> good for benchmarking numbers.
> 

Ah they're the same as described in the previous patches: a Loongson 
3A5000 + 7A1000 board (the Lemote A2101 board to be precise), with the 
3A5000 clocked at 2.5GHz. I'll amend the description in v3.

>>> lasx  2data: 354.987 MiB/s
>>> lasx  datap: 350.430 MiB/s
>>> lsx   2data: 340.026 MiB/s
>>> lsx   datap: 337.318 MiB/s
>>> intx1 2data: 164.280 MiB/s
>>> intx1 datap: 187.966 MiB/s
> 
> So the speed is more than doubled. Nice job! The lasx implementation is 
> always the fastest. Is it therefore the preferred one? Or does it come 
> with higher power consumption?

According to my experiments and other public info regarding the LA464 
micro-architecture such as [1], this should be the case for LA464, with 
no power consumption worries.

(Both LASX and LSX are handled by the same vector unit on LA464, so 
whenever LASX is available it should be preferred, as it's LSX that 
would be purely wasteful in this case: a full 256-bit result would get 
computed regardless.)

[1]: 
https://chipsandcheese.com/2023/04/09/loongsons-3a5000-chinas-best-shot/

> 
>> Signed-off-by: WANG Xuerui <git@xen0n.name>
> 
> Out of curiosity, what is your “first” name?

My first name / given name is "Xuerui"; I usually prefer having my 
romanized name in "native-endian" whenever it's convenient ;-)

> 
>> ---
>>   include/linux/raid/pq.h          |   2 +
>>   lib/raid6/Makefile               |   2 +-
>>   lib/raid6/algos.c                |   8 +
>>   lib/raid6/recov_loongarch_simd.c | 515 +++++++++++++++++++++++++++++++
>>   lib/raid6/test/Makefile          |   2 +-
>>   5 files changed, 527 insertions(+), 2 deletions(-)
>>   create mode 100644 lib/raid6/recov_loongarch_simd.c
> 
> 
> Kind regards,
> 
> Paul
> 
> 
> [snip]
>> +
>> +    /* Now, pick the proper data tables */
>> +    pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];
> 
> Should spaces be put around the operator?

Hmm, AFAICS almost all raid6 files "inherit" the scalar reference 
implementation's coding style, that happens to differ from the 
documented one. But given I already adjusted some of the comments, I 
think it would be best to also touch these too. Thanks for spotting this 
(and the others below).

> 
>> +    qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
>> +        raid6_gfexp[failb]]];
>> +
>> [snip]
>> +
>> +    /* Now, pick the proper data tables */
>> +    qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];
> 
> Only one space after qmul?

As explained above; I'll fix this one and others in v3.

> [snip]
>> +    /* Now, pick the proper data tables */
>> +    pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];
> 
> Ditto.
> 
> [snip]
>> +    /* Now, pick the proper data tables */
>> +    qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];
> 
> Ditto.
> 
> [snip]
>> diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
>> index 7b244bce32b3d..2abe0076a636c 100644
>> --- a/lib/raid6/test/Makefile
>> +++ b/lib/raid6/test/Makefile
>> @@ -65,7 +65,7 @@ else ifeq ($(HAS_ALTIVEC),yes)
>>           OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
>>                   vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>>   else ifeq ($(ARCH),loongarch64)
>> -        OBJS += loongarch_simd.o
>> +        OBJS += loongarch_simd.o recov_loongarch_simd.o
>>   endif
>>   .c.o:
> 
> 
> Kind regards,
> 
> Paul
> 
> 
> PS: I brought up the raid speed tests in the past, and Borislav called 
> them a random number generator [1]. ;-)
> 
> 
> [1]: https://lore.kernel.org/all/20210406124126.GM17806@zn.tnic/

Interesting, so the reason I've yet to observe such wild fluctuations 
may simply be that I didn't reboot that rig as many times. :D

And thanks for the review! I'll send v3 some time later (and stress the 
code more meanwhile).

-- 
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-08-04  6:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-03 17:08 [PATCH v2 0/3] raid5, raid6: Accelerate RAID math with LoongArch SIMD WANG Xuerui
2023-08-03 17:08 ` [PATCH v2 1/3] LoongArch: Add SIMD-optimized XOR routines WANG Xuerui
2023-08-03 17:08 ` [PATCH v2 2/3] raid6: Add LoongArch SIMD syndrome calculation WANG Xuerui
2023-08-03 17:08 ` [PATCH v2 3/3] raid6: Add LoongArch SIMD recovery implementation WANG Xuerui
2023-08-03 19:49   ` Paul Menzel
2023-08-04  6:24     ` WANG Xuerui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).