All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yann Sionneau <ysionneau@kalray.eu>
To: "Arnd Bergmann" <arnd@arndb.de>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Marc Zyngier" <maz@kernel.org>,
	"Rob Herring" <robh+dt@kernel.org>,
	"Krzysztof Kozlowski" <krzysztof.kozlowski+dt@linaro.org>,
	"Will Deacon" <will@kernel.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Eric Biederman" <ebiederm@xmission.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Waiman Long" <longman@redhat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Nick Piggin" <npiggin@gmail.com>,
	"Paul Moore" <paul@paul-moore.com>,
	"Eric Paris" <eparis@redhat.com>,
	"Christian Brauner" <brauner@kernel.org>,
	"Paul Walmsley" <paul.walmsley@sifive.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Albert Ou" <aou@eecs.berkeley.edu>,
	"Jules Maselbas" <jmaselbas@kalray.eu>,
	"Yann Sionneau" <ysionneau@kalray.eu>,
	"Guillaume Thouvenin" <gthouvenin@kalray.eu>,
	"Clement Leger" <clement@clement-leger.fr>,
	"Vincent Chardon" <vincent.chardon@elsys-design.com>,
	"Marc Poulhiès" <dkm@kataplop.net>,
	"Julian Vetter" <jvetter@kalray.eu>,
	"Samuel Jones" <sjones@kalray.eu>,
	"Ashley Lesdalons" <alesdalons@kalray.eu>,
	"Thomas Costis" <tcostis@kalray.eu>,
	"Marius Gligor" <mgligor@kalray.eu>,
	"Jonathan Borne" <jborne@kalray.eu>,
	"Julien Villette" <jvillette@kalray.eu>,
	"Luc Michel" <lmichel@kalray.eu>,
	"Louis Morhet" <lmorhet@kalray.eu>,
	"Julien Hascoet" <jhascoet@kalray.eu>,
	"Jean-Christophe Pince" <jcpince@gmail.com>,
	"Guillaume Missonnier" <gmissonnier@kalray.eu>,
	"Alex Michon" <amichon@kalray.eu>,
	"Huacai Chen" <chenhuacai@kernel.org>,
	"WANG Xuerui" <git@xen0n.name>,
	"Shaokun Zhang" <zhangshaokun@hisilicon.com>,
	"John Garry" <john.garry@huawei.com>,
	"Guangbin Huang" <huangguangbin2@huawei.com>,
	"Bharat Bhushan" <bbhushan2@marvell.com>,
	"Bibo Mao" <maobibo@loongson.cn>,
	"Atish Patra" <atishp@atishpatra.org>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>,
	"Qi Liu" <liuqi115@huawei.com>,
	"Jiaxun Yang" <jiaxun.yang@flygoat.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Mark Brown" <broonie@kernel.org>,
	"Janosch Frank" <frankja@linux.ibm.com>,
	"Alexey Dobriyan" <adobriyan@gmail.com>
Cc: Benjamin Mugnier <mugnier.benjamin@gmail.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org, linux-mm@kvack.org,
	linux-arch@vger.kernel.org, linux-audit@redhat.com,
	linux-riscv@lists.infradead.org, bpf@vger.kernel.org
Subject: [RFC PATCH v2 25/31] kvx: Add some library functions
Date: Fri, 20 Jan 2023 15:09:56 +0100	[thread overview]
Message-ID: <20230120141002.2442-26-ysionneau@kalray.eu> (raw)
In-Reply-To: <20230120141002.2442-1-ysionneau@kalray.eu>

Add some library functions for kvx, including: delay, memset,
memcpy, strlen, clear_page, copy_page, raw_copy_from/to_user,
asm_clear_user.

Co-developed-by: Clement Leger <clement@clement-leger.fr>
Signed-off-by: Clement Leger <clement@clement-leger.fr>
Co-developed-by: Jules Maselbas <jmaselbas@kalray.eu>
Signed-off-by: Jules Maselbas <jmaselbas@kalray.eu>
Co-developed-by: Julian Vetter <jvetter@kalray.eu>
Signed-off-by: Julian Vetter <jvetter@kalray.eu>
Co-developed-by: Marius Gligor <mgligor@kalray.eu>
Signed-off-by: Marius Gligor <mgligor@kalray.eu>
Co-developed-by: Yann Sionneau <ysionneau@kalray.eu>
Signed-off-by: Yann Sionneau <ysionneau@kalray.eu>
---

Notes:
    V1 -> V2: no changes

 arch/kvx/include/asm/string.h |  20 ++
 arch/kvx/kernel/kvx_ksyms.c   |   5 +
 arch/kvx/lib/clear_page.S     |  40 ++++
 arch/kvx/lib/copy_page.S      |  90 +++++++++
 arch/kvx/lib/delay.c          |  39 ++++
 arch/kvx/lib/memcpy.c         |  70 +++++++
 arch/kvx/lib/memset.S         | 351 ++++++++++++++++++++++++++++++++++
 arch/kvx/lib/strlen.S         | 122 ++++++++++++
 arch/kvx/lib/usercopy.S       |  90 +++++++++
 9 files changed, 827 insertions(+)
 create mode 100644 arch/kvx/include/asm/string.h
 create mode 100644 arch/kvx/lib/clear_page.S
 create mode 100644 arch/kvx/lib/copy_page.S
 create mode 100644 arch/kvx/lib/delay.c
 create mode 100644 arch/kvx/lib/memcpy.c
 create mode 100644 arch/kvx/lib/memset.S
 create mode 100644 arch/kvx/lib/strlen.S
 create mode 100644 arch/kvx/lib/usercopy.S

diff --git a/arch/kvx/include/asm/string.h b/arch/kvx/include/asm/string.h
new file mode 100644
index 000000000000..677c1393a5cd
--- /dev/null
+++ b/arch/kvx/include/asm/string.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Jules Maselbas
+ */
+
+#ifndef _ASM_KVX_STRING_H
+#define _ASM_KVX_STRING_H
+
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t n);
+
+#define __HAVE_ARCH_MEMCPY
+extern void *memcpy(void *dest, const void *src, size_t n);
+
+#define __HAVE_ARCH_STRLEN
+extern size_t strlen(const char *s);
+
+#endif	/* _ASM_KVX_STRING_H */
diff --git a/arch/kvx/kernel/kvx_ksyms.c b/arch/kvx/kernel/kvx_ksyms.c
index 18990aaf259f..678f81716dea 100644
--- a/arch/kvx/kernel/kvx_ksyms.c
+++ b/arch/kvx/kernel/kvx_ksyms.c
@@ -22,3 +22,8 @@ DECLARE_EXPORT(__umoddi3);
 DECLARE_EXPORT(__divdi3);
 DECLARE_EXPORT(__udivdi3);
 DECLARE_EXPORT(__multi3);
+
+DECLARE_EXPORT(clear_page);
+DECLARE_EXPORT(copy_page);
+DECLARE_EXPORT(memset);
+DECLARE_EXPORT(asm_clear_user);
diff --git a/arch/kvx/lib/clear_page.S b/arch/kvx/lib/clear_page.S
new file mode 100644
index 000000000000..364fe0663ca2
--- /dev/null
+++ b/arch/kvx/lib/clear_page.S
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Marius Gligor
+ *            Clement Leger
+ */
+
+#include <linux/linkage.h>
+#include <linux/export.h>
+#include <linux/const.h>
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define CLEAR_PAGE_LOOP_COUNT	(PAGE_SIZE / 32)
+
+/*
+ * Clear page @dest.
+ *
+ * Parameters:
+ *	r0 - dest page
+ */
+ENTRY(clear_page)
+	make $r1 = CLEAR_PAGE_LOOP_COUNT
+	;;
+	make $r4 = 0
+	make $r5 = 0
+	make $r6 = 0
+	make $r7 = 0
+	;;
+
+	loopdo $r1, clear_page_done
+		;;
+		so 0[$r0] = $r4r5r6r7
+		addd $r0 = $r0, 32
+		;;
+	clear_page_done:
+	ret
+	;;
+ENDPROC(clear_page)
diff --git a/arch/kvx/lib/copy_page.S b/arch/kvx/lib/copy_page.S
new file mode 100644
index 000000000000..4bb82d1c964c
--- /dev/null
+++ b/arch/kvx/lib/copy_page.S
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+
+#include <linux/linkage.h>
+#include <linux/const.h>
+
+#include <asm/page.h>
+
+/* We have 8 load/store octuple (32 bytes) per hardware loop */
+#define COPY_SIZE_PER_LOOP	(32 * 8)
+#define COPY_PAGE_LOOP_COUNT	(PAGE_SIZE / COPY_SIZE_PER_LOOP)
+
+/*
+ * Copy a page from src to dest (both are page aligned)
+ * In order to recover from smem latency, unroll the loop to trigger multiple
+ * onfly loads and avoid waiting too much for them to return.
+ * We use 8 * 32 load even though we could use more (up to 10 loads) to simplify
+ * the handling using a single hardware loop
+ *
+ * Parameters:
+ *	r0 - dest
+ *	r1 - src
+ */
+ENTRY(copy_page)
+	make $r2 = COPY_PAGE_LOOP_COUNT
+	make $r3 = 0
+	;;
+	loopdo $r2, copy_page_done
+		;;
+		/*
+		 * Load 8 * 32 bytes using uncached access to avoid hitting
+		 * the cache
+		 */
+		lo.xs $r32r33r34r35 = $r3[$r1]
+		/* Copy current copy index for store */
+		copyd $r2 = $r3
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r36r37r38r39 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r40r41r42r43 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r44r45r46r47 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r48r49r50r51 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r52r53r54r55 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r56r57r58r59 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r60r61r62r63 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		/* And then store all of them */
+		so.xs $r2[$r0] = $r32r33r34r35
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r36r37r38r39
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r40r41r42r43
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r44r45r46r47
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r48r49r50r51
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r52r53r54r55
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r56r57r58r59
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r60r61r62r63
+		;;
+	copy_page_done:
+	ret
+	;;
+ENDPROC(copy_page)
diff --git a/arch/kvx/lib/delay.c b/arch/kvx/lib/delay.c
new file mode 100644
index 000000000000..11295eedc3f5
--- /dev/null
+++ b/arch/kvx/lib/delay.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+
+#include <linux/export.h>
+#include <linux/delay.h>
+
+#include <asm/param.h>
+#include <asm/timex.h>
+
+void __delay(unsigned long loops)
+{
+	cycles_t target_cycle = get_cycles() + loops;
+
+	while (get_cycles() < target_cycle);
+}
+EXPORT_SYMBOL(__delay);
+
+inline void __const_udelay(unsigned long xloops)
+{
+	u64 loops = (u64)xloops * (u64)loops_per_jiffy * HZ;
+
+	__delay(loops >> 32);
+}
+EXPORT_SYMBOL(__const_udelay);
+
+void __udelay(unsigned long usecs)
+{
+	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+}
+EXPORT_SYMBOL(__udelay);
+
+void __ndelay(unsigned long nsecs)
+{
+	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+}
+EXPORT_SYMBOL(__ndelay);
diff --git a/arch/kvx/lib/memcpy.c b/arch/kvx/lib/memcpy.c
new file mode 100644
index 000000000000..b81f746a80ee
--- /dev/null
+++ b/arch/kvx/lib/memcpy.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Yann Sionneau
+ */
+
+#include <linux/export.h>
+#include <linux/types.h>
+
+void *memcpy(void *dest, const void *src, size_t n)
+{
+	__uint128_t *tmp128_d = dest;
+	const __uint128_t *tmp128_s = src;
+	uint64_t *tmp64_d;
+	const uint64_t *tmp64_s;
+	uint32_t *tmp32_d;
+	const uint32_t *tmp32_s;
+	uint16_t *tmp16_d;
+	const uint16_t *tmp16_s;
+	uint8_t *tmp8_d;
+	const uint8_t *tmp8_s;
+
+	while (n >= 16) {
+		*tmp128_d = *tmp128_s;
+		tmp128_d++;
+		tmp128_s++;
+		n -= 16;
+	}
+
+	tmp64_d = (uint64_t *) tmp128_d;
+	tmp64_s = (uint64_t *) tmp128_s;
+	while (n >= 8) {
+		*tmp64_d = *tmp64_s;
+		tmp64_d++;
+		tmp64_s++;
+		n -= 8;
+	}
+
+	tmp32_d = (uint32_t *) tmp64_d;
+	tmp32_s = (uint32_t *) tmp64_s;
+	while (n >= 4) {
+		*tmp32_d = *tmp32_s;
+		tmp32_d++;
+		tmp32_s++;
+		n -= 4;
+	}
+
+	tmp16_d = (uint16_t *) tmp32_d;
+	tmp16_s = (uint16_t *) tmp32_s;
+	while (n >= 2) {
+		*tmp16_d = *tmp16_s;
+		tmp16_d++;
+		tmp16_s++;
+		n -= 2;
+	}
+
+	tmp8_d = (uint8_t *) tmp16_d;
+	tmp8_s = (uint8_t *) tmp16_s;
+	while (n >= 1) {
+		*tmp8_d = *tmp8_s;
+		tmp8_d++;
+		tmp8_s++;
+		n--;
+	}
+
+	return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
diff --git a/arch/kvx/lib/memset.S b/arch/kvx/lib/memset.S
new file mode 100644
index 000000000000..9eebc28da2be
--- /dev/null
+++ b/arch/kvx/lib/memset.S
@@ -0,0 +1,351 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Marius Gligor
+ */
+
+#include <linux/linkage.h>
+
+#include <asm/cache.h>
+
+#define REPLICATE_BYTE_MASK	0x0101010101010101
+#define MIN_SIZE_FOR_ALIGN	128
+
+/*
+ * Optimized memset for kvx architecture
+ *
+ * In order to optimize memset on kvx, we can use various things:
+ * - conditionnal store which avoid branch penalty
+ * - store half/word/double/quad/octuple to store up to 16 bytes at a time
+ * - dzerol to zero a cacheline when the pattern is '0' (often the case)
+ * - hardware loop for steady cases.
+ *
+ * First, we assume that memset is mainly used for zeroing areas. In order
+ * to optimize this case, we consider it to be the fast path of the algorithm.
+ * In both cases (0 and non 0 pattern), we start by checking if the size is
+ * below a minimum size. If so, we skip the alignment part. Indeed, the kvx
+ * supports misalignment and the penalty for letting it do unaligned accesses is
+ * lower than trying to realigning us. So for small sizes, we don't even bother
+ * to realign. Minor difference is that in the memset with 0, we skip after the
+ * dzerol loop since dzerol must be cache-line aligned (no misalignment of
+ * course).
+ * Regarding the non 0 pattern memset, we use sbmm to replicate the pattern on
+ * all bits on a register in one call.
+ * Once alignment has been reached, we can do the hardware loop for both cases(
+ * store octuple/dzerol) in order to optimize throughput. Care must be taken to
+ * align hardware loops on at least 8 bytes for performances.
+ * Once the main loop has been done, we finish the copy by checking length to do
+ * the necessary calls to store remaining bytes.
+ *
+ * Pseudo code (applies for non 0 pattern):
+ *
+ * int memset(void *dest, char pattern, long length)
+ * {
+ * 	long dest_align = -((long) dest);
+ * 	long copy;
+ * 	long orig_dest = dest;
+ *
+ * 	uint64_t pattern = sbmm8(pattern, 0x0101010101010101);
+ * 	uint128_t pattern128 = pattern << 64 | pattern;
+ * 	uint256_t pattern128 = pattern128 << 128 | pattern128;
+ *
+ * 	// Keep only low bits
+ * 	dest_align &= 0x1F;
+ * 	length -= dest_align;
+ *
+ * 	// Byte align
+ * 	copy = align & (1 << 0);
+ * 	if (copy)
+ * 		*((u8 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Half align
+ * 	copy = align & (1 << 1);
+ * 	if (copy)
+ * 		*((u16 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Word align
+ * 	copy = align & (1 << 2);
+ * 	if (copy)
+ * 		*((u32 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Double align
+ * 	copy = align & (1 << 3);
+ * 	if (copy)
+ * 		*((u64 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Quad align
+ * 	copy = align & (1 << 4);
+ * 	if (copy)
+ * 		*((u128 *) dest) = pattern128;
+ * 	dest += copy;
+ *
+ * 	// We are now aligned on 256 bits
+ * 	loop_octuple_count = size >> 5;
+ * 	for (i = 0; i < loop_octuple_count; i++) {
+ * 		*((u256 *) dest) = pattern256;
+ * 		dest += 32;
+ * 	}
+ *
+ * 	if (length == 0)
+ * 		return orig_dest;
+ *
+ * 	// Copy remaining part
+ * 	remain = length & (1 << 4);
+ * 	if (copy)
+ * 		*((u128 *) dest) = pattern128;
+ * 	dest += remain;
+ * 	remain = length & (1 << 3);
+ * 	if (copy)
+ * 		*((u64 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 2);
+ * 	if (copy)
+ * 		*((u32 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 1);
+ * 	if (copy)
+ * 		*((u16 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 0);
+ * 	if (copy)
+ * 		*((u8 *) dest) = pattern;
+ * 	dest += remain;
+ *
+ * 	return orig_dest;
+ * }
+ */
+
+.text
+.align 16
+ENTRY(memset):
+	make $r32 = 0
+	make $r33 = 0
+	/* Check if length < KVX_DCACHE_LINE_SIZE */
+	compd.ltu $r7 = $r2, KVX_DCACHE_LINE_SIZE
+	/* Jump to generic memset if pattern is != 0 */
+	cb.dnez $r1? memset_non_0_pattern
+	;;
+	/* Preserve return value */
+	copyd $r3 = $r0
+	/* Invert address to compute size to copy to be aligned on 32 bytes */
+	negd $r5 = $r0
+	/* Remaining bytes for 16 bytes store (for alignment on 64 bytes) */
+	andd $r8 = $r2, (1 << 5)
+	copyq $r34r35 = $r32, $r33
+	/* Skip loopdo with dzerol if length < KVX_DCACHE_LINE_SIZE */
+	cb.dnez $r7? .Ldzerol_done
+	;;
+	/* Compute the size that will be copied to align on 64 bytes boundary */
+	andw $r6 = $r5, 0x3F
+	/* Check if address is aligned on 64 bytes */
+	andw $r9 = $r0, 0x3F
+	/* Alignment */
+	nop
+	;;
+	/* If address already aligned on 64 bytes, jump to dzerol loop */
+	cb.deqz $r9? .Laligned_64
+	/* Remove unaligned part from length */
+	sbfd $r2 = $r6, $r2
+	/* Check if we need to copy 1 byte */
+	andw $r4 = $r5, (1 << 0)
+	;;
+	/* If we are not aligned, store byte */
+	sb.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 2 bytes */
+	andw $r4 = $r5, (1 << 1)
+	/* Add potentially copied part for next store offset */
+	addd $r0 = $r0, $r4
+	;;
+	sh.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 4 bytes */
+	andw $r4 = $r5, (1 << 2)
+	addd $r0 = $r0, $r4
+	;;
+	sw.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 8 bytes */
+	andw $r4 = $r5, (1 << 3)
+	addd $r0 = $r0, $r4
+	;;
+	sd.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 16 bytes */
+	andw $r4 = $r5, (1 << 4)
+	addd $r0 = $r0, $r4
+	;;
+	sq.dnez $r4? [$r0] = $r32r33
+	/* Check if we need to copy 32 bytes */
+	andw $r4 = $r5, (1 << 5)
+	addd $r0 = $r0, $r4
+	;;
+	so.dnez $r4? [$r0] = $r32r33r34r35
+	addd $r0 = $r0, $r4
+	;;
+.Laligned_64:
+	/* Prepare amount of data for dzerol */
+	srld $r10 = $r2, 6
+	/* Size to be handled in loopdo */
+	andd $r4 = $r2, ~0x3F
+	make $r11 = 64
+	cb.deqz $r2? .Lmemset_done
+	;;
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 5)
+	/* Skip dzerol if there are not enough data for 64 bytes store */
+	cb.deqz $r10? .Ldzerol_done
+	/* Update length to copy */
+	sbfd $r2 = $r4, $r2
+	;;
+	loopdo $r10, .Ldzerol_done
+		;;
+		so 0[$r0], $r32r33r34r35
+		;;
+		so 32[$r0], $r32r33r34r35
+		addd $r0 = $r0, $r11
+		;;
+	.Ldzerol_done:
+	/*
+	 * Now that we have handled every aligned bytes using 'dzerol', we can
+	 * handled the remainder of length using store by decrementing size
+	 * We also exploit the fact we are aligned to simply check remaining
+	 * size */
+	so.dnez $r8? [$r0] = $r32r33r34r35
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 4)
+	cb.deqz $r2? .Lmemset_done
+	;;
+	sq.dnez $r8? [$r0] = $r32r33
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 8 bytes store */
+	andw $r8 = $r2, (1 << 3)
+	;;
+	sd.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 4 bytes store */
+	andw $r8 = $r2, (1 << 2)
+	;;
+	sw.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 2 bytes store */
+	andw $r8 = $r2, (1 << 1)
+	;;
+	sh.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	;;
+	sb.odd $r2? [$r0] = $r32
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+
+.align 16
+memset_non_0_pattern:
+	/* Preserve return value */
+	copyd $r3 = $r0
+	/* Replicate the first pattern byte on all bytes */
+	sbmm8 $r32 = $r1, REPLICATE_BYTE_MASK
+	/* Check if length < MIN_SIZE_FOR_ALIGN */
+	compd.geu $r7 = $r2, MIN_SIZE_FOR_ALIGN
+	/* Invert address to compute size to copy to be aligned on 32 bytes */
+	negd $r5 = $r0
+	;;
+	/* Check if we are aligned on 32 bytes */
+	andw $r9 = $r0, 0x1F
+	/* Compute the size that will be copied to align on 32 bytes boundary */
+	andw $r6 = $r5, 0x1F
+	/*
+	 * If size < MIN_SIZE_FOR_ALIGN bits, directly go to so, it will be done
+	 * unaligned but that is still better that what we can do with sb
+	 */
+	cb.deqz $r7? .Laligned_32
+	;;
+	/* Remove unaligned part from length */
+	sbfd $r2 = $r6, $r2
+	/* If we are already aligned on 32 bytes, jump to main "so" loop */
+	cb.deqz $r9? .Laligned_32
+	/* Check if we need to copy 1 byte */
+	andw $r4 = $r5, (1 << 0)
+	;;
+	/* If we are not aligned, store byte */
+	sb.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 2 bytes */
+	andw $r4 = $r5, (1 << 1)
+	/* Add potentially copied part for next store offset */
+	addd $r0 = $r0, $r4
+	;;
+	sh.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 4 bytes */
+	andw $r4 = $r5, (1 << 2)
+	addd $r0 = $r0, $r4
+	;;
+	sw.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 8 bytes */
+	andw $r4 = $r5, (1 << 3)
+	addd $r0 = $r0, $r4
+	/* Copy second part of pattern for sq */
+	copyd $r33 = $r32
+	;;
+	sd.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 16 bytes */
+	andw $r4 = $r5, (1 << 4)
+	addd $r0 = $r0, $r4
+	;;
+	sq.dnez $r4? [$r0] = $r32r33
+	addd $r0 = $r0, $r4
+	;;
+.Laligned_32:
+	/* Copy second part of pattern for sq */
+	copyd $r33 = $r32
+	/* Prepare amount of data for 32 bytes store */
+	srld $r10 = $r2, 5
+	nop
+	nop
+	;;
+	copyq $r34r35 = $r32, $r33
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 4)
+	make $r11 = 32
+	/* Check if there are enough data for 32 bytes store */
+	cb.deqz $r10? .Laligned_32_done
+	;;
+	loopdo $r10, .Laligned_32_done
+		;;
+		so 0[$r0] = $r32r33r34r35
+		addd $r0 = $r0, $r11
+		;;
+	.Laligned_32_done:
+	/*
+	 * Now that we have handled every aligned bytes using 'so', we can
+	 * handled the remainder of length using store by decrementing size
+	 * We also exploit the fact we are aligned to simply check remaining
+	 * size */
+	sq.dnez $r8? [$r0] = $r32r33
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 8 bytes store */
+	andw $r8 = $r2, (1 << 3)
+	cb.deqz $r2? .Lmemset_done
+	;;
+	sd.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 4 bytes store */
+	andw $r8 = $r2, (1 << 2)
+	;;
+	sw.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 2 bytes store */
+	andw $r8 = $r2, (1 << 1)
+	;;
+	sh.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	;;
+	sb.odd $r2? [$r0] = $r32
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+.Lmemset_done:
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+ENDPROC(memset)
diff --git a/arch/kvx/lib/strlen.S b/arch/kvx/lib/strlen.S
new file mode 100644
index 000000000000..8298402a7898
--- /dev/null
+++ b/arch/kvx/lib/strlen.S
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Jules Maselbas
+ */
+#include <linux/linkage.h>
+#include <asm/export.h>
+
+/*
+ *	kvx optimized strlen
+ *
+ *	This implementation of strlen only does aligned memory accesses.
+ *	Since we don't know the total length the idea is to do double word
+ *	load and stop on the first null byte found. As it's always safe to
+ *	read more up to a lower 8-bytes boundary.
+ *
+ *	This implementation of strlen uses a trick to detect if a double
+ *	word contains a null byte [1]:
+ *
+ *	> #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
+ *	> The sub-expression (v - 0x01010101UL), evaluates to a high bit set
+ *	> in any byte whenever the corresponding byte in v is zero or greater
+ *	> than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high
+ *	> bits set in bytes where the byte of v doesn't have its high bit set
+ *	> (so the byte was less than 0x80). Finally, by ANDing these two sub-
+ *	> expressions the result is the high bits set where the bytes in v
+ *	> were zero, since the high bits set due to a value greater than 0x80
+ *	> in the first sub-expression are masked off by the second.
+ *
+ *	[1] http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+ *
+ *	A second trick is used to get the exact number of characters before
+ *	the first null byte in a double word:
+ *
+ *		clz(sbmmt(zero, 0x0102040810204080))
+ *
+ *	This trick uses the haszero result which maps null byte to 0x80 and
+ *	others value to 0x00. The idea is to count the number of consecutive
+ *	null byte in the double word (counting from less significant byte
+ *	to most significant byte). To do so, using the bit matrix transpose
+ *	will "pack" all high bit (0x80) to the most significant byte (MSB).
+ *	It is not possible to count the trailing zeros in this MSB, however
+ *	if a byte swap is done before the bit matrix transpose we still have
+ *	all the information in the MSB but now we can count the leading zeros.
+ *	The instruction sbmmt with the matrix 0x0102040810204080 does exactly
+ *	what we need a byte swap followed by a bit transpose.
+ *
+ *	A last trick is used to handle the first double word misalignment.
+ *	This is done by masking off the N lower bytes (excess read) with N
+ *	between 0 and 7. The mask is applied on haszero results and will
+ *	force the N lower bytes to be considered not null.
+ *
+ *	This is a C implementation of the algorithm described above:
+ *
+ *	size_t strlen(char *s) {
+ *		uint64_t *p    = (uint64_t *)((uintptr_t)s) & ~0x7;
+ *		uint64_t rem   = ((uintptr_t)s) % 8;
+ *		uint64_t low   = -0x0101010101010101;
+ *		uint64_t high  =  0x8080808080808080;
+ *		uint64_t dword, zero;
+ *		uint64_t msk, len;
+ *
+ *		dword = *p++;
+ *		zero  = (dword + low) & ~dword & high;
+ *		msk   = 0xffffffffffffffff << (rem * 8);
+ *		zero &= msk;
+ *
+ *		while (!zero) {
+ *			dword = *p++;
+ *			zero  = (dword + low) & ~dword & high;
+ *		}
+ *
+ *		zero = __builtin_kvx_sbmmt8(zero, 0x0102040810204080);
+ *		len = ((void *)p - (void *)s) - 8;
+ *		len += __builtin_kvx_clzd(zero);
+ *
+ *		return len;
+ *	}
+ */
+
+.text
+.align 16
+ENTRY(strlen)
+	andd  $r1 = $r0, ~0x7
+	andd  $r2 = $r0,  0x7
+	make $r10 = -0x0101010101010101
+	make $r11 =  0x8080808080808080
+	;;
+	ld $r4 = 0[$r1]
+	sllw $r2 = $r2, 3
+	make $r3 = 0xffffffffffffffff
+	;;
+	slld $r2 = $r3, $r2
+	addd $r5 = $r4, $r10
+	andnd $r6 = $r4, $r11
+	;;
+	andd $r6 = $r6, $r2
+	make $r3 = 0
+	;;
+.loop:
+	andd $r4 = $r5, $r6
+	addd $r1 = $r1, 0x8
+	;;
+	cb.dnez $r4? .end
+	ld.deqz $r4? $r4 = [$r1]
+	;;
+	addd $r5 = $r4, $r10
+	andnd $r6 = $r4, $r11
+	goto .loop
+	;;
+.end:
+	addd $r1 = $r1, -0x8
+	sbmmt8 $r4 = $r4, 0x0102040810204080
+	;;
+	clzd $r4 = $r4
+	sbfd $r1 = $r0, $r1
+	;;
+	addd $r0 = $r4, $r1
+	ret
+	;;
+ENDPROC(strlen)
+EXPORT_SYMBOL(strlen)
diff --git a/arch/kvx/lib/usercopy.S b/arch/kvx/lib/usercopy.S
new file mode 100644
index 000000000000..bc7e1a45e1c7
--- /dev/null
+++ b/arch/kvx/lib/usercopy.S
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+#include <linux/linkage.h>
+
+/**
+ * Copy from/to a user buffer
+ * r0 = to buffer
+ * r1 = from buffer
+ * r2 = size to copy
+ * This function can trapped when hitting a non-mapped page.
+ * It will trigger a trap NOMAPPING and the trap handler will interpret
+ * it and check if instruction pointer is inside __ex_table.
+ * The next step are described later !
+ */
+.text
+ENTRY(raw_copy_from_user)
+ENTRY(raw_copy_to_user)
+	/**
+	 * naive implementation byte per byte
+	 */
+	make $r33 = 0x0;
+	/* If size == 0, exit directly */
+	cb.deqz $r2? copy_exit
+	;;
+	loopdo $r2, copy_exit
+		;;
+0:		lbz $r34 = $r33[$r1]
+		;;
+1:		sb $r33[$r0] = $r34
+		addd $r33 = $r33, 1 /* Ptr increment */
+		addd $r2 = $r2, -1 /* Remaining bytes to copy */
+		;;
+	copy_exit:
+	copyd $r0 = $r2
+	ret
+	;;
+ENDPROC(raw_copy_to_user)
+ENDPROC(raw_copy_from_user)
+
+/**
+ * Exception table
+ * each entry correspond to the following:
+ * .dword trapping_addr, restore_addr
+ *
+ * On trap, the handler will try to locate if $spc is matching a
+ * trapping address in the exception table. If so, the restore addr
+ * will  be put in the return address of the trap handler, allowing
+ * to properly finish the copy and return only the bytes copied/cleared
+ */
+.pushsection __ex_table,"a"
+.balign 8
+.dword 0b, copy_exit
+.dword 1b, copy_exit
+.popsection
+
+/**
+ * Clear a user buffer
+ * r0 = buffer to clear
+ * r1 = size to clear
+ */
+.text
+ENTRY(asm_clear_user)
+	/**
+	 * naive implementation byte per byte
+	 */
+	make $r33 = 0x0;
+	make $r34 = 0x0;
+	/* If size == 0, exit directly */
+	cb.deqz $r1? clear_exit
+	;;
+	loopdo $r1, clear_exit
+		;;
+40:		sb $r33[$r0] = $r34
+		addd $r33 = $r33, 1 /* Ptr increment */
+		addd $r1 = $r1, -1 /* Remaining bytes to copy */
+		;;
+	clear_exit:
+	copyd $r0 = $r1
+	ret
+	;;
+ENDPROC(asm_clear_user)
+
+.pushsection __ex_table,"a"
+.balign 8
+.dword 40b, clear_exit
+.popsection
+
-- 
2.37.2






_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

WARNING: multiple messages have this Message-ID (diff)
From: Yann Sionneau <ysionneau@kalray.eu>
To: "Arnd Bergmann" <arnd@arndb.de>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Marc Zyngier" <maz@kernel.org>,
	"Rob Herring" <robh+dt@kernel.org>,
	"Krzysztof Kozlowski" <krzysztof.kozlowski+dt@linaro.org>,
	"Will Deacon" <will@kernel.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Eric Biederman" <ebiederm@xmission.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Waiman Long" <longman@redhat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Nick Piggin" <npiggin@gmail.com>,
	"Paul Moore" <paul@paul-moore.com>,
	"Eric Paris" <eparis@redhat.com>,
	"Christian Brauner" <brauner@kernel.org>,
	"Paul Walmsley" <paul.walmsley@sifive.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Albert Ou" <aou@eecs.berkeley.edu>,
	"Jules Maselbas" <jmaselbas@kalray.eu>,
	"Yann Sionneau" <ysionneau@kalray.eu>,
	"Guillaume Thouvenin" <gthouvenin@kalray.eu>,
	"Clement Leger" <clement@clement-leger.fr>,
	"Vincent Chardon" <vincent.chardon@elsys-design.com>,
	"Marc Poulhiès" <dkm@kataplop.net>,
	"Julian Vetter" <jvetter@kalray.eu>,
	"Samuel Jones" <sjones@kalray.eu>,
	"Ashley Lesdalons" <alesdalons@kalray.eu>,
	"Thomas Costis" <tcostis@kalray.eu>,
	"Marius Gligor" <mgligor@kalray.eu>,
	"Jonathan Borne" <jborne@kalray.eu>,
	"Julien Villette" <jvillette@kalray.eu>,
	"Luc Michel" <lmichel@kalray.eu>,
	"Louis Morhet" <lmorhet@kalray.eu>,
	"Julien Hascoet" <jhascoet@kalray.eu>,
	"Jean-Christophe Pince" <jcpince@gmail.com>,
	"Guillaume Missonnier" <gmissonnier@kalray.eu>,
	"Alex Michon" <amichon@kalray.eu>,
	"Huacai Chen" <chenhuacai@kernel.org>,
	"WANG Xuerui" <git@xen0n.name>,
	"Shaokun Zhang" <zhangshaokun@hisilicon.com>,
	"John Garry" <john.garry@huawei.com>,
	"Guangbin Huang" <huangguangbin2@huawei.com>,
	"Bharat Bhushan" <bbhushan2@marvell.com>,
	"Bibo Mao" <maobibo@loongson.cn>,
	"Atish Patra" <atishp@atishpatra.org>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>,
	"Qi Liu" <liuqi115@huawei.com>,
	"Jiaxun Yang" <jiaxun.yang@flygoat.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Mark Brown" <broonie@kernel.org>,
	"Janosch Frank" <frankja@linux.ibm.com>,
	"Alexey Dobriyan" <adobriyan@gmail.com>
Cc: Benjamin Mugnier <mugnier.benjamin@gmail.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org, linux-mm@kvack.org,
	linux-arch@vger.kernel.org, linux-audit@redhat.com,
	linux-riscv@lists.infradead.org, bpf@vger.kernel.org
Subject: [RFC PATCH v2 25/31] kvx: Add some library functions
Date: Fri, 20 Jan 2023 15:09:56 +0100	[thread overview]
Message-ID: <20230120141002.2442-26-ysionneau@kalray.eu> (raw)
In-Reply-To: <20230120141002.2442-1-ysionneau@kalray.eu>

Add some library functions for kvx, including: delay, memset,
memcpy, strlen, clear_page, copy_page, raw_copy_from/to_user,
asm_clear_user.

Co-developed-by: Clement Leger <clement@clement-leger.fr>
Signed-off-by: Clement Leger <clement@clement-leger.fr>
Co-developed-by: Jules Maselbas <jmaselbas@kalray.eu>
Signed-off-by: Jules Maselbas <jmaselbas@kalray.eu>
Co-developed-by: Julian Vetter <jvetter@kalray.eu>
Signed-off-by: Julian Vetter <jvetter@kalray.eu>
Co-developed-by: Marius Gligor <mgligor@kalray.eu>
Signed-off-by: Marius Gligor <mgligor@kalray.eu>
Co-developed-by: Yann Sionneau <ysionneau@kalray.eu>
Signed-off-by: Yann Sionneau <ysionneau@kalray.eu>
---

Notes:
    V1 -> V2: no changes

 arch/kvx/include/asm/string.h |  20 ++
 arch/kvx/kernel/kvx_ksyms.c   |   5 +
 arch/kvx/lib/clear_page.S     |  40 ++++
 arch/kvx/lib/copy_page.S      |  90 +++++++++
 arch/kvx/lib/delay.c          |  39 ++++
 arch/kvx/lib/memcpy.c         |  70 +++++++
 arch/kvx/lib/memset.S         | 351 ++++++++++++++++++++++++++++++++++
 arch/kvx/lib/strlen.S         | 122 ++++++++++++
 arch/kvx/lib/usercopy.S       |  90 +++++++++
 9 files changed, 827 insertions(+)
 create mode 100644 arch/kvx/include/asm/string.h
 create mode 100644 arch/kvx/lib/clear_page.S
 create mode 100644 arch/kvx/lib/copy_page.S
 create mode 100644 arch/kvx/lib/delay.c
 create mode 100644 arch/kvx/lib/memcpy.c
 create mode 100644 arch/kvx/lib/memset.S
 create mode 100644 arch/kvx/lib/strlen.S
 create mode 100644 arch/kvx/lib/usercopy.S

diff --git a/arch/kvx/include/asm/string.h b/arch/kvx/include/asm/string.h
new file mode 100644
index 000000000000..677c1393a5cd
--- /dev/null
+++ b/arch/kvx/include/asm/string.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Jules Maselbas
+ */
+
+#ifndef _ASM_KVX_STRING_H
+#define _ASM_KVX_STRING_H
+
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t n);
+
+#define __HAVE_ARCH_MEMCPY
+extern void *memcpy(void *dest, const void *src, size_t n);
+
+#define __HAVE_ARCH_STRLEN
+extern size_t strlen(const char *s);
+
+#endif	/* _ASM_KVX_STRING_H */
diff --git a/arch/kvx/kernel/kvx_ksyms.c b/arch/kvx/kernel/kvx_ksyms.c
index 18990aaf259f..678f81716dea 100644
--- a/arch/kvx/kernel/kvx_ksyms.c
+++ b/arch/kvx/kernel/kvx_ksyms.c
@@ -22,3 +22,8 @@ DECLARE_EXPORT(__umoddi3);
 DECLARE_EXPORT(__divdi3);
 DECLARE_EXPORT(__udivdi3);
 DECLARE_EXPORT(__multi3);
+
+DECLARE_EXPORT(clear_page);
+DECLARE_EXPORT(copy_page);
+DECLARE_EXPORT(memset);
+DECLARE_EXPORT(asm_clear_user);
diff --git a/arch/kvx/lib/clear_page.S b/arch/kvx/lib/clear_page.S
new file mode 100644
index 000000000000..364fe0663ca2
--- /dev/null
+++ b/arch/kvx/lib/clear_page.S
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Marius Gligor
+ *            Clement Leger
+ */
+
+#include <linux/linkage.h>
+#include <linux/export.h>
+#include <linux/const.h>
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define CLEAR_PAGE_LOOP_COUNT	(PAGE_SIZE / 32)
+
+/*
+ * Clear page @dest.
+ *
+ * Parameters:
+ *	r0 - dest page
+ */
+ENTRY(clear_page)
+	make $r1 = CLEAR_PAGE_LOOP_COUNT
+	;;
+	make $r4 = 0
+	make $r5 = 0
+	make $r6 = 0
+	make $r7 = 0
+	;;
+
+	loopdo $r1, clear_page_done
+		;;
+		so 0[$r0] = $r4r5r6r7
+		addd $r0 = $r0, 32
+		;;
+	clear_page_done:
+	ret
+	;;
+ENDPROC(clear_page)
diff --git a/arch/kvx/lib/copy_page.S b/arch/kvx/lib/copy_page.S
new file mode 100644
index 000000000000..4bb82d1c964c
--- /dev/null
+++ b/arch/kvx/lib/copy_page.S
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+
+#include <linux/linkage.h>
+#include <linux/const.h>
+
+#include <asm/page.h>
+
+/* We have 8 load/store octuple (32 bytes) per hardware loop */
+#define COPY_SIZE_PER_LOOP	(32 * 8)
+#define COPY_PAGE_LOOP_COUNT	(PAGE_SIZE / COPY_SIZE_PER_LOOP)
+
+/*
+ * Copy a page from src to dest (both are page aligned)
+ * In order to recover from smem latency, unroll the loop to trigger multiple
+ * onfly loads and avoid waiting too much for them to return.
+ * We use 8 * 32 load even though we could use more (up to 10 loads) to simplify
+ * the handling using a single hardware loop
+ *
+ * Parameters:
+ *	r0 - dest
+ *	r1 - src
+ */
+ENTRY(copy_page)
+	make $r2 = COPY_PAGE_LOOP_COUNT
+	make $r3 = 0
+	;;
+	loopdo $r2, copy_page_done
+		;;
+		/*
+		 * Load 8 * 32 bytes using uncached access to avoid hitting
+		 * the cache
+		 */
+		lo.xs $r32r33r34r35 = $r3[$r1]
+		/* Copy current copy index for store */
+		copyd $r2 = $r3
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r36r37r38r39 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r40r41r42r43 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r44r45r46r47 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r48r49r50r51 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r52r53r54r55 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r56r57r58r59 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r60r61r62r63 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		/* And then store all of them */
+		so.xs $r2[$r0] = $r32r33r34r35
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r36r37r38r39
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r40r41r42r43
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r44r45r46r47
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r48r49r50r51
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r52r53r54r55
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r56r57r58r59
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r60r61r62r63
+		;;
+	copy_page_done:
+	ret
+	;;
+ENDPROC(copy_page)
diff --git a/arch/kvx/lib/delay.c b/arch/kvx/lib/delay.c
new file mode 100644
index 000000000000..11295eedc3f5
--- /dev/null
+++ b/arch/kvx/lib/delay.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+
+#include <linux/export.h>
+#include <linux/delay.h>
+
+#include <asm/param.h>
+#include <asm/timex.h>
+
+void __delay(unsigned long loops)
+{
+	cycles_t target_cycle = get_cycles() + loops;
+
+	while (get_cycles() < target_cycle);
+}
+EXPORT_SYMBOL(__delay);
+
+inline void __const_udelay(unsigned long xloops)
+{
+	u64 loops = (u64)xloops * (u64)loops_per_jiffy * HZ;
+
+	__delay(loops >> 32);
+}
+EXPORT_SYMBOL(__const_udelay);
+
+void __udelay(unsigned long usecs)
+{
+	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+}
+EXPORT_SYMBOL(__udelay);
+
+void __ndelay(unsigned long nsecs)
+{
+	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+}
+EXPORT_SYMBOL(__ndelay);
diff --git a/arch/kvx/lib/memcpy.c b/arch/kvx/lib/memcpy.c
new file mode 100644
index 000000000000..b81f746a80ee
--- /dev/null
+++ b/arch/kvx/lib/memcpy.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Yann Sionneau
+ */
+
+#include <linux/export.h>
+#include <linux/types.h>
+
+void *memcpy(void *dest, const void *src, size_t n)
+{
+	__uint128_t *tmp128_d = dest;
+	const __uint128_t *tmp128_s = src;
+	uint64_t *tmp64_d;
+	const uint64_t *tmp64_s;
+	uint32_t *tmp32_d;
+	const uint32_t *tmp32_s;
+	uint16_t *tmp16_d;
+	const uint16_t *tmp16_s;
+	uint8_t *tmp8_d;
+	const uint8_t *tmp8_s;
+
+	while (n >= 16) {
+		*tmp128_d = *tmp128_s;
+		tmp128_d++;
+		tmp128_s++;
+		n -= 16;
+	}
+
+	tmp64_d = (uint64_t *) tmp128_d;
+	tmp64_s = (uint64_t *) tmp128_s;
+	while (n >= 8) {
+		*tmp64_d = *tmp64_s;
+		tmp64_d++;
+		tmp64_s++;
+		n -= 8;
+	}
+
+	tmp32_d = (uint32_t *) tmp64_d;
+	tmp32_s = (uint32_t *) tmp64_s;
+	while (n >= 4) {
+		*tmp32_d = *tmp32_s;
+		tmp32_d++;
+		tmp32_s++;
+		n -= 4;
+	}
+
+	tmp16_d = (uint16_t *) tmp32_d;
+	tmp16_s = (uint16_t *) tmp32_s;
+	while (n >= 2) {
+		*tmp16_d = *tmp16_s;
+		tmp16_d++;
+		tmp16_s++;
+		n -= 2;
+	}
+
+	tmp8_d = (uint8_t *) tmp16_d;
+	tmp8_s = (uint8_t *) tmp16_s;
+	while (n >= 1) {
+		*tmp8_d = *tmp8_s;
+		tmp8_d++;
+		tmp8_s++;
+		n--;
+	}
+
+	return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
diff --git a/arch/kvx/lib/memset.S b/arch/kvx/lib/memset.S
new file mode 100644
index 000000000000..9eebc28da2be
--- /dev/null
+++ b/arch/kvx/lib/memset.S
@@ -0,0 +1,351 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Marius Gligor
+ */
+
+#include <linux/linkage.h>
+
+#include <asm/cache.h>
+
+#define REPLICATE_BYTE_MASK	0x0101010101010101
+#define MIN_SIZE_FOR_ALIGN	128
+
+/*
+ * Optimized memset for kvx architecture
+ *
+ * In order to optimize memset on kvx, we can use various things:
+ * - conditionnal store which avoid branch penalty
+ * - store half/word/double/quad/octuple to store up to 16 bytes at a time
+ * - dzerol to zero a cacheline when the pattern is '0' (often the case)
+ * - hardware loop for steady cases.
+ *
+ * First, we assume that memset is mainly used for zeroing areas. In order
+ * to optimize this case, we consider it to be the fast path of the algorithm.
+ * In both cases (0 and non 0 pattern), we start by checking if the size is
+ * below a minimum size. If so, we skip the alignment part. Indeed, the kvx
+ * supports misalignment and the penalty for letting it do unaligned accesses is
+ * lower than trying to realigning us. So for small sizes, we don't even bother
+ * to realign. Minor difference is that in the memset with 0, we skip after the
+ * dzerol loop since dzerol must be cache-line aligned (no misalignment of
+ * course).
+ * Regarding the non 0 pattern memset, we use sbmm to replicate the pattern on
+ * all bits on a register in one call.
+ * Once alignment has been reached, we can do the hardware loop for both cases(
+ * store octuple/dzerol) in order to optimize throughput. Care must be taken to
+ * align hardware loops on at least 8 bytes for performances.
+ * Once the main loop has been done, we finish the copy by checking length to do
+ * the necessary calls to store remaining bytes.
+ *
+ * Pseudo code (applies for non 0 pattern):
+ *
+ * int memset(void *dest, char pattern, long length)
+ * {
+ * 	long dest_align = -((long) dest);
+ * 	long copy;
+ * 	long orig_dest = dest;
+ *
+ * 	uint64_t pattern = sbmm8(pattern, 0x0101010101010101);
+ * 	uint128_t pattern128 = pattern << 64 | pattern;
+ * 	uint256_t pattern128 = pattern128 << 128 | pattern128;
+ *
+ * 	// Keep only low bits
+ * 	dest_align &= 0x1F;
+ * 	length -= dest_align;
+ *
+ * 	// Byte align
+ * 	copy = align & (1 << 0);
+ * 	if (copy)
+ * 		*((u8 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Half align
+ * 	copy = align & (1 << 1);
+ * 	if (copy)
+ * 		*((u16 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Word align
+ * 	copy = align & (1 << 2);
+ * 	if (copy)
+ * 		*((u32 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Double align
+ * 	copy = align & (1 << 3);
+ * 	if (copy)
+ * 		*((u64 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Quad align
+ * 	copy = align & (1 << 4);
+ * 	if (copy)
+ * 		*((u128 *) dest) = pattern128;
+ * 	dest += copy;
+ *
+ * 	// We are now aligned on 256 bits
+ * 	loop_octuple_count = size >> 5;
+ * 	for (i = 0; i < loop_octuple_count; i++) {
+ * 		*((u256 *) dest) = pattern256;
+ * 		dest += 32;
+ * 	}
+ *
+ * 	if (length == 0)
+ * 		return orig_dest;
+ *
+ * 	// Copy remaining part
+ * 	remain = length & (1 << 4);
+ * 	if (copy)
+ * 		*((u128 *) dest) = pattern128;
+ * 	dest += remain;
+ * 	remain = length & (1 << 3);
+ * 	if (copy)
+ * 		*((u64 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 2);
+ * 	if (copy)
+ * 		*((u32 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 1);
+ * 	if (copy)
+ * 		*((u16 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 0);
+ * 	if (copy)
+ * 		*((u8 *) dest) = pattern;
+ * 	dest += remain;
+ *
+ * 	return orig_dest;
+ * }
+ */
+
+.text
+.align 16
+ENTRY(memset):
+	make $r32 = 0
+	make $r33 = 0
+	/* Check if length < KVX_DCACHE_LINE_SIZE */
+	compd.ltu $r7 = $r2, KVX_DCACHE_LINE_SIZE
+	/* Jump to generic memset if pattern is != 0 */
+	cb.dnez $r1? memset_non_0_pattern
+	;;
+	/* Preserve return value */
+	copyd $r3 = $r0
+	/* Invert address to compute size to copy to be aligned on 32 bytes */
+	negd $r5 = $r0
+	/* Remaining bytes for 16 bytes store (for alignment on 64 bytes) */
+	andd $r8 = $r2, (1 << 5)
+	copyq $r34r35 = $r32, $r33
+	/* Skip loopdo with dzerol if length < KVX_DCACHE_LINE_SIZE */
+	cb.dnez $r7? .Ldzerol_done
+	;;
+	/* Compute the size that will be copied to align on 64 bytes boundary */
+	andw $r6 = $r5, 0x3F
+	/* Check if address is aligned on 64 bytes */
+	andw $r9 = $r0, 0x3F
+	/* Alignment */
+	nop
+	;;
+	/* If address already aligned on 64 bytes, jump to dzerol loop */
+	cb.deqz $r9? .Laligned_64
+	/* Remove unaligned part from length */
+	sbfd $r2 = $r6, $r2
+	/* Check if we need to copy 1 byte */
+	andw $r4 = $r5, (1 << 0)
+	;;
+	/* If we are not aligned, store byte */
+	sb.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 2 bytes */
+	andw $r4 = $r5, (1 << 1)
+	/* Add potentially copied part for next store offset */
+	addd $r0 = $r0, $r4
+	;;
+	sh.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 4 bytes */
+	andw $r4 = $r5, (1 << 2)
+	addd $r0 = $r0, $r4
+	;;
+	sw.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 8 bytes */
+	andw $r4 = $r5, (1 << 3)
+	addd $r0 = $r0, $r4
+	;;
+	sd.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 16 bytes */
+	andw $r4 = $r5, (1 << 4)
+	addd $r0 = $r0, $r4
+	;;
+	sq.dnez $r4? [$r0] = $r32r33
+	/* Check if we need to copy 32 bytes */
+	andw $r4 = $r5, (1 << 5)
+	addd $r0 = $r0, $r4
+	;;
+	so.dnez $r4? [$r0] = $r32r33r34r35
+	addd $r0 = $r0, $r4
+	;;
+.Laligned_64:
+	/* Prepare amount of data for dzerol */
+	srld $r10 = $r2, 6
+	/* Size to be handled in loopdo */
+	andd $r4 = $r2, ~0x3F
+	make $r11 = 64
+	cb.deqz $r2? .Lmemset_done
+	;;
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 5)
+	/* Skip dzerol if there are not enough data for 64 bytes store */
+	cb.deqz $r10? .Ldzerol_done
+	/* Update length to copy */
+	sbfd $r2 = $r4, $r2
+	;;
+	loopdo $r10, .Ldzerol_done
+		;;
+		so 0[$r0], $r32r33r34r35
+		;;
+		so 32[$r0], $r32r33r34r35
+		addd $r0 = $r0, $r11
+		;;
+	.Ldzerol_done:
+	/*
+	 * Now that we have handled every aligned bytes using 'dzerol', we can
+	 * handled the remainder of length using store by decrementing size
+	 * We also exploit the fact we are aligned to simply check remaining
+	 * size */
+	so.dnez $r8? [$r0] = $r32r33r34r35
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 4)
+	cb.deqz $r2? .Lmemset_done
+	;;
+	sq.dnez $r8? [$r0] = $r32r33
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 8 bytes store */
+	andw $r8 = $r2, (1 << 3)
+	;;
+	sd.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 4 bytes store */
+	andw $r8 = $r2, (1 << 2)
+	;;
+	sw.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 2 bytes store */
+	andw $r8 = $r2, (1 << 1)
+	;;
+	sh.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	;;
+	sb.odd $r2? [$r0] = $r32
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+
+.align 16
+memset_non_0_pattern:
+	/* Preserve return value */
+	copyd $r3 = $r0
+	/* Replicate the first pattern byte on all bytes */
+	sbmm8 $r32 = $r1, REPLICATE_BYTE_MASK
+	/* Check if length < MIN_SIZE_FOR_ALIGN */
+	compd.geu $r7 = $r2, MIN_SIZE_FOR_ALIGN
+	/* Invert address to compute size to copy to be aligned on 32 bytes */
+	negd $r5 = $r0
+	;;
+	/* Check if we are aligned on 32 bytes */
+	andw $r9 = $r0, 0x1F
+	/* Compute the size that will be copied to align on 32 bytes boundary */
+	andw $r6 = $r5, 0x1F
+	/*
+	 * If size < MIN_SIZE_FOR_ALIGN bits, directly go to so, it will be done
+	 * unaligned but that is still better that what we can do with sb
+	 */
+	cb.deqz $r7? .Laligned_32
+	;;
+	/* Remove unaligned part from length */
+	sbfd $r2 = $r6, $r2
+	/* If we are already aligned on 32 bytes, jump to main "so" loop */
+	cb.deqz $r9? .Laligned_32
+	/* Check if we need to copy 1 byte */
+	andw $r4 = $r5, (1 << 0)
+	;;
+	/* If we are not aligned, store byte */
+	sb.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 2 bytes */
+	andw $r4 = $r5, (1 << 1)
+	/* Add potentially copied part for next store offset */
+	addd $r0 = $r0, $r4
+	;;
+	sh.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 4 bytes */
+	andw $r4 = $r5, (1 << 2)
+	addd $r0 = $r0, $r4
+	;;
+	sw.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 8 bytes */
+	andw $r4 = $r5, (1 << 3)
+	addd $r0 = $r0, $r4
+	/* Copy second part of pattern for sq */
+	copyd $r33 = $r32
+	;;
+	sd.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 16 bytes */
+	andw $r4 = $r5, (1 << 4)
+	addd $r0 = $r0, $r4
+	;;
+	sq.dnez $r4? [$r0] = $r32r33
+	addd $r0 = $r0, $r4
+	;;
+.Laligned_32:
+	/* Copy second part of pattern for sq */
+	copyd $r33 = $r32
+	/* Prepare amount of data for 32 bytes store */
+	srld $r10 = $r2, 5
+	nop
+	nop
+	;;
+	copyq $r34r35 = $r32, $r33
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 4)
+	make $r11 = 32
+	/* Check if there are enough data for 32 bytes store */
+	cb.deqz $r10? .Laligned_32_done
+	;;
+	loopdo $r10, .Laligned_32_done
+		;;
+		so 0[$r0] = $r32r33r34r35
+		addd $r0 = $r0, $r11
+		;;
+	.Laligned_32_done:
+	/*
+	 * Now that we have handled every aligned bytes using 'so', we can
+	 * handled the remainder of length using store by decrementing size
+	 * We also exploit the fact we are aligned to simply check remaining
+	 * size */
+	sq.dnez $r8? [$r0] = $r32r33
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 8 bytes store */
+	andw $r8 = $r2, (1 << 3)
+	cb.deqz $r2? .Lmemset_done
+	;;
+	sd.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 4 bytes store */
+	andw $r8 = $r2, (1 << 2)
+	;;
+	sw.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 2 bytes store */
+	andw $r8 = $r2, (1 << 1)
+	;;
+	sh.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	;;
+	sb.odd $r2? [$r0] = $r32
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+.Lmemset_done:
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+ENDPROC(memset)
diff --git a/arch/kvx/lib/strlen.S b/arch/kvx/lib/strlen.S
new file mode 100644
index 000000000000..8298402a7898
--- /dev/null
+++ b/arch/kvx/lib/strlen.S
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Jules Maselbas
+ */
+#include <linux/linkage.h>
+#include <asm/export.h>
+
+/*
+ *	kvx optimized strlen
+ *
+ *	This implementation of strlen only does aligned memory accesses.
+ *	Since we don't know the total length the idea is to do double word
+ *	load and stop on the first null byte found. As it's always safe to
+ *	read more up to a lower 8-bytes boundary.
+ *
+ *	This implementation of strlen uses a trick to detect if a double
+ *	word contains a null byte [1]:
+ *
+ *	> #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
+ *	> The sub-expression (v - 0x01010101UL), evaluates to a high bit set
+ *	> in any byte whenever the corresponding byte in v is zero or greater
+ *	> than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high
+ *	> bits set in bytes where the byte of v doesn't have its high bit set
+ *	> (so the byte was less than 0x80). Finally, by ANDing these two sub-
+ *	> expressions the result is the high bits set where the bytes in v
+ *	> were zero, since the high bits set due to a value greater than 0x80
+ *	> in the first sub-expression are masked off by the second.
+ *
+ *	[1] http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+ *
+ *	A second trick is used to get the exact number of characters before
+ *	the first null byte in a double word:
+ *
+ *		clz(sbmmt(zero, 0x0102040810204080))
+ *
+ *	This trick uses the haszero result which maps null byte to 0x80 and
+ *	others value to 0x00. The idea is to count the number of consecutive
+ *	null byte in the double word (counting from less significant byte
+ *	to most significant byte). To do so, using the bit matrix transpose
+ *	will "pack" all high bit (0x80) to the most significant byte (MSB).
+ *	It is not possible to count the trailing zeros in this MSB, however
+ *	if a byte swap is done before the bit matrix transpose we still have
+ *	all the information in the MSB but now we can count the leading zeros.
+ *	The instruction sbmmt with the matrix 0x0102040810204080 does exactly
+ *	what we need a byte swap followed by a bit transpose.
+ *
+ *	A last trick is used to handle the first double word misalignment.
+ *	This is done by masking off the N lower bytes (excess read) with N
+ *	between 0 and 7. The mask is applied on haszero results and will
+ *	force the N lower bytes to be considered not null.
+ *
+ *	This is a C implementation of the algorithm described above:
+ *
+ *	size_t strlen(char *s) {
+ *		uint64_t *p    = (uint64_t *)((uintptr_t)s) & ~0x7;
+ *		uint64_t rem   = ((uintptr_t)s) % 8;
+ *		uint64_t low   = -0x0101010101010101;
+ *		uint64_t high  =  0x8080808080808080;
+ *		uint64_t dword, zero;
+ *		uint64_t msk, len;
+ *
+ *		dword = *p++;
+ *		zero  = (dword + low) & ~dword & high;
+ *		msk   = 0xffffffffffffffff << (rem * 8);
+ *		zero &= msk;
+ *
+ *		while (!zero) {
+ *			dword = *p++;
+ *			zero  = (dword + low) & ~dword & high;
+ *		}
+ *
+ *		zero = __builtin_kvx_sbmmt8(zero, 0x0102040810204080);
+ *		len = ((void *)p - (void *)s) - 8;
+ *		len += __builtin_kvx_clzd(zero);
+ *
+ *		return len;
+ *	}
+ */
+
+.text
+.align 16
+ENTRY(strlen)
+	andd  $r1 = $r0, ~0x7
+	andd  $r2 = $r0,  0x7
+	make $r10 = -0x0101010101010101
+	make $r11 =  0x8080808080808080
+	;;
+	ld $r4 = 0[$r1]
+	sllw $r2 = $r2, 3
+	make $r3 = 0xffffffffffffffff
+	;;
+	slld $r2 = $r3, $r2
+	addd $r5 = $r4, $r10
+	andnd $r6 = $r4, $r11
+	;;
+	andd $r6 = $r6, $r2
+	make $r3 = 0
+	;;
+.loop:
+	andd $r4 = $r5, $r6
+	addd $r1 = $r1, 0x8
+	;;
+	cb.dnez $r4? .end
+	ld.deqz $r4? $r4 = [$r1]
+	;;
+	addd $r5 = $r4, $r10
+	andnd $r6 = $r4, $r11
+	goto .loop
+	;;
+.end:
+	addd $r1 = $r1, -0x8
+	sbmmt8 $r4 = $r4, 0x0102040810204080
+	;;
+	clzd $r4 = $r4
+	sbfd $r1 = $r0, $r1
+	;;
+	addd $r0 = $r4, $r1
+	ret
+	;;
+ENDPROC(strlen)
+EXPORT_SYMBOL(strlen)
diff --git a/arch/kvx/lib/usercopy.S b/arch/kvx/lib/usercopy.S
new file mode 100644
index 000000000000..bc7e1a45e1c7
--- /dev/null
+++ b/arch/kvx/lib/usercopy.S
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+#include <linux/linkage.h>
+
+/**
+ * Copy from/to a user buffer
+ * r0 = to buffer
+ * r1 = from buffer
+ * r2 = size to copy
+ * This function can trapped when hitting a non-mapped page.
+ * It will trigger a trap NOMAPPING and the trap handler will interpret
+ * it and check if instruction pointer is inside __ex_table.
+ * The next step are described later !
+ */
+.text
+ENTRY(raw_copy_from_user)
+ENTRY(raw_copy_to_user)
+	/**
+	 * naive implementation byte per byte
+	 */
+	make $r33 = 0x0;
+	/* If size == 0, exit directly */
+	cb.deqz $r2? copy_exit
+	;;
+	loopdo $r2, copy_exit
+		;;
+0:		lbz $r34 = $r33[$r1]
+		;;
+1:		sb $r33[$r0] = $r34
+		addd $r33 = $r33, 1 /* Ptr increment */
+		addd $r2 = $r2, -1 /* Remaining bytes to copy */
+		;;
+	copy_exit:
+	copyd $r0 = $r2
+	ret
+	;;
+ENDPROC(raw_copy_to_user)
+ENDPROC(raw_copy_from_user)
+
+/**
+ * Exception table
+ * each entry correspond to the following:
+ * .dword trapping_addr, restore_addr
+ *
+ * On trap, the handler will try to locate if $spc is matching a
+ * trapping address in the exception table. If so, the restore addr
+ * will  be put in the return address of the trap handler, allowing
+ * to properly finish the copy and return only the bytes copied/cleared
+ */
+.pushsection __ex_table,"a"
+.balign 8
+.dword 0b, copy_exit
+.dword 1b, copy_exit
+.popsection
+
+/**
+ * Clear a user buffer
+ * r0 = buffer to clear
+ * r1 = size to clear
+ */
+.text
+ENTRY(asm_clear_user)
+	/**
+	 * naive implementation byte per byte
+	 */
+	make $r33 = 0x0;
+	make $r34 = 0x0;
+	/* If size == 0, exit directly */
+	cb.deqz $r1? clear_exit
+	;;
+	loopdo $r1, clear_exit
+		;;
+40:		sb $r33[$r0] = $r34
+		addd $r33 = $r33, 1 /* Ptr increment */
+		addd $r1 = $r1, -1 /* Remaining bytes to copy */
+		;;
+	clear_exit:
+	copyd $r0 = $r1
+	ret
+	;;
+ENDPROC(asm_clear_user)
+
+.pushsection __ex_table,"a"
+.balign 8
+.dword 40b, clear_exit
+.popsection
+
-- 
2.37.2






WARNING: multiple messages have this Message-ID (diff)
From: Yann Sionneau <ysionneau@kalray.eu>
To: "Arnd Bergmann" <arnd@arndb.de>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Marc Zyngier" <maz@kernel.org>,
	"Rob Herring" <robh+dt@kernel.org>,
	"Krzysztof Kozlowski" <krzysztof.kozlowski+dt@linaro.org>,
	"Will Deacon" <will@kernel.org>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Eric Biederman" <ebiederm@xmission.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Waiman Long" <longman@redhat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Nick Piggin" <npiggin@gmail.com>,
	"Paul Moore" <paul@paul-moore.com>,
	"Eric Paris" <eparis@redhat.com>,
	"Christian Brauner" <brauner@kernel.org>,
	"Paul Walmsley" <paul.walmsley@sifive.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Albert Ou" <aou@eecs.berkeley.edu>,
	"Jules Maselbas" <jmaselbas@kalray.eu>,
	"Yann Sionneau" <ysionneau@kalray.eu>,
	"Guillaume Thouvenin" <gthouvenin@kalray.eu>,
	"Clement Leger" <clement@clement-leger.fr>,
	"Vincent Chardon" <vincent.chardon@elsys-design.com>,
	"Marc Poulhiès" <dkm@kataplop.net>,
	"Julian Vetter" <jvetter@kalray.eu>,
	"Samuel Jones" <sjones@kalray.eu>,
	"Ashley Lesdalons" <alesdalons@kalray.eu>,
	"Thomas Costis" <tcostis@kalray.eu>,
	"Marius Gligor" <mgligor@kalray.eu>,
	"Jonathan Borne" <jborne@kalray.eu>,
	"Julien Villette" <jvillette@kalray.eu>,
	"Luc Michel" <lmichel@kalray.eu>,
	"Louis Morhet" <lmorhet@kalray.eu>,
	"Julien Hascoet" <jhascoet@kalray.eu>,
	"Jean-Christophe Pince" <jcpince@gmail.com>,
	"Guillaume Missonnier" <gmissonnier@kalray.eu>,
	"Alex Michon" <amichon@kalray.eu>,
	"Huacai Chen" <chenhuacai@kernel.org>,
	"WANG Xuerui" <git@xen0n.name>,
	"Shaokun Zhang" <zhangshaokun@hisilicon.com>,
	"John Garry" <john.garry@huawei.com>,
	"Guangbin Huang" <huangguangbin2@huawei.com>,
	"Bharat Bhushan" <bbhushan2@marvell.com>,
	"Bibo Mao" <maobibo@loongson.cn>,
	"Atish Patra" <atishp@atishpatra.org>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>,
	"Qi Liu" <liuqi115@huawei.com>,
	"Jiaxun Yang" <jiaxun.yang@flygoat.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Mark Brown" <broonie@kernel.org>,
	"Janosch Frank" <frankja@linux.ibm.com>,
	"Alexey Dobriyan" <adobriyan@gmail.com>
Cc: linux-arch@vger.kernel.org, devicetree@vger.kernel.org,
	Benjamin Mugnier <mugnier.benjamin@gmail.com>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-audit@redhat.com,
	linux-riscv@lists.infradead.org, bpf@vger.kernel.org
Subject: [RFC PATCH v2 25/31] kvx: Add some library functions
Date: Fri, 20 Jan 2023 15:09:56 +0100	[thread overview]
Message-ID: <20230120141002.2442-26-ysionneau@kalray.eu> (raw)
In-Reply-To: <20230120141002.2442-1-ysionneau@kalray.eu>

Add some library functions for kvx, including: delay, memset,
memcpy, strlen, clear_page, copy_page, raw_copy_from/to_user,
asm_clear_user.

Co-developed-by: Clement Leger <clement@clement-leger.fr>
Signed-off-by: Clement Leger <clement@clement-leger.fr>
Co-developed-by: Jules Maselbas <jmaselbas@kalray.eu>
Signed-off-by: Jules Maselbas <jmaselbas@kalray.eu>
Co-developed-by: Julian Vetter <jvetter@kalray.eu>
Signed-off-by: Julian Vetter <jvetter@kalray.eu>
Co-developed-by: Marius Gligor <mgligor@kalray.eu>
Signed-off-by: Marius Gligor <mgligor@kalray.eu>
Co-developed-by: Yann Sionneau <ysionneau@kalray.eu>
Signed-off-by: Yann Sionneau <ysionneau@kalray.eu>
---

Notes:
    V1 -> V2: no changes

 arch/kvx/include/asm/string.h |  20 ++
 arch/kvx/kernel/kvx_ksyms.c   |   5 +
 arch/kvx/lib/clear_page.S     |  40 ++++
 arch/kvx/lib/copy_page.S      |  90 +++++++++
 arch/kvx/lib/delay.c          |  39 ++++
 arch/kvx/lib/memcpy.c         |  70 +++++++
 arch/kvx/lib/memset.S         | 351 ++++++++++++++++++++++++++++++++++
 arch/kvx/lib/strlen.S         | 122 ++++++++++++
 arch/kvx/lib/usercopy.S       |  90 +++++++++
 9 files changed, 827 insertions(+)
 create mode 100644 arch/kvx/include/asm/string.h
 create mode 100644 arch/kvx/lib/clear_page.S
 create mode 100644 arch/kvx/lib/copy_page.S
 create mode 100644 arch/kvx/lib/delay.c
 create mode 100644 arch/kvx/lib/memcpy.c
 create mode 100644 arch/kvx/lib/memset.S
 create mode 100644 arch/kvx/lib/strlen.S
 create mode 100644 arch/kvx/lib/usercopy.S

diff --git a/arch/kvx/include/asm/string.h b/arch/kvx/include/asm/string.h
new file mode 100644
index 000000000000..677c1393a5cd
--- /dev/null
+++ b/arch/kvx/include/asm/string.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Jules Maselbas
+ */
+
+#ifndef _ASM_KVX_STRING_H
+#define _ASM_KVX_STRING_H
+
+#define __HAVE_ARCH_MEMSET
+extern void *memset(void *s, int c, size_t n);
+
+#define __HAVE_ARCH_MEMCPY
+extern void *memcpy(void *dest, const void *src, size_t n);
+
+#define __HAVE_ARCH_STRLEN
+extern size_t strlen(const char *s);
+
+#endif	/* _ASM_KVX_STRING_H */
diff --git a/arch/kvx/kernel/kvx_ksyms.c b/arch/kvx/kernel/kvx_ksyms.c
index 18990aaf259f..678f81716dea 100644
--- a/arch/kvx/kernel/kvx_ksyms.c
+++ b/arch/kvx/kernel/kvx_ksyms.c
@@ -22,3 +22,8 @@ DECLARE_EXPORT(__umoddi3);
 DECLARE_EXPORT(__divdi3);
 DECLARE_EXPORT(__udivdi3);
 DECLARE_EXPORT(__multi3);
+
+DECLARE_EXPORT(clear_page);
+DECLARE_EXPORT(copy_page);
+DECLARE_EXPORT(memset);
+DECLARE_EXPORT(asm_clear_user);
diff --git a/arch/kvx/lib/clear_page.S b/arch/kvx/lib/clear_page.S
new file mode 100644
index 000000000000..364fe0663ca2
--- /dev/null
+++ b/arch/kvx/lib/clear_page.S
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Marius Gligor
+ *            Clement Leger
+ */
+
+#include <linux/linkage.h>
+#include <linux/export.h>
+#include <linux/const.h>
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define CLEAR_PAGE_LOOP_COUNT	(PAGE_SIZE / 32)
+
+/*
+ * Clear page @dest.
+ *
+ * Parameters:
+ *	r0 - dest page
+ */
+ENTRY(clear_page)
+	make $r1 = CLEAR_PAGE_LOOP_COUNT
+	;;
+	make $r4 = 0
+	make $r5 = 0
+	make $r6 = 0
+	make $r7 = 0
+	;;
+
+	loopdo $r1, clear_page_done
+		;;
+		so 0[$r0] = $r4r5r6r7
+		addd $r0 = $r0, 32
+		;;
+	clear_page_done:
+	ret
+	;;
+ENDPROC(clear_page)
diff --git a/arch/kvx/lib/copy_page.S b/arch/kvx/lib/copy_page.S
new file mode 100644
index 000000000000..4bb82d1c964c
--- /dev/null
+++ b/arch/kvx/lib/copy_page.S
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+
+#include <linux/linkage.h>
+#include <linux/const.h>
+
+#include <asm/page.h>
+
+/* We have 8 load/store octuple (32 bytes) per hardware loop */
+#define COPY_SIZE_PER_LOOP	(32 * 8)
+#define COPY_PAGE_LOOP_COUNT	(PAGE_SIZE / COPY_SIZE_PER_LOOP)
+
+/*
+ * Copy a page from src to dest (both are page aligned)
+ * In order to recover from smem latency, unroll the loop to trigger multiple
+ * onfly loads and avoid waiting too much for them to return.
+ * We use 8 * 32 load even though we could use more (up to 10 loads) to simplify
+ * the handling using a single hardware loop
+ *
+ * Parameters:
+ *	r0 - dest
+ *	r1 - src
+ */
+ENTRY(copy_page)
+	make $r2 = COPY_PAGE_LOOP_COUNT
+	make $r3 = 0
+	;;
+	loopdo $r2, copy_page_done
+		;;
+		/*
+		 * Load 8 * 32 bytes using uncached access to avoid hitting
+		 * the cache
+		 */
+		lo.xs $r32r33r34r35 = $r3[$r1]
+		/* Copy current copy index for store */
+		copyd $r2 = $r3
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r36r37r38r39 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r40r41r42r43 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r44r45r46r47 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r48r49r50r51 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r52r53r54r55 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r56r57r58r59 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		lo.xs $r60r61r62r63 = $r3[$r1]
+		addd $r3 = $r3, 1
+		;;
+		/* And then store all of them */
+		so.xs $r2[$r0] = $r32r33r34r35
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r36r37r38r39
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r40r41r42r43
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r44r45r46r47
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r48r49r50r51
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r52r53r54r55
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r56r57r58r59
+		addd $r2 = $r2, 1
+		;;
+		so.xs $r2[$r0] = $r60r61r62r63
+		;;
+	copy_page_done:
+	ret
+	;;
+ENDPROC(copy_page)
diff --git a/arch/kvx/lib/delay.c b/arch/kvx/lib/delay.c
new file mode 100644
index 000000000000..11295eedc3f5
--- /dev/null
+++ b/arch/kvx/lib/delay.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+
+#include <linux/export.h>
+#include <linux/delay.h>
+
+#include <asm/param.h>
+#include <asm/timex.h>
+
+void __delay(unsigned long loops)
+{
+	cycles_t target_cycle = get_cycles() + loops;
+
+	while (get_cycles() < target_cycle);
+}
+EXPORT_SYMBOL(__delay);
+
+inline void __const_udelay(unsigned long xloops)
+{
+	u64 loops = (u64)xloops * (u64)loops_per_jiffy * HZ;
+
+	__delay(loops >> 32);
+}
+EXPORT_SYMBOL(__const_udelay);
+
+void __udelay(unsigned long usecs)
+{
+	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+}
+EXPORT_SYMBOL(__udelay);
+
+void __ndelay(unsigned long nsecs)
+{
+	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+}
+EXPORT_SYMBOL(__ndelay);
diff --git a/arch/kvx/lib/memcpy.c b/arch/kvx/lib/memcpy.c
new file mode 100644
index 000000000000..b81f746a80ee
--- /dev/null
+++ b/arch/kvx/lib/memcpy.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Yann Sionneau
+ */
+
+#include <linux/export.h>
+#include <linux/types.h>
+
+void *memcpy(void *dest, const void *src, size_t n)
+{
+	__uint128_t *tmp128_d = dest;
+	const __uint128_t *tmp128_s = src;
+	uint64_t *tmp64_d;
+	const uint64_t *tmp64_s;
+	uint32_t *tmp32_d;
+	const uint32_t *tmp32_s;
+	uint16_t *tmp16_d;
+	const uint16_t *tmp16_s;
+	uint8_t *tmp8_d;
+	const uint8_t *tmp8_s;
+
+	while (n >= 16) {
+		*tmp128_d = *tmp128_s;
+		tmp128_d++;
+		tmp128_s++;
+		n -= 16;
+	}
+
+	tmp64_d = (uint64_t *) tmp128_d;
+	tmp64_s = (uint64_t *) tmp128_s;
+	while (n >= 8) {
+		*tmp64_d = *tmp64_s;
+		tmp64_d++;
+		tmp64_s++;
+		n -= 8;
+	}
+
+	tmp32_d = (uint32_t *) tmp64_d;
+	tmp32_s = (uint32_t *) tmp64_s;
+	while (n >= 4) {
+		*tmp32_d = *tmp32_s;
+		tmp32_d++;
+		tmp32_s++;
+		n -= 4;
+	}
+
+	tmp16_d = (uint16_t *) tmp32_d;
+	tmp16_s = (uint16_t *) tmp32_s;
+	while (n >= 2) {
+		*tmp16_d = *tmp16_s;
+		tmp16_d++;
+		tmp16_s++;
+		n -= 2;
+	}
+
+	tmp8_d = (uint8_t *) tmp16_d;
+	tmp8_s = (uint8_t *) tmp16_s;
+	while (n >= 1) {
+		*tmp8_d = *tmp8_s;
+		tmp8_d++;
+		tmp8_s++;
+		n--;
+	}
+
+	return dest;
+}
+EXPORT_SYMBOL(memcpy);
+
diff --git a/arch/kvx/lib/memset.S b/arch/kvx/lib/memset.S
new file mode 100644
index 000000000000..9eebc28da2be
--- /dev/null
+++ b/arch/kvx/lib/memset.S
@@ -0,0 +1,351 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ *            Marius Gligor
+ */
+
+#include <linux/linkage.h>
+
+#include <asm/cache.h>
+
+#define REPLICATE_BYTE_MASK	0x0101010101010101
+#define MIN_SIZE_FOR_ALIGN	128
+
+/*
+ * Optimized memset for kvx architecture
+ *
+ * In order to optimize memset on kvx, we can use various things:
+ * - conditionnal store which avoid branch penalty
+ * - store half/word/double/quad/octuple to store up to 16 bytes at a time
+ * - dzerol to zero a cacheline when the pattern is '0' (often the case)
+ * - hardware loop for steady cases.
+ *
+ * First, we assume that memset is mainly used for zeroing areas. In order
+ * to optimize this case, we consider it to be the fast path of the algorithm.
+ * In both cases (0 and non 0 pattern), we start by checking if the size is
+ * below a minimum size. If so, we skip the alignment part. Indeed, the kvx
+ * supports misalignment and the penalty for letting it do unaligned accesses is
+ * lower than trying to realigning us. So for small sizes, we don't even bother
+ * to realign. Minor difference is that in the memset with 0, we skip after the
+ * dzerol loop since dzerol must be cache-line aligned (no misalignment of
+ * course).
+ * Regarding the non 0 pattern memset, we use sbmm to replicate the pattern on
+ * all bits on a register in one call.
+ * Once alignment has been reached, we can do the hardware loop for both cases(
+ * store octuple/dzerol) in order to optimize throughput. Care must be taken to
+ * align hardware loops on at least 8 bytes for performances.
+ * Once the main loop has been done, we finish the copy by checking length to do
+ * the necessary calls to store remaining bytes.
+ *
+ * Pseudo code (applies for non 0 pattern):
+ *
+ * int memset(void *dest, char pattern, long length)
+ * {
+ * 	long dest_align = -((long) dest);
+ * 	long copy;
+ * 	long orig_dest = dest;
+ *
+ * 	uint64_t pattern = sbmm8(pattern, 0x0101010101010101);
+ * 	uint128_t pattern128 = pattern << 64 | pattern;
+ * 	uint256_t pattern128 = pattern128 << 128 | pattern128;
+ *
+ * 	// Keep only low bits
+ * 	dest_align &= 0x1F;
+ * 	length -= dest_align;
+ *
+ * 	// Byte align
+ * 	copy = align & (1 << 0);
+ * 	if (copy)
+ * 		*((u8 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Half align
+ * 	copy = align & (1 << 1);
+ * 	if (copy)
+ * 		*((u16 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Word align
+ * 	copy = align & (1 << 2);
+ * 	if (copy)
+ * 		*((u32 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Double align
+ * 	copy = align & (1 << 3);
+ * 	if (copy)
+ * 		*((u64 *) dest) = pattern;
+ * 	dest += copy;
+ * 	// Quad align
+ * 	copy = align & (1 << 4);
+ * 	if (copy)
+ * 		*((u128 *) dest) = pattern128;
+ * 	dest += copy;
+ *
+ * 	// We are now aligned on 256 bits
+ * 	loop_octuple_count = size >> 5;
+ * 	for (i = 0; i < loop_octuple_count; i++) {
+ * 		*((u256 *) dest) = pattern256;
+ * 		dest += 32;
+ * 	}
+ *
+ * 	if (length == 0)
+ * 		return orig_dest;
+ *
+ * 	// Copy remaining part
+ * 	remain = length & (1 << 4);
+ * 	if (copy)
+ * 		*((u128 *) dest) = pattern128;
+ * 	dest += remain;
+ * 	remain = length & (1 << 3);
+ * 	if (copy)
+ * 		*((u64 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 2);
+ * 	if (copy)
+ * 		*((u32 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 1);
+ * 	if (copy)
+ * 		*((u16 *) dest) = pattern;
+ * 	dest += remain;
+ * 	remain = length & (1 << 0);
+ * 	if (copy)
+ * 		*((u8 *) dest) = pattern;
+ * 	dest += remain;
+ *
+ * 	return orig_dest;
+ * }
+ */
+
+.text
+.align 16
+ENTRY(memset):
+	make $r32 = 0
+	make $r33 = 0
+	/* Check if length < KVX_DCACHE_LINE_SIZE */
+	compd.ltu $r7 = $r2, KVX_DCACHE_LINE_SIZE
+	/* Jump to generic memset if pattern is != 0 */
+	cb.dnez $r1? memset_non_0_pattern
+	;;
+	/* Preserve return value */
+	copyd $r3 = $r0
+	/* Invert address to compute size to copy to be aligned on 32 bytes */
+	negd $r5 = $r0
+	/* Remaining bytes for 16 bytes store (for alignment on 64 bytes) */
+	andd $r8 = $r2, (1 << 5)
+	copyq $r34r35 = $r32, $r33
+	/* Skip loopdo with dzerol if length < KVX_DCACHE_LINE_SIZE */
+	cb.dnez $r7? .Ldzerol_done
+	;;
+	/* Compute the size that will be copied to align on 64 bytes boundary */
+	andw $r6 = $r5, 0x3F
+	/* Check if address is aligned on 64 bytes */
+	andw $r9 = $r0, 0x3F
+	/* Alignment */
+	nop
+	;;
+	/* If address already aligned on 64 bytes, jump to dzerol loop */
+	cb.deqz $r9? .Laligned_64
+	/* Remove unaligned part from length */
+	sbfd $r2 = $r6, $r2
+	/* Check if we need to copy 1 byte */
+	andw $r4 = $r5, (1 << 0)
+	;;
+	/* If we are not aligned, store byte */
+	sb.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 2 bytes */
+	andw $r4 = $r5, (1 << 1)
+	/* Add potentially copied part for next store offset */
+	addd $r0 = $r0, $r4
+	;;
+	sh.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 4 bytes */
+	andw $r4 = $r5, (1 << 2)
+	addd $r0 = $r0, $r4
+	;;
+	sw.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 8 bytes */
+	andw $r4 = $r5, (1 << 3)
+	addd $r0 = $r0, $r4
+	;;
+	sd.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 16 bytes */
+	andw $r4 = $r5, (1 << 4)
+	addd $r0 = $r0, $r4
+	;;
+	sq.dnez $r4? [$r0] = $r32r33
+	/* Check if we need to copy 32 bytes */
+	andw $r4 = $r5, (1 << 5)
+	addd $r0 = $r0, $r4
+	;;
+	so.dnez $r4? [$r0] = $r32r33r34r35
+	addd $r0 = $r0, $r4
+	;;
+.Laligned_64:
+	/* Prepare amount of data for dzerol */
+	srld $r10 = $r2, 6
+	/* Size to be handled in loopdo */
+	andd $r4 = $r2, ~0x3F
+	make $r11 = 64
+	cb.deqz $r2? .Lmemset_done
+	;;
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 5)
+	/* Skip dzerol if there are not enough data for 64 bytes store */
+	cb.deqz $r10? .Ldzerol_done
+	/* Update length to copy */
+	sbfd $r2 = $r4, $r2
+	;;
+	loopdo $r10, .Ldzerol_done
+		;;
+		so 0[$r0], $r32r33r34r35
+		;;
+		so 32[$r0], $r32r33r34r35
+		addd $r0 = $r0, $r11
+		;;
+	.Ldzerol_done:
+	/*
+	 * Now that we have handled every aligned bytes using 'dzerol', we can
+	 * handled the remainder of length using store by decrementing size
+	 * We also exploit the fact we are aligned to simply check remaining
+	 * size */
+	so.dnez $r8? [$r0] = $r32r33r34r35
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 4)
+	cb.deqz $r2? .Lmemset_done
+	;;
+	sq.dnez $r8? [$r0] = $r32r33
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 8 bytes store */
+	andw $r8 = $r2, (1 << 3)
+	;;
+	sd.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 4 bytes store */
+	andw $r8 = $r2, (1 << 2)
+	;;
+	sw.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 2 bytes store */
+	andw $r8 = $r2, (1 << 1)
+	;;
+	sh.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	;;
+	sb.odd $r2? [$r0] = $r32
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+
+.align 16
+memset_non_0_pattern:
+	/* Preserve return value */
+	copyd $r3 = $r0
+	/* Replicate the first pattern byte on all bytes */
+	sbmm8 $r32 = $r1, REPLICATE_BYTE_MASK
+	/* Check if length < MIN_SIZE_FOR_ALIGN */
+	compd.geu $r7 = $r2, MIN_SIZE_FOR_ALIGN
+	/* Invert address to compute size to copy to be aligned on 32 bytes */
+	negd $r5 = $r0
+	;;
+	/* Check if we are aligned on 32 bytes */
+	andw $r9 = $r0, 0x1F
+	/* Compute the size that will be copied to align on 32 bytes boundary */
+	andw $r6 = $r5, 0x1F
+	/*
+	 * If size < MIN_SIZE_FOR_ALIGN bits, directly go to so, it will be done
+	 * unaligned but that is still better that what we can do with sb
+	 */
+	cb.deqz $r7? .Laligned_32
+	;;
+	/* Remove unaligned part from length */
+	sbfd $r2 = $r6, $r2
+	/* If we are already aligned on 32 bytes, jump to main "so" loop */
+	cb.deqz $r9? .Laligned_32
+	/* Check if we need to copy 1 byte */
+	andw $r4 = $r5, (1 << 0)
+	;;
+	/* If we are not aligned, store byte */
+	sb.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 2 bytes */
+	andw $r4 = $r5, (1 << 1)
+	/* Add potentially copied part for next store offset */
+	addd $r0 = $r0, $r4
+	;;
+	sh.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 4 bytes */
+	andw $r4 = $r5, (1 << 2)
+	addd $r0 = $r0, $r4
+	;;
+	sw.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 8 bytes */
+	andw $r4 = $r5, (1 << 3)
+	addd $r0 = $r0, $r4
+	/* Copy second part of pattern for sq */
+	copyd $r33 = $r32
+	;;
+	sd.dnez $r4? [$r0] = $r32
+	/* Check if we need to copy 16 bytes */
+	andw $r4 = $r5, (1 << 4)
+	addd $r0 = $r0, $r4
+	;;
+	sq.dnez $r4? [$r0] = $r32r33
+	addd $r0 = $r0, $r4
+	;;
+.Laligned_32:
+	/* Copy second part of pattern for sq */
+	copyd $r33 = $r32
+	/* Prepare amount of data for 32 bytes store */
+	srld $r10 = $r2, 5
+	nop
+	nop
+	;;
+	copyq $r34r35 = $r32, $r33
+	/* Remaining bytes for 16 bytes store */
+	andw $r8 = $r2, (1 << 4)
+	make $r11 = 32
+	/* Check if there are enough data for 32 bytes store */
+	cb.deqz $r10? .Laligned_32_done
+	;;
+	loopdo $r10, .Laligned_32_done
+		;;
+		so 0[$r0] = $r32r33r34r35
+		addd $r0 = $r0, $r11
+		;;
+	.Laligned_32_done:
+	/*
+	 * Now that we have handled every aligned bytes using 'so', we can
+	 * handled the remainder of length using store by decrementing size
+	 * We also exploit the fact we are aligned to simply check remaining
+	 * size */
+	sq.dnez $r8? [$r0] = $r32r33
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 8 bytes store */
+	andw $r8 = $r2, (1 << 3)
+	cb.deqz $r2? .Lmemset_done
+	;;
+	sd.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 4 bytes store */
+	andw $r8 = $r2, (1 << 2)
+	;;
+	sw.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	/* Remaining bytes for 2 bytes store */
+	andw $r8 = $r2, (1 << 1)
+	;;
+	sh.dnez $r8? [$r0] = $r32
+	addd $r0 = $r0, $r8
+	;;
+	sb.odd $r2? [$r0] = $r32
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+.Lmemset_done:
+	/* Restore original value */
+	copyd $r0 = $r3
+	ret
+	;;
+ENDPROC(memset)
diff --git a/arch/kvx/lib/strlen.S b/arch/kvx/lib/strlen.S
new file mode 100644
index 000000000000..8298402a7898
--- /dev/null
+++ b/arch/kvx/lib/strlen.S
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Jules Maselbas
+ */
+#include <linux/linkage.h>
+#include <asm/export.h>
+
+/*
+ *	kvx optimized strlen
+ *
+ *	This implementation of strlen only does aligned memory accesses.
+ *	Since we don't know the total length the idea is to do double word
+ *	load and stop on the first null byte found. As it's always safe to
+ *	read more up to a lower 8-bytes boundary.
+ *
+ *	This implementation of strlen uses a trick to detect if a double
+ *	word contains a null byte [1]:
+ *
+ *	> #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
+ *	> The sub-expression (v - 0x01010101UL), evaluates to a high bit set
+ *	> in any byte whenever the corresponding byte in v is zero or greater
+ *	> than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high
+ *	> bits set in bytes where the byte of v doesn't have its high bit set
+ *	> (so the byte was less than 0x80). Finally, by ANDing these two sub-
+ *	> expressions the result is the high bits set where the bytes in v
+ *	> were zero, since the high bits set due to a value greater than 0x80
+ *	> in the first sub-expression are masked off by the second.
+ *
+ *	[1] http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+ *
+ *	A second trick is used to get the exact number of characters before
+ *	the first null byte in a double word:
+ *
+ *		clz(sbmmt(zero, 0x0102040810204080))
+ *
+ *	This trick uses the haszero result which maps null byte to 0x80 and
+ *	others value to 0x00. The idea is to count the number of consecutive
+ *	null byte in the double word (counting from less significant byte
+ *	to most significant byte). To do so, using the bit matrix transpose
+ *	will "pack" all high bit (0x80) to the most significant byte (MSB).
+ *	It is not possible to count the trailing zeros in this MSB, however
+ *	if a byte swap is done before the bit matrix transpose we still have
+ *	all the information in the MSB but now we can count the leading zeros.
+ *	The instruction sbmmt with the matrix 0x0102040810204080 does exactly
+ *	what we need a byte swap followed by a bit transpose.
+ *
+ *	A last trick is used to handle the first double word misalignment.
+ *	This is done by masking off the N lower bytes (excess read) with N
+ *	between 0 and 7. The mask is applied on haszero results and will
+ *	force the N lower bytes to be considered not null.
+ *
+ *	This is a C implementation of the algorithm described above:
+ *
+ *	size_t strlen(char *s) {
+ *		uint64_t *p    = (uint64_t *)((uintptr_t)s) & ~0x7;
+ *		uint64_t rem   = ((uintptr_t)s) % 8;
+ *		uint64_t low   = -0x0101010101010101;
+ *		uint64_t high  =  0x8080808080808080;
+ *		uint64_t dword, zero;
+ *		uint64_t msk, len;
+ *
+ *		dword = *p++;
+ *		zero  = (dword + low) & ~dword & high;
+ *		msk   = 0xffffffffffffffff << (rem * 8);
+ *		zero &= msk;
+ *
+ *		while (!zero) {
+ *			dword = *p++;
+ *			zero  = (dword + low) & ~dword & high;
+ *		}
+ *
+ *		zero = __builtin_kvx_sbmmt8(zero, 0x0102040810204080);
+ *		len = ((void *)p - (void *)s) - 8;
+ *		len += __builtin_kvx_clzd(zero);
+ *
+ *		return len;
+ *	}
+ */
+
+.text
+.align 16
+ENTRY(strlen)
+	andd  $r1 = $r0, ~0x7
+	andd  $r2 = $r0,  0x7
+	make $r10 = -0x0101010101010101
+	make $r11 =  0x8080808080808080
+	;;
+	ld $r4 = 0[$r1]
+	sllw $r2 = $r2, 3
+	make $r3 = 0xffffffffffffffff
+	;;
+	slld $r2 = $r3, $r2
+	addd $r5 = $r4, $r10
+	andnd $r6 = $r4, $r11
+	;;
+	andd $r6 = $r6, $r2
+	make $r3 = 0
+	;;
+.loop:
+	andd $r4 = $r5, $r6
+	addd $r1 = $r1, 0x8
+	;;
+	cb.dnez $r4? .end
+	ld.deqz $r4? $r4 = [$r1]
+	;;
+	addd $r5 = $r4, $r10
+	andnd $r6 = $r4, $r11
+	goto .loop
+	;;
+.end:
+	addd $r1 = $r1, -0x8
+	sbmmt8 $r4 = $r4, 0x0102040810204080
+	;;
+	clzd $r4 = $r4
+	sbfd $r1 = $r0, $r1
+	;;
+	addd $r0 = $r4, $r1
+	ret
+	;;
+ENDPROC(strlen)
+EXPORT_SYMBOL(strlen)
diff --git a/arch/kvx/lib/usercopy.S b/arch/kvx/lib/usercopy.S
new file mode 100644
index 000000000000..bc7e1a45e1c7
--- /dev/null
+++ b/arch/kvx/lib/usercopy.S
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2017-2023 Kalray Inc.
+ * Author(s): Clement Leger
+ */
+#include <linux/linkage.h>
+
+/**
+ * Copy from/to a user buffer
+ * r0 = to buffer
+ * r1 = from buffer
+ * r2 = size to copy
+ * This function can trapped when hitting a non-mapped page.
+ * It will trigger a trap NOMAPPING and the trap handler will interpret
+ * it and check if instruction pointer is inside __ex_table.
+ * The next step are described later !
+ */
+.text
+ENTRY(raw_copy_from_user)
+ENTRY(raw_copy_to_user)
+	/**
+	 * naive implementation byte per byte
+	 */
+	make $r33 = 0x0;
+	/* If size == 0, exit directly */
+	cb.deqz $r2? copy_exit
+	;;
+	loopdo $r2, copy_exit
+		;;
+0:		lbz $r34 = $r33[$r1]
+		;;
+1:		sb $r33[$r0] = $r34
+		addd $r33 = $r33, 1 /* Ptr increment */
+		addd $r2 = $r2, -1 /* Remaining bytes to copy */
+		;;
+	copy_exit:
+	copyd $r0 = $r2
+	ret
+	;;
+ENDPROC(raw_copy_to_user)
+ENDPROC(raw_copy_from_user)
+
+/**
+ * Exception table
+ * each entry correspond to the following:
+ * .dword trapping_addr, restore_addr
+ *
+ * On trap, the handler will try to locate if $spc is matching a
+ * trapping address in the exception table. If so, the restore addr
+ * will  be put in the return address of the trap handler, allowing
+ * to properly finish the copy and return only the bytes copied/cleared
+ */
+.pushsection __ex_table,"a"
+.balign 8
+.dword 0b, copy_exit
+.dword 1b, copy_exit
+.popsection
+
+/**
+ * Clear a user buffer
+ * r0 = buffer to clear
+ * r1 = size to clear
+ */
+.text
+ENTRY(asm_clear_user)
+	/**
+	 * naive implementation byte per byte
+	 */
+	make $r33 = 0x0;
+	make $r34 = 0x0;
+	/* If size == 0, exit directly */
+	cb.deqz $r1? clear_exit
+	;;
+	loopdo $r1, clear_exit
+		;;
+40:		sb $r33[$r0] = $r34
+		addd $r33 = $r33, 1 /* Ptr increment */
+		addd $r1 = $r1, -1 /* Remaining bytes to copy */
+		;;
+	clear_exit:
+	copyd $r0 = $r1
+	ret
+	;;
+ENDPROC(asm_clear_user)
+
+.pushsection __ex_table,"a"
+.balign 8
+.dword 40b, clear_exit
+.popsection
+
-- 
2.37.2





--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


  parent reply	other threads:[~2023-01-20 14:21 UTC|newest]

Thread overview: 201+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-20 14:09 [RFC PATCH v2 00/31] Upstream kvx Linux port Yann Sionneau
2023-01-20 14:09 ` Yann Sionneau
2023-01-20 14:09 ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 01/31] Documentation: kvx: Add basic documentation Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-22  9:02   ` Bagas Sanjaya
2023-01-22  9:02     ` Bagas Sanjaya
2023-01-22  9:02     ` Bagas Sanjaya
2023-01-25 18:28     ` Jules Maselbas
2023-01-25 18:28       ` Jules Maselbas
2023-01-25 18:28       ` Jules Maselbas
2023-01-26  2:23       ` Bagas Sanjaya
2023-01-26  2:23         ` Bagas Sanjaya
2023-01-26  2:23         ` Bagas Sanjaya
2023-01-22 15:02   ` Mike Rapoport
2023-01-22 15:02     ` Mike Rapoport
2023-01-22 15:02     ` Mike Rapoport
2023-01-20 14:09 ` [RFC PATCH v2 02/31] Documentation: Add binding for kalray,kv3-1-core-intc Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 17:28   ` Rob Herring
2023-01-20 17:28     ` Rob Herring
2023-01-20 17:28     ` Rob Herring
2023-01-22 11:44   ` Krzysztof Kozlowski
2023-01-22 11:44     ` Krzysztof Kozlowski
2023-01-22 11:44     ` Krzysztof Kozlowski
2023-01-26 16:10     ` Jules Maselbas
2023-01-26 16:10       ` Jules Maselbas
2023-01-26 16:10       ` Jules Maselbas
2023-01-27  8:32       ` Krzysztof Kozlowski
2023-01-27  8:32         ` Krzysztof Kozlowski
2023-01-27  8:32         ` Krzysztof Kozlowski
2023-01-20 14:09 ` [RFC PATCH v2 03/31] Documentation: Add binding for kalray,kv3-1-apic-gic Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-22 11:47   ` Krzysztof Kozlowski
2023-01-22 11:47     ` Krzysztof Kozlowski
2023-01-22 11:47     ` Krzysztof Kozlowski
2023-01-20 14:09 ` [RFC PATCH v2 04/31] Documentation: Add binding for kalray,kv3-1-apic-mailbox Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 17:28   ` Rob Herring
2023-01-20 17:28     ` Rob Herring
2023-01-20 17:28     ` Rob Herring
2023-01-20 14:09 ` [RFC PATCH v2 05/31] Documentation: Add binding for kalray,coolidge-itgen Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-22 11:49   ` Krzysztof Kozlowski
2023-01-22 11:49     ` Krzysztof Kozlowski
2023-01-22 11:49     ` Krzysztof Kozlowski
2023-01-20 14:09 ` [RFC PATCH v2 06/31] Documentation: Add binding for kalray,kv3-1-ipi-ctrl Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 17:28   ` Rob Herring
2023-01-20 17:28     ` [RFC PATCH v2 06/31] Documentation: Add binding for kalray, kv3-1-ipi-ctrl Rob Herring
2023-01-20 17:28     ` [RFC PATCH v2 06/31] Documentation: Add binding for kalray,kv3-1-ipi-ctrl Rob Herring
2023-01-22 11:50   ` Krzysztof Kozlowski
2023-01-22 11:50     ` Krzysztof Kozlowski
2023-01-22 11:50     ` Krzysztof Kozlowski
2023-01-20 14:09 ` [RFC PATCH v2 07/31] Documentation: Add binding for kalray,kv3-1-pwr-ctrl Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 17:28   ` Rob Herring
2023-01-20 17:28     ` [RFC PATCH v2 07/31] Documentation: Add binding for kalray, kv3-1-pwr-ctrl Rob Herring
2023-01-20 17:28     ` [RFC PATCH v2 07/31] Documentation: Add binding for kalray,kv3-1-pwr-ctrl Rob Herring
2023-01-22 11:51   ` Krzysztof Kozlowski
2023-01-22 11:51     ` Krzysztof Kozlowski
2023-01-22 11:51     ` Krzysztof Kozlowski
2023-01-20 14:09 ` [RFC PATCH v2 08/31] kvx: Add ELF-related definitions Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 09/31] kvx: Add build infrastructure Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:39   ` Arnd Bergmann
2023-01-20 14:39     ` Arnd Bergmann
2023-01-20 14:39     ` Arnd Bergmann
2023-01-20 14:53     ` Jules Maselbas
2023-01-20 14:53       ` Jules Maselbas
2023-01-20 14:53       ` Jules Maselbas
2023-01-20 15:01       ` Arnd Bergmann
2023-01-20 15:01         ` Arnd Bergmann
2023-01-20 15:01         ` Arnd Bergmann
2023-01-20 15:03         ` Jules Maselbas
2023-01-20 15:03           ` Jules Maselbas
2023-01-20 15:03           ` Jules Maselbas
2023-01-20 14:09 ` [RFC PATCH v2 10/31] kvx: Add CPU definition headers Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 11/31] kvx: Add atomic/locking headers Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 15:18   ` Mark Rutland
2023-01-20 15:18     ` Mark Rutland
2023-01-20 15:18     ` Mark Rutland
2023-01-26  9:57     ` Jules Maselbas
2023-01-26  9:57       ` Jules Maselbas
2023-01-26  9:57       ` Jules Maselbas
2023-01-26 11:15       ` Mark Rutland
2023-01-26 11:15         ` Mark Rutland
2023-01-26 11:15         ` Mark Rutland
2023-01-26 11:19       ` Jules Maselbas
2023-01-26 11:19         ` Jules Maselbas
2023-01-26 11:19         ` Jules Maselbas
2023-01-29 11:50   ` Guo Ren
2023-01-29 11:50     ` Guo Ren
2023-01-29 11:50     ` Guo Ren
2023-01-20 14:09 ` [RFC PATCH v2 12/31] kvx: Add other common headers Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:29   ` Jason A. Donenfeld
2023-01-20 14:29     ` Jason A. Donenfeld
2023-01-20 14:29     ` Jason A. Donenfeld
2023-01-25 21:55     ` Jules Maselbas
2023-01-25 21:55       ` Jules Maselbas
2023-01-25 21:55       ` Jules Maselbas
2023-01-20 14:09 ` [RFC PATCH v2 13/31] kvx: Add boot and setup routines Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 14/31] kvx: Add exception/interrupt handling Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 15/31] irqchip: Add irq-kvx-apic-gic driver Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 16/31] irqchip: Add irq-kvx-itgen driver Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 17/31] irqchip: Add irq-kvx-apic-mailbox driver Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 18/31] irqchip: Add kvx-core-intc core interupt controller driver Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 19/31] kvx: Add process management Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 20/31] kvx: Add memory management Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-22 16:09   ` Mike Rapoport
2023-01-22 16:09     ` Mike Rapoport
2023-01-22 16:09     ` Mike Rapoport
2023-01-20 14:09 ` [RFC PATCH v2 21/31] kvx: Add system call support Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 22/31] kvx: Add signal handling support Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 23/31] kvx: Add ELF relocations and module support Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 24/31] kvx: Add misc common routines Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` Yann Sionneau [this message]
2023-01-20 14:09   ` [RFC PATCH v2 25/31] kvx: Add some library functions Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 26/31] kvx: Add multi-processor (SMP) support Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09 ` [RFC PATCH v2 27/31] kvx: Add kvx default config file Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-22 11:58   ` Krzysztof Kozlowski
2023-01-22 11:58     ` Krzysztof Kozlowski
2023-01-22 11:58     ` Krzysztof Kozlowski
2023-01-20 14:09 ` [RFC PATCH v2 28/31] kvx: Add debugging related support Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:09   ` Yann Sionneau
2023-01-20 14:10 ` [RFC PATCH v2 29/31] kvx: Add support for cpuinfo Yann Sionneau
2023-01-20 14:10   ` Yann Sionneau
2023-01-20 14:10   ` Yann Sionneau
2023-01-22 11:57   ` Krzysztof Kozlowski
2023-01-22 11:57     ` Krzysztof Kozlowski
2023-01-22 11:57     ` Krzysztof Kozlowski
2023-01-20 14:10 ` [RFC PATCH v2 30/31] kvx: Add power controller driver Yann Sionneau
2023-01-20 14:10   ` Yann Sionneau
2023-01-20 14:10   ` Yann Sionneau
2023-01-22 11:54   ` Krzysztof Kozlowski
2023-01-22 11:54     ` Krzysztof Kozlowski
2023-01-22 11:54     ` Krzysztof Kozlowski
2024-04-15 14:08     ` Yann Sionneau
2024-04-15 14:08       ` Yann Sionneau
2024-04-15 15:30       ` Arnd Bergmann
2024-04-15 15:30         ` Arnd Bergmann
2024-04-17 19:20       ` Krzysztof Kozlowski
2024-04-17 19:20         ` Krzysztof Kozlowski
2023-01-20 14:10 ` [RFC PATCH v2 31/31] kvx: Add IPI driver Yann Sionneau
2023-01-20 14:10   ` Yann Sionneau
2023-01-20 14:10   ` Yann Sionneau
2023-01-22 11:54   ` Krzysztof Kozlowski
2023-01-22 11:54     ` Krzysztof Kozlowski
2023-01-22 11:54     ` Krzysztof Kozlowski
2024-01-31  9:52     ` Yann Sionneau
2024-01-31  9:52       ` Yann Sionneau
2024-01-31 10:12       ` Krzysztof Kozlowski
2024-01-31 10:12         ` Krzysztof Kozlowski
2024-01-31 10:28       ` Arnd Bergmann
2024-01-31 10:28         ` Arnd Bergmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230120141002.2442-26-ysionneau@kalray.eu \
    --to=ysionneau@kalray.eu \
    --cc=Jason@zx2c4.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alesdalons@kalray.eu \
    --cc=amichon@kalray.eu \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=arnd@arndb.de \
    --cc=atishp@atishpatra.org \
    --cc=bbhushan2@marvell.com \
    --cc=boqun.feng@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=broonie@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chenhuacai@kernel.org \
    --cc=clement@clement-leger.fr \
    --cc=corbet@lwn.net \
    --cc=devicetree@vger.kernel.org \
    --cc=dkm@kataplop.net \
    --cc=ebiederm@xmission.com \
    --cc=eparis@redhat.com \
    --cc=frankja@linux.ibm.com \
    --cc=git@xen0n.name \
    --cc=gmissonnier@kalray.eu \
    --cc=gthouvenin@kalray.eu \
    --cc=huangguangbin2@huawei.com \
    --cc=jborne@kalray.eu \
    --cc=jcpince@gmail.com \
    --cc=jhascoet@kalray.eu \
    --cc=jiaxun.yang@flygoat.com \
    --cc=jmaselbas@kalray.eu \
    --cc=john.garry@huawei.com \
    --cc=jvetter@kalray.eu \
    --cc=jvillette@kalray.eu \
    --cc=keescook@chromium.org \
    --cc=krzysztof.kozlowski+dt@linaro.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-audit@redhat.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=liuqi115@huawei.com \
    --cc=lmichel@kalray.eu \
    --cc=lmorhet@kalray.eu \
    --cc=longman@redhat.com \
    --cc=maobibo@loongson.cn \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mgligor@kalray.eu \
    --cc=mingo@redhat.com \
    --cc=mugnier.benjamin@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=oleg@redhat.com \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=paul@paul-moore.com \
    --cc=peterz@infradead.org \
    --cc=robh+dt@kernel.org \
    --cc=sjones@kalray.eu \
    --cc=tcostis@kalray.eu \
    --cc=tglx@linutronix.de \
    --cc=vincent.chardon@elsys-design.com \
    --cc=will@kernel.org \
    --cc=zhangshaokun@hisilicon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.