linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] RISC-V: Probe for misaligned access speed
@ 2023-06-23 22:20 Evan Green
  2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Evan Green @ 2023-06-23 22:20 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Simon Hosie, Evan Green, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Conor Dooley, Greentime Hu,
	Guo Ren, Heiko Stuebner, Heiko Stuebner, Jisheng Zhang,
	Jonathan Corbet, Ley Foon Tan, Li Zhengyu, Masahiro Yamada,
	Palmer Dabbelt, Paul Walmsley, Randy Dunlap, Samuel Holland,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv


The current setting for the hwprobe bit indicating misaligned access
speed is controlled by a vendor-specific feature probe function. This is
essentially a per-SoC table we have to maintain on behalf of each vendor
going forward. Let's convert that instead to something we detect at
runtime.

We have two assembly routines at the heart of our probe: one that
does a bunch of word-sized accesses (without aligning its input buffer),
and the other that does byte accesses. If we can move a larger number of
bytes using misaligned word accesses than we can with the same amount of
time doing byte accesses, then we can declare misaligned accesses as
"fast".

The tradeoff of reducing this maintenance burden is boot time. We spend
4-6 jiffies per core doing this measurement (0-2 on jiffie edge
alignment, and 4 on measurement). The timing loop was based on
raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number
of algorithms). On my THead C906, I found measurements to be stable
across several reboots, and looked like this:

[    0.047582] cpu0: Unaligned word copy 1728 MB/s, byte copy 402 MB/s, misaligned accesses are fast

I don't have a machine where misaligned accesses are slow, but I'd be
interested to see the results of booting this series if someone did.



Evan Green (2):
  RISC-V: Probe for unaligned access speed
  RISC-V: alternative: Remove feature_probe_func

 Documentation/riscv/hwprobe.rst      |  8 +--
 arch/riscv/errata/thead/errata.c     |  8 ---
 arch/riscv/include/asm/alternative.h |  5 --
 arch/riscv/include/asm/cpufeature.h  |  2 +
 arch/riscv/kernel/Makefile           |  1 +
 arch/riscv/kernel/alternative.c      | 19 -------
 arch/riscv/kernel/copy-noalign.S     | 71 +++++++++++++++++++++++++
 arch/riscv/kernel/copy-noalign.h     | 13 +++++
 arch/riscv/kernel/cpufeature.c       | 78 ++++++++++++++++++++++++++++
 arch/riscv/kernel/smpboot.c          |  3 +-
 10 files changed, 171 insertions(+), 37 deletions(-)
 create mode 100644 arch/riscv/kernel/copy-noalign.S
 create mode 100644 arch/riscv/kernel/copy-noalign.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-23 22:20 [PATCH 0/2] RISC-V: Probe for misaligned access speed Evan Green
@ 2023-06-23 22:20 ` Evan Green
  2023-06-25 21:42   ` David Laight
                     ` (2 more replies)
  2023-06-23 22:20 ` [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func Evan Green
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 21+ messages in thread
From: Evan Green @ 2023-06-23 22:20 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Simon Hosie, Evan Green, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Conor Dooley, Greentime Hu,
	Guo Ren, Heiko Stuebner, Jisheng Zhang, Jonathan Corbet,
	Ley Foon Tan, Li Zhengyu, Masahiro Yamada, Palmer Dabbelt,
	Paul Walmsley, Sia Jee Heng, Sunil V L, Xianting Tian,
	Yangyu Chen, linux-doc, linux-kernel, linux-riscv

Rather than deferring misaligned access speed determinations to a vendor
function, let's probe them and find out how fast they are. If we
determine that a misaligned word access is faster than N byte accesses,
mark the hardware's misaligned access as "fast".

Fix the documentation as well to reflect this bar. Previously the only
SoC that returned "fast" was the THead C906. The change to the
documentation is more a clarification, since the C906 is fast in the
sense of the corrected documentation.

Signed-off-by: Evan Green <evan@rivosinc.com>
---

 Documentation/riscv/hwprobe.rst     |  8 +--
 arch/riscv/include/asm/cpufeature.h |  2 +
 arch/riscv/kernel/Makefile          |  1 +
 arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
 arch/riscv/kernel/copy-noalign.h    | 13 +++++
 arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
 arch/riscv/kernel/smpboot.c         |  2 +
 7 files changed, 171 insertions(+), 4 deletions(-)
 create mode 100644 arch/riscv/kernel/copy-noalign.S
 create mode 100644 arch/riscv/kernel/copy-noalign.h

diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
index 19165ebd82ba..710325751766 100644
--- a/Documentation/riscv/hwprobe.rst
+++ b/Documentation/riscv/hwprobe.rst
@@ -88,12 +88,12 @@ The following keys are defined:
     always extremely slow.
 
   * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
-    in hardware, but are slower than the cooresponding aligned accesses
-    sequences.
+    in hardware, but are slower than N byte accesses, where N is the native
+    word size.
 
   * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
-    in hardware and are faster than the cooresponding aligned accesses
-    sequences.
+    in hardware and are faster than N byte accesses, where N is the native
+    word size.
 
   * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
     not supported at all and will generate a misaligned address fault.
diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
index 23fed53b8815..b8e917176616 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
 /* Per-cpu ISA extensions. */
 extern struct riscv_isainfo hart_isa[NR_CPUS];
 
+void check_misaligned_access(int cpu);
+
 #endif
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index a42951911067..f934d7ab7840 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -34,6 +34,7 @@ extra-y += vmlinux.lds
 obj-y	+= head.o
 obj-y	+= soc.o
 obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
+obj-y	+= copy-noalign.o
 obj-y	+= cpu.o
 obj-y	+= cpufeature.o
 obj-y	+= entry.o
diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
new file mode 100644
index 000000000000..3807fc2324b2
--- /dev/null
+++ b/arch/riscv/kernel/copy-noalign.S
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Rivos Inc. */
+
+#include <linux/linkage.h>
+#include <asm/asm.h>
+
+	.text
+
+/* void __copy_words_unaligned(void *, const void *, size_t) */
+/* Performs a memcpy without aligning buffers, using word loads and stores. */
+/* Note: The size is truncated to a multiple of 8 * SZREG */
+ENTRY(__copy_words_unaligned)
+	andi a4, a2, ~((8*SZREG)-1)
+	beqz a4, 2f
+	add a3, a1, a4
+1:
+	REG_L a4,       0(a1)
+	REG_L a5,   SZREG(a1)
+	REG_L a6, 2*SZREG(a1)
+	REG_L a7, 3*SZREG(a1)
+	REG_L t0, 4*SZREG(a1)
+	REG_L t1, 5*SZREG(a1)
+	REG_L t2, 6*SZREG(a1)
+	REG_L t3, 7*SZREG(a1)
+	REG_S a4,       0(a0)
+	REG_S a5,   SZREG(a0)
+	REG_S a6, 2*SZREG(a0)
+	REG_S a7, 3*SZREG(a0)
+	REG_S t0, 4*SZREG(a0)
+	REG_S t1, 5*SZREG(a0)
+	REG_S t2, 6*SZREG(a0)
+	REG_S t3, 7*SZREG(a0)
+	addi a0, a0, 8*SZREG
+	addi a1, a1, 8*SZREG
+	bltu a1, a3, 1b
+
+2:
+	ret
+END(__copy_words_unaligned)
+
+/* void __copy_bytes_unaligned(void *, const void *, size_t) */
+/* Performs a memcpy without aligning buffers, using only byte accesses. */
+/* Note: The size is truncated to a multiple of 8 */
+ENTRY(__copy_bytes_unaligned)
+	andi a4, a2, ~(8-1)
+	beqz a4, 2f
+	add a3, a1, a4
+1:
+	lb a4, 0(a1)
+	lb a5, 1(a1)
+	lb a6, 2(a1)
+	lb a7, 3(a1)
+	lb t0, 4(a1)
+	lb t1, 5(a1)
+	lb t2, 6(a1)
+	lb t3, 7(a1)
+	sb a4, 0(a0)
+	sb a5, 1(a0)
+	sb a6, 2(a0)
+	sb a7, 3(a0)
+	sb t0, 4(a0)
+	sb t1, 5(a0)
+	sb t2, 6(a0)
+	sb t3, 7(a0)
+	addi a0, a0, 8
+	addi a1, a1, 8
+	bltu a1, a3, 1b
+
+2:
+	ret
+END(__copy_bytes_unaligned)
diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
new file mode 100644
index 000000000000..99fbb9c763e0
--- /dev/null
+++ b/arch/riscv/kernel/copy-noalign.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 Rivos, Inc.
+ */
+#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
+#define __RISCV_KERNEL_COPY_NOALIGN_H
+
+#include <linux/types.h>
+
+void __copy_words_unaligned(void *dst, const void *src, size_t size);
+void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
+
+#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index bdcf460ea53d..3f7200dcc00c 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -19,11 +19,21 @@
 #include <asm/cacheflush.h>
 #include <asm/cpufeature.h>
 #include <asm/hwcap.h>
+#include <asm/hwprobe.h>
 #include <asm/patch.h>
 #include <asm/processor.h>
 #include <asm/vector.h>
 
+#include "copy-noalign.h"
+
 #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
+#define MISALIGNED_ACCESS_JIFFIES_LG2 1
+#define MISALIGNED_BUFFER_SIZE 0x4000
+#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
+
+#define MISALIGNED_COPY_MBS(_count) \
+	((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
+	 (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
 
 unsigned long elf_hwcap __read_mostly;
 
@@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
 	return hwcap;
 }
 
+void check_misaligned_access(int cpu)
+{
+	unsigned long j0, j1;
+	struct page *page;
+	void *dst;
+	void *src;
+	long word_copies = 0;
+	long byte_copies = 0;
+	long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
+
+	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
+	if (!page) {
+		pr_warn("Can't alloc pages to measure memcpy performance");
+		return;
+	}
+
+	/* Make a misaligned destination buffer. */
+	dst = (void *)((unsigned long)page_address(page) | 0x1);
+	/* Misalign src as well, but differently (off by 1 + 2 = 3). */
+	src = dst + (MISALIGNED_BUFFER_SIZE / 2);
+	src += 2;
+	/* Do a warmup. */
+	__copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
+	preempt_disable();
+	j0 = jiffies;
+	while ((j1 = jiffies) == j0)
+		cpu_relax();
+
+	while (time_before(jiffies,
+			   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
+
+		__copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
+		word_copies += 1;
+	}
+
+	__copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
+	j0 = jiffies;
+	while ((j1 = jiffies) == j0)
+		cpu_relax();
+
+	while (time_before(jiffies,
+			   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
+		__copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
+		byte_copies += 1;
+	}
+
+	preempt_enable();
+	if (word_copies >= byte_copies)
+		speed = RISCV_HWPROBE_MISALIGNED_FAST;
+
+	pr_info("cpu%d: Unaligned word copy %ld MB/s, byte copy %ld MB/s, misaligned accesses are %s\n",
+		cpu,
+		MISALIGNED_COPY_MBS(word_copies),
+		MISALIGNED_COPY_MBS(byte_copies),
+		(speed == RISCV_HWPROBE_MISALIGNED_FAST) ? "fast" : "slow");
+
+	per_cpu(misaligned_access_speed, cpu) = speed;
+	__free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
+}
+
+static int check_misaligned_access0(void)
+{
+	check_misaligned_access(0);
+	return 0;
+}
+
+arch_initcall(check_misaligned_access0);
+
 #ifdef CONFIG_RISCV_ALTERNATIVE
 /*
  * Alternative patch sites consider 48 bits when determining when to patch
diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index bb0b76e1a6d4..e34a71b4786b 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -26,6 +26,7 @@
 #include <linux/sched/task_stack.h>
 #include <linux/sched/mm.h>
 #include <asm/cpu_ops.h>
+#include <asm/cpufeature.h>
 #include <asm/irq.h>
 #include <asm/mmu_context.h>
 #include <asm/numa.h>
@@ -244,6 +245,7 @@ asmlinkage __visible void smp_callin(void)
 	notify_cpu_starting(curr_cpuid);
 	numa_add_cpu(curr_cpuid);
 	set_cpu_online(curr_cpuid, 1);
+	check_misaligned_access(curr_cpuid);
 	probe_vendor_features(curr_cpuid);
 
 	if (has_vector()) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func
  2023-06-23 22:20 [PATCH 0/2] RISC-V: Probe for misaligned access speed Evan Green
  2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
@ 2023-06-23 22:20 ` Evan Green
  2023-06-26 13:07   ` Conor Dooley
  2023-06-24 10:08 ` [PATCH 0/2] RISC-V: Probe for misaligned access speed Conor Dooley
  2023-06-24 10:22 ` Yangyu Chen
  3 siblings, 1 reply; 21+ messages in thread
From: Evan Green @ 2023-06-23 22:20 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Simon Hosie, Evan Green, Albert Ou, Andrew Jones, Anup Patel,
	Conor Dooley, Greentime Hu, Guo Ren, Heiko Stuebner,
	Jisheng Zhang, Ley Foon Tan, Palmer Dabbelt, Paul Walmsley,
	Randy Dunlap, Samuel Holland, Sunil V L, linux-kernel,
	linux-riscv

Now that we're testing unaligned memory copy and making that
determination generically, there are no more users of the vendor
feature_probe_func(). While I think it's probably going to need to come
back, there are no users right now, so let's remove it until it's
needed.

Signed-off-by: Evan Green <evan@rivosinc.com>

---

 arch/riscv/errata/thead/errata.c     |  8 --------
 arch/riscv/include/asm/alternative.h |  5 -----
 arch/riscv/kernel/alternative.c      | 19 -------------------
 arch/riscv/kernel/smpboot.c          |  1 -
 4 files changed, 33 deletions(-)

diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
index c259dc925ec1..bf42857c977f 100644
--- a/arch/riscv/errata/thead/errata.c
+++ b/arch/riscv/errata/thead/errata.c
@@ -117,11 +117,3 @@ void thead_errata_patch_func(struct alt_entry *begin, struct alt_entry *end,
 	if (stage == RISCV_ALTERNATIVES_EARLY_BOOT)
 		local_flush_icache_all();
 }
-
-void thead_feature_probe_func(unsigned int cpu,
-			      unsigned long archid,
-			      unsigned long impid)
-{
-	if ((archid == 0) && (impid == 0))
-		per_cpu(misaligned_access_speed, cpu) = RISCV_HWPROBE_MISALIGNED_FAST;
-}
diff --git a/arch/riscv/include/asm/alternative.h b/arch/riscv/include/asm/alternative.h
index 6a41537826a7..58ccd2f8cab7 100644
--- a/arch/riscv/include/asm/alternative.h
+++ b/arch/riscv/include/asm/alternative.h
@@ -30,7 +30,6 @@
 #define ALT_OLD_PTR(a)			__ALT_PTR(a, old_offset)
 #define ALT_ALT_PTR(a)			__ALT_PTR(a, alt_offset)
 
-void probe_vendor_features(unsigned int cpu);
 void __init apply_boot_alternatives(void);
 void __init apply_early_boot_alternatives(void);
 void apply_module_alternatives(void *start, size_t length);
@@ -53,15 +52,11 @@ void thead_errata_patch_func(struct alt_entry *begin, struct alt_entry *end,
 			     unsigned long archid, unsigned long impid,
 			     unsigned int stage);
 
-void thead_feature_probe_func(unsigned int cpu, unsigned long archid,
-			      unsigned long impid);
-
 void riscv_cpufeature_patch_func(struct alt_entry *begin, struct alt_entry *end,
 				 unsigned int stage);
 
 #else /* CONFIG_RISCV_ALTERNATIVE */
 
-static inline void probe_vendor_features(unsigned int cpu) { }
 static inline void apply_boot_alternatives(void) { }
 static inline void apply_early_boot_alternatives(void) { }
 static inline void apply_module_alternatives(void *start, size_t length) { }
diff --git a/arch/riscv/kernel/alternative.c b/arch/riscv/kernel/alternative.c
index 6b75788c18e6..85056153fa23 100644
--- a/arch/riscv/kernel/alternative.c
+++ b/arch/riscv/kernel/alternative.c
@@ -27,8 +27,6 @@ struct cpu_manufacturer_info_t {
 	void (*patch_func)(struct alt_entry *begin, struct alt_entry *end,
 				  unsigned long archid, unsigned long impid,
 				  unsigned int stage);
-	void (*feature_probe_func)(unsigned int cpu, unsigned long archid,
-				   unsigned long impid);
 };
 
 static void riscv_fill_cpu_mfr_info(struct cpu_manufacturer_info_t *cpu_mfr_info)
@@ -43,7 +41,6 @@ static void riscv_fill_cpu_mfr_info(struct cpu_manufacturer_info_t *cpu_mfr_info
 	cpu_mfr_info->imp_id = sbi_get_mimpid();
 #endif
 
-	cpu_mfr_info->feature_probe_func = NULL;
 	switch (cpu_mfr_info->vendor_id) {
 #ifdef CONFIG_ERRATA_SIFIVE
 	case SIFIVE_VENDOR_ID:
@@ -53,7 +50,6 @@ static void riscv_fill_cpu_mfr_info(struct cpu_manufacturer_info_t *cpu_mfr_info
 #ifdef CONFIG_ERRATA_THEAD
 	case THEAD_VENDOR_ID:
 		cpu_mfr_info->patch_func = thead_errata_patch_func;
-		cpu_mfr_info->feature_probe_func = thead_feature_probe_func;
 		break;
 #endif
 	default:
@@ -143,20 +139,6 @@ void riscv_alternative_fix_offsets(void *alt_ptr, unsigned int len,
 	}
 }
 
-/* Called on each CPU as it starts */
-void probe_vendor_features(unsigned int cpu)
-{
-	struct cpu_manufacturer_info_t cpu_mfr_info;
-
-	riscv_fill_cpu_mfr_info(&cpu_mfr_info);
-	if (!cpu_mfr_info.feature_probe_func)
-		return;
-
-	cpu_mfr_info.feature_probe_func(cpu,
-					cpu_mfr_info.arch_id,
-					cpu_mfr_info.imp_id);
-}
-
 /*
  * This is called very early in the boot process (directly after we run
  * a feature detect on the boot CPU). No need to worry about other CPUs
@@ -211,7 +193,6 @@ void __init apply_boot_alternatives(void)
 	/* If called on non-boot cpu things could go wrong */
 	WARN_ON(smp_processor_id() != 0);
 
-	probe_vendor_features(0);
 	_apply_alternatives((struct alt_entry *)__alt_start,
 			    (struct alt_entry *)__alt_end,
 			    RISCV_ALTERNATIVES_BOOT);
diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index e34a71b4786b..054f2d4474d0 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -246,7 +246,6 @@ asmlinkage __visible void smp_callin(void)
 	numa_add_cpu(curr_cpuid);
 	set_cpu_online(curr_cpuid, 1);
 	check_misaligned_access(curr_cpuid);
-	probe_vendor_features(curr_cpuid);
 
 	if (has_vector()) {
 		if (riscv_v_setup_vsize())
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] RISC-V: Probe for misaligned access speed
  2023-06-23 22:20 [PATCH 0/2] RISC-V: Probe for misaligned access speed Evan Green
  2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
  2023-06-23 22:20 ` [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func Evan Green
@ 2023-06-24 10:08 ` Conor Dooley
  2023-06-26 19:48   ` Evan Green
  2023-06-24 10:22 ` Yangyu Chen
  3 siblings, 1 reply; 21+ messages in thread
From: Conor Dooley @ 2023-06-24 10:08 UTC (permalink / raw)
  To: Evan Green
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Conor Dooley, Greentime Hu,
	Guo Ren, Heiko Stuebner, Heiko Stuebner, Jisheng Zhang,
	Jonathan Corbet, Ley Foon Tan, Li Zhengyu, Masahiro Yamada,
	Palmer Dabbelt, Paul Walmsley, Randy Dunlap, Samuel Holland,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

On Fri, Jun 23, 2023 at 03:20:14PM -0700, Evan Green wrote:
> 
> The current setting for the hwprobe bit indicating misaligned access
> speed is controlled by a vendor-specific feature probe function. This is
> essentially a per-SoC table we have to maintain on behalf of each vendor
> going forward. Let's convert that instead to something we detect at
> runtime.
> 
> We have two assembly routines at the heart of our probe: one that
> does a bunch of word-sized accesses (without aligning its input buffer),
> and the other that does byte accesses. If we can move a larger number of
> bytes using misaligned word accesses than we can with the same amount of
> time doing byte accesses, then we can declare misaligned accesses as
> "fast".
> 
> The tradeoff of reducing this maintenance burden is boot time. We spend
> 4-6 jiffies per core doing this measurement (0-2 on jiffie edge
> alignment, and 4 on measurement). The timing loop was based on
> raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number
> of algorithms). On my THead C906, I found measurements to be stable
> across several reboots, and looked like this:
> 
> [    0.047582] cpu0: Unaligned word copy 1728 MB/s, byte copy 402 MB/s, misaligned accesses are fast
> 
> I don't have a machine where misaligned accesses are slow, but I'd be
> interested to see the results of booting this series if someone did.

Can you elaborate on "results" please? Otherwise,

[    0.333110] smp: Bringing up secondary CPUs ...
[    0.370794] cpu1: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow
[    0.411368] cpu2: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow
[    0.451947] cpu3: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow
[    0.462628] smp: Brought up 1 node, 4 CPUs

[    0.631464] cpu0: Unaligned word copy 2 MB/s, byte copy 229 MB/s, misaligned accesses are slow

btw, why the mixed usage of "unaligned" and misaligned"?

Cheers,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] RISC-V: Probe for misaligned access speed
  2023-06-23 22:20 [PATCH 0/2] RISC-V: Probe for misaligned access speed Evan Green
                   ` (2 preceding siblings ...)
  2023-06-24 10:08 ` [PATCH 0/2] RISC-V: Probe for misaligned access speed Conor Dooley
@ 2023-06-24 10:22 ` Yangyu Chen
  2023-06-26  9:24   ` David Laight
  2023-06-26 19:48   ` Evan Green
  3 siblings, 2 replies; 21+ messages in thread
From: Yangyu Chen @ 2023-06-24 10:22 UTC (permalink / raw)
  To: evan
  Cc: ajones, alexghiti, andy.chiu, aou, apatel, conor.dooley, corbet,
	cyy, greentime.hu, guoren, heiko.stuebner, heiko, jeeheng.sia,
	jszhang, leyfoon.tan, linux-doc, linux-kernel, linux-riscv,
	lizhengyu3, masahiroy, palmer, palmer, paul.walmsley, rdunlap,
	samuel, shosie, sunilvl, xianting.tian

Hi,

Thanks for doing this. 

On 6/24/23 6:20 AM, Evan Green wrote:
> I don't have a machine where misaligned accesses are slow, but I'd be
> interested to see the results of booting this series if someone did.

I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access, every misaligned memory access will trap and emulated by SBI.

Here is the result:

~ # cat /proc/cpuinfo
processor       : 0
hart            : 0
isa             : rv64imafdc
mmu             : sv39
uarch           : sifive,rocket0
mvendorid       : 0x0
marchid         : 0x1
mimpid          : 0x20181004

processor       : 1
hart            : 1
isa             : rv64imafdc
mmu             : sv39
uarch           : sifive,rocket0
mvendorid       : 0x0
marchid         : 0x1
mimpid          : 0x20181004

~ # dmesg | grep Unaligned
[    0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow
[    0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow

Thanks,
Yangyu Chen


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
@ 2023-06-25 21:42   ` David Laight
  2023-06-26 14:14   ` Conor Dooley
  2023-06-26 21:42   ` Jessica Clarke
  2 siblings, 0 replies; 21+ messages in thread
From: David Laight @ 2023-06-25 21:42 UTC (permalink / raw)
  To: 'Evan Green', Palmer Dabbelt
  Cc: Simon Hosie, Albert Ou, Alexandre Ghiti, Andrew Jones, Andy Chiu,
	Anup Patel, Conor Dooley, Greentime Hu, Guo Ren, Heiko Stuebner,
	Jisheng Zhang, Jonathan Corbet, Ley Foon Tan, Li Zhengyu,
	Masahiro Yamada, Palmer Dabbelt, Paul Walmsley, Sia Jee Heng,
	Sunil V L, Xianting Tian, Yangyu Chen, linux-doc, linux-kernel,
	linux-riscv

From: Evan Green
> Sent: 23 June 2023 23:20
> 
> Rather than deferring misaligned access speed determinations to a vendor
> function, let's probe them and find out how fast they are. If we
> determine that a misaligned word access is faster than N byte accesses,
> mark the hardware's misaligned access as "fast".
> 
> Fix the documentation as well to reflect this bar. Previously the only
> SoC that returned "fast" was the THead C906. The change to the
> documentation is more a clarification, since the C906 is fast in the
> sense of the corrected documentation.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> ---
...
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> index 19165ebd82ba..710325751766 100644
> --- a/Documentation/riscv/hwprobe.rst
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -88,12 +88,12 @@ The following keys are defined:
>      always extremely slow.
> 
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> -    in hardware, but are slower than the cooresponding aligned accesses
> -    sequences.
> +    in hardware, but are slower than N byte accesses, where N is the native
> +    word size.
> 
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> -    in hardware and are faster than the cooresponding aligned accesses
> -    sequences.
> +    in hardware and are faster than N byte accesses, where N is the native
> +    word size.

I think I'd just say 'faster/slower than using byte accesses' (ie no N).

There are two obvious FAST cases:
1) the misaligned access takes an extra clock - worth aligning copies.
2) the misaligned access is pretty much as fast as an aligned one.

Even if you find it hard to distinguish them you should probably
allow for them both.

x86 (on Intel (non-atom) cpu) is definitely in the latter camp.
Misaligned copies are measurable slower - but not enough to
ever worry about.
I think that misaligned transfers get spilt into 8 byte accesses
(pretty irrelevant in the kernel) and then accesses that cross
cache line boundaries are split on the boundary.
With pipelined writes and two reads/clock it doesn't often
make a measurable difference.
That is definitely what I see for uncached accesses to PCIe space.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 0/2] RISC-V: Probe for misaligned access speed
  2023-06-24 10:22 ` Yangyu Chen
@ 2023-06-26  9:24   ` David Laight
  2023-06-26 19:49     ` Evan Green
  2023-06-26 19:48   ` Evan Green
  1 sibling, 1 reply; 21+ messages in thread
From: David Laight @ 2023-06-26  9:24 UTC (permalink / raw)
  To: 'Yangyu Chen', evan
  Cc: ajones, alexghiti, andy.chiu, aou, apatel, conor.dooley, corbet,
	greentime.hu, guoren, heiko.stuebner, heiko, jeeheng.sia,
	jszhang, leyfoon.tan, linux-doc, linux-kernel, linux-riscv,
	lizhengyu3, masahiroy, palmer, palmer, paul.walmsley, rdunlap,
	samuel, shosie, sunilvl, xianting.tian

From: Yangyu Chen
> Sent: 24 June 2023 11:22
> 
> Hi,
> 
> Thanks for doing this.
> 
> On 6/24/23 6:20 AM, Evan Green wrote:
> > I don't have a machine where misaligned accesses are slow, but I'd be
> > interested to see the results of booting this series if someone did.
> 
> I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with
> 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access,
> every misaligned memory access will trap and emulated by SBI.
> 
> Here is the result:
...
> ~ # dmesg | grep Unaligned
> [    0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow
> [    0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow

How many misaligned cycles are in the test loop?
If emulated ones are that slow you pretty much only need to test one.

Also it is pretty clear that you really don't want to be emulating them.
If the emulation is hidden from the kernel that really doesn't help at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func
  2023-06-23 22:20 ` [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func Evan Green
@ 2023-06-26 13:07   ` Conor Dooley
  2023-06-30 21:57     ` Evan Green
  0 siblings, 1 reply; 21+ messages in thread
From: Conor Dooley @ 2023-06-26 13:07 UTC (permalink / raw)
  To: Evan Green
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Andrew Jones, Anup Patel,
	Greentime Hu, Guo Ren, Heiko Stuebner, Jisheng Zhang,
	Ley Foon Tan, Palmer Dabbelt, Paul Walmsley, Randy Dunlap,
	Samuel Holland, Sunil V L, linux-kernel, linux-riscv

[-- Attachment #1: Type: text/plain, Size: 5632 bytes --]

Hey Evan,

On Fri, Jun 23, 2023 at 03:20:16PM -0700, Evan Green wrote:
> Now that we're testing unaligned memory copy and making that
> determination generically, there are no more users of the vendor
> feature_probe_func(). While I think it's probably going to need to come
> back, there are no users right now, so let's remove it until it's
> needed.

How come this is done as a separate patch, rather than delete the dead
code as part of the probe addition? Ease of review?

Change itself seems fine to me though, so
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>

Cheers,
Conor.

> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> ---
> 
>  arch/riscv/errata/thead/errata.c     |  8 --------
>  arch/riscv/include/asm/alternative.h |  5 -----
>  arch/riscv/kernel/alternative.c      | 19 -------------------
>  arch/riscv/kernel/smpboot.c          |  1 -
>  4 files changed, 33 deletions(-)
> 
> diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> index c259dc925ec1..bf42857c977f 100644
> --- a/arch/riscv/errata/thead/errata.c
> +++ b/arch/riscv/errata/thead/errata.c
> @@ -117,11 +117,3 @@ void thead_errata_patch_func(struct alt_entry *begin, struct alt_entry *end,
>  	if (stage == RISCV_ALTERNATIVES_EARLY_BOOT)
>  		local_flush_icache_all();
>  }
> -
> -void thead_feature_probe_func(unsigned int cpu,
> -			      unsigned long archid,
> -			      unsigned long impid)
> -{
> -	if ((archid == 0) && (impid == 0))
> -		per_cpu(misaligned_access_speed, cpu) = RISCV_HWPROBE_MISALIGNED_FAST;
> -}
> diff --git a/arch/riscv/include/asm/alternative.h b/arch/riscv/include/asm/alternative.h
> index 6a41537826a7..58ccd2f8cab7 100644
> --- a/arch/riscv/include/asm/alternative.h
> +++ b/arch/riscv/include/asm/alternative.h
> @@ -30,7 +30,6 @@
>  #define ALT_OLD_PTR(a)			__ALT_PTR(a, old_offset)
>  #define ALT_ALT_PTR(a)			__ALT_PTR(a, alt_offset)
>  
> -void probe_vendor_features(unsigned int cpu);
>  void __init apply_boot_alternatives(void);
>  void __init apply_early_boot_alternatives(void);
>  void apply_module_alternatives(void *start, size_t length);
> @@ -53,15 +52,11 @@ void thead_errata_patch_func(struct alt_entry *begin, struct alt_entry *end,
>  			     unsigned long archid, unsigned long impid,
>  			     unsigned int stage);
>  
> -void thead_feature_probe_func(unsigned int cpu, unsigned long archid,
> -			      unsigned long impid);
> -
>  void riscv_cpufeature_patch_func(struct alt_entry *begin, struct alt_entry *end,
>  				 unsigned int stage);
>  
>  #else /* CONFIG_RISCV_ALTERNATIVE */
>  
> -static inline void probe_vendor_features(unsigned int cpu) { }
>  static inline void apply_boot_alternatives(void) { }
>  static inline void apply_early_boot_alternatives(void) { }
>  static inline void apply_module_alternatives(void *start, size_t length) { }
> diff --git a/arch/riscv/kernel/alternative.c b/arch/riscv/kernel/alternative.c
> index 6b75788c18e6..85056153fa23 100644
> --- a/arch/riscv/kernel/alternative.c
> +++ b/arch/riscv/kernel/alternative.c
> @@ -27,8 +27,6 @@ struct cpu_manufacturer_info_t {
>  	void (*patch_func)(struct alt_entry *begin, struct alt_entry *end,
>  				  unsigned long archid, unsigned long impid,
>  				  unsigned int stage);
> -	void (*feature_probe_func)(unsigned int cpu, unsigned long archid,
> -				   unsigned long impid);
>  };
>  
>  static void riscv_fill_cpu_mfr_info(struct cpu_manufacturer_info_t *cpu_mfr_info)
> @@ -43,7 +41,6 @@ static void riscv_fill_cpu_mfr_info(struct cpu_manufacturer_info_t *cpu_mfr_info
>  	cpu_mfr_info->imp_id = sbi_get_mimpid();
>  #endif
>  
> -	cpu_mfr_info->feature_probe_func = NULL;
>  	switch (cpu_mfr_info->vendor_id) {
>  #ifdef CONFIG_ERRATA_SIFIVE
>  	case SIFIVE_VENDOR_ID:
> @@ -53,7 +50,6 @@ static void riscv_fill_cpu_mfr_info(struct cpu_manufacturer_info_t *cpu_mfr_info
>  #ifdef CONFIG_ERRATA_THEAD
>  	case THEAD_VENDOR_ID:
>  		cpu_mfr_info->patch_func = thead_errata_patch_func;
> -		cpu_mfr_info->feature_probe_func = thead_feature_probe_func;
>  		break;
>  #endif
>  	default:
> @@ -143,20 +139,6 @@ void riscv_alternative_fix_offsets(void *alt_ptr, unsigned int len,
>  	}
>  }
>  
> -/* Called on each CPU as it starts */
> -void probe_vendor_features(unsigned int cpu)
> -{
> -	struct cpu_manufacturer_info_t cpu_mfr_info;
> -
> -	riscv_fill_cpu_mfr_info(&cpu_mfr_info);
> -	if (!cpu_mfr_info.feature_probe_func)
> -		return;
> -
> -	cpu_mfr_info.feature_probe_func(cpu,
> -					cpu_mfr_info.arch_id,
> -					cpu_mfr_info.imp_id);
> -}
> -
>  /*
>   * This is called very early in the boot process (directly after we run
>   * a feature detect on the boot CPU). No need to worry about other CPUs
> @@ -211,7 +193,6 @@ void __init apply_boot_alternatives(void)
>  	/* If called on non-boot cpu things could go wrong */
>  	WARN_ON(smp_processor_id() != 0);
>  
> -	probe_vendor_features(0);
>  	_apply_alternatives((struct alt_entry *)__alt_start,
>  			    (struct alt_entry *)__alt_end,
>  			    RISCV_ALTERNATIVES_BOOT);
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index e34a71b4786b..054f2d4474d0 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -246,7 +246,6 @@ asmlinkage __visible void smp_callin(void)
>  	numa_add_cpu(curr_cpuid);
>  	set_cpu_online(curr_cpuid, 1);
>  	check_misaligned_access(curr_cpuid);
> -	probe_vendor_features(curr_cpuid);
>  
>  	if (has_vector()) {
>  		if (riscv_v_setup_vsize())
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
  2023-06-25 21:42   ` David Laight
@ 2023-06-26 14:14   ` Conor Dooley
  2023-06-27 19:11     ` Evan Green
  2023-06-26 21:42   ` Jessica Clarke
  2 siblings, 1 reply; 21+ messages in thread
From: Conor Dooley @ 2023-06-26 14:14 UTC (permalink / raw)
  To: Evan Green
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Greentime Hu, Guo Ren,
	Heiko Stuebner, Jisheng Zhang, Jonathan Corbet, Ley Foon Tan,
	Li Zhengyu, Masahiro Yamada, Palmer Dabbelt, Paul Walmsley,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv

[-- Attachment #1: Type: text/plain, Size: 9089 bytes --]

Hey Evan,

Some minor nitpickery comments & one actual one.

On Fri, Jun 23, 2023 at 03:20:15PM -0700, Evan Green wrote:
> Rather than deferring misaligned access speed determinations to a vendor

Could you pick a consistent word to use? You've got "unaligned",
"misaligned" and "noalign" sprinkled through out the series.

> function, let's probe them and find out how fast they are. If we
> determine that a misaligned word access is faster than N byte accesses,
> mark the hardware's misaligned access as "fast".
> 
> Fix the documentation as well to reflect this bar. Previously the only
> SoC that returned "fast" was the THead C906. The change to the
> documentation is more a clarification, since the C906 is fast in the
> sense of the corrected documentation.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> ---
> 
>  Documentation/riscv/hwprobe.rst     |  8 +--
>  arch/riscv/include/asm/cpufeature.h |  2 +
>  arch/riscv/kernel/Makefile          |  1 +
>  arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
>  arch/riscv/kernel/copy-noalign.h    | 13 +++++
>  arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
>  arch/riscv/kernel/smpboot.c         |  2 +
>  7 files changed, 171 insertions(+), 4 deletions(-)
>  create mode 100644 arch/riscv/kernel/copy-noalign.S
>  create mode 100644 arch/riscv/kernel/copy-noalign.h
> 
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> index 19165ebd82ba..710325751766 100644
> --- a/Documentation/riscv/hwprobe.rst
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -88,12 +88,12 @@ The following keys are defined:
>      always extremely slow.
>  
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> -    in hardware, but are slower than the cooresponding aligned accesses
> -    sequences.

Nice, fixed the typo by removing the offender ;)

> +    in hardware, but are slower than N byte accesses, where N is the native
> +    word size.
>  
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> -    in hardware and are faster than the cooresponding aligned accesses
> -    sequences.
> +    in hardware and are faster than N byte accesses, where N is the native
> +    word size.
>  
>    * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
>      not supported at all and will generate a misaligned address fault.
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index 23fed53b8815..b8e917176616 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
>  /* Per-cpu ISA extensions. */
>  extern struct riscv_isainfo hart_isa[NR_CPUS];
>  
> +void check_misaligned_access(int cpu);
> +
>  #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index a42951911067..f934d7ab7840 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -34,6 +34,7 @@ extra-y += vmlinux.lds
>  obj-y	+= head.o
>  obj-y	+= soc.o
>  obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
> +obj-y	+= copy-noalign.o
>  obj-y	+= cpu.o
>  obj-y	+= cpufeature.o
>  obj-y	+= entry.o
> diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
> new file mode 100644
> index 000000000000..3807fc2324b2
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.S
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2023 Rivos Inc. */
> +
> +#include <linux/linkage.h>
> +#include <asm/asm.h>
> +
> +	.text
> +
> +/* void __copy_words_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using word loads and stores. */
> +/* Note: The size is truncated to a multiple of 8 * SZREG */
> +ENTRY(__copy_words_unaligned)
> +	andi a4, a2, ~((8*SZREG)-1)
> +	beqz a4, 2f
> +	add a3, a1, a4
> +1:
> +	REG_L a4,       0(a1)
> +	REG_L a5,   SZREG(a1)
> +	REG_L a6, 2*SZREG(a1)
> +	REG_L a7, 3*SZREG(a1)
> +	REG_L t0, 4*SZREG(a1)
> +	REG_L t1, 5*SZREG(a1)
> +	REG_L t2, 6*SZREG(a1)
> +	REG_L t3, 7*SZREG(a1)
> +	REG_S a4,       0(a0)
> +	REG_S a5,   SZREG(a0)
> +	REG_S a6, 2*SZREG(a0)
> +	REG_S a7, 3*SZREG(a0)
> +	REG_S t0, 4*SZREG(a0)
> +	REG_S t1, 5*SZREG(a0)
> +	REG_S t2, 6*SZREG(a0)
> +	REG_S t3, 7*SZREG(a0)
> +	addi a0, a0, 8*SZREG
> +	addi a1, a1, 8*SZREG
> +	bltu a1, a3, 1b
> +
> +2:
> +	ret
> +END(__copy_words_unaligned)
> +
> +/* void __copy_bytes_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using only byte accesses. */
> +/* Note: The size is truncated to a multiple of 8 */
> +ENTRY(__copy_bytes_unaligned)
> +	andi a4, a2, ~(8-1)
> +	beqz a4, 2f
> +	add a3, a1, a4

Could you align operands for ASM please, to make this a little easier to
read?

> +1:
> +	lb a4, 0(a1)
> +	lb a5, 1(a1)
> +	lb a6, 2(a1)
> +	lb a7, 3(a1)
> +	lb t0, 4(a1)
> +	lb t1, 5(a1)
> +	lb t2, 6(a1)
> +	lb t3, 7(a1)
> +	sb a4, 0(a0)
> +	sb a5, 1(a0)
> +	sb a6, 2(a0)
> +	sb a7, 3(a0)
> +	sb t0, 4(a0)
> +	sb t1, 5(a0)
> +	sb t2, 6(a0)
> +	sb t3, 7(a0)
> +	addi a0, a0, 8
> +	addi a1, a1, 8
> +	bltu a1, a3, 1b
> +
> +2:
> +	ret
> +END(__copy_bytes_unaligned)
> diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
> new file mode 100644
> index 000000000000..99fbb9c763e0
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Rivos, Inc.
> + */
> +#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
> +#define __RISCV_KERNEL_COPY_NOALIGN_H
> +
> +#include <linux/types.h>
> +
> +void __copy_words_unaligned(void *dst, const void *src, size_t size);
> +void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
> +
> +#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index bdcf460ea53d..3f7200dcc00c 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -19,11 +19,21 @@
>  #include <asm/cacheflush.h>
>  #include <asm/cpufeature.h>
>  #include <asm/hwcap.h>
> +#include <asm/hwprobe.h>
>  #include <asm/patch.h>
>  #include <asm/processor.h>
>  #include <asm/vector.h>
>  
> +#include "copy-noalign.h"
> +
>  #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
> +#define MISALIGNED_ACCESS_JIFFIES_LG2 1
> +#define MISALIGNED_BUFFER_SIZE 0x4000
> +#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
> +

I think this blank line is misplaced, it should go after NUM_ALPHA_EXTS
instead of (or as well as) here.

> +#define MISALIGNED_COPY_MBS(_count) \
> +	((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
> +	 (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
>  
>  unsigned long elf_hwcap __read_mostly;
>  
> @@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
>  	return hwcap;
>  }
>  
> +void check_misaligned_access(int cpu)
> +{
> +	unsigned long j0, j1;
> +	struct page *page;
> +	void *dst;
> +	void *src;
> +	long word_copies = 0;
> +	long byte_copies = 0;
> +	long speed = RISCV_HWPROBE_MISALIGNED_SLOW;

Is this not a change from current behaviour, that may actually lead to
incorrect reporting. Presently, only T-Head stuff sets a speed, so
hwprobe falls back to UNKNOWN for everything else. With this, we will
get slow set, for anything failing the test.
Slow is defined as "Misaligned accesses are supported in hardware, but
are slower than the cooresponding aligned accesses sequences (sic)", but
you have no way of knowing, based on the test you are performing, whether
the hardware supports it or if it is emulated by firmware.
Perhaps that is not relevant to userspace, but wanted to know your
thoughts.

> +
> +	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> +	if (!page) {
> +		pr_warn("Can't alloc pages to measure memcpy performance");
> +		return;
> +	}
> +
> +	/* Make a misaligned destination buffer. */
> +	dst = (void *)((unsigned long)page_address(page) | 0x1);
> +	/* Misalign src as well, but differently (off by 1 + 2 = 3). */
> +	src = dst + (MISALIGNED_BUFFER_SIZE / 2);
> +	src += 2;
> +	/* Do a warmup. */
> +	__copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> +	preempt_disable();
> +	j0 = jiffies;
> +	while ((j1 = jiffies) == j0)
> +		cpu_relax();
> +
> +	while (time_before(jiffies,
> +			   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {

Does this not fit in 100 chars?

> +
> +		__copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> +		word_copies += 1;
> +	}
> +
> +	__copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> +	j0 = jiffies;
> +	while ((j1 = jiffies) == j0)
> +		cpu_relax();
> +
> +	while (time_before(jiffies,
> +			   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {

Ditto here, no?

Cheers,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] RISC-V: Probe for misaligned access speed
  2023-06-24 10:08 ` [PATCH 0/2] RISC-V: Probe for misaligned access speed Conor Dooley
@ 2023-06-26 19:48   ` Evan Green
  0 siblings, 0 replies; 21+ messages in thread
From: Evan Green @ 2023-06-26 19:48 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Conor Dooley, Greentime Hu,
	Guo Ren, Heiko Stuebner, Heiko Stuebner, Jisheng Zhang,
	Jonathan Corbet, Ley Foon Tan, Li Zhengyu, Masahiro Yamada,
	Palmer Dabbelt, Paul Walmsley, Randy Dunlap, Samuel Holland,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv

On Sat, Jun 24, 2023 at 3:08 AM Conor Dooley <conor@kernel.org> wrote:
>
> On Fri, Jun 23, 2023 at 03:20:14PM -0700, Evan Green wrote:
> >
> > The current setting for the hwprobe bit indicating misaligned access
> > speed is controlled by a vendor-specific feature probe function. This is
> > essentially a per-SoC table we have to maintain on behalf of each vendor
> > going forward. Let's convert that instead to something we detect at
> > runtime.
> >
> > We have two assembly routines at the heart of our probe: one that
> > does a bunch of word-sized accesses (without aligning its input buffer),
> > and the other that does byte accesses. If we can move a larger number of
> > bytes using misaligned word accesses than we can with the same amount of
> > time doing byte accesses, then we can declare misaligned accesses as
> > "fast".
> >
> > The tradeoff of reducing this maintenance burden is boot time. We spend
> > 4-6 jiffies per core doing this measurement (0-2 on jiffie edge
> > alignment, and 4 on measurement). The timing loop was based on
> > raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number
> > of algorithms). On my THead C906, I found measurements to be stable
> > across several reboots, and looked like this:
> >
> > [    0.047582] cpu0: Unaligned word copy 1728 MB/s, byte copy 402 MB/s, misaligned accesses are fast
> >
> > I don't have a machine where misaligned accesses are slow, but I'd be
> > interested to see the results of booting this series if someone did.
>
> Can you elaborate on "results" please? Otherwise,
>
> [    0.333110] smp: Bringing up secondary CPUs ...
> [    0.370794] cpu1: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow
> [    0.411368] cpu2: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow
> [    0.451947] cpu3: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow
> [    0.462628] smp: Brought up 1 node, 4 CPUs
>
> [    0.631464] cpu0: Unaligned word copy 2 MB/s, byte copy 229 MB/s, misaligned accesses are slow
>
> btw, why the mixed usage of "unaligned" and misaligned"?

Yes, this is exactly what I was hoping for in terms of results, thank
you. I'll clean up the diction and choose one word. I think my brain
attributed subtle differences between unaligned (as in, without regard
for alignment, the behavior of the copies) and misaligned (ie.
deliberately out of alignment, the type of access we're testing), but
I'm not sure I'm even fully consistent to those, so I'll fix it.

-Evan

>
> Cheers,
> Conor.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] RISC-V: Probe for misaligned access speed
  2023-06-24 10:22 ` Yangyu Chen
  2023-06-26  9:24   ` David Laight
@ 2023-06-26 19:48   ` Evan Green
  1 sibling, 0 replies; 21+ messages in thread
From: Evan Green @ 2023-06-26 19:48 UTC (permalink / raw)
  To: Yangyu Chen
  Cc: ajones, alexghiti, andy.chiu, aou, apatel, conor.dooley, corbet,
	greentime.hu, guoren, heiko.stuebner, heiko, jeeheng.sia,
	jszhang, leyfoon.tan, linux-doc, linux-kernel, linux-riscv,
	lizhengyu3, masahiroy, palmer, palmer, paul.walmsley, rdunlap,
	samuel, shosie, sunilvl, xianting.tian

On Sat, Jun 24, 2023 at 3:22 AM Yangyu Chen <cyy@cyyself.name> wrote:
>
> Hi,
>
> Thanks for doing this.
>
> On 6/24/23 6:20 AM, Evan Green wrote:
> > I don't have a machine where misaligned accesses are slow, but I'd be
> > interested to see the results of booting this series if someone did.
>
> I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access, every misaligned memory access will trap and emulated by SBI.
>
> Here is the result:
>
> ~ # cat /proc/cpuinfo
> processor       : 0
> hart            : 0
> isa             : rv64imafdc
> mmu             : sv39
> uarch           : sifive,rocket0
> mvendorid       : 0x0
> marchid         : 0x1
> mimpid          : 0x20181004
>
> processor       : 1
> hart            : 1
> isa             : rv64imafdc
> mmu             : sv39
> uarch           : sifive,rocket0
> mvendorid       : 0x0
> marchid         : 0x1
> mimpid          : 0x20181004
>
> ~ # dmesg | grep Unaligned
> [    0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow
> [    0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow

Thank you, Yangyu! Oof, the firmware traps are quite slow!
-Evan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] RISC-V: Probe for misaligned access speed
  2023-06-26  9:24   ` David Laight
@ 2023-06-26 19:49     ` Evan Green
  0 siblings, 0 replies; 21+ messages in thread
From: Evan Green @ 2023-06-26 19:49 UTC (permalink / raw)
  To: David Laight
  Cc: Yangyu Chen, ajones, alexghiti, andy.chiu, aou, apatel,
	conor.dooley, corbet, greentime.hu, guoren, heiko.stuebner,
	heiko, jeeheng.sia, jszhang, leyfoon.tan, linux-doc,
	linux-kernel, linux-riscv, lizhengyu3, masahiroy, palmer, palmer,
	paul.walmsley, rdunlap, samuel, shosie, sunilvl, xianting.tian

On Mon, Jun 26, 2023 at 2:24 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Yangyu Chen
> > Sent: 24 June 2023 11:22
> >
> > Hi,
> >
> > Thanks for doing this.
> >
> > On 6/24/23 6:20 AM, Evan Green wrote:
> > > I don't have a machine where misaligned accesses are slow, but I'd be
> > > interested to see the results of booting this series if someone did.
> >
> > I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with
> > 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access,
> > every misaligned memory access will trap and emulated by SBI.
> >
> > Here is the result:
> ...
> > ~ # dmesg | grep Unaligned
> > [    0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow
> > [    0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow
>
> How many misaligned cycles are in the test loop?
> If emulated ones are that slow you pretty much only need to test one.

The code does as many cycles as it can in a fixed number of jiffies.

>
> Also it is pretty clear that you really don't want to be emulating them.
> If the emulation is hidden from the kernel that really doesn't help at all.

From what I understand there's work being done to give the kernel some
awareness and even control over the misaligned access
trapping/emulation. It won't help today's systems though, and either
way you're right emulating is very very slow.
-Evan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
  2023-06-25 21:42   ` David Laight
  2023-06-26 14:14   ` Conor Dooley
@ 2023-06-26 21:42   ` Jessica Clarke
  2023-06-27 19:11     ` Evan Green
  2 siblings, 1 reply; 21+ messages in thread
From: Jessica Clarke @ 2023-06-26 21:42 UTC (permalink / raw)
  To: Evan Green
  Cc: Palmer Dabbelt, linux-doc, Yangyu Chen, Conor Dooley, Guo Ren,
	Jisheng Zhang, linux-riscv, Jonathan Corbet, Xianting Tian,
	Masahiro Yamada, Greentime Hu, Simon Hosie, Li Zhengyu,
	Andrew Jones, Albert Ou, Alexandre Ghiti, Ley Foon Tan,
	Paul Walmsley, Heiko Stuebner, Anup Patel, linux-kernel,
	Sia Jee Heng, Palmer Dabbelt, Andy Chiu

On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> 
> Rather than deferring misaligned access speed determinations to a vendor
> function, let's probe them and find out how fast they are. If we
> determine that a misaligned word access is faster than N byte accesses,
> mark the hardware's misaligned access as "fast".

How sure are you that your measurements can be extrapolated and aren’t
an artefact of the testing process? For example, off the top of my head:

* The first run will potentially be penalised by data cache misses,
  untrained prefetchers, TLB misses, branch predictors, etc. compared
  with later runs. You have one warmup, but who knows how many
  iterations it will take to converge?

* The code being benchmarked isn’t the code being run, so differences
  in access patterns, loop unrolling, loop alignment, etc. may cause the
  real code to behave differently (and perhaps change which is better).

The non-determinism that could in theory result from this also seems
like a not great idea to have.

Jess

> Fix the documentation as well to reflect this bar. Previously the only
> SoC that returned "fast" was the THead C906. The change to the
> documentation is more a clarification, since the C906 is fast in the
> sense of the corrected documentation.
> 
> Signed-off-by: Evan Green <evan@rivosinc.com>
> ---
> 
> Documentation/riscv/hwprobe.rst     |  8 +--
> arch/riscv/include/asm/cpufeature.h |  2 +
> arch/riscv/kernel/Makefile          |  1 +
> arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
> arch/riscv/kernel/copy-noalign.h    | 13 +++++
> arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
> arch/riscv/kernel/smpboot.c         |  2 +
> 7 files changed, 171 insertions(+), 4 deletions(-)
> create mode 100644 arch/riscv/kernel/copy-noalign.S
> create mode 100644 arch/riscv/kernel/copy-noalign.h
> 
> diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> index 19165ebd82ba..710325751766 100644
> --- a/Documentation/riscv/hwprobe.rst
> +++ b/Documentation/riscv/hwprobe.rst
> @@ -88,12 +88,12 @@ The following keys are defined:
>     always extremely slow.
> 
>   * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> -    in hardware, but are slower than the cooresponding aligned accesses
> -    sequences.
> +    in hardware, but are slower than N byte accesses, where N is the native
> +    word size.
> 
>   * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> -    in hardware and are faster than the cooresponding aligned accesses
> -    sequences.
> +    in hardware and are faster than N byte accesses, where N is the native
> +    word size.
> 
>   * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
>     not supported at all and will generate a misaligned address fault.
> diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> index 23fed53b8815..b8e917176616 100644
> --- a/arch/riscv/include/asm/cpufeature.h
> +++ b/arch/riscv/include/asm/cpufeature.h
> @@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> /* Per-cpu ISA extensions. */
> extern struct riscv_isainfo hart_isa[NR_CPUS];
> 
> +void check_misaligned_access(int cpu);
> +
> #endif
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index a42951911067..f934d7ab7840 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -34,6 +34,7 @@ extra-y += vmlinux.lds
> obj-y += head.o
> obj-y += soc.o
> obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
> +obj-y += copy-noalign.o
> obj-y += cpu.o
> obj-y += cpufeature.o
> obj-y += entry.o
> diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
> new file mode 100644
> index 000000000000..3807fc2324b2
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.S
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2023 Rivos Inc. */
> +
> +#include <linux/linkage.h>
> +#include <asm/asm.h>
> +
> + .text
> +
> +/* void __copy_words_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using word loads and stores. */
> +/* Note: The size is truncated to a multiple of 8 * SZREG */
> +ENTRY(__copy_words_unaligned)
> + andi a4, a2, ~((8*SZREG)-1)
> + beqz a4, 2f
> + add a3, a1, a4
> +1:
> + REG_L a4,       0(a1)
> + REG_L a5,   SZREG(a1)
> + REG_L a6, 2*SZREG(a1)
> + REG_L a7, 3*SZREG(a1)
> + REG_L t0, 4*SZREG(a1)
> + REG_L t1, 5*SZREG(a1)
> + REG_L t2, 6*SZREG(a1)
> + REG_L t3, 7*SZREG(a1)
> + REG_S a4,       0(a0)
> + REG_S a5,   SZREG(a0)
> + REG_S a6, 2*SZREG(a0)
> + REG_S a7, 3*SZREG(a0)
> + REG_S t0, 4*SZREG(a0)
> + REG_S t1, 5*SZREG(a0)
> + REG_S t2, 6*SZREG(a0)
> + REG_S t3, 7*SZREG(a0)
> + addi a0, a0, 8*SZREG
> + addi a1, a1, 8*SZREG
> + bltu a1, a3, 1b
> +
> +2:
> + ret
> +END(__copy_words_unaligned)
> +
> +/* void __copy_bytes_unaligned(void *, const void *, size_t) */
> +/* Performs a memcpy without aligning buffers, using only byte accesses. */
> +/* Note: The size is truncated to a multiple of 8 */
> +ENTRY(__copy_bytes_unaligned)
> + andi a4, a2, ~(8-1)
> + beqz a4, 2f
> + add a3, a1, a4
> +1:
> + lb a4, 0(a1)
> + lb a5, 1(a1)
> + lb a6, 2(a1)
> + lb a7, 3(a1)
> + lb t0, 4(a1)
> + lb t1, 5(a1)
> + lb t2, 6(a1)
> + lb t3, 7(a1)
> + sb a4, 0(a0)
> + sb a5, 1(a0)
> + sb a6, 2(a0)
> + sb a7, 3(a0)
> + sb t0, 4(a0)
> + sb t1, 5(a0)
> + sb t2, 6(a0)
> + sb t3, 7(a0)
> + addi a0, a0, 8
> + addi a1, a1, 8
> + bltu a1, a3, 1b
> +
> +2:
> + ret
> +END(__copy_bytes_unaligned)
> diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
> new file mode 100644
> index 000000000000..99fbb9c763e0
> --- /dev/null
> +++ b/arch/riscv/kernel/copy-noalign.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 Rivos, Inc.
> + */
> +#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
> +#define __RISCV_KERNEL_COPY_NOALIGN_H
> +
> +#include <linux/types.h>
> +
> +void __copy_words_unaligned(void *dst, const void *src, size_t size);
> +void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
> +
> +#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index bdcf460ea53d..3f7200dcc00c 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -19,11 +19,21 @@
> #include <asm/cacheflush.h>
> #include <asm/cpufeature.h>
> #include <asm/hwcap.h>
> +#include <asm/hwprobe.h>
> #include <asm/patch.h>
> #include <asm/processor.h>
> #include <asm/vector.h>
> 
> +#include "copy-noalign.h"
> +
> #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
> +#define MISALIGNED_ACCESS_JIFFIES_LG2 1
> +#define MISALIGNED_BUFFER_SIZE 0x4000
> +#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
> +
> +#define MISALIGNED_COPY_MBS(_count) \
> + ((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
> + (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
> 
> unsigned long elf_hwcap __read_mostly;
> 
> @@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
> return hwcap;
> }
> 
> +void check_misaligned_access(int cpu)
> +{
> + unsigned long j0, j1;
> + struct page *page;
> + void *dst;
> + void *src;
> + long word_copies = 0;
> + long byte_copies = 0;
> + long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
> +
> + page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> + if (!page) {
> + pr_warn("Can't alloc pages to measure memcpy performance");
> + return;
> + }
> +
> + /* Make a misaligned destination buffer. */
> + dst = (void *)((unsigned long)page_address(page) | 0x1);
> + /* Misalign src as well, but differently (off by 1 + 2 = 3). */
> + src = dst + (MISALIGNED_BUFFER_SIZE / 2);
> + src += 2;
> + /* Do a warmup. */
> + __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + preempt_disable();
> + j0 = jiffies;
> + while ((j1 = jiffies) == j0)
> + cpu_relax();
> +
> + while (time_before(jiffies,
> +   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
> +
> + __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + word_copies += 1;
> + }
> +
> + __copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + j0 = jiffies;
> + while ((j1 = jiffies) == j0)
> + cpu_relax();
> +
> + while (time_before(jiffies,
> +   j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
> + __copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> + byte_copies += 1;
> + }
> +
> + preempt_enable();
> + if (word_copies >= byte_copies)
> + speed = RISCV_HWPROBE_MISALIGNED_FAST;
> +
> + pr_info("cpu%d: Unaligned word copy %ld MB/s, byte copy %ld MB/s, misaligned accesses are %s\n",
> + cpu,
> + MISALIGNED_COPY_MBS(word_copies),
> + MISALIGNED_COPY_MBS(byte_copies),
> + (speed == RISCV_HWPROBE_MISALIGNED_FAST) ? "fast" : "slow");
> +
> + per_cpu(misaligned_access_speed, cpu) = speed;
> + __free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
> +}
> +
> +static int check_misaligned_access0(void)
> +{
> + check_misaligned_access(0);
> + return 0;
> +}
> +
> +arch_initcall(check_misaligned_access0);
> +
> #ifdef CONFIG_RISCV_ALTERNATIVE
> /*
>  * Alternative patch sites consider 48 bits when determining when to patch
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index bb0b76e1a6d4..e34a71b4786b 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -26,6 +26,7 @@
> #include <linux/sched/task_stack.h>
> #include <linux/sched/mm.h>
> #include <asm/cpu_ops.h>
> +#include <asm/cpufeature.h>
> #include <asm/irq.h>
> #include <asm/mmu_context.h>
> #include <asm/numa.h>
> @@ -244,6 +245,7 @@ asmlinkage __visible void smp_callin(void)
> notify_cpu_starting(curr_cpuid);
> numa_add_cpu(curr_cpuid);
> set_cpu_online(curr_cpuid, 1);
> + check_misaligned_access(curr_cpuid);
> probe_vendor_features(curr_cpuid);
> 
> if (has_vector()) {
> -- 
> 2.34.1
> 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-26 14:14   ` Conor Dooley
@ 2023-06-27 19:11     ` Evan Green
  2023-06-29 14:03       ` Conor Dooley
  0 siblings, 1 reply; 21+ messages in thread
From: Evan Green @ 2023-06-27 19:11 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Greentime Hu, Guo Ren,
	Heiko Stuebner, Jisheng Zhang, Jonathan Corbet, Ley Foon Tan,
	Li Zhengyu, Masahiro Yamada, Palmer Dabbelt, Paul Walmsley,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv

On Mon, Jun 26, 2023 at 7:15 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> Hey Evan,
>
> Some minor nitpickery comments & one actual one.
>
> On Fri, Jun 23, 2023 at 03:20:15PM -0700, Evan Green wrote:
> > Rather than deferring misaligned access speed determinations to a vendor
>
> Could you pick a consistent word to use? You've got "unaligned",
> "misaligned" and "noalign" sprinkled through out the series.
>
> > function, let's probe them and find out how fast they are. If we
> > determine that a misaligned word access is faster than N byte accesses,
> > mark the hardware's misaligned access as "fast".
> >
> > Fix the documentation as well to reflect this bar. Previously the only
> > SoC that returned "fast" was the THead C906. The change to the
> > documentation is more a clarification, since the C906 is fast in the
> > sense of the corrected documentation.
> >
> > Signed-off-by: Evan Green <evan@rivosinc.com>
> > ---
> >
> >  Documentation/riscv/hwprobe.rst     |  8 +--
> >  arch/riscv/include/asm/cpufeature.h |  2 +
> >  arch/riscv/kernel/Makefile          |  1 +
> >  arch/riscv/kernel/copy-noalign.S    | 71 ++++++++++++++++++++++++++
> >  arch/riscv/kernel/copy-noalign.h    | 13 +++++
> >  arch/riscv/kernel/cpufeature.c      | 78 +++++++++++++++++++++++++++++
> >  arch/riscv/kernel/smpboot.c         |  2 +
> >  7 files changed, 171 insertions(+), 4 deletions(-)
> >  create mode 100644 arch/riscv/kernel/copy-noalign.S
> >  create mode 100644 arch/riscv/kernel/copy-noalign.h
> >
> > diff --git a/Documentation/riscv/hwprobe.rst b/Documentation/riscv/hwprobe.rst
> > index 19165ebd82ba..710325751766 100644
> > --- a/Documentation/riscv/hwprobe.rst
> > +++ b/Documentation/riscv/hwprobe.rst
> > @@ -88,12 +88,12 @@ The following keys are defined:
> >      always extremely slow.
> >
> >    * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are supported
> > -    in hardware, but are slower than the cooresponding aligned accesses
> > -    sequences.
>
> Nice, fixed the typo by removing the offender ;)
>
> > +    in hardware, but are slower than N byte accesses, where N is the native
> > +    word size.
> >
> >    * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are supported
> > -    in hardware and are faster than the cooresponding aligned accesses
> > -    sequences.
> > +    in hardware and are faster than N byte accesses, where N is the native
> > +    word size.
> >
> >    * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are
> >      not supported at all and will generate a misaligned address fault.
> > diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
> > index 23fed53b8815..b8e917176616 100644
> > --- a/arch/riscv/include/asm/cpufeature.h
> > +++ b/arch/riscv/include/asm/cpufeature.h
> > @@ -30,4 +30,6 @@ DECLARE_PER_CPU(long, misaligned_access_speed);
> >  /* Per-cpu ISA extensions. */
> >  extern struct riscv_isainfo hart_isa[NR_CPUS];
> >
> > +void check_misaligned_access(int cpu);
> > +
> >  #endif
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index a42951911067..f934d7ab7840 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -34,6 +34,7 @@ extra-y += vmlinux.lds
> >  obj-y        += head.o
> >  obj-y        += soc.o
> >  obj-$(CONFIG_RISCV_ALTERNATIVE) += alternative.o
> > +obj-y        += copy-noalign.o
> >  obj-y        += cpu.o
> >  obj-y        += cpufeature.o
> >  obj-y        += entry.o
> > diff --git a/arch/riscv/kernel/copy-noalign.S b/arch/riscv/kernel/copy-noalign.S
> > new file mode 100644
> > index 000000000000..3807fc2324b2
> > --- /dev/null
> > +++ b/arch/riscv/kernel/copy-noalign.S
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright (C) 2023 Rivos Inc. */
> > +
> > +#include <linux/linkage.h>
> > +#include <asm/asm.h>
> > +
> > +     .text
> > +
> > +/* void __copy_words_unaligned(void *, const void *, size_t) */
> > +/* Performs a memcpy without aligning buffers, using word loads and stores. */
> > +/* Note: The size is truncated to a multiple of 8 * SZREG */
> > +ENTRY(__copy_words_unaligned)
> > +     andi a4, a2, ~((8*SZREG)-1)
> > +     beqz a4, 2f
> > +     add a3, a1, a4
> > +1:
> > +     REG_L a4,       0(a1)
> > +     REG_L a5,   SZREG(a1)
> > +     REG_L a6, 2*SZREG(a1)
> > +     REG_L a7, 3*SZREG(a1)
> > +     REG_L t0, 4*SZREG(a1)
> > +     REG_L t1, 5*SZREG(a1)
> > +     REG_L t2, 6*SZREG(a1)
> > +     REG_L t3, 7*SZREG(a1)
> > +     REG_S a4,       0(a0)
> > +     REG_S a5,   SZREG(a0)
> > +     REG_S a6, 2*SZREG(a0)
> > +     REG_S a7, 3*SZREG(a0)
> > +     REG_S t0, 4*SZREG(a0)
> > +     REG_S t1, 5*SZREG(a0)
> > +     REG_S t2, 6*SZREG(a0)
> > +     REG_S t3, 7*SZREG(a0)
> > +     addi a0, a0, 8*SZREG
> > +     addi a1, a1, 8*SZREG
> > +     bltu a1, a3, 1b
> > +
> > +2:
> > +     ret
> > +END(__copy_words_unaligned)
> > +
> > +/* void __copy_bytes_unaligned(void *, const void *, size_t) */
> > +/* Performs a memcpy without aligning buffers, using only byte accesses. */
> > +/* Note: The size is truncated to a multiple of 8 */
> > +ENTRY(__copy_bytes_unaligned)
> > +     andi a4, a2, ~(8-1)
> > +     beqz a4, 2f
> > +     add a3, a1, a4
>
> Could you align operands for ASM please, to make this a little easier to
> read?
>
> > +1:
> > +     lb a4, 0(a1)
> > +     lb a5, 1(a1)
> > +     lb a6, 2(a1)
> > +     lb a7, 3(a1)
> > +     lb t0, 4(a1)
> > +     lb t1, 5(a1)
> > +     lb t2, 6(a1)
> > +     lb t3, 7(a1)
> > +     sb a4, 0(a0)
> > +     sb a5, 1(a0)
> > +     sb a6, 2(a0)
> > +     sb a7, 3(a0)
> > +     sb t0, 4(a0)
> > +     sb t1, 5(a0)
> > +     sb t2, 6(a0)
> > +     sb t3, 7(a0)
> > +     addi a0, a0, 8
> > +     addi a1, a1, 8
> > +     bltu a1, a3, 1b
> > +
> > +2:
> > +     ret
> > +END(__copy_bytes_unaligned)
> > diff --git a/arch/riscv/kernel/copy-noalign.h b/arch/riscv/kernel/copy-noalign.h
> > new file mode 100644
> > index 000000000000..99fbb9c763e0
> > --- /dev/null
> > +++ b/arch/riscv/kernel/copy-noalign.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023 Rivos, Inc.
> > + */
> > +#ifndef __RISCV_KERNEL_COPY_NOALIGN_H
> > +#define __RISCV_KERNEL_COPY_NOALIGN_H
> > +
> > +#include <linux/types.h>
> > +
> > +void __copy_words_unaligned(void *dst, const void *src, size_t size);
> > +void __copy_bytes_unaligned(void *dst, const void *src, size_t size);
> > +
> > +#endif /* __RISCV_KERNEL_COPY_NOALIGN_H */
> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > index bdcf460ea53d..3f7200dcc00c 100644
> > --- a/arch/riscv/kernel/cpufeature.c
> > +++ b/arch/riscv/kernel/cpufeature.c
> > @@ -19,11 +19,21 @@
> >  #include <asm/cacheflush.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/hwcap.h>
> > +#include <asm/hwprobe.h>
> >  #include <asm/patch.h>
> >  #include <asm/processor.h>
> >  #include <asm/vector.h>
> >
> > +#include "copy-noalign.h"
> > +
> >  #define NUM_ALPHA_EXTS ('z' - 'a' + 1)
> > +#define MISALIGNED_ACCESS_JIFFIES_LG2 1
> > +#define MISALIGNED_BUFFER_SIZE 0x4000
> > +#define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
> > +
>
> I think this blank line is misplaced, it should go after NUM_ALPHA_EXTS
> instead of (or as well as) here.
>
> > +#define MISALIGNED_COPY_MBS(_count) \
> > +     ((HZ * (_count) * MISALIGNED_COPY_SIZE) >> \
> > +      (20 + MISALIGNED_ACCESS_JIFFIES_LG2))
> >
> >  unsigned long elf_hwcap __read_mostly;
> >
> > @@ -396,6 +406,74 @@ unsigned long riscv_get_elf_hwcap(void)
> >       return hwcap;
> >  }
> >
> > +void check_misaligned_access(int cpu)
> > +{
> > +     unsigned long j0, j1;
> > +     struct page *page;
> > +     void *dst;
> > +     void *src;
> > +     long word_copies = 0;
> > +     long byte_copies = 0;
> > +     long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
>
> Is this not a change from current behaviour, that may actually lead to
> incorrect reporting. Presently, only T-Head stuff sets a speed, so
> hwprobe falls back to UNKNOWN for everything else. With this, we will
> get slow set, for anything failing the test.
> Slow is defined as "Misaligned accesses are supported in hardware, but
> are slower than the cooresponding aligned accesses sequences (sic)", but
> you have no way of knowing, based on the test you are performing, whether
> the hardware supports it or if it is emulated by firmware.
> Perhaps that is not relevant to userspace, but wanted to know your
> thoughts.
>

Hm, that's true. EMULATED was an easy one when we were planning to get
this info from the DT. It also might be an easy one in the future, if
we get an SBI call that allows the kernel to take over misaligned trap
handling. We'd then be able to do a misaligned access and see if our
trap handler got called.

One option is to leave the value alone if we fail the FAST test
(rather than changing it from UNKNOWN to SLOW). This isn't great
though, as it effectively makes UNKNOWN synonymous with SLOW, but in a
way where usermode can't tell the difference between "I truly don't
know" and "I tried the fast test and it failed".

The alternative, as it is now, may mislabel some emulated systems as
slow until the new SBI call shows up. I'm not sure how bad this is in
practice. We could add a subsequent performance bar below which we
guess "emulated". This probably matches what usermode will use that
value for anyway (a synonym for "very slow"), but it's basically the
same problem with reversed polarity (we mislabel some slow systems as
emulated). I'm open to suggestions!

-Evan

> > +
> > +     page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
> > +     if (!page) {
> > +             pr_warn("Can't alloc pages to measure memcpy performance");
> > +             return;
> > +     }
> > +
> > +     /* Make a misaligned destination buffer. */
> > +     dst = (void *)((unsigned long)page_address(page) | 0x1);
> > +     /* Misalign src as well, but differently (off by 1 + 2 = 3). */
> > +     src = dst + (MISALIGNED_BUFFER_SIZE / 2);
> > +     src += 2;
> > +     /* Do a warmup. */
> > +     __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> > +     preempt_disable();
> > +     j0 = jiffies;
> > +     while ((j1 = jiffies) == j0)
> > +             cpu_relax();
> > +
> > +     while (time_before(jiffies,
> > +                        j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
>
> Does this not fit in 100 chars?
>
> > +
> > +             __copy_words_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> > +             word_copies += 1;
> > +     }
> > +
> > +     __copy_bytes_unaligned(dst, src, MISALIGNED_COPY_SIZE);
> > +     j0 = jiffies;
> > +     while ((j1 = jiffies) == j0)
> > +             cpu_relax();
> > +
> > +     while (time_before(jiffies,
> > +                        j1 + (1 << MISALIGNED_ACCESS_JIFFIES_LG2))) {
>
> Ditto here, no?
>
> Cheers,
> Conor.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-26 21:42   ` Jessica Clarke
@ 2023-06-27 19:11     ` Evan Green
  2023-06-29 12:05       ` David Laight
  0 siblings, 1 reply; 21+ messages in thread
From: Evan Green @ 2023-06-27 19:11 UTC (permalink / raw)
  To: Jessica Clarke
  Cc: Palmer Dabbelt, linux-doc, Yangyu Chen, Conor Dooley, Guo Ren,
	Jisheng Zhang, linux-riscv, Jonathan Corbet, Xianting Tian,
	Masahiro Yamada, Greentime Hu, Simon Hosie, Li Zhengyu,
	Andrew Jones, Albert Ou, Alexandre Ghiti, Ley Foon Tan,
	Paul Walmsley, Heiko Stuebner, Anup Patel, linux-kernel,
	Sia Jee Heng, Palmer Dabbelt, Andy Chiu

On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
>
> On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> >
> > Rather than deferring misaligned access speed determinations to a vendor
> > function, let's probe them and find out how fast they are. If we
> > determine that a misaligned word access is faster than N byte accesses,
> > mark the hardware's misaligned access as "fast".
>
> How sure are you that your measurements can be extrapolated and aren’t
> an artefact of the testing process? For example, off the top of my head:
>
> * The first run will potentially be penalised by data cache misses,
>   untrained prefetchers, TLB misses, branch predictors, etc. compared
>   with later runs. You have one warmup, but who knows how many
>   iterations it will take to converge?

I'd expect the cache penalties to be reasonably covered by a single
warmup. You're right about branch prediction, which is why I tried to
use a large-ish buffer size, minimize the ratio of conditionals to
loads/stores, and do the test for a decent number of iterations (on my
THead, about 1800 and 400 for words and bytes).

When I ran the test a handful of times, I did see variation on the
order of ~5%. But the comparison of the two numbers doesn't seem to be
anywhere near that margin (THead C906 was ~4x faster doing misaligned
word accesses, others with slow misaligned accesses also reporting
numbers not anywhere close to each other).

>
> * The code being benchmarked isn’t the code being run, so differences
>   in access patterns, loop unrolling, loop alignment, etc. may cause the
>   real code to behave differently (and perhaps change which is better).

I'm not trying to make statements about memcpy specifically, but
(only) about misaligned accesses, which is why I tried to write loops
that isolated that element as much as possible.

>
> The non-determinism that could in theory result from this also seems
> like a not great idea to have.

This is fair, if we have machines where this waffles from boot to boot
that's not great. In theory if misaligned word accesses come out to
being almost exactly equal to N byte accesses, then it doesn't matter
which you choose, though of course it could still make a difference in
practice. The alternative though of providing no info just pushes the
same problem out into userspace, which seems worse.
-Evan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-27 19:11     ` Evan Green
@ 2023-06-29 12:05       ` David Laight
  2023-06-29 23:09         ` Evan Green
  0 siblings, 1 reply; 21+ messages in thread
From: David Laight @ 2023-06-29 12:05 UTC (permalink / raw)
  To: 'Evan Green', Jessica Clarke
  Cc: Palmer Dabbelt, linux-doc, Yangyu Chen, Conor Dooley, Guo Ren,
	Jisheng Zhang, linux-riscv, Jonathan Corbet, Xianting Tian,
	Masahiro Yamada, Greentime Hu, Simon Hosie, Li Zhengyu,
	Andrew Jones, Albert Ou, Alexandre Ghiti, Ley Foon Tan,
	Paul Walmsley, Heiko Stuebner, Anup Patel, linux-kernel,
	Sia Jee Heng, Palmer Dabbelt, Andy Chiu

From: Evan Green
> Sent: 27 June 2023 20:12
> 
> On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> >
> > On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> > >
> > > Rather than deferring misaligned access speed determinations to a vendor
> > > function, let's probe them and find out how fast they are. If we
> > > determine that a misaligned word access is faster than N byte accesses,
> > > mark the hardware's misaligned access as "fast".
> >
> > How sure are you that your measurements can be extrapolated and aren’t
> > an artefact of the testing process? For example, off the top of my head:
> >
> > * The first run will potentially be penalised by data cache misses,
> >   untrained prefetchers, TLB misses, branch predictors, etc. compared
> >   with later runs. You have one warmup, but who knows how many
> >   iterations it will take to converge?
> 
> I'd expect the cache penalties to be reasonably covered by a single
> warmup. You're right about branch prediction, which is why I tried to
> use a large-ish buffer size, minimize the ratio of conditionals to
> loads/stores, and do the test for a decent number of iterations (on my
> THead, about 1800 and 400 for words and bytes).
> 
> When I ran the test a handful of times, I did see variation on the
> order of ~5%. But the comparison of the two numbers doesn't seem to be
> anywhere near that margin (THead C906 was ~4x faster doing misaligned
> word accesses, others with slow misaligned accesses also reporting
> numbers not anywhere close to each other).

Isn't the EMULATED case so much slower than anything else that
it is even pretty obvious from a single access?
(Possibly the 2nd access to avoid 'cold cache'.)

One of the things that can perturb measurements is hardware
interrupts. That can be mitigated by counting clocks for a few
(10 is plenty) iterations of a short request and taking the
fastest value.
For short hot-cache code sequences you can actually compare the
actual clock counts with theoretical minimum values.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-27 19:11     ` Evan Green
@ 2023-06-29 14:03       ` Conor Dooley
  2023-06-29 23:10         ` Evan Green
  0 siblings, 1 reply; 21+ messages in thread
From: Conor Dooley @ 2023-06-29 14:03 UTC (permalink / raw)
  To: Evan Green
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Greentime Hu, Guo Ren,
	Heiko Stuebner, Jisheng Zhang, Jonathan Corbet, Ley Foon Tan,
	Li Zhengyu, Masahiro Yamada, Palmer Dabbelt, Paul Walmsley,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv

[-- Attachment #1: Type: text/plain, Size: 3295 bytes --]

On Tue, Jun 27, 2023 at 12:11:25PM -0700, Evan Green wrote:
> On Mon, Jun 26, 2023 at 7:15 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > +void check_misaligned_access(int cpu)
> > > +{
> > > +     unsigned long j0, j1;
> > > +     struct page *page;
> > > +     void *dst;
> > > +     void *src;
> > > +     long word_copies = 0;
> > > +     long byte_copies = 0;
> > > +     long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
> >
> > Is this not a change from current behaviour, that may actually lead to
> > incorrect reporting. Presently, only T-Head stuff sets a speed, so
> > hwprobe falls back to UNKNOWN for everything else. With this, we will
> > get slow set, for anything failing the test.
> > Slow is defined as "Misaligned accesses are supported in hardware, but
> > are slower than the cooresponding aligned accesses sequences (sic)", but
> > you have no way of knowing, based on the test you are performing, whether
> > the hardware supports it or if it is emulated by firmware.
> > Perhaps that is not relevant to userspace, but wanted to know your
> > thoughts.
> >
> 
> Hm, that's true. EMULATED was an easy one when we were planning to get
> this info from the DT. It also might be an easy one in the future, if
> we get an SBI call that allows the kernel to take over misaligned trap
> handling. We'd then be able to do a misaligned access and see if our
> trap handler got called.
> 
> One option is to leave the value alone if we fail the FAST test
> (rather than changing it from UNKNOWN to SLOW). This isn't great
> though, as it effectively makes UNKNOWN synonymous with SLOW, but in a
> way where usermode can't tell the difference between "I truly don't
> know" and "I tried the fast test and it failed".
> 
> The alternative, as it is now, may mislabel some emulated systems as
> slow until the new SBI call shows up.

Make that "mislabel some emulated systems forever", existing systems
don't magically grow support for new extensions unfortunately.

Realistically though, does it matter to userspace if it is slow because
the hardware is slow, or if the emulation is slow, since there's not
really a way for userspace to tell from the syscall by how much it is
slower.
It can probably guess that emulation is worse, given how crap the
speed I see on mpfs is.

I'd rather we did say slow, rather than people start to interpret
UNKNOWN as slow.

> I'm not sure how bad this is in
> practice. We could add a subsequent performance bar below which we
> guess "emulated".

Nah, I don't really think that that is required.

> This probably matches what usermode will use that
> value for anyway (a synonym for "very slow"), but it's basically the
> same problem with reversed polarity (we mislabel some slow systems as
> emulated). I'm open to suggestions!

I think I just agreed with you, give or take. If it is fast, say fast.
If it is slow, we say it is slow. If we know it is emulated, then we can
report it being emulated. Is it too late to remove the "hardware" from
the syscall documentation, IOW s/supported in hardware/supported/?

Please actually describe the assumptions/subtleties in the commit
message though, so that the rationale for stuff is in the history :)

Cheers,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-29 12:05       ` David Laight
@ 2023-06-29 23:09         ` Evan Green
  2023-06-30  8:29           ` David Laight
  0 siblings, 1 reply; 21+ messages in thread
From: Evan Green @ 2023-06-29 23:09 UTC (permalink / raw)
  To: David Laight
  Cc: Jessica Clarke, Palmer Dabbelt, linux-doc, Yangyu Chen,
	Conor Dooley, Guo Ren, Jisheng Zhang, linux-riscv,
	Jonathan Corbet, Xianting Tian, Masahiro Yamada, Greentime Hu,
	Simon Hosie, Li Zhengyu, Andrew Jones, Albert Ou,
	Alexandre Ghiti, Ley Foon Tan, Paul Walmsley, Heiko Stuebner,
	Anup Patel, linux-kernel, Sia Jee Heng, Palmer Dabbelt,
	Andy Chiu

On Thu, Jun 29, 2023 at 5:05 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Evan Green
> > Sent: 27 June 2023 20:12
> >
> > On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> > >
> > > On 23 Jun 2023, at 23:20, Evan Green <evan@rivosinc.com> wrote:
> > > >
> > > > Rather than deferring misaligned access speed determinations to a vendor
> > > > function, let's probe them and find out how fast they are. If we
> > > > determine that a misaligned word access is faster than N byte accesses,
> > > > mark the hardware's misaligned access as "fast".
> > >
> > > How sure are you that your measurements can be extrapolated and aren’t
> > > an artefact of the testing process? For example, off the top of my head:
> > >
> > > * The first run will potentially be penalised by data cache misses,
> > >   untrained prefetchers, TLB misses, branch predictors, etc. compared
> > >   with later runs. You have one warmup, but who knows how many
> > >   iterations it will take to converge?
> >
> > I'd expect the cache penalties to be reasonably covered by a single
> > warmup. You're right about branch prediction, which is why I tried to
> > use a large-ish buffer size, minimize the ratio of conditionals to
> > loads/stores, and do the test for a decent number of iterations (on my
> > THead, about 1800 and 400 for words and bytes).
> >
> > When I ran the test a handful of times, I did see variation on the
> > order of ~5%. But the comparison of the two numbers doesn't seem to be
> > anywhere near that margin (THead C906 was ~4x faster doing misaligned
> > word accesses, others with slow misaligned accesses also reporting
> > numbers not anywhere close to each other).
>
> Isn't the EMULATED case so much slower than anything else that
> it is even pretty obvious from a single access?
> (Possibly the 2nd access to avoid 'cold cache'.)
>
> One of the things that can perturb measurements is hardware
> interrupts. That can be mitigated by counting clocks for a few
> (10 is plenty) iterations of a short request and taking the
> fastest value.
> For short hot-cache code sequences you can actually compare the
> actual clock counts with theoretical minimum values.

Yeah, one thing I could do is disable interrupts, measure the cycle
count of doing an individual iteration, do this N times, and take the
minimum value as the time to compare. In the end I'll then have two
numbers to compare, like I do in this patch. In theory the variance on
that should be really tight. N will have to depend on the overall
amount of time I'm taking so as not to shut interrupts off for very
long. Let me experiment with this and see how the results look.
-Evan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-29 14:03       ` Conor Dooley
@ 2023-06-29 23:10         ` Evan Green
  0 siblings, 0 replies; 21+ messages in thread
From: Evan Green @ 2023-06-29 23:10 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Alexandre Ghiti,
	Andrew Jones, Andy Chiu, Anup Patel, Greentime Hu, Guo Ren,
	Heiko Stuebner, Jisheng Zhang, Jonathan Corbet, Ley Foon Tan,
	Li Zhengyu, Masahiro Yamada, Palmer Dabbelt, Paul Walmsley,
	Sia Jee Heng, Sunil V L, Xianting Tian, Yangyu Chen, linux-doc,
	linux-kernel, linux-riscv

On Thu, Jun 29, 2023 at 7:03 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> On Tue, Jun 27, 2023 at 12:11:25PM -0700, Evan Green wrote:
> > On Mon, Jun 26, 2023 at 7:15 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > > +void check_misaligned_access(int cpu)
> > > > +{
> > > > +     unsigned long j0, j1;
> > > > +     struct page *page;
> > > > +     void *dst;
> > > > +     void *src;
> > > > +     long word_copies = 0;
> > > > +     long byte_copies = 0;
> > > > +     long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
> > >
> > > Is this not a change from current behaviour, that may actually lead to
> > > incorrect reporting. Presently, only T-Head stuff sets a speed, so
> > > hwprobe falls back to UNKNOWN for everything else. With this, we will
> > > get slow set, for anything failing the test.
> > > Slow is defined as "Misaligned accesses are supported in hardware, but
> > > are slower than the cooresponding aligned accesses sequences (sic)", but
> > > you have no way of knowing, based on the test you are performing, whether
> > > the hardware supports it or if it is emulated by firmware.
> > > Perhaps that is not relevant to userspace, but wanted to know your
> > > thoughts.
> > >
> >
> > Hm, that's true. EMULATED was an easy one when we were planning to get
> > this info from the DT. It also might be an easy one in the future, if
> > we get an SBI call that allows the kernel to take over misaligned trap
> > handling. We'd then be able to do a misaligned access and see if our
> > trap handler got called.
> >
> > One option is to leave the value alone if we fail the FAST test
> > (rather than changing it from UNKNOWN to SLOW). This isn't great
> > though, as it effectively makes UNKNOWN synonymous with SLOW, but in a
> > way where usermode can't tell the difference between "I truly don't
> > know" and "I tried the fast test and it failed".
> >
> > The alternative, as it is now, may mislabel some emulated systems as
> > slow until the new SBI call shows up.
>
> Make that "mislabel some emulated systems forever", existing systems
> don't magically grow support for new extensions unfortunately.

Right.

>
> Realistically though, does it matter to userspace if it is slow because
> the hardware is slow, or if the emulation is slow, since there's not
> really a way for userspace to tell from the syscall by how much it is
> slower.
> It can probably guess that emulation is worse, given how crap the
> speed I see on mpfs is.
>
> I'd rather we did say slow, rather than people start to interpret
> UNKNOWN as slow.

I think I agree.

>
> > I'm not sure how bad this is in
> > practice. We could add a subsequent performance bar below which we
> > guess "emulated".
>
> Nah, I don't really think that that is required.
>
> > This probably matches what usermode will use that
> > value for anyway (a synonym for "very slow"), but it's basically the
> > same problem with reversed polarity (we mislabel some slow systems as
> > emulated). I'm open to suggestions!
>
> I think I just agreed with you, give or take. If it is fast, say fast.
> If it is slow, we say it is slow. If we know it is emulated, then we can
> report it being emulated. Is it too late to remove the "hardware" from
> the syscall documentation, IOW s/supported in hardware/supported/?
>
> Please actually describe the assumptions/subtleties in the commit
> message though, so that the rationale for stuff is in the history :)

Will do. I pondered an alternative of creating a "gray zone" where if
misaligned words and bytes come out close to each other (which I don't
expect them to), we leave the setting of UNKNOWN alone. But I'm not
sure this really solves anything, it just moves the "waffle point"
around, so I couldn't convince myself it was valuable.
-Evan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] RISC-V: Probe for unaligned access speed
  2023-06-29 23:09         ` Evan Green
@ 2023-06-30  8:29           ` David Laight
  0 siblings, 0 replies; 21+ messages in thread
From: David Laight @ 2023-06-30  8:29 UTC (permalink / raw)
  To: 'Evan Green'
  Cc: Jessica Clarke, Palmer Dabbelt, linux-doc, Yangyu Chen,
	Conor Dooley, Guo Ren, Jisheng Zhang, linux-riscv,
	Jonathan Corbet, Xianting Tian, Masahiro Yamada, Greentime Hu,
	Simon Hosie, Li Zhengyu, Andrew Jones, Albert Ou,
	Alexandre Ghiti, Ley Foon Tan, Paul Walmsley, Heiko Stuebner,
	Anup Patel, linux-kernel, Sia Jee Heng, Palmer Dabbelt,
	Andy Chiu

...
> Yeah, one thing I could do is disable interrupts, measure the cycle
> count of doing an individual iteration, do this N times, and take the
> minimum value as the time to compare. In the end I'll then have two
> numbers to compare, like I do in this patch. In theory the variance on
> that should be really tight. N will have to depend on the overall
> amount of time I'm taking so as not to shut interrupts off for very
> long. Let me experiment with this and see how the results look.
> -Evan

I doubt you'll need many iterations or a long test.

You can do tests in userspace without disabling pre-emption
or interrupts - the large/silly values they generate are
easily ignored.

I suspect you'll get enough info from something like:
	unsigned long x[2];
	volatile unsigned long *p = (void *)((unsigned char *)x + 1);
	full_cpu_barrier()
	start = rdtsc();
	full_cpu_barrier();
	*p; *p; *p; *p; *p; *p; *p; *p;
	*p; *p; *p; *p; *p; *p; *p; *p;
	full_cpu_barrier()
	elapsed = rdtsc() - start;
Once the i-cache is loaded it should be pretty constant.
For aligned addresses I'd expect each extra '*p' to be
one more clock.
With hardware support for misaligned transfers at most
2 clocks (test on x86 and it will be 1 clock).
The emulated version will be 100s or 1000s.
	
I'm not sure how much of a cpu barrier you need.
Definitely needs to wait for all memory accesses
and the rdtsc().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func
  2023-06-26 13:07   ` Conor Dooley
@ 2023-06-30 21:57     ` Evan Green
  0 siblings, 0 replies; 21+ messages in thread
From: Evan Green @ 2023-06-30 21:57 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Palmer Dabbelt, Simon Hosie, Albert Ou, Andrew Jones, Anup Patel,
	Greentime Hu, Guo Ren, Heiko Stuebner, Jisheng Zhang,
	Ley Foon Tan, Palmer Dabbelt, Paul Walmsley, Randy Dunlap,
	Samuel Holland, Sunil V L, linux-kernel, linux-riscv

On Mon, Jun 26, 2023 at 6:07 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> Hey Evan,
>
> On Fri, Jun 23, 2023 at 03:20:16PM -0700, Evan Green wrote:
> > Now that we're testing unaligned memory copy and making that
> > determination generically, there are no more users of the vendor
> > feature_probe_func(). While I think it's probably going to need to come
> > back, there are no users right now, so let's remove it until it's
> > needed.
>
> How come this is done as a separate patch, rather than delete the dead
> code as part of the probe addition? Ease of review?

Yes, it just seemed like a logically distinct change. Thanks for the review!
-Evan

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-06-30 21:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-23 22:20 [PATCH 0/2] RISC-V: Probe for misaligned access speed Evan Green
2023-06-23 22:20 ` [PATCH 1/2] RISC-V: Probe for unaligned " Evan Green
2023-06-25 21:42   ` David Laight
2023-06-26 14:14   ` Conor Dooley
2023-06-27 19:11     ` Evan Green
2023-06-29 14:03       ` Conor Dooley
2023-06-29 23:10         ` Evan Green
2023-06-26 21:42   ` Jessica Clarke
2023-06-27 19:11     ` Evan Green
2023-06-29 12:05       ` David Laight
2023-06-29 23:09         ` Evan Green
2023-06-30  8:29           ` David Laight
2023-06-23 22:20 ` [PATCH 2/2] RISC-V: alternative: Remove feature_probe_func Evan Green
2023-06-26 13:07   ` Conor Dooley
2023-06-30 21:57     ` Evan Green
2023-06-24 10:08 ` [PATCH 0/2] RISC-V: Probe for misaligned access speed Conor Dooley
2023-06-26 19:48   ` Evan Green
2023-06-24 10:22 ` Yangyu Chen
2023-06-26  9:24   ` David Laight
2023-06-26 19:49     ` Evan Green
2023-06-26 19:48   ` Evan Green

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).