[PATCH] x86: bring back rep movsq for user access on CPUs without ERMS

* [PATCH] x86: bring back rep movsq for user access on CPUs without ERMS
@ 2023-08-28 17:07 Mateusz Guzik
  2023-08-28 18:00 ` Linus Torvalds
  2023-08-29 19:30 ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Mateusz Guzik @ 2023-08-28 17:07 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, linux-arch, bp, Mateusz Guzik

I spotted rep_movs_alternative on a flamegraph while building the kernel
on an AMD CPU and I was surprised to find it does not have ERMS.

The 64 byte loop there can be trivially replaced with movsq and I don't
think there are any reasons to NOT do it. If anything I'm surprised the
non-ERMS case was left hanging around after the 32KB regression reported
by Eric Dumazet.

Anyhow patch below remedies it.

The patch initially had:
 SYM_FUNC_START(rep_movs_alternative)
        cmpq $64,%rcx
-       jae .Llarge
+       ALTERNATIVE "jae .Lrep_movsq", "jae .Lrep_movsb", X86_FEATURE_ERMS


But then I found the weird nops and after reading the thread which
resulted in bringing back ERMS I see why
https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/

That said, whacking the 64 byte loop in favor of rep movsq is definitely
the right to do and the patch below is one way to do it.

What I don't get is what's up with ERMS availability on AMD CPUs. I was
told newer uarchs support it, but the bit may be disabled in bios(?).

Anyhow, I temporarily got an instance on Amazon EC2 with EPYC 7R13 and
the bit is not there, whether this is a config problem or not.

I also note there is quite a mess concerning other string ops, which I'm
going to tackle in another thread later(tm).

================ cut here ================

Intel CPUs ship with ERMS for over a decade, but this is not true for
AMD. 

Hand-rolled mov loops executing in this case are quite pessimal compared
to rep movsq for bigger sizes. While the upper limit depends on uarch,
everyone is well south of 1KB AFAICS and sizes bigger than that are
common. The problem can be easily remedied so do it.

Sample result from read1_processes from will-it-scale on EPYC 7R13
(4KB reads/s):
before:	1507021
after:	1721828 (+14%)

Note that the cutoff point for rep usage is set to 64 bytes, which is
way too conservative but I'm sticking to what was done in 47ee3f1dd93b
("x86: re-introduce support for ERMS copies for user space accesses").
That is to say *some* copies will now go slower, which is fixable but
beyond the scope of this patch.

While here make sure labels are unique.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 arch/x86/lib/copy_user_64.S | 61 +++++++++++--------------------------
 1 file changed, 17 insertions(+), 44 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 01c5de4c279b..2fe61de81a22 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -68,55 +68,28 @@ SYM_FUNC_START(rep_movs_alternative)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
 
 .Llarge:
-0:	ALTERNATIVE "jmp .Lunrolled", "rep movsb", X86_FEATURE_ERMS
-1:	RET
+4:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
+5:	RET
 
-        _ASM_EXTABLE_UA( 0b, 1b)
+	_ASM_EXTABLE_UA( 4b, 5b)
 
-	.p2align 4
-.Lunrolled:
-10:	movq (%rsi),%r8
-11:	movq 8(%rsi),%r9
-12:	movq 16(%rsi),%r10
-13:	movq 24(%rsi),%r11
-14:	movq %r8,(%rdi)
-15:	movq %r9,8(%rdi)
-16:	movq %r10,16(%rdi)
-17:	movq %r11,24(%rdi)
-20:	movq 32(%rsi),%r8
-21:	movq 40(%rsi),%r9
-22:	movq 48(%rsi),%r10
-23:	movq 56(%rsi),%r11
-24:	movq %r8,32(%rdi)
-25:	movq %r9,40(%rdi)
-26:	movq %r10,48(%rdi)
-27:	movq %r11,56(%rdi)
-	addq $64,%rsi
-	addq $64,%rdi
-	subq $64,%rcx
-	cmpq $64,%rcx
-	jae .Lunrolled
-	cmpl $8,%ecx
-	jae .Lword
+.Llarge_movsq:
+	movq %rcx,%r8
+	movq %rcx,%rax
+	shrq $3,%rcx
+	andl $7,%eax
+6:	rep movsq
+	movl %eax,%ecx
 	testl %ecx,%ecx
 	jne .Lcopy_user_tail
 	RET
 
-	_ASM_EXTABLE_UA(10b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(11b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(12b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(13b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(14b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(15b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(16b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(17b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(20b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(21b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(22b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(23b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(24b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(25b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(26b, .Lcopy_user_tail)
-	_ASM_EXTABLE_UA(27b, .Lcopy_user_tail)
+/*
+ * Recovery after failed rep movsq
+ */
+7:	movq %r8,%rcx
+	jmp .Lcopy_user_tail
+
+	_ASM_EXTABLE_UA( 6b, 7b)
 SYM_FUNC_END(rep_movs_alternative)
 EXPORT_SYMBOL(rep_movs_alternative)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread