[PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
@ 2021-11-01  4:49 Noah Goldstein
  2021-11-01 18:59 ` [PATCH v2] " Noah Goldstein
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Noah Goldstein @ 2021-11-01  4:49 UTC (permalink / raw)
  Cc: goldstein.w.n, tglx, mingo, bp, x86, hpa, luto, linux-kernel

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
   an order of magnitude slower than normal and much slower than a 4x
   'movq' loop.

    FSRM forward (dst - src == 32)   -> 1113.156
    FSRM forward (dst - src == 64)   -> 120.669

    ERMS forward (dst - src == 32)   -> 209.326
    ERMS forward (dst - src == 64)   -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
   of the times below are with dst % 256 == src % 256 which mirrors
   the usage of the previous implementation.

    FSRM backward                    -> 1196.039
    ERMS backward                    -> 1191.873

As a reference this is how a 4x 'movq' performances:

    4x Forward (dst - src == 32)     -> 128.273
    4x Backward                      -> 130.183

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 arch/x86/lib/memmove_64.S | 28 ++++++++++------------------
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..9d5f3ec4db04 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,7 +39,16 @@ SYM_FUNC_START(__memmove)
 
 	/* FSRM implies ERMS => no length checks, do the copy directly */
 .Lmemmove_begin_forward:
-	ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+	/*
+	 * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+	 * 63]. It hits a slow case which is an order of magnitude slower.
+	 */
+	ALTERNATIVE "cmp $0x20, %rdx;"
+				"jb 1f;"
+				"mov %edi, %ecx;"
+				"sub %esi, %ecx;"
+				"cmp $63, %ecx;"
+				"jb 3f;", "", X86_FEATURE_FSRM
 	ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS
 
 	/*
@@ -89,23 +98,6 @@ SYM_FUNC_START(__memmove)
 	jmp 13f
 .Lmemmove_end_forward:
 
-	/*
-	 * Handle data backward by movsq.
-	 */
-	.p2align 4
-7:
-	movq %rdx, %rcx
-	movq (%rsi), %r11
-	movq %rdi, %r10
-	leaq -8(%rsi, %rdx), %rsi
-	leaq -8(%rdi, %rdx), %rdi
-	shrq $3, %rcx
-	std
-	rep movsq
-	cld
-	movq %r11, (%r10)
-	jmp 13f
-
 	/*
 	 * Start to prepare for backward copy.
 	 */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-01  4:49 [PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S Noah Goldstein
@ 2021-11-01 18:59 ` Noah Goldstein
  2021-11-02 23:15 ` [PATCH v3] " Noah Goldstein
  2021-11-17 21:02 ` [PATCH v4] " Noah Goldstein
  2 siblings, 0 replies; 13+ messages in thread
From: Noah Goldstein @ 2021-11-01 18:59 UTC (permalink / raw)
  Cc: goldstein.w.n, tglx, mingo, bp, x86, hpa, luto, linux-kernel

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
   an order of magnitude slower than normal and much slower than a 4x
   'movq' loop.

    FSRM forward (dst - src == 32)   -> 1113.156
    FSRM forward (dst - src == 64)   -> 120.669

    ERMS forward (dst - src == 32)   -> 209.326
    ERMS forward (dst - src == 64)   -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
   of the times below are with dst % 256 == src % 256 which mirrors
   the usage of the previous implementation.

    FSRM backward                    -> 1196.039
    ERMS backward                    -> 1191.873

As a reference this is how a 4x 'movq' performances:

    4x Forward (dst - src == 32)     -> 128.273
    4x Backward                      -> 130.183

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Mistake in V1. Had forgotten to remove the logic jumping to backwards
'rep movsq'.
 arch/x86/lib/memmove_64.S | 34 +++++++++++-----------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..90eb2487fde2 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,7 +39,16 @@ SYM_FUNC_START(__memmove)
 
 	/* FSRM implies ERMS => no length checks, do the copy directly */
 .Lmemmove_begin_forward:
-	ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+	/*
+	 * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+	 * 63]. It hits a slow case which is an order of magnitude slower.
+	 */
+	ALTERNATIVE "cmp $0x20, %rdx;"
+				"jb 1f;"
+				"mov %edi, %ecx;"
+				"sub %esi, %ecx;"
+				"cmp $63, %ecx;"
+				"jb 3f;", "", X86_FEATURE_FSRM
 	ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS
 
 	/*
@@ -89,23 +98,6 @@ SYM_FUNC_START(__memmove)
 	jmp 13f
 .Lmemmove_end_forward:
 
-	/*
-	 * Handle data backward by movsq.
-	 */
-	.p2align 4
-7:
-	movq %rdx, %rcx
-	movq (%rsi), %r11
-	movq %rdi, %r10
-	leaq -8(%rsi, %rdx), %rsi
-	leaq -8(%rdi, %rdx), %rdi
-	shrq $3, %rcx
-	std
-	rep movsq
-	cld
-	movq %r11, (%r10)
-	jmp 13f
-
 	/*
 	 * Start to prepare for backward copy.
 	 */
@@ -113,11 +105,7 @@ SYM_FUNC_START(__memmove)
 2:
 	cmp $0x20, %rdx
 	jb 1f
-	cmp $680, %rdx
-	jb 6f
-	cmp %dil, %sil
-	je 7b
-6:
+
 	/*
 	 * Calculate copy position to tail.
 	 */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-01  4:49 [PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S Noah Goldstein
  2021-11-01 18:59 ` [PATCH v2] " Noah Goldstein
@ 2021-11-02 23:15 ` Noah Goldstein
  2021-11-17 21:02 ` [PATCH v4] " Noah Goldstein
  2 siblings, 0 replies; 13+ messages in thread
From: Noah Goldstein @ 2021-11-02 23:15 UTC (permalink / raw)
  Cc: goldstein.w.n, tglx, mingo, bp, x86, hpa, luto, linux-kernel

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
   an order of magnitude slower than normal and much slower than a 4x
   'movq' loop.

    FSRM forward (dst - src == 32)   -> 1113.156
    FSRM forward (dst - src == 64)   -> 120.669

    ERMS forward (dst - src == 32)   -> 209.326
    ERMS forward (dst - src == 64)   -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
   of the times below are with dst % 256 == src % 256 which mirrors
   the usage of the previous implementation.

    FSRM backward                    -> 1196.039
    ERMS backward                    -> 1191.873

As a reference this is how a 4x 'movq' performances:

    4x Forward (dst - src == 32)     -> 128.273
    4x Backward                      -> 130.183

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Appears the included assembly needs to all be on one line.
 arch/x86/lib/memmove_64.S | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..8fb16a7d0ea2 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,7 +39,11 @@ SYM_FUNC_START(__memmove)
 
 	/* FSRM implies ERMS => no length checks, do the copy directly */
 .Lmemmove_begin_forward:
-	ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+	/*
+	 * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+	 * 63]. It hits a slow case which is an order of magnitude slower.
+	 */
+	ALTERNATIVE "cmp $0x20, %rdx; jb 1f; mov %edi, %ecx; sub %esi, %ecx; cmp $63, %ecx; jbe 3f;", "", X86_FEATURE_FSRM
 	ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS
 
 	/*
@@ -89,23 +93,6 @@ SYM_FUNC_START(__memmove)
 	jmp 13f
 .Lmemmove_end_forward:
 
-	/*
-	 * Handle data backward by movsq.
-	 */
-	.p2align 4
-7:
-	movq %rdx, %rcx
-	movq (%rsi), %r11
-	movq %rdi, %r10
-	leaq -8(%rsi, %rdx), %rsi
-	leaq -8(%rdi, %rdx), %rdi
-	shrq $3, %rcx
-	std
-	rep movsq
-	cld
-	movq %r11, (%r10)
-	jmp 13f
-
 	/*
 	 * Start to prepare for backward copy.
 	 */
@@ -113,11 +100,7 @@ SYM_FUNC_START(__memmove)
 2:
 	cmp $0x20, %rdx
 	jb 1f
-	cmp $680, %rdx
-	jb 6f
-	cmp %dil, %sil
-	je 7b
-6:
+
 	/*
 	 * Calculate copy position to tail.
 	 */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-01  4:49 [PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S Noah Goldstein
  2021-11-01 18:59 ` [PATCH v2] " Noah Goldstein
  2021-11-02 23:15 ` [PATCH v3] " Noah Goldstein
@ 2021-11-17 21:02 ` Noah Goldstein
  2021-11-17 22:31   ` David Laight
  2022-03-16 16:23   ` Borislav Petkov
  2 siblings, 2 replies; 13+ messages in thread
From: Noah Goldstein @ 2021-11-17 21:02 UTC (permalink / raw)
  Cc: goldstein.w.n, tglx, mingo, bp, x86, hpa, luto, linux-kernel

Add check for "short distance movsb" for forwards FSRM usage and
entirely remove backwards 'rep movsq'. Both of these usages hit "slow
modes" that are an order of magnitude slower than usual.

'rep movsb' has some noticeable VERY slow modes that the current
implementation is either 1) not checking for or 2) intentionally
using.

All times are in cycles and measuring the throughput of copying 1024
bytes.

1. For FSRM, when 'dst - src' is in (1, 63] or (4GB, 4GB + 63] it is
   an order of magnitude slower than normal and much slower than a 4x
   'movq' loop.

    FSRM forward (dst - src == 32)   -> 1113.156
    FSRM forward (dst - src == 64)   -> 120.669

    ERMS forward (dst - src == 32)   -> 209.326
    ERMS forward (dst - src == 64)   -> 118.22

2. For both FSRM and ERMS backwards 'rep movsb' is always slow. Both
   of the times below are with dst % 256 == src % 256 which mirrors
   the usage of the previous implementation.

    FSRM backward                    -> 1196.039
    ERMS backward                    -> 1191.873

As a reference this is how a 4x 'movq' performances:

    4x Forward (dst - src == 32)     -> 128.273
    4x Backward                      -> 130.183

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 arch/x86/lib/memmove_64.S | 38 +++++++++++++-------------------------
 1 file changed, 13 insertions(+), 25 deletions(-)

diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 64801010d312..910b963388b1 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -28,6 +28,8 @@ SYM_FUNC_START_WEAK(memmove)
 SYM_FUNC_START(__memmove)
 
 	mov %rdi, %rax
+	cmp $0x20, %rdx
+	jb 1f
 
 	/* Decide forward/backward copy mode */
 	cmp %rdi, %rsi
@@ -39,7 +41,17 @@ SYM_FUNC_START(__memmove)
 
 	/* FSRM implies ERMS => no length checks, do the copy directly */
 .Lmemmove_begin_forward:
-	ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
+	/*
+	 * Don't use FSRM 'rep movsb' if 'dst - src' in (0, 63] or (4GB, 4GB +
+	 * 63]. It hits a slow case which is an order of magnitude slower.
+	 */
+	ALTERNATIVE "	\
+                mov %edi, %ecx;	\
+                sub %esi, %ecx;	\
+                cmp $63, %ecx;	\
+                jbe 3f	\
+                ", "", X86_FEATURE_FSRM
+
 	ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS
 
 	/*
@@ -89,35 +101,11 @@ SYM_FUNC_START(__memmove)
 	jmp 13f
 .Lmemmove_end_forward:
 
-	/*
-	 * Handle data backward by movsq.
-	 */
-	.p2align 4
-7:
-	movq %rdx, %rcx
-	movq (%rsi), %r11
-	movq %rdi, %r10
-	leaq -8(%rsi, %rdx), %rsi
-	leaq -8(%rdi, %rdx), %rdi
-	shrq $3, %rcx
-	std
-	rep movsq
-	cld
-	movq %r11, (%r10)
-	jmp 13f
-
 	/*
 	 * Start to prepare for backward copy.
 	 */
 	.p2align 4
 2:
-	cmp $0x20, %rdx
-	jb 1f
-	cmp $680, %rdx
-	jb 6f
-	cmp %dil, %sil
-	je 7b
-6:
 	/*
 	 * Calculate copy position to tail.
 	 */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-17 21:02 ` [PATCH v4] " Noah Goldstein
@ 2021-11-17 22:31   ` David Laight
  2021-11-17 22:45     ` Noah Goldstein
  2022-03-16 16:23   ` Borislav Petkov
  1 sibling, 1 reply; 13+ messages in thread
From: David Laight @ 2021-11-17 22:31 UTC (permalink / raw)
  To: 'Noah Goldstein'; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

From: Noah Goldstein
> Sent: 17 November 2021 21:03
> 
> Add check for "short distance movsb" for forwards FSRM usage and
> entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> modes" that are an order of magnitude slower than usual.
> 
> 'rep movsb' has some noticeable VERY slow modes that the current
> implementation is either 1) not checking for or 2) intentionally
> using.

How does this relate to the decision that glibc made a few years
ago to use backwards 'rep movs' for non-overlapping copies?

Did they find a different corner case??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-17 22:31   ` David Laight
@ 2021-11-17 22:45     ` Noah Goldstein
  2021-11-19 22:31       ` David Laight
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2021-11-17 22:45 UTC (permalink / raw)
  To: David Laight; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Noah Goldstein
> > Sent: 17 November 2021 21:03
> >
> > Add check for "short distance movsb" for forwards FSRM usage and
> > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > modes" that are an order of magnitude slower than usual.
> >
> > 'rep movsb' has some noticeable VERY slow modes that the current
> > implementation is either 1) not checking for or 2) intentionally
> > using.
>
> How does this relate to the decision that glibc made a few years
> ago to use backwards 'rep movs' for non-overlapping copies?

GLIBC doesn't use backwards `rep movs`.  Since the regions are
non-overlapping it just uses forward copy. Backwards `rep movs` is
from setting the direction flag (`std`) and is a very slow byte
copy. For overlapping regions where backwards copy is necessary GLIBC
uses 4x vec copy loop.

>
>
> Did they find a different corner case??
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-17 22:45     ` Noah Goldstein
@ 2021-11-19 22:31       ` David Laight
  2021-11-20  0:05         ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: David Laight @ 2021-11-19 22:31 UTC (permalink / raw)
  To: 'Noah Goldstein'; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

From: Noah Goldstein
> Sent: 17 November 2021 22:45
> 
> On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Noah Goldstein
> > > Sent: 17 November 2021 21:03
> > >
> > > Add check for "short distance movsb" for forwards FSRM usage and
> > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > modes" that are an order of magnitude slower than usual.
> > >
> > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > implementation is either 1) not checking for or 2) intentionally
> > > using.
> >
> > How does this relate to the decision that glibc made a few years
> > ago to use backwards 'rep movs' for non-overlapping copies?
> 
> GLIBC doesn't use backwards `rep movs`.  Since the regions are
> non-overlapping it just uses forward copy. Backwards `rep movs` is
> from setting the direction flag (`std`) and is a very slow byte
> copy. For overlapping regions where backwards copy is necessary GLIBC
> uses 4x vec copy loop.

Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba

Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
But I can't find the actual patch.

The claims were a massive performance increase for the reverse copy.

The pdf from www.agner.org/optimize may well indicate why some
copies are unexpectedly slow due to cache access aliasing.

I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
can be persuaded to copy 8 bytes/clock for in-cache data with
a fairly simple loop that contains 2 reads (maybe misaligned)
and two writes (so 16 bytes per iteration).
Extra unrolling just adds extra code top and bottom.

You might want a loop like:
	1:	mov	0(%rsi, %rcx),%rax
		mov	8(%rsi, %rcx),%rdx
		mov	%rax, 0(%rdi, %rcx)
		mov	%rdx, 8(%rdi, %rcx)
		add	$16, %rcx
		jnz	1b

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-19 22:31       ` David Laight
@ 2021-11-20  0:05         ` Noah Goldstein
  2021-12-10 18:35           ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2021-11-20  0:05 UTC (permalink / raw)
  To: David Laight; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

On Fri, Nov 19, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Noah Goldstein
> > Sent: 17 November 2021 22:45
> >
> > On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Noah Goldstein
> > > > Sent: 17 November 2021 21:03
> > > >
> > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > modes" that are an order of magnitude slower than usual.
> > > >
> > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > implementation is either 1) not checking for or 2) intentionally
> > > > using.
> > >
> > > How does this relate to the decision that glibc made a few years
> > > ago to use backwards 'rep movs' for non-overlapping copies?
> >
> > GLIBC doesn't use backwards `rep movs`.  Since the regions are
> > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > from setting the direction flag (`std`) and is a very slow byte
> > copy. For overlapping regions where backwards copy is necessary GLIBC
> > uses 4x vec copy loop.
>
> Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
>
> Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> But I can't find the actual patch.
>
> The claims were a massive performance increase for the reverse copy.
>

I don't think that's referring to optimizations around `rep movs`. It
appears to be referring to fallout from this patch:
https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba

which broken programs misusing `memcpy` with overlapping regions
resulting in this fix:
https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd

AFAICT support for ERMS was only added around:
https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db

Either way GLIBC memcpy/memmove moment most certainly does not
use backwards `rep movs`:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655

as it is very slow.

> The pdf from www.agner.org/optimize may well indicate why some
> copies are unexpectedly slow due to cache access aliasing.

Even in the `4k` aliasing case `rep movsb` seems to stay within a
factor of 2 of optimal whereas the `std` backwards `rep movs` loses
by a factor of 10.

Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
the direction of copy for `memmove` is a correctness question, not
an optimization.


>
> I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> can be persuaded to copy 8 bytes/clock for in-cache data with
> a fairly simple loop that contains 2 reads (maybe misaligned)
> and two writes (so 16 bytes per iteration).
> Extra unrolling just adds extra code top and bottom.
>
> You might want a loop like:
>         1:      mov     0(%rsi, %rcx),%rax
>                 mov     8(%rsi, %rcx),%rdx
>                 mov     %rax, 0(%rdi, %rcx)
>                 mov     %rdx, 8(%rdi, %rcx)
>                 add     $16, %rcx
>                 jnz     1b
>
>         David

The backwards loop already has 4x unrolled `movq` loop.

>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-20  0:05         ` Noah Goldstein
@ 2021-12-10 18:35           ` Noah Goldstein
  2022-01-12  3:13             ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2021-12-10 18:35 UTC (permalink / raw)
  To: David Laight; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Nov 19, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Noah Goldstein
> > > Sent: 17 November 2021 22:45
> > >
> > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Noah Goldstein
> > > > > Sent: 17 November 2021 21:03
> > > > >
> > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > modes" that are an order of magnitude slower than usual.
> > > > >
> > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > using.
> > > >
> > > > How does this relate to the decision that glibc made a few years
> > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > >
> > > GLIBC doesn't use backwards `rep movs`.  Since the regions are
> > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > from setting the direction flag (`std`) and is a very slow byte
> > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > uses 4x vec copy loop.
> >
> > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> >
> > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > But I can't find the actual patch.
> >
> > The claims were a massive performance increase for the reverse copy.
> >
>
> I don't think that's referring to optimizations around `rep movs`. It
> appears to be referring to fallout from this patch:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
>
> which broken programs misusing `memcpy` with overlapping regions
> resulting in this fix:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
>
> AFAICT support for ERMS was only added around:
> https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
>
> Either way GLIBC memcpy/memmove moment most certainly does not
> use backwards `rep movs`:
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
>
> as it is very slow.
>
> > The pdf from www.agner.org/optimize may well indicate why some
> > copies are unexpectedly slow due to cache access aliasing.
>
> Even in the `4k` aliasing case `rep movsb` seems to stay within a
> factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> by a factor of 10.
>
> Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> the direction of copy for `memmove` is a correctness question, not
> an optimization.
>
>
> >
> > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > can be persuaded to copy 8 bytes/clock for in-cache data with
> > a fairly simple loop that contains 2 reads (maybe misaligned)
> > and two writes (so 16 bytes per iteration).
> > Extra unrolling just adds extra code top and bottom.
> >
> > You might want a loop like:
> >         1:      mov     0(%rsi, %rcx),%rax
> >                 mov     8(%rsi, %rcx),%rdx
> >                 mov     %rax, 0(%rdi, %rcx)
> >                 mov     %rdx, 8(%rdi, %rcx)
> >                 add     $16, %rcx
> >                 jnz     1b
> >
> >         David
>
> The backwards loop already has 4x unrolled `movq` loop.
ping.
>
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-12-10 18:35           ` Noah Goldstein
@ 2022-01-12  3:13             ` Noah Goldstein
  2022-02-10  9:08               ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2022-01-12  3:13 UTC (permalink / raw)
  To: David Laight; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

On Fri, Dec 10, 2021 at 12:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Nov 19, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Noah Goldstein
> > > > Sent: 17 November 2021 22:45
> > > >
> > > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > > > >
> > > > > From: Noah Goldstein
> > > > > > Sent: 17 November 2021 21:03
> > > > > >
> > > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > > modes" that are an order of magnitude slower than usual.
> > > > > >
> > > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > > using.
> > > > >
> > > > > How does this relate to the decision that glibc made a few years
> > > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > > >
> > > > GLIBC doesn't use backwards `rep movs`.  Since the regions are
> > > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > > from setting the direction flag (`std`) and is a very slow byte
> > > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > > uses 4x vec copy loop.
> > >
> > > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > >
> > > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > > But I can't find the actual patch.
> > >
> > > The claims were a massive performance increase for the reverse copy.
> > >
> >
> > I don't think that's referring to optimizations around `rep movs`. It
> > appears to be referring to fallout from this patch:
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> >
> > which broken programs misusing `memcpy` with overlapping regions
> > resulting in this fix:
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
> >
> > AFAICT support for ERMS was only added around:
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
> >
> > Either way GLIBC memcpy/memmove moment most certainly does not
> > use backwards `rep movs`:
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
> >
> > as it is very slow.
> >
> > > The pdf from www.agner.org/optimize may well indicate why some
> > > copies are unexpectedly slow due to cache access aliasing.
> >
> > Even in the `4k` aliasing case `rep movsb` seems to stay within a
> > factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> > by a factor of 10.
> >
> > Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> > the direction of copy for `memmove` is a correctness question, not
> > an optimization.
> >
> >
> > >
> > > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > > can be persuaded to copy 8 bytes/clock for in-cache data with
> > > a fairly simple loop that contains 2 reads (maybe misaligned)
> > > and two writes (so 16 bytes per iteration).
> > > Extra unrolling just adds extra code top and bottom.
> > >
> > > You might want a loop like:
> > >         1:      mov     0(%rsi, %rcx),%rax
> > >                 mov     8(%rsi, %rcx),%rdx
> > >                 mov     %rax, 0(%rdi, %rcx)
> > >                 mov     %rdx, 8(%rdi, %rcx)
> > >                 add     $16, %rcx
> > >                 jnz     1b
> > >
> > >         David
> >
> > The backwards loop already has 4x unrolled `movq` loop.
> ping.
ping.
> >
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2022-01-12  3:13             ` Noah Goldstein
@ 2022-02-10  9:08               ` Noah Goldstein
  2022-03-16 14:15                 ` Noah Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Noah Goldstein @ 2022-02-10  9:08 UTC (permalink / raw)
  To: David Laight; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

On Tue, Jan 11, 2022 at 9:13 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Dec 10, 2021 at 12:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Noah Goldstein
> > > > > Sent: 17 November 2021 22:45
> > > > >
> > > > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > > > > >
> > > > > > From: Noah Goldstein
> > > > > > > Sent: 17 November 2021 21:03
> > > > > > >
> > > > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > > > modes" that are an order of magnitude slower than usual.
> > > > > > >
> > > > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > > > using.
> > > > > >
> > > > > > How does this relate to the decision that glibc made a few years
> > > > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > > > >
> > > > > GLIBC doesn't use backwards `rep movs`.  Since the regions are
> > > > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > > > from setting the direction flag (`std`) and is a very slow byte
> > > > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > > > uses 4x vec copy loop.
> > > >
> > > > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > > >
> > > > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > > > But I can't find the actual patch.
> > > >
> > > > The claims were a massive performance increase for the reverse copy.
> > > >
> > >
> > > I don't think that's referring to optimizations around `rep movs`. It
> > > appears to be referring to fallout from this patch:
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > >
> > > which broken programs misusing `memcpy` with overlapping regions
> > > resulting in this fix:
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
> > >
> > > AFAICT support for ERMS was only added around:
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
> > >
> > > Either way GLIBC memcpy/memmove moment most certainly does not
> > > use backwards `rep movs`:
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
> > >
> > > as it is very slow.
> > >
> > > > The pdf from www.agner.org/optimize may well indicate why some
> > > > copies are unexpectedly slow due to cache access aliasing.
> > >
> > > Even in the `4k` aliasing case `rep movsb` seems to stay within a
> > > factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> > > by a factor of 10.
> > >
> > > Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> > > the direction of copy for `memmove` is a correctness question, not
> > > an optimization.
> > >
> > >
> > > >
> > > > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > > > can be persuaded to copy 8 bytes/clock for in-cache data with
> > > > a fairly simple loop that contains 2 reads (maybe misaligned)
> > > > and two writes (so 16 bytes per iteration).
> > > > Extra unrolling just adds extra code top and bottom.
> > > >
> > > > You might want a loop like:
> > > >         1:      mov     0(%rsi, %rcx),%rax
> > > >                 mov     8(%rsi, %rcx),%rdx
> > > >                 mov     %rax, 0(%rdi, %rcx)
> > > >                 mov     %rdx, 8(%rdi, %rcx)
> > > >                 add     $16, %rcx
> > > >                 jnz     1b
> > > >
> > > >         David
> > >
> > > The backwards loop already has 4x unrolled `movq` loop.
> > ping.
> ping.
ping3.
> > >
> > > >
> > > > -
> > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2022-02-10  9:08               ` Noah Goldstein
@ 2022-03-16 14:15                 ` Noah Goldstein
  0 siblings, 0 replies; 13+ messages in thread
From: Noah Goldstein @ 2022-03-16 14:15 UTC (permalink / raw)
  To: David Laight; +Cc: tglx, mingo, bp, x86, hpa, luto, linux-kernel

On Thu, Feb 10, 2022 at 3:08 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Jan 11, 2022 at 9:13 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Dec 10, 2021 at 12:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 6:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 19, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > > > >
> > > > > From: Noah Goldstein
> > > > > > Sent: 17 November 2021 22:45
> > > > > >
> > > > > > On Wed, Nov 17, 2021 at 4:31 PM David Laight <David.Laight@aculab.com> wrote:
> > > > > > >
> > > > > > > From: Noah Goldstein
> > > > > > > > Sent: 17 November 2021 21:03
> > > > > > > >
> > > > > > > > Add check for "short distance movsb" for forwards FSRM usage and
> > > > > > > > entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> > > > > > > > modes" that are an order of magnitude slower than usual.
> > > > > > > >
> > > > > > > > 'rep movsb' has some noticeable VERY slow modes that the current
> > > > > > > > implementation is either 1) not checking for or 2) intentionally
> > > > > > > > using.
> > > > > > >
> > > > > > > How does this relate to the decision that glibc made a few years
> > > > > > > ago to use backwards 'rep movs' for non-overlapping copies?
> > > > > >
> > > > > > GLIBC doesn't use backwards `rep movs`.  Since the regions are
> > > > > > non-overlapping it just uses forward copy. Backwards `rep movs` is
> > > > > > from setting the direction flag (`std`) and is a very slow byte
> > > > > > copy. For overlapping regions where backwards copy is necessary GLIBC
> > > > > > uses 4x vec copy loop.
> > > > >
> > > > > Try to find this commit 6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > > > >
> > > > > Or follow links from https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html
> > > > > But I can't find the actual patch.
> > > > >
> > > > > The claims were a massive performance increase for the reverse copy.
> > > > >
> > > >
> > > > I don't think that's referring to optimizations around `rep movs`. It
> > > > appears to be referring to fallout from this patch:
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba
> > > >
> > > > which broken programs misusing `memcpy` with overlapping regions
> > > > resulting in this fix:
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=0354e355014b7bfda32622e0255399d859862fcd
> > > >
> > > > AFAICT support for ERMS was only added around:
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=13efa86ece61bf84daca50cab30db1b0902fe2db
> > > >
> > > > Either way GLIBC memcpy/memmove moment most certainly does not
> > > > use backwards `rep movs`:
> > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S;hb=HEAD#l655
> > > >
> > > > as it is very slow.
> > > >
> > > > > The pdf from www.agner.org/optimize may well indicate why some
> > > > > copies are unexpectedly slow due to cache access aliasing.
> > > >
> > > > Even in the `4k` aliasing case `rep movsb` seems to stay within a
> > > > factor of 2 of optimal whereas the `std` backwards `rep movs` loses
> > > > by a factor of 10.
> > > >
> > > > Either way, `4k` aliasing detection is mostly a concern of `memcpy` as
> > > > the direction of copy for `memmove` is a correctness question, not
> > > > an optimization.
> > > >
> > > >
> > > > >
> > > > > I'm pretty sure that Intel cpu (possibly from Ivy bridge onwards)
> > > > > can be persuaded to copy 8 bytes/clock for in-cache data with
> > > > > a fairly simple loop that contains 2 reads (maybe misaligned)
> > > > > and two writes (so 16 bytes per iteration).
> > > > > Extra unrolling just adds extra code top and bottom.
> > > > >
> > > > > You might want a loop like:
> > > > >         1:      mov     0(%rsi, %rcx),%rax
> > > > >                 mov     8(%rsi, %rcx),%rdx
> > > > >                 mov     %rax, 0(%rdi, %rcx)
> > > > >                 mov     %rdx, 8(%rdi, %rcx)
> > > > >                 add     $16, %rcx
> > > > >                 jnz     1b
> > > > >
> > > > >         David
> > > >
> > > > The backwards loop already has 4x unrolled `movq` loop.
> > > ping.
> > ping.
> ping3.

Hi,

Anything I'm missing to get this looked at?

> > > >
> > > > >
> > > > > -
> > > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S
  2021-11-17 21:02 ` [PATCH v4] " Noah Goldstein
  2021-11-17 22:31   ` David Laight
@ 2022-03-16 16:23   ` Borislav Petkov
  1 sibling, 0 replies; 13+ messages in thread
From: Borislav Petkov @ 2022-03-16 16:23 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: tglx, mingo, x86, hpa, luto, linux-kernel

On Wed, Nov 17, 2021 at 03:02:45PM -0600, Noah Goldstein wrote:
> Add check for "short distance movsb" for forwards FSRM usage and
> entirely remove backwards 'rep movsq'. Both of these usages hit "slow
> modes" that are an order of magnitude slower than usual.
> 
> 'rep movsb' has some noticeable VERY slow modes that the current
> implementation is either 1) not checking for or 2) intentionally
> using.
> 
> All times are in cycles and measuring the throughput of copying 1024
> bytes.

All these claims need to be proven by

 - real benchmarks - not a microbenchmark - where it shows that
 modifications like that are not "in the noise". Others should be able
 to verify those results too.

 - on a bunch of CPUs from different vendors to verify that they don't
   cause performance regressions on any.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-03-16 16:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01  4:49 [PATCH v1] arch/x86: Improve 'rep movs{b|q}' usage in memmove_64.S Noah Goldstein
2021-11-01 18:59 ` [PATCH v2] " Noah Goldstein
2021-11-02 23:15 ` [PATCH v3] " Noah Goldstein
2021-11-17 21:02 ` [PATCH v4] " Noah Goldstein
2021-11-17 22:31   ` David Laight
2021-11-17 22:45     ` Noah Goldstein
2021-11-19 22:31       ` David Laight
2021-11-20  0:05         ` Noah Goldstein
2021-12-10 18:35           ` Noah Goldstein
2022-01-12  3:13             ` Noah Goldstein
2022-02-10  9:08               ` Noah Goldstein
2022-03-16 14:15                 ` Noah Goldstein
2022-03-16 16:23   ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).