* [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
@ 2020-06-18 10:20 Matt Fleming
2020-06-18 10:48 ` David Laight
2020-06-19 16:40 ` [tip: x86/urgent] " tip-bot2 for Matt Fleming
0 siblings, 2 replies; 6+ messages in thread
From: Matt Fleming @ 2020-06-18 10:20 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
Cc: Alexey Dobriyan, linux-kernel, Matt Fleming, Grimm, Jon, Kumar,
Venkataramanan, Jan Kara, stable
x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,
1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.
Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.
Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:
Zen 1 (Naples)
libmicro-file
5.7.0-rc6 5.7.0-rc6 5.7.0-rc6
revert-1153933703d9+ align16+
Time mean95-pread_z100k 9.9195 ( 0.00%) 5.9856 ( 39.66%) 5.9938 ( 39.58%)
Time mean95-pread_z10k 1.1378 ( 0.00%) 0.7450 ( 34.52%) 0.7467 ( 34.38%)
Time mean95-pread_z1k 0.2623 ( 0.00%) 0.2251 ( 14.18%) 0.2252 ( 14.15%)
Time mean95-pread_zw100k 9.9974 ( 0.00%) 6.0648 ( 39.34%) 6.0756 ( 39.23%)
Time mean95-read_z100k 9.8940 ( 0.00%) 5.9885 ( 39.47%) 5.9994 ( 39.36%)
Time mean95-read_z10k 1.1394 ( 0.00%) 0.7483 ( 34.33%) 0.7482 ( 34.33%)
Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).
Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Cc: "Grimm, Jon" <Jon.Grimm@amd.com>
Cc: "Kumar, Venkataramanan" <Venkataramanan.Kumar@amd.com>
CC: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
---
arch/x86/lib/usercopy_64.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..b0dfac3d3df7 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
asm volatile(
" testq %[size8],%[size8]\n"
" jz 4f\n"
+ " .align 16\n"
"0: movq $0,(%[dst])\n"
" addq $8,%[dst]\n"
" decl %%ecx ; jnz 0b\n"
--
2.17.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* RE: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
2020-06-18 10:20 [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes Matt Fleming
@ 2020-06-18 10:48 ` David Laight
2020-06-18 13:16 ` Alexey Dobriyan
2020-06-19 16:40 ` [tip: x86/urgent] " tip-bot2 for Matt Fleming
1 sibling, 1 reply; 6+ messages in thread
From: David Laight @ 2020-06-18 10:48 UTC (permalink / raw)
To: 'Matt Fleming', Ingo Molnar, Peter Zijlstra, Thomas Gleixner
Cc: Alexey Dobriyan, linux-kernel, Grimm, Jon, Kumar, Venkataramanan,
Jan Kara, stable
From: Matt Fleming
> Sent: 18 June 2020 11:20
> x86 CPUs can suffer severe performance drops if a tight loop, such as
> the ones in __clear_user(), straddles a 16-byte instruction fetch
> window, or worse, a 64-byte cacheline. This issues was discovered in the
> SUSE kernel with the following commit,
>
> 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
>
> which increased the code object size from 10 bytes to 15 bytes and
> caused the 8-byte copy loop in __clear_user() to be split across a
> 64-byte cacheline.
>
> Aligning the start of the loop to 16-bytes makes this fit neatly inside
> a single instruction fetch window again and restores the performance of
> __clear_user() which is used heavily when reading from /dev/zero.
>
> Here are some numbers from running libmicro's read_z* and pread_z*
> microbenchmarks which read from /dev/zero:
>
> Zen 1 (Naples)
>
> libmicro-file
> 5.7.0-rc6 5.7.0-rc6 5.7.0-rc6
> revert-1153933703d9+ align16+
> Time mean95-pread_z100k 9.9195 ( 0.00%) 5.9856 ( 39.66%) 5.9938 ( 39.58%)
> Time mean95-pread_z10k 1.1378 ( 0.00%) 0.7450 ( 34.52%) 0.7467 ( 34.38%)
> Time mean95-pread_z1k 0.2623 ( 0.00%) 0.2251 ( 14.18%) 0.2252 ( 14.15%)
> Time mean95-pread_zw100k 9.9974 ( 0.00%) 6.0648 ( 39.34%) 6.0756 ( 39.23%)
> Time mean95-read_z100k 9.8940 ( 0.00%) 5.9885 ( 39.47%) 5.9994 ( 39.36%)
> Time mean95-read_z10k 1.1394 ( 0.00%) 0.7483 ( 34.33%) 0.7482 ( 34.33%)
>
> Note that this doesn't affect Haswell or Broadwell microarchitectures
> which seem to avoid the alignment issue by executing the loop straight
> out of the Loop Stream Detector (verified using perf events).
Which cpu was affected?
At least one source (www.agner.org/optimize) implies that both ivy
bridge and sandy bridge have uop caches that mean (If I've read it
correctly) the loop shouldn't be affected by the alignment).
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index fff28c6f73a2..b0dfac3d3df7 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
> asm volatile(
> " testq %[size8],%[size8]\n"
> " jz 4f\n"
> + " .align 16\n"
> "0: movq $0,(%[dst])\n"
> " addq $8,%[dst]\n"
> " decl %%ecx ; jnz 0b\n"
You can do better that that loop.
Change 'dst' to point to the end of the buffer, negate the count
and divide by 8 and you get:
"0: movq $0,($[dst],%%ecx,8)\n"
" add $1,%%ecx"
" jnz 0b\n"
which might run at one iteration per clock especially on cpu that pair
the add and jnz into a single uop.
(You need to use add not inc.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
2020-06-18 10:48 ` David Laight
@ 2020-06-18 13:16 ` Alexey Dobriyan
2020-06-18 16:39 ` David Laight
0 siblings, 1 reply; 6+ messages in thread
From: Alexey Dobriyan @ 2020-06-18 13:16 UTC (permalink / raw)
To: David Laight
Cc: 'Matt Fleming',
Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel,
Grimm, Jon, Kumar, Venkataramanan, Jan Kara, stable
On Thu, Jun 18, 2020 at 10:48:05AM +0000, David Laight wrote:
> From: Matt Fleming
> > Sent: 18 June 2020 11:20
> > x86 CPUs can suffer severe performance drops if a tight loop, such as
> > the ones in __clear_user(), straddles a 16-byte instruction fetch
> > window, or worse, a 64-byte cacheline. This issues was discovered in the
> > SUSE kernel with the following commit,
> >
> > 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
> >
> > which increased the code object size from 10 bytes to 15 bytes and
> > caused the 8-byte copy loop in __clear_user() to be split across a
> > 64-byte cacheline.
> >
> > Aligning the start of the loop to 16-bytes makes this fit neatly inside
> > a single instruction fetch window again and restores the performance of
> > __clear_user() which is used heavily when reading from /dev/zero.
> >
> > Here are some numbers from running libmicro's read_z* and pread_z*
> > microbenchmarks which read from /dev/zero:
> >
> > Zen 1 (Naples)
> >
> > libmicro-file
> > 5.7.0-rc6 5.7.0-rc6 5.7.0-rc6
> > revert-1153933703d9+ align16+
> > Time mean95-pread_z100k 9.9195 ( 0.00%) 5.9856 ( 39.66%) 5.9938 ( 39.58%)
> > Time mean95-pread_z10k 1.1378 ( 0.00%) 0.7450 ( 34.52%) 0.7467 ( 34.38%)
> > Time mean95-pread_z1k 0.2623 ( 0.00%) 0.2251 ( 14.18%) 0.2252 ( 14.15%)
> > Time mean95-pread_zw100k 9.9974 ( 0.00%) 6.0648 ( 39.34%) 6.0756 ( 39.23%)
> > Time mean95-read_z100k 9.8940 ( 0.00%) 5.9885 ( 39.47%) 5.9994 ( 39.36%)
> > Time mean95-read_z10k 1.1394 ( 0.00%) 0.7483 ( 34.33%) 0.7482 ( 34.33%)
> >
> > Note that this doesn't affect Haswell or Broadwell microarchitectures
> > which seem to avoid the alignment issue by executing the loop straight
> > out of the Loop Stream Detector (verified using perf events).
>
> Which cpu was affected?
> At least one source (www.agner.org/optimize) implies that both ivy
> bridge and sandy bridge have uop caches that mean (If I've read it
> correctly) the loop shouldn't be affected by the alignment).
>
> > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> > index fff28c6f73a2..b0dfac3d3df7 100644
> > --- a/arch/x86/lib/usercopy_64.c
> > +++ b/arch/x86/lib/usercopy_64.c
> > @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
> > asm volatile(
> > " testq %[size8],%[size8]\n"
> > " jz 4f\n"
> > + " .align 16\n"
> > "0: movq $0,(%[dst])\n"
> > " addq $8,%[dst]\n"
> > " decl %%ecx ; jnz 0b\n"
>
> You can do better that that loop.
> Change 'dst' to point to the end of the buffer, negate the count
> and divide by 8 and you get:
> "0: movq $0,($[dst],%%ecx,8)\n"
> " add $1,%%ecx"
> " jnz 0b\n"
> which might run at one iteration per clock especially on cpu that pair
> the add and jnz into a single uop.
> (You need to use add not inc.)
/dev/zero should probably use REP STOSB etc just like everything else.
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
2020-06-18 13:16 ` Alexey Dobriyan
@ 2020-06-18 16:39 ` David Laight
2020-06-18 21:01 ` Alexey Dobriyan
0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2020-06-18 16:39 UTC (permalink / raw)
To: 'Alexey Dobriyan'
Cc: 'Matt Fleming',
Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel,
Grimm, Jon, Kumar, Venkataramanan, Jan Kara, stable
From: Alexey Dobriyan
> Sent: 18 June 2020 14:17
...
> > > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> > > index fff28c6f73a2..b0dfac3d3df7 100644
> > > --- a/arch/x86/lib/usercopy_64.c
> > > +++ b/arch/x86/lib/usercopy_64.c
> > > @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
> > > asm volatile(
> > > " testq %[size8],%[size8]\n"
> > > " jz 4f\n"
> > > + " .align 16\n"
> > > "0: movq $0,(%[dst])\n"
> > > " addq $8,%[dst]\n"
> > > " decl %%ecx ; jnz 0b\n"
> >
> > You can do better that that loop.
> > Change 'dst' to point to the end of the buffer, negate the count
> > and divide by 8 and you get:
> > "0: movq $0,($[dst],%%ecx,8)\n"
> > " add $1,%%ecx"
> > " jnz 0b\n"
> > which might run at one iteration per clock especially on cpu that pair
> > the add and jnz into a single uop.
> > (You need to use add not inc.)
>
> /dev/zero should probably use REP STOSB etc just like everything else.
Almost certainly it shouldn't, and neither should anything else.
Potentially it could use whatever memset() is patched to.
That MIGHT be 'rep stos' on some cpu variants, but in general
it is slow.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
2020-06-18 16:39 ` David Laight
@ 2020-06-18 21:01 ` Alexey Dobriyan
0 siblings, 0 replies; 6+ messages in thread
From: Alexey Dobriyan @ 2020-06-18 21:01 UTC (permalink / raw)
To: David Laight
Cc: 'Matt Fleming',
Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel,
Grimm, Jon, Kumar, Venkataramanan, Jan Kara, stable
On Thu, Jun 18, 2020 at 04:39:35PM +0000, David Laight wrote:
> From: Alexey Dobriyan
> > Sent: 18 June 2020 14:17
> ...
> > > > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> > > > index fff28c6f73a2..b0dfac3d3df7 100644
> > > > --- a/arch/x86/lib/usercopy_64.c
> > > > +++ b/arch/x86/lib/usercopy_64.c
> > > > @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
> > > > asm volatile(
> > > > " testq %[size8],%[size8]\n"
> > > > " jz 4f\n"
> > > > + " .align 16\n"
> > > > "0: movq $0,(%[dst])\n"
> > > > " addq $8,%[dst]\n"
> > > > " decl %%ecx ; jnz 0b\n"
> > >
> > > You can do better that that loop.
> > > Change 'dst' to point to the end of the buffer, negate the count
> > > and divide by 8 and you get:
> > > "0: movq $0,($[dst],%%ecx,8)\n"
> > > " add $1,%%ecx"
> > > " jnz 0b\n"
> > > which might run at one iteration per clock especially on cpu that pair
> > > the add and jnz into a single uop.
> > > (You need to use add not inc.)
> >
> > /dev/zero should probably use REP STOSB etc just like everything else.
>
> Almost certainly it shouldn't, and neither should anything else.
> Potentially it could use whatever memset() is patched to.
> That MIGHT be 'rep stos' on some cpu variants, but in general
> it is slow.
Yes, that's what I meant: alternatives choosing REP variant.
memset loops are so 21-st century.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [tip: x86/urgent] x86/asm/64: Align start of __clear_user() loop to 16-bytes
2020-06-18 10:20 [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes Matt Fleming
2020-06-18 10:48 ` David Laight
@ 2020-06-19 16:40 ` tip-bot2 for Matt Fleming
1 sibling, 0 replies; 6+ messages in thread
From: tip-bot2 for Matt Fleming @ 2020-06-19 16:40 UTC (permalink / raw)
To: linux-tip-commits; +Cc: Matt Fleming, Borislav Petkov, stable, x86, LKML
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: bb5570ad3b54e7930997aec76ab68256d5236d94
Gitweb: https://git.kernel.org/tip/bb5570ad3b54e7930997aec76ab68256d5236d94
Author: Matt Fleming <matt@codeblueprint.co.uk>
AuthorDate: Thu, 18 Jun 2020 11:20:02 +01:00
Committer: Borislav Petkov <bp@suse.de>
CommitterDate: Fri, 19 Jun 2020 18:32:11 +02:00
x86/asm/64: Align start of __clear_user() loop to 16-bytes
x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,
1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.
Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.
Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:
Zen 1 (Naples)
libmicro-file
5.7.0-rc6 5.7.0-rc6 5.7.0-rc6
revert-1153933703d9+ align16+
Time mean95-pread_z100k 9.9195 ( 0.00%) 5.9856 ( 39.66%) 5.9938 ( 39.58%)
Time mean95-pread_z10k 1.1378 ( 0.00%) 0.7450 ( 34.52%) 0.7467 ( 34.38%)
Time mean95-pread_z1k 0.2623 ( 0.00%) 0.2251 ( 14.18%) 0.2252 ( 14.15%)
Time mean95-pread_zw100k 9.9974 ( 0.00%) 6.0648 ( 39.34%) 6.0756 ( 39.23%)
Time mean95-read_z100k 9.8940 ( 0.00%) 5.9885 ( 39.47%) 5.9994 ( 39.36%)
Time mean95-read_z10k 1.1394 ( 0.00%) 0.7483 ( 34.33%) 0.7482 ( 34.33%)
Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).
Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk
---
arch/x86/lib/usercopy_64.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6..b0dfac3 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
asm volatile(
" testq %[size8],%[size8]\n"
" jz 4f\n"
+ " .align 16\n"
"0: movq $0,(%[dst])\n"
" addq $8,%[dst]\n"
" decl %%ecx ; jnz 0b\n"
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-06-19 16:41 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-18 10:20 [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes Matt Fleming
2020-06-18 10:48 ` David Laight
2020-06-18 13:16 ` Alexey Dobriyan
2020-06-18 16:39 ` David Laight
2020-06-18 21:01 ` Alexey Dobriyan
2020-06-19 16:40 ` [tip: x86/urgent] " tip-bot2 for Matt Fleming
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).