* [PATCH 0/8] Use uncached writes while clearing gigantic pages
@ 2020-10-14 8:32 Ankur Arora
2020-10-14 8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
` (7 more replies)
0 siblings, 8 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora
This series adds clear_page_nt(), a non-temporal MOV (MOVNTI) based
clear_page().
The immediate use case is to speedup creation of large (~2TB) guests
VMs. Memory for these guests is allocated via huge/gigantic pages which
are faulted in early.
The intent behind using non-temporal writes is to minimize allocation of
unnecessary cachelines. This helps in minimizing cache pollution, and
potentially also speeds up zeroing of large extents.
That said there are, uncached writes are not always great, as can be seen
in these 'perf bench mem memset' numbers comparing clear_page_erms() and
clear_page_nt():
Intel Broadwellx:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
AMD Rome:
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%
Intel Skylakex:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%
(All of the machines in these tests had a minimum of 25MB L3 cache per
socket.)
There are two performance issues:
- uncached writes typically perform better only for region sizes
sizes around or larger than ~LLC-size.
- MOVNTI does not always perform well on all microarchitectures.
We handle the first issue by only using clear_page_nt() for GB pages.
That leaves out page zeroing for 2MB pages, which is a size that's large
enough that uncached writes might have meaningful cache benefits but at
the same time is small enough that uncached writes would end up being
slower.
We can handle a subset of the 2MB case -- mmaps with MAP_POPULATE -- by
means of a uncached-or-cached hint decided based on a threshold size. This
would apply to maps backed by any page-size.
This case is not handled in this series -- I wanted to sanity check the
high level approach before attempting that.
Handle the second issue by adding a synthetic cpu-feature,
X86_FEATURE_NT_GOOD which is only enabled for architectures where MOVNTI
performs well.
(Relatedly, I thought I had independently decided to use ALTERNATIVES
to deal with this, but more likely I had just internalized it from this
discussion:
https://lore.kernel.org/linux-mm/20200316101856.GH11482@dhcp22.suse.cz/#t)
Accordingly this series enables X86_FEATURE_NT_GOOD for Intel Broadwellx
and AMD Rome. (In my testing, the performance was also good for some
pre-production models but this series leaves them out.)
Please review.
Thanks
Ankur
Ankur Arora (8):
x86/cpuid: add X86_FEATURE_NT_GOOD
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add clear_page_nt()
x86/clear_page: add clear_page_uncached()
mm, clear_huge_page: use clear_page_uncached() for gigantic pages
x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 6 +++
arch/x86/include/asm/page_32.h | 9 ++++
arch/x86/include/asm/page_64.h | 15 ++++++
arch/x86/kernel/cpu/amd.c | 3 ++
arch/x86/kernel/cpu/intel.c | 2 +
arch/x86/lib/clear_page_64.S | 26 +++++++++++
arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
include/asm-generic/page.h | 3 ++
include/linux/highmem.h | 10 ++++
mm/memory.c | 3 +-
tools/arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 ++-
13 files changed, 158 insertions(+), 62 deletions(-)
--
2.9.3
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
` (6 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Tony Luck, Pawan Gupta, Josh Poimboeuf,
Peter Zijlstra (Intel),
Mark Gross, Kim Phillips, Vineela Tummalapalli, Wei Huang
Enabled on microarchitectures with performant non-temporal MOV (MOVNTI)
instruction.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/cpufeatures.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 7b0afd5e6c57..8bae38240346 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -289,6 +289,7 @@
#define X86_FEATURE_FENCE_SWAPGS_KERNEL (11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
#define X86_FEATURE_SPLIT_LOCK_DETECT (11*32+ 6) /* #AC for split lock */
#define X86_FEATURE_PER_THREAD_MBA (11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
+#define X86_FEATURE_NT_GOOD (11*32+ 8) /* Non-temporal instructions perform well */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 2/8] x86/asm: add memset_movnti()
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
2020-10-14 8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
` (5 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Jiri Slaby, Juergen Gross
Add a MOVNTI based implementation of memset().
memset_orig() and memset_movnti() only differ in the opcode used in the
inner loop, so move the memset_orig() logic into a macro, which gets
expanded into memset_movq() and memset_movnti().
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/lib/memset_64.S | 68 +++++++++++++++++++++++++++---------------------
1 file changed, 38 insertions(+), 30 deletions(-)
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9ff15ee404a4..79703cc04b6a 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -27,7 +27,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS
movq %rdi,%r9
@@ -68,7 +68,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
ret
SYM_FUNC_END(memset_erms)
-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
movq %rdi,%r10
/* expand byte value */
@@ -79,64 +80,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@
.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@
/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@
-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@
-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
ret
-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 3/8] perf bench: add memset_movnti()
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
2020-10-14 8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
2020-10-14 8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
` (4 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim
Clone memset_movnti() from arch/x86/lib/memset_64.S.
perf bench mem memset on -f x86-64-movnt on Intel Broadwellx, Skylakex
and AMD Rome:
Intel Broadwellx:
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Output pruned.
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
11.837121 GB/sec
# Copying 8MB bytes ...
11.783560 GB/sec
# Copying 32MB bytes ...
11.868591 GB/sec
# Copying 128MB bytes ...
11.865211 GB/sec
# Copying 512MB bytes ...
11.864085 GB/sec
Intel Skylakex:
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
6.361971 GB/sec
# Copying 8MB bytes ...
6.300403 GB/sec
# Copying 32MB bytes ...
6.288992 GB/sec
# Copying 128MB bytes ...
6.328793 GB/sec
# Copying 512MB bytes ...
6.324471 GB/sec
AMD Rome:
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
10.993199 GB/sec
# Copying 8MB bytes ...
14.221784 GB/sec
# Copying 32MB bytes ...
14.293337 GB/sec
# Copying 128MB bytes ...
15.238947 GB/sec
# Copying 512MB bytes ...
16.476093 GB/sec
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
tools/arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 ++-
2 files changed, 43 insertions(+), 31 deletions(-)
diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index fd5d25a474b7..bfbf6d06f81e 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -26,7 +26,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS
movq %rdi,%r9
@@ -65,7 +65,8 @@ SYM_FUNC_START(memset_erms)
ret
SYM_FUNC_END(memset_erms)
-SYM_FUNC_START(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START(memset_\OP)
movq %rdi,%r10
/* expand byte value */
@@ -76,64 +77,71 @@ SYM_FUNC_START(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@
.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@
/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@
-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@
-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
ret
-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
"x86-64-unrolled",
"unrolled memset() in arch/x86/lib/memset_64.S")
@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
MEMSET_FN(memset_erms,
"x86-64-stosb",
"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+ "x86-64-movnt",
+ "movnt-based memset() in arch/x86/lib/memset_64.S")
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 4/8] x86/asm: add clear_page_nt()
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
` (2 preceding siblings ...)
2020-10-14 8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 19:56 ` Borislav Petkov
2020-10-14 8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
` (3 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Jiri Slaby, Herbert Xu, Rafael J. Wysocki
Add clear_page_nt() which is essentially an unrolled MOVNTI loop. The
unrolling keeps the inner loop similar to memset_movnti() which can be
exercised via perf bench mem memset.
The caller needs to execute an SFENCE when done.
MOVNTI, from the Intel SDM, Volume 2B, 4-101:
"The non-temporal hint is implemented by using a write combining (WC)
memory type protocol when writing the data to memory. Using this
protocol, the processor does not write the data into the cache hierarchy,
nor does it fetch the corresponding cache line from memory into the
cache hierarchy."
The AMD Arch Manual has something similar to say as well.
This can potentially improve page-clearing bandwidth (see below for
performance numbers for two microarchitectures where it helps and one
where it doesn't) and can help indirectly by consuming less cache
resources.
Any performance benefits are expected for extents larger than LLC-sized
or more -- when we are DRAM-BW constrained rather than cache-BW
constrained.
# Intel Broadwellx
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:
System: Oracle X6-2
CPU: 2 nodes * 10 cores/node * 2 threads/core
Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory: 256G evenly split between nodes
Microcode: 0xb00002e
scaling_governor: performance
L3 size: 25MB
intel_pstate/no_turbo: 1
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34%
4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
5.405362 GB/sec
5.444229 GB/sec
5.397943 GB/sec
5.401012 GB/sec
5.439320 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):
2,064,476,092 cpu-cycles # 1.087 GHz ( +- 0.17% ) (22.19%)
8,578,591 instructions # 0.00 insn per cycle ( +- 12.01% ) (27.79%)
132,481,645 cache-references # 69.730 M/sec ( +- 0.20% ) (27.83%)
157,710 cache-misses # 0.119 % of all cache refs ( +- 5.80% ) (27.84%)
2,879,628 branch-instructions # 1.516 M/sec ( +- 0.21% ) (27.86%)
80,581 branch-misses # 2.80% of all branches ( +- 13.15% ) (27.84%)
94,401,869 bus-cycles # 49.687 M/sec ( +- 0.25% ) (22.21%)
133,947,283 L1-dcache-load-misses # 139717.91% of all L1-dcache accesses ( +- 0.26% ) (22.21%)
95,870 L1-dcache-loads # 0.050 M/sec ( +- 9.95% ) (22.21%)
1,700 LLC-loads # 0.895 K/sec ( +- 6.50% ) (22.21%)
1,410 LLC-load-misses # 82.95% of all LL-cache accesses ( +- 19.42% ) (22.21%)
132,526,771 LLC-stores # 69.754 M/sec ( +- 0.65% ) (11.10%)
101,145 LLC-store-misses # 0.053 M/sec ( +- 11.19% ) (11.10%)
1.90238 +- 0.00358 seconds time elapsed ( +- 0.19% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
11.774264 GB/sec
11.758826 GB/sec
11.774368 GB/sec
11.758239 GB/sec
11.760348 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
1,619,807,936 cpu-cycles # 0.971 GHz ( +- 0.24% ) (22.14%)
1,481,306,856 instructions # 0.91 insn per cycle ( +- 0.33% ) (27.75%)
163,086 cache-references # 0.098 M/sec ( +- 11.68% ) (27.79%)
39,913 cache-misses # 24.474 % of all cache refs ( +- 26.45% ) (27.84%)
135,741,931 branch-instructions # 81.353 M/sec ( +- 0.33% ) (27.89%)
82,647 branch-misses # 0.06% of all branches ( +- 6.29% ) (27.90%)
73,575,446 bus-cycles # 44.095 M/sec ( +- 0.28% ) (22.28%)
27,834 L1-dcache-load-misses # 68.42% of all L1-dcache accesses ( +- 65.93% ) (22.28%)
40,683 L1-dcache-loads # 0.024 M/sec ( +- 42.62% ) (22.27%)
2,598 LLC-loads # 0.002 M/sec ( +- 22.66% ) (22.25%)
1,523 LLC-load-misses # 58.60% of all LL-cache accesses ( +- 39.64% ) (22.22%)
2 LLC-stores # 0.001 K/sec ( +-100.00% ) (11.08%)
0 LLC-store-misses # 0.000 K/sec (11.07%)
1.67003 +- 0.00169 seconds time elapsed ( +- 0.10% )
The L1-dcache-load-miss (L1D.REPLACEMENT) counts are significantly down,
which does confirm that unlike "REP; STOSB", MOVNTI does not result in a
write-allocate.
# AMD Rome
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
# (X86_FEATURE_REP_GOOD) and x86-64-movnt:
System: Oracle E2-2c
CPU: 2 nodes * 64 cores/node * 2 threads/core
AMD EPYC 7742 (Rome, 23:49:0)
Memory: 2048 GB evenly split between nodes
Microcode: 0x8301038
scaling_governor: performance
L3 size: 16 * 16MB
cpufreq/boost: 0
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%
1024MB 11.86 GB/s ( +- 0.83%) 16.54 GB/s ( +- 0.04%) +39.46%
4096MB 11.89 GB/s ( +- 0.61%) 16.49 GB/s ( +- 0.28%) +38.68%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosq
# Running 'mem/memset' benchmark:
# function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
11.785122 GB/sec
11.970851 GB/sec
11.916821 GB/sec
11.861517 GB/sec
11.941867 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosq' (5 runs):
1,014,645,096 cpu-cycles # 1.264 GHz ( +- 0.18% ) (45.28%)
4,620,983 instructions # 0.00 insn per cycle ( +- 1.86% ) (45.37%)
262,988,622 cache-references # 327.723 M/sec ( +- 0.21% ) (45.51%)
6,312,740 cache-misses # 2.400 % of all cache refs ( +- 1.12% ) (45.56%)
1,792,517 branch-instructions # 2.234 M/sec ( +- 0.20% ) (45.60%)
54,095 branch-misses # 3.02% of all branches ( +- 2.99% ) (45.64%)
133,710,131 L1-dcache-load-misses # 363.51% of all L1-dcache accesses ( +- 0.12% ) (45.55%)
36,783,396 L1-dcache-loads # 45.838 M/sec ( +- 0.79% ) (45.46%)
53,411,709 L1-dcache-prefetches # 66.559 M/sec ( +- 0.28% ) (45.39%)
0.80303 +- 0.00117 seconds time elapsed ( +- 0.15% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
16.533230 GB/sec
16.496138 GB/sec
16.480302 GB/sec
16.478333 GB/sec
16.474600 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
1,091,352,779 cpu-cycles # 1.292 GHz ( +- 0.32% ) (45.25%)
1,483,248,390 instructions # 1.36 insn per cycle ( +- 0.14% ) (45.38%)
134,114,985 cache-references # 158.723 M/sec ( +- 0.17% ) (45.51%)
117,682 cache-misses # 0.088 % of all cache refs ( +- 0.99% ) (45.59%)
135,009,275 branch-instructions # 159.781 M/sec ( +- 0.18% ) (45.68%)
50,659 branch-misses # 0.04% of all branches ( +- 7.50% ) (45.66%)
58,569 L1-dcache-load-misses # 5.84% of all L1-dcache accesses ( +- 6.04% ) (45.57%)
1,002,657 L1-dcache-loads # 1.187 M/sec ( +- 15.40% ) (45.45%)
3,111 L1-dcache-prefetches # 0.004 M/sec ( +- 31.21% ) (45.38%)
0.84554 +- 0.00289 seconds time elapsed ( +- 0.34% )
Similar to Intel Broadwellx, the L1-dcache-load-misses (L2$ access from
DC Miss) counts are significantly lower. The L1 prefetcher is also
fairly quiet.
# Intel Skylakex
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:
System: Oracle X8-2
CPU: 2 nodes * 26 cores/node * 2 threads/core
Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory: 3TB evenly split between nodes
Microcode: 0x5002f01
scaling_governor: performance
L3 size: 36MB
intel_pstate/no_turbo: 1
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%
1024MB 6.48 GB/s ( +- 0.31%) 6.24 GB/s ( +- 0.00%) -3.70%
4096MB 6.51 GB/s ( +- 0.01%) 6.27 GB/s ( +- 0.42%) -3.68%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.516972 GB/sec
6.518756 GB/sec
6.517620 GB/sec
6.517598 GB/sec
6.518799 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):
3,357,373,317 cpu-cycles # 1.133 GHz ( +- 0.01% ) (29.38%)
165,063,710 instructions # 0.05 insn per cycle ( +- 1.54% ) (35.29%)
358,997 cache-references # 0.121 M/sec ( +- 0.89% ) (35.32%)
205,420 cache-misses # 57.221 % of all cache refs ( +- 3.61% ) (35.36%)
6,117,673 branch-instructions # 2.065 M/sec ( +- 1.48% ) (35.38%)
58,309 branch-misses # 0.95% of all branches ( +- 1.30% ) (35.39%)
31,329,466 bus-cycles # 10.575 M/sec ( +- 0.03% ) (23.56%)
68,543,766 L1-dcache-load-misses # 157.03% of all L1-dcache accesses ( +- 0.02% ) (23.53%)
43,648,909 L1-dcache-loads # 14.734 M/sec ( +- 0.50% ) (23.50%)
137,498 LLC-loads # 0.046 M/sec ( +- 0.21% ) (23.49%)
12,308 LLC-load-misses # 8.95% of all LL-cache accesses ( +- 2.52% ) (23.49%)
26,335 LLC-stores # 0.009 M/sec ( +- 5.65% ) (11.75%)
25,008 LLC-store-misses # 0.008 M/sec ( +- 3.42% ) (11.75%)
2.962842 +- 0.000162 seconds time elapsed ( +- 0.01% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.283420 GB/sec
6.222843 GB/sec
6.282976 GB/sec
6.282828 GB/sec
6.283173 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
4,462,272,094 cpu-cycles # 1.322 GHz ( +- 0.30% ) (29.38%)
1,633,675,881 instructions # 0.37 insn per cycle ( +- 0.21% ) (35.28%)
283,627 cache-references # 0.084 M/sec ( +- 0.58% ) (35.31%)
28,824 cache-misses # 10.163 % of all cache refs ( +- 20.67% ) (35.34%)
139,719,697 branch-instructions # 41.407 M/sec ( +- 0.16% ) (35.35%)
58,062 branch-misses # 0.04% of all branches ( +- 1.49% ) (35.36%)
41,760,350 bus-cycles # 12.376 M/sec ( +- 0.05% ) (23.55%)
303,300 L1-dcache-load-misses # 0.69% of all L1-dcache accesses ( +- 2.08% ) (23.53%)
43,769,498 L1-dcache-loads # 12.972 M/sec ( +- 0.54% ) (23.52%)
99,570 LLC-loads # 0.030 M/sec ( +- 1.06% ) (23.52%)
1,966 LLC-load-misses # 1.97% of all LL-cache accesses ( +- 6.17% ) (23.52%)
129 LLC-stores # 0.038 K/sec ( +- 27.85% ) (11.75%)
7 LLC-store-misses # 0.002 K/sec ( +- 47.82% ) (11.75%)
3.37465 +- 0.00474 seconds time elapsed ( +- 0.14% )
The L1-dcache-load-misses (L1D.REPLACEMENT) count is much lower just
like the previous two cases. No performance improvement for Skylakex
though.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/page_64.h | 1 +
arch/x86/lib/clear_page_64.S | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..bde3c2785ec4 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,6 +43,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
void clear_page_orig(void *page);
void clear_page_rep(void *page);
void clear_page_erms(void *page);
+void clear_page_nt(void *page);
static inline void clear_page(void *page)
{
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index c4c7dd115953..f16bb753b236 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -50,3 +50,29 @@ SYM_FUNC_START(clear_page_erms)
ret
SYM_FUNC_END(clear_page_erms)
EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Zero a page.
+ * %rdi - page
+ *
+ * Caller needs to issue a fence at the end.
+ */
+SYM_FUNC_START(clear_page_nt)
+ xorl %eax,%eax
+ movl $4096,%ecx
+
+ .p2align 4
+.Lstart:
+ movnti %rax, 0x00(%rdi)
+ movnti %rax, 0x08(%rdi)
+ movnti %rax, 0x10(%rdi)
+ movnti %rax, 0x18(%rdi)
+ movnti %rax, 0x20(%rdi)
+ movnti %rax, 0x28(%rdi)
+ movnti %rax, 0x30(%rdi)
+ movnti %rax, 0x38(%rdi)
+ addq $0x40, %rdi
+ subl $0x40, %ecx
+ ja .Lstart
+ ret
+SYM_FUNC_END(clear_page_nt)
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
` (3 preceding siblings ...)
2020-10-14 8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 11:10 ` kernel test robot
` (2 more replies)
2020-10-14 8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
` (2 subsequent siblings)
7 siblings, 3 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
linux-arch
Define clear_page_uncached() as an alternative_call() to clear_page_nt()
if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
doesn't.
Similarly define clear_page_uncached_flush() which provides an SFENCE
if the CPU sets X86_FEATURE_NT_GOOD.
Also, add the glue interface clear_user_highpage_uncached().
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/page.h | 6 ++++++
arch/x86/include/asm/page_32.h | 9 +++++++++
arch/x86/include/asm/page_64.h | 14 ++++++++++++++
include/asm-generic/page.h | 3 +++
include/linux/highmem.h | 10 ++++++++++
5 files changed, 42 insertions(+)
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..ca0aa379ac7f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
clear_page(page);
}
+static inline void clear_user_page_uncached(void *page, unsigned long vaddr,
+ struct page *pg)
+{
+ clear_page_uncached(page);
+}
+
static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
struct page *topage)
{
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 94dbd51df58f..7a03a274a9a4 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -39,6 +39,15 @@ static inline void clear_page(void *page)
memset(page, 0, PAGE_SIZE);
}
+static inline void clear_page_uncached(void *page)
+{
+ clear_page(page);
+}
+
+static inline void clear_page_uncached_flush(void)
+{
+}
+
static inline void copy_page(void *to, void *from)
{
memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index bde3c2785ec4..5897075e77dd 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -55,6 +55,20 @@ static inline void clear_page(void *page)
: "cc", "memory", "rax", "rcx");
}
+static inline void clear_page_uncached(void *page)
+{
+ alternative_call(clear_page,
+ clear_page_nt, X86_FEATURE_NT_GOOD,
+ "=D" (page),
+ "0" (page)
+ : "cc", "memory", "rax", "rcx");
+}
+
+static inline void clear_page_uncached_flush(void)
+{
+ alternative("", "sfence", X86_FEATURE_NT_GOOD);
+}
+
void copy_page(void *to, void *from);
#endif /* !__ASSEMBLY__ */
diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h
index fe801f01625e..60235a0cf24a 100644
--- a/include/asm-generic/page.h
+++ b/include/asm-generic/page.h
@@ -26,6 +26,9 @@
#ifndef __ASSEMBLY__
#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page_uncached(page) clear_page(page)
+#define clear_page_uncached_flush() do { } while (0)
+
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
#define clear_user_page(page, vaddr, pg) clear_page(page)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 14e6202ce47f..f842593e2474 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -232,6 +232,16 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
}
#endif
+#ifndef clear_user_highpage_uncached
+static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
+{
+ void *addr = kmap_atomic(page);
+
+ clear_user_page_uncached(addr, vaddr, page);
+ kunmap_atomic(addr);
+}
+#endif
+
#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/**
* __alloc_zeroed_user_highpage - Allocate a zeroed HIGHMEM page for a VMA with caller-specified movable GFP flags
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
` (4 preceding siblings ...)
2020-10-14 8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 15:28 ` Ingo Molnar
2020-10-14 8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
2020-10-14 8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora
7 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora, Andrew Morton
Uncached writes are suitable for circumstances where the region written to
is not expected to be read again soon, or the region written to is large
enough that there's no expectation that we will find the writes in the
cache.
Accordingly switch to using clear_page_uncached() for gigantic pages.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
mm/memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..4d2c58f83ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
for (i = 0; i < pages_per_huge_page;
i++, p = mem_map_next(p, page, i)) {
cond_resched();
- clear_user_highpage(p, addr + i * PAGE_SIZE);
+ clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
}
}
@@ -5111,6 +5111,7 @@ void clear_huge_page(struct page *page,
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
clear_gigantic_page(page, addr, pages_per_huge_page);
+ clear_page_uncached_flush();
return;
}
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
` (5 preceding siblings ...)
2020-10-14 8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
2020-10-14 15:31 ` Ingo Molnar
2020-10-14 8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora
7 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Tony Luck, Sean Christopherson, Mike Rapoport,
Xiaoyao Li, Fenghua Yu, Peter Zijlstra (Intel),
Dave Hansen
System: Oracle X6-2
CPU: 2 nodes * 10 cores/node * 2 threads/core
Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory: 256 GB evenly split between nodes
Microcode: 0xb00002e
scaling_governor: performance
L3 size: 25MB
intel_pstate/no_turbo: 1
Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
(X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34%
4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37%
The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.
$ cat pf-test.c
#include <stdlib.h>
#include <sys/mman.h>
#include <linux/mman.h>
#define HPAGE_BITS 30
int main(int argc, char **argv) {
int i;
unsigned long len = atoi(argv[1]); /* In GB */
unsigned long offset = 0;
unsigned long numpages;
char *base;
len *= 1UL << 30;
numpages = len >> HPAGE_BITS;
base = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS |
MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);
for (i = 0; i < numpages; i++) {
*((volatile char *)base + offset) = *(base + offset);
offset += 1UL << HPAGE_BITS;
}
return 0;
}
The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.
Page-clearing throughput for clear_page_erms(): 3.72 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128
Performance counter stats for 'bin/pf-test 128' (5 runs):
74,799,496,556 cpu-cycles # 2.176 GHz ( +- 2.22% ) (29.41%)
1,474,615,023 instructions # 0.02 insn per cycle ( +- 0.23% ) (35.29%)
2,148,580,131 cache-references # 62.502 M/sec ( +- 0.02% ) (35.29%)
71,736,985 cache-misses # 3.339 % of all cache refs ( +- 0.94% ) (35.29%)
433,713,165 branch-instructions # 12.617 M/sec ( +- 0.15% ) (35.30%)
1,008,251 branch-misses # 0.23% of all branches ( +- 1.88% ) (35.30%)
3,406,821,966 bus-cycles # 99.104 M/sec ( +- 2.22% ) (23.53%)
2,156,059,110 L1-dcache-load-misses # 445.35% of all L1-dcache accesses ( +- 0.01% ) (23.53%)
484,128,243 L1-dcache-loads # 14.083 M/sec ( +- 0.22% ) (23.53%)
944,216 LLC-loads # 0.027 M/sec ( +- 7.41% ) (23.53%)
537,989 LLC-load-misses # 56.98% of all LL-cache accesses ( +- 13.64% ) (23.53%)
2,150,138,476 LLC-stores # 62.547 M/sec ( +- 0.01% ) (11.76%)
69,598,760 LLC-store-misses # 2.025 M/sec ( +- 0.47% ) (11.76%)
483,923,875 dTLB-loads # 14.077 M/sec ( +- 0.21% ) (17.64%)
1,892 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 30.63% ) (23.53%)
4,799,154,980 dTLB-stores # 139.606 M/sec ( +- 0.03% ) (23.53%)
90 dTLB-store-misses # 0.003 K/sec ( +- 35.92% ) (23.53%)
34.377 +- 0.760 seconds time elapsed ( +- 2.21% )
Page-clearing throughput with clear_page_nt(): 11.78GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128
Performance counter stats for 'bin/pf-test 128' (5 runs):
23,699,446,603 cpu-cycles # 2.182 GHz ( +- 0.01% ) (23.53%)
24,794,548,512 instructions # 1.05 insn per cycle ( +- 0.00% ) (29.41%)
432,775 cache-references # 0.040 M/sec ( +- 3.96% ) (29.41%)
75,580 cache-misses # 17.464 % of all cache refs ( +- 51.42% ) (29.41%)
2,492,858,290 branch-instructions # 229.475 M/sec ( +- 0.00% ) (29.42%)
34,016,826 branch-misses # 1.36% of all branches ( +- 0.04% ) (29.42%)
1,078,468,643 bus-cycles # 99.276 M/sec ( +- 0.01% ) (23.53%)
717,228 L1-dcache-load-misses # 0.20% of all L1-dcache accesses ( +- 3.77% ) (23.53%)
351,999,535 L1-dcache-loads # 32.403 M/sec ( +- 0.04% ) (23.53%)
75,988 LLC-loads # 0.007 M/sec ( +- 4.20% ) (23.53%)
24,503 LLC-load-misses # 32.25% of all LL-cache accesses ( +- 53.30% ) (23.53%)
57,283 LLC-stores # 0.005 M/sec ( +- 2.15% ) (11.76%)
19,738 LLC-store-misses # 0.002 M/sec ( +- 46.55% ) (11.76%)
351,836,498 dTLB-loads # 32.388 M/sec ( +- 0.04% ) (17.65%)
1,171 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 42.68% ) (23.53%)
17,385,579,725 dTLB-stores # 1600.392 M/sec ( +- 0.00% ) (23.53%)
200 dTLB-store-misses # 0.018 K/sec ( +- 10.63% ) (23.53%)
10.863678 +- 0.000804 seconds time elapsed ( +- 0.01% )
L1-dcache-load-misses (L1D.REPLACEMENT) is substantially lower which
suggests that, as expected, we aren't doing write-allocate or RFO.
Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSB' per
PAGE_SIZE region to a MOVNTI loop.
The page-clearing BW is substantially higher (~100% or more), so enable
X86_FEATURE_NT_GOOD for Intel Broadwellx.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/kernel/cpu/intel.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 59a1e3ce3f14..161028c1dee0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -662,6 +662,8 @@ static void init_intel(struct cpuinfo_x86 *c)
c->x86_cache_alignment = c->x86_clflush_size * 2;
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+ if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
+ set_cpu_cap(c, X86_FEATURE_NT_GOOD);
#else
/*
* Names for the Pentium II/Celeron processors
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
` (6 preceding siblings ...)
2020-10-14 8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
@ 2020-10-14 8:32 ` Ankur Arora
7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 8:32 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Kim Phillips, Reinette Chatre, Tony Luck,
Tom Lendacky, Wei Huang
System: Oracle E2-2C
CPU: 2 nodes * 64 cores/node * 2 threads/core
AMD EPYC 7742 (Rome, 23:49:0)
Memory: 2048 GB evenly split between nodes
Microcode: 0x8301038
scaling_governor: performance
L3 size: 16 * 16MB
cpufreq/boost: 0
Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
(X86_FEATURE_REP_GOOD) and x86-64-movnt (X86_FEATURE_NT_GOOD):
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%
1024MB 11.86 GB/s ( +- 0.83%) 16.54 GB/s ( +- 0.04%) +39.46%
4096MB 11.89 GB/s ( +- 0.61%) 16.49 GB/s ( +- 0.28%) +38.68%
The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.
$ cat pf-test.c
#include <stdlib.h>
#include <sys/mman.h>
#include <linux/mman.h>
#define HPAGE_BITS 30
int main(int argc, char **argv) {
int i;
unsigned long len = atoi(argv[1]); /* In GB */
unsigned long offset = 0;
unsigned long numpages;
char *base;
len *= 1UL << 30;
numpages = len >> HPAGE_BITS;
base = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS |
MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);
for (i = 0; i < numpages; i++) {
*((volatile char *)base + offset) = *(base + offset);
offset += 1UL << HPAGE_BITS;
}
return 0;
}
The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.
Page-clearing throughput for clear_page_rep(): 11.33 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128
Performance counter stats for 'bin/pf-test 128' (5 runs):
25,130,082,910 cpu-cycles # 2.226 GHz ( +- 0.44% ) (54.54%)
1,368,762,311 instructions # 0.05 insn per cycle ( +- 0.02% ) (54.54%)
4,265,726,534 cache-references # 377.794 M/sec ( +- 0.02% ) (54.54%)
119,021,793 cache-misses # 2.790 % of all cache refs ( +- 3.90% ) (54.55%)
413,825,787 branch-instructions # 36.650 M/sec ( +- 0.01% ) (54.55%)
236,847 branch-misses # 0.06% of all branches ( +- 18.80% ) (54.56%)
2,152,320,887 L1-dcache-load-misses # 40.40% of all L1-dcache accesses ( +- 0.01% ) (54.55%)
5,326,873,560 L1-dcache-loads # 471.775 M/sec ( +- 0.20% ) (54.55%)
828,943,234 L1-dcache-prefetches # 73.415 M/sec ( +- 0.55% ) (54.54%)
18,914 dTLB-loads # 0.002 M/sec ( +- 47.23% ) (54.54%)
4,423 dTLB-load-misses # 23.38% of all dTLB cache accesses ( +- 27.75% ) (54.54%)
11.2917 +- 0.0499 seconds time elapsed ( +- 0.44% )
Page-clearing throughput for clear_page_nt(): 16.29 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128
Performance counter stats for 'bin/pf-test 128' (5 runs):
17,523,166,924 cpu-cycles # 2.230 GHz ( +- 0.03% ) (45.43%)
24,801,270,826 instructions # 1.42 insn per cycle ( +- 0.01% ) (45.45%)
2,151,391,033 cache-references # 273.845 M/sec ( +- 0.01% ) (45.46%)
168,555 cache-misses # 0.008 % of all cache refs ( +- 4.87% ) (45.47%)
2,490,226,446 branch-instructions # 316.974 M/sec ( +- 0.01% ) (45.48%)
117,604 branch-misses # 0.00% of all branches ( +- 1.56% ) (45.48%)
273,492 L1-dcache-load-misses # 0.06% of all L1-dcache accesses ( +- 2.14% ) (45.47%)
490,340,458 L1-dcache-loads # 62.414 M/sec ( +- 0.02% ) (45.45%)
20,517 L1-dcache-prefetches # 0.003 M/sec ( +- 9.61% ) (45.44%)
7,413 dTLB-loads # 0.944 K/sec ( +- 8.37% ) (45.44%)
2,031 dTLB-load-misses # 27.40% of all dTLB cache accesses ( +- 8.30% ) (45.43%)
7.85674 +- 0.00270 seconds time elapsed ( +- 0.03% )
The L1-dcache-load-misses (L2$ access from DC Miss) count is
substantially lower which suggests we aren't doing write-allocate or
RFO. The L1-dcache-prefetches are also substantially lower.
Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSQ' per
PAGE_SIZE region to a MOVNTI loop.
The page-clearing BW shows a ~40% improvement. Additionally, a quick
'perf bench memset' comparison on AMD Naples (AMD EPYC 7551) shows
similar performance gains. So, enable X86_FEATURE_NT_GOOD for
AMD Zen.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/kernel/cpu/amd.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index dcc3d943c68f..c57eb6c28aa1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -918,6 +918,9 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);
+ if (c->x86 == 0x17)
+ set_cpu_cap(c, X86_FEATURE_NT_GOOD);
+
#ifdef CONFIG_NUMA
node_reclaim_distance = 32;
#endif
--
2.9.3
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
@ 2020-10-14 11:10 ` kernel test robot
2020-10-14 13:04 ` kernel test robot
2020-10-14 15:45 ` Andy Lutomirski
2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-14 11:10 UTC (permalink / raw)
To: Ankur Arora, linux-kernel, linux-mm
Cc: kbuild-all, kirill, mhocko, boris.ostrovsky, konrad.wilk,
Ankur Arora, Thomas Gleixner, Ingo Molnar, Borislav Petkov
[-- Attachment #1: Type: text/plain, Size: 10502 bytes --]
Hi Ankur,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on tip/master]
[also build test ERROR on linus/master next-20201013]
[cannot apply to tip/x86/core linux/master v5.9]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 80f92ca9b86c71450f003d39956fca4327cc5586
config: riscv-randconfig-r006-20201014 (attached as .config)
compiler: riscv32-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
git checkout 6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=riscv
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
In file included from net/socket.c:74:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/socket.c: In function '__sys_getsockopt':
net/socket.c:2155:6: warning: variable 'max_optlen' set but not used [-Wunused-but-set-variable]
2155 | int max_optlen;
| ^~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from net/sysctl_net.c:20:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from include/linux/mroute_base.h:8,
from include/linux/mroute.h:10,
from net/ipv4/route.c:82:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/ipv4/route.c: In function 'ip_rt_send_redirect':
net/ipv4/route.c:878:6: warning: variable 'log_martians' set but not used [-Wunused-but-set-variable]
878 | int log_martians;
| ^~~~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from include/net/inet_sock.h:22,
from include/net/ip.h:28,
from net/ipv6/ip6_fib.c:28:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/ipv6/ip6_fib.c: In function 'fib6_add':
net/ipv6/ip6_fib.c:1373:25: warning: variable 'pn' set but not used [-Wunused-but-set-variable]
1373 | struct fib6_node *fn, *pn = NULL;
| ^~
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from include/linux/tcp.h:19,
from include/linux/ipv6.h:88,
from include/linux/netfilter/ipset/ip_set.h:11,
from net/netfilter/ipset/ip_set_core.c:23:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/netfilter/ipset/ip_set_core.c: In function 'ip_set_rename':
net/netfilter/ipset/ip_set_core.c:1363:2: warning: 'strncpy' specified bound 32 equals destination size [-Wstringop-truncation]
1363 | strncpy(set->name, name2, IPSET_MAXNAMELEN);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from net/nfc/nci/../nfc.h:14,
from net/nfc/nci/hci.c:13:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/nfc/nci/hci.c: In function 'nci_hci_resp_received':
net/nfc/nci/hci.c:369:5: warning: variable 'status' set but not used [-Wunused-but-set-variable]
369 | u8 status = result;
| ^~~~~~
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from include/linux/tcp.h:19,
from include/linux/ipv6.h:88,
from include/net/ipv6.h:12,
from net/ipv6/netfilter/nf_reject_ipv6.c:7:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
net/ipv6/netfilter/nf_reject_ipv6.c:152:18: warning: variable 'ip6h' set but not used [-Wunused-but-set-variable]
152 | struct ipv6hdr *ip6h;
| ^~~~
cc1: some warnings being treated as errors
--
In file included from include/linux/pagemap.h:11,
from include/linux/blkdev.h:13,
from include/linux/blk-cgroup.h:23,
from include/linux/writeback.h:14,
from include/linux/memcontrol.h:22,
from include/net/sock.h:53,
from include/linux/tcp.h:19,
from net/netfilter/ipvs/ip_vs_core.c:28:
include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
240 | clear_user_page_uncached(addr, vaddr, page);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| clear_user_highpage_uncached
net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_in_icmp':
net/netfilter/ipvs/ip_vs_core.c:1660:8: warning: variable 'outer_proto' set but not used [-Wunused-but-set-variable]
1660 | char *outer_proto = "IPIP";
| ^~~~~~~~~~~
cc1: some warnings being treated as errors
vim +240 include/linux/highmem.h
234
235 #ifndef clear_user_highpage_uncached
236 static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
237 {
238 void *addr = kmap_atomic(page);
239
> 240 clear_user_page_uncached(addr, vaddr, page);
241 kunmap_atomic(addr);
242 }
243 #endif
244
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34217 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10 ` kernel test robot
@ 2020-10-14 13:04 ` kernel test robot
2020-10-14 15:45 ` Andy Lutomirski
2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-14 13:04 UTC (permalink / raw)
To: Ankur Arora, linux-kernel, linux-mm
Cc: kbuild-all, clang-built-linux, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Ankur Arora, Thomas Gleixner, Ingo Molnar,
Borislav Petkov
[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]
Hi Ankur,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on tip/master]
[also build test ERROR on linus/master next-20201013]
[cannot apply to tip/x86/core linux/master v5.9]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 80f92ca9b86c71450f003d39956fca4327cc5586
config: arm64-randconfig-r001-20201014 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project e7fe3c6dfede8d5781bd000741c3dea7088307a4)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm64 cross compiling tool for clang build
# apt-get install binutils-aarch64-linux-gnu
# https://github.com/0day-ci/linux/commit/6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
git checkout 6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
In file included from arch/arm64/kernel/asm-offsets.c:16:
In file included from include/linux/suspend.h:5:
In file included from include/linux/swap.h:9:
In file included from include/linux/memcontrol.h:22:
In file included from include/linux/writeback.h:14:
In file included from include/linux/blk-cgroup.h:23:
In file included from include/linux/blkdev.h:13:
In file included from include/linux/pagemap.h:11:
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached' [-Werror,-Wimplicit-function-declaration]
clear_user_page_uncached(addr, vaddr, page);
^
include/linux/highmem.h:240:2: note: did you mean 'clear_user_highpage_uncached'?
include/linux/highmem.h:236:20: note: 'clear_user_highpage_uncached' declared here
static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
^
1 error generated.
make[2]: *** [scripts/Makefile.build:117: arch/arm64/kernel/asm-offsets.s] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1198: prepare0] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:185: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.
vim +/clear_user_page_uncached +240 include/linux/highmem.h
234
235 #ifndef clear_user_highpage_uncached
236 static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
237 {
238 void *addr = kmap_atomic(page);
239
> 240 clear_user_page_uncached(addr, vaddr, page);
241 kunmap_atomic(addr);
242 }
243 #endif
244
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 38942 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages
2020-10-14 8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
@ 2020-10-14 15:28 ` Ingo Molnar
2020-10-14 19:15 ` Ankur Arora
0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2020-10-14 15:28 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Andrew Morton
* Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Uncached writes are suitable for circumstances where the region written to
> is not expected to be read again soon, or the region written to is large
> enough that there's no expectation that we will find the writes in the
> cache.
>
> Accordingly switch to using clear_page_uncached() for gigantic pages.
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> mm/memory.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index eeae590e526a..4d2c58f83ab1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
> for (i = 0; i < pages_per_huge_page;
> i++, p = mem_map_next(p, page, i)) {
> cond_resched();
> - clear_user_highpage(p, addr + i * PAGE_SIZE);
> + clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
> }
> }
So this does the clearing in 4K chunks, and your measurements suggest that
short memory clearing is not as efficient, right?
I'm wondering whether it would make sense to do 2MB chunked clearing on
64-bit CPUs, instead of 512x 4k clearing? Both 2MB and GB pages are
continuous in memory, so accessible to these instructions in a single
narrow loop.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
2020-10-14 8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
@ 2020-10-14 15:31 ` Ingo Molnar
2020-10-14 19:23 ` Ankur Arora
0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2020-10-14 15:31 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Tony Luck, Sean Christopherson, Mike Rapoport,
Xiaoyao Li, Fenghua Yu, Peter Zijlstra (Intel),
Dave Hansen
* Ankur Arora <ankur.a.arora@oracle.com> wrote:
> System: Oracle X6-2
> CPU: 2 nodes * 10 cores/node * 2 threads/core
> Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
> Memory: 256 GB evenly split between nodes
> Microcode: 0xb00002e
> scaling_governor: performance
> L3 size: 25MB
> intel_pstate/no_turbo: 1
>
> Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
>
> x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
> ----------------------- ----------------------- -------
> size BW ( pstdev) BW ( pstdev)
>
> 16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
> 128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
> 1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34%
> 4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37%
> + if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
> + set_cpu_cap(c, X86_FEATURE_NT_GOOD);
So while I agree with how you've done careful measurements to isolate bad
microarchitectures where non-temporal stores are slow, I do think this
approach of opt-in doesn't scale and is hard to maintain.
Instead I'd suggest enabling this by default everywhere, and creating a
X86_FEATURE_NT_BAD quirk table for the bad microarchitectures.
This means that with new microarchitectures we'd get automatic enablement,
and hopefully chip testing would identify cases where performance isn't as
good.
I.e. the 'trust but verify' method.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10 ` kernel test robot
2020-10-14 13:04 ` kernel test robot
@ 2020-10-14 15:45 ` Andy Lutomirski
2020-10-14 19:58 ` Borislav Petkov
2020-10-14 20:54 ` Ankur Arora
2 siblings, 2 replies; 29+ messages in thread
From: Andy Lutomirski @ 2020-10-14 15:45 UTC (permalink / raw)
To: Ankur Arora
Cc: LKML, Linux-MM, Kirill A. Shutemov, Michal Hocko,
Boris Ostrovsky, Konrad Rzeszutek Wilk, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, X86 ML, H. Peter Anvin,
Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch
On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
> doesn't.
>
> Similarly define clear_page_uncached_flush() which provides an SFENCE
> if the CPU sets X86_FEATURE_NT_GOOD.
As long as you keep "NT" or "MOVNTI" in the names and keep functions
in arch/x86, I think it's reasonable to expect that callers understand
that MOVNTI has bizarre memory ordering rules. But once you give
something a generic name like "clear_page_uncached" and stick it in
generic code, I think the semantics should be more obvious.
How about:
clear_page_uncached_unordered() or clear_page_uncached_incoherent()
and
flush_after_clear_page_uncached()
After all, a naive reader might expect "uncached" to imply "caches are
off and this is coherent with everything". And the results of getting
this wrong will be subtle and possibly hard-to-reproduce corruption.
--Andy
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages
2020-10-14 15:28 ` Ingo Molnar
@ 2020-10-14 19:15 ` Ankur Arora
0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 19:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Andrew Morton
On 2020-10-14 8:28 a.m., Ingo Molnar wrote:
>
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Uncached writes are suitable for circumstances where the region written to
>> is not expected to be read again soon, or the region written to is large
>> enough that there's no expectation that we will find the writes in the
>> cache.
>>
>> Accordingly switch to using clear_page_uncached() for gigantic pages.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> mm/memory.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eeae590e526a..4d2c58f83ab1 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
>> for (i = 0; i < pages_per_huge_page;
>> i++, p = mem_map_next(p, page, i)) {
>> cond_resched();
>> - clear_user_highpage(p, addr + i * PAGE_SIZE);
>> + clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
>> }
>> }
>
> So this does the clearing in 4K chunks, and your measurements suggest that
> short memory clearing is not as efficient, right?
I did not measure that separately (though I should), but the performance numbers
around that were somewhat puzzling.
For MOVNTI, the performance via perf bench (single call to memset_movnti())
is pretty close (within margin of error) to what we see with the page-fault
workload (4K chunks in clear_page_nt().)
With 'REP;STOS' though, there's degradation (~30% Broadwell, ~5% Rome) between
perf bench (single call to memset_erms()) compared to the page-fault workload
(4K chunks in clear_page_erms()).
In the second case, we are executing a lot more 'REP;STOS' loops while the
number of instructions in the first case is pretty much the same, so maybe
that's what accounts for it. But I checked and we are not frontend bound.
Maybe there are high setup costs for 'REP;STOS' on Broadwell? It does advertise
X86_FEATURE_ERMS though...
>
> I'm wondering whether it would make sense to do 2MB chunked clearing on
> 64-bit CPUs, instead of 512x 4k clearing? Both 2MB and GB pages are
> continuous in memory, so accessible to these instructions in a single
> narrow loop.
Yeah, I think it makes sense to do and should be quite straight-forward
as well. I'll try that out. I suspect it might help the X86_FEATURE_NT_BAD
models more but there's no reason why for it to hurt anywhere.
Ankur
>
> Thanks,
>
> Ingo
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
2020-10-14 15:31 ` Ingo Molnar
@ 2020-10-14 19:23 ` Ankur Arora
0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 19:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin, Tony Luck, Sean Christopherson, Mike Rapoport,
Xiaoyao Li, Fenghua Yu, Peter Zijlstra (Intel),
Dave Hansen
On 2020-10-14 8:31 a.m., Ingo Molnar wrote:
>
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> System: Oracle X6-2
>> CPU: 2 nodes * 10 cores/node * 2 threads/core
>> Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
>> Memory: 256 GB evenly split between nodes
>> Microcode: 0xb00002e
>> scaling_governor: performance
>> L3 size: 25MB
>> intel_pstate/no_turbo: 1
>>
>> Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
>> (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
>>
>> x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
>> ----------------------- ----------------------- -------
>> size BW ( pstdev) BW ( pstdev)
>>
>> 16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
>> 128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
>> 1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34%
>> 4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37%
>
>> + if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
>> + set_cpu_cap(c, X86_FEATURE_NT_GOOD);
>
> So while I agree with how you've done careful measurements to isolate bad
> microarchitectures where non-temporal stores are slow, I do think this
> approach of opt-in doesn't scale and is hard to maintain.
>
> Instead I'd suggest enabling this by default everywhere, and creating a
> X86_FEATURE_NT_BAD quirk table for the bad microarchitectures.
Okay, some kind of quirk table is a great idea. Also means that there's a
single place for keeping this rather than it being scattered all over in
the code.
That also simplifies my handling of features like X86_FEATURE_CLZERO.
I was concerned that if you squint a bit, it seems to be an alias to
X86_FEATURE_NT_GOOD and that seemed ugly.
>
> This means that with new microarchitectures we'd get automatic enablement,
> and hopefully chip testing would identify cases where performance isn't as
> good.
Makes sense to me. A first class citizen, as it were...
Thanks for reviewing btw.
Ankur
>
> I.e. the 'trust but verify' method.
>
> Thanks,
>
> Ingo
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/8] x86/asm: add clear_page_nt()
2020-10-14 8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
@ 2020-10-14 19:56 ` Borislav Petkov
2020-10-14 21:11 ` Ankur Arora
0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-14 19:56 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
Jiri Slaby, Herbert Xu, Rafael J. Wysocki
On Wed, Oct 14, 2020 at 01:32:55AM -0700, Ankur Arora wrote:
> This can potentially improve page-clearing bandwidth (see below for
> performance numbers for two microarchitectures where it helps and one
> where it doesn't) and can help indirectly by consuming less cache
> resources.
>
> Any performance benefits are expected for extents larger than LLC-sized
> or more -- when we are DRAM-BW constrained rather than cache-BW
> constrained.
"potentially", "expected", I don't like those formulations. Do you have
some actual benchmark data where this shows any improvement and not
microbenchmarks only, to warrant the additional complexity?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 15:45 ` Andy Lutomirski
@ 2020-10-14 19:58 ` Borislav Petkov
2020-10-14 21:07 ` Andy Lutomirski
2020-10-14 20:54 ` Ankur Arora
1 sibling, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-14 19:58 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Ankur Arora, LKML, Linux-MM, Kirill A. Shutemov, Michal Hocko,
Boris Ostrovsky, Konrad Rzeszutek Wilk, Thomas Gleixner,
Ingo Molnar, X86 ML, H. Peter Anvin, Arnd Bergmann,
Andrew Morton, Ira Weiny, linux-arch
On Wed, Oct 14, 2020 at 08:45:37AM -0700, Andy Lutomirski wrote:
> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > Define clear_page_uncached() as an alternative_call() to clear_page_nt()
> > if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
> > doesn't.
> >
> > Similarly define clear_page_uncached_flush() which provides an SFENCE
> > if the CPU sets X86_FEATURE_NT_GOOD.
>
> As long as you keep "NT" or "MOVNTI" in the names and keep functions
> in arch/x86, I think it's reasonable to expect that callers understand
> that MOVNTI has bizarre memory ordering rules. But once you give
> something a generic name like "clear_page_uncached" and stick it in
> generic code, I think the semantics should be more obvious.
Why does it have to be a separate call? Why isn't it behind the
clear_page() alternative machinery so that the proper function is
selected at boot? IOW, why does a user of clear_page functionality need
to know at all about an "uncached" variant?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 15:45 ` Andy Lutomirski
2020-10-14 19:58 ` Borislav Petkov
@ 2020-10-14 20:54 ` Ankur Arora
1 sibling, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 20:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: LKML, Linux-MM, Kirill A. Shutemov, Michal Hocko,
Boris Ostrovsky, Konrad Rzeszutek Wilk, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, X86 ML, H. Peter Anvin,
Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch
On 2020-10-14 8:45 a.m., Andy Lutomirski wrote:
> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
>> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
>> doesn't.
>>
>> Similarly define clear_page_uncached_flush() which provides an SFENCE
>> if the CPU sets X86_FEATURE_NT_GOOD.
>
> As long as you keep "NT" or "MOVNTI" in the names and keep functions
> in arch/x86, I think it's reasonable to expect that callers understand
> that MOVNTI has bizarre memory ordering rules. But once you give
> something a generic name like "clear_page_uncached" and stick it in
> generic code, I think the semantics should be more obvious.
>
> How about:
>
> clear_page_uncached_unordered() or clear_page_uncached_incoherent()
>
> and
>
> flush_after_clear_page_uncached()
>
> After all, a naive reader might expect "uncached" to imply "caches are
> off and this is coherent with everything". And the results of getting
> this wrong will be subtle and possibly hard-to-reproduce corruption.
Yeah, these are a lot more obvious. Thanks. Will fix.
Ankur
>
> --Andy
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 19:58 ` Borislav Petkov
@ 2020-10-14 21:07 ` Andy Lutomirski
2020-10-14 21:12 ` Borislav Petkov
2020-10-15 3:21 ` Ankur Arora
0 siblings, 2 replies; 29+ messages in thread
From: Andy Lutomirski @ 2020-10-14 21:07 UTC (permalink / raw)
To: Borislav Petkov
Cc: Andy Lutomirski, Ankur Arora, LKML, Linux-MM, Kirill A. Shutemov,
Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch
> On Oct 14, 2020, at 12:58 PM, Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Oct 14, 2020 at 08:45:37AM -0700, Andy Lutomirski wrote:
>>> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>
>>> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
>>> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
>>> doesn't.
>>>
>>> Similarly define clear_page_uncached_flush() which provides an SFENCE
>>> if the CPU sets X86_FEATURE_NT_GOOD.
>>
>> As long as you keep "NT" or "MOVNTI" in the names and keep functions
>> in arch/x86, I think it's reasonable to expect that callers understand
>> that MOVNTI has bizarre memory ordering rules. But once you give
>> something a generic name like "clear_page_uncached" and stick it in
>> generic code, I think the semantics should be more obvious.
>
> Why does it have to be a separate call? Why isn't it behind the
> clear_page() alternative machinery so that the proper function is
> selected at boot? IOW, why does a user of clear_page functionality need
> to know at all about an "uncached" variant?
>
>
I assume it’s for a little optimization of clearing more than one page per SFENCE.
In any event, based on the benchmark data upthread, we only want to do NT clears when they’re rather large, so this shouldn’t be just an alternative. I assume this is because a page or two will fit in cache and, for most uses that allocate zeroed pages, we prefer cache-hot pages. When clearing 1G, on the other hand, cache-hot is impossible and we prefer the improved bandwidth and less cache trashing of NT clears.
Perhaps SFENCE is so fast that this is a silly optimization, though, and we don’t lose anything measurable by SFENCEing once per page.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/8] x86/asm: add clear_page_nt()
2020-10-14 19:56 ` Borislav Petkov
@ 2020-10-14 21:11 ` Ankur Arora
0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 21:11 UTC (permalink / raw)
To: Borislav Petkov
Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
konrad.wilk, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
Jiri Slaby, Herbert Xu, Rafael J. Wysocki
On 2020-10-14 12:56 p.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 01:32:55AM -0700, Ankur Arora wrote:
>> This can potentially improve page-clearing bandwidth (see below for
>> performance numbers for two microarchitectures where it helps and one
>> where it doesn't) and can help indirectly by consuming less cache
>> resources.
>>
>> Any performance benefits are expected for extents larger than LLC-sized
>> or more -- when we are DRAM-BW constrained rather than cache-BW
>> constrained.
>
> "potentially", "expected", I don't like those formulations.
That's fair. The reason for those weasel words is mostly because it
is microarchitecture specific.
For example on Intel where I did compare across generations: I see good
performance on Broadwellx, not good on Skylakex and then good again on
some pre-production CPUs.
> Do you have
> some actual benchmark data where this shows any improvement and not
> microbenchmarks only, to warrant the additional complexity?
Yes, guest creation under QEMU (pinned guests) shows similar improvements.
I've posted performance numbers in patches 7, 8 with a simple page-fault
test derived from that.
I can add numbers from QEMU as well.
Thanks,
Ankur
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 21:07 ` Andy Lutomirski
@ 2020-10-14 21:12 ` Borislav Petkov
2020-10-15 3:37 ` Ankur Arora
2020-10-15 3:21 ` Ankur Arora
1 sibling, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-14 21:12 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andy Lutomirski, Ankur Arora, LKML, Linux-MM, Kirill A. Shutemov,
Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch
On Wed, Oct 14, 2020 at 02:07:30PM -0700, Andy Lutomirski wrote:
> I assume it’s for a little optimization of clearing more than one
> page per SFENCE.
>
> In any event, based on the benchmark data upthread, we only want to do
> NT clears when they’re rather large, so this shouldn’t be just an
> alternative. I assume this is because a page or two will fit in cache
> and, for most uses that allocate zeroed pages, we prefer cache-hot
> pages. When clearing 1G, on the other hand, cache-hot is impossible
> and we prefer the improved bandwidth and less cache trashing of NT
> clears.
Yeah, use case makes sense but people won't know what to use. At the
time I was experimenting with this crap, I remember Linus saying that
that selection should be made based on the size of the area cleared, so
users should not have to know the difference.
Which perhaps is the only sane use case I see for this.
> Perhaps SFENCE is so fast that this is a silly optimization, though,
> and we don’t lose anything measurable by SFENCEing once per page.
Yes, I'd like to see real use cases showing improvement from this, not
just microbenchmarks.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 21:07 ` Andy Lutomirski
2020-10-14 21:12 ` Borislav Petkov
@ 2020-10-15 3:21 ` Ankur Arora
2020-10-15 10:40 ` Borislav Petkov
1 sibling, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-15 3:21 UTC (permalink / raw)
To: Andy Lutomirski, Borislav Petkov
Cc: Andy Lutomirski, LKML, Linux-MM, Kirill A. Shutemov,
Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch
On 2020-10-14 2:07 p.m., Andy Lutomirski wrote:
>
>
>
>> On Oct 14, 2020, at 12:58 PM, Borislav Petkov <bp@alien8.de> wrote:
>>
>> On Wed, Oct 14, 2020 at 08:45:37AM -0700, Andy Lutomirski wrote:
>>>> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>>
>>>> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
>>>> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
>>>> doesn't.
>>>>
>>>> Similarly define clear_page_uncached_flush() which provides an SFENCE
>>>> if the CPU sets X86_FEATURE_NT_GOOD.
>>>
>>> As long as you keep "NT" or "MOVNTI" in the names and keep functions
>>> in arch/x86, I think it's reasonable to expect that callers understand
>>> that MOVNTI has bizarre memory ordering rules. But once you give
>>> something a generic name like "clear_page_uncached" and stick it in
>>> generic code, I think the semantics should be more obvious.
>>
>> Why does it have to be a separate call? Why isn't it behind the
>> clear_page() alternative machinery so that the proper function is
>> selected at boot? IOW, why does a user of clear_page functionality need
>> to know at all about an "uncached" variant?
>
> I assume it’s for a little optimization of clearing more than one page
> per SFENCE.
>
> In any event, based on the benchmark data upthread, we only want to do
> NT clears when they’re rather large, so this shouldn’t be just an
> alternative. I assume this is because a page or two will fit in cache
> and, for most uses that allocate zeroed pages, we prefer cache-hot
> pages. When clearing 1G, on the other hand, cache-hot is impossible
> and we prefer the improved bandwidth and less cache trashing of NT
> clears.
Also, if we did extend clear_page() to take the page-size as parameter
we still might not have enough information (ex. a 4K or a 2MB page that
clear_page() sees could be part of a GUP of a much larger extent) to
decide whether to go uncached or not.
> Perhaps SFENCE is so fast that this is a silly optimization, though,
> and we don’t lose anything measurable by SFENCEing once per page.
Alas, no. I tried that and dropped about 15% performance on Rome.
Thanks
Ankur
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-14 21:12 ` Borislav Petkov
@ 2020-10-15 3:37 ` Ankur Arora
2020-10-15 10:35 ` Borislav Petkov
0 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-15 3:37 UTC (permalink / raw)
To: Borislav Petkov, Andy Lutomirski
Cc: Andy Lutomirski, LKML, Linux-MM, Kirill A. Shutemov,
Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch
On 2020-10-14 2:12 p.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 02:07:30PM -0700, Andy Lutomirski wrote:
>> I assume it’s for a little optimization of clearing more than one
>> page per SFENCE.
>>
>> In any event, based on the benchmark data upthread, we only want to do
>> NT clears when they’re rather large, so this shouldn’t be just an
>> alternative. I assume this is because a page or two will fit in cache
>> and, for most uses that allocate zeroed pages, we prefer cache-hot
>> pages. When clearing 1G, on the other hand, cache-hot is impossible
>> and we prefer the improved bandwidth and less cache trashing of NT
>> clears.
>
> Yeah, use case makes sense but people won't know what to use. At the
> time I was experimenting with this crap, I remember Linus saying that
> that selection should be made based on the size of the area cleared, so
> users should not have to know the difference.
I don't disagree but I think the selection of cached/uncached route should
be made where we have enough context available to be able to choose to do
this.
This could be for example, done in mm_populate() or gup where if say the
extent is larger than LLC-size, it takes the uncached path.
>
> Which perhaps is the only sane use case I see for this.
>
>> Perhaps SFENCE is so fast that this is a silly optimization, though,
>> and we don’t lose anything measurable by SFENCEing once per page.
>
> Yes, I'd like to see real use cases showing improvement from this, not
> just microbenchmarks.
Sure will add.
Thanks
Ankur
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-15 3:37 ` Ankur Arora
@ 2020-10-15 10:35 ` Borislav Petkov
2020-10-15 21:20 ` Ankur Arora
0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-15 10:35 UTC (permalink / raw)
To: Ankur Arora
Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
linux-arch
On Wed, Oct 14, 2020 at 08:37:44PM -0700, Ankur Arora wrote:
> I don't disagree but I think the selection of cached/uncached route should
> be made where we have enough context available to be able to choose to do
> this.
>
> This could be for example, done in mm_populate() or gup where if say the
> extent is larger than LLC-size, it takes the uncached path.
Are there examples where we don't know the size?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-15 3:21 ` Ankur Arora
@ 2020-10-15 10:40 ` Borislav Petkov
2020-10-15 21:40 ` Ankur Arora
0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-15 10:40 UTC (permalink / raw)
To: Ankur Arora
Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
linux-arch
On Wed, Oct 14, 2020 at 08:21:57PM -0700, Ankur Arora wrote:
> Also, if we did extend clear_page() to take the page-size as parameter
> we still might not have enough information (ex. a 4K or a 2MB page that
> clear_page() sees could be part of a GUP of a much larger extent) to
> decide whether to go uncached or not.
clear_page* assumes 4K. All of the lowlevel asm variants do. So adding
the size there won't bring you a whole lot.
So you'd need to devise this whole thing differently. Perhaps have a
clear_pages() helper which decides based on size what to do: uncached
clearing or the clear_page() as is now in a loop.
Looking at the callsites would give you a better idea I'd say.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-15 10:35 ` Borislav Petkov
@ 2020-10-15 21:20 ` Ankur Arora
2020-10-16 18:21 ` Borislav Petkov
0 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-15 21:20 UTC (permalink / raw)
To: Borislav Petkov
Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
linux-arch
On 2020-10-15 3:35 a.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 08:37:44PM -0700, Ankur Arora wrote:
>> I don't disagree but I think the selection of cached/uncached route should
>> be made where we have enough context available to be able to choose to do
>> this.
>>
>> This could be for example, done in mm_populate() or gup where if say the
>> extent is larger than LLC-size, it takes the uncached path.
>
> Are there examples where we don't know the size?
The case I was thinking of was that clear_huge_page() or faultin_page() would
know the size to a page unit, while the higher level function would know the
whole extent and could optimize differently based on that.
Thanks
Ankur
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-15 10:40 ` Borislav Petkov
@ 2020-10-15 21:40 ` Ankur Arora
0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-15 21:40 UTC (permalink / raw)
To: Borislav Petkov
Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
linux-arch
On 2020-10-15 3:40 a.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 08:21:57PM -0700, Ankur Arora wrote:
>> Also, if we did extend clear_page() to take the page-size as parameter
>> we still might not have enough information (ex. a 4K or a 2MB page that
>> clear_page() sees could be part of a GUP of a much larger extent) to
>> decide whether to go uncached or not.
>
> clear_page* assumes 4K. All of the lowlevel asm variants do. So adding
> the size there won't bring you a whole lot.
>
> So you'd need to devise this whole thing differently. Perhaps have a
> clear_pages() helper which decides based on size what to do: uncached
> clearing or the clear_page() as is now in a loop.
I think that'll work well for GB pages, where the clear_pages() helper
has enough information to make a decision.
But, unless I'm missing something, I'm not sure how that would work for
say, a 1GB mm_populate() using 4K pages. The clear_page() (or clear_pages())
in that case would only see the 4K size.
But let me think about this more (and look at the callsites as you suggest.)
>
> Looking at the callsites would give you a better idea I'd say.
Thanks, yeah that's a good idea. Let me go do that.
Ankur
>
> Thx.
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
2020-10-15 21:20 ` Ankur Arora
@ 2020-10-16 18:21 ` Borislav Petkov
0 siblings, 0 replies; 29+ messages in thread
From: Borislav Petkov @ 2020-10-16 18:21 UTC (permalink / raw)
To: Ankur Arora
Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
linux-arch
On Thu, Oct 15, 2020 at 02:20:36PM -0700, Ankur Arora wrote:
> The case I was thinking of was that clear_huge_page()
That loop in clear_gigantic_page() there could be optimized not to
iterate over the pages but do a NTA moves in one go, provided they're
contiguous.
> or faultin_page() would
faultin_page() goes into the bowels of mm fault handling, you'd have to
be more precise what exactly you mean with that one.
> know the size to a page unit, while the higher level function would know the
> whole extent and could optimize differently based on that.
Just don't forget that this "optimization" of yours comes at the price
of added code complexity and you're putting the onus on the people to
know which function to call. So it is not for free and needs to be
carefully weighed.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2020-10-16 18:21 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-14 8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
2020-10-14 8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
2020-10-14 8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
2020-10-14 8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
2020-10-14 8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
2020-10-14 19:56 ` Borislav Petkov
2020-10-14 21:11 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10 ` kernel test robot
2020-10-14 13:04 ` kernel test robot
2020-10-14 15:45 ` Andy Lutomirski
2020-10-14 19:58 ` Borislav Petkov
2020-10-14 21:07 ` Andy Lutomirski
2020-10-14 21:12 ` Borislav Petkov
2020-10-15 3:37 ` Ankur Arora
2020-10-15 10:35 ` Borislav Petkov
2020-10-15 21:20 ` Ankur Arora
2020-10-16 18:21 ` Borislav Petkov
2020-10-15 3:21 ` Ankur Arora
2020-10-15 10:40 ` Borislav Petkov
2020-10-15 21:40 ` Ankur Arora
2020-10-14 20:54 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
2020-10-14 15:28 ` Ingo Molnar
2020-10-14 19:15 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
2020-10-14 15:31 ` Ingo Molnar
2020-10-14 19:23 ` Ankur Arora
2020-10-14 8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).