linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] Use uncached writes while clearing gigantic pages
@ 2020-10-14  8:32 Ankur Arora
  2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora

This series adds clear_page_nt(), a non-temporal MOV (MOVNTI) based
clear_page().

The immediate use case is to speedup creation of large (~2TB) guests
VMs. Memory for these guests is allocated via huge/gigantic pages which
are faulted in early.

The intent behind using non-temporal writes is to minimize allocation of
unnecessary cachelines. This helps in minimizing cache pollution, and
potentially also speeds up zeroing of large extents.

That said there are, uncached writes are not always great, as can be seen
in these 'perf bench mem memset' numbers comparing clear_page_erms() and
clear_page_nt():

Intel Broadwellx:
              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
              -----------------------   -----------------------     -------
     size            BW   (   pstdev)          BW   (   pstdev)
     16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
    128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%

AMD Rome:
              x86-64-stosq (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------     -------
     size            BW   (   pstdev)          BW   (   pstdev)
     16MB      15.39 GB/s ( +- 9.14%)    14.56 GB/s ( +-19.43%)     -5.39%
    128MB      11.04 GB/s ( +- 4.87%)    14.49 GB/s ( +-13.22%)    +31.25%

Intel Skylakex:
              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size            BW   (   pstdev)          BW   (   pstdev)
     16MB      20.38 GB/s ( +- 2.58%)     6.25 GB/s ( +- 0.41%)   -69.28%
    128MB       6.52 GB/s ( +- 0.14%)     6.31 GB/s ( +- 0.47%)    -3.22%

(All of the machines in these tests had a minimum of 25MB L3 cache per
socket.)

There are two performance issues:
 - uncached writes typically perform better only for region sizes
   sizes around or larger than ~LLC-size.
 - MOVNTI does not always perform well on all microarchitectures.

We handle the first issue by only using clear_page_nt() for GB pages.

That leaves out page zeroing for 2MB pages, which is a size that's large
enough that uncached writes might have meaningful cache benefits but at
the same time is small enough that uncached writes would end up being
slower.

We can handle a subset of the 2MB case -- mmaps with MAP_POPULATE -- by
means of a uncached-or-cached hint decided based on a threshold size. This
would apply to maps backed by any page-size.
This case is not handled in this series -- I wanted to sanity check the
high level approach before attempting that.

Handle the second issue by adding a synthetic cpu-feature,
X86_FEATURE_NT_GOOD which is only enabled for architectures where MOVNTI
performs well.
(Relatedly, I thought I had independently decided to use ALTERNATIVES
to deal with this, but more likely I had just internalized it from this 
discussion:
https://lore.kernel.org/linux-mm/20200316101856.GH11482@dhcp22.suse.cz/#t)

Accordingly this series enables X86_FEATURE_NT_GOOD for Intel Broadwellx
and AMD Rome. (In my testing, the performance was also good for some
pre-production models but this series leaves them out.)

Please review.

Thanks
Ankur

Ankur Arora (8):
  x86/cpuid: add X86_FEATURE_NT_GOOD
  x86/asm: add memset_movnti()
  perf bench: add memset_movnti()
  x86/asm: add clear_page_nt()
  x86/clear_page: add clear_page_uncached()
  mm, clear_huge_page: use clear_page_uncached() for gigantic pages
  x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
  x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen

 arch/x86/include/asm/cpufeatures.h           |  1 +
 arch/x86/include/asm/page.h                  |  6 +++
 arch/x86/include/asm/page_32.h               |  9 ++++
 arch/x86/include/asm/page_64.h               | 15 ++++++
 arch/x86/kernel/cpu/amd.c                    |  3 ++
 arch/x86/kernel/cpu/intel.c                  |  2 +
 arch/x86/lib/clear_page_64.S                 | 26 +++++++++++
 arch/x86/lib/memset_64.S                     | 68 ++++++++++++++++------------
 include/asm-generic/page.h                   |  3 ++
 include/linux/highmem.h                      | 10 ++++
 mm/memory.c                                  |  3 +-
 tools/arch/x86/lib/memset_64.S               | 68 ++++++++++++++++------------
 tools/perf/bench/mem-memset-x86-64-asm-def.h |  6 ++-
 13 files changed, 158 insertions(+), 62 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14  8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Tony Luck, Pawan Gupta, Josh Poimboeuf,
	Peter Zijlstra (Intel),
	Mark Gross, Kim Phillips, Vineela Tummalapalli, Wei Huang

Enabled on microarchitectures with performant non-temporal MOV (MOVNTI)
instruction.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 7b0afd5e6c57..8bae38240346 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -289,6 +289,7 @@
 #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
 #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
 #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
+#define X86_FEATURE_NT_GOOD		(11*32+ 8) /* Non-temporal instructions perform well */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX512_BF16		(12*32+ 5) /* AVX512 BFLOAT16 instructions */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/8] x86/asm: add memset_movnti()
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
  2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14  8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jiri Slaby, Juergen Gross

Add a MOVNTI based implementation of memset().

memset_orig() and memset_movnti() only differ in the opcode used in the
inner loop, so move the memset_orig() logic into a macro, which gets
expanded into memset_movq() and memset_movnti().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/lib/memset_64.S | 68 +++++++++++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9ff15ee404a4..79703cc04b6a 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -27,7 +27,7 @@ SYM_FUNC_START(__memset)
 	 *
 	 * Otherwise, use original memset function.
 	 */
-	ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+	ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memset_erms", X86_FEATURE_ERMS
 
 	movq %rdi,%r9
@@ -68,7 +68,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
 	ret
 SYM_FUNC_END(memset_erms)
 
-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
 	movq %rdi,%r10
 
 	/* expand byte value  */
@@ -79,64 +80,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
 	/* align dst */
 	movl  %edi,%r9d
 	andl  $7,%r9d
-	jnz  .Lbad_alignment
-.Lafter_bad_alignment:
+	jnz  .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
 
 	movq  %rdx,%rcx
 	shrq  $6,%rcx
-	jz	 .Lhandle_tail
+	jz	 .Lhandle_tail_\@
 
 	.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
 	decq  %rcx
-	movq  %rax,(%rdi)
-	movq  %rax,8(%rdi)
-	movq  %rax,16(%rdi)
-	movq  %rax,24(%rdi)
-	movq  %rax,32(%rdi)
-	movq  %rax,40(%rdi)
-	movq  %rax,48(%rdi)
-	movq  %rax,56(%rdi)
+	\OP  %rax,(%rdi)
+	\OP  %rax,8(%rdi)
+	\OP  %rax,16(%rdi)
+	\OP  %rax,24(%rdi)
+	\OP  %rax,32(%rdi)
+	\OP  %rax,40(%rdi)
+	\OP  %rax,48(%rdi)
+	\OP  %rax,56(%rdi)
 	leaq  64(%rdi),%rdi
-	jnz    .Lloop_64
+	jnz    .Lloop_64_\@
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */
 	.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
 	movl	%edx,%ecx
 	andl    $63&(~7),%ecx
-	jz 		.Lhandle_7
+	jz 		.Lhandle_7_\@
 	shrl	$3,%ecx
 	.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
 	decl   %ecx
-	movq  %rax,(%rdi)
+	\OP  %rax,(%rdi)
 	leaq  8(%rdi),%rdi
-	jnz    .Lloop_8
+	jnz    .Lloop_8_\@
 
-.Lhandle_7:
+.Lhandle_7_\@:
 	andl	$7,%edx
-	jz      .Lende
+	jz      .Lende_\@
 	.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
 	decl    %edx
 	movb 	%al,(%rdi)
 	leaq	1(%rdi),%rdi
-	jnz     .Lloop_1
+	jnz     .Lloop_1_\@
 
-.Lende:
+.Lende_\@:
+	.if \fence
+	sfence
+	.endif
 	movq	%r10,%rax
 	ret
 
-.Lbad_alignment:
+.Lbad_alignment_\@:
 	cmpq $7,%rdx
-	jbe	.Lhandle_7
+	jbe	.Lhandle_7_\@
 	movq %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8
 	subq %r9,%r8
 	addq %r8,%rdi
 	subq %r8,%rdx
-	jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+	jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/8] perf bench: add memset_movnti()
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
  2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
  2020-10-14  8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim

Clone memset_movnti() from arch/x86/lib/memset_64.S.

perf bench mem memset on -f x86-64-movnt on Intel Broadwellx, Skylakex
and AMD Rome:

Intel Broadwellx:
$ for i in 2 8 32 128 512; do
	perf bench mem memset -f x86-64-movnt -s ${i}MB
  done

  # Output pruned.
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 2MB bytes ...
        11.837121 GB/sec
  # Copying 8MB bytes ...
        11.783560 GB/sec
  # Copying 32MB bytes ...
        11.868591 GB/sec
  # Copying 128MB bytes ...
        11.865211 GB/sec
  # Copying 512MB bytes ...
        11.864085 GB/sec

Intel Skylakex:
$ for i in 2 8 32 128 512; do
	perf bench mem memset -f x86-64-movnt -s ${i}MB
  done
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 2MB bytes ...
         6.361971 GB/sec
  # Copying 8MB bytes ...
         6.300403 GB/sec
  # Copying 32MB bytes ...
         6.288992 GB/sec
  # Copying 128MB bytes ...
         6.328793 GB/sec
  # Copying 512MB bytes ...
         6.324471 GB/sec

AMD Rome:
$ for i in 2 8 32 128 512; do
	perf bench mem memset -f x86-64-movnt -s ${i}MB
  done
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 2MB bytes ...
        10.993199 GB/sec
  # Copying 8MB bytes ...
        14.221784 GB/sec
  # Copying 32MB bytes ...
        14.293337 GB/sec
  # Copying 128MB bytes ...
        15.238947 GB/sec
  # Copying 512MB bytes ...
        16.476093 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/arch/x86/lib/memset_64.S               | 68 ++++++++++++++++------------
 tools/perf/bench/mem-memset-x86-64-asm-def.h |  6 ++-
 2 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index fd5d25a474b7..bfbf6d06f81e 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -26,7 +26,7 @@ SYM_FUNC_START(__memset)
 	 *
 	 * Otherwise, use original memset function.
 	 */
-	ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+	ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memset_erms", X86_FEATURE_ERMS
 
 	movq %rdi,%r9
@@ -65,7 +65,8 @@ SYM_FUNC_START(memset_erms)
 	ret
 SYM_FUNC_END(memset_erms)
 
-SYM_FUNC_START(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START(memset_\OP)
 	movq %rdi,%r10
 
 	/* expand byte value  */
@@ -76,64 +77,71 @@ SYM_FUNC_START(memset_orig)
 	/* align dst */
 	movl  %edi,%r9d
 	andl  $7,%r9d
-	jnz  .Lbad_alignment
-.Lafter_bad_alignment:
+	jnz  .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
 
 	movq  %rdx,%rcx
 	shrq  $6,%rcx
-	jz	 .Lhandle_tail
+	jz	 .Lhandle_tail_\@
 
 	.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
 	decq  %rcx
-	movq  %rax,(%rdi)
-	movq  %rax,8(%rdi)
-	movq  %rax,16(%rdi)
-	movq  %rax,24(%rdi)
-	movq  %rax,32(%rdi)
-	movq  %rax,40(%rdi)
-	movq  %rax,48(%rdi)
-	movq  %rax,56(%rdi)
+	\OP  %rax,(%rdi)
+	\OP  %rax,8(%rdi)
+	\OP  %rax,16(%rdi)
+	\OP  %rax,24(%rdi)
+	\OP  %rax,32(%rdi)
+	\OP  %rax,40(%rdi)
+	\OP  %rax,48(%rdi)
+	\OP  %rax,56(%rdi)
 	leaq  64(%rdi),%rdi
-	jnz    .Lloop_64
+	jnz    .Lloop_64_\@
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */
 	.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
 	movl	%edx,%ecx
 	andl    $63&(~7),%ecx
-	jz 		.Lhandle_7
+	jz 		.Lhandle_7_\@
 	shrl	$3,%ecx
 	.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
 	decl   %ecx
-	movq  %rax,(%rdi)
+	\OP  %rax,(%rdi)
 	leaq  8(%rdi),%rdi
-	jnz    .Lloop_8
+	jnz    .Lloop_8_\@
 
-.Lhandle_7:
+.Lhandle_7_\@:
 	andl	$7,%edx
-	jz      .Lende
+	jz      .Lende_\@
 	.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
 	decl    %edx
 	movb 	%al,(%rdi)
 	leaq	1(%rdi),%rdi
-	jnz     .Lloop_1
+	jnz     .Lloop_1_\@
 
-.Lende:
+.Lende_\@:
+	.if \fence
+	sfence
+	.endif
 	movq	%r10,%rax
 	ret
 
-.Lbad_alignment:
+.Lbad_alignment_\@:
 	cmpq $7,%rdx
-	jbe	.Lhandle_7
+	jbe	.Lhandle_7_\@
 	movq %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8
 	subq %r9,%r8
 	addq %r8,%rdi
 	subq %r8,%rdx
-	jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+	jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
 	"x86-64-unrolled",
 	"unrolled memset() in arch/x86/lib/memset_64.S")
 
@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
 MEMSET_FN(memset_erms,
 	"x86-64-stosb",
 	"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+	"x86-64-movnt",
+	"movnt-based memset() in arch/x86/lib/memset_64.S")
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/8] x86/asm: add clear_page_nt()
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
                   ` (2 preceding siblings ...)
  2020-10-14  8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14 19:56   ` Borislav Petkov
  2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jiri Slaby, Herbert Xu, Rafael J. Wysocki

Add clear_page_nt() which is essentially an unrolled MOVNTI loop. The
unrolling keeps the inner loop similar to memset_movnti() which can be
exercised via perf bench mem memset.

The caller needs to execute an SFENCE when done.

MOVNTI, from the Intel SDM, Volume 2B, 4-101:
 "The non-temporal hint is implemented by using a write combining (WC)
  memory type protocol when writing the data to memory. Using this
  protocol, the processor does not write the data into the cache hierarchy,
  nor does it fetch the corresponding cache line from memory into the
  cache hierarchy."

The AMD Arch Manual has something similar to say as well.

This can potentially improve page-clearing bandwidth (see below for
performance numbers for two microarchitectures where it helps and one
where it doesn't) and can help indirectly by consuming less cache
resources.

Any performance benefits are expected for extents larger than LLC-sized
or more -- when we are DRAM-BW constrained rather than cache-BW
constrained.

 # Intel Broadwellx
 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

 System:           Oracle X6-2
 CPU:              2 nodes * 10 cores/node * 2 threads/core
                  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
 Memory:           256G evenly split between nodes
 Microcode:        0xb00002e
 scaling_governor: performance
 L3 size:           25MB
 intel_pstate/no_turbo: 1

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
              -----------------------   -----------------------     -------
     size            BW   (   pstdev)          BW   (   pstdev)

     16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
    128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
   1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
   4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
 # Running 'mem/memset' benchmark:
 # function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...

       5.405362 GB/sec
       5.444229 GB/sec
       5.397943 GB/sec
       5.401012 GB/sec
       5.439320 GB/sec

  Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):

     2,064,476,092      cpu-cycles                #    1.087 GHz                      ( +-  0.17% )  (22.19%)
         8,578,591      instructions              #    0.00  insn per cycle           ( +- 12.01% )  (27.79%)
       132,481,645      cache-references          #   69.730 M/sec                    ( +-  0.20% )  (27.83%)
           157,710      cache-misses              #    0.119 % of all cache refs      ( +-  5.80% )  (27.84%)
         2,879,628      branch-instructions       #    1.516 M/sec                    ( +-  0.21% )  (27.86%)
            80,581      branch-misses             #    2.80% of all branches          ( +- 13.15% )  (27.84%)
        94,401,869      bus-cycles                #   49.687 M/sec                    ( +-  0.25% )  (22.21%)
       133,947,283      L1-dcache-load-misses     # 139717.91% of all L1-dcache accesses  ( +-  0.26% )  (22.21%)
            95,870      L1-dcache-loads           #    0.050 M/sec                    ( +-  9.95% )  (22.21%)
             1,700      LLC-loads                 #    0.895 K/sec                    ( +-  6.50% )  (22.21%)
             1,410      LLC-load-misses           #   82.95% of all LL-cache accesses  ( +- 19.42% )  (22.21%)
       132,526,771      LLC-stores                #   69.754 M/sec                    ( +-  0.65% )  (11.10%)
           101,145      LLC-store-misses          #    0.053 M/sec                    ( +- 11.19% )  (11.10%)

           1.90238 +- 0.00358 seconds time elapsed  ( +-  0.19% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
 # Running 'mem/memset' benchmark:
 # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...

      11.774264 GB/sec
      11.758826 GB/sec
      11.774368 GB/sec
      11.758239 GB/sec
      11.760348 GB/sec

  Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

     1,619,807,936      cpu-cycles                #    0.971 GHz                      ( +-  0.24% )  (22.14%)
     1,481,306,856      instructions              #    0.91  insn per cycle           ( +-  0.33% )  (27.75%)
           163,086      cache-references          #    0.098 M/sec                    ( +- 11.68% )  (27.79%)
            39,913      cache-misses              #   24.474 % of all cache refs      ( +- 26.45% )  (27.84%)
       135,741,931      branch-instructions       #   81.353 M/sec                    ( +-  0.33% )  (27.89%)
            82,647      branch-misses             #    0.06% of all branches          ( +-  6.29% )  (27.90%)
        73,575,446      bus-cycles                #   44.095 M/sec                    ( +-  0.28% )  (22.28%)
            27,834      L1-dcache-load-misses     #   68.42% of all L1-dcache accesses  ( +- 65.93% )  (22.28%)
            40,683      L1-dcache-loads           #    0.024 M/sec                    ( +- 42.62% )  (22.27%)
             2,598      LLC-loads                 #    0.002 M/sec                    ( +- 22.66% )  (22.25%)
             1,523      LLC-load-misses           #   58.60% of all LL-cache accesses  ( +- 39.64% )  (22.22%)
                 2      LLC-stores                #    0.001 K/sec                    ( +-100.00% )  (11.08%)
                 0      LLC-store-misses          #    0.000 K/sec                    (11.07%)

           1.67003 +- 0.00169 seconds time elapsed  ( +-  0.10% )

The L1-dcache-load-miss (L1D.REPLACEMENT) counts are significantly down,
which does confirm that unlike "REP; STOSB", MOVNTI does not result in a
write-allocate.

 # AMD Rome
 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
 # (X86_FEATURE_REP_GOOD) and x86-64-movnt:

 System:           Oracle E2-2c
 CPU:              2 nodes * 64 cores/node * 2 threads/core
                   AMD EPYC 7742 (Rome, 23:49:0)
 Memory:           2048 GB evenly split between nodes
 Microcode:        0x8301038
 scaling_governor: performance
 L3 size:          16 * 16MB
 cpufreq/boost:    0

              x86-64-stosq (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size       BW        (   pstdev)          BW   (   pstdev)

     16MB      15.39 GB/s ( +- 9.14%)    14.56 GB/s ( +-19.43%)     -5.39%
    128MB      11.04 GB/s ( +- 4.87%)    14.49 GB/s ( +-13.22%)    +31.25%
   1024MB      11.86 GB/s ( +- 0.83%)    16.54 GB/s ( +- 0.04%)    +39.46%
   4096MB      11.89 GB/s ( +- 0.61%)    16.49 GB/s ( +- 0.28%)    +38.68%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosq
 # Running 'mem/memset' benchmark:
 # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...

      11.785122 GB/sec
      11.970851 GB/sec
      11.916821 GB/sec
      11.861517 GB/sec
      11.941867 GB/sec

  Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosq' (5 runs):

     1,014,645,096      cpu-cycles                #    1.264 GHz                      ( +-  0.18% )  (45.28%)
         4,620,983      instructions              #    0.00  insn per cycle           ( +-  1.86% )  (45.37%)
       262,988,622      cache-references          #  327.723 M/sec                    ( +-  0.21% )  (45.51%)
         6,312,740      cache-misses              #    2.400 % of all cache refs      ( +-  1.12% )  (45.56%)
         1,792,517      branch-instructions       #    2.234 M/sec                    ( +-  0.20% )  (45.60%)
            54,095      branch-misses             #    3.02% of all branches          ( +-  2.99% )  (45.64%)
       133,710,131      L1-dcache-load-misses     #  363.51% of all L1-dcache accesses  ( +-  0.12% )  (45.55%)
        36,783,396      L1-dcache-loads           #   45.838 M/sec                    ( +-  0.79% )  (45.46%)
        53,411,709      L1-dcache-prefetches      #   66.559 M/sec                    ( +-  0.28% )  (45.39%)

           0.80303 +- 0.00117 seconds time elapsed  ( +-  0.15% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
 # Running 'mem/memset' benchmark:
 # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...

      16.533230 GB/sec
      16.496138 GB/sec
      16.480302 GB/sec
      16.478333 GB/sec
      16.474600 GB/sec

  Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

     1,091,352,779      cpu-cycles                #    1.292 GHz                      ( +-  0.32% )  (45.25%)
     1,483,248,390      instructions              #    1.36  insn per cycle           ( +-  0.14% )  (45.38%)
       134,114,985      cache-references          #  158.723 M/sec                    ( +-  0.17% )  (45.51%)
           117,682      cache-misses              #    0.088 % of all cache refs      ( +-  0.99% )  (45.59%)
       135,009,275      branch-instructions       #  159.781 M/sec                    ( +-  0.18% )  (45.68%)
            50,659      branch-misses             #    0.04% of all branches          ( +-  7.50% )  (45.66%)
            58,569      L1-dcache-load-misses     #    5.84% of all L1-dcache accesses  ( +-  6.04% )  (45.57%)
         1,002,657      L1-dcache-loads           #    1.187 M/sec                    ( +- 15.40% )  (45.45%)
             3,111      L1-dcache-prefetches      #    0.004 M/sec                    ( +- 31.21% )  (45.38%)

           0.84554 +- 0.00289 seconds time elapsed  ( +-  0.34% )

Similar to Intel Broadwellx, the L1-dcache-load-misses (L2$ access from
DC Miss) counts are significantly lower. The L1 prefetcher is also
fairly quiet.

 # Intel Skylakex
 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

 System:           Oracle X8-2
 CPU:              2 nodes * 26 cores/node * 2 threads/core
                  Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
 Memory:           3TB evenly split between nodes
 Microcode:        0x5002f01
 scaling_governor: performance
 L3 size:          36MB
 intel_pstate/no_turbo: 1

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size            BW   (   pstdev)          BW   (   pstdev)

     16MB      20.38 GB/s ( +- 2.58%)     6.25 GB/s ( +- 0.41%)   -69.28%
    128MB       6.52 GB/s ( +- 0.14%)     6.31 GB/s ( +- 0.47%)    -3.22%
   1024MB       6.48 GB/s ( +- 0.31%)     6.24 GB/s ( +- 0.00%)    -3.70%
   4096MB       6.51 GB/s ( +- 0.01%)     6.27 GB/s ( +- 0.42%)    -3.68%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
 # Running 'mem/memset' benchmark:
 # function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...
       6.516972 GB/sec
       6.518756 GB/sec
       6.517620 GB/sec
       6.517598 GB/sec
       6.518799 GB/sec

 Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):

     3,357,373,317      cpu-cycles                #    1.133 GHz                      ( +-  0.01% )  (29.38%)
       165,063,710      instructions              #    0.05  insn per cycle           ( +-  1.54% )  (35.29%)
           358,997      cache-references          #    0.121 M/sec                    ( +-  0.89% )  (35.32%)
           205,420      cache-misses              #   57.221 % of all cache refs      ( +-  3.61% )  (35.36%)
         6,117,673      branch-instructions       #    2.065 M/sec                    ( +-  1.48% )  (35.38%)
            58,309      branch-misses             #    0.95% of all branches          ( +-  1.30% )  (35.39%)
        31,329,466      bus-cycles                #   10.575 M/sec                    ( +-  0.03% )  (23.56%)
        68,543,766      L1-dcache-load-misses     #  157.03% of all L1-dcache accesses  ( +-  0.02% )  (23.53%)
        43,648,909      L1-dcache-loads           #   14.734 M/sec                    ( +-  0.50% )  (23.50%)
           137,498      LLC-loads                 #    0.046 M/sec                    ( +-  0.21% )  (23.49%)
            12,308      LLC-load-misses           #    8.95% of all LL-cache accesses  ( +-  2.52% )  (23.49%)
            26,335      LLC-stores                #    0.009 M/sec                    ( +-  5.65% )  (11.75%)
            25,008      LLC-store-misses          #    0.008 M/sec                    ( +-  3.42% )  (11.75%)

          2.962842 +- 0.000162 seconds time elapsed  ( +-  0.01% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
 # Running 'mem/memset' benchmark:
 # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...
       6.283420 GB/sec
       6.222843 GB/sec
       6.282976 GB/sec
       6.282828 GB/sec
       6.283173 GB/sec

 Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

     4,462,272,094      cpu-cycles                #    1.322 GHz                      ( +-  0.30% )  (29.38%)
     1,633,675,881      instructions              #    0.37  insn per cycle           ( +-  0.21% )  (35.28%)
           283,627      cache-references          #    0.084 M/sec                    ( +-  0.58% )  (35.31%)
            28,824      cache-misses              #   10.163 % of all cache refs      ( +- 20.67% )  (35.34%)
       139,719,697      branch-instructions       #   41.407 M/sec                    ( +-  0.16% )  (35.35%)
            58,062      branch-misses             #    0.04% of all branches          ( +-  1.49% )  (35.36%)
        41,760,350      bus-cycles                #   12.376 M/sec                    ( +-  0.05% )  (23.55%)
           303,300      L1-dcache-load-misses     #    0.69% of all L1-dcache accesses  ( +-  2.08% )  (23.53%)
        43,769,498      L1-dcache-loads           #   12.972 M/sec                    ( +-  0.54% )  (23.52%)
            99,570      LLC-loads                 #    0.030 M/sec                    ( +-  1.06% )  (23.52%)
             1,966      LLC-load-misses           #    1.97% of all LL-cache accesses  ( +-  6.17% )  (23.52%)
               129      LLC-stores                #    0.038 K/sec                    ( +- 27.85% )  (11.75%)
                 7      LLC-store-misses          #    0.002 K/sec                    ( +- 47.82% )  (11.75%)

           3.37465 +- 0.00474 seconds time elapsed  ( +-  0.14% )

The L1-dcache-load-misses (L1D.REPLACEMENT) count is much lower just
like the previous two cases. No performance improvement for Skylakex
though.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h |  1 +
 arch/x86/lib/clear_page_64.S   | 26 ++++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..bde3c2785ec4 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,6 +43,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void clear_page_orig(void *page);
 void clear_page_rep(void *page);
 void clear_page_erms(void *page);
+void clear_page_nt(void *page);
 
 static inline void clear_page(void *page)
 {
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index c4c7dd115953..f16bb753b236 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -50,3 +50,29 @@ SYM_FUNC_START(clear_page_erms)
 	ret
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Zero a page.
+ * %rdi - page
+ *
+ * Caller needs to issue a fence at the end.
+ */
+SYM_FUNC_START(clear_page_nt)
+	xorl	%eax,%eax
+	movl	$4096,%ecx
+
+	.p2align 4
+.Lstart:
+        movnti  %rax, 0x00(%rdi)
+        movnti  %rax, 0x08(%rdi)
+        movnti  %rax, 0x10(%rdi)
+        movnti  %rax, 0x18(%rdi)
+        movnti  %rax, 0x20(%rdi)
+        movnti  %rax, 0x28(%rdi)
+        movnti  %rax, 0x30(%rdi)
+        movnti  %rax, 0x38(%rdi)
+        addq    $0x40, %rdi
+        subl    $0x40, %ecx
+        ja      .Lstart
+	ret
+SYM_FUNC_END(clear_page_nt)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
                   ` (3 preceding siblings ...)
  2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14 11:10   ` kernel test robot
                     ` (2 more replies)
  2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
	linux-arch

Define clear_page_uncached() as an alternative_call() to clear_page_nt()
if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
doesn't.

Similarly define clear_page_uncached_flush() which provides an SFENCE
if the CPU sets X86_FEATURE_NT_GOOD.

Also, add the glue interface clear_user_highpage_uncached().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page.h    |  6 ++++++
 arch/x86/include/asm/page_32.h |  9 +++++++++
 arch/x86/include/asm/page_64.h | 14 ++++++++++++++
 include/asm-generic/page.h     |  3 +++
 include/linux/highmem.h        | 10 ++++++++++
 5 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..ca0aa379ac7f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
 	clear_page(page);
 }
 
+static inline void clear_user_page_uncached(void *page, unsigned long vaddr,
+					    struct page *pg)
+{
+	clear_page_uncached(page);
+}
+
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *topage)
 {
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 94dbd51df58f..7a03a274a9a4 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -39,6 +39,15 @@ static inline void clear_page(void *page)
 	memset(page, 0, PAGE_SIZE);
 }
 
+static inline void clear_page_uncached(void *page)
+{
+	clear_page(page);
+}
+
+static inline void clear_page_uncached_flush(void)
+{
+}
+
 static inline void copy_page(void *to, void *from)
 {
 	memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index bde3c2785ec4..5897075e77dd 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -55,6 +55,20 @@ static inline void clear_page(void *page)
 			   : "cc", "memory", "rax", "rcx");
 }
 
+static inline void clear_page_uncached(void *page)
+{
+	alternative_call(clear_page,
+			 clear_page_nt, X86_FEATURE_NT_GOOD,
+			 "=D" (page),
+			 "0" (page)
+			 : "cc", "memory", "rax", "rcx");
+}
+
+static inline void clear_page_uncached_flush(void)
+{
+	alternative("", "sfence", X86_FEATURE_NT_GOOD);
+}
+
 void copy_page(void *to, void *from);
 
 #endif	/* !__ASSEMBLY__ */
diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h
index fe801f01625e..60235a0cf24a 100644
--- a/include/asm-generic/page.h
+++ b/include/asm-generic/page.h
@@ -26,6 +26,9 @@
 #ifndef __ASSEMBLY__
 
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page_uncached(page)	clear_page(page)
+#define clear_page_uncached_flush()	do { } while (0)
+
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 14e6202ce47f..f842593e2474 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -232,6 +232,16 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 }
 #endif
 
+#ifndef clear_user_highpage_uncached
+static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
+{
+	void *addr = kmap_atomic(page);
+
+	clear_user_page_uncached(addr, vaddr, page);
+	kunmap_atomic(addr);
+}
+#endif
+
 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /**
  * __alloc_zeroed_user_highpage - Allocate a zeroed HIGHMEM page for a VMA with caller-specified movable GFP flags
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
                   ` (4 preceding siblings ...)
  2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14 15:28   ` Ingo Molnar
  2020-10-14  8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
  2020-10-14  8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora
  7 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora, Andrew Morton

Uncached writes are suitable for circumstances where the region written to
is not expected to be read again soon, or the region written to is large
enough that there's no expectation that we will find the writes in the
cache.

Accordingly switch to using clear_page_uncached() for gigantic pages.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..4d2c58f83ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
 	for (i = 0; i < pages_per_huge_page;
 	     i++, p = mem_map_next(p, page, i)) {
 		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
+		clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
 	}
 }
 
@@ -5111,6 +5111,7 @@ void clear_huge_page(struct page *page,
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, pages_per_huge_page);
+		clear_page_uncached_flush();
 		return;
 	}
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
                   ` (5 preceding siblings ...)
  2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  2020-10-14 15:31   ` Ingo Molnar
  2020-10-14  8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora
  7 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Tony Luck, Sean Christopherson, Mike Rapoport,
	Xiaoyao Li, Fenghua Yu, Peter Zijlstra (Intel),
	Dave Hansen

System:           Oracle X6-2
CPU:              2 nodes * 10 cores/node * 2 threads/core
		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory:           256 GB evenly split between nodes
Microcode:        0xb00002e
scaling_governor: performance
L3 size:          25MB
intel_pstate/no_turbo: 1

Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
(X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
              -----------------------   -----------------------     -------
     size       BW        (   pstdev)          BW   (   pstdev)

     16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
    128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
   1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
   4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%

The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.

$ cat pf-test.c
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <linux/mman.h>

 #define HPAGE_BITS 30
 int main(int argc, char **argv) {
	int i;
	unsigned long len = atoi(argv[1]); /* In GB */
	unsigned long offset = 0;
	unsigned long numpages;
	char *base;

	len *= 1UL << 30;
	numpages = len >> HPAGE_BITS;

	base = mmap(NULL, len, PROT_READ|PROT_WRITE,
	            MAP_PRIVATE | MAP_ANONYMOUS |
		    MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

	for (i = 0; i < numpages; i++) {
	        *((volatile char *)base + offset) = *(base + offset);
	        offset += 1UL << HPAGE_BITS;
	}

	return 0;
 }

The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.

Page-clearing throughput for clear_page_erms(): 3.72 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    74,799,496,556      cpu-cycles                #    2.176 GHz                      ( +-  2.22% )  (29.41%)
     1,474,615,023      instructions              #    0.02  insn per cycle           ( +-  0.23% )  (35.29%)
     2,148,580,131      cache-references          #   62.502 M/sec                    ( +-  0.02% )  (35.29%)
        71,736,985      cache-misses              #    3.339 % of all cache refs      ( +-  0.94% )  (35.29%)
       433,713,165      branch-instructions       #   12.617 M/sec                    ( +-  0.15% )  (35.30%)
         1,008,251      branch-misses             #    0.23% of all branches          ( +-  1.88% )  (35.30%)
     3,406,821,966      bus-cycles                #   99.104 M/sec                    ( +-  2.22% )  (23.53%)
     2,156,059,110      L1-dcache-load-misses     #  445.35% of all L1-dcache accesses  ( +-  0.01% )  (23.53%)
       484,128,243      L1-dcache-loads           #   14.083 M/sec                    ( +-  0.22% )  (23.53%)
           944,216      LLC-loads                 #    0.027 M/sec                    ( +-  7.41% )  (23.53%)
           537,989      LLC-load-misses           #   56.98% of all LL-cache accesses  ( +- 13.64% )  (23.53%)
     2,150,138,476      LLC-stores                #   62.547 M/sec                    ( +-  0.01% )  (11.76%)
        69,598,760      LLC-store-misses          #    2.025 M/sec                    ( +-  0.47% )  (11.76%)
       483,923,875      dTLB-loads                #   14.077 M/sec                    ( +-  0.21% )  (17.64%)
             1,892      dTLB-load-misses          #    0.00% of all dTLB cache accesses  ( +- 30.63% )  (23.53%)
     4,799,154,980      dTLB-stores               #  139.606 M/sec                    ( +-  0.03% )  (23.53%)
                90      dTLB-store-misses         #    0.003 K/sec                    ( +- 35.92% )  (23.53%)

            34.377 +- 0.760 seconds time elapsed  ( +-  2.21% )

Page-clearing throughput with clear_page_nt(): 11.78GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    23,699,446,603      cpu-cycles                #    2.182 GHz                      ( +-  0.01% )  (23.53%)
    24,794,548,512      instructions              #    1.05  insn per cycle           ( +-  0.00% )  (29.41%)
           432,775      cache-references          #    0.040 M/sec                    ( +-  3.96% )  (29.41%)
            75,580      cache-misses              #   17.464 % of all cache refs      ( +- 51.42% )  (29.41%)
     2,492,858,290      branch-instructions       #  229.475 M/sec                    ( +-  0.00% )  (29.42%)
        34,016,826      branch-misses             #    1.36% of all branches          ( +-  0.04% )  (29.42%)
     1,078,468,643      bus-cycles                #   99.276 M/sec                    ( +-  0.01% )  (23.53%)
           717,228      L1-dcache-load-misses     #    0.20% of all L1-dcache accesses  ( +-  3.77% )  (23.53%)
       351,999,535      L1-dcache-loads           #   32.403 M/sec                    ( +-  0.04% )  (23.53%)
            75,988      LLC-loads                 #    0.007 M/sec                    ( +-  4.20% )  (23.53%)
            24,503      LLC-load-misses           #   32.25% of all LL-cache accesses  ( +- 53.30% )  (23.53%)
            57,283      LLC-stores                #    0.005 M/sec                    ( +-  2.15% )  (11.76%)
            19,738      LLC-store-misses          #    0.002 M/sec                    ( +- 46.55% )  (11.76%)
       351,836,498      dTLB-loads                #   32.388 M/sec                    ( +-  0.04% )  (17.65%)
             1,171      dTLB-load-misses          #    0.00% of all dTLB cache accesses  ( +- 42.68% )  (23.53%)
    17,385,579,725      dTLB-stores               # 1600.392 M/sec                    ( +-  0.00% )  (23.53%)
               200      dTLB-store-misses         #    0.018 K/sec                    ( +- 10.63% )  (23.53%)

         10.863678 +- 0.000804 seconds time elapsed  ( +-  0.01% )

L1-dcache-load-misses (L1D.REPLACEMENT) is substantially lower which
suggests that, as expected, we aren't doing write-allocate or RFO.

Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSB' per
PAGE_SIZE region to a MOVNTI loop.

The page-clearing BW is substantially higher (~100% or more), so enable
X86_FEATURE_NT_GOOD for Intel Broadwellx.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/intel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 59a1e3ce3f14..161028c1dee0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -662,6 +662,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
 	if (c->x86 == 6)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
+		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
 #else
 	/*
 	 * Names for the Pentium II/Celeron processors
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen
  2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
                   ` (6 preceding siblings ...)
  2020-10-14  8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
@ 2020-10-14  8:32 ` Ankur Arora
  7 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14  8:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: kirill, mhocko, boris.ostrovsky, konrad.wilk, Ankur Arora,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Kim Phillips, Reinette Chatre, Tony Luck,
	Tom Lendacky, Wei Huang

System:           Oracle E2-2C
CPU:              2 nodes * 64 cores/node * 2 threads/core
                  AMD EPYC 7742 (Rome, 23:49:0)
Memory:           2048 GB evenly split between nodes
Microcode:        0x8301038
scaling_governor: performance
L3 size:          16 * 16MB
cpufreq/boost:    0

Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
(X86_FEATURE_REP_GOOD) and x86-64-movnt (X86_FEATURE_NT_GOOD):

              x86-64-stosq (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size       BW        (   pstdev)          BW   (   pstdev)

     16MB      15.39 GB/s ( +- 9.14%)    14.56 GB/s ( +-19.43%)     -5.39%
    128MB      11.04 GB/s ( +- 4.87%)    14.49 GB/s ( +-13.22%)    +31.25%
   1024MB      11.86 GB/s ( +- 0.83%)    16.54 GB/s ( +- 0.04%)    +39.46%
   4096MB      11.89 GB/s ( +- 0.61%)    16.49 GB/s ( +- 0.28%)    +38.68%

The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.

$ cat pf-test.c
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <linux/mman.h>

 #define HPAGE_BITS 30

 int main(int argc, char **argv) {
	int i;
	unsigned long len = atoi(argv[1]); /* In GB */
	unsigned long offset = 0;
	unsigned long numpages;
	char *base;

	len *= 1UL << 30;
	numpages = len >> HPAGE_BITS;

	base = mmap(NULL, len, PROT_READ|PROT_WRITE,
	            MAP_PRIVATE | MAP_ANONYMOUS |
		    MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

	for (i = 0; i < numpages; i++) {
	        *((volatile char *)base + offset) = *(base + offset);
	        offset += 1UL << HPAGE_BITS;
	}

	return 0;
 }

The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.

Page-clearing throughput for clear_page_rep(): 11.33 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    25,130,082,910      cpu-cycles                #    2.226 GHz                      ( +-  0.44% )  (54.54%)
     1,368,762,311      instructions              #    0.05  insn per cycle           ( +-  0.02% )  (54.54%)
     4,265,726,534      cache-references          #  377.794 M/sec                    ( +-  0.02% )  (54.54%)
       119,021,793      cache-misses              #    2.790 % of all cache refs      ( +-  3.90% )  (54.55%)
       413,825,787      branch-instructions       #   36.650 M/sec                    ( +-  0.01% )  (54.55%)
           236,847      branch-misses             #    0.06% of all branches          ( +- 18.80% )  (54.56%)
     2,152,320,887      L1-dcache-load-misses     #   40.40% of all L1-dcache accesses  ( +-  0.01% )  (54.55%)
     5,326,873,560      L1-dcache-loads           #  471.775 M/sec                    ( +-  0.20% )  (54.55%)
       828,943,234      L1-dcache-prefetches      #   73.415 M/sec                    ( +-  0.55% )  (54.54%)
            18,914      dTLB-loads                #    0.002 M/sec                    ( +- 47.23% )  (54.54%)
             4,423      dTLB-load-misses          #   23.38% of all dTLB cache accesses  ( +- 27.75% )  (54.54%)

           11.2917 +- 0.0499 seconds time elapsed  ( +-  0.44% )

Page-clearing throughput for clear_page_nt(): 16.29 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

 Performance counter stats for 'bin/pf-test 128' (5 runs):

    17,523,166,924      cpu-cycles                #    2.230 GHz                      ( +-  0.03% )  (45.43%)
    24,801,270,826      instructions              #    1.42  insn per cycle           ( +-  0.01% )  (45.45%)
     2,151,391,033      cache-references          #  273.845 M/sec                    ( +-  0.01% )  (45.46%)
           168,555      cache-misses              #    0.008 % of all cache refs      ( +-  4.87% )  (45.47%)
     2,490,226,446      branch-instructions       #  316.974 M/sec                    ( +-  0.01% )  (45.48%)
           117,604      branch-misses             #    0.00% of all branches          ( +-  1.56% )  (45.48%)
           273,492      L1-dcache-load-misses     #    0.06% of all L1-dcache accesses  ( +-  2.14% )  (45.47%)
       490,340,458      L1-dcache-loads           #   62.414 M/sec                    ( +-  0.02% )  (45.45%)
            20,517      L1-dcache-prefetches      #    0.003 M/sec                    ( +-  9.61% )  (45.44%)
             7,413      dTLB-loads                #    0.944 K/sec                    ( +-  8.37% )  (45.44%)
             2,031      dTLB-load-misses          #   27.40% of all dTLB cache accesses  ( +-  8.30% )  (45.43%)

           7.85674 +- 0.00270 seconds time elapsed  ( +-  0.03% )

The L1-dcache-load-misses (L2$ access from DC Miss) count is
substantially lower which suggests we aren't doing write-allocate or
RFO. The L1-dcache-prefetches are also substantially lower.

Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSQ' per
PAGE_SIZE region to a MOVNTI loop.

The page-clearing BW shows a ~40% improvement. Additionally, a quick
'perf bench memset' comparison on AMD Naples (AMD EPYC 7551) shows
similar performance gains. So, enable X86_FEATURE_NT_GOOD for
AMD Zen.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/amd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index dcc3d943c68f..c57eb6c28aa1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -918,6 +918,9 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
 {
 	set_cpu_cap(c, X86_FEATURE_ZEN);
 
+	if (c->x86 == 0x17)
+		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
+
 #ifdef CONFIG_NUMA
 	node_reclaim_distance = 32;
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
@ 2020-10-14 11:10   ` kernel test robot
  2020-10-14 13:04   ` kernel test robot
  2020-10-14 15:45   ` Andy Lutomirski
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-14 11:10 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm
  Cc: kbuild-all, kirill, mhocko, boris.ostrovsky, konrad.wilk,
	Ankur Arora, Thomas Gleixner, Ingo Molnar, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 10502 bytes --]

Hi Ankur,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/master]
[also build test ERROR on linus/master next-20201013]
[cannot apply to tip/x86/core linux/master v5.9]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 80f92ca9b86c71450f003d39956fca4327cc5586
config: riscv-randconfig-r006-20201014 (attached as .config)
compiler: riscv32-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
        git checkout 6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from net/socket.c:74:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/socket.c: In function '__sys_getsockopt':
   net/socket.c:2155:6: warning: variable 'max_optlen' set but not used [-Wunused-but-set-variable]
    2155 |  int max_optlen;
         |      ^~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from net/sysctl_net.c:20:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from include/linux/mroute_base.h:8,
                    from include/linux/mroute.h:10,
                    from net/ipv4/route.c:82:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/ipv4/route.c: In function 'ip_rt_send_redirect':
   net/ipv4/route.c:878:6: warning: variable 'log_martians' set but not used [-Wunused-but-set-variable]
     878 |  int log_martians;
         |      ^~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from include/net/inet_sock.h:22,
                    from include/net/ip.h:28,
                    from net/ipv6/ip6_fib.c:28:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/ipv6/ip6_fib.c: In function 'fib6_add':
   net/ipv6/ip6_fib.c:1373:25: warning: variable 'pn' set but not used [-Wunused-but-set-variable]
    1373 |  struct fib6_node *fn, *pn = NULL;
         |                         ^~
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from include/linux/tcp.h:19,
                    from include/linux/ipv6.h:88,
                    from include/linux/netfilter/ipset/ip_set.h:11,
                    from net/netfilter/ipset/ip_set_core.c:23:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/netfilter/ipset/ip_set_core.c: In function 'ip_set_rename':
   net/netfilter/ipset/ip_set_core.c:1363:2: warning: 'strncpy' specified bound 32 equals destination size [-Wstringop-truncation]
    1363 |  strncpy(set->name, name2, IPSET_MAXNAMELEN);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from net/nfc/nci/../nfc.h:14,
                    from net/nfc/nci/hci.c:13:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/nfc/nci/hci.c: In function 'nci_hci_resp_received':
   net/nfc/nci/hci.c:369:5: warning: variable 'status' set but not used [-Wunused-but-set-variable]
     369 |  u8 status = result;
         |     ^~~~~~
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from include/linux/tcp.h:19,
                    from include/linux/ipv6.h:88,
                    from include/net/ipv6.h:12,
                    from net/ipv6/netfilter/nf_reject_ipv6.c:7:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
   net/ipv6/netfilter/nf_reject_ipv6.c:152:18: warning: variable 'ip6h' set but not used [-Wunused-but-set-variable]
     152 |  struct ipv6hdr *ip6h;
         |                  ^~~~
   cc1: some warnings being treated as errors
--
   In file included from include/linux/pagemap.h:11,
                    from include/linux/blkdev.h:13,
                    from include/linux/blk-cgroup.h:23,
                    from include/linux/writeback.h:14,
                    from include/linux/memcontrol.h:22,
                    from include/net/sock.h:53,
                    from include/linux/tcp.h:19,
                    from net/netfilter/ipvs/ip_vs_core.c:28:
   include/linux/highmem.h: In function 'clear_user_highpage_uncached':
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached'; did you mean 'clear_user_highpage_uncached'? [-Werror=implicit-function-declaration]
     240 |  clear_user_page_uncached(addr, vaddr, page);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~
         |  clear_user_highpage_uncached
   net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_in_icmp':
   net/netfilter/ipvs/ip_vs_core.c:1660:8: warning: variable 'outer_proto' set but not used [-Wunused-but-set-variable]
    1660 |  char *outer_proto = "IPIP";
         |        ^~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +240 include/linux/highmem.h

   234	
   235	#ifndef clear_user_highpage_uncached
   236	static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
   237	{
   238		void *addr = kmap_atomic(page);
   239	
 > 240		clear_user_page_uncached(addr, vaddr, page);
   241		kunmap_atomic(addr);
   242	}
   243	#endif
   244	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34217 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
  2020-10-14 11:10   ` kernel test robot
@ 2020-10-14 13:04   ` kernel test robot
  2020-10-14 15:45   ` Andy Lutomirski
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-14 13:04 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm
  Cc: kbuild-all, clang-built-linux, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]

Hi Ankur,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/master]
[also build test ERROR on linus/master next-20201013]
[cannot apply to tip/x86/core linux/master v5.9]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 80f92ca9b86c71450f003d39956fca4327cc5586
config: arm64-randconfig-r001-20201014 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project e7fe3c6dfede8d5781bd000741c3dea7088307a4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/0day-ci/linux/commit/6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ankur-Arora/Use-uncached-writes-while-clearing-gigantic-pages/20201014-163720
        git checkout 6a1ec80588fc845c7ce6bd0e0e3635bf07d9110d
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/arm64/kernel/asm-offsets.c:16:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:13:
   In file included from include/linux/pagemap.h:11:
>> include/linux/highmem.h:240:2: error: implicit declaration of function 'clear_user_page_uncached' [-Werror,-Wimplicit-function-declaration]
           clear_user_page_uncached(addr, vaddr, page);
           ^
   include/linux/highmem.h:240:2: note: did you mean 'clear_user_highpage_uncached'?
   include/linux/highmem.h:236:20: note: 'clear_user_highpage_uncached' declared here
   static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
                      ^
   1 error generated.
   make[2]: *** [scripts/Makefile.build:117: arch/arm64/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1198: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:185: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.

vim +/clear_user_page_uncached +240 include/linux/highmem.h

   234	
   235	#ifndef clear_user_highpage_uncached
   236	static inline void clear_user_highpage_uncached(struct page *page, unsigned long vaddr)
   237	{
   238		void *addr = kmap_atomic(page);
   239	
 > 240		clear_user_page_uncached(addr, vaddr, page);
   241		kunmap_atomic(addr);
   242	}
   243	#endif
   244	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 38942 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages
  2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
@ 2020-10-14 15:28   ` Ingo Molnar
  2020-10-14 19:15     ` Ankur Arora
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2020-10-14 15:28 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Andrew Morton


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Uncached writes are suitable for circumstances where the region written to
> is not expected to be read again soon, or the region written to is large
> enough that there's no expectation that we will find the writes in the
> cache.
> 
> Accordingly switch to using clear_page_uncached() for gigantic pages.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  mm/memory.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index eeae590e526a..4d2c58f83ab1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
>  	for (i = 0; i < pages_per_huge_page;
>  	     i++, p = mem_map_next(p, page, i)) {
>  		cond_resched();
> -		clear_user_highpage(p, addr + i * PAGE_SIZE);
> +		clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
>  	}
>  }

So this does the clearing in 4K chunks, and your measurements suggest that 
short memory clearing is not as efficient, right?

I'm wondering whether it would make sense to do 2MB chunked clearing on 
64-bit CPUs, instead of 512x 4k clearing? Both 2MB and GB pages are 
continuous in memory, so accessible to these instructions in a single 
narrow loop.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
  2020-10-14  8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
@ 2020-10-14 15:31   ` Ingo Molnar
  2020-10-14 19:23     ` Ankur Arora
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2020-10-14 15:31 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Tony Luck, Sean Christopherson, Mike Rapoport,
	Xiaoyao Li, Fenghua Yu, Peter Zijlstra (Intel),
	Dave Hansen


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> System:           Oracle X6-2
> CPU:              2 nodes * 10 cores/node * 2 threads/core
> 		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
> Memory:           256 GB evenly split between nodes
> Microcode:        0xb00002e
> scaling_governor: performance
> L3 size:          25MB
> intel_pstate/no_turbo: 1
> 
> Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
> 
>               x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
>               -----------------------   -----------------------     -------
>      size       BW        (   pstdev)          BW   (   pstdev)
> 
>      16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
>     128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
>    1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
>    4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%

> +	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
> +		set_cpu_cap(c, X86_FEATURE_NT_GOOD);

So while I agree with how you've done careful measurements to isolate bad 
microarchitectures where non-temporal stores are slow, I do think this 
approach of opt-in doesn't scale and is hard to maintain.

Instead I'd suggest enabling this by default everywhere, and creating a 
X86_FEATURE_NT_BAD quirk table for the bad microarchitectures.

This means that with new microarchitectures we'd get automatic enablement, 
and hopefully chip testing would identify cases where performance isn't as 
good.

I.e. the 'trust but verify' method.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
  2020-10-14 11:10   ` kernel test robot
  2020-10-14 13:04   ` kernel test robot
@ 2020-10-14 15:45   ` Andy Lutomirski
  2020-10-14 19:58     ` Borislav Petkov
  2020-10-14 20:54     ` Ankur Arora
  2 siblings, 2 replies; 29+ messages in thread
From: Andy Lutomirski @ 2020-10-14 15:45 UTC (permalink / raw)
  To: Ankur Arora
  Cc: LKML, Linux-MM, Kirill A. Shutemov, Michal Hocko,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, H. Peter Anvin,
	Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch

On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
> doesn't.
>
> Similarly define clear_page_uncached_flush() which provides an SFENCE
> if the CPU sets X86_FEATURE_NT_GOOD.

As long as you keep "NT" or "MOVNTI" in the names and keep functions
in arch/x86, I think it's reasonable to expect that callers understand
that MOVNTI has bizarre memory ordering rules.  But once you give
something a generic name like "clear_page_uncached" and stick it in
generic code, I think the semantics should be more obvious.

How about:

clear_page_uncached_unordered() or clear_page_uncached_incoherent()

and

flush_after_clear_page_uncached()

After all, a naive reader might expect "uncached" to imply "caches are
off and this is coherent with everything".  And the results of getting
this wrong will be subtle and possibly hard-to-reproduce corruption.

--Andy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages
  2020-10-14 15:28   ` Ingo Molnar
@ 2020-10-14 19:15     ` Ankur Arora
  0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 19:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Andrew Morton

On 2020-10-14 8:28 a.m., Ingo Molnar wrote:
> 
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
>> Uncached writes are suitable for circumstances where the region written to
>> is not expected to be read again soon, or the region written to is large
>> enough that there's no expectation that we will find the writes in the
>> cache.
>>
>> Accordingly switch to using clear_page_uncached() for gigantic pages.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   mm/memory.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eeae590e526a..4d2c58f83ab1 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
>>   	for (i = 0; i < pages_per_huge_page;
>>   	     i++, p = mem_map_next(p, page, i)) {
>>   		cond_resched();
>> -		clear_user_highpage(p, addr + i * PAGE_SIZE);
>> +		clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
>>   	}
>>   }
> 
> So this does the clearing in 4K chunks, and your measurements suggest that
> short memory clearing is not as efficient, right?
I did not measure that separately (though I should), but the performance numbers
around that were somewhat puzzling.

For MOVNTI, the performance via perf bench (single call to memset_movnti())
is pretty close (within margin of error) to what we see with the page-fault
workload (4K chunks in clear_page_nt().)

With 'REP;STOS' though, there's degradation (~30% Broadwell, ~5% Rome) between
perf bench (single call to memset_erms()) compared to the page-fault workload
(4K chunks in clear_page_erms()).

In the second case, we are executing a lot more 'REP;STOS' loops while the
number of instructions in the first case is pretty much the same, so maybe
that's what accounts for it. But I checked and we are not frontend bound.

Maybe there are high setup costs for 'REP;STOS' on Broadwell? It does advertise
X86_FEATURE_ERMS though...

> 
> I'm wondering whether it would make sense to do 2MB chunked clearing on
> 64-bit CPUs, instead of 512x 4k clearing? Both 2MB and GB pages are
> continuous in memory, so accessible to these instructions in a single
> narrow loop.
Yeah, I think it makes sense to do and should be quite straight-forward
as well. I'll try that out. I suspect it might help the X86_FEATURE_NT_BAD
models more but there's no reason why for it to hurt anywhere.


Ankur

> 
> Thanks,
> 

> 	Ingo
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
  2020-10-14 15:31   ` Ingo Molnar
@ 2020-10-14 19:23     ` Ankur Arora
  0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 19:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Tony Luck, Sean Christopherson, Mike Rapoport,
	Xiaoyao Li, Fenghua Yu, Peter Zijlstra (Intel),
	Dave Hansen

On 2020-10-14 8:31 a.m., Ingo Molnar wrote:
> 
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
>> System:           Oracle X6-2
>> CPU:              2 nodes * 10 cores/node * 2 threads/core
>> 		  Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
>> Memory:           256 GB evenly split between nodes
>> Microcode:        0xb00002e
>> scaling_governor: performance
>> L3 size:          25MB
>> intel_pstate/no_turbo: 1
>>
>> Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
>> (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):
>>
>>                x86-64-stosb (5 runs)     x86-64-movnt (5 runs)       speedup
>>                -----------------------   -----------------------     -------
>>       size       BW        (   pstdev)          BW   (   pstdev)
>>
>>       16MB      17.35 GB/s ( +- 9.27%)    11.83 GB/s ( +- 0.19%)     -31.81%
>>      128MB       5.31 GB/s ( +- 0.13%)    11.72 GB/s ( +- 0.44%)    +121.84%
>>     1024MB       5.42 GB/s ( +- 0.13%)    11.78 GB/s ( +- 0.03%)    +117.34%
>>     4096MB       5.41 GB/s ( +- 0.41%)    11.76 GB/s ( +- 0.07%)    +117.37%
> 
>> +	if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
>> +		set_cpu_cap(c, X86_FEATURE_NT_GOOD);
> 
> So while I agree with how you've done careful measurements to isolate bad
> microarchitectures where non-temporal stores are slow, I do think this
> approach of opt-in doesn't scale and is hard to maintain.
> 
> Instead I'd suggest enabling this by default everywhere, and creating a
> X86_FEATURE_NT_BAD quirk table for the bad microarchitectures.
Okay, some kind of quirk table is a great idea. Also means that there's a
single place for keeping this rather than it being scattered all over in
the code.

That also simplifies my handling of features like X86_FEATURE_CLZERO.
I was concerned that if you squint a bit, it seems to be an alias to
X86_FEATURE_NT_GOOD and that seemed ugly.

> 
> This means that with new microarchitectures we'd get automatic enablement,
> and hopefully chip testing would identify cases where performance isn't as
> good.
Makes sense to me. A first class citizen, as it were...

Thanks for reviewing btw.

Ankur

> 
> I.e. the 'trust but verify' method.


> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/8] x86/asm: add clear_page_nt()
  2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
@ 2020-10-14 19:56   ` Borislav Petkov
  2020-10-14 21:11     ` Ankur Arora
  0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-14 19:56 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Jiri Slaby, Herbert Xu, Rafael J. Wysocki

On Wed, Oct 14, 2020 at 01:32:55AM -0700, Ankur Arora wrote:
> This can potentially improve page-clearing bandwidth (see below for
> performance numbers for two microarchitectures where it helps and one
> where it doesn't) and can help indirectly by consuming less cache
> resources.
> 
> Any performance benefits are expected for extents larger than LLC-sized
> or more -- when we are DRAM-BW constrained rather than cache-BW
> constrained.

"potentially", "expected", I don't like those formulations. Do you have
some actual benchmark data where this shows any improvement and not
microbenchmarks only, to warrant the additional complexity?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14 15:45   ` Andy Lutomirski
@ 2020-10-14 19:58     ` Borislav Petkov
  2020-10-14 21:07       ` Andy Lutomirski
  2020-10-14 20:54     ` Ankur Arora
  1 sibling, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-14 19:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ankur Arora, LKML, Linux-MM, Kirill A. Shutemov, Michal Hocko,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, Thomas Gleixner,
	Ingo Molnar, X86 ML, H. Peter Anvin, Arnd Bergmann,
	Andrew Morton, Ira Weiny, linux-arch

On Wed, Oct 14, 2020 at 08:45:37AM -0700, Andy Lutomirski wrote:
> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > Define clear_page_uncached() as an alternative_call() to clear_page_nt()
> > if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
> > doesn't.
> >
> > Similarly define clear_page_uncached_flush() which provides an SFENCE
> > if the CPU sets X86_FEATURE_NT_GOOD.
> 
> As long as you keep "NT" or "MOVNTI" in the names and keep functions
> in arch/x86, I think it's reasonable to expect that callers understand
> that MOVNTI has bizarre memory ordering rules.  But once you give
> something a generic name like "clear_page_uncached" and stick it in
> generic code, I think the semantics should be more obvious.

Why does it have to be a separate call? Why isn't it behind the
clear_page() alternative machinery so that the proper function is
selected at boot? IOW, why does a user of clear_page functionality need
to know at all about an "uncached" variant?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14 15:45   ` Andy Lutomirski
  2020-10-14 19:58     ` Borislav Petkov
@ 2020-10-14 20:54     ` Ankur Arora
  1 sibling, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Linux-MM, Kirill A. Shutemov, Michal Hocko,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, H. Peter Anvin,
	Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch

On 2020-10-14 8:45 a.m., Andy Lutomirski wrote:
> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
>> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
>> doesn't.
>>
>> Similarly define clear_page_uncached_flush() which provides an SFENCE
>> if the CPU sets X86_FEATURE_NT_GOOD.
> 
> As long as you keep "NT" or "MOVNTI" in the names and keep functions
> in arch/x86, I think it's reasonable to expect that callers understand
> that MOVNTI has bizarre memory ordering rules.  But once you give
> something a generic name like "clear_page_uncached" and stick it in
> generic code, I think the semantics should be more obvious.
> 
> How about:
> 
> clear_page_uncached_unordered() or clear_page_uncached_incoherent()
> 
> and
> 
> flush_after_clear_page_uncached()
> 
> After all, a naive reader might expect "uncached" to imply "caches are
> off and this is coherent with everything".  And the results of getting
> this wrong will be subtle and possibly hard-to-reproduce corruption.
Yeah, these are a lot more obvious. Thanks. Will fix.

Ankur

>
> --Andy
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14 19:58     ` Borislav Petkov
@ 2020-10-14 21:07       ` Andy Lutomirski
  2020-10-14 21:12         ` Borislav Petkov
  2020-10-15  3:21         ` Ankur Arora
  0 siblings, 2 replies; 29+ messages in thread
From: Andy Lutomirski @ 2020-10-14 21:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Ankur Arora, LKML, Linux-MM, Kirill A. Shutemov,
	Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
	Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
	Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch




> On Oct 14, 2020, at 12:58 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Wed, Oct 14, 2020 at 08:45:37AM -0700, Andy Lutomirski wrote:
>>> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>> 
>>> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
>>> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
>>> doesn't.
>>> 
>>> Similarly define clear_page_uncached_flush() which provides an SFENCE
>>> if the CPU sets X86_FEATURE_NT_GOOD.
>> 
>> As long as you keep "NT" or "MOVNTI" in the names and keep functions
>> in arch/x86, I think it's reasonable to expect that callers understand
>> that MOVNTI has bizarre memory ordering rules.  But once you give
>> something a generic name like "clear_page_uncached" and stick it in
>> generic code, I think the semantics should be more obvious.
> 
> Why does it have to be a separate call? Why isn't it behind the
> clear_page() alternative machinery so that the proper function is
> selected at boot? IOW, why does a user of clear_page functionality need
> to know at all about an "uncached" variant?
> 
> 

I assume it’s for a little optimization of clearing more than one page per SFENCE.

In any event, based on the benchmark data upthread, we only want to do NT clears when they’re rather large, so this shouldn’t be just an alternative. I assume this is because a page or two will fit in cache and, for most uses that allocate zeroed pages, we prefer cache-hot pages.  When clearing 1G, on the other hand, cache-hot is impossible and we prefer the improved bandwidth and less cache trashing of NT clears.

Perhaps SFENCE is so fast that this is a silly optimization, though, and we don’t lose anything measurable by SFENCEing once per page.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/8] x86/asm: add clear_page_nt()
  2020-10-14 19:56   ` Borislav Petkov
@ 2020-10-14 21:11     ` Ankur Arora
  0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-14 21:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, linux-mm, kirill, mhocko, boris.ostrovsky,
	konrad.wilk, Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Jiri Slaby, Herbert Xu, Rafael J. Wysocki

On 2020-10-14 12:56 p.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 01:32:55AM -0700, Ankur Arora wrote:
>> This can potentially improve page-clearing bandwidth (see below for
>> performance numbers for two microarchitectures where it helps and one
>> where it doesn't) and can help indirectly by consuming less cache
>> resources.
>>
>> Any performance benefits are expected for extents larger than LLC-sized
>> or more -- when we are DRAM-BW constrained rather than cache-BW
>> constrained.
> 
> "potentially", "expected", I don't like those formulations.
That's fair. The reason for those weasel words is mostly because it
is microarchitecture specific.
For example on Intel where I did compare across generations: I see good
performance on Broadwellx, not good on Skylakex and then good again on
some pre-production CPUs.

> Do you have
> some actual benchmark data where this shows any improvement and not
> microbenchmarks only, to warrant the additional complexity?
Yes, guest creation under QEMU (pinned guests) shows similar improvements.
I've posted performance numbers in patches 7, 8 with a simple page-fault
test derived from that.

I can add numbers from QEMU as well.

Thanks,
Ankur

> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14 21:07       ` Andy Lutomirski
@ 2020-10-14 21:12         ` Borislav Petkov
  2020-10-15  3:37           ` Ankur Arora
  2020-10-15  3:21         ` Ankur Arora
  1 sibling, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-14 21:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Ankur Arora, LKML, Linux-MM, Kirill A. Shutemov,
	Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
	Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
	Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch

On Wed, Oct 14, 2020 at 02:07:30PM -0700, Andy Lutomirski wrote:
> I assume it’s for a little optimization of clearing more than one
> page per SFENCE.
>
> In any event, based on the benchmark data upthread, we only want to do
> NT clears when they’re rather large, so this shouldn’t be just an
> alternative. I assume this is because a page or two will fit in cache
> and, for most uses that allocate zeroed pages, we prefer cache-hot
> pages. When clearing 1G, on the other hand, cache-hot is impossible
> and we prefer the improved bandwidth and less cache trashing of NT
> clears.

Yeah, use case makes sense but people won't know what to use. At the
time I was experimenting with this crap, I remember Linus saying that
that selection should be made based on the size of the area cleared, so
users should not have to know the difference.

Which perhaps is the only sane use case I see for this.

> Perhaps SFENCE is so fast that this is a silly optimization, though,
> and we don’t lose anything measurable by SFENCEing once per page.

Yes, I'd like to see real use cases showing improvement from this, not
just microbenchmarks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14 21:07       ` Andy Lutomirski
  2020-10-14 21:12         ` Borislav Petkov
@ 2020-10-15  3:21         ` Ankur Arora
  2020-10-15 10:40           ` Borislav Petkov
  1 sibling, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-15  3:21 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov
  Cc: Andy Lutomirski, LKML, Linux-MM, Kirill A. Shutemov,
	Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
	Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
	Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch

On 2020-10-14 2:07 p.m., Andy Lutomirski wrote:
> 
> 
> 
>> On Oct 14, 2020, at 12:58 PM, Borislav Petkov <bp@alien8.de> wrote:
>>
>> On Wed, Oct 14, 2020 at 08:45:37AM -0700, Andy Lutomirski wrote:
>>>> On Wed, Oct 14, 2020 at 1:33 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>>
>>>> Define clear_page_uncached() as an alternative_call() to clear_page_nt()
>>>> if the CPU sets X86_FEATURE_NT_GOOD and fallback to clear_page() if it
>>>> doesn't.
>>>>
>>>> Similarly define clear_page_uncached_flush() which provides an SFENCE
>>>> if the CPU sets X86_FEATURE_NT_GOOD.
>>>
>>> As long as you keep "NT" or "MOVNTI" in the names and keep functions
>>> in arch/x86, I think it's reasonable to expect that callers understand
>>> that MOVNTI has bizarre memory ordering rules.  But once you give
>>> something a generic name like "clear_page_uncached" and stick it in
>>> generic code, I think the semantics should be more obvious.
>>
>> Why does it have to be a separate call? Why isn't it behind the
>> clear_page() alternative machinery so that the proper function is
>> selected at boot? IOW, why does a user of clear_page functionality need
>> to know at all about an "uncached" variant?
>
> I assume it’s for a little optimization of clearing more than one page
> per SFENCE.
>
> In any event, based on the benchmark data upthread, we only want to do
> NT clears when they’re rather large, so this shouldn’t be just an
> alternative. I assume this is because a page or two will fit in cache
> and, for most uses that allocate zeroed pages, we prefer cache-hot
> pages. When clearing 1G, on the other hand, cache-hot is impossible
> and we prefer the improved bandwidth and less cache trashing of NT
> clears.

Also, if we did extend clear_page() to take the page-size as parameter
we still might not have enough information (ex. a 4K or a 2MB page that
clear_page() sees could be part of a GUP of a much larger extent) to
decide whether to go uncached or not.

> Perhaps SFENCE is so fast that this is a silly optimization, though,
> and we don’t lose anything measurable by SFENCEing once per page.
Alas, no. I tried that and dropped about 15% performance on Rome.

Thanks
Ankur

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-14 21:12         ` Borislav Petkov
@ 2020-10-15  3:37           ` Ankur Arora
  2020-10-15 10:35             ` Borislav Petkov
  0 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-15  3:37 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski
  Cc: Andy Lutomirski, LKML, Linux-MM, Kirill A. Shutemov,
	Michal Hocko, Boris Ostrovsky, Konrad Rzeszutek Wilk,
	Thomas Gleixner, Ingo Molnar, X86 ML, H. Peter Anvin,
	Arnd Bergmann, Andrew Morton, Ira Weiny, linux-arch

On 2020-10-14 2:12 p.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 02:07:30PM -0700, Andy Lutomirski wrote:
>> I assume it’s for a little optimization of clearing more than one
>> page per SFENCE.
>>
>> In any event, based on the benchmark data upthread, we only want to do
>> NT clears when they’re rather large, so this shouldn’t be just an
>> alternative. I assume this is because a page or two will fit in cache
>> and, for most uses that allocate zeroed pages, we prefer cache-hot
>> pages. When clearing 1G, on the other hand, cache-hot is impossible
>> and we prefer the improved bandwidth and less cache trashing of NT
>> clears.
> 
> Yeah, use case makes sense but people won't know what to use. At the
> time I was experimenting with this crap, I remember Linus saying that
> that selection should be made based on the size of the area cleared, so
> users should not have to know the difference.
I don't disagree but I think the selection of cached/uncached route should
be made where we have enough context available to be able to choose to do
this.

This could be for example, done in mm_populate() or gup where if say the
extent is larger than LLC-size, it takes the uncached path.

> 
> Which perhaps is the only sane use case I see for this.
> 
>> Perhaps SFENCE is so fast that this is a silly optimization, though,
>> and we don’t lose anything measurable by SFENCEing once per page.
> 
> Yes, I'd like to see real use cases showing improvement from this, not
> just microbenchmarks.
Sure will add.

Thanks
Ankur

> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-15  3:37           ` Ankur Arora
@ 2020-10-15 10:35             ` Borislav Petkov
  2020-10-15 21:20               ` Ankur Arora
  0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-15 10:35 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
	Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
	linux-arch

On Wed, Oct 14, 2020 at 08:37:44PM -0700, Ankur Arora wrote:
> I don't disagree but I think the selection of cached/uncached route should
> be made where we have enough context available to be able to choose to do
> this.
>
> This could be for example, done in mm_populate() or gup where if say the
> extent is larger than LLC-size, it takes the uncached path.

Are there examples where we don't know the size?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-15  3:21         ` Ankur Arora
@ 2020-10-15 10:40           ` Borislav Petkov
  2020-10-15 21:40             ` Ankur Arora
  0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2020-10-15 10:40 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
	Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
	linux-arch

On Wed, Oct 14, 2020 at 08:21:57PM -0700, Ankur Arora wrote:
> Also, if we did extend clear_page() to take the page-size as parameter
> we still might not have enough information (ex. a 4K or a 2MB page that
> clear_page() sees could be part of a GUP of a much larger extent) to
> decide whether to go uncached or not.

clear_page* assumes 4K. All of the lowlevel asm variants do. So adding
the size there won't bring you a whole lot.

So you'd need to devise this whole thing differently. Perhaps have a
clear_pages() helper which decides based on size what to do: uncached
clearing or the clear_page() as is now in a loop.

Looking at the callsites would give you a better idea I'd say.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-15 10:35             ` Borislav Petkov
@ 2020-10-15 21:20               ` Ankur Arora
  2020-10-16 18:21                 ` Borislav Petkov
  0 siblings, 1 reply; 29+ messages in thread
From: Ankur Arora @ 2020-10-15 21:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
	Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
	linux-arch

On 2020-10-15 3:35 a.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 08:37:44PM -0700, Ankur Arora wrote:
>> I don't disagree but I think the selection of cached/uncached route should
>> be made where we have enough context available to be able to choose to do
>> this.
>>
>> This could be for example, done in mm_populate() or gup where if say the
>> extent is larger than LLC-size, it takes the uncached path.
> 
> Are there examples where we don't know the size?

The case I was thinking of was that clear_huge_page() or faultin_page() would
know the size to a page unit, while the higher level function would know the
whole extent and could optimize differently based on that.

Thanks
Ankur

> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-15 10:40           ` Borislav Petkov
@ 2020-10-15 21:40             ` Ankur Arora
  0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2020-10-15 21:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
	Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
	linux-arch

On 2020-10-15 3:40 a.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 08:21:57PM -0700, Ankur Arora wrote:
>> Also, if we did extend clear_page() to take the page-size as parameter
>> we still might not have enough information (ex. a 4K or a 2MB page that
>> clear_page() sees could be part of a GUP of a much larger extent) to
>> decide whether to go uncached or not.
> 
> clear_page* assumes 4K. All of the lowlevel asm variants do. So adding
> the size there won't bring you a whole lot.
> 
> So you'd need to devise this whole thing differently. Perhaps have a
> clear_pages() helper which decides based on size what to do: uncached
> clearing or the clear_page() as is now in a loop.

I think that'll work well for GB pages, where the clear_pages() helper
has enough information to make a decision.

But, unless I'm missing something, I'm not sure how that would work for
say, a 1GB mm_populate() using 4K pages. The clear_page() (or clear_pages())
in that case would only see the 4K size.

But let me think about this more (and look at the callsites as you suggest.)

> 
> Looking at the callsites would give you a better idea I'd say.
Thanks, yeah that's a good idea. Let me go do that.

Ankur

> 
> Thx.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] x86/clear_page: add clear_page_uncached()
  2020-10-15 21:20               ` Ankur Arora
@ 2020-10-16 18:21                 ` Borislav Petkov
  0 siblings, 0 replies; 29+ messages in thread
From: Borislav Petkov @ 2020-10-16 18:21 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Andy Lutomirski, Andy Lutomirski, LKML, Linux-MM,
	Kirill A. Shutemov, Michal Hocko, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Ira Weiny,
	linux-arch

On Thu, Oct 15, 2020 at 02:20:36PM -0700, Ankur Arora wrote:
> The case I was thinking of was that clear_huge_page()

That loop in clear_gigantic_page() there could be optimized not to
iterate over the pages but do a NTA moves in one go, provided they're
contiguous.

> or faultin_page() would

faultin_page() goes into the bowels of mm fault handling, you'd have to
be more precise what exactly you mean with that one.

> know the size to a page unit, while the higher level function would know the
> whole extent and could optimize differently based on that.

Just don't forget that this "optimization" of yours comes at the price
of added code complexity and you're putting the onus on the people to
know which function to call. So it is not for free and needs to be
carefully weighed.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-10-16 18:21 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-14  8:32 [PATCH 0/8] Use uncached writes while clearing gigantic pages Ankur Arora
2020-10-14  8:32 ` [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD Ankur Arora
2020-10-14  8:32 ` [PATCH 2/8] x86/asm: add memset_movnti() Ankur Arora
2020-10-14  8:32 ` [PATCH 3/8] perf bench: " Ankur Arora
2020-10-14  8:32 ` [PATCH 4/8] x86/asm: add clear_page_nt() Ankur Arora
2020-10-14 19:56   ` Borislav Petkov
2020-10-14 21:11     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 5/8] x86/clear_page: add clear_page_uncached() Ankur Arora
2020-10-14 11:10   ` kernel test robot
2020-10-14 13:04   ` kernel test robot
2020-10-14 15:45   ` Andy Lutomirski
2020-10-14 19:58     ` Borislav Petkov
2020-10-14 21:07       ` Andy Lutomirski
2020-10-14 21:12         ` Borislav Petkov
2020-10-15  3:37           ` Ankur Arora
2020-10-15 10:35             ` Borislav Petkov
2020-10-15 21:20               ` Ankur Arora
2020-10-16 18:21                 ` Borislav Petkov
2020-10-15  3:21         ` Ankur Arora
2020-10-15 10:40           ` Borislav Petkov
2020-10-15 21:40             ` Ankur Arora
2020-10-14 20:54     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for gigantic pages Ankur Arora
2020-10-14 15:28   ` Ingo Molnar
2020-10-14 19:15     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx Ankur Arora
2020-10-14 15:31   ` Ingo Molnar
2020-10-14 19:23     ` Ankur Arora
2020-10-14  8:32 ` [PATCH 8/8] x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).