linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/14] Use uncached stores while clearing huge pages
@ 2021-10-20 17:02 Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 01/14] x86/asm: add memset_movnti() Ankur Arora
                   ` (13 more replies)
  0 siblings, 14 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

This series adds support for uncached page clearing for huge and
gigantic pages. The motivation is to speedup creation of large
prealloc'd VMs, backed by huge/gigantic pages.

Uncached page clearing helps in two ways:
 - Faster than the cached path for sizes > O(LLC-size)
 - Avoids replacing potentially useful cache-lines with useless
   zeroes

Performance improvements: with this series, VM creation (for VMs
with prealloc'd 2MB backing pages) sees significant runtime
improvements:


 AMD Milan, sz=1550 GB, runs=3   BW           stdev     diff
                                 ----------   ------    --------
 baseline   (clear_page_erms)     8.05 GBps     0.08
 CLZERO   (clear_page_clzero)    29.94 GBps     0.31    +271.92%

 (Creation time for this 1550 GB VM goes from 192.6s to 51.7s.)


 Intel Icelake, sz=200 GB, runs=3  BW       stdev     diff
                                   ----------   ------   ---------
 baseline   (clear_page_erms)      8.25 GBps     0.05
 MOVNT     (clear_page_movnt)     21.55 GBps     0.31    +161.21%

 (Creation time for this 200 GB VM goes from 25.2s to 9.3s.)

Additionally, on the AMD Milan system, a kernel-build test with a
background job doing page-clearing, sees a ~5% improvement in runtime
with uncached clearing vs cached.
A similar test on the Intel Icelake system shows improvement in
cache-miss rates but no overall improvement in runtime.


With the motivation out of the way, the following note describes how
v2 addresses some review comments from v1 (and other sticking points
on series of this nature over the years):

1. Uncached stores (via MOVNT, CLZERO on x86) are weakly ordered with
   respect to the cache hierarchy and unless they are combined with an
   appropriate fence, are unsafe to use.

   Patch 6, "sparse: add address_space __incoherent" adds a new sparse
   address_space: __incoherent.

   Patch 7, "x86/clear_page: add clear_page_uncached()" defines:
     void clear_page_uncached(__incoherent void *)
   and the corresponding flush is exposed as:
     void clear_page_uncached_make_coherent(void)

   This would ensure that an incorrect or missing address_space would
   result in a warning from sparse (and KTP.)

2. Page clearing needs to be ordered before any PTE writes related to
   the cleared extent (before SetPageUptodate().) For the uncached
   path, this means that we need a store fence before the PTE write.

   The cost of the fence is microarchitecture dependent but from my
   measurements, it is noticeable all the way upto around one every
   32KB. This limits us to huge/gigantic pages on x86.

   The logic handling this is in patch 10, "clear_huge_page: use
   uncached path".

3. Uncached stores are generally slower than cached for extents smaller
   than LLC-size, and faster for larger ones.
   
   This means that if you choose the uncached path for too small an
   extent, you would see performance regressions. And, keeping the
   threshold too high means not getting some of the possible speedup.

   Patches 8 and 9, "mm/clear_page: add clear_page_uncached_threshold()",
   "x86/clear_page: add arch_clear_page_uncached_threshold()" setup an
   arch specific threshold. For architectures that don't specify one, a
   default value of 8MB is used.
  
   However, a singe call to clear_huge_pages() or get_/pin_user_pages()
   only sees a small portion of an extent being cleared in each
   iteration. To make sure we choose uncached stores when working with
   large extents, patch 11, "gup: add FOLL_HINT_BULK,
   FAULT_FLAG_UNCACHED", adds a new flag that gup users can use for
   this purpose. This is used in patch 13, "vfio_iommu_type1: specify
   FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
   while attaching passthrough PCIe devices.
  
   The get_user_pages() logic to handle these flags is in patch 12,
   "gup: use uncached path when clearing large regions".

4. Point (3) above (uncached stores are faster for extents larger than
   LLC-sized) is generally true, with a side of Brownian motion thrown
   in. For instance, MOVNTI (for > LLC-size) performs well on Broadwell
   and Ice Lake, but on Skylake -- sandwiched in between the two,
   it does not.

   To deal with this, use Ingo's "trust but verify" suggestion,
   (https://lore.kernel.org/lkml/20201014153127.GB1424414@gmail.com/)
   where we enable MOVNT by default and only disable it on bad
   microarchitectures.
   If the uncached path ends up being a part of the kernel, hopefully
   these regressions would show up early enough in chip testing.

   Patch 5, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
   and patch 14, "set X86_FEATURE_MOVNT_SLOW for Skylake" disables
   the uncached path for Skylake.

Performance numbers are in patch 12, "gup: use uncached path when
clearing large regions."

Also at:
  github.com/terminus/linux clear-page-uncached.upstream-v2

Please review.

Changelog:

v1: (https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/)
 - Make the unsafe nature of clear_page_uncached() more obvious.
 - Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't have
   to explicitly enable it for every new model.
 - Add GUP path (and appropriate threshold) to allow the uncached path
   to be used for huge pages
 - Made the code more generic so it's tied to fewer x86 specific assumptions

Thanks
Ankur

Ankur Arora (14):
  x86/asm: add memset_movnti()
  perf bench: add memset_movnti()
  x86/asm: add uncached page clearing
  x86/asm: add clzero based page clearing
  x86/cpuid: add X86_FEATURE_MOVNT_SLOW
  sparse: add address_space __incoherent
  x86/clear_page: add clear_page_uncached()
  mm/clear_page: add clear_page_uncached_threshold()
  x86/clear_page: add arch_clear_page_uncached_threshold()
  clear_huge_page: use uncached path
  gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED
  gup: use uncached path when clearing large regions
  vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
  x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

 arch/x86/include/asm/cacheinfo.h             |  1 +
 arch/x86/include/asm/cpufeatures.h           |  1 +
 arch/x86/include/asm/page.h                  | 10 +++
 arch/x86/include/asm/page_32.h               |  9 +++
 arch/x86/include/asm/page_64.h               | 34 +++++++++
 arch/x86/kernel/cpu/amd.c                    |  2 +
 arch/x86/kernel/cpu/bugs.c                   | 30 ++++++++
 arch/x86/kernel/cpu/cacheinfo.c              | 13 ++++
 arch/x86/kernel/cpu/cpu.h                    |  2 +
 arch/x86/kernel/cpu/intel.c                  |  1 +
 arch/x86/kernel/setup.c                      |  6 ++
 arch/x86/lib/clear_page_64.S                 | 45 ++++++++++++
 arch/x86/lib/memset_64.S                     | 68 ++++++++++--------
 drivers/vfio/vfio_iommu_type1.c              |  3 +
 fs/hugetlbfs/inode.c                         |  7 +-
 include/linux/compiler_types.h               |  2 +
 include/linux/mm.h                           | 38 +++++++++-
 mm/gup.c                                     | 20 ++++++
 mm/huge_memory.c                             |  3 +-
 mm/hugetlb.c                                 | 10 ++-
 mm/memory.c                                  | 76 ++++++++++++++++++--
 tools/arch/x86/lib/memset_64.S               | 68 ++++++++++--------
 tools/perf/bench/mem-memset-x86-64-asm-def.h |  6 +-
 23 files changed, 386 insertions(+), 69 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 01/14] x86/asm: add memset_movnti()
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 02/14] perf bench: " Ankur Arora
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Add an uncached (based on MOVNTI) implementation of memset().

memset_movnti() only needs to differ from memset_orig() in the opcode
used in the inner loop, so move the memset_orig() logic into a macro,
and use that to generate memset_movq() and memset_movnti().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/lib/memset_64.S | 68 ++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9827ae267f96..ef2a091563d9 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -25,7 +25,7 @@ SYM_FUNC_START(__memset)
 	 *
 	 * Otherwise, use original memset function.
 	 */
-	ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+	ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memset_erms", X86_FEATURE_ERMS
 
 	movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
 	ret
 SYM_FUNC_END(memset_erms)
 
-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
 	movq %rdi,%r10
 
 	/* expand byte value  */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
 	/* align dst */
 	movl  %edi,%r9d
 	andl  $7,%r9d
-	jnz  .Lbad_alignment
-.Lafter_bad_alignment:
+	jnz  .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
 
 	movq  %rdx,%rcx
 	shrq  $6,%rcx
-	jz	 .Lhandle_tail
+	jz	 .Lhandle_tail_\@
 
 	.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
 	decq  %rcx
-	movq  %rax,(%rdi)
-	movq  %rax,8(%rdi)
-	movq  %rax,16(%rdi)
-	movq  %rax,24(%rdi)
-	movq  %rax,32(%rdi)
-	movq  %rax,40(%rdi)
-	movq  %rax,48(%rdi)
-	movq  %rax,56(%rdi)
+	\OP  %rax,(%rdi)
+	\OP  %rax,8(%rdi)
+	\OP  %rax,16(%rdi)
+	\OP  %rax,24(%rdi)
+	\OP  %rax,32(%rdi)
+	\OP  %rax,40(%rdi)
+	\OP  %rax,48(%rdi)
+	\OP  %rax,56(%rdi)
 	leaq  64(%rdi),%rdi
-	jnz    .Lloop_64
+	jnz    .Lloop_64_\@
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */
 	.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
 	movl	%edx,%ecx
 	andl    $63&(~7),%ecx
-	jz 		.Lhandle_7
+	jz 		.Lhandle_7_\@
 	shrl	$3,%ecx
 	.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
 	decl   %ecx
-	movq  %rax,(%rdi)
+	\OP  %rax,(%rdi)
 	leaq  8(%rdi),%rdi
-	jnz    .Lloop_8
+	jnz    .Lloop_8_\@
 
-.Lhandle_7:
+.Lhandle_7_\@:
 	andl	$7,%edx
-	jz      .Lende
+	jz      .Lende_\@
 	.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
 	decl    %edx
 	movb 	%al,(%rdi)
 	leaq	1(%rdi),%rdi
-	jnz     .Lloop_1
+	jnz     .Lloop_1_\@
 
-.Lende:
+.Lende_\@:
+	.if \fence
+	sfence
+	.endif
 	movq	%r10,%rax
 	ret
 
-.Lbad_alignment:
+.Lbad_alignment_\@:
 	cmpq $7,%rdx
-	jbe	.Lhandle_7
+	jbe	.Lhandle_7_\@
 	movq %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8
 	subq %r9,%r8
 	addq %r8,%rdi
 	subq %r8,%rdx
-	jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+	jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 02/14] perf bench: add memset_movnti()
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 01/14] x86/asm: add memset_movnti() Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 03/14] x86/asm: add uncached page clearing Ankur Arora
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Clone memset_movnti() from arch/x86/lib/memset_64.S.

perf bench mem memset -f x86-64-movnt on Intel Icelake-X, AMD Milan:

  # Intel Icelake-X

  $ for i in 8 32 128 512; do
         perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
     done

  # Output pruned.
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 8MB bytes ...
      12.896170 GB/sec
  # Copying 32MB bytes ...
      15.879065 GB/sec
  # Copying 128MB bytes ...
      20.813214 GB/sec
  # Copying 512MB bytes ...
      24.190817 GB/sec

  # AMD Milan

  $ for i in 8 32 128 512; do
         perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
     done

  # Output pruned.
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 8MB bytes ...
        22.372566 GB/sec
  # Copying 32MB bytes ...
        22.507923 GB/sec
  # Copying 128MB bytes ...
        22.492532 GB/sec
  # Copying 512MB bytes ...
        22.434603 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/arch/x86/lib/memset_64.S               | 68 +++++++++++---------
 tools/perf/bench/mem-memset-x86-64-asm-def.h |  6 +-
 2 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index 9827ae267f96..ef2a091563d9 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -25,7 +25,7 @@ SYM_FUNC_START(__memset)
 	 *
 	 * Otherwise, use original memset function.
 	 */
-	ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+	ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memset_erms", X86_FEATURE_ERMS
 
 	movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
 	ret
 SYM_FUNC_END(memset_erms)
 
-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
 	movq %rdi,%r10
 
 	/* expand byte value  */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
 	/* align dst */
 	movl  %edi,%r9d
 	andl  $7,%r9d
-	jnz  .Lbad_alignment
-.Lafter_bad_alignment:
+	jnz  .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
 
 	movq  %rdx,%rcx
 	shrq  $6,%rcx
-	jz	 .Lhandle_tail
+	jz	 .Lhandle_tail_\@
 
 	.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
 	decq  %rcx
-	movq  %rax,(%rdi)
-	movq  %rax,8(%rdi)
-	movq  %rax,16(%rdi)
-	movq  %rax,24(%rdi)
-	movq  %rax,32(%rdi)
-	movq  %rax,40(%rdi)
-	movq  %rax,48(%rdi)
-	movq  %rax,56(%rdi)
+	\OP  %rax,(%rdi)
+	\OP  %rax,8(%rdi)
+	\OP  %rax,16(%rdi)
+	\OP  %rax,24(%rdi)
+	\OP  %rax,32(%rdi)
+	\OP  %rax,40(%rdi)
+	\OP  %rax,48(%rdi)
+	\OP  %rax,56(%rdi)
 	leaq  64(%rdi),%rdi
-	jnz    .Lloop_64
+	jnz    .Lloop_64_\@
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */
 	.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
 	movl	%edx,%ecx
 	andl    $63&(~7),%ecx
-	jz 		.Lhandle_7
+	jz 		.Lhandle_7_\@
 	shrl	$3,%ecx
 	.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
 	decl   %ecx
-	movq  %rax,(%rdi)
+	\OP  %rax,(%rdi)
 	leaq  8(%rdi),%rdi
-	jnz    .Lloop_8
+	jnz    .Lloop_8_\@
 
-.Lhandle_7:
+.Lhandle_7_\@:
 	andl	$7,%edx
-	jz      .Lende
+	jz      .Lende_\@
 	.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
 	decl    %edx
 	movb 	%al,(%rdi)
 	leaq	1(%rdi),%rdi
-	jnz     .Lloop_1
+	jnz     .Lloop_1_\@
 
-.Lende:
+.Lende_\@:
+	.if \fence
+	sfence
+	.endif
 	movq	%r10,%rax
 	ret
 
-.Lbad_alignment:
+.Lbad_alignment_\@:
 	cmpq $7,%rdx
-	jbe	.Lhandle_7
+	jbe	.Lhandle_7_\@
 	movq %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8
 	subq %r9,%r8
 	addq %r8,%rdi
 	subq %r8,%rdx
-	jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+	jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
 	"x86-64-unrolled",
 	"unrolled memset() in arch/x86/lib/memset_64.S")
 
@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
 MEMSET_FN(memset_erms,
 	"x86-64-stosb",
 	"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+	"x86-64-movnt",
+	"movnt-based memset() in arch/x86/lib/memset_64.S")
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 03/14] x86/asm: add uncached page clearing
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 01/14] x86/asm: add memset_movnti() Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 02/14] perf bench: " Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 04/14] x86/asm: add clzero based " Ankur Arora
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Add clear_page_movnt(), which uses MOVNTI as the underlying primitive.
MOVNTI skips the memory hierarchy, so this provides a non cache-polluting
implementation of clear_page().

MOVNTI, from the Intel SDM, Volume 2B, 4-101:
 "The non-temporal hint is implemented by using a write combining (WC)
  memory type protocol when writing the data to memory. Using this
  protocol, the processor does not write the data into the cache
  hierarchy, nor does it fetch the corresponding cache line from memory
  into the cache hierarchy."

The AMD Arch Manual has something similar to say as well.

One use-case is to handle zeroing large extents where this can help by
not needlessly bring in cache-lines that would never get accessed.
Also, often clear_page_movnt() based clearing is faster once extent
sizes are O(LLC-size).

As the excerpt notes, MOVNTI is weakly ordered with respect to other
instructions operating on the memory hierarchy. This needs to be
handled by the caller by executing an SFENCE when done.

The implementation is fairly straight-forward. We unroll the inner loop
to keep it similar to memset_movnti(), so we can use that to gauge the
clear_page_movnt() performance via perf bench mem memset.

 # Intel Icelake-X
 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

 System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
 Processor:   Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelake-X)
 Memory:      512 GB evenly split between nodes
 LLC-size:    48MB for each node (32-cores * 2-threads)
 no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)      diff
              ----------------------    ---------------------      -------
     size            BW   (   stdev)          BW   (   stdev)

      2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)     -12.38%
     16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)      -6.02%
    128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)     +84.24%
   1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)     +97.35%
   4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)     +98.50%

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h |  1 +
 arch/x86/lib/clear_page_64.S   | 26 ++++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 4bde0dc66100..cfb95069cf9e 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,6 +43,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void clear_page_orig(void *page);
 void clear_page_rep(void *page);
 void clear_page_erms(void *page);
+void clear_page_movnt(void *page);
 
 static inline void clear_page(void *page)
 {
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index c4c7dd115953..578f40db0716 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -50,3 +50,29 @@ SYM_FUNC_START(clear_page_erms)
 	ret
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Zero a page.
+ * %rdi - page
+ *
+ * Caller needs to issue a sfence at the end.
+ */
+SYM_FUNC_START(clear_page_movnt)
+	xorl	%eax,%eax
+	movl	$4096,%ecx
+
+	.p2align 4
+.Lstart:
+        movnti  %rax, 0x00(%rdi)
+        movnti  %rax, 0x08(%rdi)
+        movnti  %rax, 0x10(%rdi)
+        movnti  %rax, 0x18(%rdi)
+        movnti  %rax, 0x20(%rdi)
+        movnti  %rax, 0x28(%rdi)
+        movnti  %rax, 0x30(%rdi)
+        movnti  %rax, 0x38(%rdi)
+        addq    $0x40, %rdi
+        subl    $0x40, %ecx
+        ja      .Lstart
+	ret
+SYM_FUNC_END(clear_page_movnt)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 04/14] x86/asm: add clzero based page clearing
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (2 preceding siblings ...)
  2021-10-20 17:02 ` [PATCH v2 03/14] x86/asm: add uncached page clearing Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 05/14] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Add clear_page_clzero(), which uses CLZERO as the underlying primitive.
CLZERO skips the memory hierarchy, so this provides a non-polluting
implementation of clear_page(). Available if X86_FEATURE_CLZERO is set.

CLZERO, from the AMD architecture guide (Vol 3, Rev 3.30):
 "Clears the cache line specified by the logical address in rAX by
  writing a zero to every byte in the line. The instruction uses an
  implied non temporal memory type, similar to a streaming store, and
  uses the write combining protocol to minimize cache pollution.

  CLZERO is weakly-ordered with respect to other instructions that
  operate on memory. Software should use an SFENCE or stronger to
  enforce memory ordering of CLZERO with respect to other store
  instructions.

  The CLZERO instruction executes at any privilege level. CLZERO
  performs all the segmentation and paging checks that a store of
  the specified cache line would perform."

The use-case is similar to clear_page_movnt(), except that
clear_page_clzero() is expected to be more performant.

Cc: jon.grimm@amd.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h |  1 +
 arch/x86/lib/clear_page_64.S   | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index cfb95069cf9e..3c53f8ef8818 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -44,6 +44,7 @@ void clear_page_orig(void *page);
 void clear_page_rep(void *page);
 void clear_page_erms(void *page);
 void clear_page_movnt(void *page);
+void clear_page_clzero(void *page);
 
 static inline void clear_page(void *page)
 {
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 578f40db0716..1cb29a4454e1 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -76,3 +76,22 @@ SYM_FUNC_START(clear_page_movnt)
         ja      .Lstart
 	ret
 SYM_FUNC_END(clear_page_movnt)
+
+/*
+ * Zero a page using clzero (on AMD.)
+ * %rdi - page
+ *
+ * Caller needs to issue a sfence at the end.
+ */
+SYM_FUNC_START(clear_page_clzero)
+	movl	$4096,%ecx
+	movq	%rdi,%rax
+
+	.p2align 4
+.Liter:
+        clzero
+        addq    $0x40, %rax
+        sub     $0x40, %ecx
+        ja      .Liter
+	ret
+SYM_FUNC_END(clear_page_clzero)
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 05/14] x86/cpuid: add X86_FEATURE_MOVNT_SLOW
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (3 preceding siblings ...)
  2021-10-20 17:02 ` [PATCH v2 04/14] x86/asm: add clzero based " Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 06/14] sparse: add address_space __incoherent Ankur Arora
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Enabled on microarchitectures where MOVNT is slower for
bulk page clearing than the standard cached clear_page()
idiom.

Also add check_movnt_quirks() where we would set this.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/cpu/amd.c          |  2 ++
 arch/x86/kernel/cpu/bugs.c         | 15 +++++++++++++++
 arch/x86/kernel/cpu/cpu.h          |  2 ++
 arch/x86/kernel/cpu/intel.c        |  1 +
 5 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..69191f175c2c 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -294,6 +294,7 @@
 #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
 #define X86_FEATURE_SGX1		(11*32+ 8) /* "" Basic SGX */
 #define X86_FEATURE_SGX2		(11*32+ 9) /* "" SGX Enclave Dynamic Memory Management (EDMM) */
+#define X86_FEATURE_MOVNT_SLOW		(11*32+10) /* MOVNT is slow. (see check_movnt_quirks()) */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 2131af9f2fa2..5de83c6fe526 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -915,6 +915,8 @@ static void init_amd(struct cpuinfo_x86 *c)
 	if (c->x86 >= 0x10)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
 
+	check_movnt_quirks(c);
+
 	/* get apicid instead of initial apic id from cpuid */
 	c->apicid = hard_smp_processor_id();
 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ecfca3bbcd96..4e1558d22a5f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -84,6 +84,21 @@ EXPORT_SYMBOL_GPL(mds_idle_clear);
  */
 DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 
+void check_movnt_quirks(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_64
+	/*
+	 * Check if MOVNT is slower than the model specific clear_page()
+	 * idiom (movq/rep-stosb/rep-stosq etc) for bulk page clearing.
+	 * (Bulk is defined here as LLC-sized or larger.)
+	 *
+	 * Condition this check on CONFIG_X86_64 so we don't have
+	 * to worry about any CONFIG_X86_32 families that don't
+	 * support SSE2/MOVNT.
+	 */
+#endif /* CONFIG_X86_64*/
+}
+
 void __init check_bugs(void)
 {
 	identify_boot_cpu();
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 95521302630d..72e3715d63ea 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -83,4 +83,6 @@ extern void update_srbds_msr(void);
 
 extern u64 x86_read_arch_cap_msr(void);
 
+void check_movnt_quirks(struct cpuinfo_x86 *c);
+
 #endif /* ARCH_X86_CPU_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..36a2f8e88b74 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -666,6 +666,7 @@ static void init_intel(struct cpuinfo_x86 *c)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
 	if (c->x86 == 6)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+	check_movnt_quirks(c);
 #else
 	/*
 	 * Names for the Pentium II/Celeron processors
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 06/14] sparse: add address_space __incoherent
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (4 preceding siblings ...)
  2021-10-20 17:02 ` [PATCH v2 05/14] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:02 ` [PATCH v2 07/14] x86/clear_page: add clear_page_uncached() Ankur Arora
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Some CPU architectures provide store instructions that are weakly
ordered with respect to other instructions that operate on the memory
hierarchy.
Add sparse address_space __incoherent to denote pointers used to
operate over these regions.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/compiler_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index b6ff83a714ca..f7f68d7bc494 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -11,6 +11,7 @@
 # define __iomem	__attribute__((noderef, address_space(__iomem)))
 # define __percpu	__attribute__((noderef, address_space(__percpu)))
 # define __rcu		__attribute__((noderef, address_space(__rcu)))
+# define __incoherent __attribute__((noderef, address_space(__incoherent)))
 static inline void __chk_user_ptr(const volatile void __user *ptr) { }
 static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
 /* context/locking */
@@ -37,6 +38,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
 # define __iomem
 # define __percpu
 # define __rcu
+# define __incoherent
 # define __chk_user_ptr(x)	(void)0
 # define __chk_io_ptr(x)	(void)0
 /* context/locking */
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (5 preceding siblings ...)
  2021-10-20 17:02 ` [PATCH v2 06/14] sparse: add address_space __incoherent Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-12-08  8:58   ` Yu Xu
  2021-10-20 17:02 ` [PATCH v2 08/14] mm/clear_page: add clear_page_uncached_threshold() Ankur Arora
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Expose the low-level uncached primitives (clear_page_movnt(),
clear_page_clzero()) as alternatives via clear_page_uncached().
Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
and the CPU does not have X86_FEATURE_CLZERO.

Both the uncached primitives use stores which are weakly ordered
with respect to other instructions accessing the memory hierarchy.
To ensure that callers don't mix accesses to different types of
address_spaces, annotate clear_user_page_uncached(), and
clear_page_uncached() as taking __incoherent pointers as arguments.

Also add clear_page_uncached_make_coherent() which provides the
necessary store fence to flush out the uncached regions.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Notes:
    This patch adds the fallback definitions of clear_user_page_uncached()
    etc in include/linux/mm.h which is likely not the right place for it.

    I'm guessing these should be moved to include/asm-generic/page.h
    (or maybe a new include/asm-generic/page_uncached.h) and for
    architectures that do have arch/$arch/include/asm/page.h (which
    seems like all of them), also replicate there?

    Anyway, wanted to first check if that's the way to do it, before 
    doing that.

 arch/x86/include/asm/page.h    | 10 ++++++++++
 arch/x86/include/asm/page_32.h |  9 +++++++++
 arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
 include/linux/mm.h             | 14 ++++++++++++++
 4 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 94dbd51df58f..163be03ac422 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -39,6 +39,15 @@ static inline void clear_page(void *page)
 	memset(page, 0, PAGE_SIZE);
 }
 
+static inline void clear_page_uncached(__incoherent void *page)
+{
+	clear_page((__force void *) page);
+}
+
+static inline void clear_page_uncached_make_coherent(void)
+{
+}
+
 static inline void copy_page(void *to, void *from)
 {
 	memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 3c53f8ef8818..d7946047c70f 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -56,6 +56,38 @@ static inline void clear_page(void *page)
 			   : "cc", "memory", "rax", "rcx");
 }
 
+/*
+ * clear_page_uncached: only allowed on __incoherent memory regions.
+ */
+static inline void clear_page_uncached(__incoherent void *page)
+{
+	alternative_call_2(clear_page_movnt,
+			   clear_page, X86_FEATURE_MOVNT_SLOW,
+			   clear_page_clzero, X86_FEATURE_CLZERO,
+			   "=D" (page),
+			   "0" (page)
+			   : "cc", "memory", "rax", "rcx");
+}
+
+/*
+ * clear_page_uncached_make_coherent: executes the necessary store
+ * fence after which __incoherent regions can be safely accessed.
+ */
+static inline void clear_page_uncached_make_coherent(void)
+{
+	/*
+	 * Keep the sfence for oldinstr and clzero separate to guard against
+	 * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
+	 * and X86_FEATURE_CLZERO.
+	 *
+	 * The alternatives need to be in the same order as the ones
+	 * in clear_page_uncached().
+	 */
+	alternative_2("sfence",
+		      "", X86_FEATURE_MOVNT_SLOW,
+		      "sfence", X86_FEATURE_CLZERO);
+}
+
 void copy_page(void *to, void *from);
 
 #ifdef CONFIG_X86_5LEVEL
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..b88069d1116c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
+#ifndef clear_user_page_uncached
+/*
+ * clear_user_page_uncached: fallback to the standard clear_user_page().
+ */
+static inline void clear_user_page_uncached(__incoherent void *page,
+					unsigned long vaddr, struct page *pg)
+{
+	clear_user_page((__force void *)page, vaddr, pg);
+}
+
+static inline void clear_page_uncached_make_coherent(void) { }
+#endif
+
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 extern unsigned int _debug_guardpage_minorder;
 DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 08/14] mm/clear_page: add clear_page_uncached_threshold()
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (6 preceding siblings ...)
  2021-10-20 17:02 ` [PATCH v2 07/14] x86/clear_page: add clear_page_uncached() Ankur Arora
@ 2021-10-20 17:02 ` Ankur Arora
  2021-10-20 17:03 ` [PATCH v2 09/14] x86/clear_page: add arch_clear_page_uncached_threshold() Ankur Arora
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Introduce clear_page_uncached_threshold which provides the threshold
above which clear_page_uncached() is used.

The ideal threshold value depends on the CPU architecture and where
the performance curves for cached and uncached stores intersect.
Typically this would depend on microarchitectural details and the LLC
size.
Here, we choose a 8MB (CLEAR_PAGE_UNCACHED_THRESHOLD) which seems
like a reasonably sized LLC.

Also define clear_page_prefer_uncached() which provides the user
interface to query this.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h | 18 ++++++++++++++++++
 mm/memory.c        | 30 ++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b88069d1116c..49a97f817eb2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3190,6 +3190,24 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 				   (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
 }
 
+/*
+ * Default size beyond which huge page clearing uses the uncached
+ * path. We size it for a reasonably sized LLC.
+ */
+#define CLEAR_PAGE_UNCACHED_THRESHOLD	(8 << 20)
+
+/*
+ * Arch specific code can define arch_clear_page_uncached_threshold()
+ * to override CLEAR_PAGE_UNCACHED_THRESHOLD with a machine specific value.
+ */
+extern unsigned long __init arch_clear_page_uncached_threshold(void);
+
+extern bool clear_page_prefer_uncached(unsigned long extent);
+#else
+static inline bool clear_page_prefer_uncached(unsigned long extent)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 #ifndef clear_user_page_uncached
diff --git a/mm/memory.c b/mm/memory.c
index adf9b9ef8277..9f6059520985 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5266,6 +5266,36 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+static unsigned long __read_mostly clear_page_uncached_threshold =
+					CLEAR_PAGE_UNCACHED_THRESHOLD;
+
+/* Arch code can override for a machine specific value. */
+unsigned long __weak __init arch_clear_page_uncached_threshold(void)
+{
+	return CLEAR_PAGE_UNCACHED_THRESHOLD;
+}
+
+static int __init setup_clear_page_uncached_threshold(void)
+{
+	clear_page_uncached_threshold =
+		arch_clear_page_uncached_threshold() / PAGE_SIZE;
+	return 0;
+}
+
+/*
+ * cacheinfo is setup via device_initcall and we want to get set after
+ * that. Use the default value until then.
+ */
+late_initcall(setup_clear_page_uncached_threshold);
+
+bool clear_page_prefer_uncached(unsigned long extent)
+{
+	unsigned long pages = extent / PAGE_SIZE;
+
+	return pages >= clear_page_uncached_threshold;
+}
+
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 09/14] x86/clear_page: add arch_clear_page_uncached_threshold()
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (7 preceding siblings ...)
  2021-10-20 17:02 ` [PATCH v2 08/14] mm/clear_page: add clear_page_uncached_threshold() Ankur Arora
@ 2021-10-20 17:03 ` Ankur Arora
  2021-10-20 17:03 ` [PATCH v2 10/14] clear_huge_page: use uncached path Ankur Arora
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Add arch_clear_page_uncached_threshold() for a machine specific value
above which clear_page_uncached() would be used.

The ideal threshold value depends on the CPU model and where the
performance curves for cached and uncached stores intersect.
A safe value is LLC-size, so we use that of the boot_cpu.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cacheinfo.h |  1 +
 arch/x86/kernel/cpu/cacheinfo.c  | 13 +++++++++++++
 arch/x86/kernel/setup.c          |  6 ++++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/cacheinfo.h b/arch/x86/include/asm/cacheinfo.h
index 86b2e0dcc4bf..5c6045699e94 100644
--- a/arch/x86/include/asm/cacheinfo.h
+++ b/arch/x86/include/asm/cacheinfo.h
@@ -4,5 +4,6 @@
 
 void cacheinfo_amd_init_llc_id(struct cpuinfo_x86 *c, int cpu);
 void cacheinfo_hygon_init_llc_id(struct cpuinfo_x86 *c, int cpu);
+int cacheinfo_lookup_max_size(int cpu);
 
 #endif /* _ASM_X86_CACHEINFO_H */
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index b5e36bd0425b..6c34fc22d9ae 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -1033,3 +1033,16 @@ int populate_cache_leaves(unsigned int cpu)
 
 	return 0;
 }
+
+int cacheinfo_lookup_max_size(int cpu)
+{
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cacheinfo *this_leaf = this_cpu_ci->info_list;
+	struct cacheinfo *max_leaf;
+
+	/*
+	 * Assume that cache sizes always increase with level.
+	 */
+	max_leaf = this_leaf + this_cpu_ci->num_leaves - 1;
+	return max_leaf->size;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 40ed44ead063..1b3e2c40f832 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -49,6 +49,7 @@
 #include <asm/thermal.h>
 #include <asm/unwind.h>
 #include <asm/vsyscall.h>
+#include <asm/cacheinfo.h>
 #include <linux/vmalloc.h>
 
 /*
@@ -1250,3 +1251,8 @@ static int __init register_kernel_offset_dumper(void)
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+unsigned long __init arch_clear_page_uncached_threshold(void)
+{
+	return cacheinfo_lookup_max_size(0);
+}
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 10/14] clear_huge_page: use uncached path
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (8 preceding siblings ...)
  2021-10-20 17:03 ` [PATCH v2 09/14] x86/clear_page: add arch_clear_page_uncached_threshold() Ankur Arora
@ 2021-10-20 17:03 ` Ankur Arora
  2021-10-20 17:03 ` [PATCH v2 11/14] gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED Ankur Arora
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Uncached stores are suitable for circumstances where the region written
to is not expected to be read again soon, or the region written to is
large enough that there's no expectation that we will find the data in
the cache.

Add a new helper clear_subpage_uncached(), which handles the uncached
clearing path for huge and gigantic pages.

This path is always invoked for gigantic pages, for huge pages only if
pages_per_huge_page is larger than the architectural threshold or if
the user gives an explicit hint (say for a bulk transfer.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h |  3 ++-
 mm/huge_memory.c   |  3 ++-
 mm/hugetlb.c       |  3 ++-
 mm/memory.c        | 46 ++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49a97f817eb2..3e8ddec2aba2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3164,7 +3164,8 @@ enum mf_action_page_type {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
-			    unsigned int pages_per_huge_page);
+			    unsigned int pages_per_huge_page,
+			    bool hint_uncached);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr_hint,
 				struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0fc261e..ffd4b07285ba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,6 +600,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	vm_fault_t ret = 0;
+	bool uncached = false;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -617,7 +618,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		goto release;
 	}
 
-	clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
+	clear_huge_page(page, vmf->address, HPAGE_PMD_NR, uncached);
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * clear_huge_page writes become visible before the set_pmd_at()
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..a920b1133cdb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4874,6 +4874,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool uncached = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -4928,7 +4929,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			spin_unlock(ptl);
 			goto out;
 		}
-		clear_huge_page(page, address, pages_per_huge_page(h));
+		clear_huge_page(page, address, pages_per_huge_page(h), uncached);
 		__SetPageUptodate(page);
 		new_page = true;
 
diff --git a/mm/memory.c b/mm/memory.c
index 9f6059520985..ef365948f595 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5347,6 +5347,27 @@ static inline void process_huge_page(
 	}
 }
 
+/*
+ * clear_subpage_uncached: clear the page in an uncached fashion.
+ */
+static void clear_subpage_uncached(unsigned long addr, int idx, void *arg)
+{
+	__incoherent void *kaddr;
+	struct page *page = arg;
+
+	page = page + idx;
+
+	/*
+	 * Do the kmap explicitly here since clear_user_page_uncached()
+	 * only handles __incoherent addresses.
+	 *
+	 * Caller is responsible for making the region coherent again.
+	 */
+	kaddr = (__incoherent void *)kmap_atomic(page);
+	clear_user_page_uncached(kaddr, addr + idx * PAGE_SIZE, page);
+	kunmap_atomic((__force void *)kaddr);
+}
+
 static void clear_gigantic_page(struct page *page,
 				unsigned long addr,
 				unsigned int pages_per_huge_page)
@@ -5358,7 +5379,8 @@ static void clear_gigantic_page(struct page *page,
 	for (i = 0; i < pages_per_huge_page;
 	     i++, p = mem_map_next(p, page, i)) {
 		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+		clear_subpage_uncached(addr + i * PAGE_SIZE, 0, p);
 	}
 }
 
@@ -5369,18 +5391,34 @@ static void clear_subpage(unsigned long addr, int idx, void *arg)
 	clear_user_highpage(page + idx, addr);
 }
 
-void clear_huge_page(struct page *page,
-		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+void clear_huge_page(struct page *page, unsigned long addr_hint,
+			unsigned int pages_per_huge_page, bool uncached)
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, pages_per_huge_page);
+
+		/* Gigantic page clearing always uses __incoherent. */
+		clear_page_uncached_make_coherent();
 		return;
 	}
 
-	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+	/*
+	 * The uncached path is typically slower for small extents so take it
+	 * only if the user provides an explicit hint or if the extent is large
+	 * enough that there are no cache expectations.
+	 */
+	if (uncached ||
+		clear_page_prefer_uncached(pages_per_huge_page * PAGE_SIZE)) {
+		process_huge_page(addr_hint, pages_per_huge_page,
+					clear_subpage_uncached, page);
+
+		clear_page_uncached_make_coherent();
+	} else {
+		process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+	}
 }
 
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 11/14] gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (9 preceding siblings ...)
  2021-10-20 17:03 ` [PATCH v2 10/14] clear_huge_page: use uncached path Ankur Arora
@ 2021-10-20 17:03 ` Ankur Arora
  2021-10-20 17:03 ` [PATCH v2 12/14] gup: use uncached path when clearing large regions Ankur Arora
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

Add FOLL_HINT_BULK, which users of get_user_pages(), and
pin_user_pages() can use to signal that this call is one of
many allowing get_user_pages() to optimize accordingly.

Also add FAULT_FLAG_UNCACHED, which translates to the
same idea for the fault path.

As the second flag suggests, the intent for both is to hint the
use of an uncached path if one is available.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3e8ddec2aba2..cf1c711fe5ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -446,6 +446,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UNCACHED: Fault handling to choose the uncached path.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -477,6 +478,7 @@ enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_UNCACHED =		1 << 10,
 };
 
 /*
@@ -2864,6 +2866,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_HINT_BULK	0x100000 /* part of a larger extent being gup'd */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 12/14] gup: use uncached path when clearing large regions
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (10 preceding siblings ...)
  2021-10-20 17:03 ` [PATCH v2 11/14] gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED Ankur Arora
@ 2021-10-20 17:03 ` Ankur Arora
  2021-10-20 18:52 ` [PATCH v2 13/14] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
  2021-10-20 18:52 ` [PATCH v2 14/14] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 17:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

When clearing a large region, or when the user explicitly specifies
via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger
region, take the uncached path.

One notable limitation is that this is only done when the underlying
pages are huge or gigantic, even if a large region composed of PAGE_SIZE
pages is being cleared. This is because uncached stores are generally
weakly ordered and need some kind of store fence -- which would need
to be done at PTE write granularity to avoid data leakage. This would be
expensive enough that it would negate any performance advantage.

Performance
====

System:    Oracle E4-2C (2 nodes * 64 cores * 2 threads) (Milan)
Processor: AMD EPYC 7J13 64-Core
Memory:    2048 GB evenly split between nodes
LLC-size:  32MB for each CCX (8-core * 2-threads)
boost: 0, Microcode: 0xa001137, scaling-governor: performance

System:    Oracle X9-2 (2 nodes * 32 cores * 2 threads) (Icelake)
Processor: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Memory:    512 GB evenly split between nodes
LLC-size:  48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

Workload: qemu-VM-create
==

Create a large VM, backed by preallocated 2MB pages.
(This test needs a minor change in qemu so it mmap's with
MAP_POPULATE instead of demand faulting each page.)

 Milan,     sz=1550 GB, runs=3   BW           stdev     diff
                                 ----------   ------    --------
 baseline   (clear_page_erms)     8.05 GBps     0.08
 CLZERO   (clear_page_clzero)    29.94 GBps     0.31    +271.92%

 (VM creation time decreases from 192.6s to 51.7s.)

 Icelake, sz=200 GB, runs=3      BW           stdev     diff
                                 ----------   ------    ---------
 baseline   (clear_page_erms)     8.25 GBps     0.05
 MOVNT     (clear_page_movnt)    21.55 GBps     0.31    +161.21%

 (VM creation time decreases from 25.2s to 9.3s.)

As the diff shows, for both these micro-architectures there's a
significant speedup with the CLZERO and MOVNT based interfaces.

Workload: Kernel build with background clear_huge_page()
==

Probe the cache-pollution aspect of this commit with a kernel
build (make -j 15 bzImage) alongside a background clear_huge_page()
load which does mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB)
in a loop.

The expectation -- assuming the kernel build performance is partly
cache limited -- is that the background load of clear_page_erms()
should show a greater slowdown, than clear_page_movnt() or
clear_page_clzero().
The build itself does not use THP or similar, so any performance changes
are due to the background load.

  # Milan, compile.sh internally tasksets to a CCX
  # perf stat -r 5  -e task-clock -e cycles -e stalled-cycles-frontend \
    -e stalled-cycles-backend -e instructions -e branches              \
    -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses       \
    -e cache-references -e cache-misses -e all_data_cache_accesses     \
    -e l1_data_cache_fills_all  -e l1_data_cache_fills_from_memory     \
    ./compile.sh

      Milan               kernel-build[1]          kernel-build[2]            kernel-build[3]
                          (bg: nothing)            (bg:clear_page_erms())     (bg:clear_page_clzero())
  -----------------     ---------------------      ----------------------     ------------------------
  run time              280.12s     (+- 0.59%)     322.21s     (+- 0.26%)     307.02s     (+- 1.35%)
  IPC                     1.16                       1.05                       1.14
  backend-idle            3.78%     (+- 0.06%)       4.62%     (+- 0.11%)       3.87%     (+- 0.10%)
  cache-misses           20.08%     (+- 0.14%)      20.88%     (+- 0.13%)      20.09%     (+- 0.11%)
  (% of cache-refs)
  l1_data_cache_fills-   2.77M/sec  (+- 0.20%)       3.11M/sec (+- 0.32%)       2.73M/sec (+- 0.12%)
   _from_memory

From the backend-idle stats in [1], the kernel build is only mildly
memory subsystem bound. However, there's a small but clear effect where
the background load of clear_page_clzero() does not leave much of an
imprint on the kernel-build in [3] -- both [1] and [3] have largely
similar IPC, memory and cache behaviour. OTOH, the clear_page_erms()
workload in [2] constrains the kernel-build more.

(Fuller perf stat output, at [1], [2], [3].)

  # Icelake, compile.sh internally tasksets to a socket
  # perf stat -r 5 -e task-clock -e cycles -e stalled-cycles-frontend \
    -e stalled-cycles-backend -e instructions -e branches             \
    -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses      \
    -e cache-references -e cache-misses -e LLC-loads                  \
    -e LLC-load-misses ./compile.sh

  Icelake               kernel-build[4]         kernel-build[5]            kernel-build[6]
                        (bg: nothing)           (bg:clear_page_erms())     (bg:clear_page_movnt())
  -----------------     -----------------       ----------------------     -----------------------

  run time              135.47s  (+- 0.25%)     136.75s  (+- 0.23%)        135.65s  (+- 0.15%)
  IPC                     1.81                    1.80                       1.80
  cache-misses           21.68%  (+- 0.42%)      22.88%  (+- 0.87%)         21.19%  (+- 0.51%)
  (% of cache-refs)
  LLC-load-misses        35.56%  (+- 0.83%)      37.44%  (+- 0.99%)         33.54%  (+- 1.17%)

From the LLC-load-miss and the cache-miss numbers, clear_page_erms()
seems to cause some additional cache contention in the kernel-build in
[5], compared to [4] and [6]. However, from the IPC and the run time
numbers, looks like the CPU pipeline compensates for the extra misses
quite well.
(Increasing the number of make jobs to 60, did not change the overall
picture appreciably.)

(Fuller perf stat output, at [4], [5], [6].)

[1] Milan, kernel-build

  Performance counter stats for './compile.sh' (5 runs):

      2,525,721.45 msec task-clock                #    9.016 CPUs utilized            ( +-  0.06% )
 4,642,144,895,632      cycles                    #    1.838 GHz                      ( +-  0.01% )  (47.38%)
    54,430,239,074      stalled-cycles-frontend   #    1.17% frontend cycles idle     ( +-  0.16% )  (47.35%)
   175,620,521,760      stalled-cycles-backend    #    3.78% backend cycles idle      ( +-  0.06% )  (47.34%)
 5,392,053,273,328      instructions              #    1.16  insn per cycle
                                                  #    0.03  stalled cycles per insn  ( +-  0.02% )  (47.34%)
 1,181,224,298,651      branches                  #  467.572 M/sec                    ( +-  0.01% )  (47.33%)
    27,668,103,863      branch-misses             #    2.34% of all branches          ( +-  0.04% )  (47.33%)
 2,141,384,087,286      L1-dcache-loads           #  847.639 M/sec                    ( +-  0.01% )  (47.32%)
    86,216,717,118      L1-dcache-load-misses     #    4.03% of all L1-dcache accesses  ( +-  0.08% )  (47.35%)
   264,844,001,975      cache-references          #  104.835 M/sec                    ( +-  0.03% )  (47.36%)
    53,225,109,745      cache-misses              #   20.086 % of all cache refs      ( +-  0.14% )  (47.37%)
 2,610,041,169,859      all_data_cache_accesses   #    1.033 G/sec                    ( +-  0.01% )  (47.37%)
    96,419,361,379      l1_data_cache_fills_all   #   38.166 M/sec                    ( +-  0.06% )  (47.37%)
     7,005,118,698      l1_data_cache_fills_from_memory #    2.773 M/sec                    ( +-  0.20% )  (47.38%)

            280.12 +- 1.65 seconds time elapsed  ( +-  0.59% )

[2] Milan, kernel-build (bg: clear_page_erms() workload)

 Performance counter stats for './compile.sh' (5 runs):

      2,852,168.93 msec task-clock                #    8.852 CPUs utilized            ( +-  0.14% )
 5,166,249,772,084      cycles                    #    1.821 GHz                      ( +-  0.05% )  (47.27%)
    62,039,291,151      stalled-cycles-frontend   #    1.20% frontend cycles idle     ( +-  0.04% )  (47.29%)
   238,472,446,709      stalled-cycles-backend    #    4.62% backend cycles idle      ( +-  0.11% )  (47.30%)
 5,419,530,293,688      instructions              #    1.05  insn per cycle
                                                  #    0.04  stalled cycles per insn  ( +-  0.01% )  (47.31%)
 1,186,958,893,481      branches                  #  418.404 M/sec                    ( +-  0.01% )  (47.31%)
    28,106,023,654      branch-misses             #    2.37% of all branches          ( +-  0.03% )  (47.29%)
 2,160,377,315,024      L1-dcache-loads           #  761.534 M/sec                    ( +-  0.03% )  (47.26%)
    89,101,836,173      L1-dcache-load-misses     #    4.13% of all L1-dcache accesses  ( +-  0.06% )  (47.25%)
   276,859,144,248      cache-references          #   97.593 M/sec                    ( +-  0.04% )  (47.22%)
    57,774,174,239      cache-misses              #   20.889 % of all cache refs      ( +-  0.13% )  (47.24%)
 2,641,613,011,234      all_data_cache_accesses   #  931.170 M/sec                    ( +-  0.01% )  (47.22%)
    99,595,968,133      l1_data_cache_fills_all   #   35.108 M/sec                    ( +-  0.06% )  (47.24%)
     8,831,873,628      l1_data_cache_fills_from_memory #    3.113 M/sec                    ( +-  0.32% )  (47.23%)

           322.211 +- 0.837 seconds time elapsed  ( +-  0.26% )

[3] Milan, kernel-build + (bg: clear_page_clzero() workload)

 Performance counter stats for './compile.sh' (5 runs):

      2,607,387.17 msec task-clock                #    8.493 CPUs utilized            ( +-  0.14% )
 4,749,807,054,468      cycles                    #    1.824 GHz                      ( +-  0.09% )  (47.28%)
    56,579,908,946      stalled-cycles-frontend   #    1.19% frontend cycles idle     ( +-  0.19% )  (47.28%)
   183,367,955,020      stalled-cycles-backend    #    3.87% backend cycles idle      ( +-  0.10% )  (47.28%)
 5,395,577,260,957      instructions              #    1.14  insn per cycle
                                                  #    0.03  stalled cycles per insn  ( +-  0.02% )  (47.29%)
 1,181,904,525,139      branches                  #  453.753 M/sec                    ( +-  0.01% )  (47.30%)
    27,702,316,890      branch-misses             #    2.34% of all branches          ( +-  0.02% )  (47.31%)
 2,137,616,885,978      L1-dcache-loads           #  820.667 M/sec                    ( +-  0.01% )  (47.32%)
    85,841,996,509      L1-dcache-load-misses     #    4.02% of all L1-dcache accesses  ( +-  0.03% )  (47.32%)
   262,784,890,310      cache-references          #  100.888 M/sec                    ( +-  0.04% )  (47.32%)
    52,812,245,646      cache-misses              #   20.094 % of all cache refs      ( +-  0.11% )  (47.32%)
 2,605,653,350,299      all_data_cache_accesses   #    1.000 G/sec                    ( +-  0.01% )  (47.32%)
    95,770,076,665      l1_data_cache_fills_all   #   36.768 M/sec                    ( +-  0.03% )  (47.30%)
     7,134,690,513      l1_data_cache_fills_from_memory #    2.739 M/sec                    ( +-  0.12% )  (47.29%)

            307.02 +- 4.15 seconds time elapsed  ( +-  1.35% )

[4] Icelake, kernel-build

 Performance counter stats for './compile.sh' (5 runs):

           421,633      cs                        #  358.780 /sec                     ( +-  0.04% )
      1,173,522.36 msec task-clock                #    8.662 CPUs utilized            ( +-  0.14% )
 2,991,427,421,282      cycles                    #    2.545 GHz                      ( +-  0.15% )  (82.42%)
 5,410,090,251,681      instructions              #    1.81  insn per cycle           ( +-  0.02% )  (91.13%)
 1,189,406,048,438      branches                  #    1.012 G/sec                    ( +-  0.02% )  (91.05%)
    21,291,454,717      branch-misses             #    1.79% of all branches          ( +-  0.02% )  (91.06%)
 1,462,419,736,675      L1-dcache-loads           #    1.244 G/sec                    ( +-  0.02% )  (91.06%)
    47,084,269,809      L1-dcache-load-misses     #    3.22% of all L1-dcache accesses  ( +-  0.01% )  (91.05%)
    23,527,140,332      cache-references          #   20.020 M/sec                    ( +-  0.13% )  (91.04%)
     5,093,132,060      cache-misses              #   21.682 % of all cache refs      ( +-  0.42% )  (91.03%)
     4,220,672,439      LLC-loads                 #    3.591 M/sec                    ( +-  0.14% )  (91.04%)
     1,501,704,609      LLC-load-misses           #   35.56% of all LL-cache accesses  ( +-  0.83% )  (73.10%)

           135.478 +- 0.335 seconds time elapsed  ( +-  0.25% )

[5] Icelake, kernel-build + (bg: clear_page_erms() workload)

 Performance counter stats for './compile.sh' (5 runs):

           410,611      cs                        #  347.771 /sec                     ( +-  0.02% )
      1,184,382.84 msec task-clock                #    8.661 CPUs utilized            ( +-  0.08% )
 3,018,535,155,772      cycles                    #    2.557 GHz                      ( +-  0.08% )  (82.42%)
 5,408,788,104,113      instructions              #    1.80  insn per cycle           ( +-  0.00% )  (91.13%)
 1,189,173,209,515      branches                  #    1.007 G/sec                    ( +-  0.00% )  (91.05%)
    21,279,087,578      branch-misses             #    1.79% of all branches          ( +-  0.01% )  (91.06%)
 1,462,243,374,967      L1-dcache-loads           #    1.238 G/sec                    ( +-  0.00% )  (91.05%)
    47,210,704,159      L1-dcache-load-misses     #    3.23% of all L1-dcache accesses  ( +-  0.02% )  (91.04%)
    23,378,470,958      cache-references          #   19.801 M/sec                    ( +-  0.03% )  (91.05%)
     5,339,921,426      cache-misses              #   22.814 % of all cache refs      ( +-  0.87% )  (91.03%)
     4,241,388,134      LLC-loads                 #    3.592 M/sec                    ( +-  0.02% )  (91.05%)
     1,588,055,137      LLC-load-misses           #   37.44% of all LL-cache accesses  ( +-  0.99% )  (73.09%)

           136.750 +- 0.315 seconds time elapsed  ( +-  0.23% )

[6] Icelake, kernel-build + (bg: clear_page_movnt() workload)

 Performance counter stats for './compile.sh' (5 runs):

           409,978      cs                        #  347.850 /sec                     ( +-  0.06% )
      1,174,090.99 msec task-clock                #    8.655 CPUs utilized            ( +-  0.10% )
 2,992,914,428,930      cycles                    #    2.539 GHz                      ( +-  0.10% )  (82.40%)
 5,408,632,560,457      instructions              #    1.80  insn per cycle           ( +-  0.00% )  (91.12%)
 1,189,083,425,674      branches                  #    1.009 G/sec                    ( +-  0.00% )  (91.05%)
    21,273,992,588      branch-misses             #    1.79% of all branches          ( +-  0.02% )  (91.05%)
 1,462,081,591,012      L1-dcache-loads           #    1.241 G/sec                    ( +-  0.00% )  (91.05%)
    47,071,136,770      L1-dcache-load-misses     #    3.22% of all L1-dcache accesses  ( +-  0.03% )  (91.04%)
    23,331,268,072      cache-references          #   19.796 M/sec                    ( +-  0.05% )  (91.04%)
     4,953,198,057      cache-misses              #   21.190 % of all cache refs      ( +-  0.51% )  (91.04%)
     4,194,721,070      LLC-loads                 #    3.559 M/sec                    ( +-  0.10% )  (91.06%)
     1,412,414,538      LLC-load-misses           #   33.54% of all LL-cache accesses  ( +-  1.17% )  (73.09%)

           135.654 +- 0.203 seconds time elapsed  ( +-  0.15% )

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 fs/hugetlbfs/inode.c |  7 ++++++-
 mm/gup.c             | 20 ++++++++++++++++++++
 mm/huge_memory.c     |  2 +-
 mm/hugetlb.c         |  9 ++++++++-
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cdfb1ae78a3f..44cee9d30035 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -636,6 +636,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
 	pgoff_t start, index, end;
+	bool hint_uncached;
 	int error;
 	u32 hash;
 
@@ -653,6 +654,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	start = offset >> hpage_shift;
 	end = (offset + len + hpage_size - 1) >> hpage_shift;
 
+	/* Don't pollute the cache if we are fallocte'ing a large region. */
+	hint_uncached = clear_page_prefer_uncached((end - start) << hpage_shift);
+
 	inode_lock(inode);
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -731,7 +735,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, addr, pages_per_huge_page(h));
+		clear_huge_page(page, addr, pages_per_huge_page(h),
+				hint_uncached);
 		__SetPageUptodate(page);
 		error = huge_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..930944e0c6eb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -933,6 +933,13 @@ static int faultin_page(struct vm_area_struct *vma,
 		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
+	if (*flags & FOLL_HINT_BULK) {
+		/*
+		 * From the user hint, we might be faulting-in a large region
+		 * so minimize cache-pollution.
+		 */
+		fault_flags |= FAULT_FLAG_UNCACHED;
+	}
 
 	ret = handle_mm_fault(vma, address, fault_flags, NULL);
 	if (ret & VM_FAULT_ERROR) {
@@ -1100,6 +1107,19 @@ static long __get_user_pages(struct mm_struct *mm,
 	if (!(gup_flags & FOLL_FORCE))
 		gup_flags |= FOLL_NUMA;
 
+	/*
+	 * Uncached page clearing is generally faster when clearing regions
+	 * sized ~LLC/2 or thereabouts. So hint the uncached path based
+	 * on clear_page_prefer_uncached().
+	 *
+	 * Note, however that this get_user_pages() call might end up
+	 * needing to clear an extent smaller than nr_pages when we have
+	 * taken the (potentially slower) uncached path based on the
+	 * expectation of a larger nr_pages value.
+	 */
+	if (clear_page_prefer_uncached(nr_pages * PAGE_SIZE))
+		gup_flags |= FOLL_HINT_BULK;
+
 	do {
 		struct page *page;
 		unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ffd4b07285ba..2d239967a8a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,7 +600,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	vm_fault_t ret = 0;
-	bool uncached = false;
+	bool uncached = vmf->flags & FAULT_FLAG_UNCACHED;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a920b1133cdb..35b643df5854 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4874,7 +4874,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
-	bool uncached = false;
+	bool uncached = flags & FAULT_FLAG_UNCACHED;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5503,6 +5503,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				 */
 				fault_flags |= FAULT_FLAG_TRIED;
 			}
+			if (flags & FOLL_HINT_BULK) {
+				/*
+				 * From the user hint, we might be faulting-in a large
+				 * region so minimize cache-pollution.
+				 */
+				fault_flags |= FAULT_FLAG_UNCACHED;
+			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
 			if (ret & VM_FAULT_ERROR) {
 				err = vm_fault_to_errno(ret, flags);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 13/14] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (11 preceding siblings ...)
  2021-10-20 17:03 ` [PATCH v2 12/14] gup: use uncached path when clearing large regions Ankur Arora
@ 2021-10-20 18:52 ` Ankur Arora
  2021-10-20 18:52 ` [PATCH v2 14/14] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 18:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora, alex.williamson

Specify FOLL_HINT_BULK to pin_user_pages() so it is aware that
this pin is part of a larger region being pinned, so it can
optimize based on that expectation.

Cc: alex.williamson@redhat.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0e9217687f5c..0d45b0c6464d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -557,6 +557,9 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	if (prot & IOMMU_WRITE)
 		flags |= FOLL_WRITE;
 
+	/* Tell gup that this iterations is part of larger set of pins. */
+	flags |= FOLL_HINT_BULK;
+
 	mmap_read_lock(mm);
 	ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
 				    pages, NULL, NULL);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 14/14] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake
  2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
                   ` (12 preceding siblings ...)
  2021-10-20 18:52 ` [PATCH v2 13/14] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
@ 2021-10-20 18:52 ` Ankur Arora
  13 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-10-20 18:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky, Ankur Arora

System:           Oracle X8-2
CPU:              2 nodes * 26 cores/node * 2 threads/core
                  Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory:           3TB evenly split between nodes
Microcode:        0x5002f01
scaling_governor: performance
L3 size:          36MB
intel_pstate/no_turbo: 1

  $ for i in 2 8 32 128 512; do
	perf bench mem memset -f x86-64-movnt -s ${i}MB
  done
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 2MB bytes ...
         6.361971 GB/sec
  # Copying 8MB bytes ...
         6.300403 GB/sec
  # Copying 32MB bytes ...
         6.288992 GB/sec
  # Copying 128MB bytes ...
         6.328793 GB/sec
  # Copying 512MB bytes ...
         6.324471 GB/sec

 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size            BW   (   pstdev)          BW   (   pstdev)

     16MB      20.38 GB/s ( +- 2.58%)     6.25 GB/s ( +- 0.41%)   -69.28%
    128MB       6.52 GB/s ( +- 0.14%)     6.31 GB/s ( +- 0.47%)    -3.22%
   1024MB       6.48 GB/s ( +- 0.31%)     6.24 GB/s ( +- 0.00%)    -3.70%
   4096MB       6.51 GB/s ( +- 0.01%)     6.27 GB/s ( +- 0.42%)    -3.68%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
 # Running 'mem/memset' benchmark:
 # function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...
       6.516972 GB/sec       (+- 0.01%)

 Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):

     3,357,373,317      cpu-cycles                #    1.133 GHz                      ( +-  0.01% )  (29.38%)
       165,063,710      instructions              #    0.05  insn per cycle           ( +-  1.54% )  (35.29%)
           358,997      cache-references          #    0.121 M/sec                    ( +-  0.89% )  (35.32%)
           205,420      cache-misses              #   57.221 % of all cache refs      ( +-  3.61% )  (35.36%)
         6,117,673      branch-instructions       #    2.065 M/sec                    ( +-  1.48% )  (35.38%)
            58,309      branch-misses             #    0.95% of all branches          ( +-  1.30% )  (35.39%)
        31,329,466      bus-cycles                #   10.575 M/sec                    ( +-  0.03% )  (23.56%)
        68,543,766      L1-dcache-load-misses     #  157.03% of all L1-dcache accesses  ( +-  0.02% )  (23.53%)
        43,648,909      L1-dcache-loads           #   14.734 M/sec                    ( +-  0.50% )  (23.50%)
           137,498      LLC-loads                 #    0.046 M/sec                    ( +-  0.21% )  (23.49%)
            12,308      LLC-load-misses           #    8.95% of all LL-cache accesses  ( +-  2.52% )  (23.49%)
            26,335      LLC-stores                #    0.009 M/sec                    ( +-  5.65% )  (11.75%)
            25,008      LLC-store-misses          #    0.008 M/sec                    ( +-  3.42% )  (11.75%)

          2.962842 +- 0.000162 seconds time elapsed  ( +-  0.01% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
 # Running 'mem/memset' benchmark:
 # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...
       6.283420 GB/sec      (+- 0.01%)

  Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

     4,462,272,094      cpu-cycles                #    1.322 GHz                      ( +-  0.30% )  (29.38%)
     1,633,675,881      instructions              #    0.37  insn per cycle           ( +-  0.21% )  (35.28%)
           283,627      cache-references          #    0.084 M/sec                    ( +-  0.58% )  (35.31%)
            28,824      cache-misses              #   10.163 % of all cache refs      ( +- 20.67% )  (35.34%)
       139,719,697      branch-instructions       #   41.407 M/sec                    ( +-  0.16% )  (35.35%)
            58,062      branch-misses             #    0.04% of all branches          ( +-  1.49% )  (35.36%)
        41,760,350      bus-cycles                #   12.376 M/sec                    ( +-  0.05% )  (23.55%)
           303,300      L1-dcache-load-misses     #    0.69% of all L1-dcache accesses  ( +-  2.08% )  (23.53%)
        43,769,498      L1-dcache-loads           #   12.972 M/sec                    ( +-  0.54% )  (23.52%)
            99,570      LLC-loads                 #    0.030 M/sec                    ( +-  1.06% )  (23.52%)
             1,966      LLC-load-misses           #    1.97% of all LL-cache accesses  ( +-  6.17% )  (23.52%)
               129      LLC-stores                #    0.038 K/sec                    ( +- 27.85% )  (11.75%)
                 7      LLC-store-misses          #    0.002 K/sec                    ( +- 47.82% )  (11.75%)

           3.37465 +- 0.00474 seconds time elapsed  ( +-  0.14% )

It's unclear if using MOVNT is a net negative on Skylake. For bulk stores
MOVNT is slightly slower than REP;STOSB, but from the L1-dcache-load-misses
stats (L1D.REPLACEMENT), it does elide the write-allocate and thus helps
with cache efficiency.

However, we err on the side of caution and mark Skylake
X86_FEATURE_MOVNT_SLOW.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/bugs.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 4e1558d22a5f..222d6f095da1 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -96,6 +96,21 @@ void check_movnt_quirks(struct cpuinfo_x86 *c)
 	 * to worry about any CONFIG_X86_32 families that don't
 	 * support SSE2/MOVNT.
 	 */
+	if (c->x86_vendor == X86_VENDOR_INTEL) {
+		if (c->x86 == 6) {
+			switch (c->x86_model) {
+			case INTEL_FAM6_SKYLAKE_L:
+				fallthrough;
+			case INTEL_FAM6_SKYLAKE:
+				fallthrough;
+			case INTEL_FAM6_SKYLAKE_X:
+				set_cpu_cap(c, X86_FEATURE_MOVNT_SLOW);
+				break;
+			default:
+				break;
+			}
+		}
+	}
 #endif /* CONFIG_X86_64*/
 }
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()
  2021-10-20 17:02 ` [PATCH v2 07/14] x86/clear_page: add clear_page_uncached() Ankur Arora
@ 2021-12-08  8:58   ` Yu Xu
  2021-12-10  4:37     ` Ankur Arora
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Xu @ 2021-12-08  8:58 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: mingo, bp, luto, akpm, mike.kravetz, jon.grimm, kvm, konrad.wilk,
	boris.ostrovsky

On 10/21/21 1:02 AM, Ankur Arora wrote:
> Expose the low-level uncached primitives (clear_page_movnt(),
> clear_page_clzero()) as alternatives via clear_page_uncached().
> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
> and the CPU does not have X86_FEATURE_CLZERO.
> 
> Both the uncached primitives use stores which are weakly ordered
> with respect to other instructions accessing the memory hierarchy.
> To ensure that callers don't mix accesses to different types of
> address_spaces, annotate clear_user_page_uncached(), and
> clear_page_uncached() as taking __incoherent pointers as arguments.
> 
> Also add clear_page_uncached_make_coherent() which provides the
> necessary store fence to flush out the uncached regions.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> 
> Notes:
>      This patch adds the fallback definitions of clear_user_page_uncached()
>      etc in include/linux/mm.h which is likely not the right place for it.
> 
>      I'm guessing these should be moved to include/asm-generic/page.h
>      (or maybe a new include/asm-generic/page_uncached.h) and for
>      architectures that do have arch/$arch/include/asm/page.h (which
>      seems like all of them), also replicate there?
> 
>      Anyway, wanted to first check if that's the way to do it, before
>      doing that.
> 
>   arch/x86/include/asm/page.h    | 10 ++++++++++
>   arch/x86/include/asm/page_32.h |  9 +++++++++
>   arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>   include/linux/mm.h             | 14 ++++++++++++++
>   4 files changed, 65 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
> index 94dbd51df58f..163be03ac422 100644
> --- a/arch/x86/include/asm/page_32.h
> +++ b/arch/x86/include/asm/page_32.h
> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>   	memset(page, 0, PAGE_SIZE);
>   }
>   
> +static inline void clear_page_uncached(__incoherent void *page)
> +{
> +	clear_page((__force void *) page);
> +}
> +
> +static inline void clear_page_uncached_make_coherent(void)
> +{
> +}
> +
>   static inline void copy_page(void *to, void *from)
>   {
>   	memcpy(to, from, PAGE_SIZE);
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 3c53f8ef8818..d7946047c70f 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>   			   : "cc", "memory", "rax", "rcx");
>   }
>   
> +/*
> + * clear_page_uncached: only allowed on __incoherent memory regions.
> + */
> +static inline void clear_page_uncached(__incoherent void *page)
> +{
> +	alternative_call_2(clear_page_movnt,
> +			   clear_page, X86_FEATURE_MOVNT_SLOW,
> +			   clear_page_clzero, X86_FEATURE_CLZERO,
> +			   "=D" (page),
> +			   "0" (page)
> +			   : "cc", "memory", "rax", "rcx");
> +}
> +
> +/*
> + * clear_page_uncached_make_coherent: executes the necessary store
> + * fence after which __incoherent regions can be safely accessed.
> + */
> +static inline void clear_page_uncached_make_coherent(void)
> +{
> +	/*
> +	 * Keep the sfence for oldinstr and clzero separate to guard against
> +	 * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
> +	 * and X86_FEATURE_CLZERO.
> +	 *
> +	 * The alternatives need to be in the same order as the ones
> +	 * in clear_page_uncached().
> +	 */
> +	alternative_2("sfence",
> +		      "", X86_FEATURE_MOVNT_SLOW,
> +		      "sfence", X86_FEATURE_CLZERO);
> +}
> +
>   void copy_page(void *to, void *from);
>   
>   #ifdef CONFIG_X86_5LEVEL
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..b88069d1116c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>   
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>   
> +#ifndef clear_user_page_uncached

Hi Ankur Arora,

I've been looking for where clear_user_page_uncached is defined in this
patchset, but failed.

There should be something like follows in arch/x86, right?

static inline void clear_user_page_uncached(__incoherent void *page,
                                unsigned long vaddr, struct page *pg)
{
         clear_page_uncached(page);
}


Did I miss something?

> +/*
> + * clear_user_page_uncached: fallback to the standard clear_user_page().
> + */
> +static inline void clear_user_page_uncached(__incoherent void *page,
> +					unsigned long vaddr, struct page *pg)
> +{
> +	clear_user_page((__force void *)page, vaddr, pg);
> +}
> +
> +static inline void clear_page_uncached_make_coherent(void) { }
> +#endif
> +
> +
>   #ifdef CONFIG_DEBUG_PAGEALLOC
>   extern unsigned int _debug_guardpage_minorder;
>   DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
> 

-- 
Thanks,
Yu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()
  2021-12-08  8:58   ` Yu Xu
@ 2021-12-10  4:37     ` Ankur Arora
  2021-12-10  4:48       ` Yu Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Ankur Arora @ 2021-12-10  4:37 UTC (permalink / raw)
  To: Yu Xu
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, mingo, bp, luto, akpm,
	mike.kravetz, jon.grimm, kvm, konrad.wilk, boris.ostrovsky


Yu Xu <xuyu@linux.alibaba.com> writes:

> On 10/21/21 1:02 AM, Ankur Arora wrote:
>> Expose the low-level uncached primitives (clear_page_movnt(),
>> clear_page_clzero()) as alternatives via clear_page_uncached().
>> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
>> and the CPU does not have X86_FEATURE_CLZERO.
>> Both the uncached primitives use stores which are weakly ordered
>> with respect to other instructions accessing the memory hierarchy.
>> To ensure that callers don't mix accesses to different types of
>> address_spaces, annotate clear_user_page_uncached(), and
>> clear_page_uncached() as taking __incoherent pointers as arguments.
>> Also add clear_page_uncached_make_coherent() which provides the
>> necessary store fence to flush out the uncached regions.
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> Notes:
>>      This patch adds the fallback definitions of clear_user_page_uncached()
>>      etc in include/linux/mm.h which is likely not the right place for it.
>>      I'm guessing these should be moved to include/asm-generic/page.h
>>      (or maybe a new include/asm-generic/page_uncached.h) and for
>>      architectures that do have arch/$arch/include/asm/page.h (which
>>      seems like all of them), also replicate there?
>>      Anyway, wanted to first check if that's the way to do it, before
>>      doing that.
>>   arch/x86/include/asm/page.h    | 10 ++++++++++
>>   arch/x86/include/asm/page_32.h |  9 +++++++++
>>   arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>>   include/linux/mm.h             | 14 ++++++++++++++
>>   4 files changed, 65 insertions(+)
>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>> index 94dbd51df58f..163be03ac422 100644
>> --- a/arch/x86/include/asm/page_32.h
>> +++ b/arch/x86/include/asm/page_32.h
>> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>>   	memset(page, 0, PAGE_SIZE);
>>   }
>>   +static inline void clear_page_uncached(__incoherent void *page)
>> +{
>> +	clear_page((__force void *) page);
>> +}
>> +
>> +static inline void clear_page_uncached_make_coherent(void)
>> +{
>> +}
>> +
>>   static inline void copy_page(void *to, void *from)
>>   {
>>   	memcpy(to, from, PAGE_SIZE);
>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>> index 3c53f8ef8818..d7946047c70f 100644
>> --- a/arch/x86/include/asm/page_64.h
>> +++ b/arch/x86/include/asm/page_64.h
>> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>>   			   : "cc", "memory", "rax", "rcx");
>>   }
>>   +/*
>> + * clear_page_uncached: only allowed on __incoherent memory regions.
>> + */
>> +static inline void clear_page_uncached(__incoherent void *page)
>> +{
>> +	alternative_call_2(clear_page_movnt,
>> +			   clear_page, X86_FEATURE_MOVNT_SLOW,
>> +			   clear_page_clzero, X86_FEATURE_CLZERO,
>> +			   "=D" (page),
>> +			   "0" (page)
>> +			   : "cc", "memory", "rax", "rcx");
>> +}
>> +
>> +/*
>> + * clear_page_uncached_make_coherent: executes the necessary store
>> + * fence after which __incoherent regions can be safely accessed.
>> + */
>> +static inline void clear_page_uncached_make_coherent(void)
>> +{
>> +	/*
>> +	 * Keep the sfence for oldinstr and clzero separate to guard against
>> +	 * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
>> +	 * and X86_FEATURE_CLZERO.
>> +	 *
>> +	 * The alternatives need to be in the same order as the ones
>> +	 * in clear_page_uncached().
>> +	 */
>> +	alternative_2("sfence",
>> +		      "", X86_FEATURE_MOVNT_SLOW,
>> +		      "sfence", X86_FEATURE_CLZERO);
>> +}
>> +
>>   void copy_page(void *to, void *from);
>>     #ifdef CONFIG_X86_5LEVEL
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 73a52aba448f..b88069d1116c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>>     #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>>   +#ifndef clear_user_page_uncached
>
> Hi Ankur Arora,
>
> I've been looking for where clear_user_page_uncached is defined in this
> patchset, but failed.
>
> There should be something like follows in arch/x86, right?
>
> static inline void clear_user_page_uncached(__incoherent void *page,
>                                unsigned long vaddr, struct page *pg)
> {
>         clear_page_uncached(page);
> }
>
>
> Did I miss something?
>
Hi Yu Xu,

Defined in include/linux/mm.h. Just below :).

>> +/*
>> + * clear_user_page_uncached: fallback to the standard clear_user_page().
>> + */
>> +static inline void clear_user_page_uncached(__incoherent void *page,
>> +					unsigned long vaddr, struct page *pg)
>> +{
>> +	clear_user_page((__force void *)page, vaddr, pg);
>> +}

That said, as this note in the patch mentions, this isn't really a great
place for this definition. As you also mention, the right place for this
would be somewhere in the arch/.../include and include/asm-generic hierarchy.

>>      This patch adds the fallback definitions of clear_user_page_uncached()
>>      etc in include/linux/mm.h which is likely not the right place for it.
>>      I'm guessing these should be moved to include/asm-generic/page.h
>>      (or maybe a new include/asm-generic/page_uncached.h) and for
>>      architectures that do have arch/$arch/include/asm/page.h (which
>>      seems like all of them), also replicate there?
>>      Anyway, wanted to first check if that's the way to do it, before
>>      doing that.

Recommendations on how to handle this, welcome.

Thanks

--
ankur


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()
  2021-12-10  4:37     ` Ankur Arora
@ 2021-12-10  4:48       ` Yu Xu
  2021-12-10 14:26         ` Ankur Arora
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Xu @ 2021-12-10  4:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, mingo, bp, luto, akpm, mike.kravetz,
	jon.grimm, kvm, konrad.wilk, boris.ostrovsky

On 12/10/21 12:37 PM, Ankur Arora wrote:
> 
> Yu Xu <xuyu@linux.alibaba.com> writes:
> 
>> On 10/21/21 1:02 AM, Ankur Arora wrote:
>>> Expose the low-level uncached primitives (clear_page_movnt(),
>>> clear_page_clzero()) as alternatives via clear_page_uncached().
>>> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
>>> and the CPU does not have X86_FEATURE_CLZERO.
>>> Both the uncached primitives use stores which are weakly ordered
>>> with respect to other instructions accessing the memory hierarchy.
>>> To ensure that callers don't mix accesses to different types of
>>> address_spaces, annotate clear_user_page_uncached(), and
>>> clear_page_uncached() as taking __incoherent pointers as arguments.
>>> Also add clear_page_uncached_make_coherent() which provides the
>>> necessary store fence to flush out the uncached regions.
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>> Notes:
>>>       This patch adds the fallback definitions of clear_user_page_uncached()
>>>       etc in include/linux/mm.h which is likely not the right place for it.
>>>       I'm guessing these should be moved to include/asm-generic/page.h
>>>       (or maybe a new include/asm-generic/page_uncached.h) and for
>>>       architectures that do have arch/$arch/include/asm/page.h (which
>>>       seems like all of them), also replicate there?
>>>       Anyway, wanted to first check if that's the way to do it, before
>>>       doing that.
>>>    arch/x86/include/asm/page.h    | 10 ++++++++++
>>>    arch/x86/include/asm/page_32.h |  9 +++++++++
>>>    arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>>>    include/linux/mm.h             | 14 ++++++++++++++
>>>    4 files changed, 65 insertions(+)
>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>> index 94dbd51df58f..163be03ac422 100644
>>> --- a/arch/x86/include/asm/page_32.h
>>> +++ b/arch/x86/include/asm/page_32.h
>>> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>>>    	memset(page, 0, PAGE_SIZE);
>>>    }
>>>    +static inline void clear_page_uncached(__incoherent void *page)
>>> +{
>>> +	clear_page((__force void *) page);
>>> +}
>>> +
>>> +static inline void clear_page_uncached_make_coherent(void)
>>> +{
>>> +}
>>> +
>>>    static inline void copy_page(void *to, void *from)
>>>    {
>>>    	memcpy(to, from, PAGE_SIZE);
>>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>>> index 3c53f8ef8818..d7946047c70f 100644
>>> --- a/arch/x86/include/asm/page_64.h
>>> +++ b/arch/x86/include/asm/page_64.h
>>> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>>>    			   : "cc", "memory", "rax", "rcx");
>>>    }
>>>    +/*
>>> + * clear_page_uncached: only allowed on __incoherent memory regions.
>>> + */
>>> +static inline void clear_page_uncached(__incoherent void *page)
>>> +{
>>> +	alternative_call_2(clear_page_movnt,
>>> +			   clear_page, X86_FEATURE_MOVNT_SLOW,
>>> +			   clear_page_clzero, X86_FEATURE_CLZERO,
>>> +			   "=D" (page),
>>> +			   "0" (page)
>>> +			   : "cc", "memory", "rax", "rcx");
>>> +}
>>> +
>>> +/*
>>> + * clear_page_uncached_make_coherent: executes the necessary store
>>> + * fence after which __incoherent regions can be safely accessed.
>>> + */
>>> +static inline void clear_page_uncached_make_coherent(void)
>>> +{
>>> +	/*
>>> +	 * Keep the sfence for oldinstr and clzero separate to guard against
>>> +	 * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
>>> +	 * and X86_FEATURE_CLZERO.
>>> +	 *
>>> +	 * The alternatives need to be in the same order as the ones
>>> +	 * in clear_page_uncached().
>>> +	 */
>>> +	alternative_2("sfence",
>>> +		      "", X86_FEATURE_MOVNT_SLOW,
>>> +		      "sfence", X86_FEATURE_CLZERO);
>>> +}
>>> +
>>>    void copy_page(void *to, void *from);
>>>      #ifdef CONFIG_X86_5LEVEL
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 73a52aba448f..b88069d1116c 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>>>      #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>>>    +#ifndef clear_user_page_uncached
>>
>> Hi Ankur Arora,
>>
>> I've been looking for where clear_user_page_uncached is defined in this
>> patchset, but failed.
>>
>> There should be something like follows in arch/x86, right?
>>
>> static inline void clear_user_page_uncached(__incoherent void *page,
>>                                 unsigned long vaddr, struct page *pg)
>> {
>>          clear_page_uncached(page);
>> }
>>
>>
>> Did I miss something?
>>
> Hi Yu Xu,
> 
> Defined in include/linux/mm.h. Just below :).

Thanks for your reply :)

This is the version when #ifndef clear_user_page_uncached, i.e., fall
back to standard clear_user_page.

But where is the uncached version of clear_user_page? I am looking for
this.

> 
>>> +/*
>>> + * clear_user_page_uncached: fallback to the standard clear_user_page().
>>> + */
>>> +static inline void clear_user_page_uncached(__incoherent void *page,
>>> +					unsigned long vaddr, struct page *pg)
>>> +{
>>> +	clear_user_page((__force void *)page, vaddr, pg);
>>> +}
> 
> That said, as this note in the patch mentions, this isn't really a great
> place for this definition. As you also mention, the right place for this
> would be somewhere in the arch/.../include and include/asm-generic hierarchy.
> 
>>>       This patch adds the fallback definitions of clear_user_page_uncached()
>>>       etc in include/linux/mm.h which is likely not the right place for it.
>>>       I'm guessing these should be moved to include/asm-generic/page.h
>>>       (or maybe a new include/asm-generic/page_uncached.h) and for
>>>       architectures that do have arch/$arch/include/asm/page.h (which
>>>       seems like all of them), also replicate there?
>>>       Anyway, wanted to first check if that's the way to do it, before
>>>       doing that.
> 
> Recommendations on how to handle this, welcome.
> 
> Thanks
> 
> --
> ankur
> 

-- 
Thanks,
Yu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()
  2021-12-10  4:48       ` Yu Xu
@ 2021-12-10 14:26         ` Ankur Arora
  0 siblings, 0 replies; 19+ messages in thread
From: Ankur Arora @ 2021-12-10 14:26 UTC (permalink / raw)
  To: Yu Xu
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, mingo, bp, luto, akpm,
	mike.kravetz, jon.grimm, kvm, konrad.wilk, boris.ostrovsky


Yu Xu <xuyu@linux.alibaba.com> writes:

> On 12/10/21 12:37 PM, Ankur Arora wrote:
>> Yu Xu <xuyu@linux.alibaba.com> writes:
>>
>>> On 10/21/21 1:02 AM, Ankur Arora wrote:
>>>> Expose the low-level uncached primitives (clear_page_movnt(),
>>>> clear_page_clzero()) as alternatives via clear_page_uncached().
>>>> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
>>>> and the CPU does not have X86_FEATURE_CLZERO.
>>>> Both the uncached primitives use stores which are weakly ordered
>>>> with respect to other instructions accessing the memory hierarchy.
>>>> To ensure that callers don't mix accesses to different types of
>>>> address_spaces, annotate clear_user_page_uncached(), and
>>>> clear_page_uncached() as taking __incoherent pointers as arguments.
>>>> Also add clear_page_uncached_make_coherent() which provides the
>>>> necessary store fence to flush out the uncached regions.
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>> Notes:
>>>>       This patch adds the fallback definitions of clear_user_page_uncached()
>>>>       etc in include/linux/mm.h which is likely not the right place for it.
>>>>       I'm guessing these should be moved to include/asm-generic/page.h
>>>>       (or maybe a new include/asm-generic/page_uncached.h) and for
>>>>       architectures that do have arch/$arch/include/asm/page.h (which
>>>>       seems like all of them), also replicate there?
>>>>       Anyway, wanted to first check if that's the way to do it, before
>>>>       doing that.
>>>>    arch/x86/include/asm/page.h    | 10 ++++++++++
>>>>    arch/x86/include/asm/page_32.h |  9 +++++++++
>>>>    arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>>>>    include/linux/mm.h             | 14 ++++++++++++++
>>>>    4 files changed, 65 insertions(+)
>>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>>> index 94dbd51df58f..163be03ac422 100644
>>>> --- a/arch/x86/include/asm/page_32.h
>>>> +++ b/arch/x86/include/asm/page_32.h
>>>> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>>>>    	memset(page, 0, PAGE_SIZE);
>>>>    }
>>>>    +static inline void clear_page_uncached(__incoherent void *page)
>>>> +{
>>>> +	clear_page((__force void *) page);
>>>> +}
>>>> +
>>>> +static inline void clear_page_uncached_make_coherent(void)
>>>> +{
>>>> +}
>>>> +
>>>>    static inline void copy_page(void *to, void *from)
>>>>    {
>>>>    	memcpy(to, from, PAGE_SIZE);
>>>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>>>> index 3c53f8ef8818..d7946047c70f 100644
>>>> --- a/arch/x86/include/asm/page_64.h
>>>> +++ b/arch/x86/include/asm/page_64.h
>>>> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>>>>    			   : "cc", "memory", "rax", "rcx");
>>>>    }
>>>>    +/*
>>>> + * clear_page_uncached: only allowed on __incoherent memory regions.
>>>> + */
>>>> +static inline void clear_page_uncached(__incoherent void *page)
>>>> +{
>>>> +	alternative_call_2(clear_page_movnt,
>>>> +			   clear_page, X86_FEATURE_MOVNT_SLOW,
>>>> +			   clear_page_clzero, X86_FEATURE_CLZERO,
>>>> +			   "=D" (page),
>>>> +			   "0" (page)
>>>> +			   : "cc", "memory", "rax", "rcx");
>>>> +}
>>>> +
>>>> +/*
>>>> + * clear_page_uncached_make_coherent: executes the necessary store
>>>> + * fence after which __incoherent regions can be safely accessed.
>>>> + */
>>>> +static inline void clear_page_uncached_make_coherent(void)
>>>> +{
>>>> +	/*
>>>> +	 * Keep the sfence for oldinstr and clzero separate to guard against
>>>> +	 * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
>>>> +	 * and X86_FEATURE_CLZERO.
>>>> +	 *
>>>> +	 * The alternatives need to be in the same order as the ones
>>>> +	 * in clear_page_uncached().
>>>> +	 */
>>>> +	alternative_2("sfence",
>>>> +		      "", X86_FEATURE_MOVNT_SLOW,
>>>> +		      "sfence", X86_FEATURE_CLZERO);
>>>> +}
>>>> +
>>>>    void copy_page(void *to, void *from);
>>>>      #ifdef CONFIG_X86_5LEVEL
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 73a52aba448f..b88069d1116c 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>>>>      #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>>>>    +#ifndef clear_user_page_uncached
>>>
>>> Hi Ankur Arora,
>>>
>>> I've been looking for where clear_user_page_uncached is defined in this
>>> patchset, but failed.
>>>
>>> There should be something like follows in arch/x86, right?
>>>
>>> static inline void clear_user_page_uncached(__incoherent void *page,
>>>                                 unsigned long vaddr, struct page *pg)
>>> {
>>>          clear_page_uncached(page);
>>> }
>>>
>>>
>>> Did I miss something?
>>>
>> Hi Yu Xu,
>> Defined in include/linux/mm.h. Just below :).
>
> Thanks for your reply :)
>
> This is the version when #ifndef clear_user_page_uncached, i.e., fall
> back to standard clear_user_page.
>
> But where is the uncached version of clear_user_page? I am looking for
> this.

Sorry, my bad. There is a hunk missing. Not sure how, but this hunk
(from patch 7) got dropped in version that I sent out:

    diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
    index 4d5810c8fab7..89229f2db34a 100644
    --- a/arch/x86/include/asm/page.h
    +++ b/arch/x86/include/asm/page.h
    @@ -28,6 +28,16 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
            clear_page(page);
    }

    +#define clear_user_page_uncached clear_user_page_uncached
    +/*
    + * clear_page_uncached: allowed on only __incoherent memory regions.
    + */
    +static inline void clear_user_page_uncached(__incoherent void *page,
    +                                       unsigned long vaddr, struct page *pg)
    +{
    +       clear_page_uncached(page);
    +}
    +
    static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
                                    struct page *topage)
    {

It is identical to the version that you surmised was missing. Thanks for
the close reading.

Ankur

>>
>>>> +/*
>>>> + * clear_user_page_uncached: fallback to the standard clear_user_page().
>>>> + */
>>>> +static inline void clear_user_page_uncached(__incoherent void *page,
>>>> +					unsigned long vaddr, struct page *pg)
>>>> +{
>>>> +	clear_user_page((__force void *)page, vaddr, pg);
>>>> +}
>> That said, as this note in the patch mentions, this isn't really a great
>> place for this definition. As you also mention, the right place for this
>> would be somewhere in the arch/.../include and include/asm-generic hierarchy.
>>
>>>>       This patch adds the fallback definitions of clear_user_page_uncached()
>>>>       etc in include/linux/mm.h which is likely not the right place for it.
>>>>       I'm guessing these should be moved to include/asm-generic/page.h
>>>>       (or maybe a new include/asm-generic/page_uncached.h) and for
>>>>       architectures that do have arch/$arch/include/asm/page.h (which
>>>>       seems like all of them), also replicate there?
>>>>       Anyway, wanted to first check if that's the way to do it, before
>>>>       doing that.
>> Recommendations on how to handle this, welcome.
>> Thanks
>> --
>> ankur
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-12-10 14:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-20 17:02 [PATCH v2 00/14] Use uncached stores while clearing huge pages Ankur Arora
2021-10-20 17:02 ` [PATCH v2 01/14] x86/asm: add memset_movnti() Ankur Arora
2021-10-20 17:02 ` [PATCH v2 02/14] perf bench: " Ankur Arora
2021-10-20 17:02 ` [PATCH v2 03/14] x86/asm: add uncached page clearing Ankur Arora
2021-10-20 17:02 ` [PATCH v2 04/14] x86/asm: add clzero based " Ankur Arora
2021-10-20 17:02 ` [PATCH v2 05/14] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2021-10-20 17:02 ` [PATCH v2 06/14] sparse: add address_space __incoherent Ankur Arora
2021-10-20 17:02 ` [PATCH v2 07/14] x86/clear_page: add clear_page_uncached() Ankur Arora
2021-12-08  8:58   ` Yu Xu
2021-12-10  4:37     ` Ankur Arora
2021-12-10  4:48       ` Yu Xu
2021-12-10 14:26         ` Ankur Arora
2021-10-20 17:02 ` [PATCH v2 08/14] mm/clear_page: add clear_page_uncached_threshold() Ankur Arora
2021-10-20 17:03 ` [PATCH v2 09/14] x86/clear_page: add arch_clear_page_uncached_threshold() Ankur Arora
2021-10-20 17:03 ` [PATCH v2 10/14] clear_huge_page: use uncached path Ankur Arora
2021-10-20 17:03 ` [PATCH v2 11/14] gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED Ankur Arora
2021-10-20 17:03 ` [PATCH v2 12/14] gup: use uncached path when clearing large regions Ankur Arora
2021-10-20 18:52 ` [PATCH v2 13/14] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2021-10-20 18:52 ` [PATCH v2 14/14] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).