All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/21] huge page clearing optimizations
@ 2022-06-06 20:20 Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
                   ` (21 more replies)
  0 siblings, 22 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk, joao.martins,
	ankur.a.arora

This series introduces two optimizations in the huge page clearing path:

 1. extends the clear_page() machinery to also handle extents larger
    than a single page.
 2. support non-cached page clearing for huge and gigantic pages.

The first optimization is useful for hugepage fault handling, the
second for prefaulting, or for gigantic pages.

The immediate motivation is to speedup creation of large VMs backed
by huge pages.

Performance
==

VM creation (192GB VM with prealloc'd 2MB backing pages) sees significant
run-time improvements:

 Icelakex:
                          Time (s)        Delta (%)
 clear_page_erms()     22.37 ( +- 0.14s )            #  9.21 bytes/ns
 clear_pages_erms()    16.49 ( +- 0.06s )  -26.28%   # 12.50 bytes/ns
 clear_pages_movnt()    9.42 ( +- 0.20s )  -42.87%   # 21.88 bytes/ns

 Milan:
                          Time (s)        Delta (%)
 clear_page_erms()     16.49 ( +- 0.06s )            # 12.50 bytes/ns
 clear_pages_erms()    11.82 ( +- 0.06s )  -28.32%   # 17.44 bytes/ns
 clear_pages_clzero()   4.91 ( +- 0.27s )  -58.49%   # 41.98 bytes/ns

As a side-effect, non-polluting clearing by eliding zero filling of
caches also shows better LLC miss rates. For a kbuild+background
page-clearing job, this gives up as a small improvement (~2%) in
runtime.

Discussion
==


With the motivation out of the way, the following note describes
v3's handling of past review comments (and other sticking points for
series of this nature -- especially the non-cached part -- over the
years):

1. Non-cached clearing is unnecessary on x86: x86 already uses 'REP;STOS'
   which unlike a MOVNT loop, has semantically richer information available
   which can be used by current (and/or future) processors to make the
   same cache-elision optimization.

   All true, except a) current-gen uarchs often don't and, b) even when
   they do, the kernel by clearing at 4K granularity doesn't expose
   the extent information in a way that processors could easily
   optimize for.

   For a), I tested a bunch of REP-STOSB/MOVNTI/CLZERO loops with different
   chunk sizes (in user-space over a VA extent of 4GB, page-size=4K.)

   Intel Icelake (LLC=48MB, no_turbo=1):

     chunk-size    REP-STOSB       MOVNTI
                        MBps         MBps

             4K         9444        24510
            64K        11931        24508
             2M        12355        24524
             8M        12369        24525
            32M        12368        24523
           128M        12374        24522
            1GB        12372        24561

   Which is pretty flat across chunk-sizes.


   AMD Milan (LLC=32MB, boost=0):

     chunk-size    REP-STOSB       MOVNTI        CLZERO 
                        MBps         MBps          MBps 
                                                        
             4K        13034        17815         45579 
            64K        15196        18549         46038 
             2M        14821        18581         39064 
             8M        13964        18557         46045 
            32M        22525        18560         45969 
           128M        29311        18581         38924 
            1GB        35807        18574         45981 

    The scaling on Milan starts right around chunk=LLC-size. It
    asymptotically does seem to get close to CLZERO performance, but the
    scaling is linear and not a step function.

    For b), as I mention above, the kernel by zeroing at 4K granularity,
    doesn't send the right signal to the uarch (though the largest
    extent we can use for huge pages is 2MB (and lower for preemptible
    kernels), which from these numbers is not large enough.)
    Still using clear_page_extent() with larger extents would send the
    uarch a hint that it could capitalize on in the future.

    This is addressed in patches 1-6:
	"mm, huge-page: reorder arguments to process_huge_page()"
	"mm, huge-page: refactor process_subpage()"
	"clear_page: add generic clear_user_pages()"
	"mm, clear_huge_page: support clear_user_pages()"
	"mm/huge_page: generalize process_huge_page()"
	"x86/clear_page: add clear_pages()"

     with patch 5, "mm/huge_page: generalize process_huge_page()"
     containing the core logic.

2. Non-caching stores (via MOVNTI, CLZERO on x86) are weakly ordered with
   respect to the cache hierarchy and unless they are combined with an
   appropriate fence, are unsafe to use.

   This is true and is a problem. Patch 12, "sparse: add address_space
   __incoherent" adds a new sparse address_space which is used in
   the architectural interfaces to make sure that any user is cognizant
   of its use:

	void clear_user_pages_incoherent(__incoherent void *page, ...)
	void clear_pages_incoherent(__incoherent void *page, ...)

   One other place it is needed (and is missing) is in highmem:
       void clear_user_highpages_incoherent(struct page *page, ...).

   Given the natural highmem interface, I couldn't think of a good
   way to add the annotation here.

3. Non-caching stores are generally slower than cached for extents
   smaller than LLC-size, and faster for larger ones.

   This means that if you choose the non-caching path for too small an
   extent, you would see performance regressions. There is of course
   benefit in not filling the cache with zeroes but that is a somewhat
   nebulous advantage and AFAICT there is no representative tests that
   probe for it.
   (Note that this slowness isn't a consequence of the extra fence --
   that is expensive but stops being noticeable for chunk-size >=
   ~32K-128K depending on uarch.)

   This is handled by adding an arch specific threshold (with a
   default CLEAR_PAGE_NON_CACHING_THRESHOLD=8MB.) in patches 15 and 16,
   "mm/clear_page: add clear_page_non_caching_threshold()",
   "x86/clear_page: add arch_clear_page_non_caching_threshold()".

   Further, a single call to clear_huge_pages() or get_/pin_user_pages()
   might only see a small portion of an extent being cleared in each
   iteration. To make sure we choose non-caching stores when working with
   large extents, patch 18, "gup: add FOLL_HINT_BULK,
   FAULT_FLAG_NON_CACHING", adds a new flag that gup users can use for
   this purpose. This is used in patch 20, "vfio_iommu_type1: specify
   FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
   while attaching passthrough PCIe devices.
  
   The get_user_pages() logic to handle these flags is in patch 19,
   "gup: hint non-caching if clearing large regions".

4. Subpoint of 3) above (non-caching stores are faster for extents
   larger than LLC-sized) is generally true, with a side of Brownian
   motion thrown in. For instance, MOVNTI (for > LLC-size) performs well
   on Broadwell and Ice Lake, but on Skylake/Cascade-lake -- sandwiched
   in between the two, it does not.

   To deal with this, use Ingo's suggestion of "trust but verify",
   (https://lore.kernel.org/lkml/20201014153127.GB1424414@gmail.com/)
   where we enable MOVNT by default and only disable it on slow
   uarchs.
   If the non-caching path ends up being a part of the kernel, uarchs
   that regress would hopefully show up early enough in chip testing.

   Patch 11, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
   and patch 21, "x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for
   Skylake" disables the non-caching path for Skylake.

Performance numbers are in patches 6 and 19, "x86/clear_page: add
clear_pages()", "gup: hint non-caching if clearing large regions".

Also at:
  github.com/terminus/linux clear-page-non-caching.upstream-v3

Comments appreciated!

Changelog
==

v2: https://lore.kernel.org/lkml/20211020170305.376118-1-ankur.a.arora@oracle.com/
  - Add multi-page clearing: this addresses comments from Ingo
    (from v1), and from an offlist discussion with Linus.
  - Rename clear_pages_uncached() to make the lack of safety
    more obvious: this addresses comments from Andy Lutomorski.
  - Simplify the clear_huge_page() changes.
  - Usual cleanups etc.
  - Rebased to v5.18.


v1: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/
  - Make the unsafe nature of clear_page_uncached() more obvious.
  - Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't
    have to explicitly enable it for every new model: suggestion from
    Ingo Molnar.
  - Add GUP path (and appropriate threshold) to allow the uncached path
    to be used for huge pages.
  - Make the code more generic so it's tied to fewer x86 specific assumptions.

Thanks
Ankur

Ankur Arora (21):
  mm, huge-page: reorder arguments to process_huge_page()
  mm, huge-page: refactor process_subpage()
  clear_page: add generic clear_user_pages()
  mm, clear_huge_page: support clear_user_pages()
  mm/huge_page: generalize process_huge_page()
  x86/clear_page: add clear_pages()
  x86/asm: add memset_movnti()
  perf bench: add memset_movnti()
  x86/asm: add clear_pages_movnt()
  x86/asm: add clear_pages_clzero()
  x86/cpuid: add X86_FEATURE_MOVNT_SLOW
  sparse: add address_space __incoherent
  clear_page: add generic clear_user_pages_incoherent()
  x86/clear_page: add clear_pages_incoherent()
  mm/clear_page: add clear_page_non_caching_threshold()
  x86/clear_page: add arch_clear_page_non_caching_threshold()
  clear_huge_page: use non-cached clearing
  gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING
  gup: hint non-caching if clearing large regions
  vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
  x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

 arch/alpha/include/asm/page.h                |   1 +
 arch/arc/include/asm/page.h                  |   1 +
 arch/arm/include/asm/page.h                  |   1 +
 arch/arm64/include/asm/page.h                |   1 +
 arch/csky/include/asm/page.h                 |   1 +
 arch/hexagon/include/asm/page.h              |   1 +
 arch/ia64/include/asm/page.h                 |   1 +
 arch/m68k/include/asm/page.h                 |   1 +
 arch/microblaze/include/asm/page.h           |   1 +
 arch/mips/include/asm/page.h                 |   1 +
 arch/nios2/include/asm/page.h                |   2 +
 arch/openrisc/include/asm/page.h             |   1 +
 arch/parisc/include/asm/page.h               |   1 +
 arch/powerpc/include/asm/page.h              |   1 +
 arch/riscv/include/asm/page.h                |   1 +
 arch/s390/include/asm/page.h                 |   1 +
 arch/sh/include/asm/page.h                   |   1 +
 arch/sparc/include/asm/page_32.h             |   1 +
 arch/sparc/include/asm/page_64.h             |   1 +
 arch/um/include/asm/page.h                   |   1 +
 arch/x86/include/asm/cacheinfo.h             |   1 +
 arch/x86/include/asm/cpufeatures.h           |   1 +
 arch/x86/include/asm/page.h                  |  26 ++
 arch/x86/include/asm/page_64.h               |  64 ++++-
 arch/x86/kernel/cpu/amd.c                    |   2 +
 arch/x86/kernel/cpu/bugs.c                   |  30 +++
 arch/x86/kernel/cpu/cacheinfo.c              |  13 +
 arch/x86/kernel/cpu/cpu.h                    |   2 +
 arch/x86/kernel/cpu/intel.c                  |   2 +
 arch/x86/kernel/setup.c                      |   6 +
 arch/x86/lib/clear_page_64.S                 |  78 ++++--
 arch/x86/lib/memset_64.S                     |  68 ++---
 arch/xtensa/include/asm/page.h               |   1 +
 drivers/vfio/vfio_iommu_type1.c              |   3 +
 fs/hugetlbfs/inode.c                         |   7 +-
 include/asm-generic/clear_page.h             |  69 +++++
 include/asm-generic/page.h                   |   1 +
 include/linux/compiler_types.h               |   2 +
 include/linux/highmem.h                      |  46 ++++
 include/linux/mm.h                           |  10 +-
 include/linux/mm_types.h                     |   2 +
 mm/gup.c                                     |  18 ++
 mm/huge_memory.c                             |   3 +-
 mm/hugetlb.c                                 |  10 +-
 mm/memory.c                                  | 264 +++++++++++++++----
 tools/arch/x86/lib/memset_64.S               |  68 ++---
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   6 +-
 47 files changed, 680 insertions(+), 144 deletions(-)
 create mode 100644 include/asm-generic/clear_page.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
@ 2022-06-06 20:20 ` Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk, joao.martins,
	ankur.a.arora

Mechanical change to process_huge_page() to pass subpage clear/copy
args via struct subpage_arg * instead of passing an opaque pointer
around.

No change in generated code.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 47 ++++++++++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 21dadf03f089..c33aacdaaf11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5562,15 +5562,22 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+struct subpage_arg {
+	struct page *dst;
+	struct page *src;
+	struct vm_area_struct *vma;
+};
+
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
  * cache lines hot.
  */
-static inline void process_huge_page(
+static inline void process_huge_page(struct subpage_arg *sa,
 	unsigned long addr_hint, unsigned int pages_per_huge_page,
-	void (*process_subpage)(unsigned long addr, int idx, void *arg),
-	void *arg)
+	void (*process_subpage)(struct subpage_arg *sa,
+				unsigned long addr, int idx))
 {
 	int i, n, base, l;
 	unsigned long addr = addr_hint &
@@ -5586,7 +5593,7 @@ static inline void process_huge_page(
 		/* Process subpages at the end of huge page */
 		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
 			cond_resched();
-			process_subpage(addr + i * PAGE_SIZE, i, arg);
+			process_subpage(sa, addr + i * PAGE_SIZE, i);
 		}
 	} else {
 		/* If target subpage in second half of huge page */
@@ -5595,7 +5602,7 @@ static inline void process_huge_page(
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
 			cond_resched();
-			process_subpage(addr + i * PAGE_SIZE, i, arg);
+			process_subpage(sa, addr + i * PAGE_SIZE, i);
 		}
 	}
 	/*
@@ -5607,9 +5614,9 @@ static inline void process_huge_page(
 		int right_idx = base + 2 * l - 1 - i;
 
 		cond_resched();
-		process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
+		process_subpage(sa, addr + left_idx * PAGE_SIZE, left_idx);
 		cond_resched();
-		process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
+		process_subpage(sa, addr + right_idx * PAGE_SIZE, right_idx);
 	}
 }
 
@@ -5628,9 +5635,9 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 
-static void clear_subpage(unsigned long addr, int idx, void *arg)
+static void clear_subpage(struct subpage_arg *sa, unsigned long addr, int idx)
 {
-	struct page *page = arg;
+	struct page *page = sa->dst;
 
 	clear_user_highpage(page + idx, addr);
 }
@@ -5640,13 +5647,18 @@ void clear_huge_page(struct page *page,
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	struct subpage_arg sa = {
+		.dst = page,
+		.src = NULL,
+		.vma = NULL,
+	};
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, pages_per_huge_page);
 		return;
 	}
 
-	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+	process_huge_page(&sa, addr_hint, pages_per_huge_page, clear_subpage);
 }
 
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
@@ -5668,16 +5680,9 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
 	}
 }
 
-struct copy_subpage_arg {
-	struct page *dst;
-	struct page *src;
-	struct vm_area_struct *vma;
-};
-
-static void copy_subpage(unsigned long addr, int idx, void *arg)
+static void copy_subpage(struct subpage_arg *copy_arg,
+			 unsigned long addr, int idx)
 {
-	struct copy_subpage_arg *copy_arg = arg;
-
 	copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
 			   addr, copy_arg->vma);
 }
@@ -5688,7 +5693,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
-	struct copy_subpage_arg arg = {
+	struct subpage_arg sa = {
 		.dst = dst,
 		.src = src,
 		.vma = vma,
@@ -5700,7 +5705,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		return;
 	}
 
-	process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg);
+	process_huge_page(&sa, addr_hint, pages_per_huge_page, copy_subpage);
 }
 
 long copy_huge_page_from_user(struct page *dst_page,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/21] mm, huge-page: refactor process_subpage()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
@ 2022-06-06 20:20 ` Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk, joao.martins,
	ankur.a.arora

process_subpage() takes an absolute address and an index, both
referencing the same region.

Change this so process_huge_page() deals only with the huge-page
region, offloading the indexing to process_subpage().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c33aacdaaf11..2c86d79c9d98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5577,7 +5577,7 @@ struct subpage_arg {
 static inline void process_huge_page(struct subpage_arg *sa,
 	unsigned long addr_hint, unsigned int pages_per_huge_page,
 	void (*process_subpage)(struct subpage_arg *sa,
-				unsigned long addr, int idx))
+				unsigned long base_addr, int idx))
 {
 	int i, n, base, l;
 	unsigned long addr = addr_hint &
@@ -5593,7 +5593,7 @@ static inline void process_huge_page(struct subpage_arg *sa,
 		/* Process subpages at the end of huge page */
 		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
 			cond_resched();
-			process_subpage(sa, addr + i * PAGE_SIZE, i);
+			process_subpage(sa, addr, i);
 		}
 	} else {
 		/* If target subpage in second half of huge page */
@@ -5602,7 +5602,7 @@ static inline void process_huge_page(struct subpage_arg *sa,
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
 			cond_resched();
-			process_subpage(sa, addr + i * PAGE_SIZE, i);
+			process_subpage(sa, addr, i);
 		}
 	}
 	/*
@@ -5614,9 +5614,9 @@ static inline void process_huge_page(struct subpage_arg *sa,
 		int right_idx = base + 2 * l - 1 - i;
 
 		cond_resched();
-		process_subpage(sa, addr + left_idx * PAGE_SIZE, left_idx);
+		process_subpage(sa, addr, left_idx);
 		cond_resched();
-		process_subpage(sa, addr + right_idx * PAGE_SIZE, right_idx);
+		process_subpage(sa, addr, right_idx);
 	}
 }
 
@@ -5635,11 +5635,12 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 
-static void clear_subpage(struct subpage_arg *sa, unsigned long addr, int idx)
+static void clear_subpage(struct subpage_arg *sa,
+			  unsigned long base_addr, int idx)
 {
 	struct page *page = sa->dst;
 
-	clear_user_highpage(page + idx, addr);
+	clear_user_highpage(page + idx, base_addr + idx * PAGE_SIZE);
 }
 
 void clear_huge_page(struct page *page,
@@ -5681,10 +5682,10 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
 }
 
 static void copy_subpage(struct subpage_arg *copy_arg,
-			 unsigned long addr, int idx)
+			 unsigned long base_addr, int idx)
 {
 	copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
-			   addr, copy_arg->vma);
+			   base_addr + idx * PAGE_SIZE, copy_arg->vma);
 }
 
 void copy_user_huge_page(struct page *dst, struct page *src,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/21] clear_page: add generic clear_user_pages()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
@ 2022-06-06 20:20 ` Ankur Arora
  2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk, joao.martins,
	ankur.a.arora

Add generic clear_user_pages() which operates on contiguous
PAGE_SIZE'd chunks via an arch defined primitive.

The generic version defines:
  #define ARCH_MAX_CLEAR_PAGES_ORDER	0
so clear_user_pages() would fallback to clear_user_page().

An arch can expose this by defining __HAVE_ARCH_CLEAR_USER_PAGES.

Also add clear_user_highpages() which, either funnels through
to clear_user_pages() or does the clearing page-at-a-time.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Notes:
    1. I'm not sure that a new header asm-generic/clear_page.h is ideal.
    
    The logical place for this is asm-generic/page.h itself. However, only
    H8300 includes that and so this (and the next few patches) would need
    a stub everywhere else.
    (Just rechecked and looks like arch/h8300 is no more.)
    
    If adding a new header looks reasonable to the community, I'm happy
    to move clear_user_page(), copy_user_page() stubs out to this file.
    (Note that patches further on add non-caching clear_user_pages()
     as well.)
    
    Or, if asm-generic/page.h is the best place, then add stubs
    everywhere else.
    
    2. Shoehorning a multiple page operation in CONFIG_HIGHMEM seems
    ugly but, seemed like the best choice of a bad set of options.
    Is there a better way of doing this?

 arch/alpha/include/asm/page.h      |  1 +
 arch/arc/include/asm/page.h        |  1 +
 arch/arm/include/asm/page.h        |  1 +
 arch/arm64/include/asm/page.h      |  1 +
 arch/csky/include/asm/page.h       |  1 +
 arch/hexagon/include/asm/page.h    |  1 +
 arch/ia64/include/asm/page.h       |  1 +
 arch/m68k/include/asm/page.h       |  1 +
 arch/microblaze/include/asm/page.h |  1 +
 arch/mips/include/asm/page.h       |  1 +
 arch/nios2/include/asm/page.h      |  2 ++
 arch/openrisc/include/asm/page.h   |  1 +
 arch/parisc/include/asm/page.h     |  1 +
 arch/powerpc/include/asm/page.h    |  1 +
 arch/riscv/include/asm/page.h      |  1 +
 arch/s390/include/asm/page.h       |  1 +
 arch/sh/include/asm/page.h         |  1 +
 arch/sparc/include/asm/page_32.h   |  1 +
 arch/sparc/include/asm/page_64.h   |  1 +
 arch/um/include/asm/page.h         |  1 +
 arch/x86/include/asm/page.h        |  1 +
 arch/xtensa/include/asm/page.h     |  1 +
 include/asm-generic/clear_page.h   | 44 ++++++++++++++++++++++++++++++
 include/asm-generic/page.h         |  1 +
 include/linux/highmem.h            | 23 ++++++++++++++++
 25 files changed, 91 insertions(+)
 create mode 100644 include/asm-generic/clear_page.h

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 8f3f5eecba28..2d3b099e165c 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -93,5 +93,6 @@ typedef struct page *pgtable_t;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _ALPHA_PAGE_H */
diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
index 9a62e1d87967..abdbef6897bf 100644
--- a/arch/arc/include/asm/page.h
+++ b/arch/arc/include/asm/page.h
@@ -133,6 +133,7 @@ extern int pfn_valid(unsigned long pfn);
 
 #include <asm-generic/memory_model.h>   /* page_to_pfn, pfn_to_page */
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
index 5fcc8a600e36..ba244baca1fa 100644
--- a/arch/arm/include/asm/page.h
+++ b/arch/arm/include/asm/page.h
@@ -167,5 +167,6 @@ extern int pfn_valid(unsigned long);
 #define VM_DATA_DEFAULT_FLAGS	VM_DATA_FLAGS_TSK_EXEC
 
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 993a27ea6f54..8407ac2b5d68 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -50,5 +50,6 @@ int pfn_is_map_memory(unsigned long pfn);
 #define VM_DATA_DEFAULT_FLAGS	(VM_DATA_FLAGS_TSK_EXEC | VM_MTE_ALLOWED)
 
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif
diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
index ed7451478b1b..47cc27d4ede1 100644
--- a/arch/csky/include/asm/page.h
+++ b/arch/csky/include/asm/page.h
@@ -89,6 +89,7 @@ extern unsigned long va_pa_offset;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_CSKY_PAGE_H */
diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
index 7cbf719c578e..e7a8edd6903a 100644
--- a/arch/hexagon/include/asm/page.h
+++ b/arch/hexagon/include/asm/page.h
@@ -142,6 +142,7 @@ static inline void clear_page(void *page)
 #include <asm-generic/memory_model.h>
 /* XXX Todo: implement assembly-optimized version of getorder. */
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* ifdef __ASSEMBLY__ */
 #endif /* ifdef __KERNEL__ */
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 1b990466d540..1feae333e250 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -96,6 +96,7 @@ do {						\
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
 #include <asm-generic/memory_model.h>
+#include <asm-generic/clear_page.h>
 
 #ifdef CONFIG_FLATMEM
 # define pfn_valid(pfn)		((pfn) < max_mapnr)
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 2f1c54e4725d..1aeaae820670 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -68,5 +68,6 @@ extern unsigned long _ramend;
 #endif
 
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _M68K_PAGE_H */
diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
index 4b8b2fa78fc5..baa03569477a 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -137,5 +137,6 @@ extern int page_is_ram(unsigned long pfn);
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _ASM_MICROBLAZE_PAGE_H */
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 96bc798c1ec1..3dde03bf99f3 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -269,5 +269,6 @@ static inline unsigned long kaslr_offset(void)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _ASM_PAGE_H */
diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
index 6a989819a7c1..9763048bd3ed 100644
--- a/arch/nios2/include/asm/page.h
+++ b/arch/nios2/include/asm/page.h
@@ -104,6 +104,8 @@ static inline bool pfn_valid(unsigned long pfn)
 
 #include <asm-generic/getorder.h>
 
+#include <asm-generic/clear_page.h>
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_NIOS2_PAGE_H */
diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
index aab6e64d6db4..879419c00cd4 100644
--- a/arch/openrisc/include/asm/page.h
+++ b/arch/openrisc/include/asm/page.h
@@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* __ASM_OPENRISC_PAGE_H */
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 6faaaa3ebe9b..961f88d6ff63 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -184,6 +184,7 @@ extern int npmem_ranges;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 #include <asm/pdc.h>
 
 #define PAGE0   ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index e5f75c70eda8..4742b1f99a3e 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -335,6 +335,7 @@ static inline unsigned long kaslr_offset(void)
 }
 
 #include <asm-generic/memory_model.h>
+#include <asm-generic/clear_page.h>
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PAGE_H */
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 1526e410e802..ce9005ffccb0 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -188,5 +188,6 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _ASM_RISCV_PAGE_H */
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 61dea67bb9c7..7a598f86ae39 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -207,5 +207,6 @@ int arch_make_page_accessible(struct page *page);
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _S390_PAGE_H */
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index eca5daa43b93..5e49bb342c2c 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -176,6 +176,7 @@ typedef struct page *pgtable_t;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 /*
  * Some drivers need to perform DMA into kmalloc'ed buffers
diff --git a/arch/sparc/include/asm/page_32.h b/arch/sparc/include/asm/page_32.h
index fff8861df107..2f061d9a5a30 100644
--- a/arch/sparc/include/asm/page_32.h
+++ b/arch/sparc/include/asm/page_32.h
@@ -135,5 +135,6 @@ extern unsigned long pfn_base;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _SPARC_PAGE_H */
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 254dffd85fb1..2026bf92e3e7 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -159,5 +159,6 @@ extern unsigned long PAGE_OFFSET;
 #endif /* !(__ASSEMBLY__) */
 
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* _SPARC64_PAGE_H */
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 95af12e82a32..79768ad6069c 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -113,6 +113,7 @@ extern unsigned long uml_physmem;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9cc82f305f4b..5a246a2a66aa 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -85,6 +85,7 @@ static __always_inline u64 __is_canonical_address(u64 vaddr, u8 vaddr_bits)
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 493eb7083b1a..2812f2bea844 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
 #endif /* __ASSEMBLY__ */
 
 #include <asm-generic/memory_model.h>
+#include <asm-generic/clear_page.h>
 #endif /* _XTENSA_PAGE_H */
diff --git a/include/asm-generic/clear_page.h b/include/asm-generic/clear_page.h
new file mode 100644
index 000000000000..f827d661519c
--- /dev/null
+++ b/include/asm-generic/clear_page.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_CLEAR_PAGE_H
+#define __ASM_GENERIC_CLEAR_PAGE_H
+
+/*
+ * clear_user_pages() operates on contiguous pages and does the clearing
+ * operation in a single arch defined primitive.
+ *
+ * To do this, arch code defines clear_user_pages() and the max granularity
+ * it can handle via ARCH_MAX_CLEAR_PAGES_ORDER.
+ *
+ * Note that given the need for contiguity, __HAVE_ARCH_CLEAR_USER_PAGES
+ * and CONFIG_HIGHMEM are mutually exclusive.
+ */
+
+#if defined(CONFIG_HIGHMEM) && defined(__HAVE_ARCH_CLEAR_USER_PAGES)
+#error CONFIG_HIGHMEM is incompatible with __HAVE_ARCH_CLEAR_USER_PAGES
+#endif
+
+#ifndef __HAVE_ARCH_CLEAR_USER_PAGES
+
+/*
+ * For architectures that do not expose __HAVE_ARCH_CLEAR_USER_PAGES, set
+ * the granularity to be identical to clear_user_page().
+ */
+#define ARCH_MAX_CLEAR_PAGES_ORDER	0
+
+#ifndef __ASSEMBLY__
+
+/*
+ * With ARCH_MAX_CLEAR_PAGES_ORDER == 0, all callers should be specifying
+ * npages == 1 and so we just fallback to clear_user_page().
+ */
+static inline void clear_user_pages(void *page, unsigned long vaddr,
+			       struct page *start_page, unsigned int npages)
+{
+	clear_user_page(page, vaddr, start_page);
+}
+#endif /* __ASSEMBLY__ */
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
+
+#define ARCH_MAX_CLEAR_PAGES	(1 << ARCH_MAX_CLEAR_PAGES_ORDER)
+
+#endif /* __ASM_GENERIC_CLEAR_PAGE_H */
diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h
index 6fc47561814c..060094e7f964 100644
--- a/include/asm-generic/page.h
+++ b/include/asm-generic/page.h
@@ -93,5 +93,6 @@ extern unsigned long memory_end;
 
 #include <asm-generic/memory_model.h>
 #include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
 
 #endif /* __ASM_GENERIC_PAGE_H */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 3af34de54330..08781d7693e7 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -208,6 +208,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 }
 #endif
 
+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+					unsigned int npages)
+{
+	void *addr = page_address(page);
+
+	clear_user_pages(addr, vaddr, page, npages);
+}
+#else
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+					unsigned int npages)
+{
+	void *addr;
+	unsigned int i;
+
+	for (i = 0; i < npages; i++, page++, vaddr += PAGE_SIZE) {
+		addr = kmap_local_page(page);
+		clear_user_page(addr, vaddr, page);
+		kunmap_local(addr);
+	}
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
+
 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
 /**
  * alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (2 preceding siblings ...)
  2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
@ 2022-06-06 20:20 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk, joao.martins,
	ankur.a.arora

process_huge_page() now handles page extents with process_subpages()
handling the individual page level operation.

process_subpages() workers, clear_subpages() and copy_subpages()
chunk the clearing in units of clear_page_unit, or continue to copy
using a single page operation.

Relatedly, define clear_user_extent() which uses clear_user_highpages()
to funnel through to clear_user_pages() or falls back to page-at-a-time
clearing via clear_user_highpage().

clear_page_unit, the clearing unit size, is defined to be:
   1 << min(MAX_ORDER - 1, ARCH_MAX_CLEAR_PAGES_ORDER).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 95 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 69 insertions(+), 26 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2c86d79c9d98..fbc7bc70dc3d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5563,6 +5563,31 @@ EXPORT_SYMBOL(__might_fault);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 
+static unsigned int __ro_after_init clear_page_unit = 1;
+static int __init setup_clear_page_params(void)
+{
+	clear_page_unit = 1 << min(MAX_ORDER - 1, ARCH_MAX_CLEAR_PAGES_ORDER);
+	return 0;
+}
+
+/*
+ * cacheinfo is setup via device_initcall and we want to get set after
+ * that. Use the default value until then.
+ */
+late_initcall(setup_clear_page_params);
+
+/*
+ * Clear a page extent.
+ *
+ * With ARCH_MAX_CLEAR_PAGES == 1, clear_user_highpages() drops down
+ * to page-at-a-time mode. Or, funnels through to clear_user_pages().
+ */
+static void clear_user_extent(struct page *start_page, unsigned long vaddr,
+			      unsigned int npages)
+{
+	clear_user_highpages(start_page, vaddr, npages);
+}
+
 struct subpage_arg {
 	struct page *dst;
 	struct page *src;
@@ -5576,34 +5601,29 @@ struct subpage_arg {
  */
 static inline void process_huge_page(struct subpage_arg *sa,
 	unsigned long addr_hint, unsigned int pages_per_huge_page,
-	void (*process_subpage)(struct subpage_arg *sa,
-				unsigned long base_addr, int idx))
+	void (*process_subpages)(struct subpage_arg *sa,
+				 unsigned long base_addr, int lidx, int ridx))
 {
 	int i, n, base, l;
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 
 	/* Process target subpage last to keep its cache lines hot */
-	might_sleep();
 	n = (addr_hint - addr) / PAGE_SIZE;
+
 	if (2 * n <= pages_per_huge_page) {
 		/* If target subpage in first half of huge page */
 		base = 0;
 		l = n;
 		/* Process subpages at the end of huge page */
-		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
-			cond_resched();
-			process_subpage(sa, addr, i);
-		}
+		process_subpages(sa, addr, 2*n, pages_per_huge_page-1);
 	} else {
 		/* If target subpage in second half of huge page */
 		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
 		l = pages_per_huge_page - n;
+
 		/* Process subpages at the begin of huge page */
-		for (i = 0; i < base; i++) {
-			cond_resched();
-			process_subpage(sa, addr, i);
-		}
+		process_subpages(sa, addr, 0, base);
 	}
 	/*
 	 * Process remaining subpages in left-right-left-right pattern
@@ -5613,15 +5633,13 @@ static inline void process_huge_page(struct subpage_arg *sa,
 		int left_idx = base + i;
 		int right_idx = base + 2 * l - 1 - i;
 
-		cond_resched();
-		process_subpage(sa, addr, left_idx);
-		cond_resched();
-		process_subpage(sa, addr, right_idx);
+		process_subpages(sa, addr, left_idx, left_idx);
+		process_subpages(sa, addr, right_idx, right_idx);
 	}
 }
 
 static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
+				unsigned long base_addr,
 				unsigned int pages_per_huge_page)
 {
 	int i;
@@ -5629,18 +5647,35 @@ static void clear_gigantic_page(struct page *page,
 
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	     i += clear_page_unit, p = mem_map_offset(page, i)) {
+		/*
+		 * clear_page_unit is a factor of 1<<MAX_ORDER which
+		 * guarantees that p[0] and p[clear_page_unit-1]
+		 * never straddle a mem_map discontiguity.
+		 */
+		clear_user_extent(p, base_addr + i * PAGE_SIZE, clear_page_unit);
 		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
 	}
 }
 
-static void clear_subpage(struct subpage_arg *sa,
-			  unsigned long base_addr, int idx)
+static void clear_subpages(struct subpage_arg *sa,
+			   unsigned long base_addr, int lidx, int ridx)
 {
 	struct page *page = sa->dst;
+	int i, n;
 
-	clear_user_highpage(page + idx, base_addr + idx * PAGE_SIZE);
+	might_sleep();
+
+	for (i = lidx; i <= ridx; ) {
+		unsigned int remaining = (unsigned int) ridx - i + 1;
+
+		n = min(clear_page_unit, remaining);
+
+		clear_user_extent(page + i, base_addr + i * PAGE_SIZE, n);
+		i += n;
+
+		cond_resched();
+	}
 }
 
 void clear_huge_page(struct page *page,
@@ -5659,7 +5694,7 @@ void clear_huge_page(struct page *page,
 		return;
 	}
 
-	process_huge_page(&sa, addr_hint, pages_per_huge_page, clear_subpage);
+	process_huge_page(&sa, addr_hint, pages_per_huge_page, clear_subpages);
 }
 
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
@@ -5681,11 +5716,19 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
 	}
 }
 
-static void copy_subpage(struct subpage_arg *copy_arg,
-			 unsigned long base_addr, int idx)
+static void copy_subpages(struct subpage_arg *copy_arg,
+			  unsigned long base_addr, int lidx, int ridx)
 {
-	copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
+	int idx;
+
+	might_sleep();
+
+	for (idx = lidx; idx <= ridx; idx++) {
+		copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
 			   base_addr + idx * PAGE_SIZE, copy_arg->vma);
+
+		cond_resched();
+	}
 }
 
 void copy_user_huge_page(struct page *dst, struct page *src,
@@ -5706,7 +5749,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		return;
 	}
 
-	process_huge_page(&sa, addr_hint, pages_per_huge_page, copy_subpage);
+	process_huge_page(&sa, addr_hint, pages_per_huge_page, copy_subpages);
 }
 
 long copy_huge_page_from_user(struct page *dst_page,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/21] mm/huge_page: generalize process_huge_page()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (3 preceding siblings ...)
  2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

process_huge_page() processes subpages left-right, narrowing towards
the direction of the faulting subpage to keep spatially close
cachelines hot.

This is done, however, page-at-a-time. Retain the left-right
narrowing logic while using larger chunks for page regions
farther away from the target, and smaller chunks approaching
the target.

Clearing in large chunks allows for uarch specific optimizations.
Do this, however, only for far away subpages because we don't
care about keeping those cachelines hot.

In addition, while narrowing towards the target, access both the
left and right chunks in the forward direction instead of the
reverse -- x86 string instructions perform better that way.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 86 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 64 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fbc7bc70dc3d..04c6bb5d75f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5592,8 +5592,10 @@ struct subpage_arg {
 	struct page *dst;
 	struct page *src;
 	struct vm_area_struct *vma;
+	int page_unit;
 };
 
+#define NWIDTH 4
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -5604,37 +5606,75 @@ static inline void process_huge_page(struct subpage_arg *sa,
 	void (*process_subpages)(struct subpage_arg *sa,
 				 unsigned long base_addr, int lidx, int ridx))
 {
-	int i, n, base, l;
+	int n, lbound, rbound;
+	int remaining, unit = sa->page_unit;
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 
+	lbound = 0;
+	rbound = pages_per_huge_page - 1;
+	remaining = pages_per_huge_page;
+
 	/* Process target subpage last to keep its cache lines hot */
 	n = (addr_hint - addr) / PAGE_SIZE;
 
-	if (2 * n <= pages_per_huge_page) {
-		/* If target subpage in first half of huge page */
-		base = 0;
-		l = n;
-		/* Process subpages at the end of huge page */
-		process_subpages(sa, addr, 2*n, pages_per_huge_page-1);
-	} else {
-		/* If target subpage in second half of huge page */
-		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
-		l = pages_per_huge_page - n;
-
-		/* Process subpages at the begin of huge page */
-		process_subpages(sa, addr, 0, base);
-	}
 	/*
-	 * Process remaining subpages in left-right-left-right pattern
-	 * towards the target subpage
+	 * Process subpages in a left-right-left-right pattern towards the
+	 * faulting subpage to keep spatially close cachelines hot.
+	 *
+	 * If the architecture advertises multi-page clearing/copying, use
+	 * the largest extent available, process it in the forward direction,
+	 * while iteratively narrowing as the target gets closer.
+	 *
+	 * Clearing in large chunks allows for uarch specific optimizations.
+	 * Do this, however, only for far away subpages because we don't
+	 * care about keeping those cachelines hot.
+	 *
+	 * In addition, while narrowing towards the target, access both the
+	 * left and right chunks in the forward direction instead of the
+	 * reverse -- x86 string instructions perform better that way.
 	 */
-	for (i = 0; i < l; i++) {
-		int left_idx = base + i;
-		int right_idx = base + 2 * l - 1 - i;
+	while (remaining) {
+		int left_gap = n - lbound;
+		int right_gap = rbound - n;
+		int neighbourhood;
 
-		process_subpages(sa, addr, left_idx, left_idx);
-		process_subpages(sa, addr, right_idx, right_idx);
+		/*
+		 * We want to defer processing of the immediate neighbourhood of
+		 * the target until rest of the huge-page is exhausted.
+		 */
+		neighbourhood = NWIDTH * (left_gap > NWIDTH ||
+					  right_gap > NWIDTH);
+
+		/*
+		 * Width of the remaining region on the left: n - lbound + 1.
+		 * In addition hold an additional neighbourhood region, which is
+		 * non-zero until the left, right gaps have been cleared.
+		 *
+		 * [ddddd....xxxxN
+		 *       ^   |   `---- target
+		 *       `---|-- lbound
+		 *           `------------ left neighbourhood edge
+		 */
+		if ((n - lbound + 1) >= unit + neighbourhood) {
+			process_subpages(sa, addr, lbound, lbound + unit - 1);
+			lbound += unit;
+			remaining -= unit;
+		}
+
+		/*
+		 * Similarly the right:
+		 *               Nxxxx....ddd]
+		 */
+		if ((rbound - n) >= (unit + neighbourhood)) {
+			process_subpages(sa, addr, rbound - unit + 1, rbound);
+			rbound -= unit;
+			remaining -= unit;
+		}
+
+		unit = min(sa->page_unit, unit >> 1);
+		if (unit == 0)
+			unit = 1;
 	}
 }
 
@@ -5687,6 +5727,7 @@ void clear_huge_page(struct page *page,
 		.dst = page,
 		.src = NULL,
 		.vma = NULL,
+		.page_unit = clear_page_unit,
 	};
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
@@ -5741,6 +5782,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		.dst = dst,
 		.src = src,
 		.vma = vma,
+		.page_unit = 1,
 	};
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/21] x86/clear_page: add clear_pages()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (4 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add clear_pages(), with ARCH_MAX_CLEAR_PAGES_ORDER=8, so we can clear
in chunks of upto 1024KB.

The case for doing this is to expose huge or gigantic page clearing
as a few long strings of zeroes instead of many PAGE_SIZE'd operations.
Processors could take advantage of this hint by foregoing cacheline
allocation.
Unfortunately current generation CPUs generally do not do this
optimization: among CPUs tested, Intel Skylake, Icelakex don't at
all; AMD Milan does for extents > ~LLC-size.
(Note, however, numbers below do show a ~25% increase in clearing
BW -- just that they aren't due to foregoing cacheline allocation.)

One hope for this change is that it might provide enough of a
hint that future uarchs could optimize for.

A minor negative with this change is that calls to clear_page()
(which now calls clear_pages()) clobber an additional register.

Performance
===

System:    Oracle X9-2c (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory:    1024 GB evenly split between nodes
LLC-size:  48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0002c1, scaling-governor: performance

System:    Oracle E4-2c (2 nodes * 8 CCXs * 8 cores * 2 threads)
Processor: AMD EPYC 7J13 64-Core Processor (Milan, 25:1:1)
Memory:    512 GB evenly split between nodes
LLC-size:  32MB for each CCX (8-cores * 2-threads)
boost: 1, Microcode: 0xa00115d, scaling-governor: performance

Workload: create a 192GB qemu-VM (backed by preallocated 2MB
pages on the local node)
==

Icelakex
--
                          Time (s)        Delta (%)
 clear_page_erms()     22.37 ( +- 0.14s )           #  9.21 bytes/ns
 clear_pages_erms()    16.49 ( +- 0.06s )  -26.28%  # 12.50 bytes/ns

Looking at the perf stats [1] [2], it's not obvious where the
improvement is coming from. For clear_pages_erms(), we do execute
fewer instructions and branches (multiple pages per call to
clear_pages_erms(), and fewer cond_resched() calls) but since this
code isn't frontend bound (though there is a marginal improvement in
topdown-fe-bound), not clear if that's the cause for the ~25%
improvement.
The topdown-be-bound numbers are significantly better but they are
in a similar proportion to the total slots in both cases.

Milan
--
                          Time (s)        Delta (%)
 clear_page_erms()     16.49 ( +- 0.06s )           # 12.50 bytes/ns
 clear_pages_erms()    11.82 ( +- 0.06s )  -28.32%  # 17.44 bytes/ns

Similar to the Icelakex case above, from the perf stats [3], [4] it's
unclear where the improvement is coming from. We do somewhat better
for L1-dcache-loads and marginally better for stalled-cycles-backend
but nothing obvious stands out.

Workload: vm-scalability hugetlb tests (on Icelakex)
==

For case-anon-w-seq-hugetlb, there is a ~19.49% improvement in
cpu-cycles expended. As above, from perf stats there isn't a clear
reason why. No significant differences in user/kernel cache misses.

case-anon-w-seq-hugetlb:
  -   2,632,688,342,385      cpu-cycles                #    2.301 GHz                      ( +-  6.76% )  (33.29%)
  +   2,119,058,504,338      cpu-cycles                #    1.654 GHz                      ( +-  4.63% )  (33.37%)

Other hugetlb tests are flat.

case-anon-w-rand-hugetlb:
  -  14,423,774,217,911      cpu-cycles                #    2.452 GHz                      ( +-  0.55% )  (33.30%)
  +  14,009,785,056,082      cpu-cycles                #    2.428 GHz                      ( +-  3.11% )  (33.32%)

case-anon-cow-seq-hugetlb:
  -   2,689,994,027,601      cpu-cycles                #    2.220 GHz                      ( +-  1.91% )  (33.27%)
  +   2,735,414,889,894      cpu-cycles                #    2.262 GHz                      ( +-  1.82% )  (27.73%)

case-anon-cow-rand-hugetlb:
  -  16,130,147,328,192      cpu-cycles                #    2.482 GHz                      ( +-  1.07% )  (33.30%)
  +  15,815,163,909,204      cpu-cycles                #    2.432 GHz                      ( +-  0.64% )  (33.32%)

cache-references, cache-misses are within margin of error across all
the tests.

[1] Icelakex, create 192GB qemu-VM, clear_page_erms()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

         22,378.31 msec task-clock                #    1.000 CPUs utilized            ( +-  0.67% )
               153      context-switches          #    6.844 /sec                     ( +-  0.57% )
                 8      cpu-migrations            #    0.358 /sec                     ( +- 16.49% )
               116      page-faults               #    5.189 /sec                     ( +-  0.17% )
    57,290,131,280      cycles                    #    2.563 GHz                      ( +-  0.66% )  (38.46%)
     3,077,416,348      instructions              #    0.05  insn per cycle           ( +-  0.30% )  (46.14%)
       631,473,780      branches                  #   28.246 M/sec                    ( +-  0.18% )  (53.83%)
         1,167,792      branch-misses             #    0.19% of all branches          ( +-  0.79% )  (61.52%)
   286,600,215,705      slots                     #   12.820 G/sec                    ( +-  0.66% )  (69.20%)
    11,435,999,662      topdown-retiring          #      3.9% retiring                ( +-  1.56% )  (69.20%)
    19,428,489,213      topdown-bad-spec          #      6.2% bad speculation         ( +-  3.23% )  (69.20%)
     3,504,763,769      topdown-fe-bound          #      1.2% frontend bound          ( +-  0.67% )  (69.20%)
   258,517,960,428      topdown-be-bound          #     88.7% backend bound           ( +-  0.58% )  (69.20%)
       749,211,322      L1-dcache-loads           #   33.513 M/sec                    ( +-  0.13% )  (69.18%)
     3,244,380,956      L1-dcache-load-misses     #  433.32% of all L1-dcache accesses  ( +-  0.00% )  (69.20%)
        11,441,841      LLC-loads                 #  511.805 K/sec                    ( +-  0.30% )  (69.23%)
           839,878      LLC-load-misses           #    7.32% of all LL-cache accesses  ( +-  1.28% )  (69.24%)
   <not supported>      L1-icache-loads
        23,091,397      L1-icache-load-misses                                         ( +-  0.72% )  (30.82%)
       772,619,434      dTLB-loads                #   34.560 M/sec                    ( +-  0.31% )  (30.82%)
            49,750      dTLB-load-misses          #    0.01% of all dTLB cache accesses  ( +-  3.21% )  (30.80%)
   <not supported>      iTLB-loads
           503,570      iTLB-load-misses                                              ( +-  0.44% )  (30.78%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

            22.374 +- 0.149 seconds time elapsed  ( +-  0.66% )

[2] Icelakex, create 192GB qemu-VM, clear_pages_erms()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

         16,329.41 msec task-clock                #    0.990 CPUs utilized            ( +-  0.42% )
               143      context-switches          #    8.681 /sec                     ( +-  0.93% )
                 1      cpu-migrations            #    0.061 /sec                     ( +- 63.25% )
               118      page-faults               #    7.164 /sec                     ( +-  0.27% )
    41,735,523,673      cycles                    #    2.534 GHz                      ( +-  0.42% )  (38.46%)
     1,454,116,543      instructions              #    0.03  insn per cycle           ( +-  0.49% )  (46.16%)
       266,749,920      branches                  #   16.194 M/sec                    ( +-  0.41% )  (53.86%)
           928,726      branch-misses             #    0.35% of all branches          ( +-  0.38% )  (61.54%)
   208,805,754,709      slots                     #   12.676 G/sec                    ( +-  0.41% )  (69.23%)
     5,355,889,366      topdown-retiring          #      2.5% retiring                ( +-  0.50% )  (69.23%)
    12,720,749,784      topdown-bad-spec          #      6.1% bad speculation         ( +-  1.38% )  (69.23%)
       998,710,552      topdown-fe-bound          #      0.5% frontend bound          ( +-  0.85% )  (69.23%)
   192,653,197,875      topdown-be-bound          #     90.9% backend bound           ( +-  0.38% )  (69.23%)
       407,619,058      L1-dcache-loads           #   24.746 M/sec                    ( +-  0.17% )  (69.20%)
     3,245,399,461      L1-dcache-load-misses     #  801.49% of all L1-dcache accesses  ( +-  0.01% )  (69.22%)
        10,805,747      LLC-loads                 #  656.009 K/sec                    ( +-  0.37% )  (69.25%)
           804,475      LLC-load-misses           #    7.44% of all LL-cache accesses  ( +-  2.73% )  (69.26%)
   <not supported>      L1-icache-loads
        18,134,527      L1-icache-load-misses                                         ( +-  1.24% )  (30.80%)
       435,474,462      dTLB-loads                #   26.437 M/sec                    ( +-  0.28% )  (30.80%)
            41,187      dTLB-load-misses          #    0.01% of all dTLB cache accesses  ( +-  4.06% )  (30.79%)
   <not supported>      iTLB-loads
           440,135      iTLB-load-misses                                              ( +-  1.07% )  (30.78%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           16.4906 +- 0.0676 seconds time elapsed  ( +-  0.41% )

[3] Milan, create 192GB qemu-VM, clear_page_erms()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

         16,321.98 msec task-clock                #    0.989 CPUs utilized            ( +-  0.42% )
               104      context-switches          #    6.312 /sec                     ( +-  0.47% )
                 0      cpu-migrations            #    0.000 /sec
               109      page-faults               #    6.616 /sec                     ( +-  0.41% )
    39,430,057,963      cycles                    #    2.393 GHz                      ( +-  0.42% )  (33.33%)
       252,874,009      stalled-cycles-frontend   #    0.64% frontend cycles idle     ( +- 17.81% )  (33.34%)
         7,240,041      stalled-cycles-backend    #    0.02% backend cycles idle      ( +-245.73% )  (33.34%)
     3,031,754,124      instructions              #    0.08  insn per cycle
                                                  #    0.11  stalled cycles per insn  ( +-  0.41% )  (33.35%)
       711,675,976      branches                  #   43.197 M/sec                    ( +-  0.15% )  (33.34%)
        52,470,018      branch-misses             #    7.38% of all branches          ( +-  0.21% )  (33.36%)
     7,744,057,748      L1-dcache-loads           #  470.041 M/sec                    ( +-  0.05% )  (33.36%)
     3,241,880,079      L1-dcache-load-misses     #   41.92% of all L1-dcache accesses  ( +-  0.01% )  (33.35%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
       155,312,115      L1-icache-loads           #    9.427 M/sec                    ( +-  0.23% )  (33.34%)
         1,573,793      L1-icache-load-misses     #    1.01% of all L1-icache accesses  ( +-  3.74% )  (33.36%)
         3,521,392      dTLB-loads                #  213.738 K/sec                    ( +-  4.97% )  (33.35%)
           346,337      dTLB-load-misses          #    9.31% of all dTLB cache accesses  ( +-  5.54% )  (33.35%)
               725      iTLB-loads                #   44.005 /sec                     ( +-  8.75% )  (33.34%)
           115,723      iTLB-load-misses          # 19261.48% of all iTLB cache accesses  ( +-  1.20% )  (33.34%)
       139,229,403      L1-dcache-prefetches      #    8.451 M/sec                    ( +- 10.97% )  (33.34%)
   <not supported>      L1-dcache-prefetch-misses

           16.4962 +- 0.0665 seconds time elapsed  ( +-  0.40% )

[4] Milan, create 192GB qemu-VM, clear_pages_erms()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

         11,676.79 msec task-clock                #    0.987 CPUs utilized            ( +-  0.68% )
                96      context-switches          #    8.131 /sec                     ( +-  0.78% )
                 2      cpu-migrations            #    0.169 /sec                     ( +- 18.71% )
               106      page-faults               #    8.978 /sec                     ( +-  0.23% )
    28,161,726,414      cycles                    #    2.385 GHz                      ( +-  0.69% )  (33.33%)
       141,032,827      stalled-cycles-frontend   #    0.50% frontend cycles idle     ( +- 52.44% )  (33.35%)
       796,792,139      stalled-cycles-backend    #    2.80% backend cycles idle      ( +- 23.73% )  (33.35%)
     1,140,172,646      instructions              #    0.04  insn per cycle
                                                  #    0.50  stalled cycles per insn  ( +-  0.89% )  (33.35%)
       219,864,061      branches                  #   18.622 M/sec                    ( +-  1.06% )  (33.36%)
         1,407,446      branch-misses             #    0.63% of all branches          ( +- 10.66% )  (33.40%)
     6,882,968,897      L1-dcache-loads           #  582.960 M/sec                    ( +-  0.03% )  (33.38%)
     3,267,546,914      L1-dcache-load-misses     #   47.45% of all L1-dcache accesses  ( +-  0.02% )  (33.37%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
       146,901,513      L1-icache-loads           #   12.442 M/sec                    ( +-  0.78% )  (33.36%)
         1,462,155      L1-icache-load-misses     #    0.99% of all L1-icache accesses  ( +-  0.83% )  (33.34%)
         2,055,805      dTLB-loads                #  174.118 K/sec                    ( +- 22.56% )  (33.33%)
           136,260      dTLB-load-misses          #    4.69% of all dTLB cache accesses  ( +- 23.13% )  (33.35%)
               941      iTLB-loads                #   79.699 /sec                     ( +-  5.54% )  (33.35%)
           115,444      iTLB-load-misses          # 14051.12% of all iTLB cache accesses  ( +- 21.17% )  (33.34%)
        95,438,373      L1-dcache-prefetches      #    8.083 M/sec                    ( +- 19.99% )  (33.34%)
   <not supported>      L1-dcache-prefetch-misses

           11.8296 +- 0.0805 seconds time elapsed  ( +-  0.68% )

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page.h    | 12 +++++++++++
 arch/x86/include/asm/page_64.h | 28 ++++++++++++++++++-------
 arch/x86/lib/clear_page_64.S   | 38 ++++++++++++++++++++--------------
 3 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 5a246a2a66aa..045eaab08f43 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -22,6 +22,18 @@ struct page;
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES	/* x86_64 */
+
+#define clear_page(page) clear_pages(page, 1)
+
+static inline void clear_user_pages(void *page, unsigned long vaddr,
+				    struct page *pg, unsigned int npages)
+{
+	clear_pages(page, npages);
+}
+
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
+
 static inline void clear_user_page(void *page, unsigned long vaddr,
 				   struct page *pg)
 {
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index baa70451b8df..a88a3508888a 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -41,16 +41,28 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 #define pfn_valid(pfn)          ((pfn) < max_pfn)
 #endif
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+/*
+ * Clear in chunks of 256 pages/1024KB.
+ *
+ * Assuming a clearing BW of 3b/cyc (recent generation processors have
+ * more), this amounts to around 400K cycles for each chunk.
+ *
+ * With a cpufreq of ~2.5GHz, this amounts to ~160us for each chunk
+ * (which would also be the interval between calls to cond_resched().)
+ */
+#define ARCH_MAX_CLEAR_PAGES_ORDER	8
 
-static inline void clear_page(void *page)
+void clear_pages_orig(void *page, unsigned long npages);
+void clear_pages_rep(void *page, unsigned long npages);
+void clear_pages_erms(void *page, unsigned long npages);
+
+#define __HAVE_ARCH_CLEAR_USER_PAGES
+static inline void clear_pages(void *page, unsigned int npages)
 {
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
-			   "=D" (page),
+	alternative_call_2(clear_pages_orig,
+			   clear_pages_rep, X86_FEATURE_REP_GOOD,
+			   clear_pages_erms, X86_FEATURE_ERMS,
+			   "=D" (page), "S" ((unsigned long) npages),
 			   "0" (page)
 			   : "cc", "memory", "rax", "rcx");
 }
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..2cc3b681734a 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
 #include <asm/export.h>
+#include <asm/page_types.h>
 
 /*
  * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
@@ -10,23 +11,29 @@
  */
 
 /*
- * Zero a page.
- * %rdi	- page
+ * Zero pages.
+ * %rdi	- base page
+ * %rsi	- number of pages
+ *
+ * Note: clear_pages_*() have differing alignments restrictions
+ * but callers are always expected to page align.
  */
-SYM_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
+SYM_FUNC_START(clear_pages_rep)
+	movq %rsi,%rcx
+	shlq $(PAGE_SHIFT - 3),%rcx
 	xorl %eax,%eax
 	rep stosq
 	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
+SYM_FUNC_END(clear_pages_rep)
+EXPORT_SYMBOL_GPL(clear_pages_rep)
 
-SYM_FUNC_START(clear_page_orig)
+SYM_FUNC_START(clear_pages_orig)
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+	movq   %rsi,%rcx
+	shlq   $(PAGE_SHIFT - 6),%rcx
 	.p2align 4
 .Lloop:
-	decl	%ecx
+	decq	%rcx
 #define PUT(x) movq %rax,x*8(%rdi)
 	movq %rax,(%rdi)
 	PUT(1)
@@ -40,13 +47,14 @@ SYM_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
+SYM_FUNC_END(clear_pages_orig)
+EXPORT_SYMBOL_GPL(clear_pages_orig)
 
-SYM_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
+SYM_FUNC_START(clear_pages_erms)
+	movq %rsi,%rcx
+	shlq $PAGE_SHIFT, %rcx
 	xorl %eax,%eax
 	rep stosb
 	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(clear_pages_erms)
+EXPORT_SYMBOL_GPL(clear_pages_erms)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/21] x86/asm: add memset_movnti()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (5 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add a MOVNTI based non-caching implementation of memset().

memset_movnti() only needs to differ from memset_orig() in the opcode
used in the inner loop, so move the memset_orig() logic into a macro,
and use that to generate memset_orig() (now memset_movq()) and
memset_movnti().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/lib/memset_64.S | 68 ++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index fc9ffd3ff3b2..307b753ca03a 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -24,7 +24,7 @@ SYM_FUNC_START(__memset)
 	 *
 	 * Otherwise, use original memset function.
 	 */
-	ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+	ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memset_erms", X86_FEATURE_ERMS
 
 	movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
 	RET
 SYM_FUNC_END(memset_erms)
 
-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
 	movq %rdi,%r10
 
 	/* expand byte value  */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
 	/* align dst */
 	movl  %edi,%r9d
 	andl  $7,%r9d
-	jnz  .Lbad_alignment
-.Lafter_bad_alignment:
+	jnz  .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
 
 	movq  %rdx,%rcx
 	shrq  $6,%rcx
-	jz	 .Lhandle_tail
+	jz	 .Lhandle_tail_\@
 
 	.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
 	decq  %rcx
-	movq  %rax,(%rdi)
-	movq  %rax,8(%rdi)
-	movq  %rax,16(%rdi)
-	movq  %rax,24(%rdi)
-	movq  %rax,32(%rdi)
-	movq  %rax,40(%rdi)
-	movq  %rax,48(%rdi)
-	movq  %rax,56(%rdi)
+	\OP  %rax,(%rdi)
+	\OP  %rax,8(%rdi)
+	\OP  %rax,16(%rdi)
+	\OP  %rax,24(%rdi)
+	\OP  %rax,32(%rdi)
+	\OP  %rax,40(%rdi)
+	\OP  %rax,48(%rdi)
+	\OP  %rax,56(%rdi)
 	leaq  64(%rdi),%rdi
-	jnz    .Lloop_64
+	jnz    .Lloop_64_\@
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */
 	.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
 	movl	%edx,%ecx
 	andl    $63&(~7),%ecx
-	jz 		.Lhandle_7
+	jz 	.Lhandle_7_\@
 	shrl	$3,%ecx
 	.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
 	decl   %ecx
-	movq  %rax,(%rdi)
+	\OP  %rax,(%rdi)
 	leaq  8(%rdi),%rdi
-	jnz    .Lloop_8
+	jnz    .Lloop_8_\@
 
-.Lhandle_7:
+.Lhandle_7_\@:
 	andl	$7,%edx
-	jz      .Lende
+	jz      .Lende_\@
 	.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
 	decl    %edx
 	movb 	%al,(%rdi)
 	leaq	1(%rdi),%rdi
-	jnz     .Lloop_1
+	jnz     .Lloop_1_\@
 
-.Lende:
+.Lende_\@:
+	.if \fence
+	sfence
+	.endif
 	movq	%r10,%rax
 	RET
 
-.Lbad_alignment:
+.Lbad_alignment_\@:
 	cmpq $7,%rdx
-	jbe	.Lhandle_7
+	jbe	.Lhandle_7_\@
 	movq %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8
 	subq %r9,%r8
 	addq %r8,%rdi
 	subq %r8,%rdx
-	jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+	jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/21] perf bench: add memset_movnti()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (6 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Clone memset_movnti() from arch/x86/lib/memset_64.S.

perf bench mem memset -f x86-64-movnt on Intel Icelakex, AMD Milan:

  # Intel Icelakex

  $ for i in 8 32 128 512; do
         perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
     done

  # Output pruned.
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 8MB bytes ...
      12.896170 GB/sec
  # Copying 32MB bytes ...
      15.879065 GB/sec
  # Copying 128MB bytes ...
      20.813214 GB/sec
  # Copying 512MB bytes ...
      24.190817 GB/sec

  # AMD Milan

  $ for i in 8 32 128 512; do
         perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
     done

  # Output pruned.
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 8MB bytes ...
        22.372566 GB/sec
  # Copying 32MB bytes ...
        22.507923 GB/sec
  # Copying 128MB bytes ...
        22.492532 GB/sec
  # Copying 512MB bytes ...
        22.434603 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/arch/x86/lib/memset_64.S               | 68 +++++++++++---------
 tools/perf/bench/mem-memset-x86-64-asm-def.h |  6 +-
 2 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index fc9ffd3ff3b2..307b753ca03a 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -24,7 +24,7 @@ SYM_FUNC_START(__memset)
 	 *
 	 * Otherwise, use original memset function.
 	 */
-	ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+	ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
 		      "jmp memset_erms", X86_FEATURE_ERMS
 
 	movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
 	RET
 SYM_FUNC_END(memset_erms)
 
-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
 	movq %rdi,%r10
 
 	/* expand byte value  */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
 	/* align dst */
 	movl  %edi,%r9d
 	andl  $7,%r9d
-	jnz  .Lbad_alignment
-.Lafter_bad_alignment:
+	jnz  .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:
 
 	movq  %rdx,%rcx
 	shrq  $6,%rcx
-	jz	 .Lhandle_tail
+	jz	 .Lhandle_tail_\@
 
 	.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
 	decq  %rcx
-	movq  %rax,(%rdi)
-	movq  %rax,8(%rdi)
-	movq  %rax,16(%rdi)
-	movq  %rax,24(%rdi)
-	movq  %rax,32(%rdi)
-	movq  %rax,40(%rdi)
-	movq  %rax,48(%rdi)
-	movq  %rax,56(%rdi)
+	\OP  %rax,(%rdi)
+	\OP  %rax,8(%rdi)
+	\OP  %rax,16(%rdi)
+	\OP  %rax,24(%rdi)
+	\OP  %rax,32(%rdi)
+	\OP  %rax,40(%rdi)
+	\OP  %rax,48(%rdi)
+	\OP  %rax,56(%rdi)
 	leaq  64(%rdi),%rdi
-	jnz    .Lloop_64
+	jnz    .Lloop_64_\@
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */
 	.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
 	movl	%edx,%ecx
 	andl    $63&(~7),%ecx
-	jz 		.Lhandle_7
+	jz 	.Lhandle_7_\@
 	shrl	$3,%ecx
 	.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
 	decl   %ecx
-	movq  %rax,(%rdi)
+	\OP  %rax,(%rdi)
 	leaq  8(%rdi),%rdi
-	jnz    .Lloop_8
+	jnz    .Lloop_8_\@
 
-.Lhandle_7:
+.Lhandle_7_\@:
 	andl	$7,%edx
-	jz      .Lende
+	jz      .Lende_\@
 	.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
 	decl    %edx
 	movb 	%al,(%rdi)
 	leaq	1(%rdi),%rdi
-	jnz     .Lloop_1
+	jnz     .Lloop_1_\@
 
-.Lende:
+.Lende_\@:
+	.if \fence
+	sfence
+	.endif
 	movq	%r10,%rax
 	RET
 
-.Lbad_alignment:
+.Lbad_alignment_\@:
 	cmpq $7,%rdx
-	jbe	.Lhandle_7
+	jbe	.Lhandle_7_\@
 	movq %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8
 	subq %r9,%r8
 	addq %r8,%rdi
 	subq %r8,%rdx
-	jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+	jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
 	"x86-64-unrolled",
 	"unrolled memset() in arch/x86/lib/memset_64.S")
 
@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
 MEMSET_FN(memset_erms,
 	"x86-64-stosb",
 	"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+	"x86-64-movnt",
+	"movnt-based memset() in arch/x86/lib/memset_64.S")
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/21] x86/asm: add clear_pages_movnt()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (7 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-10 22:11   ` Noah Goldstein
  2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
With this, page-clearing can skip the memory hierarchy, thus providing
a non cache-polluting implementation of clear_pages().

MOVNTI, from the Intel SDM, Volume 2B, 4-101:
 "The non-temporal hint is implemented by using a write combining (WC)
  memory type protocol when writing the data to memory. Using this
  protocol, the processor does not write the data into the cache
  hierarchy, nor does it fetch the corresponding cache line from memory
  into the cache hierarchy."

The AMD Arch Manual has something similar to say as well.

One use-case is to zero large extents without bringing in never-to-be-
accessed cachelines. Also, often clear_pages_movnt() based clearing is
faster once extent sizes are O(LLC-size).

As the excerpt notes, MOVNTI is weakly ordered with respect to other
instructions operating on the memory hierarchy. This needs to be
handled by the caller by executing an SFENCE when done.

The implementation is straight-forward: unroll the inner loop to keep
the code similar to memset_movnti(), so that we can gauge
clear_pages_movnt() performance via perf bench mem memset.

 # Intel Icelakex
 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

 System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
 Processor:   Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
 Memory:      512 GB evenly split between nodes
 LLC-size:    48MB for each node (32-cores * 2-threads)
 no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)    Delta(%)
              ----------------------    ---------------------    --------
     size            BW   (   stdev)          BW    (   stdev)

      2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)   -12.38%
     16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)    -6.02%
    128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)   +84.24%
   1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)   +97.35%
   4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)   +98.50%

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h |  1 +
 arch/x86/lib/clear_page_64.S   | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index a88a3508888a..3affc4ecb8da 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void clear_pages_orig(void *page, unsigned long npages);
 void clear_pages_rep(void *page, unsigned long npages);
 void clear_pages_erms(void *page, unsigned long npages);
+void clear_pages_movnt(void *page, unsigned long npages);
 
 #define __HAVE_ARCH_CLEAR_USER_PAGES
 static inline void clear_pages(void *page, unsigned int npages)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 2cc3b681734a..83d14f1c9f57 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
 	RET
 SYM_FUNC_END(clear_pages_erms)
 EXPORT_SYMBOL_GPL(clear_pages_erms)
+
+SYM_FUNC_START(clear_pages_movnt)
+	xorl	%eax,%eax
+	movq	%rsi,%rcx
+	shlq    $PAGE_SHIFT, %rcx
+
+	.p2align 4
+.Lstart:
+	movnti  %rax, 0x00(%rdi)
+	movnti  %rax, 0x08(%rdi)
+	movnti  %rax, 0x10(%rdi)
+	movnti  %rax, 0x18(%rdi)
+	movnti  %rax, 0x20(%rdi)
+	movnti  %rax, 0x28(%rdi)
+	movnti  %rax, 0x30(%rdi)
+	movnti  %rax, 0x38(%rdi)
+	addq    $0x40, %rdi
+	subl    $0x40, %ecx
+	ja      .Lstart
+	RET
+SYM_FUNC_END(clear_pages_movnt)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/21] x86/asm: add clear_pages_clzero()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (8 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add clear_pages_clzero(), which uses CLZERO as the clearing primitive.
CLZERO skips the memory hierarchy, so this provides a non-polluting
implementation of clear_page(). Available if X86_FEATURE_CLZERO is set.

CLZERO, from the AMD architecture guide (Vol 3, Rev 3.30):
 "Clears the cache line specified by the logical address in rAX by
  writing a zero to every byte in the line. The instruction uses an
  implied non temporal memory type, similar to a streaming store, and
  uses the write combining protocol to minimize cache pollution.

  CLZERO is weakly-ordered with respect to other instructions that
  operate on memory. Software should use an SFENCE or stronger to
  enforce memory ordering of CLZERO with respect to other store
  instructions.

  The CLZERO instruction executes at any privilege level. CLZERO
  performs all the segmentation and paging checks that a store of
  the specified cache line would perform."

The use-case is similar to clear_page_movnt(), except that
clear_pages_clzero() is expected to be more performant.

Cc: jon.grimm@amd.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h |  1 +
 arch/x86/lib/clear_page_64.S   | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 3affc4ecb8da..e8d4698fda65 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -56,6 +56,7 @@ void clear_pages_orig(void *page, unsigned long npages);
 void clear_pages_rep(void *page, unsigned long npages);
 void clear_pages_erms(void *page, unsigned long npages);
 void clear_pages_movnt(void *page, unsigned long npages);
+void clear_pages_clzero(void *page, unsigned long npages);
 
 #define __HAVE_ARCH_CLEAR_USER_PAGES
 static inline void clear_pages(void *page, unsigned int npages)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 83d14f1c9f57..00203103cf77 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -79,3 +79,22 @@ SYM_FUNC_START(clear_pages_movnt)
 	ja      .Lstart
 	RET
 SYM_FUNC_END(clear_pages_movnt)
+
+/*
+ * Zero a page using clzero (On AMD, with CPU_FEATURE_CLZERO.)
+ *
+ * Caller needs to issue a sfence at the end.
+ */
+SYM_FUNC_START(clear_pages_clzero)
+	movq	%rdi,%rax
+	movq	%rsi,%rcx
+	shlq    $PAGE_SHIFT, %rcx
+
+	.p2align 4
+.Liter:
+	clzero
+	addq    $0x40, %rax
+	subl    $0x40, %ecx
+	ja      .Liter
+	RET
+SYM_FUNC_END(clear_pages_clzero)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (9 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

X86_FEATURE_MOVNT_SLOW denotes that clear_pages_movnti() is slower for
bulk page clearing (defined as LLC-sized or larger) than the standard
cached clear_page() idiom.

Microarchs where this is true would set this via check_movnt_quirks().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/cpu/amd.c          |  2 ++
 arch/x86/kernel/cpu/bugs.c         | 16 ++++++++++++++++
 arch/x86/kernel/cpu/cpu.h          |  2 ++
 arch/x86/kernel/cpu/intel.c        |  2 ++
 5 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 393f2bbb5e3a..824bdb1d0da1 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -296,6 +296,7 @@
 #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
 #define X86_FEATURE_SGX1		(11*32+ 8) /* "" Basic SGX */
 #define X86_FEATURE_SGX2		(11*32+ 9) /* "" SGX Enclave Dynamic Memory Management (EDMM) */
+#define X86_FEATURE_MOVNT_SLOW		(11*32+10) /* MOVNT is slow. (see check_movnt_quirks()) */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 0c0b09796ced..a5fe1420388d 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -891,6 +891,8 @@ static void init_amd(struct cpuinfo_x86 *c)
 	if (c->x86 >= 0x10)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
 
+	check_movnt_quirks(c);
+
 	/* get apicid instead of initial apic id from cpuid */
 	c->apicid = hard_smp_processor_id();
 
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d879a6c93609..16e293654d34 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -85,6 +85,22 @@ EXPORT_SYMBOL_GPL(mds_idle_clear);
  */
 DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 
+/*
+ * check_movnt_quirks() sets X86_FEATURE_MOVNT_SLOW for uarchs where
+ * clear_pages_movnti() is slower for bulk page clearing than the standard
+ * cached clear_page() idiom (typically rep-stosb/rep-stosq.)
+ *
+ * (Bulk clearing defined as LLC-sized or larger.)
+ *
+ * x86_64 only since clear_pages_movnti() is only defined there.
+ */
+void check_movnt_quirks(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_64
+
+#endif
+}
+
 void __init check_bugs(void)
 {
 	identify_boot_cpu();
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 2a8e584fc991..f53f07bf706f 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -83,4 +83,6 @@ extern void update_srbds_msr(void);
 
 extern u64 x86_read_arch_cap_msr(void);
 
+void check_movnt_quirks(struct cpuinfo_x86 *c);
+
 #endif /* ARCH_X86_CPU_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index fd5dead8371c..f0dc9b97dc8f 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -701,6 +701,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
 	if (c->x86 == 6)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+
+	check_movnt_quirks(c);
 #else
 	/*
 	 * Names for the Pentium II/Celeron processors
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 12/21] sparse: add address_space __incoherent
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (10 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Some CPU architectures provide store instructions that are weakly
ordered with respect to rest of the local instruction stream.

Add sparse address_space __incoherent to denote regions used such.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/compiler_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index d08dfcb0ac68..8e3e736fc82f 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -19,6 +19,7 @@
 # define __iomem	__attribute__((noderef, address_space(__iomem)))
 # define __percpu	__attribute__((noderef, address_space(__percpu)))
 # define __rcu		__attribute__((noderef, address_space(__rcu)))
+# define __incoherent	__attribute__((noderef, address_space(__incoherent)))
 static inline void __chk_user_ptr(const volatile void __user *ptr) { }
 static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
 /* context/locking */
@@ -45,6 +46,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
 # define __iomem
 # define __percpu	BTF_TYPE_TAG(percpu)
 # define __rcu
+# define __incoherent
 # define __chk_user_ptr(x)	(void)0
 # define __chk_io_ptr(x)	(void)0
 /* context/locking */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (11 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-08  0:01   ` Luc Van Oostenryck
  2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add generic primitives for clear_user_pages_incoherent() and
clear_page_make_coherent().

To ensure that callers don't mix accesses to different types
of address_spaces, annotate clear_user_pages_incoherent()
as taking an __incoherent pointer as argument.

Also add clear_user_highpages_incoherent() which either calls
clear_user_pages_incoherent() or falls back to clear_user_highpages()

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Notes:
    clear_user_highpages_incoherent() operates on an __incoherent region
    and expects the caller to call clear_page_make_coherent().
    
    It should, however be taking an __incoherent * as argument -- this it
    does not do because I couldn't see a clean way of doing that with
    highmem. Suggestions?

 include/asm-generic/clear_page.h | 21 +++++++++++++++++++++
 include/linux/highmem.h          | 23 +++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/asm-generic/clear_page.h b/include/asm-generic/clear_page.h
index f827d661519c..0ebff70a60a9 100644
--- a/include/asm-generic/clear_page.h
+++ b/include/asm-generic/clear_page.h
@@ -16,6 +16,9 @@
 #if defined(CONFIG_HIGHMEM) && defined(__HAVE_ARCH_CLEAR_USER_PAGES)
 #error CONFIG_HIGHMEM is incompatible with __HAVE_ARCH_CLEAR_USER_PAGES
 #endif
+#if defined(CONFIG_HIGHMEM) && defined(__HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT)
+#error CONFIG_HIGHMEM is incompatible with __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+#endif
 
 #ifndef __HAVE_ARCH_CLEAR_USER_PAGES
 
@@ -41,4 +44,22 @@ static inline void clear_user_pages(void *page, unsigned long vaddr,
 
 #define ARCH_MAX_CLEAR_PAGES	(1 << ARCH_MAX_CLEAR_PAGES_ORDER)
 
+#ifndef __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+#ifndef __ASSEMBLY__
+/*
+ * Fallback path (via clear_user_pages()) if the architecture does not
+ * support incoherent clearing.
+ */
+static inline void clear_user_pages_incoherent(__incoherent void *page,
+					       unsigned long vaddr,
+					       struct page *pg,
+					       unsigned int npages)
+{
+	clear_user_pages((__force void *)page, vaddr, pg, npages);
+}
+
+static inline void clear_page_make_coherent(void) { }
+#endif /* __ASSEMBLY__ */
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
+
 #endif /* __ASM_GENERIC_CLEAR_PAGE_H */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 08781d7693e7..90179f623c3b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -231,6 +231,29 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
 }
 #endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
 
+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+static inline void clear_user_highpages_incoherent(struct page *page,
+						   unsigned long vaddr,
+						   unsigned int npages)
+{
+	__incoherent void *addr = (__incoherent void *) page_address(page);
+
+	clear_user_pages_incoherent(addr, vaddr, page, npages);
+}
+#else
+static inline void clear_user_highpages_incoherent(struct page *page,
+						   unsigned long vaddr,
+						   unsigned int npages)
+{
+	/*
+	 * We fallback to clear_user_highpages() for the CONFIG_HIGHMEM
+	 * configs.
+	 * For !CONFIG_HIGHMEM, this will get translated to clear_user_pages().
+	 */
+	clear_user_highpages(page, vaddr, npages);
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
+
 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
 /**
  * alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (12 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Expose incoherent clearing primitives (clear_pages_movnt(),
clear_pages_clzero()) as alternatives via clear_pages_incoherent().

Fallback to clear_pages() if, X86_FEATURE_MOVNT_SLOW is set and
the CPU does not have X86_FEATURE_CLZERO.

Both these primitives use weakly-ordered stores. To ensure that
callers don't mix accesses to different types of address_spaces,
annotate clear_user_pages_incoherent(), and clear_pages_incoherent()
as taking __incoherent pointers as arguments.

Also add clear_page_make_coherent() which provides the necessary
store fence to make access to these __incoherent regions safe.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page.h    | 13 +++++++++++++
 arch/x86/include/asm/page_64.h | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 045eaab08f43..8fc6cc6759b9 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -40,6 +40,19 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
 	clear_page(page);
 }
 
+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT /* x86_64 */
+/*
+ * clear_pages_incoherent: valid on only __incoherent memory regions.
+ */
+static inline void clear_user_pages_incoherent(__incoherent void *page,
+					       unsigned long vaddr,
+					       struct page *pg,
+					       unsigned int npages)
+{
+	clear_pages_incoherent(page, npages);
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
+
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *topage)
 {
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index e8d4698fda65..78417f63f522 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -69,6 +69,40 @@ static inline void clear_pages(void *page, unsigned int npages)
 			   : "cc", "memory", "rax", "rcx");
 }
 
+#define __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+/*
+ * clear_pages_incoherent: only allowed on __incoherent memory regions.
+ */
+static inline void clear_pages_incoherent(__incoherent void *page,
+					  unsigned int npages)
+{
+	alternative_call_2(clear_pages_movnt,
+			   clear_pages, X86_FEATURE_MOVNT_SLOW,
+			   clear_pages_clzero, X86_FEATURE_CLZERO,
+			   "=D" (page), "S" ((unsigned long) npages),
+			   "0" (page)
+			   : "cc", "memory", "rax", "rcx");
+}
+
+/*
+ * clear_page_make_coherent: execute the necessary store fence
+ * after which __incoherent regions can be safely accessed.
+ */
+static inline void clear_page_make_coherent(void)
+{
+	/*
+	 * Keep the sfence for oldinstr and clzero separate to guard against
+	 * the possibility that a CPU has both X86_FEATURE_MOVNT_SLOW and
+	 * X86_FEATURE_CLZERO.
+	 *
+	 * The alternatives need to be in the same order as the ones
+	 * in clear_pages_incoherent().
+	 */
+	alternative_2("sfence",
+		      "", X86_FEATURE_MOVNT_SLOW,
+		      "sfence", X86_FEATURE_CLZERO);
+}
+
 void copy_page(void *to, void *from);
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (13 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Introduce clear_page_non_caching_threshold_pages which specifies the
threshold above which clear_page_incoherent() is used.

The ideal threshold value depends on the CPU uarch and where the
performance curves for cached and non-cached stores intersect.

Typically this would depend on microarchitectural details and
the LLC-size. Here, we arbitrarily choose a default value of
8MB (CLEAR_PAGE_NON_CACHING_THRESHOLD), a reasonably large LLC.

Also define clear_page_prefer_incoherent() which provides the
interface for querying this.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/clear_page.h |  4 ++++
 include/linux/mm.h               |  6 ++++++
 mm/memory.c                      | 25 +++++++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/asm-generic/clear_page.h b/include/asm-generic/clear_page.h
index 0ebff70a60a9..b790000661ce 100644
--- a/include/asm-generic/clear_page.h
+++ b/include/asm-generic/clear_page.h
@@ -62,4 +62,8 @@ static inline void clear_page_make_coherent(void) { }
 #endif /* __ASSEMBLY__ */
 #endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
 
+#ifndef __ASSEMBLY__
+extern unsigned long __init arch_clear_page_non_caching_threshold(void);
+#endif
+
 #endif /* __ASM_GENERIC_CLEAR_PAGE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..5084571b2fb6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3328,6 +3328,12 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 				   (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
 }
 
+extern bool clear_page_prefer_non_caching(unsigned long extent);
+#else /* !(CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS) */
+static inline bool clear_page_prefer_non_caching(unsigned long extent)
+{
+	return false;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/mm/memory.c b/mm/memory.c
index 04c6bb5d75f6..b78b32a3e915 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5563,10 +5563,28 @@ EXPORT_SYMBOL(__might_fault);
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 
+/*
+ * Default size beyond which huge page clearing uses the non-caching
+ * path. Size it for a reasonable sized LLC.
+ */
+#define CLEAR_PAGE_NON_CACHING_THRESHOLD	(8 << 20)
 static unsigned int __ro_after_init clear_page_unit = 1;
+
+static unsigned long __read_mostly clear_page_non_caching_threshold_pages =
+				CLEAR_PAGE_NON_CACHING_THRESHOLD / PAGE_SIZE;
+
+/* Arch code can override for a machine specific value. */
+unsigned long __weak __init arch_clear_page_non_caching_threshold(void)
+{
+	return CLEAR_PAGE_NON_CACHING_THRESHOLD;
+}
+
 static int __init setup_clear_page_params(void)
 {
 	clear_page_unit = 1 << min(MAX_ORDER - 1, ARCH_MAX_CLEAR_PAGES_ORDER);
+
+	clear_page_non_caching_threshold_pages =
+		arch_clear_page_non_caching_threshold() / PAGE_SIZE;
 	return 0;
 }
 
@@ -5576,6 +5594,13 @@ static int __init setup_clear_page_params(void)
  */
 late_initcall(setup_clear_page_params);
 
+bool clear_page_prefer_non_caching(unsigned long extent)
+{
+	unsigned long pages = extent / PAGE_SIZE;
+
+	return pages >= clear_page_non_caching_threshold_pages;
+}
+
 /*
  * Clear a page extent.
  *
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (14 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add arch_clear_page_non_caching_threshold() for a machine specific value
above which clear_page_incoherent() would be used.

The ideal threshold value depends on the CPU model and where the
performance curves for caching and non-caching stores intersect.
A safe value is LLC-size, so we use that of the boot_cpu.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cacheinfo.h |  1 +
 arch/x86/kernel/cpu/cacheinfo.c  | 13 +++++++++++++
 arch/x86/kernel/setup.c          |  6 ++++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/cacheinfo.h b/arch/x86/include/asm/cacheinfo.h
index 86b2e0dcc4bf..5c6045699e94 100644
--- a/arch/x86/include/asm/cacheinfo.h
+++ b/arch/x86/include/asm/cacheinfo.h
@@ -4,5 +4,6 @@
 
 void cacheinfo_amd_init_llc_id(struct cpuinfo_x86 *c, int cpu);
 void cacheinfo_hygon_init_llc_id(struct cpuinfo_x86 *c, int cpu);
+int cacheinfo_lookup_max_size(int cpu);
 
 #endif /* _ASM_X86_CACHEINFO_H */
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index fe98a1465be6..6fb0cb868099 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -1034,3 +1034,16 @@ int populate_cache_leaves(unsigned int cpu)
 
 	return 0;
 }
+
+int cacheinfo_lookup_max_size(int cpu)
+{
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cacheinfo *this_leaf = this_cpu_ci->info_list;
+	struct cacheinfo *max_leaf;
+
+	/*
+	 * Assume that cache sizes always increase with level.
+	 */
+	max_leaf = this_leaf + this_cpu_ci->num_leaves - 1;
+	return max_leaf->size;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 249981bf3d8a..701825a22863 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include <asm/thermal.h>
 #include <asm/unwind.h>
 #include <asm/vsyscall.h>
+#include <asm/cacheinfo.h>
 #include <linux/vmalloc.h>
 
 /*
@@ -1293,3 +1294,8 @@ static int __init register_kernel_offset_dumper(void)
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+unsigned long __init arch_clear_page_non_caching_threshold(void)
+{
+	return cacheinfo_lookup_max_size(0);
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 17/21] clear_huge_page: use non-cached clearing
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (15 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Non-caching stores are suitable for circumstances where the destination
region is unlikely to be read again soon, or is large enough that
there's no expectation that we will find the data in the cache.

Add a new parameter to clear_user_extent(), which handles the
non-caching clearing path for huge and gigantic pages. This needs a
final clear_page_make_coherent() operation since non-cached clearing
typically involves weakly ordered stores that are incoherent wrt other
operations in the memory hierarchy.

This path is always invoked for gigantic pages, for huge pages only if
pages_per_huge_page is greater than an architectural threshold, or if
the user gives an explicit hint (if for instance, this call is part of
a larger clearing operation.)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h |  3 ++-
 mm/huge_memory.c   |  3 ++-
 mm/hugetlb.c       |  3 ++-
 mm/memory.c        | 50 +++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5084571b2fb6..a9b0c1889348 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3302,7 +3302,8 @@ enum mf_action_page_type {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
-			    unsigned int pages_per_huge_page);
+			    unsigned int pages_per_huge_page,
+			    bool non_cached);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr_hint,
 				struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a77c78a2b6b5..73654db77a1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -594,6 +594,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	vm_fault_t ret = 0;
+	bool non_cached = false;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -611,7 +612,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		goto release;
 	}
 
-	clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
+	clear_huge_page(page, vmf->address, HPAGE_PMD_NR, non_cached);
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * clear_huge_page writes become visible before the set_pmd_at()
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c468ac1d069..0c4a31b5c1e9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5481,6 +5481,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool non_cached = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5536,7 +5537,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			spin_unlock(ptl);
 			goto out;
 		}
-		clear_huge_page(page, address, pages_per_huge_page(h));
+		clear_huge_page(page, address, pages_per_huge_page(h), non_cached);
 		__SetPageUptodate(page);
 		new_page = true;
 
diff --git a/mm/memory.c b/mm/memory.c
index b78b32a3e915..0638dc56828f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5606,11 +5606,18 @@ bool clear_page_prefer_non_caching(unsigned long extent)
  *
  * With ARCH_MAX_CLEAR_PAGES == 1, clear_user_highpages() drops down
  * to page-at-a-time mode. Or, funnels through to clear_user_pages().
+ *
+ * With coherent == false, we use incoherent stores and the caller is
+ * responsible for making the region coherent again by calling
+ * clear_page_make_coherent().
  */
 static void clear_user_extent(struct page *start_page, unsigned long vaddr,
-			      unsigned int npages)
+			      unsigned int npages, bool coherent)
 {
-	clear_user_highpages(start_page, vaddr, npages);
+	if (coherent)
+		clear_user_highpages(start_page, vaddr, npages);
+	else
+		clear_user_highpages_incoherent(start_page, vaddr, npages);
 }
 
 struct subpage_arg {
@@ -5709,6 +5716,13 @@ static void clear_gigantic_page(struct page *page,
 {
 	int i;
 	struct page *p = page;
+	bool coherent;
+
+	/*
+	 * Gigantic pages are large enough, that there are no cache
+	 * expectations. Use the incoherent path.
+	 */
+	coherent = false;
 
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page;
@@ -5718,9 +5732,16 @@ static void clear_gigantic_page(struct page *page,
 		 * guarantees that p[0] and p[clear_page_unit-1]
 		 * never straddle a mem_map discontiguity.
 		 */
-		clear_user_extent(p, base_addr + i * PAGE_SIZE, clear_page_unit);
+		clear_user_extent(p, base_addr + i * PAGE_SIZE,
+				  clear_page_unit, coherent);
 		cond_resched();
 	}
+
+	/*
+	 * We need to make sure that writes above are ordered before
+	 * updating the PTE and marking SetPageUptodate().
+	 */
+	clear_page_make_coherent();
 }
 
 static void clear_subpages(struct subpage_arg *sa,
@@ -5736,15 +5757,16 @@ static void clear_subpages(struct subpage_arg *sa,
 
 		n = min(clear_page_unit, remaining);
 
-		clear_user_extent(page + i, base_addr + i * PAGE_SIZE, n);
+		clear_user_extent(page + i, base_addr + i * PAGE_SIZE,
+				  n, true);
 		i += n;
 
 		cond_resched();
 	}
 }
 
-void clear_huge_page(struct page *page,
-		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+void clear_huge_page(struct page *page, unsigned long addr_hint,
+		     unsigned int pages_per_huge_page, bool non_cached)
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
@@ -5755,7 +5777,21 @@ void clear_huge_page(struct page *page,
 		.page_unit = clear_page_unit,
 	};
 
-	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+	/*
+	 * The non-caching path is typically slower for small extents so use
+	 * it only if the caller explicitly hints it or if the extent is
+	 * large enough that there are no cache expectations.
+	 *
+	 * We let the gigantic page path handle the details.
+	 */
+	non_cached |=
+		clear_page_prefer_non_caching(pages_per_huge_page * PAGE_SIZE);
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES || non_cached)) {
+		/*
+		 * Gigantic page clearing always uses incoherent clearing
+		 * internally.
+		 */
 		clear_gigantic_page(page, addr, pages_per_huge_page);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (16 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 19/21] gup: hint non-caching if clearing large regions Ankur Arora
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

Add FOLL_HINT_BULK, which callers of get_user_pages(), pin_user_pages()
can use to signal that this call is one of many, allowing
get_user_pages() to optimize accordingly.

Additionally, add FAULT_FLAG_NON_CACHING, which in the fault handling
path signals that the underlying logic can use non-caching primitives.
This is a possible optimization for FOLL_HINT_BULK calls.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h       | 1 +
 include/linux/mm_types.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a9b0c1889348..dbd8b7344dfc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,6 +2941,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_HINT_BULK	0x100000 /* part of a larger extent being gup'd */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b34ff2cdbc4f..287b3018c14d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -824,6 +824,7 @@ typedef struct {
  *                      mapped R/O.
  * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
  *                        We should only access orig_pte if this flag set.
+ * @FAULT_FLAG_NON_CACHING: Avoid polluting the cache if possible.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -861,6 +862,7 @@ enum fault_flag {
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
 	FAULT_FLAG_UNSHARE =		1 << 10,
 	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
+	FAULT_FLAG_NON_CACHING =	1 << 12,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 19/21] gup: hint non-caching if clearing large regions
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (17 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

When clearing a large region, or when the user explicitly hints
via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger
region being gup'd, take the non-caching path.

One notable limitation is that this is only done when the underlying
pages are huge or gigantic, even if a large region composed of PAGE_SIZE
pages is being cleared. This is because non-caching stores are generally
weakly ordered and need some kind of store fence -- at PTE write
granularity -- to avoid data leakage. This is expensive enough to
negate any performance advantage.

Performance
==

System:    Oracle X9-2c (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory:    1024 GB evenly split between nodes
LLC-size:  48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0002c1, scaling-governor: performance

System:    Oracle E4-2c (2 nodes * 8 CCXes * 8 cores * 2 threads)
Processor: AMD EPYC 7J13 64-Core Processor (Milan, 25:1:1)
Memory:    512 GB evenly split between nodes
LLC-size:  32MB for each CCX (8-cores * 2-threads)
boost: 1, Microcode: 0xa00115d, scaling-governor: performance

Two workloads: qemu VM creation where that is the exclusive load
and, to probe the cache interference with unrelated processes aspect
of these changes, a kbuild with a background page clearing workload.

Workload: create a 192GB qemu-VM (backed by preallocated 2MB
pages on the local node)
==

Icelakex
--
                          Time (s)        Delta (%)
 clear_pages_erms()    16.49 ( +- 0.06s )            # 12.50 bytes/ns
 clear_pages_movnt()    9.42 ( +- 0.20s )  -42.87%   # 21.88 bytes/ns

It is easy enough to see where the improvement is coming from -- given
the non-caching stores, the CPU does not need to do any RFOs ending up
with way fewer L1-dcache-load-misses:

-      407,619,058      L1-dcache-loads           #   24.746 M/sec                    ( +-  0.17% )  (69.20%)
-    3,245,399,461      L1-dcache-load-misses     #  801.49% of all L1-dcache accesses  ( +-  0.01% )  (69.22%)
+      393,160,148      L1-dcache-loads           #   41.786 M/sec                    ( +-  0.80% )  (69.22%)
+        5,790,543      L1-dcache-load-misses     #    1.50% of all L1-dcache accesses  ( +-  1.55% )  (69.26%)

(Fuller perf stat output, at [1], [2].)

Milan
--
                          Time (s)          Delta
 clear_pages_erms()    11.83 ( +- 0.08s )            # 17.42 bytes/ns
 clear_pages_clzero()   4.91 ( +- 0.27s )  -58.49%   # 41.98 bytes/ns

Milan does significantly fewer RFO, as well.

-    6,882,968,897      L1-dcache-loads           #  582.960 M/sec                    ( +-  0.03% )  (33.38%)
-    3,267,546,914      L1-dcache-load-misses     #   47.45% of all L1-dcache accesses  ( +-  0.02% )  (33.37%)
+      418,489,450      L1-dcache-loads           #   85.611 M/sec                    ( +-  1.19% )  (33.46%)
+        5,406,557      L1-dcache-load-misses     #    1.35% of all L1-dcache accesses  ( +-  1.07% )  (33.45%)

(Fuller perf stat output, at [3], [4].)

Workload: Kbuild with background clear_huge_page()
==

Probe the cache-pollution aspect of this commit with a kbuild
(make -j 32 bzImage) alongside a background process doing
clear_huge_page() via mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB)
in a loop.

The expectation -- assuming kbuild performance is partly cache
limited -- is that the clear_huge_page() -> clear_pages_erms()
background load would show a greater slowdown than,
clear_huge_page() -> clear_pages_movnt(). The kbuild itself does not
use THP or similar, so any performance changes are due to the
background load.

Icelakex
--

 # kbuild: 16 cores, 32 threads
 # clear_huge_page() load: single thread bound to the same CPUset
 # taskset -c 16-31,80-95 perf stat -r 5 -ddd	\
	make -C .. -j 32 O=b2 clean bzImage

-  8,226,884,900,694      instructions              #    1.09  insn per cycle           ( +-  0.02% )  (47.27%)
+  8,223,413,950,371      instructions              #    1.12  insn per cycle           ( +-  0.03% )  (47.31%)

- 20,016,410,480,886      slots                     #    6.565 G/sec                    ( +-  0.01% )  (69.84%)
-  1,310,070,777,023      topdown-be-bound          #      6.1% backend bound           ( +-  0.28% )  (69.84%)
+ 19,328,950,611,944      slots                     #    6.494 G/sec                    ( +-  0.02% )  (69.87%)
+  1,043,408,291,623      topdown-be-bound          #      5.0% backend bound           ( +-  0.33% )  (69.87%)

-     10,747,834,729      LLC-loads                 #    3.525 M/sec                    ( +-  0.05% )  (69.68%)
-      4,841,355,743      LLC-load-misses           #   45.02% of all LL-cache accesses  ( +-  0.06% )  (69.70%)
+     10,466,865,056      LLC-loads                 #    3.517 M/sec                    ( +-  0.08% )  (69.68%)
+      4,206,944,783      LLC-load-misses           #   40.21% of all LL-cache accesses  ( +-  0.06% )  (69.71%)

The LLC-load-misses show a significant improvement (-13.11%) which is
borne out in the (-20.35%) reduction in topdown-be-bound and a (2.7%)
improvement in IPC.

- 7,521,157,276,899      cycles                    #    2.467 GHz                      ( +-  0.02% )  (39.65%)
+ 7,348,971,235,549      cycles                    #    2.469 GHz                      ( +-  0.04% )  (39.68%)

The ends up with an overall improvement in cycles of (-2.28%).

(Fuller perf stat output, at [5], [6].)

Milan
--

 # kbuild: 2 CCxes, 16 cores, 32 threads
 # clear_huge_page() load: single thread bound to the same CPUset
 # taskset -c 16-31,144-159 perf stat -r 5 -ddd	\
	make -C .. -j 32 O=b2 clean bzImage

-   302,739,130,717      stalled-cycles-backend    #    3.82% backend cycles idle      ( +-  0.10% )  (41.11%)
+   287,703,667,307      stalled-cycles-backend    #    3.74% backend cycles idle      ( +-  0.04% )  (41.11%)

- 8,981,403,534,446      instructions              #    1.13  insn per cycle
+ 8,969,062,192,998      instructions              #    1.16  insn per cycle

Milan sees a (-4.96%) improvement in stalled-cycles-backend and
a (-2.65%) improvement in IPC.

- 7,930,842,057,103      cycles                    #    2.338 GHz                      ( +-  0.04% )  (41.09%)
+ 7,705,812,395,365      cycles                    #    2.339 GHz                      ( +-  0.01% )  (41.11%)

The ends up with an overall improvement in cycles of (-2.83%).

(Fuller perf stat output, at [7], [8].)

[1] Icelakex, clear_pages_erms()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

         16,329.41 msec task-clock                #    0.990 CPUs utilized            ( +-  0.42% )
               143      context-switches          #    8.681 /sec                     ( +-  0.93% )
                 1      cpu-migrations            #    0.061 /sec                     ( +- 63.25% )
               118      page-faults               #    7.164 /sec                     ( +-  0.27% )
    41,735,523,673      cycles                    #    2.534 GHz                      ( +-  0.42% )  (38.46%)
     1,454,116,543      instructions              #    0.03  insn per cycle           ( +-  0.49% )  (46.16%)
       266,749,920      branches                  #   16.194 M/sec                    ( +-  0.41% )  (53.86%)
           928,726      branch-misses             #    0.35% of all branches          ( +-  0.38% )  (61.54%)
   208,805,754,709      slots                     #   12.676 G/sec                    ( +-  0.41% )  (69.23%)
     5,355,889,366      topdown-retiring          #      2.5% retiring                ( +-  0.50% )  (69.23%)
    12,720,749,784      topdown-bad-spec          #      6.1% bad speculation         ( +-  1.38% )  (69.23%)
       998,710,552      topdown-fe-bound          #      0.5% frontend bound          ( +-  0.85% )  (69.23%)
   192,653,197,875      topdown-be-bound          #     90.9% backend bound           ( +-  0.38% )  (69.23%)
       407,619,058      L1-dcache-loads           #   24.746 M/sec                    ( +-  0.17% )  (69.20%)
     3,245,399,461      L1-dcache-load-misses     #  801.49% of all L1-dcache accesses  ( +-  0.01% )  (69.22%)
        10,805,747      LLC-loads                 #  656.009 K/sec                    ( +-  0.37% )  (69.25%)
           804,475      LLC-load-misses           #    7.44% of all LL-cache accesses  ( +-  2.73% )  (69.26%)
   <not supported>      L1-icache-loads
        18,134,527      L1-icache-load-misses                                         ( +-  1.24% )  (30.80%)
       435,474,462      dTLB-loads                #   26.437 M/sec                    ( +-  0.28% )  (30.80%)
            41,187      dTLB-load-misses          #    0.01% of all dTLB cache accesses  ( +-  4.06% )  (30.79%)
   <not supported>      iTLB-loads
           440,135      iTLB-load-misses                                              ( +-  1.07% )  (30.78%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           16.4906 +- 0.0676 seconds time elapsed  ( +-  0.41% )

[2] Icelakex, clear_pages_movnt()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

          9,896.77 msec task-clock                #    1.050 CPUs utilized            ( +-  2.08% )
               135      context-switches          #   14.348 /sec                     ( +-  0.74% )
                 0      cpu-migrations            #    0.000 /sec
               116      page-faults               #   12.329 /sec                     ( +-  0.50% )
    25,239,642,558      cycles                    #    2.683 GHz                      ( +-  2.11% )  (38.43%)
    36,791,658,500      instructions              #    1.54  insn per cycle           ( +-  0.06% )  (46.12%)
     3,475,279,229      branches                  #  369.361 M/sec                    ( +-  0.09% )  (53.82%)
         1,987,098      branch-misses             #    0.06% of all branches          ( +-  0.71% )  (61.51%)
   126,256,220,768      slots                     #   13.419 G/sec                    ( +-  2.10% )  (69.21%)
    57,705,186,453      topdown-retiring          #     47.8% retiring                ( +-  0.28% )  (69.21%)
     5,934,729,245      topdown-bad-spec          #      4.3% bad speculation         ( +-  5.91% )  (69.21%)
     4,089,990,217      topdown-fe-bound          #      3.1% frontend bound          ( +-  2.11% )  (69.21%)
    60,298,426,167      topdown-be-bound          #     44.8% backend bound           ( +-  4.21% )  (69.21%)
       393,160,148      L1-dcache-loads           #   41.786 M/sec                    ( +-  0.80% )  (69.22%)
         5,790,543      L1-dcache-load-misses     #    1.50% of all L1-dcache accesses  ( +-  1.55% )  (69.26%)
         1,069,049      LLC-loads                 #  113.621 K/sec                    ( +-  1.25% )  (69.27%)
           728,260      LLC-load-misses           #   70.65% of all LL-cache accesses  ( +-  2.63% )  (69.30%)
   <not supported>      L1-icache-loads
        14,620,549      L1-icache-load-misses                                         ( +-  1.27% )  (30.80%)
       404,962,421      dTLB-loads                #   43.040 M/sec                    ( +-  1.13% )  (30.80%)
            31,916      dTLB-load-misses          #    0.01% of all dTLB cache accesses  ( +-  4.61% )  (30.77%)
   <not supported>      iTLB-loads
           396,984      iTLB-load-misses                                              ( +-  2.23% )  (30.74%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

             9.428 +- 0.206 seconds time elapsed  ( +-  2.18% )

[3] Milan, clear_pages_erms()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

         11,676.79 msec task-clock                #    0.987 CPUs utilized            ( +-  0.68% )
                96      context-switches          #    8.131 /sec                     ( +-  0.78% )
                 2      cpu-migrations            #    0.169 /sec                     ( +- 18.71% )
               106      page-faults               #    8.978 /sec                     ( +-  0.23% )
    28,161,726,414      cycles                    #    2.385 GHz                      ( +-  0.69% )  (33.33%)
       141,032,827      stalled-cycles-frontend   #    0.50% frontend cycles idle     ( +- 52.44% )  (33.35%)
       796,792,139      stalled-cycles-backend    #    2.80% backend cycles idle      ( +- 23.73% )  (33.35%)
     1,140,172,646      instructions              #    0.04  insn per cycle
                                                  #    0.50  stalled cycles per insn  ( +-  0.89% )  (33.35%)
       219,864,061      branches                  #   18.622 M/sec                    ( +-  1.06% )  (33.36%)
         1,407,446      branch-misses             #    0.63% of all branches          ( +- 10.66% )  (33.40%)
     6,882,968,897      L1-dcache-loads           #  582.960 M/sec                    ( +-  0.03% )  (33.38%)
     3,267,546,914      L1-dcache-load-misses     #   47.45% of all L1-dcache accesses  ( +-  0.02% )  (33.37%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
       146,901,513      L1-icache-loads           #   12.442 M/sec                    ( +-  0.78% )  (33.36%)
         1,462,155      L1-icache-load-misses     #    0.99% of all L1-icache accesses  ( +-  0.83% )  (33.34%)
         2,055,805      dTLB-loads                #  174.118 K/sec                    ( +- 22.56% )  (33.33%)
           136,260      dTLB-load-misses          #    4.69% of all dTLB cache accesses  ( +- 23.13% )  (33.35%)
               941      iTLB-loads                #   79.699 /sec                     ( +-  5.54% )  (33.35%)
           115,444      iTLB-load-misses          # 14051.12% of all iTLB cache accesses  ( +- 21.17% )  (33.34%)
        95,438,373      L1-dcache-prefetches      #    8.083 M/sec                    ( +- 19.99% )  (33.34%)
   <not supported>      L1-dcache-prefetch-misses

           11.8296 +- 0.0805 seconds time elapsed  ( +-  0.68% )

[4] Milan, clear_pages_clzero()
 # perf stat -r 5 --all-kernel -ddd ./qemu.sh

 Performance counter stats for './qemu.sh' (5 runs):

          4,599.00 msec task-clock                #    0.937 CPUs utilized            ( +-  5.93% )
                91      context-switches          #   18.616 /sec                     ( +-  0.92% )
                 0      cpu-migrations            #    0.000 /sec
               107      page-faults               #   21.889 /sec                     ( +-  0.19% )
    10,975,453,059      cycles                    #    2.245 GHz                      ( +-  6.02% )  (33.28%)
        14,193,355      stalled-cycles-frontend   #    0.12% frontend cycles idle     ( +-  1.90% )  (33.35%)
        38,969,144      stalled-cycles-backend    #    0.33% backend cycles idle      ( +- 23.92% )  (33.34%)
    13,951,880,530      instructions              #    1.20  insn per cycle
                                                  #    0.00  stalled cycles per insn  ( +-  0.11% )  (33.33%)
     3,426,708,418      branches                  #  701.003 M/sec                    ( +-  0.06% )  (33.36%)
         2,350,619      branch-misses             #    0.07% of all branches          ( +-  0.61% )  (33.45%)
       418,489,450      L1-dcache-loads           #   85.611 M/sec                    ( +-  1.19% )  (33.46%)
         5,406,557      L1-dcache-load-misses     #    1.35% of all L1-dcache accesses  ( +-  1.07% )  (33.45%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
        90,088,059      L1-icache-loads           #   18.429 M/sec                    ( +-  0.36% )  (33.44%)
         1,081,035      L1-icache-load-misses     #    1.20% of all L1-icache accesses  ( +-  3.67% )  (33.42%)
         4,017,464      dTLB-loads                #  821.854 K/sec                    ( +-  1.02% )  (33.40%)
           204,096      dTLB-load-misses          #    5.22% of all dTLB cache accesses  ( +-  9.77% )  (33.36%)
               770      iTLB-loads                #  157.519 /sec                     ( +-  5.12% )  (33.36%)
           209,834      iTLB-load-misses          # 29479.35% of all iTLB cache accesses  ( +-  0.17% )  (33.34%)
         1,596,265      L1-dcache-prefetches      #  326.548 K/sec                    ( +-  1.55% )  (33.31%)
   <not supported>      L1-dcache-prefetch-misses

             4.908 +- 0.272 seconds time elapsed  ( +-  5.54% )

[5] Icelakex, kbuild + bg:clear_pages_erms() load.
 # taskset -c 16-31,80-95 perf stat -r 5 -ddd	\
	make -C .. -j 32 O=b2 clean bzImage

 Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

      3,047,329.07 msec task-clock                #   19.520 CPUs utilized            ( +-  0.02% )
         1,675,061      context-switches          #  549.415 /sec                     ( +-  0.43% )
            89,232      cpu-migrations            #   29.268 /sec                     ( +-  2.34% )
        85,752,972      page-faults               #   28.127 K/sec                    ( +-  0.00% )
 7,521,157,276,899      cycles                    #    2.467 GHz                      ( +-  0.02% )  (39.65%)
 8,226,884,900,694      instructions              #    1.09  insn per cycle           ( +-  0.02% )  (47.27%)
 1,744,557,848,503      branches                  #  572.209 M/sec                    ( +-  0.02% )  (54.83%)
    36,252,079,075      branch-misses             #    2.08% of all branches          ( +-  0.02% )  (62.35%)
20,016,410,480,886      slots                     #    6.565 G/sec                    ( +-  0.01% )  (69.84%)
 6,518,990,385,998      topdown-retiring          #     30.5% retiring                ( +-  0.02% )  (69.84%)
 7,821,817,193,732      topdown-bad-spec          #     36.7% bad speculation         ( +-  0.29% )  (69.84%)
 5,714,082,318,274      topdown-fe-bound          #     26.7% frontend bound          ( +-  0.10% )  (69.84%)
 1,310,070,777,023      topdown-be-bound          #      6.1% backend bound           ( +-  0.28% )  (69.84%)
 2,270,017,283,501      L1-dcache-loads           #  744.558 M/sec                    ( +-  0.02% )  (69.60%)
   103,295,556,544      L1-dcache-load-misses     #    4.55% of all L1-dcache accesses  ( +-  0.02% )  (69.64%)
    10,747,834,729      LLC-loads                 #    3.525 M/sec                    ( +-  0.05% )  (69.68%)
     4,841,355,743      LLC-load-misses           #   45.02% of all LL-cache accesses  ( +-  0.06% )  (69.70%)
   <not supported>      L1-icache-loads
   180,672,238,145      L1-icache-load-misses                                         ( +-  0.03% )  (31.18%)
 2,216,149,664,522      dTLB-loads                #  726.890 M/sec                    ( +-  0.03% )  (31.83%)
     2,000,781,326      dTLB-load-misses          #    0.09% of all dTLB cache accesses  ( +-  0.08% )  (31.79%)
   <not supported>      iTLB-loads
     1,938,124,234      iTLB-load-misses                                              ( +-  0.04% )  (31.76%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          156.1136 +- 0.0785 seconds time elapsed  ( +-  0.05% )

[6] Icelakex, kbuild + bg:clear_pages_movnt() load.
 # taskset -c 16-31,80-95 perf stat -r 5 -ddd	\
	make -C .. -j 32 O=b2 clean bzImage

 Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

      2,978,535.47 msec task-clock                #   19.471 CPUs utilized            ( +-  0.05% )
         1,637,295      context-switches          #  550.105 /sec                     ( +-  0.89% )
            91,635      cpu-migrations            #   30.788 /sec                     ( +-  1.88% )
        85,754,138      page-faults               #   28.812 K/sec                    ( +-  0.00% )
 7,348,971,235,549      cycles                    #    2.469 GHz                      ( +-  0.04% )  (39.68%)
 8,223,413,950,371      instructions              #    1.12  insn per cycle           ( +-  0.03% )  (47.31%)
 1,743,914,970,674      branches                  #  585.928 M/sec                    ( +-  0.01% )  (54.87%)
    36,188,623,655      branch-misses             #    2.07% of all branches          ( +-  0.05% )  (62.39%)
19,328,950,611,944      slots                     #    6.494 G/sec                    ( +-  0.02% )  (69.87%)
 6,508,801,041,075      topdown-retiring          #     31.7% retiring                ( +-  0.35% )  (69.87%)
 7,581,383,615,462      topdown-bad-spec          #     36.4% bad speculation         ( +-  0.43% )  (69.87%)
 5,521,686,808,149      topdown-fe-bound          #     26.8% frontend bound          ( +-  0.14% )  (69.87%)
 1,043,408,291,623      topdown-be-bound          #      5.0% backend bound           ( +-  0.33% )  (69.87%)
 2,269,475,492,575      L1-dcache-loads           #  762.507 M/sec                    ( +-  0.03% )  (69.63%)
   101,544,979,642      L1-dcache-load-misses     #    4.47% of all L1-dcache accesses  ( +-  0.05% )  (69.66%)
    10,466,865,056      LLC-loads                 #    3.517 M/sec                    ( +-  0.08% )  (69.68%)
     4,206,944,783      LLC-load-misses           #   40.21% of all LL-cache accesses  ( +-  0.06% )  (69.71%)
   <not supported>      L1-icache-loads
   180,267,126,923      L1-icache-load-misses                                         ( +-  0.07% )  (31.17%)
 2,216,010,317,050      dTLB-loads                #  744.544 M/sec                    ( +-  0.03% )  (31.82%)
     1,979,801,744      dTLB-load-misses          #    0.09% of all dTLB cache accesses  ( +-  0.10% )  (31.79%)
   <not supported>      iTLB-loads
     1,925,390,304      iTLB-load-misses                                              ( +-  0.08% )  (31.77%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           152.972 +- 0.309 seconds time elapsed  ( +-  0.20% )

[7] Milan, clear_pages_erms()
 # taskset -c 16-31,144-159 perf stat -r 5 -ddd  \
	make -C .. -j 32 O=b2 clean bzImage

 Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

      3,390,130.53 msec task-clock                #   18.241 CPUs utilized            ( +-  0.04% )
         1,720,283      context-switches          #  507.160 /sec                     ( +-  0.27% )
            96,694      cpu-migrations            #   28.507 /sec                     ( +-  1.41% )
        75,872,994      page-faults               #   22.368 K/sec                    ( +-  0.00% )
 7,930,842,057,103      cycles                    #    2.338 GHz                      ( +-  0.04% )  (41.09%)
    39,974,518,172      stalled-cycles-frontend   #    0.50% frontend cycles idle     ( +-  0.05% )  (41.10%)
   302,739,130,717      stalled-cycles-backend    #    3.82% backend cycles idle      ( +-  0.10% )  (41.11%)
 8,981,403,534,446      instructions              #    1.13  insn per cycle
                                                  #    0.03  stalled cycles per insn  ( +-  0.03% )  (41.10%)
 1,909,303,327,220      branches                  #  562.886 M/sec                    ( +-  0.02% )  (41.10%)
    50,324,935,298      branch-misses             #    2.64% of all branches          ( +-  0.02% )  (41.09%)
 3,563,297,595,796      L1-dcache-loads           #    1.051 G/sec                    ( +-  0.03% )  (41.08%)
   129,901,339,258      L1-dcache-load-misses     #    3.65% of all L1-dcache accesses  ( +-  0.10% )  (41.07%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   809,770,606,566      L1-icache-loads           #  238.730 M/sec                    ( +-  0.03% )  (41.07%)
    12,403,758,671      L1-icache-load-misses     #    1.53% of all L1-icache accesses  ( +-  0.08% )  (41.07%)
    60,010,026,089      dTLB-loads                #   17.692 M/sec                    ( +-  0.04% )  (41.07%)
     3,254,066,681      dTLB-load-misses          #    5.42% of all dTLB cache accesses  ( +-  0.09% )  (41.07%)
     5,195,070,952      iTLB-loads                #    1.532 M/sec                    ( +-  0.03% )  (41.08%)
       489,196,395      iTLB-load-misses          #    9.42% of all iTLB cache accesses  ( +-  0.10% )  (41.09%)
    39,920,161,716      L1-dcache-prefetches      #   11.769 M/sec                    ( +-  0.03% )  (41.09%)
   <not supported>      L1-dcache-prefetch-misses

           185.852 +- 0.501 seconds time elapsed  ( +-  0.27% )

[8] Milan, clear_pages_clzero()
 # taskset -c 16-31,144-159 perf stat -r 5 -ddd  \
	make -C .. -j 32 O=b2 clean bzImage

 Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

      3,296,677.12 msec task-clock                #   18.051 CPUs utilized            ( +-  0.02% )
         1,713,645      context-switches          #  520.062 /sec                     ( +-  0.26% )
            91,883      cpu-migrations            #   27.885 /sec                     ( +-  0.83% )
        75,877,740      page-faults               #   23.028 K/sec                    ( +-  0.00% )
 7,705,812,395,365      cycles                    #    2.339 GHz                      ( +-  0.01% )  (41.11%)
    38,866,265,031      stalled-cycles-frontend   #    0.50% frontend cycles idle     ( +-  0.09% )  (41.10%)
   287,703,667,307      stalled-cycles-backend    #    3.74% backend cycles idle      ( +-  0.04% )  (41.11%)
 8,969,062,192,998      instructions              #    1.16  insn per cycle
                                                  #    0.03  stalled cycles per insn  ( +-  0.01% )  (41.11%)
 1,906,857,866,689      branches                  #  578.699 M/sec                    ( +-  0.01% )  (41.10%)
    50,155,411,444      branch-misses             #    2.63% of all branches          ( +-  0.03% )  (41.11%)
 3,552,652,190,906      L1-dcache-loads           #    1.078 G/sec                    ( +-  0.01% )  (41.13%)
   127,238,478,917      L1-dcache-load-misses     #    3.58% of all L1-dcache accesses  ( +-  0.04% )  (41.13%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   808,024,730,682      L1-icache-loads           #  245.222 M/sec                    ( +-  0.03% )  (41.13%)
     7,773,178,107      L1-icache-load-misses     #    0.96% of all L1-icache accesses  ( +-  0.11% )  (41.13%)
    59,684,355,294      dTLB-loads                #   18.113 M/sec                    ( +-  0.04% )  (41.12%)
     3,247,521,154      dTLB-load-misses          #    5.44% of all dTLB cache accesses  ( +-  0.04% )  (41.12%)
     5,064,547,530      iTLB-loads                #    1.537 M/sec                    ( +-  0.09% )  (41.12%)
       462,977,175      iTLB-load-misses          #    9.13% of all iTLB cache accesses  ( +-  0.07% )  (41.12%)
    39,307,810,241      L1-dcache-prefetches      #   11.929 M/sec                    ( +-  0.06% )  (41.11%)
   <not supported>      L1-dcache-prefetch-misses

           182.630 +- 0.365 seconds time elapsed  ( +-  0.20% )

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Notes:
    Not sure if this wall of perf-stats (or indeed the whole kbuild test) is
    warranted here.
    
    To my eyes, there's no non-obvious information in the performance results
    (reducing cache usage should and does lead to other processes getting a small
    bump in performance), so is there any value in keeping this in the commit
    message?

 fs/hugetlbfs/inode.c |  7 ++++++-
 mm/gup.c             | 18 ++++++++++++++++++
 mm/huge_memory.c     |  2 +-
 mm/hugetlb.c         |  9 ++++++++-
 4 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 62408047e8d7..993bb7227a2f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -650,6 +650,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
 	pgoff_t start, index, end;
+	bool hint_non_caching;
 	int error;
 	u32 hash;
 
@@ -667,6 +668,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	start = offset >> hpage_shift;
 	end = (offset + len + hpage_size - 1) >> hpage_shift;
 
+	/* Don't pollute the cache if we are fallocte'ing a large region. */
+	hint_non_caching = clear_page_prefer_non_caching((end - start) << hpage_shift);
+
 	inode_lock(inode);
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -745,7 +749,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, addr, pages_per_huge_page(h));
+		clear_huge_page(page, addr, pages_per_huge_page(h),
+				hint_non_caching);
 		__SetPageUptodate(page);
 		error = huge_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..bceb6ff64687 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -944,6 +944,13 @@ static int faultin_page(struct vm_area_struct *vma,
 		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
+	if (*flags & FOLL_HINT_BULK) {
+		/*
+		 * This page is part of a large region being faulted-in
+		 * so attempt to minimize cache-pollution.
+		 */
+		fault_flags |= FAULT_FLAG_NON_CACHING;
+	}
 	if (unshare) {
 		fault_flags |= FAULT_FLAG_UNSHARE;
 		/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
@@ -1116,6 +1123,17 @@ static long __get_user_pages(struct mm_struct *mm,
 	if (!(gup_flags & FOLL_FORCE))
 		gup_flags |= FOLL_NUMA;
 
+	/*
+	 * Non-cached page clearing is generally faster when clearing regions
+	 * larger than O(LLC-size). So hint the non-caching path based on
+	 * clear_page_prefer_non_caching().
+	 *
+	 * Note, however this check is optimistic -- nr_pages is the upper
+	 * limit and we might be clearing less than that.
+	 */
+	if (clear_page_prefer_non_caching(nr_pages * PAGE_SIZE))
+		gup_flags |= FOLL_HINT_BULK;
+
 	do {
 		struct page *page;
 		unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73654db77a1c..c7294cffc384 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -594,7 +594,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	vm_fault_t ret = 0;
-	bool non_cached = false;
+	bool non_cached = vmf->flags & FAULT_FLAG_NON_CACHING;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0c4a31b5c1e9..d906c6558b15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5481,7 +5481,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
-	bool non_cached = false;
+	bool non_cached = flags & FAULT_FLAG_NON_CACHING;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -6182,6 +6182,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				 */
 				fault_flags |= FAULT_FLAG_TRIED;
 			}
+			if (flags & FOLL_HINT_BULK) {
+				/*
+				 * From the user hint, we might be faulting-in
+				 * a large region so minimize cache-pollution.
+				 */
+				fault_flags |= FAULT_FLAG_NON_CACHING;
+			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
 			if (ret & VM_FAULT_ERROR) {
 				err = vm_fault_to_errno(ret, flags);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (18 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 19/21] gup: hint non-caching if clearing large regions Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 20:37 ` [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
  2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora, alex.williamson

Specify FOLL_HINT_BULK to pin_user_pages_remote() so it is aware
that this pin is part of a larger region being pinned, and can
optimize based on that expectation.

Cc: alex.williamson@redhat.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9444c1..138b23769793 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -553,6 +553,9 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	if (prot & IOMMU_WRITE)
 		flags |= FOLL_WRITE;
 
+	/* Tell gup that this pin iteration is part of a larger set of pins. */
+	flags |= FOLL_HINT_BULK;
+
 	mmap_read_lock(mm);
 	ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
 				    pages, NULL, NULL);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (19 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
@ 2022-06-06 20:37 ` Ankur Arora
  2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
  21 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-06 20:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, mike.kravetz, mingo, luto, tglx, bp, peterz, ak,
	arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins, ankur.a.arora

System:           Oracle X8-2 (2 nodes * 26 cores/node * 2 threads/core)
Processor:        Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory:           3TB evenly split between nodes
Microcode:        0x5002f01
scaling_governor: performance
LLC size:         36MB for each node
intel_pstate/no_turbo: 1

$ for i in 2 8 32 128 512; do
	perf bench mem memset -f x86-64-movnt -s ${i}MB
  done
  # Running 'mem/memset' benchmark:
  # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
  # Copying 2MB bytes ...
         6.361971 GB/sec
  # Copying 8MB bytes ...
         6.300403 GB/sec
  # Copying 32MB bytes ...
         6.288992 GB/sec
  # Copying 128MB bytes ...
         6.328793 GB/sec
  # Copying 512MB bytes ...
         6.324471 GB/sec

 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)      speedup
              -----------------------   -----------------------    -------
     size            BW   (   pstdev)          BW   (   pstdev)

     16MB      20.38 GB/s ( +- 2.58%)     6.25 GB/s ( +- 0.41%)   -69.28%
    128MB       6.52 GB/s ( +- 0.14%)     6.31 GB/s ( +- 0.47%)    -3.22%
   1024MB       6.48 GB/s ( +- 0.31%)     6.24 GB/s ( +- 0.00%)    -3.70%
   4096MB       6.51 GB/s ( +- 0.01%)     6.27 GB/s ( +- 0.42%)    -3.68%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
 # Running 'mem/memset' benchmark:
 # function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...
       6.516972 GB/sec       (+- 0.01%)

 Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):

     3,357,373,317      cpu-cycles                #    1.133 GHz                      ( +-  0.01% )  (29.38%)
       165,063,710      instructions              #    0.05  insn per cycle           ( +-  1.54% )  (35.29%)
           358,997      cache-references          #    0.121 M/sec                    ( +-  0.89% )  (35.32%)
           205,420      cache-misses              #   57.221 % of all cache refs      ( +-  3.61% )  (35.36%)
         6,117,673      branch-instructions       #    2.065 M/sec                    ( +-  1.48% )  (35.38%)
            58,309      branch-misses             #    0.95% of all branches          ( +-  1.30% )  (35.39%)
        31,329,466      bus-cycles                #   10.575 M/sec                    ( +-  0.03% )  (23.56%)
        68,543,766      L1-dcache-load-misses     #  157.03% of all L1-dcache accesses  ( +-  0.02% )  (23.53%)
        43,648,909      L1-dcache-loads           #   14.734 M/sec                    ( +-  0.50% )  (23.50%)
           137,498      LLC-loads                 #    0.046 M/sec                    ( +-  0.21% )  (23.49%)
            12,308      LLC-load-misses           #    8.95% of all LL-cache accesses  ( +-  2.52% )  (23.49%)
            26,335      LLC-stores                #    0.009 M/sec                    ( +-  5.65% )  (11.75%)
            25,008      LLC-store-misses          #    0.008 M/sec                    ( +-  3.42% )  (11.75%)

          2.962842 +- 0.000162 seconds time elapsed  ( +-  0.01% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
 # Running 'mem/memset' benchmark:
 # function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
 # Copying 4096MB bytes ...
       6.283420 GB/sec      (+- 0.01%)

  Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

     4,462,272,094      cpu-cycles                #    1.322 GHz                      ( +-  0.30% )  (29.38%)
     1,633,675,881      instructions              #    0.37  insn per cycle           ( +-  0.21% )  (35.28%)
           283,627      cache-references          #    0.084 M/sec                    ( +-  0.58% )  (35.31%)
            28,824      cache-misses              #   10.163 % of all cache refs      ( +- 20.67% )  (35.34%)
       139,719,697      branch-instructions       #   41.407 M/sec                    ( +-  0.16% )  (35.35%)
            58,062      branch-misses             #    0.04% of all branches          ( +-  1.49% )  (35.36%)
        41,760,350      bus-cycles                #   12.376 M/sec                    ( +-  0.05% )  (23.55%)
           303,300      L1-dcache-load-misses     #    0.69% of all L1-dcache accesses  ( +-  2.08% )  (23.53%)
        43,769,498      L1-dcache-loads           #   12.972 M/sec                    ( +-  0.54% )  (23.52%)
            99,570      LLC-loads                 #    0.030 M/sec                    ( +-  1.06% )  (23.52%)
             1,966      LLC-load-misses           #    1.97% of all LL-cache accesses  ( +-  6.17% )  (23.52%)
               129      LLC-stores                #    0.038 K/sec                    ( +- 27.85% )  (11.75%)
                 7      LLC-store-misses          #    0.002 K/sec                    ( +- 47.82% )  (11.75%)

           3.37465 +- 0.00474 seconds time elapsed  ( +-  0.14% )

It's unclear if using MOVNT is a net negative on Skylake. For bulk stores
MOVNT is slightly slower than REP;STOSB, but from the L1-dcache-load-misses
stats (L1D.REPLACEMENT), it does elide the write-allocate and thus helps
with cache efficiency.

However, we err on the side of caution and set X86_FEATURE_MOVNT_SLOW
on Skylake.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/kernel/cpu/bugs.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 16e293654d34..ee7206f03d15 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -97,7 +97,21 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
 void check_movnt_quirks(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_X86_64
-
+	if (c->x86_vendor == X86_VENDOR_INTEL) {
+		if (c->x86 == 6) {
+			switch (c->x86_model) {
+			case INTEL_FAM6_SKYLAKE_L:
+				fallthrough;
+			case INTEL_FAM6_SKYLAKE:
+				fallthrough;
+			case INTEL_FAM6_SKYLAKE_X:
+				set_cpu_cap(c, X86_FEATURE_MOVNT_SLOW);
+				break;
+			default:
+				break;
+			}
+		}
+	}
 #endif
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
                   ` (20 preceding siblings ...)
  2022-06-06 20:37 ` [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
@ 2022-06-06 21:53 ` Linus Torvalds
  2022-06-07 15:08   ` Ankur Arora
  21 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2022-06-06 21:53 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Linux Kernel Mailing List, Linux-MM, the arch/x86 maintainers,
	Andrew Morton, Mike Kravetz, Ingo Molnar, Andrew Lutomirski,
	Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Andi Kleen,
	Arnd Bergmann, Jason Gunthorpe, jon.grimm, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, joao.martins

On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> This series introduces two optimizations in the huge page clearing path:
>
>  1. extends the clear_page() machinery to also handle extents larger
>     than a single page.
>  2. support non-cached page clearing for huge and gigantic pages.
>
> The first optimization is useful for hugepage fault handling, the
> second for prefaulting, or for gigantic pages.

Please just split these two issues up into entirely different patch series.

That said, I have a few complaints about the individual patches even
in this form, to the point where I think the whole series is nasty:

 - get rid of 3/21 entirely. It's wrong in every possible way:

    (a) That shouldn't be an inline function in a header file at all.
If you're clearing several pages of data, that just shouldn't be an
inline function.

    (b) Get rid of __HAVE_ARCH_CLEAR_USER_PAGES. I hate how people
make up those idiotic pointless names.

         If you have to use a #ifdef, just use the name of the
function that the architecture overrides, not some other new name.

   But you don't need it at all, because

    (c) Just make a __weak function called clear_user_highpages() in
mm/highmem.c, and allow architectures to just create their own
non-weak ones.

 - patch 4/21 and 5/32: can  we instead just get rid of that silly
"process_huge_page()" thing entirely. It's disgusting, and it's a big
part of why 'rep movs/stos' cannot work efficiently. It also makes NO
SENSE if you then use non-temporal accesses.

   So instead of doubling down on the craziness of that function, just
get rid of it entirely.

   There are two users, and they want to clear a hugepage and copy it
respectively. Don't make it harder than it is.

    *Maybe* the code wants to do a "prefetch" afterwards. Who knows.
But I really think you sh ould do the crapectomy first, make the code
simpler and more straightforward, and just allow architectures to
override the *simple* "copy or clear a lage page" rather than keep
feeding this butt-ugly monstrosity.

 - 13/21: see 3/21.

 - 14-17/21: see 4/21 and 5/21. Once you do the crapectomy and get rid
of the crazy process_huge_page() abstraction, and just let
architectures do their own clear/copy huge pages, *all* this craziness
goes away. Those "when to use which type of clear/copy" becomes a
*local* question, no silly arch_clear_page_non_caching_threshold()
garbage.

So I really don't like this series. A *lot* of it comes from that
horrible process_huge_page() model, and the whole model is just wrong
and pointless. You're literally trying to fix the mess that that
function is, but you're keeping the fundamental problem around.

The whole *point* of your patch-set is to use non-temporal stores,
which makes all the process_huge_page() things entirely pointless, and
only complicates things.

And even if we don't use non-temporal stores, that process_huge_page()
thing makes for trouble for any "rep stos/movs" implementation that
might actualyl do a better job if it was just chunked up in bigger
chunks.

Yes, yes, you probably still want to chunk that up somewhat due to
latency reasons, but even then architectures might as well just make
their own decisions, rather than have the core mm code make one
clearly bad decision for them. Maybe chunking it up in bigger chunks
than one page.

Maybe an architecture could do even more radical things like "let's
just 'rep stos' for the whole area, but set a special thread flag that
causes the interrupt return to break it up on return to kernel space".
IOW, the "latency fix" might not even be about chunking it up, it
might look more like our exception handling thing.

So I really think that crapectomy should be the first thing you do,
and that should be that first part of "extends the clear_page()
machinery to also handle extents larger than a single page"

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
@ 2022-06-07 15:08   ` Ankur Arora
  2022-06-07 17:56     ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Ankur Arora @ 2022-06-07 15:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ankur Arora, Linux Kernel Mailing List, Linux-MM,
	the arch/x86 maintainers, Andrew Morton, Mike Kravetz,
	Ingo Molnar, Andrew Lutomirski, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Andi Kleen, Arnd Bergmann, Jason Gunthorpe,
	jon.grimm, Boris Ostrovsky, Konrad Rzeszutek Wilk,
	joao.m.martins

[ Fixed email for Joao Martins. ]

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
[snip]

> So I really don't like this series. A *lot* of it comes from that
> horrible process_huge_page() model, and the whole model is just wrong
> and pointless. You're literally trying to fix the mess that that
> function is, but you're keeping the fundamental problem around.
>
> The whole *point* of your patch-set is to use non-temporal stores,
> which makes all the process_huge_page() things entirely pointless, and
> only complicates things.
>
> And even if we don't use non-temporal stores, that process_huge_page()
> thing makes for trouble for any "rep stos/movs" implementation that
> might actualyl do a better job if it was just chunked up in bigger
> chunks.

This makes sense to me. There is a lot of unnecessary machinery
around process_huge_page() and this series adds more of it.

For highmem and page-at-a-time archs we would need to keep some
of the same optimizations (via the common clear/copy_user_highpages().)

Still that rids the arch code from pointless constraints as you
say below.

> Yes, yes, you probably still want to chunk that up somewhat due to
> latency reasons, but even then architectures might as well just make
> their own decisions, rather than have the core mm code make one
> clearly bad decision for them. Maybe chunking it up in bigger chunks
> than one page.

Right. Or doing the whole contiguous area in one or a few chunks
chunks, and then touching the faulting cachelines towards the end.

> Maybe an architecture could do even more radical things like "let's
> just 'rep stos' for the whole area, but set a special thread flag that
> causes the interrupt return to break it up on return to kernel space".
> IOW, the "latency fix" might not even be about chunking it up, it
> might look more like our exception handling thing.

When I was thinking about this earlier, I had a vague inkling of
setting a thread flag and defer writes to the last few cachelines
for just before returning to user-space.
Can you elaborate a little about what you are describing above?

> So I really think that crapectomy should be the first thing you do,
> and that should be that first part of "extends the clear_page()
> machinery to also handle extents larger than a single page"

Ack that. And, thanks for the detailed comments.

--
ankur

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-07 15:08   ` Ankur Arora
@ 2022-06-07 17:56     ` Linus Torvalds
  2022-06-08 19:24       ` Ankur Arora
  2022-06-08 19:49       ` Matthew Wilcox
  0 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2022-06-07 17:56 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Linux Kernel Mailing List, Linux-MM, the arch/x86 maintainers,
	Andrew Morton, Mike Kravetz, Ingo Molnar, Andrew Lutomirski,
	Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Andi Kleen,
	Arnd Bergmann, Jason Gunthorpe, jon.grimm, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Joao Martins

On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> For highmem and page-at-a-time archs we would need to keep some
> of the same optimizations (via the common clear/copy_user_highpages().)

Yeah, I guess that we could keep the code for legacy use, just make
the existing code be marked __weak so that it can be ignored for any
further work.

IOW, the first patch might be to just add that __weak to
'clear_huge_page()' and 'copy_user_huge_page()'.

At that point, any architecture can just say "I will implement my own
versions of these two".

In fact, you can start with just one or the other, which is probably
nicer to keep the patch series smaller (ie do the simpler
"clear_huge_page()" first).

I worry a bit about the insanity of the "gigantic" pages, and the
mem_map_next() games it plays, but that code is from 2008 and I really
doubt it makes any sense to keep around at least for x86. The source
of that abomination is powerpc, and I do not think that whole issue
with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

It most definitely makes no sense when there is no highmem issues, and
all those 'struct page' games should just be deleted (or at least
relegated entirely to that "legacy __weak function" case so that sane
situations don't need to care).

For that same HIGHMEM reason it's probably a good idea to limit the
new case just to x86-64, and leave 32-bit x86 behind.

> Right. Or doing the whole contiguous area in one or a few chunks
> chunks, and then touching the faulting cachelines towards the end.

Yeah, just add a prefetch for the 'addr_hint' part at the end.

> > Maybe an architecture could do even more radical things like "let's
> > just 'rep stos' for the whole area, but set a special thread flag that
> > causes the interrupt return to break it up on return to kernel space".
> > IOW, the "latency fix" might not even be about chunking it up, it
> > might look more like our exception handling thing.
>
> When I was thinking about this earlier, I had a vague inkling of
> setting a thread flag and defer writes to the last few cachelines
> for just before returning to user-space.
> Can you elaborate a little about what you are describing above?

So 'process_huge_page()' (and the gigantic page case) does three very
different things:

 (a) that page chunking for highmem accesses

 (b) the page access _ordering_ for the cache hinting reasons

 (c) the chunking for _latency_ reasons

and I think all of them are basically "bad legacy" reasons, in that

 (a) HIGHMEM doesn't exist on sane architectures that we care about these days

 (b) the cache hinting ordering makes no sense if you do non-temporal
accesses (and might then be replaced by a possible "prefetch" at the
end)

 (c) the latency reasons still *do* exist, but only with PREEMPT_NONE

So what I was alluding to with those "more radical approaches" was
that PREEMPT_NONE case: we would probably still want to chunk things
up for latency reasons and do that "cond_resched()" in  between
chunks.

Now, there are alternatives here:

 (a) only override that existing disgusting (but tested) function when
both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false

 (b) do something like this:

    void clear_huge_page(struct page *page,
        unsigned long addr_hint,
        unsigned int pages_per_huge_page)
    {
        void *addr = page_address(page);
    #ifdef CONFIG_PREEMPT_NONE
        for (int i = 0; i < pages_per_huge_page; i++)
            clear_page(addr, PAGE_SIZE);
            cond_preempt();
        }
    #else
        nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
        prefetch(addr_hint);
    #endif
    }

 or (c), do that "more radical approach", where you do something like this:

    void clear_huge_page(struct page *page,
        unsigned long addr_hint,
        unsigned int pages_per_huge_page)
    {
        set_thread_flag(TIF_PREEMPT_ME);
        nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
        clear_thread_flag(TIF_PREEMPT_ME);
        prefetch(addr_hint);
    }

and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
case and actually force preemption even on a non-preempt kernel.

It's _probably_ the case that CONFIG_PREEMPT_NONE is so rare that it's
n ot even worth doing. I dunno.

And all of the above pseudo-code may _look_ like real code, but is
entirely untested and entirely handwavy "something like this".

Hmm?

               Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent()
  2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
@ 2022-06-08  0:01   ` Luc Van Oostenryck
  2022-06-12 11:19     ` Ankur Arora
  0 siblings, 1 reply; 35+ messages in thread
From: Luc Van Oostenryck @ 2022-06-08  0:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, mike.kravetz, mingo,
	luto, tglx, bp, peterz, ak, arnd, jgg, jon.grimm,
	boris.ostrovsky, konrad.wilk, joao.m.martins

On Mon, Jun 06, 2022 at 08:37:17PM +0000, Ankur Arora wrote:
> +static inline void clear_user_pages_incoherent(__incoherent void *page,
> +					       unsigned long vaddr,
> +					       struct page *pg,
> +					       unsigned int npages)
> +{
> +	clear_user_pages((__force void *)page, vaddr, pg, npages);
> +}

Hi,

Please use 'void __incoherent *' and 'void __force *', as it's done
elsewhere for __force and address spaces.

-- Luc

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-07 17:56     ` Linus Torvalds
@ 2022-06-08 19:24       ` Ankur Arora
  2022-06-08 19:39         ` Linus Torvalds
  2022-06-08 19:49       ` Matthew Wilcox
  1 sibling, 1 reply; 35+ messages in thread
From: Ankur Arora @ 2022-06-08 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ankur Arora, Linux Kernel Mailing List, Linux-MM,
	the arch/x86 maintainers, Andrew Morton, Mike Kravetz,
	Ingo Molnar, Andrew Lutomirski, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Andi Kleen, Arnd Bergmann, Jason Gunthorpe,
	jon.grimm, Boris Ostrovsky, Konrad Rzeszutek Wilk, Joao Martins


Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> For highmem and page-at-a-time archs we would need to keep some
>> of the same optimizations (via the common clear/copy_user_highpages().)
>
> Yeah, I guess that we could keep the code for legacy use, just make
> the existing code be marked __weak so that it can be ignored for any
> further work.
>
> IOW, the first patch might be to just add that __weak to
> 'clear_huge_page()' and 'copy_user_huge_page()'.
>
> At that point, any architecture can just say "I will implement my own
> versions of these two".
>
> In fact, you can start with just one or the other, which is probably
> nicer to keep the patch series smaller (ie do the simpler
> "clear_huge_page()" first).

Agreed. Best way to iron out all the kinks too.

> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Looking at it now, it seems to be caused by the wide range of
MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try
to figure it out in detail.

But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
An arch specific clear_huge_page() code could, however handle 1GB pages
via some kind of static loop around (30 - MAX_SECTION_BITS).

I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP
configs. But, I think we should be able to not look up pfn_to_page(),
page_to_pfn() at all or at least not in the inner loop.

> It most definitely makes no sense when there is no highmem issues, and
> all those 'struct page' games should just be deleted (or at least
> relegated entirely to that "legacy __weak function" case so that sane
> situations don't need to care).

Yeah, I'm hoping to do exactly that.

> For that same HIGHMEM reason it's probably a good idea to limit the
> new case just to x86-64, and leave 32-bit x86 behind.

Ack that.

>> Right. Or doing the whole contiguous area in one or a few chunks
>> chunks, and then touching the faulting cachelines towards the end.
>
> Yeah, just add a prefetch for the 'addr_hint' part at the end.
>
>> > Maybe an architecture could do even more radical things like "let's
>> > just 'rep stos' for the whole area, but set a special thread flag that
>> > causes the interrupt return to break it up on return to kernel space".
>> > IOW, the "latency fix" might not even be about chunking it up, it
>> > might look more like our exception handling thing.
>>
>> When I was thinking about this earlier, I had a vague inkling of
>> setting a thread flag and defer writes to the last few cachelines
>> for just before returning to user-space.
>> Can you elaborate a little about what you are describing above?
>
> So 'process_huge_page()' (and the gigantic page case) does three very
> different things:
>
>  (a) that page chunking for highmem accesses
>
>  (b) the page access _ordering_ for the cache hinting reasons
>
>  (c) the chunking for _latency_ reasons
>
> and I think all of them are basically "bad legacy" reasons, in that
>
>  (a) HIGHMEM doesn't exist on sane architectures that we care about these days
>
>  (b) the cache hinting ordering makes no sense if you do non-temporal
> accesses (and might then be replaced by a possible "prefetch" at the
> end)
>
>  (c) the latency reasons still *do* exist, but only with PREEMPT_NONE
>
> So what I was alluding to with those "more radical approaches" was
> that PREEMPT_NONE case: we would probably still want to chunk things
> up for latency reasons and do that "cond_resched()" in  between
> chunks.

Thanks for the detail. That helps.

> Now, there are alternatives here:
>
>  (a) only override that existing disgusting (but tested) function when
> both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false
>
>  (b) do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         void *addr = page_address(page);
>     #ifdef CONFIG_PREEMPT_NONE
>         for (int i = 0; i < pages_per_huge_page; i++)
>             clear_page(addr, PAGE_SIZE);
>             cond_preempt();
>         }
>     #else
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         prefetch(addr_hint);
>     #endif
>     }

We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
as well, right? Either way, as you said earlier could chunk
up in bigger units than a single page.
(In the numbers I had posted earlier, chunking in units of upto 1MB
gave ~25% higher clearing BW. Don't think the microcode setup costs
are that high, but don't have a good explanation why.)

>  or (c), do that "more radical approach", where you do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         set_thread_flag(TIF_PREEMPT_ME);
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         clear_thread_flag(TIF_PREEMPT_ME);
>         prefetch(addr_hint);
>     }
>
> and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
> case and actually force preemption even on a non-preempt kernel.

I like this one. I'll try out (b) and (c) and see how the code shakes
out.

Just one minor point -- seems to me that the choice of nontemporal or
temporal might have to be based on a hint to clear_huge_page().

Basically the nontemporal path is only faster for
(pages_per_huge_page * PAGE_SIZE > LLC-size).

So in the page-fault path it might make sense to use the temporal
path (except for gigantic pages.) In the prefault path, nontemporal
might be better.

Thanks

--
ankur

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-08 19:24       ` Ankur Arora
@ 2022-06-08 19:39         ` Linus Torvalds
  2022-06-08 20:21           ` Ankur Arora
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2022-06-08 19:39 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Linux Kernel Mailing List, Linux-MM, the arch/x86 maintainers,
	Andrew Morton, Mike Kravetz, Ingo Molnar, Andrew Lutomirski,
	Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Andi Kleen,
	Arnd Bergmann, Jason Gunthorpe, jon.grimm, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Joao Martins

On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
> An arch specific clear_huge_page() code could, however handle 1GB pages
> via some kind of static loop around (30 - MAX_SECTION_BITS).

Even if gigantic pages straddle that area, it simply shouldn't matter.

The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup.

And the only reason for *that* is because of HIGHMEM.

So it's all entirely silly and pointless on any sane architecture, I think.

> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
> as well, right?

Ahh, yes.  I should have looked at the code, and not just gone by my
"PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we
split that up.

> Just one minor point -- seems to me that the choice of nontemporal or
> temporal might have to be based on a hint to clear_huge_page().

Quite possibly. But I'd prefer that  as a separate "look, this
improves numbers by X%" thing from the whole "let's make the
clear_huge_page() interface at least sane".

                 Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-07 17:56     ` Linus Torvalds
  2022-06-08 19:24       ` Ankur Arora
@ 2022-06-08 19:49       ` Matthew Wilcox
  2022-06-08 19:51         ` Matthew Wilcox
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2022-06-08 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ankur Arora, Linux Kernel Mailing List, Linux-MM,
	the arch/x86 maintainers, Andrew Morton, Mike Kravetz,
	Ingo Molnar, Andrew Lutomirski, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Andi Kleen, Arnd Bergmann, Jason Gunthorpe,
	jon.grimm, Boris Ostrovsky, Konrad Rzeszutek Wilk, Joao Martins

On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote:
> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Oh, argh, I meant to delete mem_map_next(), and forgot.

If you need to use struct page (a later message hints you don't), just
use nth_page() directly.  I optimised it so it's not painful except on
SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936.
And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP
systems.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-08 19:49       ` Matthew Wilcox
@ 2022-06-08 19:51         ` Matthew Wilcox
  0 siblings, 0 replies; 35+ messages in thread
From: Matthew Wilcox @ 2022-06-08 19:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ankur Arora, Linux Kernel Mailing List, Linux-MM,
	the arch/x86 maintainers, Andrew Morton, Mike Kravetz,
	Ingo Molnar, Andrew Lutomirski, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Andi Kleen, Arnd Bergmann, Jason Gunthorpe,
	jon.grimm, Boris Ostrovsky, Konrad Rzeszutek Wilk, Joao Martins

On Wed, Jun 08, 2022 at 08:49:57PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote:
> > I worry a bit about the insanity of the "gigantic" pages, and the
> > mem_map_next() games it plays, but that code is from 2008 and I really
> > doubt it makes any sense to keep around at least for x86. The source
> > of that abomination is powerpc, and I do not think that whole issue
> > with MAX_ORDER_NR_PAGES makes any difference on x86, at least.
> 
> Oh, argh, I meant to delete mem_map_next(), and forgot.
> 
> If you need to use struct page (a later message hints you don't), just
> use nth_page() directly.  I optimised it so it's not painful except on
> SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936.
> And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP
> systems.

Oops, wrong commit.  I meant 1cfcee728391 from June 2021.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 00/21] huge page clearing optimizations
  2022-06-08 19:39         ` Linus Torvalds
@ 2022-06-08 20:21           ` Ankur Arora
  0 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-08 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ankur Arora, Linux Kernel Mailing List, Linux-MM,
	the arch/x86 maintainers, Andrew Morton, Mike Kravetz,
	Ingo Molnar, Andrew Lutomirski, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Andi Kleen, Arnd Bergmann, Jason Gunthorpe,
	jon.grimm, Boris Ostrovsky, Konrad Rzeszutek Wilk, Joao Martins


Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
>> An arch specific clear_huge_page() code could, however handle 1GB pages
>> via some kind of static loop around (30 - MAX_SECTION_BITS).
>
> Even if gigantic pages straddle that area, it simply shouldn't matter.
>
> The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup.
>
> And the only reason for *that* is because of HIGHMEM.
>
> So it's all entirely silly and pointless on any sane architecture, I think.
>
>> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
>> as well, right?
>
> Ahh, yes.  I should have looked at the code, and not just gone by my
> "PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we
> split that up.
>
>> Just one minor point -- seems to me that the choice of nontemporal or
>> temporal might have to be based on a hint to clear_huge_page().
>
> Quite possibly. But I'd prefer that  as a separate "look, this
> improves numbers by X%" thing from the whole "let's make the
> clear_huge_page() interface at least sane".

Makes sense to me.

--
ankur

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()
  2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
@ 2022-06-10 22:11   ` Noah Goldstein
  2022-06-10 22:15     ` Noah Goldstein
  0 siblings, 1 reply; 35+ messages in thread
From: Noah Goldstein @ 2022-06-10 22:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: open list, linux-mm, X86 ML, torvalds, akpm, mike.kravetz, mingo,
	Andy Lutomirski, tglx, Borislav Petkov, peterz, ak, arnd, jgg,
	jon.grimm, boris.ostrovsky, konrad.wilk, joao.m.martins

On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
> With this, page-clearing can skip the memory hierarchy, thus providing
> a non cache-polluting implementation of clear_pages().
>
> MOVNTI, from the Intel SDM, Volume 2B, 4-101:
>  "The non-temporal hint is implemented by using a write combining (WC)
>   memory type protocol when writing the data to memory. Using this
>   protocol, the processor does not write the data into the cache
>   hierarchy, nor does it fetch the corresponding cache line from memory
>   into the cache hierarchy."
>
> The AMD Arch Manual has something similar to say as well.
>
> One use-case is to zero large extents without bringing in never-to-be-
> accessed cachelines. Also, often clear_pages_movnt() based clearing is
> faster once extent sizes are O(LLC-size).
>
> As the excerpt notes, MOVNTI is weakly ordered with respect to other
> instructions operating on the memory hierarchy. This needs to be
> handled by the caller by executing an SFENCE when done.
>
> The implementation is straight-forward: unroll the inner loop to keep
> the code similar to memset_movnti(), so that we can gauge
> clear_pages_movnt() performance via perf bench mem memset.
>
>  # Intel Icelakex
>  # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
>  # (X86_FEATURE_ERMS) and x86-64-movnt:
>
>  System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
>  Processor:   Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
>  Memory:      512 GB evenly split between nodes
>  LLC-size:    48MB for each node (32-cores * 2-threads)
>  no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
>
>               x86-64-stosb (5 runs)     x86-64-movnt (5 runs)    Delta(%)
>               ----------------------    ---------------------    --------
>      size            BW   (   stdev)          BW    (   stdev)
>
>       2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)   -12.38%
>      16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)    -6.02%
>     128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)   +84.24%
>    1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)   +97.35%
>    4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)   +98.50%

For these sizes it may be worth it to save/rstor an xmm register to do
the memset:

Just on my Tigerlake laptop:
model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

                  movntdq xmm (5 runs)          movnti GPR (5 runs)
    Delta(%)
                  -----------------------       -----------------------
           size      BW GB/s ( +-  stdev)          BW GB/s ( +-
stdev)         %
           2 MB   35.71 GB/s ( +-   1.02)       34.62 GB/s ( +-
0.77)    -3.15%
          16 MB   36.43 GB/s ( +-   0.35)        31.3 GB/s ( +-
0.1)   -16.39%
         128 MB    35.6 GB/s ( +-   0.83)       30.82 GB/s ( +-
0.08)    -15.5%
        1024 MB   36.85 GB/s ( +-   0.26)       30.71 GB/s ( +-
0.2)    -20.0%
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/include/asm/page_64.h |  1 +
>  arch/x86/lib/clear_page_64.S   | 21 +++++++++++++++++++++
>  2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index a88a3508888a..3affc4ecb8da 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
>  void clear_pages_orig(void *page, unsigned long npages);
>  void clear_pages_rep(void *page, unsigned long npages);
>  void clear_pages_erms(void *page, unsigned long npages);
> +void clear_pages_movnt(void *page, unsigned long npages);
>
>  #define __HAVE_ARCH_CLEAR_USER_PAGES
>  static inline void clear_pages(void *page, unsigned int npages)
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index 2cc3b681734a..83d14f1c9f57 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
>         RET
>  SYM_FUNC_END(clear_pages_erms)
>  EXPORT_SYMBOL_GPL(clear_pages_erms)
> +
> +SYM_FUNC_START(clear_pages_movnt)
> +       xorl    %eax,%eax
> +       movq    %rsi,%rcx
> +       shlq    $PAGE_SHIFT, %rcx
> +
> +       .p2align 4
> +.Lstart:
> +       movnti  %rax, 0x00(%rdi)
> +       movnti  %rax, 0x08(%rdi)
> +       movnti  %rax, 0x10(%rdi)
> +       movnti  %rax, 0x18(%rdi)
> +       movnti  %rax, 0x20(%rdi)
> +       movnti  %rax, 0x28(%rdi)
> +       movnti  %rax, 0x30(%rdi)
> +       movnti  %rax, 0x38(%rdi)
> +       addq    $0x40, %rdi
> +       subl    $0x40, %ecx
> +       ja      .Lstart
> +       RET
> +SYM_FUNC_END(clear_pages_movnt)
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()
  2022-06-10 22:11   ` Noah Goldstein
@ 2022-06-10 22:15     ` Noah Goldstein
  2022-06-12 11:18       ` Ankur Arora
  0 siblings, 1 reply; 35+ messages in thread
From: Noah Goldstein @ 2022-06-10 22:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: open list, linux-mm, X86 ML, torvalds, akpm, mike.kravetz, mingo,
	Andy Lutomirski, tglx, Borislav Petkov, peterz, ak, arnd, jgg,
	jon.grimm, boris.ostrovsky, konrad.wilk, joao.m.martins

On Fri, Jun 10, 2022 at 3:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
> > With this, page-clearing can skip the memory hierarchy, thus providing
> > a non cache-polluting implementation of clear_pages().
> >
> > MOVNTI, from the Intel SDM, Volume 2B, 4-101:
> >  "The non-temporal hint is implemented by using a write combining (WC)
> >   memory type protocol when writing the data to memory. Using this
> >   protocol, the processor does not write the data into the cache
> >   hierarchy, nor does it fetch the corresponding cache line from memory
> >   into the cache hierarchy."
> >
> > The AMD Arch Manual has something similar to say as well.
> >
> > One use-case is to zero large extents without bringing in never-to-be-
> > accessed cachelines. Also, often clear_pages_movnt() based clearing is
> > faster once extent sizes are O(LLC-size).
> >
> > As the excerpt notes, MOVNTI is weakly ordered with respect to other
> > instructions operating on the memory hierarchy. This needs to be
> > handled by the caller by executing an SFENCE when done.
> >
> > The implementation is straight-forward: unroll the inner loop to keep
> > the code similar to memset_movnti(), so that we can gauge
> > clear_pages_movnt() performance via perf bench mem memset.
> >
> >  # Intel Icelakex
> >  # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> >  # (X86_FEATURE_ERMS) and x86-64-movnt:
> >
> >  System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
> >  Processor:   Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
> >  Memory:      512 GB evenly split between nodes
> >  LLC-size:    48MB for each node (32-cores * 2-threads)
> >  no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
> >
> >               x86-64-stosb (5 runs)     x86-64-movnt (5 runs)    Delta(%)
> >               ----------------------    ---------------------    --------
> >      size            BW   (   stdev)          BW    (   stdev)
> >
> >       2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)   -12.38%
> >      16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)    -6.02%
> >     128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)   +84.24%
> >    1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)   +97.35%
> >    4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)   +98.50%
>
> For these sizes it may be worth it to save/rstor an xmm register to do
> the memset:
>
> Just on my Tigerlake laptop:
> model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
>
>                   movntdq xmm (5 runs)          movnti GPR (5 runs)
>     Delta(%)
>                   -----------------------       -----------------------
>            size      BW GB/s ( +-  stdev)          BW GB/s ( +-
> stdev)         %
>            2 MB   35.71 GB/s ( +-   1.02)       34.62 GB/s ( +-
> 0.77)    -3.15%
>           16 MB   36.43 GB/s ( +-   0.35)        31.3 GB/s ( +-
> 0.1)   -16.39%
>          128 MB    35.6 GB/s ( +-   0.83)       30.82 GB/s ( +-
> 0.08)    -15.5%
>         1024 MB   36.85 GB/s ( +-   0.26)       30.71 GB/s ( +-
> 0.2)    -20.0%


Also (again just from Tigerlake laptop) I found the trend favor
`rep stosb` more (as opposed to non-cacheable writes) when
there are multiple threads competing for BW:

https://docs.google.com/spreadsheets/d/1f6N9EVqHg71cDIR-RALLR76F_ovW5gzwIWr26yLCmS0/edit?usp=sharing
> >
> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > ---
> >  arch/x86/include/asm/page_64.h |  1 +
> >  arch/x86/lib/clear_page_64.S   | 21 +++++++++++++++++++++
> >  2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index a88a3508888a..3affc4ecb8da 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
> >  void clear_pages_orig(void *page, unsigned long npages);
> >  void clear_pages_rep(void *page, unsigned long npages);
> >  void clear_pages_erms(void *page, unsigned long npages);
> > +void clear_pages_movnt(void *page, unsigned long npages);
> >
> >  #define __HAVE_ARCH_CLEAR_USER_PAGES
> >  static inline void clear_pages(void *page, unsigned int npages)
> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > index 2cc3b681734a..83d14f1c9f57 100644
> > --- a/arch/x86/lib/clear_page_64.S
> > +++ b/arch/x86/lib/clear_page_64.S
> > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
> >         RET
> >  SYM_FUNC_END(clear_pages_erms)
> >  EXPORT_SYMBOL_GPL(clear_pages_erms)
> > +
> > +SYM_FUNC_START(clear_pages_movnt)
> > +       xorl    %eax,%eax
> > +       movq    %rsi,%rcx
> > +       shlq    $PAGE_SHIFT, %rcx
> > +
> > +       .p2align 4
> > +.Lstart:
> > +       movnti  %rax, 0x00(%rdi)
> > +       movnti  %rax, 0x08(%rdi)
> > +       movnti  %rax, 0x10(%rdi)
> > +       movnti  %rax, 0x18(%rdi)
> > +       movnti  %rax, 0x20(%rdi)
> > +       movnti  %rax, 0x28(%rdi)
> > +       movnti  %rax, 0x30(%rdi)
> > +       movnti  %rax, 0x38(%rdi)
> > +       addq    $0x40, %rdi
> > +       subl    $0x40, %ecx
> > +       ja      .Lstart
> > +       RET
> > +SYM_FUNC_END(clear_pages_movnt)
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()
  2022-06-10 22:15     ` Noah Goldstein
@ 2022-06-12 11:18       ` Ankur Arora
  0 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-12 11:18 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Ankur Arora, open list, linux-mm, X86 ML, torvalds, akpm,
	mike.kravetz, mingo, Andy Lutomirski, tglx, Borislav Petkov,
	peterz, ak, arnd, jgg, jon.grimm, boris.ostrovsky, konrad.wilk,
	joao.m.martins


Noah Goldstein <goldstein.w.n@gmail.com> writes:

> On Fri, Jun 10, 2022 at 3:11 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >
>> > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
>> > With this, page-clearing can skip the memory hierarchy, thus providing
>> > a non cache-polluting implementation of clear_pages().
>> >
>> > MOVNTI, from the Intel SDM, Volume 2B, 4-101:
>> >  "The non-temporal hint is implemented by using a write combining (WC)
>> >   memory type protocol when writing the data to memory. Using this
>> >   protocol, the processor does not write the data into the cache
>> >   hierarchy, nor does it fetch the corresponding cache line from memory
>> >   into the cache hierarchy."
>> >
>> > The AMD Arch Manual has something similar to say as well.
>> >
>> > One use-case is to zero large extents without bringing in never-to-be-
>> > accessed cachelines. Also, often clear_pages_movnt() based clearing is
>> > faster once extent sizes are O(LLC-size).
>> >
>> > As the excerpt notes, MOVNTI is weakly ordered with respect to other
>> > instructions operating on the memory hierarchy. This needs to be
>> > handled by the caller by executing an SFENCE when done.
>> >
>> > The implementation is straight-forward: unroll the inner loop to keep
>> > the code similar to memset_movnti(), so that we can gauge
>> > clear_pages_movnt() performance via perf bench mem memset.
>> >
>> >  # Intel Icelakex
>> >  # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
>> >  # (X86_FEATURE_ERMS) and x86-64-movnt:
>> >
>> >  System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
>> >  Processor:   Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
>> >  Memory:      512 GB evenly split between nodes
>> >  LLC-size:    48MB for each node (32-cores * 2-threads)
>> >  no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
>> >
>> >               x86-64-stosb (5 runs)     x86-64-movnt (5 runs)    Delta(%)
>> >               ----------------------    ---------------------    --------
>> >      size            BW   (   stdev)          BW    (   stdev)
>> >
>> >       2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)   -12.38%
>> >      16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)    -6.02%
>> >     128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)   +84.24%
>> >    1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)   +97.35%
>> >    4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)   +98.50%
>>
>> For these sizes it may be worth it to save/rstor an xmm register to do
>> the memset:
>>
>> Just on my Tigerlake laptop:
>> model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
>>
>>                   movntdq xmm (5 runs)          movnti GPR (5 runs)
>>     Delta(%)
>>                   -----------------------       -----------------------
>>            size      BW GB/s ( +-  stdev)          BW GB/s ( +-
>> stdev)         %
>>            2 MB   35.71 GB/s ( +-   1.02)       34.62 GB/s ( +-
>> 0.77)    -3.15%
>>           16 MB   36.43 GB/s ( +-   0.35)        31.3 GB/s ( +-
>> 0.1)   -16.39%
>>          128 MB    35.6 GB/s ( +-   0.83)       30.82 GB/s ( +-
>> 0.08)    -15.5%
>>         1024 MB   36.85 GB/s ( +-   0.26)       30.71 GB/s ( +-
>> 0.2)    -20.0%

Thanks this looks interesting. Any thoughts on what causes the drop-off
for the movnti loop as the region size increases?

I can see the usual two problems with using the XMM registers:

 - the kernel_fpu_begin()/_end() overhead
 - kernel_fpu regions need preemption disabled, which limits the
   extent that can be cleared in a single operation

And given how close movntdq and movnti are for size=2MB, I'm not
sure movntdq would even come out ahead if we include the XMM
save/restore overhead?

> Also (again just from Tigerlake laptop) I found the trend favor
> `rep stosb` more (as opposed to non-cacheable writes) when
> there are multiple threads competing for BW:

I notice in your spreadsheet, that you ran the tests only until
~32MB. How does the performance on Tigerlake change as you
go up to say 512MB? Also, a little unexpected that the cacheable
SIMD variant always performs pretty much the worst.

In general, I wouldn't expect NT writes to perform better for O(LLC-size).
That's why this series avoids using NT writes for sizes smaller than
that (see patch-19.)

The argument is: the larger the region being cleared, the less the
caller cares about the contents and thus we can avoid using the cache.
The other part of course is that NT doesn't perform as well for small
sizes and so using that would regress performance for some user.


Ankur

> https://docs.google.com/spreadsheets/d/1f6N9EVqHg71cDIR-RALLR76F_ovW5gzwIWr26yLCmS0/edit?usp=sharing

>> >
>> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> > ---
>> >  arch/x86/include/asm/page_64.h |  1 +
>> >  arch/x86/lib/clear_page_64.S   | 21 +++++++++++++++++++++
>> >  2 files changed, 22 insertions(+)
>> >
>> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>> > index a88a3508888a..3affc4ecb8da 100644
>> > --- a/arch/x86/include/asm/page_64.h
>> > +++ b/arch/x86/include/asm/page_64.h
>> > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
>> >  void clear_pages_orig(void *page, unsigned long npages);
>> >  void clear_pages_rep(void *page, unsigned long npages);
>> >  void clear_pages_erms(void *page, unsigned long npages);
>> > +void clear_pages_movnt(void *page, unsigned long npages);
>> >
>> >  #define __HAVE_ARCH_CLEAR_USER_PAGES
>> >  static inline void clear_pages(void *page, unsigned int npages)
>> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> > index 2cc3b681734a..83d14f1c9f57 100644
>> > --- a/arch/x86/lib/clear_page_64.S
>> > +++ b/arch/x86/lib/clear_page_64.S
>> > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
>> >         RET
>> >  SYM_FUNC_END(clear_pages_erms)
>> >  EXPORT_SYMBOL_GPL(clear_pages_erms)
>> > +
>> > +SYM_FUNC_START(clear_pages_movnt)
>> > +       xorl    %eax,%eax
>> > +       movq    %rsi,%rcx
>> > +       shlq    $PAGE_SHIFT, %rcx
>> > +
>> > +       .p2align 4
>> > +.Lstart:
>> > +       movnti  %rax, 0x00(%rdi)
>> > +       movnti  %rax, 0x08(%rdi)
>> > +       movnti  %rax, 0x10(%rdi)
>> > +       movnti  %rax, 0x18(%rdi)
>> > +       movnti  %rax, 0x20(%rdi)
>> > +       movnti  %rax, 0x28(%rdi)
>> > +       movnti  %rax, 0x30(%rdi)
>> > +       movnti  %rax, 0x38(%rdi)
>> > +       addq    $0x40, %rdi
>> > +       subl    $0x40, %ecx
>> > +       ja      .Lstart
>> > +       RET
>> > +SYM_FUNC_END(clear_pages_movnt)
>> > --
>> > 2.31.1
>> >


--
ankur

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent()
  2022-06-08  0:01   ` Luc Van Oostenryck
@ 2022-06-12 11:19     ` Ankur Arora
  0 siblings, 0 replies; 35+ messages in thread
From: Ankur Arora @ 2022-06-12 11:19 UTC (permalink / raw)
  To: Luc Van Oostenryck
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm,
	mike.kravetz, mingo, luto, tglx, bp, peterz, ak, arnd, jgg,
	jon.grimm, boris.ostrovsky, konrad.wilk, joao.m.martins


Luc Van Oostenryck <luc.vanoostenryck@gmail.com> writes:

> On Mon, Jun 06, 2022 at 08:37:17PM +0000, Ankur Arora wrote:
>> +static inline void clear_user_pages_incoherent(__incoherent void *page,
>> +					       unsigned long vaddr,
>> +					       struct page *pg,
>> +					       unsigned int npages)
>> +{
>> +	clear_user_pages((__force void *)page, vaddr, pg, npages);
>> +}
>
> Hi,
>
> Please use 'void __incoherent *' and 'void __force *', as it's done
> elsewhere for __force and address spaces.

Thanks Luc. Will fix.

--
ankur

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-06-12 11:20 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-06 20:20 [PATCH v3 00/21] huge page clearing optimizations Ankur Arora
2022-06-06 20:20 ` [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 02/21] mm, huge-page: refactor process_subpage() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 03/21] clear_page: add generic clear_user_pages() Ankur Arora
2022-06-06 20:20 ` [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 05/21] mm/huge_page: generalize process_huge_page() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 06/21] x86/clear_page: add clear_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 07/21] x86/asm: add memset_movnti() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 08/21] perf bench: " Ankur Arora
2022-06-06 20:37 ` [PATCH v3 09/21] x86/asm: add clear_pages_movnt() Ankur Arora
2022-06-10 22:11   ` Noah Goldstein
2022-06-10 22:15     ` Noah Goldstein
2022-06-12 11:18       ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 10/21] x86/asm: add clear_pages_clzero() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW Ankur Arora
2022-06-06 20:37 ` [PATCH v3 12/21] sparse: add address_space __incoherent Ankur Arora
2022-06-06 20:37 ` [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent() Ankur Arora
2022-06-08  0:01   ` Luc Van Oostenryck
2022-06-12 11:19     ` Ankur Arora
2022-06-06 20:37 ` [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 17/21] clear_huge_page: use non-cached clearing Ankur Arora
2022-06-06 20:37 ` [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING Ankur Arora
2022-06-06 20:37 ` [PATCH v3 19/21] gup: hint non-caching if clearing large regions Ankur Arora
2022-06-06 20:37 ` [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() Ankur Arora
2022-06-06 20:37 ` [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake Ankur Arora
2022-06-06 21:53 ` [PATCH v3 00/21] huge page clearing optimizations Linus Torvalds
2022-06-07 15:08   ` Ankur Arora
2022-06-07 17:56     ` Linus Torvalds
2022-06-08 19:24       ` Ankur Arora
2022-06-08 19:39         ` Linus Torvalds
2022-06-08 20:21           ` Ankur Arora
2022-06-08 19:49       ` Matthew Wilcox
2022-06-08 19:51         ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.