All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] x86/clear_huge_page: multi-page clearing
@ 2023-04-03  5:22 Ankur Arora
  2023-04-03  5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

This series introduces multi-page clearing for hugepages. 

This is a follow up of some of the ideas discussed at:
  https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/

On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.

In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)

An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.

All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.

There are, however, some problems:

1. extended zeroing periods means there's an increased latency due to
   the now missing preemption points.

   That's handled in patches 7, 8, 9:
     "sched: define TIF_ALLOW_RESCHED"
     "irqentry: define irqentry_exit_allow_resched()"
     "x86/clear_huge_page: make clear_contig_region() preemptible"
   by the context marking itself reschedulable, and rescheduling in
   irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)

2. the current page-at-a-time clearing logic does left-right narrowing
   towards the faulting page which benefits workloads by maintaining
   cache locality for workloads which have a sequential pattern. Clearing
   in large chunks loses that.

   Some (but not all) of that could be ameliorated by something like
   this patch:
   https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/

   But, before doing that I'd like some comments on whether that is
   worth doing for this specific use case?

Rest of the series:
  Patches 1, 2, 3:
    "huge_pages: get rid of process_huge_page()"
    "huge_page: get rid of {clear,copy}_subpage()"
    "huge_page: allow arch override for clear/copy_huge_page()"
  are mechanical and they simplify some of the current clear_huge_page()
  logic.

  Patches 4, 5:
  "x86/clear_page: parameterize clear_page*() to specify length"
  "x86/clear_pages: add clear_pages()"

  add clear_pages() and helpers.

  Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
  chunked x86 clear_huge_page() implementation.


Performance
==

Demand fault performance gets a decent boost:

  *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change   
                          (GB/s)                (GB/s)            
                                                                  
  pg-sz=2MB                 8.76                 11.82   +34.93%  
  pg-sz=1GB                 8.99                 12.18   +35.48%  


  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change    
                          (GB/s)                (GB/s)             
                                                                   
  pg-sz=2MB                12.24                 17.54    +43.30%  
  pg-sz=1GB                17.98                 37.24   +107.11%  


vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:

  *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
  (mem=4GB/task, tasks=128)

  stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
  utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
  wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%


  *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
  (mem=1GB/task, tasks=512)

  stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
  utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
  wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%

Also at:
  github.com/terminus/linux clear-pages.v1

Comments appreciated!

Ankur Arora (9):
  huge_pages: get rid of process_huge_page()
  huge_page: get rid of {clear,copy}_subpage()
  huge_page: allow arch override for clear/copy_huge_page()
  x86/clear_page: parameterize clear_page*() to specify length
  x86/clear_pages: add clear_pages()
  mm/clear_huge_page: use multi-page clearing
  sched: define TIF_ALLOW_RESCHED
  irqentry: define irqentry_exit_allow_resched()
  x86/clear_huge_page: make clear_contig_region() preemptible

 arch/x86/include/asm/page.h        |   6 +
 arch/x86/include/asm/page_32.h     |   6 +
 arch/x86/include/asm/page_64.h     |  25 +++--
 arch/x86/include/asm/thread_info.h |   2 +
 arch/x86/lib/clear_page_64.S       |  45 ++++++--
 arch/x86/mm/hugetlbpage.c          |  59 ++++++++++
 include/linux/sched.h              |  29 +++++
 kernel/entry/common.c              |   8 ++
 kernel/sched/core.c                |  36 +++---
 mm/memory.c                        | 174 +++++++++++++++--------------
 10 files changed, 270 insertions(+), 120 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/9] huge_pages: get rid of process_huge_page()
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-03  5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

process_huge_pages() (and the subsequent call to process_subpage())
translate between base-addr[index], struct page *, first when calling
clear/copy_subpage() which then translates to the parameters
needed in clear/copy_user_highpage().

There's no runtime cost in doing this, but it's unnecessary complexity
for something that only has two users.

Accordingly, fold process_huge_page() back in its callers.

Link: https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 126 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 73 insertions(+), 53 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..d54bc27a35ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5629,63 +5629,13 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-/*
- * Process all subpages of the specified huge page with the specified
- * operation.  The target subpage will be processed last to keep its
- * cache lines hot.
- */
-static inline void process_huge_page(
-	unsigned long addr_hint, unsigned int pages_per_huge_page,
-	void (*process_subpage)(unsigned long addr, int idx, void *arg),
-	void *arg)
-{
-	int i, n, base, l;
-	unsigned long addr = addr_hint &
-		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
-
-	/* Process target subpage last to keep its cache lines hot */
-	might_sleep();
-	n = (addr_hint - addr) / PAGE_SIZE;
-	if (2 * n <= pages_per_huge_page) {
-		/* If target subpage in first half of huge page */
-		base = 0;
-		l = n;
-		/* Process subpages at the end of huge page */
-		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
-			cond_resched();
-			process_subpage(addr + i * PAGE_SIZE, i, arg);
-		}
-	} else {
-		/* If target subpage in second half of huge page */
-		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
-		l = pages_per_huge_page - n;
-		/* Process subpages at the begin of huge page */
-		for (i = 0; i < base; i++) {
-			cond_resched();
-			process_subpage(addr + i * PAGE_SIZE, i, arg);
-		}
-	}
-	/*
-	 * Process remaining subpages in left-right-left-right pattern
-	 * towards the target subpage
-	 */
-	for (i = 0; i < l; i++) {
-		int left_idx = base + i;
-		int right_idx = base + 2 * l - 1 - i;
-
-		cond_resched();
-		process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
-		cond_resched();
-		process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
-	}
-}
 
 static void clear_gigantic_page(struct page *page,
 				unsigned long addr,
 				unsigned int pages_per_huge_page)
 {
 	int i;
-	struct page *p;
+	struct page *p = page;
 
 	might_sleep();
 	for (i = 0; i < pages_per_huge_page; i++) {
@@ -5707,13 +5657,48 @@ void clear_huge_page(struct page *page,
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	int i, n, base, l;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, pages_per_huge_page);
 		return;
 	}
 
-	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+	/* Process target subpage last to keep its cache lines hot */
+	might_sleep();
+	n = (addr_hint - addr) / PAGE_SIZE;
+	if (2 * n <= pages_per_huge_page) {
+		/* If target subpage in first half of huge page */
+		base = 0;
+		l = n;
+		/* Process subpages at the end of huge page */
+		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+			cond_resched();
+			clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+		}
+	} else {
+		/* If target subpage in second half of huge page */
+		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
+		l = pages_per_huge_page - n;
+		/* Process subpages at the begin of huge page */
+		for (i = 0; i < base; i++) {
+			cond_resched();
+			clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+		}
+	}
+	/*
+	 * Process remaining subpages in left-right-left-right pattern
+	 * towards the target subpage
+	 */
+	for (i = 0; i < l; i++) {
+		int left_idx = base + i;
+		int right_idx = base + 2 * l - 1 - i;
+
+		cond_resched();
+		clear_subpage(addr + left_idx * PAGE_SIZE, left_idx, (void *)page);
+		cond_resched();
+		clear_subpage(addr + right_idx * PAGE_SIZE, right_idx, (void *)page);
+	}
 }
 
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
@@ -5759,6 +5744,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		.src = src,
 		.vma = vma,
 	};
+	int i, n, base, l;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		copy_user_gigantic_page(dst, src, addr, vma,
@@ -5766,7 +5752,41 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		return;
 	}
 
-	process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg);
+	/* Process target subpage last to keep its cache lines hot */
+	might_sleep();
+	n = (addr_hint - addr) / PAGE_SIZE;
+	if (2 * n <= pages_per_huge_page) {
+		/* If target subpage in first half of huge page */
+		base = 0;
+		l = n;
+		/* Process subpages at the end of huge page */
+		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+			cond_resched();
+			copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+		}
+	} else {
+		/* If target subpage in second half of huge page */
+		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
+		l = pages_per_huge_page - n;
+		/* Process subpages at the begin of huge page */
+		for (i = 0; i < base; i++) {
+			cond_resched();
+			copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+		}
+	}
+	/*
+	 * Process remaining subpages in left-right-left-right pattern
+	 * towards the target subpage
+	 */
+	for (i = 0; i < l; i++) {
+		int left_idx = base + i;
+		int right_idx = base + 2 * l - 1 - i;
+
+		cond_resched();
+		copy_subpage(addr + left_idx * PAGE_SIZE, left_idx, &arg);
+		cond_resched();
+		copy_subpage(addr + right_idx * PAGE_SIZE, right_idx, &arg);
+	}
 }
 
 long copy_huge_page_from_user(struct page *dst_page,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage()
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
  2023-04-03  5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-03  5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

clear/copy_subpage():
 static void clear_subpage(unsigned long addr, int idx, void *arg)
 static void copy_subpage(unsigned long addr, int idx, void *arg)

take as parameters: an index, a post indexing virtual address,
and a base page * which is then resolved using the index.

Instead just use clear/copy_user_highpage() directly.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 46 ++++++++++++----------------------------------
 1 file changed, 12 insertions(+), 34 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d54bc27a35ca..6da97e6c7d21 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5645,13 +5645,6 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 
-static void clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct page *page = arg;
-
-	clear_user_highpage(page + idx, addr);
-}
-
 void clear_huge_page(struct page *page,
 		     unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
@@ -5674,7 +5667,7 @@ void clear_huge_page(struct page *page,
 		/* Process subpages at the end of huge page */
 		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
 			cond_resched();
-			clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+			clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 		}
 	} else {
 		/* If target subpage in second half of huge page */
@@ -5683,7 +5676,7 @@ void clear_huge_page(struct page *page,
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
 			cond_resched();
-			clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+			clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 		}
 	}
 	/*
@@ -5695,9 +5688,9 @@ void clear_huge_page(struct page *page,
 		int right_idx = base + 2 * l - 1 - i;
 
 		cond_resched();
-		clear_subpage(addr + left_idx * PAGE_SIZE, left_idx, (void *)page);
+		clear_user_highpage(page + left_idx, addr + left_idx * PAGE_SIZE);
 		cond_resched();
-		clear_subpage(addr + right_idx * PAGE_SIZE, right_idx, (void *)page);
+		clear_user_highpage(page + right_idx, addr + right_idx * PAGE_SIZE);
 	}
 }
 
@@ -5719,31 +5712,12 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
 	}
 }
 
-struct copy_subpage_arg {
-	struct page *dst;
-	struct page *src;
-	struct vm_area_struct *vma;
-};
-
-static void copy_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct copy_subpage_arg *copy_arg = arg;
-
-	copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
-			   addr, copy_arg->vma);
-}
-
 void copy_user_huge_page(struct page *dst, struct page *src,
 			 unsigned long addr_hint, struct vm_area_struct *vma,
 			 unsigned int pages_per_huge_page)
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
-	struct copy_subpage_arg arg = {
-		.dst = dst,
-		.src = src,
-		.vma = vma,
-	};
 	int i, n, base, l;
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
@@ -5762,7 +5736,8 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		/* Process subpages at the end of huge page */
 		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
 			cond_resched();
-			copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+			copy_user_highpage(dst + i, src + i,
+					   addr + i*PAGE_SIZE, vma);
 		}
 	} else {
 		/* If target subpage in second half of huge page */
@@ -5771,7 +5746,8 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		/* Process subpages at the begin of huge page */
 		for (i = 0; i < base; i++) {
 			cond_resched();
-			copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+			copy_user_highpage(dst + i, src + i,
+					   addr + i*PAGE_SIZE, vma);
 		}
 	}
 	/*
@@ -5783,9 +5759,11 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		int right_idx = base + 2 * l - 1 - i;
 
 		cond_resched();
-		copy_subpage(addr + left_idx * PAGE_SIZE, left_idx, &arg);
+		copy_user_highpage(dst + left_idx, src + left_idx,
+					   addr + left_idx*PAGE_SIZE, vma);
 		cond_resched();
-		copy_subpage(addr + right_idx * PAGE_SIZE, right_idx, &arg);
+		copy_user_highpage(dst + right_idx, src + right_idx,
+					   addr + right_idx*PAGE_SIZE, vma);
 	}
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page()
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
  2023-04-03  5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
  2023-04-03  5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-03  5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

clear/copy_huge_page() are constrained to operate on a page-at-a-time
because they also handle the CONFIG_HIGHMEM case.

Mark both __weak to allow for arch specific optimizations.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6da97e6c7d21..9a6bce384616 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5645,8 +5645,12 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 
-void clear_huge_page(struct page *page,
-		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+/*
+ * clear_huge_page(): does page-at-a-time clearing to handle the
+ * CONFIG_HIGHMEM case.
+ */
+__weak void clear_huge_page(struct page *page,
+			    unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
@@ -5712,9 +5716,13 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
 	}
 }
 
-void copy_user_huge_page(struct page *dst, struct page *src,
-			 unsigned long addr_hint, struct vm_area_struct *vma,
-			 unsigned int pages_per_huge_page)
+/*
+ * copy_user_huge_page(): does page-at-a-time copying to handle
+ * the CONFIG_HIGHMEM case.
+ */
+__weak void copy_user_huge_page(struct page *dst, struct page *src,
+				unsigned long addr_hint, struct vm_area_struct *vma,
+				unsigned int pages_per_huge_page)
 {
 	unsigned long addr = addr_hint &
 		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (2 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-06  8:19   ` Peter Zijlstra
  2023-04-03  5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

Change clear_page*() to take a length parameter.
Rename to clear_pages_*().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

clear_page() is now defined in terms of clear_pages(). This means
that all clear_page() callsites -- which are more numerous -- would
use an additional register now.

Opinions on if that's worth optimizing?

---
 arch/x86/include/asm/page_64.h | 18 ++++++++------
 arch/x86/lib/clear_page_64.S   | 45 +++++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index cc6b8e087192..7ca3bd2448c1 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -39,22 +39,24 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void clear_pages_orig(void *page, unsigned int length);
+void clear_pages_rep(void *page, unsigned int length);
+void clear_pages_erms(void *page, unsigned int length);
 
 static inline void clear_page(void *page)
 {
+	unsigned long length = PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
+	kmsan_unpoison_memory(page, length);
+
+	alternative_call_2(clear_pages_orig,
+			   clear_pages_rep, X86_FEATURE_REP_GOOD,
+			   clear_pages_erms, X86_FEATURE_ERMS,
 			   "=D" (page),
-			   "0" (page)
+			   "0" (page), "S" (length)
 			   : "cc", "memory", "rax", "rcx");
 }
 
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index ecbfb4dd3b01..6069acf6072f 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -2,6 +2,8 @@
 #include <linux/linkage.h>
 #include <asm/asm.h>
 #include <asm/export.h>
+#include <asm/cpufeatures.h>
+#include <asm/alternative.h>
 
 /*
  * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
@@ -13,18 +15,30 @@
 /*
  * Zero a page.
  * %rdi	- page
+ * %esi	- page-length
+ *
+ * Clobbers: %rax, %rcx
  */
-SYM_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
+SYM_FUNC_START(clear_pages_rep)
+	movl %esi, %ecx
 	xorl %eax,%eax
+	shrl $3,%ecx
 	rep stosq
 	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
+SYM_FUNC_END(clear_pages_rep)
+EXPORT_SYMBOL_GPL(clear_pages_rep)
 
-SYM_FUNC_START(clear_page_orig)
+/*
+ * Original page zeroing loop.
+ * %rdi	- page
+ * %esi	- page-length
+ *
+ * Clobbers: %rax, %rcx, %rflags
+ */
+SYM_FUNC_START(clear_pages_orig)
+	movl   %esi, %ecx
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+	shrl   $6,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -41,16 +55,23 @@ SYM_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
+SYM_FUNC_END(clear_pages_orig)
+EXPORT_SYMBOL_GPL(clear_pages_orig)
 
-SYM_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
+/*
+ * Zero a page.
+ * %rdi	- page
+ * %esi	- page-length
+ *
+ * Clobbers: %rax, %rcx
+ */
+SYM_FUNC_START(clear_pages_erms)
+	movl %esi, %ecx
 	xorl %eax,%eax
 	rep stosb
 	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(clear_pages_erms)
+EXPORT_SYMBOL_GPL(clear_pages_erms)
 
 /*
  * Default clear user-space.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/9] x86/clear_pages: add clear_pages()
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (3 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-06  8:23   ` Peter Zijlstra
  2023-04-03  5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

Add clear_pages() and define the ancillary clear_user_pages().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page.h    | 6 ++++++
 arch/x86/include/asm/page_32.h | 6 ++++++
 arch/x86/include/asm/page_64.h | 9 +++++++--
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..03e3c69fc427 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
 	clear_page(page);
 }
 
+static inline void clear_user_pages(void *page, unsigned long vaddr,
+				    struct page *pg, unsigned int nsubpages)
+{
+	clear_pages(page, nsubpages);
+}
+
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *topage)
 {
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 580d71aca65a..3523d1150cfc 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -22,6 +22,12 @@ static inline void clear_page(void *page)
 	memset(page, 0, PAGE_SIZE);
 }
 
+static inline void clear_pages(void *page, unsigned int nsubpages)
+{
+	for (int i = 0; i < nsubpages; i++)
+		clear_page(page + i * PAGE_SIZE);
+}
+
 static inline void copy_page(void *to, void *from)
 {
 	memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 7ca3bd2448c1..42f6c45206c1 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,9 +43,9 @@ void clear_pages_orig(void *page, unsigned int length);
 void clear_pages_rep(void *page, unsigned int length);
 void clear_pages_erms(void *page, unsigned int length);
 
-static inline void clear_page(void *page)
+static inline void clear_pages(void *page, unsigned int nsubpages)
 {
-	unsigned long length = PAGE_SIZE;
+	unsigned int length = nsubpages * PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
@@ -60,6 +60,11 @@ static inline void clear_page(void *page)
 			   : "cc", "memory", "rax", "rcx");
 }
 
+static inline void clear_page(void *page)
+{
+	clear_pages(page, 1);
+}
+
 void copy_page(void *to, void *from);
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/9] mm/clear_huge_page: use multi-page clearing
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (4 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-03  5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

clear_pages_rep(), clear_pages_erms() use string instructions
internally. These, unlike a MOV loop, allow us to explicitly advertise
the region-size to the processor. Thus, clearing in multi-page chunks
means we can specify the real region sizes (or close to it) which is
good for two reasons:

 - region-size can serve as a hint to current (some AMD Zen models) and
   possibly future uarchs which can use this hint to avoid polluting one
   or more levels of the dcache.

 - string instructions are typically microcoded, and would be cheaper
   if amortized across larger regions. We also execute fewer loop
   iterations (ex. a cond_resched() check for each page but those
   instructions are likely free.)

clear_huge_page() now clears in three sections: the local neighbourhood
of the faulting address (faulting page, and four surrounding pages),
and its left and right regions.

The local neighbourhood is cleared last to keep its cachelines hot.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 64GB region (on the local
NUMA node):

Icelakex (Platinum 8358, ucode=0xd0002c1, no_turbo=1):

              mm/clear_huge_page   x86/clear_huge_page   change   
                          (GB/s)                (GB/s)            
                                                                  
  pg-sz=2MB                 8.76                 11.82   +34.93%  
  pg-sz=1GB                 8.99                 12.18   +35.48%  

On Icelakex we continue to allocate cachelines:

pg-sz=2MB:
    -   701,951,397      L1-dcache-loads           #   47.985 M/sec                       ( +- 19.22% )  (69.23%)
    - 3,239,403,770      L1-dcache-load-misses     #  691.17% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
    +   194,318,641      L1-dcache-loads           #   17.905 M/sec                       ( +- 19.07% )  (69.25%)
    + 3,238,878,229      L1-dcache-load-misses     # 2480.93% of all L1-dcache accesses   ( +- 19.25% )  (69.26%)

pg-sz=1GB:
    -   532,232,051      L1-dcache-loads           #   37.378 M/sec                       ( +- 19.25% )  (69.23%)
    - 3,224,574,249      L1-dcache-load-misses     #  909.02% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
    +    22,587,703      L1-dcache-loads           #    2.150 M/sec                       ( +- 19.38% )  (69.25%)
    + 3,223,143,697      L1-dcache-load-misses     # 21478.37% of all L1-dcache accesses  ( +- 19.25% )  (69.25%)


Milan (EPYC 7J13, ucode=0xa0011a9, boost=0):

              mm/clear_huge_page   x86/clear_huge_page   change    
                          (GB/s)                (GB/s)             
                                                                   
  pg-sz=2MB                12.24                 17.54    +43.30%  
  pg-sz=1GB                17.98                 37.24   +107.11%  

Milan uses a threshold ~32MB for eliding cacheline allocation, so we
see a dropoff in cacheline-allocations for pg-sz=1GB:

pg-sz=2MB:
    - 2,495,566,569      L1-dcache-loads           #  476.417 M/sec                      ( +-  0.04% )  (33.38%)
    - 1,079,711,798      L1-dcache-load-misses     #   43.28% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)
    + 2,235,310,058      L1-dcache-loads           #  610.770 M/sec                      ( +-  0.02% )  (33.37%)
    + 1,089,602,355      L1-dcache-load-misses     #   48.73% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)

pg-sz=1GB:
    - 2,417,846,489      L1-dcache-loads           #  679.753 M/sec                      ( +-  0.01% )  (33.38%)
    - 1,075,531,869      L1-dcache-load-misses     #   44.49% of all L1-dcache accesses  ( +-  0.01% )  (33.35%)
    +    31,159,378      L1-dcache-loads           #   18.119 M/sec                      ( +-  3.27% )  (33.46%)
    +    14,692,358      L1-dcache-load-misses     #   48.21% of all L1-dcache accesses  ( +-  3.12% )  (33.46%)

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Fuller perf stats for context:

# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

         21,945.59 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.25% )
                34      context-switches                 #    2.324 /sec                        ( +- 20.38% )
                 3      cpu-migrations                   #    0.205 /sec                        ( +- 19.25% )
           198,152      page-faults                      #   13.546 K/sec                       ( +- 19.29% )
    56,513,364,885      cycles                           #    3.863 GHz                         ( +- 19.25% )  (38.44%)
     2,583,719,806      instructions                     #    0.07  insn per cycle              ( +- 19.24% )  (46.14%)
       585,212,952      branches                         #   40.005 M/sec                       ( +- 19.23% )  (53.83%)
           562,164      branch-misses                    #    0.14% of all branches             ( +- 19.23% )  (61.53%)
   282,621,312,162      slots                            #   19.320 G/sec                       ( +- 19.25% )  (69.22%)
    11,048,627,225      topdown-retiring                 #      3.8% Retiring                   ( +- 19.22% )  (69.22%)
    34,358,400,894      topdown-bad-spec                 #     11.5% Bad Speculation            ( +- 19.57% )  (69.22%)
     2,231,092,499      topdown-fe-bound                 #      0.8% Frontend Bound             ( +- 19.25% )  (69.22%)
   246,679,210,776      topdown-be-bound                 #     84.0% Backend Bound              ( +- 19.21% )  (69.22%)
       701,951,397      L1-dcache-loads                  #   47.985 M/sec                       ( +- 19.22% )  (69.23%)
     3,239,403,770      L1-dcache-load-misses            #  691.17% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
        11,475,685      LLC-loads                        #  784.475 K/sec                       ( +- 19.23% )  (69.25%)
           793,272      LLC-load-misses                  #   10.36% of all LL-cache accesses    ( +- 19.23% )  (69.25%)
        17,821,045      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.51% )  (30.77%)
       693,339,354      dTLB-loads                       #   47.397 M/sec                       ( +- 19.33% )  (30.76%)
           637,811      dTLB-load-misses                 #    0.14% of all dTLB cache accesses  ( +- 19.09% )  (30.75%)
           131,922      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.59% )  (30.75%)

           7.31681 +- 0.00177 seconds time elapsed  ( +-  0.02% )


# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

         16,276.28 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.24% )
                27      context-switches                 #    2.488 /sec                        ( +- 19.25% )
                 3      cpu-migrations                   #    0.276 /sec                        ( +- 19.25% )
           196,935      page-faults                      #   18.146 K/sec                       ( +- 19.25% )
    41,906,597,608      cycles                           #    3.861 GHz                         ( +- 19.24% )  (38.44%)
       729,479,932      instructions                     #    0.03  insn per cycle              ( +- 19.38% )  (46.14%)
       133,969,095      branches                         #   12.344 M/sec                       ( +- 19.35% )  (53.84%)
           412,818      branch-misses                    #    0.46% of all branches             ( +- 18.97% )  (61.54%)
   209,574,316,961      slots                            #   19.311 G/sec                       ( +- 19.24% )  (69.24%)
     4,933,512,982      topdown-retiring                 #      2.3% Retiring                   ( +- 19.24% )  (69.24%)
    20,272,641,267      topdown-bad-spec                 #      9.4% Bad Speculation            ( +- 19.51% )  (69.24%)
       837,421,487      topdown-fe-bound                 #      0.4% Frontend Bound             ( +- 19.24% )  (69.24%)
   190,089,232,476      topdown-be-bound                 #     88.0% Backend Bound              ( +- 19.19% )  (69.24%)
       194,318,641      L1-dcache-loads                  #   17.905 M/sec                       ( +- 19.07% )  (69.25%)
     3,238,878,229      L1-dcache-load-misses            # 2480.93% of all L1-dcache accesses   ( +- 19.25% )  (69.26%)
        10,560,508      LLC-loads                        #  973.081 K/sec                       ( +- 19.23% )  (69.26%)
           724,884      LLC-load-misses                  #   10.28% of all LL-cache accesses    ( +- 17.15% )  (69.26%)
        14,378,070      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.13% )  (30.75%)
       185,562,230      dTLB-loads                       #   17.098 M/sec                       ( +- 19.74% )  (30.74%)
           617,978      dTLB-load-misses                 #    0.51% of all dTLB cache accesses  ( +- 18.72% )  (30.74%)
           112,509      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.76% )  (30.74%)

           5.42697 +- 0.00152 seconds time elapsed  ( +-  0.03% )


# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):

         21,361.22 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.25% )
                23      context-switches                 #    1.615 /sec                        ( +- 18.95% )
                 3      cpu-migrations                   #    0.211 /sec                        ( +- 19.25% )
               701      page-faults                      #   49.230 /sec                        ( +- 19.27% )
    54,981,958,487      cycles                           #    3.861 GHz                         ( +- 19.25% )  (38.44%)
     2,012,625,953      instructions                     #    0.05  insn per cycle              ( +- 19.25% )  (46.14%)
       470,264,509      branches                         #   33.026 M/sec                       ( +- 19.25% )  (53.83%)
           194,801      branch-misses                    #    0.06% of all branches             ( +- 18.88% )  (61.53%)
   274,966,507,627      slots                            #   19.311 G/sec                       ( +- 19.25% )  (69.22%)
    10,555,137,650      topdown-retiring                 #      3.8% Retiring                   ( +- 19.04% )  (69.22%)
    21,206,785,918      topdown-bad-spec                 #      7.8% Bad Speculation            ( +- 18.13% )  (69.22%)
     1,094,597,329      topdown-fe-bound                 #      0.4% Frontend Bound             ( +- 19.25% )  (69.22%)
   244,462,123,545      topdown-be-bound                 #     88.0% Backend Bound              ( +- 19.33% )  (69.22%)
       532,232,051      L1-dcache-loads                  #   37.378 M/sec                       ( +- 19.25% )  (69.23%)
     3,224,574,249      L1-dcache-load-misses            #  909.02% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
         2,318,195      LLC-loads                        #  162.804 K/sec                       ( +- 19.35% )  (69.25%)
           206,737      LLC-load-misses                  #   13.44% of all LL-cache accesses    ( +- 18.30% )  (69.25%)
         4,950,866      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.26% )  (30.77%)
       531,299,560      dTLB-loads                       #   37.313 M/sec                       ( +- 19.24% )  (30.76%)
             2,811      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +- 17.25% )  (30.75%)
            26,355      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.58% )  (30.75%)

           7.12187 +- 0.00190 seconds time elapsed  ( +-  0.03% )


# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):

         15,764.52 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.25% )
                17      context-switches                 #    1.618 /sec                        ( +- 20.47% )
                 3      cpu-migrations                   #    0.285 /sec                        ( +- 19.25% )
               700      page-faults                      #   66.614 /sec                        ( +- 19.22% )
    40,560,984,582      cycles                           #    3.860 GHz                         ( +- 19.25% )  (38.45%)
        79,578,792      instructions                     #    0.00  insn per cycle              ( +- 19.24% )  (46.15%)
        13,872,134      branches                         #    1.320 M/sec                       ( +- 19.23% )  (53.85%)
           119,492      branch-misses                    #    1.29% of all branches             ( +- 18.80% )  (61.55%)
   202,854,573,160      slots                            #   19.304 G/sec                       ( +- 19.25% )  (69.25%)
     3,982,417,725      topdown-retiring                 #      2.0% Retiring                   ( +- 19.25% )  (69.25%)
    13,523,424,635      topdown-bad-spec                 #      6.8% Bad Speculation            ( +- 18.69% )  (69.25%)
        18,661,431      topdown-fe-bound                 #      0.0% Frontend Bound             ( +- 19.28% )  (69.25%)
   185,884,147,789      topdown-be-bound                 #     91.3% Backend Bound              ( +- 19.28% )  (69.25%)
        22,587,703      L1-dcache-loads                  #    2.150 M/sec                       ( +- 19.38% )  (69.25%)
     3,223,143,697      L1-dcache-load-misses            # 21478.37% of all L1-dcache accesses  ( +- 19.25% )  (69.25%)
         1,777,675      LLC-loads                        #  169.169 K/sec                       ( +- 19.60% )  (69.25%)
           126,583      LLC-load-misses                  #   10.77% of all LL-cache accesses    ( +- 19.82% )  (69.25%)
         3,333,729      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.49% )  (30.75%)
        19,999,517      dTLB-loads                       #    1.903 M/sec                       ( +- 19.38% )  (30.75%)
             1,833      dTLB-load-misses                 #    0.01% of all dTLB cache accesses  ( +- 17.72% )  (30.75%)
            34,066      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.09% )  (30.75%)

           5.25624 +- 0.00176 seconds time elapsed  ( +-  0.03% )


# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

          5,241.76 msec task-clock                #    1.000 CPUs utilized            ( +-  0.08% )
                10      context-switches          #    1.909 /sec                     ( +-  8.82% )
                 1      cpu-migrations            #    0.191 /sec                   
            65,636      page-faults               #   12.530 K/sec                    ( +-  0.00% )
    12,730,694,768      cycles                    #    2.430 GHz                      ( +-  0.08% )  (33.31%)
        36,709,243      stalled-cycles-frontend   #    0.29% frontend cycles idle     ( +- 24.07% )  (33.32%)
        37,520,225      stalled-cycles-backend    #    0.29% backend cycles idle      ( +-  9.87% )  (33.34%)
       874,896,010      instructions              #    0.07  insn per cycle         
                                                  #    0.05  stalled cycles per insn  ( +-  1.23% )  (33.36%)
       199,308,386      branches                  #   38.049 M/sec                    ( +-  0.84% )  (33.38%)
           441,428      branch-misses             #    0.22% of all branches          ( +-  4.68% )  (33.38%)
     2,495,566,569      L1-dcache-loads           #  476.417 M/sec                    ( +-  0.04% )  (33.38%)
     1,079,711,798      L1-dcache-load-misses     #   43.28% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)
        50,936,391      L1-icache-loads           #    9.724 M/sec                    ( +-  1.29% )  (33.35%)
           284,407      L1-icache-load-misses     #    0.56% of all L1-icache accesses  ( +-  4.60% )  (33.33%)
           546,596      dTLB-loads                #  104.348 K/sec                    ( +-  6.14% )  (33.31%)
           231,897      dTLB-load-misses          #   42.08% of all dTLB cache accesses  ( +-  4.27% )  (33.29%)
                 6      iTLB-loads                #    1.145 /sec                     ( +- 72.65% )  (33.29%)
            34,065      iTLB-load-misses          # 262038.46% of all iTLB cache accesses  ( +- 44.88% )  (33.29%)
        18,237,487      L1-dcache-prefetches      #    3.482 M/sec                    ( +- 12.84% )  (33.29%)

           5.23915 +- 0.00421 seconds time elapsed  ( +-  0.08% )

# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

          3,655.71 msec task-clock                #    0.999 CPUs utilized            ( +-  0.13% )
                 7      context-switches          #    1.913 /sec                     ( +-  8.25% )
                 1      cpu-migrations            #    0.273 /sec                   
            65,636      page-faults               #   17.934 K/sec                    ( +-  0.00% )
     8,879,727,514      cycles                    #    2.426 GHz                      ( +-  0.13% )  (33.26%)
         5,733,380      stalled-cycles-frontend   #    0.06% frontend cycles idle     ( +-170.04% )  (33.28%)
        42,012,302      stalled-cycles-backend    #    0.47% backend cycles idle      ( +- 24.51% )  (33.31%)
       214,672,610      instructions              #    0.02  insn per cycle         
                                                  #    0.28  stalled cycles per insn  ( +-  1.71% )  (33.34%)
        42,298,268      branches                  #   11.557 M/sec                    ( +-  1.28% )  (33.36%)
           267,936      branch-misses             #    0.62% of all branches          ( +-  7.80% )  (33.37%)
     2,235,310,058      L1-dcache-loads           #  610.770 M/sec                    ( +-  0.02% )  (33.37%)
     1,089,602,355      L1-dcache-load-misses     #   48.73% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)
        48,725,812      L1-icache-loads           #   13.314 M/sec                    ( +-  0.25% )  (33.37%)
           231,227      L1-icache-load-misses     #    0.47% of all L1-icache accesses  ( +- 13.20% )  (33.37%)
           280,655      dTLB-loads                #   76.685 K/sec                    ( +- 13.33% )  (33.37%)
           151,028      dTLB-load-misses          #   44.02% of all dTLB cache accesses  ( +-  6.64% )  (33.35%)
                15      iTLB-loads                #    4.099 /sec                     ( +-  6.67% )  (33.32%)
           121,208      iTLB-load-misses          # 865771.43% of all iTLB cache accesses  ( +-  2.74% )  (33.29%)
        18,702,209      L1-dcache-prefetches      #    5.110 M/sec                    ( +- 12.51% )  (33.27%)

           3.66065 +- 0.00461 seconds time elapsed  ( +-  0.13% )


# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):

          3,544.20 msec task-clock                #    0.996 CPUs utilized            ( +-  0.21% )
                 5      context-switches          #    1.406 /sec                     ( +-  6.67% )
                 1      cpu-migrations            #    0.281 /sec                   
               227      page-faults               #   63.819 /sec                     ( +-  0.15% )
     8,609,810,964      cycles                    #    2.421 GHz                      ( +-  0.21% )  (33.30%)
        77,420,424      stalled-cycles-frontend   #    0.90% frontend cycles idle     ( +- 20.55% )  (33.33%)
        25,197,541      stalled-cycles-backend    #    0.29% backend cycles idle      ( +-  1.09% )  (33.35%)
       658,146,061      instructions              #    0.08  insn per cycle         
                                                  #    0.16  stalled cycles per insn  ( +-  0.04% )  (33.38%)
       154,867,131      branches                  #   43.539 M/sec                    ( +-  0.04% )  (33.41%)
           167,531      branch-misses             #    0.11% of all branches          ( +-  5.19% )  (33.41%)
     2,417,846,489      L1-dcache-loads           #  679.753 M/sec                    ( +-  0.01% )  (33.38%)
     1,075,531,869      L1-dcache-load-misses     #   44.49% of all L1-dcache accesses  ( +-  0.01% )  (33.35%)
        12,835,321      L1-icache-loads           #    3.609 M/sec                    ( +-  0.41% )  (33.33%)
            55,282      L1-icache-load-misses     #    0.43% of all L1-icache accesses  ( +-  1.98% )  (33.30%)
            23,287      dTLB-loads                #    6.547 K/sec                    ( +- 15.61% )  (33.29%)
             1,333      dTLB-load-misses          #    4.48% of all dTLB cache accesses  ( +-  1.26% )  (33.29%)
                 3      iTLB-loads                #    0.843 /sec                     ( +- 33.33% )  (33.29%)
               231      iTLB-load-misses          # 11550.00% of all iTLB cache accesses  ( +-  6.14% )  (33.29%)
       170,608,062      L1-dcache-prefetches      #   47.965 M/sec                    ( +-  0.84% )  (33.29%)

           3.55776 +- 0.00738 seconds time elapsed  ( +-  0.21% )


# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):

          1,718.27 msec task-clock                #    0.999 CPUs utilized            ( +-  0.08% )
                 6      context-switches          #    3.489 /sec                     ( +- 14.70% )
                 1      cpu-migrations            #    0.581 /sec                   
               227      page-faults               #  132.000 /sec                     ( +-  0.15% )
     4,176,107,493      cycles                    #    2.428 GHz                      ( +-  0.08% )  (33.19%)
         2,675,797      stalled-cycles-frontend   #    0.06% frontend cycles idle     ( +-  0.34% )  (33.25%)
       147,394,527      stalled-cycles-backend    #    3.53% backend cycles idle      ( +-  8.80% )  (33.31%)
        12,779,784      instructions              #    0.00  insn per cycle         
                                                  #   13.14  stalled cycles per insn  ( +-  0.09% )  (33.37%)
         2,428,829      branches                  #    1.412 M/sec                    ( +-  0.08% )  (33.42%)
            63,460      branch-misses             #    2.61% of all branches          ( +-  3.48% )  (33.46%)
        31,159,378      L1-dcache-loads           #   18.119 M/sec                    ( +-  3.27% )  (33.46%)
        14,692,358      L1-dcache-load-misses     #   48.21% of all L1-dcache accesses  ( +-  3.12% )  (33.46%)
         2,556,688      L1-icache-loads           #    1.487 M/sec                    ( +-  0.89% )  (33.46%)
            21,148      L1-icache-load-misses     #    0.84% of all L1-icache accesses  ( +-  0.25% )  (33.41%)
             6,114      dTLB-loads                #    3.555 K/sec                    ( +- 12.76% )  (33.35%)
             1,742      dTLB-load-misses          #   33.73% of all dTLB cache accesses  ( +- 21.79% )  (33.29%)
                45      iTLB-loads                #   26.167 /sec                     ( +-  7.52% )  (33.23%)
                90      iTLB-load-misses          #  210.94% of all iTLB cache accesses  ( +- 21.20% )  (33.17%)
           257,942      L1-dcache-prefetches      #  149.993 K/sec                    ( +-  9.84% )  (33.17%)

           1.72042 +- 0.00139 seconds time elapsed  ( +-  0.08% )

---
 arch/x86/mm/hugetlbpage.c | 49 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..4294b77c4f18 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,55 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		return hugetlb_get_unmapped_area_topdown(file, addr, len,
 				pgoff, flags);
 }
+
+/*
+ * This is used on all !CONFIG_HIGHMEM configurations.
+ *
+ * CONFIG_HIGHMEM, falls back to the __weak version.
+ */
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned long vaddr,
+				unsigned int npages)
+{
+	clear_user_pages(page_address(page), vaddr, page, npages);
+}
+
+void clear_huge_page(struct page *page,
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+	unsigned long addr = addr_hint &
+		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+	const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+	const int width = 2; /* pages cleared last on either side */
+	int sidx[3], eidx[3];
+	int i, n;
+
+	if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+		return clear_contig_region(page, addr, pages_per_huge_page);
+
+	/*
+	 * Neighbourhood of the fault. Cleared at the end to ensure
+	 * it sticks around in the cache.
+	 */
+	n = 2;
+	sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+	eidx[n] = (pgidx + width) > last_pg  ? last_pg  : (pgidx + width);
+
+	sidx[0] = first_pg;	/* Region to the left of the fault */
+	eidx[0] = sidx[n] - 1;
+
+	sidx[1] = eidx[n] + 1;	/* Region to the right of the fault */
+	eidx[1] = last_pg;
+
+	for (i = 0; i <= 2; i++) {
+		if (eidx[i] >= sidx[i])
+			clear_contig_region(page + sidx[i],
+					    addr + sidx[i] * PAGE_SIZE,
+					    eidx[i] - sidx[i] + 1);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_X86_64
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/9] sched: define TIF_ALLOW_RESCHED
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (5 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-05 20:07   ` Peter Zijlstra
  2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

Define TIF_ALLOW_RESCHED to allow threads to mark themselves as allowing
rescheduling in the irqexit path with PREEMPTION_NONE/_VOLUNTARY.

This is meant to be used for long running tasks where it is
not convenient to periodically call cond_resched().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/thread_info.h |  2 ++
 include/linux/sched.h              | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f1cccba52eb9..8c18b9eaeec4 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,6 +100,7 @@ struct thread_info {
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
+#define TIF_RESCHED_ALLOW	30	/* can reschedule if needed */
 
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
@@ -122,6 +123,7 @@ struct thread_info {
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
+#define _TIF_RESCHED_ALLOW	(1 << TIF_RESCHED_ALLOW)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..1e7536e6d9ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2229,6 +2229,35 @@ static __always_inline bool need_resched(void)
 	return unlikely(tif_need_resched());
 }
 
+/*
+ * Define this in common code to avoid include hell.
+ */
+static __always_inline bool resched_allowed(void)
+{
+#ifndef TIF_RESCHED_ALLOW
+	return false;
+#else
+	return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+#endif
+}
+
+static inline void allow_resched(void)
+{
+	/*
+	 * allow_resched() allows preemption via the irqexit context.
+	 * To ensure that we stick around on the current runqueue,
+	 * disallow migration.
+	 */
+	migrate_disable();
+	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static inline void disallow_resched(void)
+{
+	clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+	migrate_enable();
+}
+
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (6 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-04  9:38   ` Thomas Gleixner
  2023-04-05 20:22   ` Peter Zijlstra
  2023-04-03  5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
  2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
  9 siblings, 2 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.

This is only necessary under !preempt_model_preemptible() for which
we reuse the same logic as irqentry_exit_code_resched().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/entry/common.c |  8 ++++++++
 kernel/sched/core.c   | 36 +++++++++++++++++++++---------------
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index be61332c66b5..f1005595ebe7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -390,6 +390,9 @@ void raw_irqentry_exit_cond_resched(void)
 			preempt_schedule_irq();
 	}
 }
+
+void irqentry_exit_allow_resched(void) __alias(raw_irqentry_exit_cond_resched);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
@@ -431,6 +434,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		instrumentation_begin();
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
+		/*
+		 * We care about this clause only in the dynamic !preemptible case.
+		 */
+		if (unlikely(!preempt_model_preemptible() && resched_allowed()))
+			irqentry_exit_allow_resched();
 
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..11845a91b691 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6500,6 +6500,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  *          - explicit schedule() call
  *          - return from syscall or exception to user-space
  *          - return from interrupt-handler to user-space
+ *          - return from interrupt-handler with the task having set
+ *            TIF_RESCHED_ALLOW
  *
  * WARNING: must be called with preemption disabled!
  */
@@ -8597,28 +8599,32 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
  * SC:preempt_schedule
  * SC:preempt_schedule_notrace
  * SC:irqentry_exit_cond_resched
+ * SC:irqentry_exit_allow_resched
  *
  *
  * NONE:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- RET0
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
+ *   cond_resched                <- __cond_resched
+ *   might_resched               <- RET0
+ *   preempt_schedule            <- NOP
+ *   preempt_schedule_notrace    <- NOP
+ *   irqentry_exit_cond_resched  <- NOP
+ *   irqentry_exit_allow_resched <- irqentry_exit_allow_resched
  *
  * VOLUNTARY:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- __cond_resched
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
+ *   cond_resched                <- __cond_resched
+ *   might_resched               <- __cond_resched
+ *   preempt_schedule            <- NOP
+ *   preempt_schedule_notrace    <- NOP
+ *   irqentry_exit_cond_resched  <- NOP
+ *   irqentry_exit_allow_resched <- irqentry_exit_allow_resched
  *
  * FULL:
- *   cond_resched               <- RET0
- *   might_resched              <- RET0
- *   preempt_schedule           <- preempt_schedule
- *   preempt_schedule_notrace   <- preempt_schedule_notrace
- *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ *   cond_resched                <- RET0
+ *   might_resched               <- RET0
+ *   preempt_schedule            <- preempt_schedule
+ *   preempt_schedule_notrace    <- preempt_schedule_notrace
+ *   irqentry_exit_cond_resched  <- irqentry_exit_cond_resched
+ *   irqentry_exit_allow_resched <- NOP
  */
 
 enum {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (7 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
@ 2023-04-03  5:22 ` Ankur Arora
  2023-04-05 20:27   ` Peter Zijlstra
  2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
  9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
chunk Allow preemption in the irqentry_exit path to make sure we don't
hold on to the CPU for an arbitrarily long period.

Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
hugetlb-2mb region, and then writes sequentially to the region, demand
faulting pages on the way.

This test, with a CONFIG_VOLUNTARY config shows the effects of this
change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
goes up (~15% on Icelakex, ~13% on Milan.)

  *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
  (mem=4GB/task, tasks=128)

  stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
  utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
  wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%



  *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
  (mem=1GB/task, tasks=512)

  stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
  utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
  wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%

The drop in stime is due to REP; STOS being more efficient for bigger
extents.  The increase in utime is due to cache effects of that change:
mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
faulting page; while x86/clear_huge_page only optimizes for cache
locality in the local neighbourhood of the faulting address.

This effect on utime is visible via the increased L1-dcache-load-misses
and LLC-load* and an increased backend boundedness for perf user-stat
--all-user on Icelakex. The effect is slight but given the heavy cache
pressure generated by the test, shows up in the drop in user IPC:

    -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
    -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
    -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
    -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
    -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)

    +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
    +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
    +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
    +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
    +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)

Given the fact that the stime improves for all loads using this path,
while the utime drop is load dependent add this change.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/hugetlbpage.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 4294b77c4f18..c8564b0552e5 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 static void clear_contig_region(struct page *page, unsigned long vaddr,
 				unsigned int npages)
 {
+	might_sleep();
+
+	/*
+	 * We might be clearing a large region.
+	 * Allow rescheduling.
+	 */
+	allow_resched();
 	clear_user_pages(page_address(page), vaddr, page, npages);
+	disallow_resched();
+
+	cond_resched();
 }
 
 void clear_huge_page(struct page *page,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
@ 2023-04-04  9:38   ` Thomas Gleixner
  2023-04-05  5:29     ` Ankur Arora
  2023-04-05 20:22   ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2023-04-04  9:38 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, vincent.guittot, jon.grimm,
	bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora

On Sun, Apr 02 2023 at 22:22, Ankur Arora wrote:
> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>
> This is only necessary under !preempt_model_preemptible() for which
> we reuse the same logic as irqentry_exit_code_resched().

This tells what this patch is doing but completely fails to explain why
this is necessary and useful.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-04  9:38   ` Thomas Gleixner
@ 2023-04-05  5:29     ` Ankur Arora
  0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-05  5:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, peterz,
	rostedt, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk, ankur.a.arora


Thomas Gleixner <tglx@linutronix.de> writes:

> On Sun, Apr 02 2023 at 22:22, Ankur Arora wrote:
>> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>>
>> This is only necessary under !preempt_model_preemptible() for which
>> we reuse the same logic as irqentry_exit_code_resched().
>
> This tells what this patch is doing but completely fails to explain why
> this is necessary and useful.

Thanks. Yeah, it does seem to miss that completely.

Needs some massaging, but does something like this clarify it's purpose?

On kernels with PREEMPTION_NONE/_VOLUNTARY, rescheduling of kernel
tasks happens when they allow it -- for instance by synchronously
calling cond_resched() in a long running task.

There are cases where it is not convenient to periodically call
cond_resched() -- for instance when executing a potentially
long running instruction (such as REP STOSB on x86).

To handle kernel code sections which can be safely preempted, but
cannot explicitly call cond_resched(), allow them to mark themselves
TIF_ALLOW_RESCHED.

Contexts marked such (via allow_resched()) can be rescheduled in the
irqexit path. This is, of course only needed with
!preempt_model_preemptible() and the rescheduling logic is functionally
same as irqentry_exit_code_resched().

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
  2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
                   ` (8 preceding siblings ...)
  2023-04-03  5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
@ 2023-04-05 19:48 ` Raghavendra K T
  2023-04-08 22:46   ` Ankur Arora
  9 siblings, 1 reply; 30+ messages in thread
From: Raghavendra K T @ 2023-04-05 19:48 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
	willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk

On 4/3/2023 10:52 AM, Ankur Arora wrote:
> This series introduces multi-page clearing for hugepages.
> 
> This is a follow up of some of the ideas discussed at:
>    https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/
> 
> On x86 page clearing is typically done via string intructions. These,
> unlike a MOV loop, allow us to explicitly advertise the region-size to
> the processor, which could serve as a hint to current (and/or
> future) uarchs to elide cacheline allocation.
> 
> In current generation processors, Milan (and presumably other Zen
> variants) use the hint to elide cacheline allocation (for
> region-size > LLC-size.)
> 
> An additional reason for doing this is that string instructions are typically
> microcoded, and clearing in bigger chunks than the current page-at-a-
> time logic amortizes some of the cost.
> 
> All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.
> 
> There are, however, some problems:
> 
> 1. extended zeroing periods means there's an increased latency due to
>     the now missing preemption points.
> 
>     That's handled in patches 7, 8, 9:
>       "sched: define TIF_ALLOW_RESCHED"
>       "irqentry: define irqentry_exit_allow_resched()"
>       "x86/clear_huge_page: make clear_contig_region() preemptible"
>     by the context marking itself reschedulable, and rescheduling in
>     irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)
> 
> 2. the current page-at-a-time clearing logic does left-right narrowing
>     towards the faulting page which benefits workloads by maintaining
>     cache locality for workloads which have a sequential pattern. Clearing
>     in large chunks loses that.
> 
>     Some (but not all) of that could be ameliorated by something like
>     this patch:
>     https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/
> 
>     But, before doing that I'd like some comments on whether that is
>     worth doing for this specific use case?
> 
> Rest of the series:
>    Patches 1, 2, 3:
>      "huge_pages: get rid of process_huge_page()"
>      "huge_page: get rid of {clear,copy}_subpage()"
>      "huge_page: allow arch override for clear/copy_huge_page()"
>    are mechanical and they simplify some of the current clear_huge_page()
>    logic.
> 
>    Patches 4, 5:
>    "x86/clear_page: parameterize clear_page*() to specify length"
>    "x86/clear_pages: add clear_pages()"
> 
>    add clear_pages() and helpers.
> 
>    Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
>    chunked x86 clear_huge_page() implementation.
> 
> 
> Performance
> ==
> 
> Demand fault performance gets a decent boost:
> 
>    *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)                (GB/s)
>                                                                    
>    pg-sz=2MB                 8.76                 11.82   +34.93%
>    pg-sz=1GB                 8.99                 12.18   +35.48%
> 
> 
>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)                (GB/s)
>                                                                     
>    pg-sz=2MB                12.24                 17.54    +43.30%
>    pg-sz=1GB                17.98                 37.24   +107.11%
> 
> 
> vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
> worse when user space tries to touch those pages:
> 
>    *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
>    (mem=4GB/task, tasks=128)
> 
>    stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
>    utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
>    wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%
> 
> 
>    *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
>    (mem=1GB/task, tasks=512)
> 
>    stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
>    utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
>    wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%
> 
> Also at:
>    github.com/terminus/linux clear-pages.v1
> 
> Comments appreciated!
> 

Hello Ankur,

Was able to test your patches. To summarize, am seeing 2x-3x perf
improvement for 2M, 1GB base hugepage sizes.

SUT: Genoa AMD EPYC
    Thread(s) per core:  2
    Core(s) per socket:  128
    Socket(s):           2

NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-127,256-383
   NUMA node1 CPU(s):     128-255,384-511

Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA 
node0), for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
page-size  mm/clear_huge_page   x86/clear_huge_page     change %
2M              5.4567          2.6774                  -50.93
1G              2.64452         1.011281                -61.76

Full perfstat info

  page size = 2M mm/clear_huge_page

  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 
runs):

           5,434.71 msec task-clock                #    0.996 CPUs 
utilized            ( +-  0.55% )
                  8      context-switches          #    1.466 /sec 
                ( +-  4.66% )
                  0      cpu-migrations            #    0.000 /sec
             32,918      page-faults               #    6.034 K/sec 
                ( +-  0.00% )
     16,977,242,482      cycles                    #    3.112 GHz 
                ( +-  0.04% )  (35.70%)
          1,961,724      stalled-cycles-frontend   #    0.01% frontend 
cycles idle     ( +-  1.09% )  (35.72%)
         35,685,674      stalled-cycles-backend    #    0.21% backend 
cycles idle      ( +-  3.48% )  (35.74%)
      1,038,327,182      instructions              #    0.06  insn per cycle
                                                   #    0.04  stalled 
cycles per insn  ( +-  0.38% )  (35.75%)
        221,409,216      branches                  #   40.584 M/sec 
                ( +-  0.36% )  (35.75%)
            350,730      branch-misses             #    0.16% of all 
branches          ( +-  1.18% )  (35.75%)
      2,520,888,779      L1-dcache-loads           #  462.077 M/sec 
                ( +-  0.03% )  (35.73%)
      1,094,178,209      L1-dcache-load-misses     #   43.46% of all 
L1-dcache accesses  ( +-  0.02% )  (35.71%)
         67,751,730      L1-icache-loads           #   12.419 M/sec 
                ( +-  0.11% )  (35.70%)
            271,118      L1-icache-load-misses     #    0.40% of all 
L1-icache accesses  ( +-  2.55% )  (35.70%)
            506,635      dTLB-loads                #   92.866 K/sec 
                ( +-  3.31% )  (35.70%)
            237,385      dTLB-load-misses          #   43.64% of all 
dTLB cache accesses  ( +-  7.00% )  (35.69%)
                268      iTLB-load-misses          # 6700.00% of all 
iTLB cache accesses  ( +- 13.86% )  (35.70%)

             5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )

  page size = 2M x86/clear_huge_page
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 
runs):

           2,780.69 msec task-clock                #    1.039 CPUs 
utilized            ( +-  1.03% )
                  3      context-switches          #    1.121 /sec 
                ( +- 21.34% )
                  0      cpu-migrations            #    0.000 /sec
             32,918      page-faults               #   12.301 K/sec 
                ( +-  0.00% )
      8,143,619,771      cycles                    #    3.043 GHz 
                ( +-  0.25% )  (35.62%)
          2,024,872      stalled-cycles-frontend   #    0.02% frontend 
cycles idle     ( +-320.93% )  (35.66%)
        717,198,728      stalled-cycles-backend    #    8.82% backend 
cycles idle      ( +-  8.26% )  (35.69%)
        606,549,334      instructions              #    0.07  insn per cycle
                                                   #    1.39  stalled 
cycles per insn  ( +-  0.23% )  (35.73%)
        108,856,550      branches                  #   40.677 M/sec 
                ( +-  0.24% )  (35.76%)
            202,490      branch-misses             #    0.18% of all 
branches          ( +-  3.58% )  (35.78%)
      2,348,818,806      L1-dcache-loads           #  877.701 M/sec 
                ( +-  0.03% )  (35.78%)
      1,081,562,988      L1-dcache-load-misses     #   46.04% of all 
L1-dcache accesses  ( +-  0.01% )  (35.78%)
    <not supported>      LLC-loads
    <not supported>      LLC-load-misses
         43,411,167      L1-icache-loads           #   16.222 M/sec 
                ( +-  0.19% )  (35.77%)
            273,042      L1-icache-load-misses     #    0.64% of all 
L1-icache accesses  ( +-  4.94% )  (35.76%)
            834,482      dTLB-loads                #  311.827 K/sec 
                ( +-  9.73% )  (35.72%)
            437,343      dTLB-load-misses          #   65.86% of all 
dTLB cache accesses  ( +-  8.56% )  (35.68%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.65%)
                160      iTLB-load-misses          # 1777.78% of all 
iTLB cache accesses  ( +- 15.82% )  (35.62%)

             2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )

  page size = 1G mm/clear_huge_page
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

           2,625.24 msec task-clock                #    0.993 CPUs 
utilized            ( +-  0.23% )
                  4      context-switches          #    1.513 /sec 
                ( +-  4.49% )
                  1      cpu-migrations            #    0.378 /sec
                214      page-faults               #   80.965 /sec 
                ( +-  0.13% )
      8,178,624,349      cycles                    #    3.094 GHz 
                ( +-  0.23% )  (35.65%)
          2,942,576      stalled-cycles-frontend   #    0.04% frontend 
cycles idle     ( +- 75.22% )  (35.69%)
          7,117,425      stalled-cycles-backend    #    0.09% backend 
cycles idle      ( +-  3.79% )  (35.73%)
        454,521,647      instructions              #    0.06  insn per cycle
                                                   #    0.02  stalled 
cycles per insn  ( +-  0.10% )  (35.77%)
        113,223,853      branches                  #   42.837 M/sec 
                ( +-  0.08% )  (35.80%)
             84,766      branch-misses             #    0.07% of all 
branches          ( +-  5.37% )  (35.80%)
      2,294,528,890      L1-dcache-loads           #  868.111 M/sec 
                ( +-  0.02% )  (35.81%)
      1,075,907,551      L1-dcache-load-misses     #   46.88% of all 
L1-dcache accesses  ( +-  0.02% )  (35.78%)
         26,167,323      L1-icache-loads           #    9.900 M/sec 
                ( +-  0.24% )  (35.74%)
            139,675      L1-icache-load-misses     #    0.54% of all 
L1-icache accesses  ( +-  0.37% )  (35.70%)
              3,459      dTLB-loads                #    1.309 K/sec 
                ( +- 12.75% )  (35.67%)
                732      dTLB-load-misses          #   19.71% of all 
dTLB cache accesses  ( +- 26.61% )  (35.62%)
                 11      iTLB-load-misses          #  192.98% of all 
iTLB cache accesses  ( +-238.28% )  (35.62%)

            2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )


  page size = 1G x86/clear_huge_page
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

           1,009.09 msec task-clock                #    0.998 CPUs 
utilized            ( +-  0.06% )
                  2      context-switches          #    1.980 /sec 
                ( +- 23.63% )
                  1      cpu-migrations            #    0.990 /sec
                214      page-faults               #  211.887 /sec 
                ( +-  0.16% )
      3,154,980,463      cycles                    #    3.124 GHz 
                ( +-  0.06% )  (35.77%)
            145,051      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +-  6.26% )  (35.78%)
        730,087,143      stalled-cycles-backend    #   23.12% backend 
cycles idle      ( +-  9.75% )  (35.78%)
         45,813,391      instructions              #    0.01  insn per cycle
                                                   #   18.51  stalled 
cycles per insn  ( +-  1.00% )  (35.78%)
          8,498,282      branches                  #    8.414 M/sec 
                ( +-  1.54% )  (35.78%)
             63,351      branch-misses             #    0.74% of all 
branches          ( +-  6.70% )  (35.69%)
         29,135,863      L1-dcache-loads           #   28.848 M/sec 
                ( +-  5.67% )  (35.68%)
          8,537,280      L1-dcache-load-misses     #   28.66% of all 
L1-dcache accesses  ( +- 10.15% )  (35.68%)
          1,040,087      L1-icache-loads           #    1.030 M/sec 
                ( +-  1.60% )  (35.68%)
              9,147      L1-icache-load-misses     #    0.85% of all 
L1-icache accesses  ( +-  6.50% )  (35.67%)
              1,084      dTLB-loads                #    1.073 K/sec 
                ( +- 12.05% )  (35.68%)
                431      dTLB-load-misses          #   40.28% of all 
dTLB cache accesses  ( +- 43.46% )  (35.68%)
                 16      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +- 40.54% )  (35.68%)

           1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )

Please feel free to add

Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Will come back with further observations on patch/performance if any

Thanks and Regards

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] sched: define TIF_ALLOW_RESCHED
  2023-04-03  5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
@ 2023-04-05 20:07   ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-05 20:07 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Sun, Apr 02, 2023 at 10:22:31PM -0700, Ankur Arora wrote:
> Define TIF_ALLOW_RESCHED to allow threads to mark themselves as allowing
> rescheduling in the irqexit path with PREEMPTION_NONE/_VOLUNTARY.
> 
> This is meant to be used for long running tasks where it is
> not convenient to periodically call cond_resched().
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/include/asm/thread_info.h |  2 ++
>  include/linux/sched.h              | 29 +++++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index f1cccba52eb9..8c18b9eaeec4 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -100,6 +100,7 @@ struct thread_info {
>  #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
>  #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
>  #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
> +#define TIF_RESCHED_ALLOW	30	/* can reschedule if needed */
>  
>  #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
> @@ -122,6 +123,7 @@ struct thread_info {
>  #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
>  #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
>  #define _TIF_ADDR32		(1 << TIF_ADDR32)
> +#define _TIF_RESCHED_ALLOW	(1 << TIF_RESCHED_ALLOW)
>  
>  /* flags to check in __switch_to() */
>  #define _TIF_WORK_CTXSW_BASE					\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b1a..1e7536e6d9ce 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2229,6 +2229,35 @@ static __always_inline bool need_resched(void)
>  	return unlikely(tif_need_resched());
>  }
>  
> +/*
> + * Define this in common code to avoid include hell.
> + */
> +static __always_inline bool resched_allowed(void)
> +{
> +#ifndef TIF_RESCHED_ALLOW
> +	return false;
> +#else
> +	return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> +#endif
> +}
> +
> +static inline void allow_resched(void)
> +{
> +	/*
> +	 * allow_resched() allows preemption via the irqexit context.
> +	 * To ensure that we stick around on the current runqueue,
> +	 * disallow migration.
> +	 */
> +	migrate_disable();
> +	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
> +}
> +
> +static inline void disallow_resched(void)
> +{
> +	clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
> +	migrate_enable();
> +}

Why the migrate_disable(), that comment doesn't help much.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
  2023-04-04  9:38   ` Thomas Gleixner
@ 2023-04-05 20:22   ` Peter Zijlstra
  2023-04-06 16:56     ` Ankur Arora
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-05 20:22 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Sun, Apr 02, 2023 at 10:22:32PM -0700, Ankur Arora wrote:
> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
> 
> This is only necessary under !preempt_model_preemptible() for which
> we reuse the same logic as irqentry_exit_code_resched().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  kernel/entry/common.c |  8 ++++++++
>  kernel/sched/core.c   | 36 +++++++++++++++++++++---------------
>  2 files changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index be61332c66b5..f1005595ebe7 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -390,6 +390,9 @@ void raw_irqentry_exit_cond_resched(void)
>  			preempt_schedule_irq();
>  	}
>  }
> +
> +void irqentry_exit_allow_resched(void) __alias(raw_irqentry_exit_cond_resched);

Because typing raw_irqentry_exit_cond_resched() was too much effort?

> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> @@ -431,6 +434,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		instrumentation_begin();
>  		if (IS_ENABLED(CONFIG_PREEMPTION))
>  			irqentry_exit_cond_resched();
> +		/*
> +		 * We care about this clause only in the dynamic !preemptible case.
> +		 */
> +		if (unlikely(!preempt_model_preemptible() && resched_allowed()))
> +			irqentry_exit_allow_resched();

This is wrong, if we have dynamic preemption then we have
CONFIG_PREEMPTION and we'll have that irqentry_exit_cond_resched() call.

Basically what you've written above is something like:

	static_call(foo); // raw_foo() when A, NOP if !A
	if (!A)
		raw_foo();

And yeah, you've got the extra resched_allowed() thing in there, but
that doesn't make it any better -- this is all quite horrible.

What you really care about is the CONFIG_PREEMPTION=n case, but that you
don't actually handle.

>  		/* Covers both tracing and lockdep */
>  		trace_hardirqs_on();
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0d18c3969f90..11845a91b691 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c

> @@ -8597,28 +8599,32 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>   * SC:preempt_schedule
>   * SC:preempt_schedule_notrace
>   * SC:irqentry_exit_cond_resched
> + * SC:irqentry_exit_allow_resched
>   *
>   *
>   * NONE:
> - *   cond_resched               <- __cond_resched
> - *   might_resched              <- RET0
> - *   preempt_schedule           <- NOP
> - *   preempt_schedule_notrace   <- NOP
> - *   irqentry_exit_cond_resched <- NOP
> + *   cond_resched                <- __cond_resched
> + *   might_resched               <- RET0
> + *   preempt_schedule            <- NOP
> + *   preempt_schedule_notrace    <- NOP
> + *   irqentry_exit_cond_resched  <- NOP
> + *   irqentry_exit_allow_resched <- irqentry_exit_allow_resched
>   *
>   * VOLUNTARY:
> - *   cond_resched               <- __cond_resched
> - *   might_resched              <- __cond_resched
> - *   preempt_schedule           <- NOP
> - *   preempt_schedule_notrace   <- NOP
> - *   irqentry_exit_cond_resched <- NOP
> + *   cond_resched                <- __cond_resched
> + *   might_resched               <- __cond_resched
> + *   preempt_schedule            <- NOP
> + *   preempt_schedule_notrace    <- NOP
> + *   irqentry_exit_cond_resched  <- NOP
> + *   irqentry_exit_allow_resched <- irqentry_exit_allow_resched
>   *
>   * FULL:
> - *   cond_resched               <- RET0
> - *   might_resched              <- RET0
> - *   preempt_schedule           <- preempt_schedule
> - *   preempt_schedule_notrace   <- preempt_schedule_notrace
> - *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + *   cond_resched                <- RET0
> + *   might_resched               <- RET0
> + *   preempt_schedule            <- preempt_schedule
> + *   preempt_schedule_notrace    <- preempt_schedule_notrace
> + *   irqentry_exit_cond_resched  <- irqentry_exit_cond_resched
> + *   irqentry_exit_allow_resched <- NOP
>   */

This ^ is all complete nonsense.. irqentry_exit_allow_resched() is not
a static call. It's an alias of raw_irqentry_exit_cond_resched which
circumvents the static call entirely.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible
  2023-04-03  5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
@ 2023-04-05 20:27   ` Peter Zijlstra
  2023-04-06 17:00     ` Ankur Arora
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-05 20:27 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Sun, Apr 02, 2023 at 10:22:33PM -0700, Ankur Arora wrote:
> clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
> chunk Allow preemption in the irqentry_exit path to make sure we don't
> hold on to the CPU for an arbitrarily long period.
> 
> Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
> hugetlb-2mb region, and then writes sequentially to the region, demand
> faulting pages on the way.
> 
> This test, with a CONFIG_VOLUNTARY config shows the effects of this
> change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
> goes up (~15% on Icelakex, ~13% on Milan.)
> 
>   *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
>   (mem=4GB/task, tasks=128)
> 
>   stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
>   utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
>   wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%
> 
> 
> 
>   *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
>   (mem=1GB/task, tasks=512)
> 
>   stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
>   utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
>   wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%
> 
> The drop in stime is due to REP; STOS being more efficient for bigger
> extents.  The increase in utime is due to cache effects of that change:
> mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
> faulting page; while x86/clear_huge_page only optimizes for cache
> locality in the local neighbourhood of the faulting address.
> 
> This effect on utime is visible via the increased L1-dcache-load-misses
> and LLC-load* and an increased backend boundedness for perf user-stat
> --all-user on Icelakex. The effect is slight but given the heavy cache
> pressure generated by the test, shows up in the drop in user IPC:
> 
>     -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
>     -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
>     -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>     -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
>     -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)
> 
>     +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
>     +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
>     +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>     +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
>     +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)
> 
> Given the fact that the stime improves for all loads using this path,
> while the utime drop is load dependent add this change.

Either I really need sleep, or *NONE* of the above is actually relevant
to what the patch below actually does!

The above talks about the glories of using large clears, while the patch
allows reschedules which are about latency.

> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/mm/hugetlbpage.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 4294b77c4f18..c8564b0552e5 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>  static void clear_contig_region(struct page *page, unsigned long vaddr,
>  				unsigned int npages)
>  {
> +	might_sleep();
> +
> +	/*
> +	 * We might be clearing a large region.
> +	 * Allow rescheduling.
> +	 */
> +	allow_resched();
>  	clear_user_pages(page_address(page), vaddr, page, npages);
> +	disallow_resched();
> +
> +	cond_resched();
>  }
>  
>  void clear_huge_page(struct page *page,
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length
  2023-04-03  5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
@ 2023-04-06  8:19   ` Peter Zijlstra
  2023-04-07  3:03     ` Ankur Arora
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06  8:19 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Sun, Apr 02, 2023 at 10:22:28PM -0700, Ankur Arora wrote:
> Change clear_page*() to take a length parameter.
> Rename to clear_pages_*().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index ecbfb4dd3b01..6069acf6072f 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -2,6 +2,8 @@
>  #include <linux/linkage.h>
>  #include <asm/asm.h>
>  #include <asm/export.h>
> +#include <asm/cpufeatures.h>
> +#include <asm/alternative.h>
>  
>  /*
>   * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
> @@ -13,18 +15,30 @@
>  /*
>   * Zero a page.
>   * %rdi	- page
> + * %esi	- page-length
> + *
> + * Clobbers: %rax, %rcx
>   */
> -SYM_FUNC_START(clear_page_rep)
> -	movl $4096/8,%ecx
> +SYM_FUNC_START(clear_pages_rep)
> +	movl %esi, %ecx
>  	xorl %eax,%eax
> +	shrl $3,%ecx
>  	rep stosq
>  	RET
> -SYM_FUNC_END(clear_page_rep)
> -EXPORT_SYMBOL_GPL(clear_page_rep)
> +SYM_FUNC_END(clear_pages_rep)
> +EXPORT_SYMBOL_GPL(clear_pages_rep)
>  
> -SYM_FUNC_START(clear_page_orig)
> +/*
> + * Original page zeroing loop.
> + * %rdi	- page
> + * %esi	- page-length
> + *
> + * Clobbers: %rax, %rcx, %rflags
> + */
> +SYM_FUNC_START(clear_pages_orig)
> +	movl   %esi, %ecx
>  	xorl   %eax,%eax
> -	movl   $4096/64,%ecx
> +	shrl   $6,%ecx
>  	.p2align 4
>  .Lloop:
>  	decl	%ecx
> @@ -41,16 +55,23 @@ SYM_FUNC_START(clear_page_orig)
>  	jnz	.Lloop
>  	nop
>  	RET
> -SYM_FUNC_END(clear_page_orig)
> -EXPORT_SYMBOL_GPL(clear_page_orig)
> +SYM_FUNC_END(clear_pages_orig)
> +EXPORT_SYMBOL_GPL(clear_pages_orig)
>  
> -SYM_FUNC_START(clear_page_erms)
> -	movl $4096,%ecx
> +/*
> + * Zero a page.
> + * %rdi	- page
> + * %esi	- page-length
> + *
> + * Clobbers: %rax, %rcx
> + */
> +SYM_FUNC_START(clear_pages_erms)
> +	movl %esi, %ecx
>  	xorl %eax,%eax
>  	rep stosb
>  	RET
> -SYM_FUNC_END(clear_page_erms)
> -EXPORT_SYMBOL_GPL(clear_page_erms)
> +SYM_FUNC_END(clear_pages_erms)
> +EXPORT_SYMBOL_GPL(clear_pages_erms)

So it seems to me that clear_user_*() and clear_page_*() are now very
similar; is there really no way to de-duplicate all that?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
  2023-04-03  5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
@ 2023-04-06  8:23   ` Peter Zijlstra
  2023-04-07  0:50     ` Ankur Arora
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06  8:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
> Add clear_pages() and define the ancillary clear_user_pages().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/include/asm/page.h    | 6 ++++++
>  arch/x86/include/asm/page_32.h | 6 ++++++
>  arch/x86/include/asm/page_64.h | 9 +++++++--
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index d18e5c332cb9..03e3c69fc427 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
>  	clear_page(page);
>  }
>  
> +static inline void clear_user_pages(void *page, unsigned long vaddr,
> +				    struct page *pg, unsigned int nsubpages)
> +{
> +	clear_pages(page, nsubpages);
> +}

This seems dodgy, clear_user* has slightly different semantics. It needs
the access_ok() and stac/clac thing on at the very least.

> +
>  static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
>  				  struct page *topage)
>  {
> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
> index 580d71aca65a..3523d1150cfc 100644
> --- a/arch/x86/include/asm/page_32.h
> +++ b/arch/x86/include/asm/page_32.h
> @@ -22,6 +22,12 @@ static inline void clear_page(void *page)
>  	memset(page, 0, PAGE_SIZE);
>  }
>  
> +static inline void clear_pages(void *page, unsigned int nsubpages)
> +{
> +	for (int i = 0; i < nsubpages; i++)
> +		clear_page(page + i * PAGE_SIZE);

cond_resched() ?

> +}
> +
>  static inline void copy_page(void *to, void *from)
>  {
>  	memcpy(to, from, PAGE_SIZE);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-05 20:22   ` Peter Zijlstra
@ 2023-04-06 16:56     ` Ankur Arora
  2023-04-06 20:13       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-06 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Apr 02, 2023 at 10:22:32PM -0700, Ankur Arora wrote:
>> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>>
>> This is only necessary under !preempt_model_preemptible() for which
>> we reuse the same logic as irqentry_exit_code_resched().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  kernel/entry/common.c |  8 ++++++++
>>  kernel/sched/core.c   | 36 +++++++++++++++++++++---------------
>>  2 files changed, 29 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index be61332c66b5..f1005595ebe7 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -390,6 +390,9 @@ void raw_irqentry_exit_cond_resched(void)
>>  			preempt_schedule_irq();
>>  	}
>>  }
>> +
>> +void irqentry_exit_allow_resched(void) __alias(raw_irqentry_exit_cond_resched);
>
> Because typing raw_irqentry_exit_cond_resched() was too much effort?

Probably unnecessary but wanted to underscore that irqentry_exit_allow_resched()
handled a separate user path.

>> +
>>  #ifdef CONFIG_PREEMPT_DYNAMIC
>>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>>  DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
>> @@ -431,6 +434,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>>  		instrumentation_begin();
>>  		if (IS_ENABLED(CONFIG_PREEMPTION))
>>  			irqentry_exit_cond_resched();
>> +		/*
>> +		 * We care about this clause only in the dynamic !preemptible case.
>> +		 */
>> +		if (unlikely(!preempt_model_preemptible() && resched_allowed()))
>> +			irqentry_exit_allow_resched();
>
> This is wrong, if we have dynamic preemption then we have
> CONFIG_PREEMPTION and we'll have that irqentry_exit_cond_resched() call.
>
> Basically what you've written above is something like:
>
> 	static_call(foo); // raw_foo() when A, NOP if !A

Right, so:

CONFIG_PREEMPTION: raw_foo()
CONFIG_PREEMPTION && (preempt_dynamic_mode == preempt_dynamic_full): raw_foo()

This is NOP if CONFIG_PREEMPTION && preempt_dynamic_mode != preempt_dynamic_full.

> 	if (!A)
> 		raw_foo();

So I would call raw_foo() for the CONFIG_PREEMPTION=n case.

> What you really care about is the CONFIG_PREEMPTION=n case, but that you
> don't actually handle.

I don't see that. The CONFIG_PREEMPTION=n (or its dynamic version)
is being handled here.

Ignoring the RT case, preempt_model_preemptible() is either:
   IS_ENABLED(CONFIG_PREEMPT)
or:
   preempt_dynamic_mode == preempt_dynamic_full

So, I'm handling both the static and the dynamic case here.

if (!IS_ENABLED(CONFIG_PREEMPTION))
   raw_foo()
or
if (preempt_dynamic_mode != preempt_dynamic_full)
   raw_foo()

> And yeah, you've got the extra resched_allowed() thing in there, but
> that doesn't make it any better -- this is all quite horrible.

I don't disagree. There's a quite a lot of static/dynamic config options
here and adding this clause didn't improve things.

I think just going with a static call here for the allow-resched case
might have been cleaner. Alternately I'll see if it can be cleanly
folded in with the irqentry_exit_cond_resched() logic.

>>  		/* Covers both tracing and lockdep */
>>  		trace_hardirqs_on();
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0d18c3969f90..11845a91b691 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>
>> @@ -8597,28 +8599,32 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>>   * SC:preempt_schedule
>>   * SC:preempt_schedule_notrace
>>   * SC:irqentry_exit_cond_resched
>> + * SC:irqentry_exit_allow_resched
>>   *
>>   *
>>   * NONE:
>> - *   cond_resched               <- __cond_resched
>> - *   might_resched              <- RET0
>> - *   preempt_schedule           <- NOP
>> - *   preempt_schedule_notrace   <- NOP
>> - *   irqentry_exit_cond_resched <- NOP
>> + *   cond_resched                <- __cond_resched
>> + *   might_resched               <- RET0
>> + *   preempt_schedule            <- NOP
>> + *   preempt_schedule_notrace    <- NOP
>> + *   irqentry_exit_cond_resched  <- NOP
>> + *   irqentry_exit_allow_resched <- irqentry_exit_allow_resched
>>   *
>>   * VOLUNTARY:
>> - *   cond_resched               <- __cond_resched
>> - *   might_resched              <- __cond_resched
>> - *   preempt_schedule           <- NOP
>> - *   preempt_schedule_notrace   <- NOP
>> - *   irqentry_exit_cond_resched <- NOP
>> + *   cond_resched                <- __cond_resched
>> + *   might_resched               <- __cond_resched
>> + *   preempt_schedule            <- NOP
>> + *   preempt_schedule_notrace    <- NOP
>> + *   irqentry_exit_cond_resched  <- NOP
>> + *   irqentry_exit_allow_resched <- irqentry_exit_allow_resched
>>   *
>>   * FULL:
>> - *   cond_resched               <- RET0
>> - *   might_resched              <- RET0
>> - *   preempt_schedule           <- preempt_schedule
>> - *   preempt_schedule_notrace   <- preempt_schedule_notrace
>> - *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>> + *   cond_resched                <- RET0
>> + *   might_resched               <- RET0
>> + *   preempt_schedule            <- preempt_schedule
>> + *   preempt_schedule_notrace    <- preempt_schedule_notrace
>> + *   irqentry_exit_cond_resched  <- irqentry_exit_cond_resched
>> + *   irqentry_exit_allow_resched <- NOP
>>   */
>
> This ^ is all complete nonsense.. irqentry_exit_allow_resched() is not
> a static call. It's an alias of raw_irqentry_exit_cond_resched which
> circumvents the static call entirely.

Yeah you are right. I added the SC:irqentry_exit_allow_resched
prematurely. Will fix.

Thanks for the detailed comments.

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible
  2023-04-05 20:27   ` Peter Zijlstra
@ 2023-04-06 17:00     ` Ankur Arora
  0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-06 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Apr 02, 2023 at 10:22:33PM -0700, Ankur Arora wrote:
>> clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
>> chunk Allow preemption in the irqentry_exit path to make sure we don't
>> hold on to the CPU for an arbitrarily long period.
>>
>> Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
>> hugetlb-2mb region, and then writes sequentially to the region, demand
>> faulting pages on the way.
>>
>> This test, with a CONFIG_VOLUNTARY config shows the effects of this
>> change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
>> goes up (~15% on Icelakex, ~13% on Milan.)
>>
>>   *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
>>   (mem=4GB/task, tasks=128)
>>
>>   stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
>>   utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
>>   wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%
>>
>>
>>
>>   *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
>>   (mem=1GB/task, tasks=512)
>>
>>   stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
>>   utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
>>   wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%
>>
>> The drop in stime is due to REP; STOS being more efficient for bigger
>> extents.  The increase in utime is due to cache effects of that change:
>> mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
>> faulting page; while x86/clear_huge_page only optimizes for cache
>> locality in the local neighbourhood of the faulting address.
>>
>> This effect on utime is visible via the increased L1-dcache-load-misses
>> and LLC-load* and an increased backend boundedness for perf user-stat
>> --all-user on Icelakex. The effect is slight but given the heavy cache
>> pressure generated by the test, shows up in the drop in user IPC:
>>
>>     -  9,455,243,414,829      instructions                     #    2.75  insn per cycle              ( +- 14.14% )  (46.17%)
>>     -  2,367,920,864,112      L1-dcache-loads                  #    1.054 G/sec                       ( +- 14.14% )  (69.24%)
>>     -     42,075,182,813      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>>     -         20,365,688      LLC-loads                        #    9.064 K/sec                       ( +- 13.98% )  (69.24%)
>>     -            890,382      LLC-load-misses                  #    7.18% of all LL-cache accesses    ( +- 14.91% )  (69.24%)
>>
>>     +  9,467,796,660,698      instructions                     #    2.37  insn per cycle              ( +- 14.14% )  (46.16%)
>>     +  2,369,973,307,561      L1-dcache-loads                  #    1.027 G/sec                       ( +- 14.14% )  (69.24%)
>>     +     42,155,621,201      L1-dcache-load-misses            #    2.96% of all L1-dcache accesses   ( +- 14.14% )  (69.24%)
>>     +         22,116,300      LLC-loads                        #    9.588 K/sec                       ( +- 14.20% )  (69.24%)
>>     +          1,355,607      LLC-load-misses                  #   10.29% of all LL-cache accesses    ( +- 15.49% )  (69.25%)
>>
>> Given the fact that the stime improves for all loads using this path,
>> while the utime drop is load dependent add this change.
>
> Either I really need sleep, or *NONE* of the above is actually relevant
> to what the patch below actually does!

Yeah, you are right about the relevance.

I wanted to provide two sets of stats: the raw memory BW stats and the
relevant vm-scalability workload. The vm-scalability workload needs a
more reasonable scheduling model than what's present until this patch
and so it seemed to make sense to put here for that reason.
But yeah it doesn't really fit here.

> The above talks about the glories of using large clears, while the patch
> allows reschedules which are about latency.

Yeah, let me find a more reasonable way to present these.

Ankur

>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/x86/mm/hugetlbpage.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 4294b77c4f18..c8564b0552e5 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>>  static void clear_contig_region(struct page *page, unsigned long vaddr,
>>  				unsigned int npages)
>>  {
>> +	might_sleep();
>> +
>> +	/*
>> +	 * We might be clearing a large region.
>> +	 * Allow rescheduling.
>> +	 */
>> +	allow_resched();
>>  	clear_user_pages(page_address(page), vaddr, page, npages);
>> +	disallow_resched();
>> +
>> +	cond_resched();
>>  }
>>
>>  void clear_huge_page(struct page *page,
>> --
>> 2.31.1
>>


--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-06 16:56     ` Ankur Arora
@ 2023-04-06 20:13       ` Peter Zijlstra
  2023-04-06 20:16         ` Peter Zijlstra
  2023-04-07  2:29         ` Ankur Arora
  0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06 20:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Thu, Apr 06, 2023 at 09:56:07AM -0700, Ankur Arora wrote:

> 
> Right, so:
> 
> CONFIG_PREEMPTION: raw_foo()
> CONFIG_PREEMPTION && (preempt_dynamic_mode == preempt_dynamic_full): raw_foo()
> 
> This is NOP if CONFIG_PREEMPTION && preempt_dynamic_mode != preempt_dynamic_full.
> 
> > 	if (!A)
> > 		raw_foo();
> 
> So I would call raw_foo() for the CONFIG_PREEMPTION=n case.
> 
> > What you really care about is the CONFIG_PREEMPTION=n case, but that you
> > don't actually handle.
> 
> I don't see that. The CONFIG_PREEMPTION=n (or its dynamic version)
> is being handled here.

Bah, I overlooked we have multiple definitions of the
preempt_model_foo() things. For some reason I thought all that was only
for DYNAMIC_PREEMPT.

> > And yeah, you've got the extra resched_allowed() thing in there, but
> > that doesn't make it any better -- this is all quite horrible.
> 
> I don't disagree. There's a quite a lot of static/dynamic config options
> here and adding this clause didn't improve things.
> 
> I think just going with a static call here for the allow-resched case
> might have been cleaner. Alternately I'll see if it can be cleanly
> folded in with the irqentry_exit_cond_resched() logic.

Something like the below perhaps?

---
 include/linux/entry-common.h |  6 ++++++
 kernel/entry/common.c        | 23 +++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d95ab85f96ba..0c365dc1f1c2 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -415,10 +415,16 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  * Conditional reschedule with additional sanity checks.
  */
 void raw_irqentry_exit_cond_resched(void);
+void irqentry_exit_cond_resched_tif(void);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
+#ifdef TIF_RESCHED_ALLOW
+#define irqentry_exit_cond_resched_dynamic_disabled	irqentry_exit_cond_resched_tif
+#else
 #define irqentry_exit_cond_resched_dynamic_disabled	NULL
+#endif
 DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
 #define irqentry_exit_cond_resched()	static_call(irqentry_exit_cond_resched)()
 #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index be61332c66b5..211d118aa672 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -390,6 +390,21 @@ void raw_irqentry_exit_cond_resched(void)
 			preempt_schedule_irq();
 	}
 }
+
+void irqentry_exit_cond_resched_tif(void)
+{
+#ifdef TIF_RESCHED_ALLOW
+	if (resched_allowed()) {
+		/* Sanity check RCU and thread stack */
+		rcu_irq_exit_check_preempt();
+		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
+			WARN_ON_ONCE(!on_thread_stack());
+		if (need_resched())
+			preempt_schedule_irq();
+	}
+#endif
+}
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
@@ -397,8 +412,10 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
 DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
 void dynamic_irqentry_exit_cond_resched(void)
 {
-	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
-		return;
+	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched)) {
+		irqentry_exit_cond_resched_tif();
+		return
+	}
 	raw_irqentry_exit_cond_resched();
 }
 #endif
@@ -431,6 +448,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		instrumentation_begin();
 		if (IS_ENABLED(CONFIG_PREEMPTION))
 			irqentry_exit_cond_resched();
+		else
+			irqentry_exit_cond_resched_tif();
 
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-06 20:13       ` Peter Zijlstra
@ 2023-04-06 20:16         ` Peter Zijlstra
  2023-04-07  2:29         ` Ankur Arora
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06 20:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Thu, Apr 06, 2023 at 10:13:59PM +0200, Peter Zijlstra wrote:
> +void irqentry_exit_cond_resched_tif(void)
> +{
> +#ifdef TIF_RESCHED_ALLOW
> +	if (resched_allowed()) {
> +		/* Sanity check RCU and thread stack */
> +		rcu_irq_exit_check_preempt();
> +		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> +			WARN_ON_ONCE(!on_thread_stack());
> +		if (need_resched())
> +			preempt_schedule_irq();

arguably this can simply call raw_irqentry_exit_cond_resched_tif()..
probably better than duplicating all that again.

> +	}
> +#endif
> +}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
  2023-04-06  8:23   ` Peter Zijlstra
@ 2023-04-07  0:50     ` Ankur Arora
  2023-04-07 10:34       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-07  0:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
>> Add clear_pages() and define the ancillary clear_user_pages().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/x86/include/asm/page.h    | 6 ++++++
>>  arch/x86/include/asm/page_32.h | 6 ++++++
>>  arch/x86/include/asm/page_64.h | 9 +++++++--
>>  3 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>> index d18e5c332cb9..03e3c69fc427 100644
>> --- a/arch/x86/include/asm/page.h
>> +++ b/arch/x86/include/asm/page.h
>> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
>>  	clear_page(page);
>>  }
>>
>> +static inline void clear_user_pages(void *page, unsigned long vaddr,
>> +				    struct page *pg, unsigned int nsubpages)
>> +{
>> +	clear_pages(page, nsubpages);
>> +}
>
> This seems dodgy, clear_user* has slightly different semantics. It needs
> the access_ok() and stac/clac thing on at the very least.

That can't be right. On x86, clear_user_page(), copy_user_page() (and
now the multi-page versions) only write to kernel maps of user pages.
That's why they can skip the access_ok(), stac/clac or uacess
exception handling.

From core-api/cachetlb.rst:

  ``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)``
  ``void clear_user_page(void *to, unsigned long addr, struct page *page)``

        These two routines store data in user anonymous or COW
        pages.  It allows a port to efficiently avoid D-cache alias
        issues between userspace and the kernel.

        For example, a port may temporarily map 'from' and 'to' to
        kernel virtual addresses during the copy.  The virtual address
        for these two pages is chosen in such a way that the kernel
        load/store instructions happen to virtual addresses which are
        of the same "color" as the user mapping of the page.  Sparc64
        for example, uses this technique.

        The 'addr' parameter tells the virtual address where the
        user will ultimately have this page mapped, and the 'page'
        parameter gives a pointer to the struct page of the target.

The naming OTOH does seems dodgy. Especially because as you say it
suggests semantics similar to clear_user() etc.

On x86, I think it is definitely a mistake for clear_huge_page() to be
calling clear_user_page*() (especially given that it is getting the
kernel map.) Will fix that.

Even for non-x86, I see just two users in common code:
  highmem.h: copy_user_highpage(), clear_user_highpage()
  fs/dax.c: copy_cow_page_dax()

All of them do a kmap_atomic() so there's really no "may" as documented
above:
        For example, a port may temporarily map 'from' and 'to' to
        kernel virtual addresses during the copy.  The virtual address

Maybe a name change is warranted, if nothing else?

>> +
>>  static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
>>  				  struct page *topage)
>>  {
>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>> index 580d71aca65a..3523d1150cfc 100644
>> --- a/arch/x86/include/asm/page_32.h
>> +++ b/arch/x86/include/asm/page_32.h
>> @@ -22,6 +22,12 @@ static inline void clear_page(void *page)
>>  	memset(page, 0, PAGE_SIZE);
>>  }
>>
>> +static inline void clear_pages(void *page, unsigned int nsubpages)
>> +{
>> +	for (int i = 0; i < nsubpages; i++)
>> +		clear_page(page + i * PAGE_SIZE);
>
> cond_resched() ?

Missed that. Thanks. Will fix.

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-06 20:13       ` Peter Zijlstra
  2023-04-06 20:16         ` Peter Zijlstra
@ 2023-04-07  2:29         ` Ankur Arora
  2023-04-07 10:23           ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-07  2:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Apr 06, 2023 at 09:56:07AM -0700, Ankur Arora wrote:
>
>>
>> Right, so:
>>
>> CONFIG_PREEMPTION: raw_foo()
>> CONFIG_PREEMPTION && (preempt_dynamic_mode == preempt_dynamic_full): raw_foo()
>>
>> This is NOP if CONFIG_PREEMPTION && preempt_dynamic_mode != preempt_dynamic_full.
>>
>> > 	if (!A)
>> > 		raw_foo();
>>
>> So I would call raw_foo() for the CONFIG_PREEMPTION=n case.
>>
>> > What you really care about is the CONFIG_PREEMPTION=n case, but that you
>> > don't actually handle.
>>
>> I don't see that. The CONFIG_PREEMPTION=n (or its dynamic version)
>> is being handled here.
>
> Bah, I overlooked we have multiple definitions of the
> preempt_model_foo() things. For some reason I thought all that was only
> for DYNAMIC_PREEMPT.
>
>> > And yeah, you've got the extra resched_allowed() thing in there, but
>> > that doesn't make it any better -- this is all quite horrible.
>>
>> I don't disagree. There's a quite a lot of static/dynamic config options
>> here and adding this clause didn't improve things.
>>
>> I think just going with a static call here for the allow-resched case
>> might have been cleaner. Alternately I'll see if it can be cleanly
>> folded in with the irqentry_exit_cond_resched() logic.
>
> Something like the below perhaps?
>
> ---
>  include/linux/entry-common.h |  6 ++++++
>  kernel/entry/common.c        | 23 +++++++++++++++++++++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index d95ab85f96ba..0c365dc1f1c2 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -415,10 +415,16 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
>   * Conditional reschedule with additional sanity checks.
>   */
>  void raw_irqentry_exit_cond_resched(void);
> +void irqentry_exit_cond_resched_tif(void);
> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
> +#ifdef TIF_RESCHED_ALLOW
> +#define irqentry_exit_cond_resched_dynamic_disabled	irqentry_exit_cond_resched_tif
> +#else
>  #define irqentry_exit_cond_resched_dynamic_disabled	NULL
> +#endif

So this is clever. Essentially this would toggle between the two kinds
for the preempt_model_preemptible()/!preempt_model_preemptible() dynamic
case. Do I have that right?

>  DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
>  #define irqentry_exit_cond_resched()	static_call(irqentry_exit_cond_resched)()
>  #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index be61332c66b5..211d118aa672 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -390,6 +390,21 @@ void raw_irqentry_exit_cond_resched(void)
>  			preempt_schedule_irq();
>  	}
>  }
> +
> +void irqentry_exit_cond_resched_tif(void)
> +{
> +#ifdef TIF_RESCHED_ALLOW
> +	if (resched_allowed()) {
> +		/* Sanity check RCU and thread stack */
> +		rcu_irq_exit_check_preempt();
> +		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> +			WARN_ON_ONCE(!on_thread_stack());
> +		if (need_resched())
> +			preempt_schedule_irq();
> +	}
> +#endif
> +}
> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> @@ -397,8 +412,10 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
>  DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
>  void dynamic_irqentry_exit_cond_resched(void)
>  {
> -	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
> -		return;
> +	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched)) {
> +		irqentry_exit_cond_resched_tif();
> +		return
> +	}
>  	raw_irqentry_exit_cond_resched();
>  }
>  #endif
> @@ -431,6 +448,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  		instrumentation_begin();
>  		if (IS_ENABLED(CONFIG_PREEMPTION))
>  			irqentry_exit_cond_resched();
> +		else
> +			irqentry_exit_cond_resched_tif();

And, if we choose between the two resched modes at compile time then this
would work.

Might massage the names a little but this should work as is. Okay if I
use your Codeveloped-by/Suggested-by on this patch?

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length
  2023-04-06  8:19   ` Peter Zijlstra
@ 2023-04-07  3:03     ` Ankur Arora
  0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-07  3:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Apr 02, 2023 at 10:22:28PM -0700, Ankur Arora wrote:
>> Change clear_page*() to take a length parameter.
>> Rename to clear_pages_*().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>
>> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> index ecbfb4dd3b01..6069acf6072f 100644
>> --- a/arch/x86/lib/clear_page_64.S
>> +++ b/arch/x86/lib/clear_page_64.S

> So it seems to me that clear_user_*() and clear_page_*() are now very
> similar; is there really no way to de-duplicate all that?

I'm not sure I see a good way to do that. The clear_page_*() variants
are way simpler because they don't do alignment or uacess exception
handling.

The innermost loop in them could be unified -- but both clear_pages_rep()
and clear_pages_erms() are pretty trivial.

One place where it might be worth doing is clear_user_original() and
clear_pages_orig(). These two have also diverged some (clear_pages_orig
unrolls its loop and uses a register mov):

inner loop in clear_pages_orig:
.Lloop:
        decl    %ecx
#define PUT(x) movq %rax,x*8(%rdi)
        movq %rax,(%rdi)
        PUT(1)
        PUT(2)
        PUT(3)
        PUT(4)
        PUT(5)
        PUT(6)
        PUT(7)
        leaq    64(%rdi),%rdi
        jnz     .Lloop
        nop

inner loop in clear_user_original:
.Lqwords:
        movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz .Lqwords

Maybe something like this?

---
 arch/x86/lib/clear_page_64.S | 35 +++++------------------------------
 1 file changed, 5 insertions(+), 30 deletions(-)

diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 6069acf6072f..795a82214d99 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -28,36 +28,6 @@ SYM_FUNC_START(clear_pages_rep)
 SYM_FUNC_END(clear_pages_rep)
 EXPORT_SYMBOL_GPL(clear_pages_rep)

-/*
- * Original page zeroing loop.
- * %rdi	- page
- * %esi	- page-length
- *
- * Clobbers: %rax, %rcx, %rflags
- */
-SYM_FUNC_START(clear_pages_orig)
-	movl   %esi, %ecx
-	xorl   %eax,%eax
-	shrl   $6,%ecx
-	.p2align 4
-.Lloop:
-	decl	%ecx
-#define PUT(x) movq %rax,x*8(%rdi)
-	movq %rax,(%rdi)
-	PUT(1)
-	PUT(2)
-	PUT(3)
-	PUT(4)
-	PUT(5)
-	PUT(6)
-	PUT(7)
-	leaq	64(%rdi),%rdi
-	jnz	.Lloop
-	nop
-	RET
-SYM_FUNC_END(clear_pages_orig)
-EXPORT_SYMBOL_GPL(clear_pages_orig)
-
 /*
  * Zero a page.
  * %rdi	- page
@@ -92,6 +62,9 @@ SYM_FUNC_START(clear_user_original)
 	jz .Lrest_bytes

 	# do the qwords first
+SYM_INNER_LABEL(clear_pages_orig, SYM_L_GLOBAL)
+	movl   %esi, %ecx
+	shr    $3,%rcx
 	.p2align 4
 .Lqwords:
 	movq $0,(%rdi)
@@ -135,6 +108,8 @@ SYM_FUNC_START(clear_user_original)
 SYM_FUNC_END(clear_user_original)
 EXPORT_SYMBOL(clear_user_original)

+EXPORT_SYMBOL_GPL(clear_pages_orig)
+
 /*
  * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
  * present.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
  2023-04-07  2:29         ` Ankur Arora
@ 2023-04-07 10:23           ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-07 10:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Thu, Apr 06, 2023 at 07:29:59PM -0700, Ankur Arora wrote:
> Peter Zijlstra <peterz@infradead.org> writes:

> > Something like the below perhaps?
> >
> > ---
> >  include/linux/entry-common.h |  6 ++++++
> >  kernel/entry/common.c        | 23 +++++++++++++++++++++--
> >  2 files changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index d95ab85f96ba..0c365dc1f1c2 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -415,10 +415,16 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
> >   * Conditional reschedule with additional sanity checks.
> >   */
> >  void raw_irqentry_exit_cond_resched(void);
> > +void irqentry_exit_cond_resched_tif(void);
> > +
> >  #ifdef CONFIG_PREEMPT_DYNAMIC
> >  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> >  #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
> > +#ifdef TIF_RESCHED_ALLOW
> > +#define irqentry_exit_cond_resched_dynamic_disabled	irqentry_exit_cond_resched_tif
> > +#else
> >  #define irqentry_exit_cond_resched_dynamic_disabled	NULL
> > +#endif
> 
> So this is clever. Essentially this would toggle between the two kinds
> for the preempt_model_preemptible()/!preempt_model_preemptible() dynamic
> case. Do I have that right?

Exactly!

> >  DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> >  #define irqentry_exit_cond_resched()	static_call(irqentry_exit_cond_resched)()
> >  #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index be61332c66b5..211d118aa672 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -390,6 +390,21 @@ void raw_irqentry_exit_cond_resched(void)
> >  			preempt_schedule_irq();
> >  	}
> >  }
> > +
> > +void irqentry_exit_cond_resched_tif(void)
> > +{
> > +#ifdef TIF_RESCHED_ALLOW
> > +	if (resched_allowed()) {
> > +		/* Sanity check RCU and thread stack */
> > +		rcu_irq_exit_check_preempt();
> > +		if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> > +			WARN_ON_ONCE(!on_thread_stack());
> > +		if (need_resched())
> > +			preempt_schedule_irq();
> > +	}
> > +#endif
> > +}
> > +
> >  #ifdef CONFIG_PREEMPT_DYNAMIC
> >  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> >  DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> > @@ -397,8 +412,10 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> >  DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> >  void dynamic_irqentry_exit_cond_resched(void)
> >  {
> > -	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
> > -		return;
> > +	if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched)) {
> > +		irqentry_exit_cond_resched_tif();
> > +		return
> > +	}
> >  	raw_irqentry_exit_cond_resched();
> >  }
> >  #endif
> > @@ -431,6 +448,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> >  		instrumentation_begin();
> >  		if (IS_ENABLED(CONFIG_PREEMPTION))
> >  			irqentry_exit_cond_resched();
> > +		else
> > +			irqentry_exit_cond_resched_tif();
> 
> And, if we choose between the two resched modes at compile time then this
> would work.

Just so.

> Might massage the names a little but this should work as is.

Yeah, I'm not liking the naming either, perhaps your
irqentry_exit_allow_resched() would've been better, dunno, see what
works.

> Okay if I use your Codeveloped-by/Suggested-by on this patch?

Yep.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
  2023-04-07  0:50     ` Ankur Arora
@ 2023-04-07 10:34       ` Peter Zijlstra
  2023-04-09 13:26         ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-07 10:34 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
	tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Thu, Apr 06, 2023 at 05:50:18PM -0700, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
> >> Add clear_pages() and define the ancillary clear_user_pages().
> >>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  arch/x86/include/asm/page.h    | 6 ++++++
> >>  arch/x86/include/asm/page_32.h | 6 ++++++
> >>  arch/x86/include/asm/page_64.h | 9 +++++++--
> >>  3 files changed, 19 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >> index d18e5c332cb9..03e3c69fc427 100644
> >> --- a/arch/x86/include/asm/page.h
> >> +++ b/arch/x86/include/asm/page.h
> >> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
> >>  	clear_page(page);
> >>  }
> >>
> >> +static inline void clear_user_pages(void *page, unsigned long vaddr,
> >> +				    struct page *pg, unsigned int nsubpages)
> >> +{
> >> +	clear_pages(page, nsubpages);
> >> +}
> >
> > This seems dodgy, clear_user* has slightly different semantics. It needs
> > the access_ok() and stac/clac thing on at the very least.
> 
> That can't be right. On x86, clear_user_page(), copy_user_page() (and
> now the multi-page versions) only write to kernel maps of user pages.
> That's why they can skip the access_ok(), stac/clac or uacess
> exception handling.

Bah, that namespace is a mess :/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
  2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
@ 2023-04-08 22:46   ` Ankur Arora
  2023-04-10  6:26     ` Raghavendra K T
  0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-08 22:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, peterz,
	rostedt, tglx, vincent.guittot, jon.grimm, bharata,
	boris.ostrovsky, konrad.wilk


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>> This series introduces multi-page clearing for hugepages.

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                 12.24            17.54    +43.30%
>    pg-sz=1GB                17.98            37.24   +107.11%
>
>
> Hello Ankur,
>
> Was able to test your patches. To summarize, am seeing 2x-3x perf
> improvement for 2M, 1GB base hugepage sizes.

Great. Thanks Raghavendra.

> SUT: Genoa AMD EPYC
>    Thread(s) per core:  2
>    Core(s) per socket:  128
>    Socket(s):           2
>
> NUMA:
>   NUMA node(s):          2
>   NUMA node0 CPU(s):     0-127,256-383
>   NUMA node1 CPU(s):     128-255,384-511
>
> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
>
> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> page-size  mm/clear_huge_page   x86/clear_huge_page
> 2M              5.4567          2.6774
> 1G              2.64452         1.011281

So translating into BW, for Genoa we have:

page-size  mm/clear_huge_page   x86/clear_huge_page
 2M              11.74              23.97
 1G              24.24              63.36

That's a pretty good bump over Milan:

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                12.24            17.54
>   pg-sz=1GB                17.98            37.24

Btw, are these numbers with boost=1?

> Full perfstat info
>
>  page size = 2M mm/clear_huge_page
>
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           5,434.71 msec task-clock                #    0.996 CPUs utilized
>          ( +-  0.55% )
>                  8      context-switches          #    1.466 /sec
>                  ( +-  4.66% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #    6.034 K/sec
>             ( +-  0.00% )
>     16,977,242,482      cycles                    #    3.112 GHz
>     ( +-  0.04% )  (35.70%)
>          1,961,724      stalled-cycles-frontend   #    0.01% frontend cycles
>         idle     ( +-  1.09% )  (35.72%)
>         35,685,674      stalled-cycles-backend    #    0.21% backend cycles idle
>        ( +-  3.48% )  (35.74%)
>      1,038,327,182      instructions              #    0.06  insn per cycle
>                                                   #    0.04  stalled cycles per
>                                                       insn  ( +-  0.38% )
>                                                       (35.75%)
>        221,409,216      branches                  #   40.584 M/sec
>        ( +-  0.36% )  (35.75%)
>            350,730      branch-misses             #    0.16% of all branches
>           ( +-  1.18% )  (35.75%)
>      2,520,888,779      L1-dcache-loads           #  462.077 M/sec
>      ( +-  0.03% )  (35.73%)
>      1,094,178,209      L1-dcache-load-misses     #   43.46% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.71%)
>         67,751,730      L1-icache-loads           #   12.419 M/sec
>         ( +-  0.11% )  (35.70%)
>            271,118      L1-icache-load-misses     #    0.40% of all L1-icache
>           accesses  ( +-  2.55% )  (35.70%)
>            506,635      dTLB-loads                #   92.866 K/sec
>            ( +-  3.31% )  (35.70%)
>            237,385      dTLB-load-misses          #   43.64% of all dTLB cache
>           accesses  ( +-  7.00% )  (35.69%)
>                268      iTLB-load-misses          # 6700.00% of all iTLB cache
>               accesses  ( +- 13.86% )  (35.70%)
>
>             5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )
>
>  page size = 2M x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           2,780.69 msec task-clock                #    1.039 CPUs utilized
>          ( +-  1.03% )
>                  3      context-switches          #    1.121 /sec
>                  ( +- 21.34% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #   12.301 K/sec
>             ( +-  0.00% )
>      8,143,619,771      cycles                    #    3.043 GHz
>      ( +-  0.25% )  (35.62%)
>          2,024,872      stalled-cycles-frontend   #    0.02% frontend cycles
>         idle     ( +-320.93% )  (35.66%)
>        717,198,728      stalled-cycles-backend    #    8.82% backend cycles idle
>       ( +-  8.26% )  (35.69%)
>        606,549,334      instructions              #    0.07  insn per cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  0.23% )
>                                                       (35.73%)
>        108,856,550      branches                  #   40.677 M/sec
>        ( +-  0.24% )  (35.76%)
>            202,490      branch-misses             #    0.18% of all branches
>           ( +-  3.58% )  (35.78%)
>      2,348,818,806      L1-dcache-loads           #  877.701 M/sec
>      ( +-  0.03% )  (35.78%)
>      1,081,562,988      L1-dcache-load-misses     #   46.04% of all L1-dcache
>     accesses  ( +-  0.01% )  (35.78%)
>    <not supported>      LLC-loads
>    <not supported>      LLC-load-misses
>         43,411,167      L1-icache-loads           #   16.222 M/sec
>         ( +-  0.19% )  (35.77%)
>            273,042      L1-icache-load-misses     #    0.64% of all L1-icache
>           accesses  ( +-  4.94% )  (35.76%)
>            834,482      dTLB-loads                #  311.827 K/sec
>            ( +-  9.73% )  (35.72%)
>            437,343      dTLB-load-misses          #   65.86% of all dTLB cache
>           accesses  ( +-  8.56% )  (35.68%)
>                  0      iTLB-loads                #    0.000 /sec
>                  (35.65%)
>                160      iTLB-load-misses          # 1777.78% of all iTLB cache
>               accesses  ( +- 15.82% )  (35.62%)
>
>             2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )
>
>  page size = 1G mm/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,625.24 msec task-clock                #    0.993 CPUs utilized
>          ( +-  0.23% )
>                  4      context-switches          #    1.513 /sec
>                  ( +-  4.49% )
>                  1      cpu-migrations            #    0.378 /sec
>                214      page-faults               #   80.965 /sec
>                ( +-  0.13% )
>      8,178,624,349      cycles                    #    3.094 GHz
>      ( +-  0.23% )  (35.65%)
>          2,942,576      stalled-cycles-frontend   #    0.04% frontend cycles
>         idle     ( +- 75.22% )  (35.69%)
>          7,117,425      stalled-cycles-backend    #    0.09% backend cycles idle
>         ( +-  3.79% )  (35.73%)
>        454,521,647      instructions              #    0.06  insn per cycle
>                                                   #    0.02  stalled cycles per
>                                                       insn  ( +-  0.10% )
>                                                       (35.77%)
>        113,223,853      branches                  #   42.837 M/sec
>        ( +-  0.08% )  (35.80%)
>             84,766      branch-misses             #    0.07% of all branches
>            ( +-  5.37% )  (35.80%)
>      2,294,528,890      L1-dcache-loads           #  868.111 M/sec
>      ( +-  0.02% )  (35.81%)
>      1,075,907,551      L1-dcache-load-misses     #   46.88% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.78%)
>         26,167,323      L1-icache-loads           #    9.900 M/sec
>         ( +-  0.24% )  (35.74%)
>            139,675      L1-icache-load-misses     #    0.54% of all L1-icache
>           accesses  ( +-  0.37% )  (35.70%)
>              3,459      dTLB-loads                #    1.309 K/sec
>              ( +- 12.75% )  (35.67%)
>                732      dTLB-load-misses          #   19.71% of all dTLB cache
>               accesses  ( +- 26.61% )  (35.62%)
>                 11      iTLB-load-misses          #  192.98% of all iTLB cache
>                accesses  ( +-238.28% )  (35.62%)
>
>            2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )
>
>
>  page size = 1G x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           1,009.09 msec task-clock                #    0.998 CPUs utilized
>          ( +-  0.06% )
>                  2      context-switches          #    1.980 /sec
>                  ( +- 23.63% )
>                  1      cpu-migrations            #    0.990 /sec
>                214      page-faults               #  211.887 /sec
>                ( +-  0.16% )
>      3,154,980,463      cycles                    #    3.124 GHz
>      ( +-  0.06% )  (35.77%)
>            145,051      stalled-cycles-frontend   #    0.00% frontend cycles
>           idle     ( +-  6.26% )  (35.78%)
>        730,087,143      stalled-cycles-backend    #   23.12% backend cycles idle
>       ( +-  9.75% )  (35.78%)
>         45,813,391      instructions              #    0.01  insn per cycle
>                                                   #   18.51  stalled cycles per
>                                                      insn  ( +-  1.00% )
>                                                      (35.78%)
>          8,498,282      branches                  #    8.414 M/sec
>          ( +-  1.54% )  (35.78%)
>             63,351      branch-misses             #    0.74% of all branches
>            ( +-  6.70% )  (35.69%)
>         29,135,863      L1-dcache-loads           #   28.848 M/sec
>         ( +-  5.67% )  (35.68%)
>          8,537,280      L1-dcache-load-misses     #   28.66% of all L1-dcache
>         accesses  ( +- 10.15% )  (35.68%)
>          1,040,087      L1-icache-loads           #    1.030 M/sec
>          ( +-  1.60% )  (35.68%)
>              9,147      L1-icache-load-misses     #    0.85% of all L1-icache
>             accesses  ( +-  6.50% )  (35.67%)
>              1,084      dTLB-loads                #    1.073 K/sec
>              ( +- 12.05% )  (35.68%)
>                431      dTLB-load-misses          #   40.28% of all dTLB cache
>               accesses  ( +- 43.46% )  (35.68%)
>                 16      iTLB-load-misses          #    0.00% of all iTLB cache
>                accesses  ( +- 40.54% )  (35.68%)
>
>           1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )
>
> Please feel free to add
>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Thanks

Ankur

> Will come back with further observations on patch/performance if any

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
  2023-04-07 10:34       ` Peter Zijlstra
@ 2023-04-09 13:26         ` Matthew Wilcox
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2023-04-09 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
	bp, dave.hansen, hpa, mingo, juri.lelli, mgorman, rostedt, tglx,
	vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk

On Fri, Apr 07, 2023 at 12:34:44PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 05:50:18PM -0700, Ankur Arora wrote:
> > 
> > Peter Zijlstra <peterz@infradead.org> writes:
> > 
> > > On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
> > >> Add clear_pages() and define the ancillary clear_user_pages().
> > >>
> > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > >> ---
> > >>  arch/x86/include/asm/page.h    | 6 ++++++
> > >>  arch/x86/include/asm/page_32.h | 6 ++++++
> > >>  arch/x86/include/asm/page_64.h | 9 +++++++--
> > >>  3 files changed, 19 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > >> index d18e5c332cb9..03e3c69fc427 100644
> > >> --- a/arch/x86/include/asm/page.h
> > >> +++ b/arch/x86/include/asm/page.h
> > >> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
> > >>  	clear_page(page);
> > >>  }
> > >>
> > >> +static inline void clear_user_pages(void *page, unsigned long vaddr,
> > >> +				    struct page *pg, unsigned int nsubpages)
> > >> +{
> > >> +	clear_pages(page, nsubpages);
> > >> +}
> > >
> > > This seems dodgy, clear_user* has slightly different semantics. It needs
> > > the access_ok() and stac/clac thing on at the very least.
> > 
> > That can't be right. On x86, clear_user_page(), copy_user_page() (and
> > now the multi-page versions) only write to kernel maps of user pages.
> > That's why they can skip the access_ok(), stac/clac or uacess
> > exception handling.
> 
> Bah, that namespace is a mess :/

What (I think) it's suppsoed to be is that clear_page() works on kernel
pages that are never seen by userspace while clear_user_page() works
on kernel mappings of pages the user can definitely see.  This makes
no difference to x86, but some architectures can skip a lot of cache
flushing.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
  2023-04-08 22:46   ` Ankur Arora
@ 2023-04-10  6:26     ` Raghavendra K T
  0 siblings, 0 replies; 30+ messages in thread
From: Raghavendra K T @ 2023-04-10  6:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
	dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, peterz,
	rostedt, tglx, vincent.guittot, jon.grimm, bharata,
	boris.ostrovsky, konrad.wilk

On 4/9/2023 4:16 AM, Ankur Arora wrote:
> 
> Raghavendra K T <raghavendra.kt@amd.com> writes:
> 
>> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>>> This series introduces multi-page clearing for hugepages.
> 
>>     *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>>                             (GB/s)           (GB/s)
>>    pg-sz=2MB                 12.24            17.54    +43.30%
>>     pg-sz=1GB                17.98            37.24   +107.11%
>>
>>
>> Hello Ankur,
>>
>> Was able to test your patches. To summarize, am seeing 2x-3x perf
>> improvement for 2M, 1GB base hugepage sizes.
> 
> Great. Thanks Raghavendra.
> 
>> SUT: Genoa AMD EPYC
>>     Thread(s) per core:  2
>>     Core(s) per socket:  128
>>     Socket(s):           2
>>
>> NUMA:
>>    NUMA node(s):          2
>>    NUMA node0 CPU(s):     0-127,256-383
>>    NUMA node1 CPU(s):     128-255,384-511
>>
>> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
>> both base-hugepage-size=2M and 1GB
>>
>> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>>
>> time in seconds elapsed (average of 10 runs) (lower = better)
>>
>> Result:
>> page-size  mm/clear_huge_page   x86/clear_huge_page
>> 2M              5.4567          2.6774
>> 1G              2.64452         1.011281
> 
> So translating into BW, for Genoa we have:
> 
> page-size  mm/clear_huge_page   x86/clear_huge_page
>   2M              11.74              23.97
>   1G              24.24              63.36
> 
> That's a pretty good bump over Milan:
> 
>>     *Milan*     mm/clear_huge_page   x86/clear_huge_page
>>                             (GB/s)           (GB/s)
>>    pg-sz=2MB                12.24            17.54
>>    pg-sz=1GB                17.98            37.24
> 
> Btw, are these numbers with boost=1?
> 

Yes it is. Also a note about config. I had not enabled
GCOV/LOCKSTAT related config because I faced some issues.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-04-10  6:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-04-03  5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
2023-04-03  5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
2023-04-03  5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
2023-04-03  5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
2023-04-06  8:19   ` Peter Zijlstra
2023-04-07  3:03     ` Ankur Arora
2023-04-03  5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
2023-04-06  8:23   ` Peter Zijlstra
2023-04-07  0:50     ` Ankur Arora
2023-04-07 10:34       ` Peter Zijlstra
2023-04-09 13:26         ` Matthew Wilcox
2023-04-03  5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
2023-04-03  5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-04-05 20:07   ` Peter Zijlstra
2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-04-04  9:38   ` Thomas Gleixner
2023-04-05  5:29     ` Ankur Arora
2023-04-05 20:22   ` Peter Zijlstra
2023-04-06 16:56     ` Ankur Arora
2023-04-06 20:13       ` Peter Zijlstra
2023-04-06 20:16         ` Peter Zijlstra
2023-04-07  2:29         ` Ankur Arora
2023-04-07 10:23           ` Peter Zijlstra
2023-04-03  5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-04-05 20:27   ` Peter Zijlstra
2023-04-06 17:00     ` Ankur Arora
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
2023-04-08 22:46   ` Ankur Arora
2023-04-10  6:26     ` Raghavendra K T

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.