* [PATCH 0/9] x86/clear_huge_page: multi-page clearing
@ 2023-04-03 5:22 Ankur Arora
2023-04-03 5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
` (9 more replies)
0 siblings, 10 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
This series introduces multi-page clearing for hugepages.
This is a follow up of some of the ideas discussed at:
https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/
On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.
In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)
An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.
All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.
There are, however, some problems:
1. extended zeroing periods means there's an increased latency due to
the now missing preemption points.
That's handled in patches 7, 8, 9:
"sched: define TIF_ALLOW_RESCHED"
"irqentry: define irqentry_exit_allow_resched()"
"x86/clear_huge_page: make clear_contig_region() preemptible"
by the context marking itself reschedulable, and rescheduling in
irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)
2. the current page-at-a-time clearing logic does left-right narrowing
towards the faulting page which benefits workloads by maintaining
cache locality for workloads which have a sequential pattern. Clearing
in large chunks loses that.
Some (but not all) of that could be ameliorated by something like
this patch:
https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/
But, before doing that I'd like some comments on whether that is
worth doing for this specific use case?
Rest of the series:
Patches 1, 2, 3:
"huge_pages: get rid of process_huge_page()"
"huge_page: get rid of {clear,copy}_subpage()"
"huge_page: allow arch override for clear/copy_huge_page()"
are mechanical and they simplify some of the current clear_huge_page()
logic.
Patches 4, 5:
"x86/clear_page: parameterize clear_page*() to specify length"
"x86/clear_pages: add clear_pages()"
add clear_pages() and helpers.
Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
chunked x86 clear_huge_page() implementation.
Performance
==
Demand fault performance gets a decent boost:
*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 8.76 11.82 +34.93%
pg-sz=1GB 8.99 12.18 +35.48%
*Milan* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 12.24 17.54 +43.30%
pg-sz=1GB 17.98 37.24 +107.11%
vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:
*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(mem=4GB/task, tasks=128)
stime 293.02 +- .49% 239.39 +- .83% -18.30%
utime 440.11 +- .28% 508.74 +- .60% +15.59%
wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
*Milan* mm/clear_huge_page x86/clear_huge_page change
(mem=1GB/task, tasks=512)
stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
Also at:
github.com/terminus/linux clear-pages.v1
Comments appreciated!
Ankur Arora (9):
huge_pages: get rid of process_huge_page()
huge_page: get rid of {clear,copy}_subpage()
huge_page: allow arch override for clear/copy_huge_page()
x86/clear_page: parameterize clear_page*() to specify length
x86/clear_pages: add clear_pages()
mm/clear_huge_page: use multi-page clearing
sched: define TIF_ALLOW_RESCHED
irqentry: define irqentry_exit_allow_resched()
x86/clear_huge_page: make clear_contig_region() preemptible
arch/x86/include/asm/page.h | 6 +
arch/x86/include/asm/page_32.h | 6 +
arch/x86/include/asm/page_64.h | 25 +++--
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/lib/clear_page_64.S | 45 ++++++--
arch/x86/mm/hugetlbpage.c | 59 ++++++++++
include/linux/sched.h | 29 +++++
kernel/entry/common.c | 8 ++
kernel/sched/core.c | 36 +++---
mm/memory.c | 174 +++++++++++++++--------------
10 files changed, 270 insertions(+), 120 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 1/9] huge_pages: get rid of process_huge_page()
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-03 5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
` (8 subsequent siblings)
9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
process_huge_pages() (and the subsequent call to process_subpage())
translate between base-addr[index], struct page *, first when calling
clear/copy_subpage() which then translates to the parameters
needed in clear/copy_user_highpage().
There's no runtime cost in doing this, but it's unnecessary complexity
for something that only has two users.
Accordingly, fold process_huge_page() back in its callers.
Link: https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
mm/memory.c | 126 ++++++++++++++++++++++++++++++----------------------
1 file changed, 73 insertions(+), 53 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..d54bc27a35ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5629,63 +5629,13 @@ EXPORT_SYMBOL(__might_fault);
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-/*
- * Process all subpages of the specified huge page with the specified
- * operation. The target subpage will be processed last to keep its
- * cache lines hot.
- */
-static inline void process_huge_page(
- unsigned long addr_hint, unsigned int pages_per_huge_page,
- void (*process_subpage)(unsigned long addr, int idx, void *arg),
- void *arg)
-{
- int i, n, base, l;
- unsigned long addr = addr_hint &
- ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
-
- /* Process target subpage last to keep its cache lines hot */
- might_sleep();
- n = (addr_hint - addr) / PAGE_SIZE;
- if (2 * n <= pages_per_huge_page) {
- /* If target subpage in first half of huge page */
- base = 0;
- l = n;
- /* Process subpages at the end of huge page */
- for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
- cond_resched();
- process_subpage(addr + i * PAGE_SIZE, i, arg);
- }
- } else {
- /* If target subpage in second half of huge page */
- base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
- l = pages_per_huge_page - n;
- /* Process subpages at the begin of huge page */
- for (i = 0; i < base; i++) {
- cond_resched();
- process_subpage(addr + i * PAGE_SIZE, i, arg);
- }
- }
- /*
- * Process remaining subpages in left-right-left-right pattern
- * towards the target subpage
- */
- for (i = 0; i < l; i++) {
- int left_idx = base + i;
- int right_idx = base + 2 * l - 1 - i;
-
- cond_resched();
- process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
- cond_resched();
- process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
- }
-}
static void clear_gigantic_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page)
{
int i;
- struct page *p;
+ struct page *p = page;
might_sleep();
for (i = 0; i < pages_per_huge_page; i++) {
@@ -5707,13 +5657,48 @@ void clear_huge_page(struct page *page,
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+ int i, n, base, l;
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
clear_gigantic_page(page, addr, pages_per_huge_page);
return;
}
- process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+ /* Process target subpage last to keep its cache lines hot */
+ might_sleep();
+ n = (addr_hint - addr) / PAGE_SIZE;
+ if (2 * n <= pages_per_huge_page) {
+ /* If target subpage in first half of huge page */
+ base = 0;
+ l = n;
+ /* Process subpages at the end of huge page */
+ for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+ cond_resched();
+ clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+ }
+ } else {
+ /* If target subpage in second half of huge page */
+ base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
+ l = pages_per_huge_page - n;
+ /* Process subpages at the begin of huge page */
+ for (i = 0; i < base; i++) {
+ cond_resched();
+ clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+ }
+ }
+ /*
+ * Process remaining subpages in left-right-left-right pattern
+ * towards the target subpage
+ */
+ for (i = 0; i < l; i++) {
+ int left_idx = base + i;
+ int right_idx = base + 2 * l - 1 - i;
+
+ cond_resched();
+ clear_subpage(addr + left_idx * PAGE_SIZE, left_idx, (void *)page);
+ cond_resched();
+ clear_subpage(addr + right_idx * PAGE_SIZE, right_idx, (void *)page);
+ }
}
static void copy_user_gigantic_page(struct page *dst, struct page *src,
@@ -5759,6 +5744,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
.src = src,
.vma = vma,
};
+ int i, n, base, l;
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
copy_user_gigantic_page(dst, src, addr, vma,
@@ -5766,7 +5752,41 @@ void copy_user_huge_page(struct page *dst, struct page *src,
return;
}
- process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg);
+ /* Process target subpage last to keep its cache lines hot */
+ might_sleep();
+ n = (addr_hint - addr) / PAGE_SIZE;
+ if (2 * n <= pages_per_huge_page) {
+ /* If target subpage in first half of huge page */
+ base = 0;
+ l = n;
+ /* Process subpages at the end of huge page */
+ for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+ cond_resched();
+ copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+ }
+ } else {
+ /* If target subpage in second half of huge page */
+ base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
+ l = pages_per_huge_page - n;
+ /* Process subpages at the begin of huge page */
+ for (i = 0; i < base; i++) {
+ cond_resched();
+ copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+ }
+ }
+ /*
+ * Process remaining subpages in left-right-left-right pattern
+ * towards the target subpage
+ */
+ for (i = 0; i < l; i++) {
+ int left_idx = base + i;
+ int right_idx = base + 2 * l - 1 - i;
+
+ cond_resched();
+ copy_subpage(addr + left_idx * PAGE_SIZE, left_idx, &arg);
+ cond_resched();
+ copy_subpage(addr + right_idx * PAGE_SIZE, right_idx, &arg);
+ }
}
long copy_huge_page_from_user(struct page *dst_page,
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage()
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-04-03 5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-03 5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
` (7 subsequent siblings)
9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
clear/copy_subpage():
static void clear_subpage(unsigned long addr, int idx, void *arg)
static void copy_subpage(unsigned long addr, int idx, void *arg)
take as parameters: an index, a post indexing virtual address,
and a base page * which is then resolved using the index.
Instead just use clear/copy_user_highpage() directly.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
mm/memory.c | 46 ++++++++++++----------------------------------
1 file changed, 12 insertions(+), 34 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index d54bc27a35ca..6da97e6c7d21 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5645,13 +5645,6 @@ static void clear_gigantic_page(struct page *page,
}
}
-static void clear_subpage(unsigned long addr, int idx, void *arg)
-{
- struct page *page = arg;
-
- clear_user_highpage(page + idx, addr);
-}
-
void clear_huge_page(struct page *page,
unsigned long addr_hint, unsigned int pages_per_huge_page)
{
@@ -5674,7 +5667,7 @@ void clear_huge_page(struct page *page,
/* Process subpages at the end of huge page */
for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
cond_resched();
- clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+ clear_user_highpage(page + i, addr + i * PAGE_SIZE);
}
} else {
/* If target subpage in second half of huge page */
@@ -5683,7 +5676,7 @@ void clear_huge_page(struct page *page,
/* Process subpages at the begin of huge page */
for (i = 0; i < base; i++) {
cond_resched();
- clear_subpage(addr + i * PAGE_SIZE, i, (void *)page);
+ clear_user_highpage(page + i, addr + i * PAGE_SIZE);
}
}
/*
@@ -5695,9 +5688,9 @@ void clear_huge_page(struct page *page,
int right_idx = base + 2 * l - 1 - i;
cond_resched();
- clear_subpage(addr + left_idx * PAGE_SIZE, left_idx, (void *)page);
+ clear_user_highpage(page + left_idx, addr + left_idx * PAGE_SIZE);
cond_resched();
- clear_subpage(addr + right_idx * PAGE_SIZE, right_idx, (void *)page);
+ clear_user_highpage(page + right_idx, addr + right_idx * PAGE_SIZE);
}
}
@@ -5719,31 +5712,12 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
}
}
-struct copy_subpage_arg {
- struct page *dst;
- struct page *src;
- struct vm_area_struct *vma;
-};
-
-static void copy_subpage(unsigned long addr, int idx, void *arg)
-{
- struct copy_subpage_arg *copy_arg = arg;
-
- copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
- addr, copy_arg->vma);
-}
-
void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr_hint, struct vm_area_struct *vma,
unsigned int pages_per_huge_page)
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
- struct copy_subpage_arg arg = {
- .dst = dst,
- .src = src,
- .vma = vma,
- };
int i, n, base, l;
if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
@@ -5762,7 +5736,8 @@ void copy_user_huge_page(struct page *dst, struct page *src,
/* Process subpages at the end of huge page */
for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
cond_resched();
- copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+ copy_user_highpage(dst + i, src + i,
+ addr + i*PAGE_SIZE, vma);
}
} else {
/* If target subpage in second half of huge page */
@@ -5771,7 +5746,8 @@ void copy_user_huge_page(struct page *dst, struct page *src,
/* Process subpages at the begin of huge page */
for (i = 0; i < base; i++) {
cond_resched();
- copy_subpage(addr + i * PAGE_SIZE, i, &arg);
+ copy_user_highpage(dst + i, src + i,
+ addr + i*PAGE_SIZE, vma);
}
}
/*
@@ -5783,9 +5759,11 @@ void copy_user_huge_page(struct page *dst, struct page *src,
int right_idx = base + 2 * l - 1 - i;
cond_resched();
- copy_subpage(addr + left_idx * PAGE_SIZE, left_idx, &arg);
+ copy_user_highpage(dst + left_idx, src + left_idx,
+ addr + left_idx*PAGE_SIZE, vma);
cond_resched();
- copy_subpage(addr + right_idx * PAGE_SIZE, right_idx, &arg);
+ copy_user_highpage(dst + right_idx, src + right_idx,
+ addr + right_idx*PAGE_SIZE, vma);
}
}
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page()
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-04-03 5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
2023-04-03 5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-03 5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
` (6 subsequent siblings)
9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
clear/copy_huge_page() are constrained to operate on a page-at-a-time
because they also handle the CONFIG_HIGHMEM case.
Mark both __weak to allow for arch specific optimizations.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
mm/memory.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 6da97e6c7d21..9a6bce384616 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5645,8 +5645,12 @@ static void clear_gigantic_page(struct page *page,
}
}
-void clear_huge_page(struct page *page,
- unsigned long addr_hint, unsigned int pages_per_huge_page)
+/*
+ * clear_huge_page(): does page-at-a-time clearing to handle the
+ * CONFIG_HIGHMEM case.
+ */
+__weak void clear_huge_page(struct page *page,
+ unsigned long addr_hint, unsigned int pages_per_huge_page)
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
@@ -5712,9 +5716,13 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
}
}
-void copy_user_huge_page(struct page *dst, struct page *src,
- unsigned long addr_hint, struct vm_area_struct *vma,
- unsigned int pages_per_huge_page)
+/*
+ * copy_user_huge_page(): does page-at-a-time copying to handle
+ * the CONFIG_HIGHMEM case.
+ */
+__weak void copy_user_huge_page(struct page *dst, struct page *src,
+ unsigned long addr_hint, struct vm_area_struct *vma,
+ unsigned int pages_per_huge_page)
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (2 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-06 8:19 ` Peter Zijlstra
2023-04-03 5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
` (5 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
Change clear_page*() to take a length parameter.
Rename to clear_pages_*().
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
clear_page() is now defined in terms of clear_pages(). This means
that all clear_page() callsites -- which are more numerous -- would
use an additional register now.
Opinions on if that's worth optimizing?
---
arch/x86/include/asm/page_64.h | 18 ++++++++------
arch/x86/lib/clear_page_64.S | 45 +++++++++++++++++++++++++---------
2 files changed, 43 insertions(+), 20 deletions(-)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index cc6b8e087192..7ca3bd2448c1 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -39,22 +39,24 @@ extern unsigned long __phys_addr_symbol(unsigned long);
#define __phys_reloc_hide(x) (x)
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void clear_pages_orig(void *page, unsigned int length);
+void clear_pages_rep(void *page, unsigned int length);
+void clear_pages_erms(void *page, unsigned int length);
static inline void clear_page(void *page)
{
+ unsigned long length = PAGE_SIZE;
/*
* Clean up KMSAN metadata for the page being cleared. The assembly call
* below clobbers @page, so we perform unpoisoning before it.
*/
- kmsan_unpoison_memory(page, PAGE_SIZE);
- alternative_call_2(clear_page_orig,
- clear_page_rep, X86_FEATURE_REP_GOOD,
- clear_page_erms, X86_FEATURE_ERMS,
+ kmsan_unpoison_memory(page, length);
+
+ alternative_call_2(clear_pages_orig,
+ clear_pages_rep, X86_FEATURE_REP_GOOD,
+ clear_pages_erms, X86_FEATURE_ERMS,
"=D" (page),
- "0" (page)
+ "0" (page), "S" (length)
: "cc", "memory", "rax", "rcx");
}
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index ecbfb4dd3b01..6069acf6072f 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -2,6 +2,8 @@
#include <linux/linkage.h>
#include <asm/asm.h>
#include <asm/export.h>
+#include <asm/cpufeatures.h>
+#include <asm/alternative.h>
/*
* Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
@@ -13,18 +15,30 @@
/*
* Zero a page.
* %rdi - page
+ * %esi - page-length
+ *
+ * Clobbers: %rax, %rcx
*/
-SYM_FUNC_START(clear_page_rep)
- movl $4096/8,%ecx
+SYM_FUNC_START(clear_pages_rep)
+ movl %esi, %ecx
xorl %eax,%eax
+ shrl $3,%ecx
rep stosq
RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
+SYM_FUNC_END(clear_pages_rep)
+EXPORT_SYMBOL_GPL(clear_pages_rep)
-SYM_FUNC_START(clear_page_orig)
+/*
+ * Original page zeroing loop.
+ * %rdi - page
+ * %esi - page-length
+ *
+ * Clobbers: %rax, %rcx, %rflags
+ */
+SYM_FUNC_START(clear_pages_orig)
+ movl %esi, %ecx
xorl %eax,%eax
- movl $4096/64,%ecx
+ shrl $6,%ecx
.p2align 4
.Lloop:
decl %ecx
@@ -41,16 +55,23 @@ SYM_FUNC_START(clear_page_orig)
jnz .Lloop
nop
RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
+SYM_FUNC_END(clear_pages_orig)
+EXPORT_SYMBOL_GPL(clear_pages_orig)
-SYM_FUNC_START(clear_page_erms)
- movl $4096,%ecx
+/*
+ * Zero a page.
+ * %rdi - page
+ * %esi - page-length
+ *
+ * Clobbers: %rax, %rcx
+ */
+SYM_FUNC_START(clear_pages_erms)
+ movl %esi, %ecx
xorl %eax,%eax
rep stosb
RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(clear_pages_erms)
+EXPORT_SYMBOL_GPL(clear_pages_erms)
/*
* Default clear user-space.
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 5/9] x86/clear_pages: add clear_pages()
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (3 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-06 8:23 ` Peter Zijlstra
2023-04-03 5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
` (4 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
Add clear_pages() and define the ancillary clear_user_pages().
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/page.h | 6 ++++++
arch/x86/include/asm/page_32.h | 6 ++++++
arch/x86/include/asm/page_64.h | 9 +++++++--
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..03e3c69fc427 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
clear_page(page);
}
+static inline void clear_user_pages(void *page, unsigned long vaddr,
+ struct page *pg, unsigned int nsubpages)
+{
+ clear_pages(page, nsubpages);
+}
+
static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
struct page *topage)
{
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 580d71aca65a..3523d1150cfc 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -22,6 +22,12 @@ static inline void clear_page(void *page)
memset(page, 0, PAGE_SIZE);
}
+static inline void clear_pages(void *page, unsigned int nsubpages)
+{
+ for (int i = 0; i < nsubpages; i++)
+ clear_page(page + i * PAGE_SIZE);
+}
+
static inline void copy_page(void *to, void *from)
{
memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 7ca3bd2448c1..42f6c45206c1 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,9 +43,9 @@ void clear_pages_orig(void *page, unsigned int length);
void clear_pages_rep(void *page, unsigned int length);
void clear_pages_erms(void *page, unsigned int length);
-static inline void clear_page(void *page)
+static inline void clear_pages(void *page, unsigned int nsubpages)
{
- unsigned long length = PAGE_SIZE;
+ unsigned int length = nsubpages * PAGE_SIZE;
/*
* Clean up KMSAN metadata for the page being cleared. The assembly call
* below clobbers @page, so we perform unpoisoning before it.
@@ -60,6 +60,11 @@ static inline void clear_page(void *page)
: "cc", "memory", "rax", "rcx");
}
+static inline void clear_page(void *page)
+{
+ clear_pages(page, 1);
+}
+
void copy_page(void *to, void *from);
#ifdef CONFIG_X86_5LEVEL
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 6/9] mm/clear_huge_page: use multi-page clearing
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (4 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-03 5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
` (3 subsequent siblings)
9 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
clear_pages_rep(), clear_pages_erms() use string instructions
internally. These, unlike a MOV loop, allow us to explicitly advertise
the region-size to the processor. Thus, clearing in multi-page chunks
means we can specify the real region sizes (or close to it) which is
good for two reasons:
- region-size can serve as a hint to current (some AMD Zen models) and
possibly future uarchs which can use this hint to avoid polluting one
or more levels of the dcache.
- string instructions are typically microcoded, and would be cheaper
if amortized across larger regions. We also execute fewer loop
iterations (ex. a cond_resched() check for each page but those
instructions are likely free.)
clear_huge_page() now clears in three sections: the local neighbourhood
of the faulting address (faulting page, and four surrounding pages),
and its left and right regions.
The local neighbourhood is cleared last to keep its cachelines hot.
Performance
==
Use mmap(MAP_HUGETLB) to demand fault a 64GB region (on the local
NUMA node):
Icelakex (Platinum 8358, ucode=0xd0002c1, no_turbo=1):
mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 8.76 11.82 +34.93%
pg-sz=1GB 8.99 12.18 +35.48%
On Icelakex we continue to allocate cachelines:
pg-sz=2MB:
- 701,951,397 L1-dcache-loads # 47.985 M/sec ( +- 19.22% ) (69.23%)
- 3,239,403,770 L1-dcache-load-misses # 691.17% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
+ 194,318,641 L1-dcache-loads # 17.905 M/sec ( +- 19.07% ) (69.25%)
+ 3,238,878,229 L1-dcache-load-misses # 2480.93% of all L1-dcache accesses ( +- 19.25% ) (69.26%)
pg-sz=1GB:
- 532,232,051 L1-dcache-loads # 37.378 M/sec ( +- 19.25% ) (69.23%)
- 3,224,574,249 L1-dcache-load-misses # 909.02% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
+ 22,587,703 L1-dcache-loads # 2.150 M/sec ( +- 19.38% ) (69.25%)
+ 3,223,143,697 L1-dcache-load-misses # 21478.37% of all L1-dcache accesses ( +- 19.25% ) (69.25%)
Milan (EPYC 7J13, ucode=0xa0011a9, boost=0):
mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 12.24 17.54 +43.30%
pg-sz=1GB 17.98 37.24 +107.11%
Milan uses a threshold ~32MB for eliding cacheline allocation, so we
see a dropoff in cacheline-allocations for pg-sz=1GB:
pg-sz=2MB:
- 2,495,566,569 L1-dcache-loads # 476.417 M/sec ( +- 0.04% ) (33.38%)
- 1,079,711,798 L1-dcache-load-misses # 43.28% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
+ 2,235,310,058 L1-dcache-loads # 610.770 M/sec ( +- 0.02% ) (33.37%)
+ 1,089,602,355 L1-dcache-load-misses # 48.73% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
pg-sz=1GB:
- 2,417,846,489 L1-dcache-loads # 679.753 M/sec ( +- 0.01% ) (33.38%)
- 1,075,531,869 L1-dcache-load-misses # 44.49% of all L1-dcache accesses ( +- 0.01% ) (33.35%)
+ 31,159,378 L1-dcache-loads # 18.119 M/sec ( +- 3.27% ) (33.46%)
+ 14,692,358 L1-dcache-load-misses # 48.21% of all L1-dcache accesses ( +- 3.12% ) (33.46%)
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Fuller perf stats for context:
# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):
21,945.59 msec task-clock # 2.999 CPUs utilized ( +- 19.25% )
34 context-switches # 2.324 /sec ( +- 20.38% )
3 cpu-migrations # 0.205 /sec ( +- 19.25% )
198,152 page-faults # 13.546 K/sec ( +- 19.29% )
56,513,364,885 cycles # 3.863 GHz ( +- 19.25% ) (38.44%)
2,583,719,806 instructions # 0.07 insn per cycle ( +- 19.24% ) (46.14%)
585,212,952 branches # 40.005 M/sec ( +- 19.23% ) (53.83%)
562,164 branch-misses # 0.14% of all branches ( +- 19.23% ) (61.53%)
282,621,312,162 slots # 19.320 G/sec ( +- 19.25% ) (69.22%)
11,048,627,225 topdown-retiring # 3.8% Retiring ( +- 19.22% ) (69.22%)
34,358,400,894 topdown-bad-spec # 11.5% Bad Speculation ( +- 19.57% ) (69.22%)
2,231,092,499 topdown-fe-bound # 0.8% Frontend Bound ( +- 19.25% ) (69.22%)
246,679,210,776 topdown-be-bound # 84.0% Backend Bound ( +- 19.21% ) (69.22%)
701,951,397 L1-dcache-loads # 47.985 M/sec ( +- 19.22% ) (69.23%)
3,239,403,770 L1-dcache-load-misses # 691.17% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
11,475,685 LLC-loads # 784.475 K/sec ( +- 19.23% ) (69.25%)
793,272 LLC-load-misses # 10.36% of all LL-cache accesses ( +- 19.23% ) (69.25%)
17,821,045 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.51% ) (30.77%)
693,339,354 dTLB-loads # 47.397 M/sec ( +- 19.33% ) (30.76%)
637,811 dTLB-load-misses # 0.14% of all dTLB cache accesses ( +- 19.09% ) (30.75%)
131,922 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.59% ) (30.75%)
7.31681 +- 0.00177 seconds time elapsed ( +- 0.02% )
# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):
16,276.28 msec task-clock # 2.999 CPUs utilized ( +- 19.24% )
27 context-switches # 2.488 /sec ( +- 19.25% )
3 cpu-migrations # 0.276 /sec ( +- 19.25% )
196,935 page-faults # 18.146 K/sec ( +- 19.25% )
41,906,597,608 cycles # 3.861 GHz ( +- 19.24% ) (38.44%)
729,479,932 instructions # 0.03 insn per cycle ( +- 19.38% ) (46.14%)
133,969,095 branches # 12.344 M/sec ( +- 19.35% ) (53.84%)
412,818 branch-misses # 0.46% of all branches ( +- 18.97% ) (61.54%)
209,574,316,961 slots # 19.311 G/sec ( +- 19.24% ) (69.24%)
4,933,512,982 topdown-retiring # 2.3% Retiring ( +- 19.24% ) (69.24%)
20,272,641,267 topdown-bad-spec # 9.4% Bad Speculation ( +- 19.51% ) (69.24%)
837,421,487 topdown-fe-bound # 0.4% Frontend Bound ( +- 19.24% ) (69.24%)
190,089,232,476 topdown-be-bound # 88.0% Backend Bound ( +- 19.19% ) (69.24%)
194,318,641 L1-dcache-loads # 17.905 M/sec ( +- 19.07% ) (69.25%)
3,238,878,229 L1-dcache-load-misses # 2480.93% of all L1-dcache accesses ( +- 19.25% ) (69.26%)
10,560,508 LLC-loads # 973.081 K/sec ( +- 19.23% ) (69.26%)
724,884 LLC-load-misses # 10.28% of all LL-cache accesses ( +- 17.15% ) (69.26%)
14,378,070 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.13% ) (30.75%)
185,562,230 dTLB-loads # 17.098 M/sec ( +- 19.74% ) (30.74%)
617,978 dTLB-load-misses # 0.51% of all dTLB cache accesses ( +- 18.72% ) (30.74%)
112,509 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.76% ) (30.74%)
5.42697 +- 0.00152 seconds time elapsed ( +- 0.03% )
# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):
21,361.22 msec task-clock # 2.999 CPUs utilized ( +- 19.25% )
23 context-switches # 1.615 /sec ( +- 18.95% )
3 cpu-migrations # 0.211 /sec ( +- 19.25% )
701 page-faults # 49.230 /sec ( +- 19.27% )
54,981,958,487 cycles # 3.861 GHz ( +- 19.25% ) (38.44%)
2,012,625,953 instructions # 0.05 insn per cycle ( +- 19.25% ) (46.14%)
470,264,509 branches # 33.026 M/sec ( +- 19.25% ) (53.83%)
194,801 branch-misses # 0.06% of all branches ( +- 18.88% ) (61.53%)
274,966,507,627 slots # 19.311 G/sec ( +- 19.25% ) (69.22%)
10,555,137,650 topdown-retiring # 3.8% Retiring ( +- 19.04% ) (69.22%)
21,206,785,918 topdown-bad-spec # 7.8% Bad Speculation ( +- 18.13% ) (69.22%)
1,094,597,329 topdown-fe-bound # 0.4% Frontend Bound ( +- 19.25% ) (69.22%)
244,462,123,545 topdown-be-bound # 88.0% Backend Bound ( +- 19.33% ) (69.22%)
532,232,051 L1-dcache-loads # 37.378 M/sec ( +- 19.25% ) (69.23%)
3,224,574,249 L1-dcache-load-misses # 909.02% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
2,318,195 LLC-loads # 162.804 K/sec ( +- 19.35% ) (69.25%)
206,737 LLC-load-misses # 13.44% of all LL-cache accesses ( +- 18.30% ) (69.25%)
4,950,866 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.26% ) (30.77%)
531,299,560 dTLB-loads # 37.313 M/sec ( +- 19.24% ) (30.76%)
2,811 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 17.25% ) (30.75%)
26,355 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.58% ) (30.75%)
7.12187 +- 0.00190 seconds time elapsed ( +- 0.03% )
# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):
15,764.52 msec task-clock # 2.999 CPUs utilized ( +- 19.25% )
17 context-switches # 1.618 /sec ( +- 20.47% )
3 cpu-migrations # 0.285 /sec ( +- 19.25% )
700 page-faults # 66.614 /sec ( +- 19.22% )
40,560,984,582 cycles # 3.860 GHz ( +- 19.25% ) (38.45%)
79,578,792 instructions # 0.00 insn per cycle ( +- 19.24% ) (46.15%)
13,872,134 branches # 1.320 M/sec ( +- 19.23% ) (53.85%)
119,492 branch-misses # 1.29% of all branches ( +- 18.80% ) (61.55%)
202,854,573,160 slots # 19.304 G/sec ( +- 19.25% ) (69.25%)
3,982,417,725 topdown-retiring # 2.0% Retiring ( +- 19.25% ) (69.25%)
13,523,424,635 topdown-bad-spec # 6.8% Bad Speculation ( +- 18.69% ) (69.25%)
18,661,431 topdown-fe-bound # 0.0% Frontend Bound ( +- 19.28% ) (69.25%)
185,884,147,789 topdown-be-bound # 91.3% Backend Bound ( +- 19.28% ) (69.25%)
22,587,703 L1-dcache-loads # 2.150 M/sec ( +- 19.38% ) (69.25%)
3,223,143,697 L1-dcache-load-misses # 21478.37% of all L1-dcache accesses ( +- 19.25% ) (69.25%)
1,777,675 LLC-loads # 169.169 K/sec ( +- 19.60% ) (69.25%)
126,583 LLC-load-misses # 10.77% of all LL-cache accesses ( +- 19.82% ) (69.25%)
3,333,729 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.49% ) (30.75%)
19,999,517 dTLB-loads # 1.903 M/sec ( +- 19.38% ) (30.75%)
1,833 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 17.72% ) (30.75%)
34,066 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.09% ) (30.75%)
5.25624 +- 0.00176 seconds time elapsed ( +- 0.03% )
# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):
5,241.76 msec task-clock # 1.000 CPUs utilized ( +- 0.08% )
10 context-switches # 1.909 /sec ( +- 8.82% )
1 cpu-migrations # 0.191 /sec
65,636 page-faults # 12.530 K/sec ( +- 0.00% )
12,730,694,768 cycles # 2.430 GHz ( +- 0.08% ) (33.31%)
36,709,243 stalled-cycles-frontend # 0.29% frontend cycles idle ( +- 24.07% ) (33.32%)
37,520,225 stalled-cycles-backend # 0.29% backend cycles idle ( +- 9.87% ) (33.34%)
874,896,010 instructions # 0.07 insn per cycle
# 0.05 stalled cycles per insn ( +- 1.23% ) (33.36%)
199,308,386 branches # 38.049 M/sec ( +- 0.84% ) (33.38%)
441,428 branch-misses # 0.22% of all branches ( +- 4.68% ) (33.38%)
2,495,566,569 L1-dcache-loads # 476.417 M/sec ( +- 0.04% ) (33.38%)
1,079,711,798 L1-dcache-load-misses # 43.28% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
50,936,391 L1-icache-loads # 9.724 M/sec ( +- 1.29% ) (33.35%)
284,407 L1-icache-load-misses # 0.56% of all L1-icache accesses ( +- 4.60% ) (33.33%)
546,596 dTLB-loads # 104.348 K/sec ( +- 6.14% ) (33.31%)
231,897 dTLB-load-misses # 42.08% of all dTLB cache accesses ( +- 4.27% ) (33.29%)
6 iTLB-loads # 1.145 /sec ( +- 72.65% ) (33.29%)
34,065 iTLB-load-misses # 262038.46% of all iTLB cache accesses ( +- 44.88% ) (33.29%)
18,237,487 L1-dcache-prefetches # 3.482 M/sec ( +- 12.84% ) (33.29%)
5.23915 +- 0.00421 seconds time elapsed ( +- 0.08% )
# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):
3,655.71 msec task-clock # 0.999 CPUs utilized ( +- 0.13% )
7 context-switches # 1.913 /sec ( +- 8.25% )
1 cpu-migrations # 0.273 /sec
65,636 page-faults # 17.934 K/sec ( +- 0.00% )
8,879,727,514 cycles # 2.426 GHz ( +- 0.13% ) (33.26%)
5,733,380 stalled-cycles-frontend # 0.06% frontend cycles idle ( +-170.04% ) (33.28%)
42,012,302 stalled-cycles-backend # 0.47% backend cycles idle ( +- 24.51% ) (33.31%)
214,672,610 instructions # 0.02 insn per cycle
# 0.28 stalled cycles per insn ( +- 1.71% ) (33.34%)
42,298,268 branches # 11.557 M/sec ( +- 1.28% ) (33.36%)
267,936 branch-misses # 0.62% of all branches ( +- 7.80% ) (33.37%)
2,235,310,058 L1-dcache-loads # 610.770 M/sec ( +- 0.02% ) (33.37%)
1,089,602,355 L1-dcache-load-misses # 48.73% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
48,725,812 L1-icache-loads # 13.314 M/sec ( +- 0.25% ) (33.37%)
231,227 L1-icache-load-misses # 0.47% of all L1-icache accesses ( +- 13.20% ) (33.37%)
280,655 dTLB-loads # 76.685 K/sec ( +- 13.33% ) (33.37%)
151,028 dTLB-load-misses # 44.02% of all dTLB cache accesses ( +- 6.64% ) (33.35%)
15 iTLB-loads # 4.099 /sec ( +- 6.67% ) (33.32%)
121,208 iTLB-load-misses # 865771.43% of all iTLB cache accesses ( +- 2.74% ) (33.29%)
18,702,209 L1-dcache-prefetches # 5.110 M/sec ( +- 12.51% ) (33.27%)
3.66065 +- 0.00461 seconds time elapsed ( +- 0.13% )
# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):
3,544.20 msec task-clock # 0.996 CPUs utilized ( +- 0.21% )
5 context-switches # 1.406 /sec ( +- 6.67% )
1 cpu-migrations # 0.281 /sec
227 page-faults # 63.819 /sec ( +- 0.15% )
8,609,810,964 cycles # 2.421 GHz ( +- 0.21% ) (33.30%)
77,420,424 stalled-cycles-frontend # 0.90% frontend cycles idle ( +- 20.55% ) (33.33%)
25,197,541 stalled-cycles-backend # 0.29% backend cycles idle ( +- 1.09% ) (33.35%)
658,146,061 instructions # 0.08 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.04% ) (33.38%)
154,867,131 branches # 43.539 M/sec ( +- 0.04% ) (33.41%)
167,531 branch-misses # 0.11% of all branches ( +- 5.19% ) (33.41%)
2,417,846,489 L1-dcache-loads # 679.753 M/sec ( +- 0.01% ) (33.38%)
1,075,531,869 L1-dcache-load-misses # 44.49% of all L1-dcache accesses ( +- 0.01% ) (33.35%)
12,835,321 L1-icache-loads # 3.609 M/sec ( +- 0.41% ) (33.33%)
55,282 L1-icache-load-misses # 0.43% of all L1-icache accesses ( +- 1.98% ) (33.30%)
23,287 dTLB-loads # 6.547 K/sec ( +- 15.61% ) (33.29%)
1,333 dTLB-load-misses # 4.48% of all dTLB cache accesses ( +- 1.26% ) (33.29%)
3 iTLB-loads # 0.843 /sec ( +- 33.33% ) (33.29%)
231 iTLB-load-misses # 11550.00% of all iTLB cache accesses ( +- 6.14% ) (33.29%)
170,608,062 L1-dcache-prefetches # 47.965 M/sec ( +- 0.84% ) (33.29%)
3.55776 +- 0.00738 seconds time elapsed ( +- 0.21% )
# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb
Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):
1,718.27 msec task-clock # 0.999 CPUs utilized ( +- 0.08% )
6 context-switches # 3.489 /sec ( +- 14.70% )
1 cpu-migrations # 0.581 /sec
227 page-faults # 132.000 /sec ( +- 0.15% )
4,176,107,493 cycles # 2.428 GHz ( +- 0.08% ) (33.19%)
2,675,797 stalled-cycles-frontend # 0.06% frontend cycles idle ( +- 0.34% ) (33.25%)
147,394,527 stalled-cycles-backend # 3.53% backend cycles idle ( +- 8.80% ) (33.31%)
12,779,784 instructions # 0.00 insn per cycle
# 13.14 stalled cycles per insn ( +- 0.09% ) (33.37%)
2,428,829 branches # 1.412 M/sec ( +- 0.08% ) (33.42%)
63,460 branch-misses # 2.61% of all branches ( +- 3.48% ) (33.46%)
31,159,378 L1-dcache-loads # 18.119 M/sec ( +- 3.27% ) (33.46%)
14,692,358 L1-dcache-load-misses # 48.21% of all L1-dcache accesses ( +- 3.12% ) (33.46%)
2,556,688 L1-icache-loads # 1.487 M/sec ( +- 0.89% ) (33.46%)
21,148 L1-icache-load-misses # 0.84% of all L1-icache accesses ( +- 0.25% ) (33.41%)
6,114 dTLB-loads # 3.555 K/sec ( +- 12.76% ) (33.35%)
1,742 dTLB-load-misses # 33.73% of all dTLB cache accesses ( +- 21.79% ) (33.29%)
45 iTLB-loads # 26.167 /sec ( +- 7.52% ) (33.23%)
90 iTLB-load-misses # 210.94% of all iTLB cache accesses ( +- 21.20% ) (33.17%)
257,942 L1-dcache-prefetches # 149.993 K/sec ( +- 9.84% ) (33.17%)
1.72042 +- 0.00139 seconds time elapsed ( +- 0.08% )
---
arch/x86/mm/hugetlbpage.c | 49 +++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..4294b77c4f18 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,55 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
return hugetlb_get_unmapped_area_topdown(file, addr, len,
pgoff, flags);
}
+
+/*
+ * This is used on all !CONFIG_HIGHMEM configurations.
+ *
+ * CONFIG_HIGHMEM, falls back to the __weak version.
+ */
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned long vaddr,
+ unsigned int npages)
+{
+ clear_user_pages(page_address(page), vaddr, page, npages);
+}
+
+void clear_huge_page(struct page *page,
+ unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+ unsigned long addr = addr_hint &
+ ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+ const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+ const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+ const int width = 2; /* pages cleared last on either side */
+ int sidx[3], eidx[3];
+ int i, n;
+
+ if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+ return clear_contig_region(page, addr, pages_per_huge_page);
+
+ /*
+ * Neighbourhood of the fault. Cleared at the end to ensure
+ * it sticks around in the cache.
+ */
+ n = 2;
+ sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+ eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width);
+
+ sidx[0] = first_pg; /* Region to the left of the fault */
+ eidx[0] = sidx[n] - 1;
+
+ sidx[1] = eidx[n] + 1; /* Region to the right of the fault */
+ eidx[1] = last_pg;
+
+ for (i = 0; i <= 2; i++) {
+ if (eidx[i] >= sidx[i])
+ clear_contig_region(page + sidx[i],
+ addr + sidx[i] * PAGE_SIZE,
+ eidx[i] - sidx[i] + 1);
+ }
+}
+#endif /* CONFIG_HIGHMEM */
#endif /* CONFIG_HUGETLB_PAGE */
#ifdef CONFIG_X86_64
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 7/9] sched: define TIF_ALLOW_RESCHED
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (5 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-05 20:07 ` Peter Zijlstra
2023-04-03 5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
` (2 subsequent siblings)
9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
Define TIF_ALLOW_RESCHED to allow threads to mark themselves as allowing
rescheduling in the irqexit path with PREEMPTION_NONE/_VOLUNTARY.
This is meant to be used for long running tasks where it is
not convenient to periodically call cond_resched().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/thread_info.h | 2 ++
include/linux/sched.h | 29 +++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f1cccba52eb9..8c18b9eaeec4 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,6 +100,7 @@ struct thread_info {
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
+#define TIF_RESCHED_ALLOW 30 /* can reschedule if needed */
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
@@ -122,6 +123,7 @@ struct thread_info {
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
+#define _TIF_RESCHED_ALLOW (1 << TIF_RESCHED_ALLOW)
/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW_BASE \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..1e7536e6d9ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2229,6 +2229,35 @@ static __always_inline bool need_resched(void)
return unlikely(tif_need_resched());
}
+/*
+ * Define this in common code to avoid include hell.
+ */
+static __always_inline bool resched_allowed(void)
+{
+#ifndef TIF_RESCHED_ALLOW
+ return false;
+#else
+ return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+#endif
+}
+
+static inline void allow_resched(void)
+{
+ /*
+ * allow_resched() allows preemption via the irqexit context.
+ * To ensure that we stick around on the current runqueue,
+ * disallow migration.
+ */
+ migrate_disable();
+ set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static inline void disallow_resched(void)
+{
+ clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+ migrate_enable();
+}
+
/*
* Wrappers for p->thread_info->cpu access. No-op on UP.
*/
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (6 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-04 9:38 ` Thomas Gleixner
2023-04-05 20:22 ` Peter Zijlstra
2023-04-03 5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
9 siblings, 2 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
This is only necessary under !preempt_model_preemptible() for which
we reuse the same logic as irqentry_exit_code_resched().
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
kernel/entry/common.c | 8 ++++++++
kernel/sched/core.c | 36 +++++++++++++++++++++---------------
2 files changed, 29 insertions(+), 15 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index be61332c66b5..f1005595ebe7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -390,6 +390,9 @@ void raw_irqentry_exit_cond_resched(void)
preempt_schedule_irq();
}
}
+
+void irqentry_exit_allow_resched(void) __alias(raw_irqentry_exit_cond_resched);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
@@ -431,6 +434,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
instrumentation_begin();
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
+ /*
+ * We care about this clause only in the dynamic !preemptible case.
+ */
+ if (unlikely(!preempt_model_preemptible() && resched_allowed()))
+ irqentry_exit_allow_resched();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..11845a91b691 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6500,6 +6500,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* - explicit schedule() call
* - return from syscall or exception to user-space
* - return from interrupt-handler to user-space
+ * - return from interrupt-handler with the task having set
+ * TIF_RESCHED_ALLOW
*
* WARNING: must be called with preemption disabled!
*/
@@ -8597,28 +8599,32 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* SC:preempt_schedule
* SC:preempt_schedule_notrace
* SC:irqentry_exit_cond_resched
+ * SC:irqentry_exit_allow_resched
*
*
* NONE:
- * cond_resched <- __cond_resched
- * might_resched <- RET0
- * preempt_schedule <- NOP
- * preempt_schedule_notrace <- NOP
- * irqentry_exit_cond_resched <- NOP
+ * cond_resched <- __cond_resched
+ * might_resched <- RET0
+ * preempt_schedule <- NOP
+ * preempt_schedule_notrace <- NOP
+ * irqentry_exit_cond_resched <- NOP
+ * irqentry_exit_allow_resched <- irqentry_exit_allow_resched
*
* VOLUNTARY:
- * cond_resched <- __cond_resched
- * might_resched <- __cond_resched
- * preempt_schedule <- NOP
- * preempt_schedule_notrace <- NOP
- * irqentry_exit_cond_resched <- NOP
+ * cond_resched <- __cond_resched
+ * might_resched <- __cond_resched
+ * preempt_schedule <- NOP
+ * preempt_schedule_notrace <- NOP
+ * irqentry_exit_cond_resched <- NOP
+ * irqentry_exit_allow_resched <- irqentry_exit_allow_resched
*
* FULL:
- * cond_resched <- RET0
- * might_resched <- RET0
- * preempt_schedule <- preempt_schedule
- * preempt_schedule_notrace <- preempt_schedule_notrace
- * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ * cond_resched <- RET0
+ * might_resched <- RET0
+ * preempt_schedule <- preempt_schedule
+ * preempt_schedule_notrace <- preempt_schedule_notrace
+ * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ * irqentry_exit_allow_resched <- NOP
*/
enum {
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (7 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
@ 2023-04-03 5:22 ` Ankur Arora
2023-04-05 20:27 ` Peter Zijlstra
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
9 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-03 5:22 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
chunk Allow preemption in the irqentry_exit path to make sure we don't
hold on to the CPU for an arbitrarily long period.
Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
hugetlb-2mb region, and then writes sequentially to the region, demand
faulting pages on the way.
This test, with a CONFIG_VOLUNTARY config shows the effects of this
change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
goes up (~15% on Icelakex, ~13% on Milan.)
*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(mem=4GB/task, tasks=128)
stime 293.02 +- .49% 239.39 +- .83% -18.30%
utime 440.11 +- .28% 508.74 +- .60% +15.59%
wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
*Milan* mm/clear_huge_page x86/clear_huge_page change
(mem=1GB/task, tasks=512)
stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
The drop in stime is due to REP; STOS being more efficient for bigger
extents. The increase in utime is due to cache effects of that change:
mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
faulting page; while x86/clear_huge_page only optimizes for cache
locality in the local neighbourhood of the faulting address.
This effect on utime is visible via the increased L1-dcache-load-misses
and LLC-load* and an increased backend boundedness for perf user-stat
--all-user on Icelakex. The effect is slight but given the heavy cache
pressure generated by the test, shows up in the drop in user IPC:
- 9,455,243,414,829 instructions # 2.75 insn per cycle ( +- 14.14% ) (46.17%)
- 2,367,920,864,112 L1-dcache-loads # 1.054 G/sec ( +- 14.14% ) (69.24%)
- 42,075,182,813 L1-dcache-load-misses # 2.96% of all L1-dcache accesses ( +- 14.14% ) (69.24%)
- 20,365,688 LLC-loads # 9.064 K/sec ( +- 13.98% ) (69.24%)
- 890,382 LLC-load-misses # 7.18% of all LL-cache accesses ( +- 14.91% ) (69.24%)
+ 9,467,796,660,698 instructions # 2.37 insn per cycle ( +- 14.14% ) (46.16%)
+ 2,369,973,307,561 L1-dcache-loads # 1.027 G/sec ( +- 14.14% ) (69.24%)
+ 42,155,621,201 L1-dcache-load-misses # 2.96% of all L1-dcache accesses ( +- 14.14% ) (69.24%)
+ 22,116,300 LLC-loads # 9.588 K/sec ( +- 14.20% ) (69.24%)
+ 1,355,607 LLC-load-misses # 10.29% of all LL-cache accesses ( +- 15.49% ) (69.25%)
Given the fact that the stime improves for all loads using this path,
while the utime drop is load dependent add this change.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/mm/hugetlbpage.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 4294b77c4f18..c8564b0552e5 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
static void clear_contig_region(struct page *page, unsigned long vaddr,
unsigned int npages)
{
+ might_sleep();
+
+ /*
+ * We might be clearing a large region.
+ * Allow rescheduling.
+ */
+ allow_resched();
clear_user_pages(page_address(page), vaddr, page, npages);
+ disallow_resched();
+
+ cond_resched();
}
void clear_huge_page(struct page *page,
--
2.31.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-03 5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
@ 2023-04-04 9:38 ` Thomas Gleixner
2023-04-05 5:29 ` Ankur Arora
2023-04-05 20:22 ` Peter Zijlstra
1 sibling, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2023-04-04 9:38 UTC (permalink / raw)
To: Ankur Arora, linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, vincent.guittot, jon.grimm,
bharata, boris.ostrovsky, konrad.wilk, ankur.a.arora
On Sun, Apr 02 2023 at 22:22, Ankur Arora wrote:
> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>
> This is only necessary under !preempt_model_preemptible() for which
> we reuse the same logic as irqentry_exit_code_resched().
This tells what this patch is doing but completely fails to explain why
this is necessary and useful.
Thanks,
tglx
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-04 9:38 ` Thomas Gleixner
@ 2023-04-05 5:29 ` Ankur Arora
0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-05 5:29 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, peterz,
rostedt, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk, ankur.a.arora
Thomas Gleixner <tglx@linutronix.de> writes:
> On Sun, Apr 02 2023 at 22:22, Ankur Arora wrote:
>> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>>
>> This is only necessary under !preempt_model_preemptible() for which
>> we reuse the same logic as irqentry_exit_code_resched().
>
> This tells what this patch is doing but completely fails to explain why
> this is necessary and useful.
Thanks. Yeah, it does seem to miss that completely.
Needs some massaging, but does something like this clarify it's purpose?
On kernels with PREEMPTION_NONE/_VOLUNTARY, rescheduling of kernel
tasks happens when they allow it -- for instance by synchronously
calling cond_resched() in a long running task.
There are cases where it is not convenient to periodically call
cond_resched() -- for instance when executing a potentially
long running instruction (such as REP STOSB on x86).
To handle kernel code sections which can be safely preempted, but
cannot explicitly call cond_resched(), allow them to mark themselves
TIF_ALLOW_RESCHED.
Contexts marked such (via allow_resched()) can be rescheduled in the
irqexit path. This is, of course only needed with
!preempt_model_preemptible() and the rescheduling logic is functionally
same as irqentry_exit_code_resched().
--
ankur
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
` (8 preceding siblings ...)
2023-04-03 5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
@ 2023-04-05 19:48 ` Raghavendra K T
2023-04-08 22:46 ` Ankur Arora
9 siblings, 1 reply; 30+ messages in thread
From: Raghavendra K T @ 2023-04-05 19:48 UTC (permalink / raw)
To: Ankur Arora, linux-kernel, linux-mm, x86
Cc: torvalds, akpm, luto, bp, dave.hansen, hpa, mingo, juri.lelli,
willy, mgorman, peterz, rostedt, tglx, vincent.guittot,
jon.grimm, bharata, boris.ostrovsky, konrad.wilk
On 4/3/2023 10:52 AM, Ankur Arora wrote:
> This series introduces multi-page clearing for hugepages.
>
> This is a follow up of some of the ideas discussed at:
> https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/
>
> On x86 page clearing is typically done via string intructions. These,
> unlike a MOV loop, allow us to explicitly advertise the region-size to
> the processor, which could serve as a hint to current (and/or
> future) uarchs to elide cacheline allocation.
>
> In current generation processors, Milan (and presumably other Zen
> variants) use the hint to elide cacheline allocation (for
> region-size > LLC-size.)
>
> An additional reason for doing this is that string instructions are typically
> microcoded, and clearing in bigger chunks than the current page-at-a-
> time logic amortizes some of the cost.
>
> All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.
>
> There are, however, some problems:
>
> 1. extended zeroing periods means there's an increased latency due to
> the now missing preemption points.
>
> That's handled in patches 7, 8, 9:
> "sched: define TIF_ALLOW_RESCHED"
> "irqentry: define irqentry_exit_allow_resched()"
> "x86/clear_huge_page: make clear_contig_region() preemptible"
> by the context marking itself reschedulable, and rescheduling in
> irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)
>
> 2. the current page-at-a-time clearing logic does left-right narrowing
> towards the faulting page which benefits workloads by maintaining
> cache locality for workloads which have a sequential pattern. Clearing
> in large chunks loses that.
>
> Some (but not all) of that could be ameliorated by something like
> this patch:
> https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/
>
> But, before doing that I'd like some comments on whether that is
> worth doing for this specific use case?
>
> Rest of the series:
> Patches 1, 2, 3:
> "huge_pages: get rid of process_huge_page()"
> "huge_page: get rid of {clear,copy}_subpage()"
> "huge_page: allow arch override for clear/copy_huge_page()"
> are mechanical and they simplify some of the current clear_huge_page()
> logic.
>
> Patches 4, 5:
> "x86/clear_page: parameterize clear_page*() to specify length"
> "x86/clear_pages: add clear_pages()"
>
> add clear_pages() and helpers.
>
> Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
> chunked x86 clear_huge_page() implementation.
>
>
> Performance
> ==
>
> Demand fault performance gets a decent boost:
>
> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
> (GB/s) (GB/s)
>
> pg-sz=2MB 8.76 11.82 +34.93%
> pg-sz=1GB 8.99 12.18 +35.48%
>
>
> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (GB/s) (GB/s)
>
> pg-sz=2MB 12.24 17.54 +43.30%
> pg-sz=1GB 17.98 37.24 +107.11%
>
>
> vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
> worse when user space tries to touch those pages:
>
> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
> (mem=4GB/task, tasks=128)
>
> stime 293.02 +- .49% 239.39 +- .83% -18.30%
> utime 440.11 +- .28% 508.74 +- .60% +15.59%
> wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
>
>
> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (mem=1GB/task, tasks=512)
>
> stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
> utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
> wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
>
> Also at:
> github.com/terminus/linux clear-pages.v1
>
> Comments appreciated!
>
Hello Ankur,
Was able to test your patches. To summarize, am seeing 2x-3x perf
improvement for 2M, 1GB base hugepage sizes.
SUT: Genoa AMD EPYC
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 2
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-127,256-383
NUMA node1 CPU(s): 128-255,384-511
Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0), for both base-hugepage-size=2M and 1GB
perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
time in seconds elapsed (average of 10 runs) (lower = better)
Result:
page-size mm/clear_huge_page x86/clear_huge_page change %
2M 5.4567 2.6774 -50.93
1G 2.64452 1.011281 -61.76
Full perfstat info
page size = 2M mm/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10
runs):
5,434.71 msec task-clock # 0.996 CPUs
utilized ( +- 0.55% )
8 context-switches # 1.466 /sec
( +- 4.66% )
0 cpu-migrations # 0.000 /sec
32,918 page-faults # 6.034 K/sec
( +- 0.00% )
16,977,242,482 cycles # 3.112 GHz
( +- 0.04% ) (35.70%)
1,961,724 stalled-cycles-frontend # 0.01% frontend
cycles idle ( +- 1.09% ) (35.72%)
35,685,674 stalled-cycles-backend # 0.21% backend
cycles idle ( +- 3.48% ) (35.74%)
1,038,327,182 instructions # 0.06 insn per cycle
# 0.04 stalled
cycles per insn ( +- 0.38% ) (35.75%)
221,409,216 branches # 40.584 M/sec
( +- 0.36% ) (35.75%)
350,730 branch-misses # 0.16% of all
branches ( +- 1.18% ) (35.75%)
2,520,888,779 L1-dcache-loads # 462.077 M/sec
( +- 0.03% ) (35.73%)
1,094,178,209 L1-dcache-load-misses # 43.46% of all
L1-dcache accesses ( +- 0.02% ) (35.71%)
67,751,730 L1-icache-loads # 12.419 M/sec
( +- 0.11% ) (35.70%)
271,118 L1-icache-load-misses # 0.40% of all
L1-icache accesses ( +- 2.55% ) (35.70%)
506,635 dTLB-loads # 92.866 K/sec
( +- 3.31% ) (35.70%)
237,385 dTLB-load-misses # 43.64% of all
dTLB cache accesses ( +- 7.00% ) (35.69%)
268 iTLB-load-misses # 6700.00% of all
iTLB cache accesses ( +- 13.86% ) (35.70%)
5.4567 +- 0.0300 seconds time elapsed ( +- 0.55% )
page size = 2M x86/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10
runs):
2,780.69 msec task-clock # 1.039 CPUs
utilized ( +- 1.03% )
3 context-switches # 1.121 /sec
( +- 21.34% )
0 cpu-migrations # 0.000 /sec
32,918 page-faults # 12.301 K/sec
( +- 0.00% )
8,143,619,771 cycles # 3.043 GHz
( +- 0.25% ) (35.62%)
2,024,872 stalled-cycles-frontend # 0.02% frontend
cycles idle ( +-320.93% ) (35.66%)
717,198,728 stalled-cycles-backend # 8.82% backend
cycles idle ( +- 8.26% ) (35.69%)
606,549,334 instructions # 0.07 insn per cycle
# 1.39 stalled
cycles per insn ( +- 0.23% ) (35.73%)
108,856,550 branches # 40.677 M/sec
( +- 0.24% ) (35.76%)
202,490 branch-misses # 0.18% of all
branches ( +- 3.58% ) (35.78%)
2,348,818,806 L1-dcache-loads # 877.701 M/sec
( +- 0.03% ) (35.78%)
1,081,562,988 L1-dcache-load-misses # 46.04% of all
L1-dcache accesses ( +- 0.01% ) (35.78%)
<not supported> LLC-loads
<not supported> LLC-load-misses
43,411,167 L1-icache-loads # 16.222 M/sec
( +- 0.19% ) (35.77%)
273,042 L1-icache-load-misses # 0.64% of all
L1-icache accesses ( +- 4.94% ) (35.76%)
834,482 dTLB-loads # 311.827 K/sec
( +- 9.73% ) (35.72%)
437,343 dTLB-load-misses # 65.86% of all
dTLB cache accesses ( +- 8.56% ) (35.68%)
0 iTLB-loads # 0.000 /sec
(35.65%)
160 iTLB-load-misses # 1777.78% of all
iTLB cache accesses ( +- 15.82% ) (35.62%)
2.6774 +- 0.0287 seconds time elapsed ( +- 1.07% )
page size = 1G mm/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10
runs):
2,625.24 msec task-clock # 0.993 CPUs
utilized ( +- 0.23% )
4 context-switches # 1.513 /sec
( +- 4.49% )
1 cpu-migrations # 0.378 /sec
214 page-faults # 80.965 /sec
( +- 0.13% )
8,178,624,349 cycles # 3.094 GHz
( +- 0.23% ) (35.65%)
2,942,576 stalled-cycles-frontend # 0.04% frontend
cycles idle ( +- 75.22% ) (35.69%)
7,117,425 stalled-cycles-backend # 0.09% backend
cycles idle ( +- 3.79% ) (35.73%)
454,521,647 instructions # 0.06 insn per cycle
# 0.02 stalled
cycles per insn ( +- 0.10% ) (35.77%)
113,223,853 branches # 42.837 M/sec
( +- 0.08% ) (35.80%)
84,766 branch-misses # 0.07% of all
branches ( +- 5.37% ) (35.80%)
2,294,528,890 L1-dcache-loads # 868.111 M/sec
( +- 0.02% ) (35.81%)
1,075,907,551 L1-dcache-load-misses # 46.88% of all
L1-dcache accesses ( +- 0.02% ) (35.78%)
26,167,323 L1-icache-loads # 9.900 M/sec
( +- 0.24% ) (35.74%)
139,675 L1-icache-load-misses # 0.54% of all
L1-icache accesses ( +- 0.37% ) (35.70%)
3,459 dTLB-loads # 1.309 K/sec
( +- 12.75% ) (35.67%)
732 dTLB-load-misses # 19.71% of all
dTLB cache accesses ( +- 26.61% ) (35.62%)
11 iTLB-load-misses # 192.98% of all
iTLB cache accesses ( +-238.28% ) (35.62%)
2.64452 +- 0.00600 seconds time elapsed ( +- 0.23% )
page size = 1G x86/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10
runs):
1,009.09 msec task-clock # 0.998 CPUs
utilized ( +- 0.06% )
2 context-switches # 1.980 /sec
( +- 23.63% )
1 cpu-migrations # 0.990 /sec
214 page-faults # 211.887 /sec
( +- 0.16% )
3,154,980,463 cycles # 3.124 GHz
( +- 0.06% ) (35.77%)
145,051 stalled-cycles-frontend # 0.00% frontend
cycles idle ( +- 6.26% ) (35.78%)
730,087,143 stalled-cycles-backend # 23.12% backend
cycles idle ( +- 9.75% ) (35.78%)
45,813,391 instructions # 0.01 insn per cycle
# 18.51 stalled
cycles per insn ( +- 1.00% ) (35.78%)
8,498,282 branches # 8.414 M/sec
( +- 1.54% ) (35.78%)
63,351 branch-misses # 0.74% of all
branches ( +- 6.70% ) (35.69%)
29,135,863 L1-dcache-loads # 28.848 M/sec
( +- 5.67% ) (35.68%)
8,537,280 L1-dcache-load-misses # 28.66% of all
L1-dcache accesses ( +- 10.15% ) (35.68%)
1,040,087 L1-icache-loads # 1.030 M/sec
( +- 1.60% ) (35.68%)
9,147 L1-icache-load-misses # 0.85% of all
L1-icache accesses ( +- 6.50% ) (35.67%)
1,084 dTLB-loads # 1.073 K/sec
( +- 12.05% ) (35.68%)
431 dTLB-load-misses # 40.28% of all
dTLB cache accesses ( +- 43.46% ) (35.68%)
16 iTLB-load-misses # 0.00% of all
iTLB cache accesses ( +- 40.54% ) (35.68%)
1.011281 +- 0.000624 seconds time elapsed ( +- 0.06% )
Please feel free to add
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Will come back with further observations on patch/performance if any
Thanks and Regards
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 7/9] sched: define TIF_ALLOW_RESCHED
2023-04-03 5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
@ 2023-04-05 20:07 ` Peter Zijlstra
0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-05 20:07 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Sun, Apr 02, 2023 at 10:22:31PM -0700, Ankur Arora wrote:
> Define TIF_ALLOW_RESCHED to allow threads to mark themselves as allowing
> rescheduling in the irqexit path with PREEMPTION_NONE/_VOLUNTARY.
>
> This is meant to be used for long running tasks where it is
> not convenient to periodically call cond_resched().
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> arch/x86/include/asm/thread_info.h | 2 ++
> include/linux/sched.h | 29 +++++++++++++++++++++++++++++
> 2 files changed, 31 insertions(+)
>
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index f1cccba52eb9..8c18b9eaeec4 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -100,6 +100,7 @@ struct thread_info {
> #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
> #define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
> #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
> +#define TIF_RESCHED_ALLOW 30 /* can reschedule if needed */
>
> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> @@ -122,6 +123,7 @@ struct thread_info {
> #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
> #define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
> #define _TIF_ADDR32 (1 << TIF_ADDR32)
> +#define _TIF_RESCHED_ALLOW (1 << TIF_RESCHED_ALLOW)
>
> /* flags to check in __switch_to() */
> #define _TIF_WORK_CTXSW_BASE \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b1a..1e7536e6d9ce 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2229,6 +2229,35 @@ static __always_inline bool need_resched(void)
> return unlikely(tif_need_resched());
> }
>
> +/*
> + * Define this in common code to avoid include hell.
> + */
> +static __always_inline bool resched_allowed(void)
> +{
> +#ifndef TIF_RESCHED_ALLOW
> + return false;
> +#else
> + return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> +#endif
> +}
> +
> +static inline void allow_resched(void)
> +{
> + /*
> + * allow_resched() allows preemption via the irqexit context.
> + * To ensure that we stick around on the current runqueue,
> + * disallow migration.
> + */
> + migrate_disable();
> + set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
> +}
> +
> +static inline void disallow_resched(void)
> +{
> + clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
> + migrate_enable();
> +}
Why the migrate_disable(), that comment doesn't help much.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-03 5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-04-04 9:38 ` Thomas Gleixner
@ 2023-04-05 20:22 ` Peter Zijlstra
2023-04-06 16:56 ` Ankur Arora
1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-05 20:22 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Sun, Apr 02, 2023 at 10:22:32PM -0700, Ankur Arora wrote:
> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>
> This is only necessary under !preempt_model_preemptible() for which
> we reuse the same logic as irqentry_exit_code_resched().
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> kernel/entry/common.c | 8 ++++++++
> kernel/sched/core.c | 36 +++++++++++++++++++++---------------
> 2 files changed, 29 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index be61332c66b5..f1005595ebe7 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -390,6 +390,9 @@ void raw_irqentry_exit_cond_resched(void)
> preempt_schedule_irq();
> }
> }
> +
> +void irqentry_exit_allow_resched(void) __alias(raw_irqentry_exit_cond_resched);
Because typing raw_irqentry_exit_cond_resched() was too much effort?
> +
> #ifdef CONFIG_PREEMPT_DYNAMIC
> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> @@ -431,6 +434,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> instrumentation_begin();
> if (IS_ENABLED(CONFIG_PREEMPTION))
> irqentry_exit_cond_resched();
> + /*
> + * We care about this clause only in the dynamic !preemptible case.
> + */
> + if (unlikely(!preempt_model_preemptible() && resched_allowed()))
> + irqentry_exit_allow_resched();
This is wrong, if we have dynamic preemption then we have
CONFIG_PREEMPTION and we'll have that irqentry_exit_cond_resched() call.
Basically what you've written above is something like:
static_call(foo); // raw_foo() when A, NOP if !A
if (!A)
raw_foo();
And yeah, you've got the extra resched_allowed() thing in there, but
that doesn't make it any better -- this is all quite horrible.
What you really care about is the CONFIG_PREEMPTION=n case, but that you
don't actually handle.
> /* Covers both tracing and lockdep */
> trace_hardirqs_on();
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0d18c3969f90..11845a91b691 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8597,28 +8599,32 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
> * SC:preempt_schedule
> * SC:preempt_schedule_notrace
> * SC:irqentry_exit_cond_resched
> + * SC:irqentry_exit_allow_resched
> *
> *
> * NONE:
> - * cond_resched <- __cond_resched
> - * might_resched <- RET0
> - * preempt_schedule <- NOP
> - * preempt_schedule_notrace <- NOP
> - * irqentry_exit_cond_resched <- NOP
> + * cond_resched <- __cond_resched
> + * might_resched <- RET0
> + * preempt_schedule <- NOP
> + * preempt_schedule_notrace <- NOP
> + * irqentry_exit_cond_resched <- NOP
> + * irqentry_exit_allow_resched <- irqentry_exit_allow_resched
> *
> * VOLUNTARY:
> - * cond_resched <- __cond_resched
> - * might_resched <- __cond_resched
> - * preempt_schedule <- NOP
> - * preempt_schedule_notrace <- NOP
> - * irqentry_exit_cond_resched <- NOP
> + * cond_resched <- __cond_resched
> + * might_resched <- __cond_resched
> + * preempt_schedule <- NOP
> + * preempt_schedule_notrace <- NOP
> + * irqentry_exit_cond_resched <- NOP
> + * irqentry_exit_allow_resched <- irqentry_exit_allow_resched
> *
> * FULL:
> - * cond_resched <- RET0
> - * might_resched <- RET0
> - * preempt_schedule <- preempt_schedule
> - * preempt_schedule_notrace <- preempt_schedule_notrace
> - * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + * cond_resched <- RET0
> + * might_resched <- RET0
> + * preempt_schedule <- preempt_schedule
> + * preempt_schedule_notrace <- preempt_schedule_notrace
> + * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + * irqentry_exit_allow_resched <- NOP
> */
This ^ is all complete nonsense.. irqentry_exit_allow_resched() is not
a static call. It's an alias of raw_irqentry_exit_cond_resched which
circumvents the static call entirely.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible
2023-04-03 5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
@ 2023-04-05 20:27 ` Peter Zijlstra
2023-04-06 17:00 ` Ankur Arora
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-05 20:27 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Sun, Apr 02, 2023 at 10:22:33PM -0700, Ankur Arora wrote:
> clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
> chunk Allow preemption in the irqentry_exit path to make sure we don't
> hold on to the CPU for an arbitrarily long period.
>
> Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
> hugetlb-2mb region, and then writes sequentially to the region, demand
> faulting pages on the way.
>
> This test, with a CONFIG_VOLUNTARY config shows the effects of this
> change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
> goes up (~15% on Icelakex, ~13% on Milan.)
>
> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
> (mem=4GB/task, tasks=128)
>
> stime 293.02 +- .49% 239.39 +- .83% -18.30%
> utime 440.11 +- .28% 508.74 +- .60% +15.59%
> wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
>
>
>
> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (mem=1GB/task, tasks=512)
>
> stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
> utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
> wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
>
> The drop in stime is due to REP; STOS being more efficient for bigger
> extents. The increase in utime is due to cache effects of that change:
> mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
> faulting page; while x86/clear_huge_page only optimizes for cache
> locality in the local neighbourhood of the faulting address.
>
> This effect on utime is visible via the increased L1-dcache-load-misses
> and LLC-load* and an increased backend boundedness for perf user-stat
> --all-user on Icelakex. The effect is slight but given the heavy cache
> pressure generated by the test, shows up in the drop in user IPC:
>
> - 9,455,243,414,829 instructions # 2.75 insn per cycle ( +- 14.14% ) (46.17%)
> - 2,367,920,864,112 L1-dcache-loads # 1.054 G/sec ( +- 14.14% ) (69.24%)
> - 42,075,182,813 L1-dcache-load-misses # 2.96% of all L1-dcache accesses ( +- 14.14% ) (69.24%)
> - 20,365,688 LLC-loads # 9.064 K/sec ( +- 13.98% ) (69.24%)
> - 890,382 LLC-load-misses # 7.18% of all LL-cache accesses ( +- 14.91% ) (69.24%)
>
> + 9,467,796,660,698 instructions # 2.37 insn per cycle ( +- 14.14% ) (46.16%)
> + 2,369,973,307,561 L1-dcache-loads # 1.027 G/sec ( +- 14.14% ) (69.24%)
> + 42,155,621,201 L1-dcache-load-misses # 2.96% of all L1-dcache accesses ( +- 14.14% ) (69.24%)
> + 22,116,300 LLC-loads # 9.588 K/sec ( +- 14.20% ) (69.24%)
> + 1,355,607 LLC-load-misses # 10.29% of all LL-cache accesses ( +- 15.49% ) (69.25%)
>
> Given the fact that the stime improves for all loads using this path,
> while the utime drop is load dependent add this change.
Either I really need sleep, or *NONE* of the above is actually relevant
to what the patch below actually does!
The above talks about the glories of using large clears, while the patch
allows reschedules which are about latency.
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> arch/x86/mm/hugetlbpage.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 4294b77c4f18..c8564b0552e5 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> static void clear_contig_region(struct page *page, unsigned long vaddr,
> unsigned int npages)
> {
> + might_sleep();
> +
> + /*
> + * We might be clearing a large region.
> + * Allow rescheduling.
> + */
> + allow_resched();
> clear_user_pages(page_address(page), vaddr, page, npages);
> + disallow_resched();
> +
> + cond_resched();
> }
>
> void clear_huge_page(struct page *page,
> --
> 2.31.1
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length
2023-04-03 5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
@ 2023-04-06 8:19 ` Peter Zijlstra
2023-04-07 3:03 ` Ankur Arora
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06 8:19 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Sun, Apr 02, 2023 at 10:22:28PM -0700, Ankur Arora wrote:
> Change clear_page*() to take a length parameter.
> Rename to clear_pages_*().
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index ecbfb4dd3b01..6069acf6072f 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -2,6 +2,8 @@
> #include <linux/linkage.h>
> #include <asm/asm.h>
> #include <asm/export.h>
> +#include <asm/cpufeatures.h>
> +#include <asm/alternative.h>
>
> /*
> * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
> @@ -13,18 +15,30 @@
> /*
> * Zero a page.
> * %rdi - page
> + * %esi - page-length
> + *
> + * Clobbers: %rax, %rcx
> */
> -SYM_FUNC_START(clear_page_rep)
> - movl $4096/8,%ecx
> +SYM_FUNC_START(clear_pages_rep)
> + movl %esi, %ecx
> xorl %eax,%eax
> + shrl $3,%ecx
> rep stosq
> RET
> -SYM_FUNC_END(clear_page_rep)
> -EXPORT_SYMBOL_GPL(clear_page_rep)
> +SYM_FUNC_END(clear_pages_rep)
> +EXPORT_SYMBOL_GPL(clear_pages_rep)
>
> -SYM_FUNC_START(clear_page_orig)
> +/*
> + * Original page zeroing loop.
> + * %rdi - page
> + * %esi - page-length
> + *
> + * Clobbers: %rax, %rcx, %rflags
> + */
> +SYM_FUNC_START(clear_pages_orig)
> + movl %esi, %ecx
> xorl %eax,%eax
> - movl $4096/64,%ecx
> + shrl $6,%ecx
> .p2align 4
> .Lloop:
> decl %ecx
> @@ -41,16 +55,23 @@ SYM_FUNC_START(clear_page_orig)
> jnz .Lloop
> nop
> RET
> -SYM_FUNC_END(clear_page_orig)
> -EXPORT_SYMBOL_GPL(clear_page_orig)
> +SYM_FUNC_END(clear_pages_orig)
> +EXPORT_SYMBOL_GPL(clear_pages_orig)
>
> -SYM_FUNC_START(clear_page_erms)
> - movl $4096,%ecx
> +/*
> + * Zero a page.
> + * %rdi - page
> + * %esi - page-length
> + *
> + * Clobbers: %rax, %rcx
> + */
> +SYM_FUNC_START(clear_pages_erms)
> + movl %esi, %ecx
> xorl %eax,%eax
> rep stosb
> RET
> -SYM_FUNC_END(clear_page_erms)
> -EXPORT_SYMBOL_GPL(clear_page_erms)
> +SYM_FUNC_END(clear_pages_erms)
> +EXPORT_SYMBOL_GPL(clear_pages_erms)
So it seems to me that clear_user_*() and clear_page_*() are now very
similar; is there really no way to de-duplicate all that?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
2023-04-03 5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
@ 2023-04-06 8:23 ` Peter Zijlstra
2023-04-07 0:50 ` Ankur Arora
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06 8:23 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
> Add clear_pages() and define the ancillary clear_user_pages().
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> arch/x86/include/asm/page.h | 6 ++++++
> arch/x86/include/asm/page_32.h | 6 ++++++
> arch/x86/include/asm/page_64.h | 9 +++++++--
> 3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index d18e5c332cb9..03e3c69fc427 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
> clear_page(page);
> }
>
> +static inline void clear_user_pages(void *page, unsigned long vaddr,
> + struct page *pg, unsigned int nsubpages)
> +{
> + clear_pages(page, nsubpages);
> +}
This seems dodgy, clear_user* has slightly different semantics. It needs
the access_ok() and stac/clac thing on at the very least.
> +
> static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
> struct page *topage)
> {
> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
> index 580d71aca65a..3523d1150cfc 100644
> --- a/arch/x86/include/asm/page_32.h
> +++ b/arch/x86/include/asm/page_32.h
> @@ -22,6 +22,12 @@ static inline void clear_page(void *page)
> memset(page, 0, PAGE_SIZE);
> }
>
> +static inline void clear_pages(void *page, unsigned int nsubpages)
> +{
> + for (int i = 0; i < nsubpages; i++)
> + clear_page(page + i * PAGE_SIZE);
cond_resched() ?
> +}
> +
> static inline void copy_page(void *to, void *from)
> {
> memcpy(to, from, PAGE_SIZE);
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-05 20:22 ` Peter Zijlstra
@ 2023-04-06 16:56 ` Ankur Arora
2023-04-06 20:13 ` Peter Zijlstra
0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-06 16:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
Peter Zijlstra <peterz@infradead.org> writes:
> On Sun, Apr 02, 2023 at 10:22:32PM -0700, Ankur Arora wrote:
>> Allow threads marked TIF_ALLOW_RESCHED to be rescheduled in irqexit.
>>
>> This is only necessary under !preempt_model_preemptible() for which
>> we reuse the same logic as irqentry_exit_code_resched().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> kernel/entry/common.c | 8 ++++++++
>> kernel/sched/core.c | 36 +++++++++++++++++++++---------------
>> 2 files changed, 29 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index be61332c66b5..f1005595ebe7 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -390,6 +390,9 @@ void raw_irqentry_exit_cond_resched(void)
>> preempt_schedule_irq();
>> }
>> }
>> +
>> +void irqentry_exit_allow_resched(void) __alias(raw_irqentry_exit_cond_resched);
>
> Because typing raw_irqentry_exit_cond_resched() was too much effort?
Probably unnecessary but wanted to underscore that irqentry_exit_allow_resched()
handled a separate user path.
>> +
>> #ifdef CONFIG_PREEMPT_DYNAMIC
>> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
>> @@ -431,6 +434,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>> instrumentation_begin();
>> if (IS_ENABLED(CONFIG_PREEMPTION))
>> irqentry_exit_cond_resched();
>> + /*
>> + * We care about this clause only in the dynamic !preemptible case.
>> + */
>> + if (unlikely(!preempt_model_preemptible() && resched_allowed()))
>> + irqentry_exit_allow_resched();
>
> This is wrong, if we have dynamic preemption then we have
> CONFIG_PREEMPTION and we'll have that irqentry_exit_cond_resched() call.
>
> Basically what you've written above is something like:
>
> static_call(foo); // raw_foo() when A, NOP if !A
Right, so:
CONFIG_PREEMPTION: raw_foo()
CONFIG_PREEMPTION && (preempt_dynamic_mode == preempt_dynamic_full): raw_foo()
This is NOP if CONFIG_PREEMPTION && preempt_dynamic_mode != preempt_dynamic_full.
> if (!A)
> raw_foo();
So I would call raw_foo() for the CONFIG_PREEMPTION=n case.
> What you really care about is the CONFIG_PREEMPTION=n case, but that you
> don't actually handle.
I don't see that. The CONFIG_PREEMPTION=n (or its dynamic version)
is being handled here.
Ignoring the RT case, preempt_model_preemptible() is either:
IS_ENABLED(CONFIG_PREEMPT)
or:
preempt_dynamic_mode == preempt_dynamic_full
So, I'm handling both the static and the dynamic case here.
if (!IS_ENABLED(CONFIG_PREEMPTION))
raw_foo()
or
if (preempt_dynamic_mode != preempt_dynamic_full)
raw_foo()
> And yeah, you've got the extra resched_allowed() thing in there, but
> that doesn't make it any better -- this is all quite horrible.
I don't disagree. There's a quite a lot of static/dynamic config options
here and adding this clause didn't improve things.
I think just going with a static call here for the allow-resched case
might have been cleaner. Alternately I'll see if it can be cleanly
folded in with the irqentry_exit_cond_resched() logic.
>> /* Covers both tracing and lockdep */
>> trace_hardirqs_on();
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0d18c3969f90..11845a91b691 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>
>> @@ -8597,28 +8599,32 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>> * SC:preempt_schedule
>> * SC:preempt_schedule_notrace
>> * SC:irqentry_exit_cond_resched
>> + * SC:irqentry_exit_allow_resched
>> *
>> *
>> * NONE:
>> - * cond_resched <- __cond_resched
>> - * might_resched <- RET0
>> - * preempt_schedule <- NOP
>> - * preempt_schedule_notrace <- NOP
>> - * irqentry_exit_cond_resched <- NOP
>> + * cond_resched <- __cond_resched
>> + * might_resched <- RET0
>> + * preempt_schedule <- NOP
>> + * preempt_schedule_notrace <- NOP
>> + * irqentry_exit_cond_resched <- NOP
>> + * irqentry_exit_allow_resched <- irqentry_exit_allow_resched
>> *
>> * VOLUNTARY:
>> - * cond_resched <- __cond_resched
>> - * might_resched <- __cond_resched
>> - * preempt_schedule <- NOP
>> - * preempt_schedule_notrace <- NOP
>> - * irqentry_exit_cond_resched <- NOP
>> + * cond_resched <- __cond_resched
>> + * might_resched <- __cond_resched
>> + * preempt_schedule <- NOP
>> + * preempt_schedule_notrace <- NOP
>> + * irqentry_exit_cond_resched <- NOP
>> + * irqentry_exit_allow_resched <- irqentry_exit_allow_resched
>> *
>> * FULL:
>> - * cond_resched <- RET0
>> - * might_resched <- RET0
>> - * preempt_schedule <- preempt_schedule
>> - * preempt_schedule_notrace <- preempt_schedule_notrace
>> - * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>> + * cond_resched <- RET0
>> + * might_resched <- RET0
>> + * preempt_schedule <- preempt_schedule
>> + * preempt_schedule_notrace <- preempt_schedule_notrace
>> + * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>> + * irqentry_exit_allow_resched <- NOP
>> */
>
> This ^ is all complete nonsense.. irqentry_exit_allow_resched() is not
> a static call. It's an alias of raw_irqentry_exit_cond_resched which
> circumvents the static call entirely.
Yeah you are right. I added the SC:irqentry_exit_allow_resched
prematurely. Will fix.
Thanks for the detailed comments.
--
ankur
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible
2023-04-05 20:27 ` Peter Zijlstra
@ 2023-04-06 17:00 ` Ankur Arora
0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-06 17:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
Peter Zijlstra <peterz@infradead.org> writes:
> On Sun, Apr 02, 2023 at 10:22:33PM -0700, Ankur Arora wrote:
>> clear_contig_region() can be used to clear up to a huge-page (2MB/1GB)
>> chunk Allow preemption in the irqentry_exit path to make sure we don't
>> hold on to the CPU for an arbitrarily long period.
>>
>> Performance: vm-scalability/case-anon-w-seq-hugetlb mmaps an anonymous
>> hugetlb-2mb region, and then writes sequentially to the region, demand
>> faulting pages on the way.
>>
>> This test, with a CONFIG_VOLUNTARY config shows the effects of this
>> change: stime drops (~18% on Icelakex, ~5% on Milan), while the utime
>> goes up (~15% on Icelakex, ~13% on Milan.)
>>
>> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
>> (mem=4GB/task, tasks=128)
>>
>> stime 293.02 +- .49% 239.39 +- .83% -18.30%
>> utime 440.11 +- .28% 508.74 +- .60% +15.59%
>> wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
>>
>>
>>
>> *Milan* mm/clear_huge_page x86/clear_huge_page change
>> (mem=1GB/task, tasks=512)
>>
>> stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
>> utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
>> wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
>>
>> The drop in stime is due to REP; STOS being more efficient for bigger
>> extents. The increase in utime is due to cache effects of that change:
>> mm/clear_huge_page() clears page-at-a-time, while narrowing towards the
>> faulting page; while x86/clear_huge_page only optimizes for cache
>> locality in the local neighbourhood of the faulting address.
>>
>> This effect on utime is visible via the increased L1-dcache-load-misses
>> and LLC-load* and an increased backend boundedness for perf user-stat
>> --all-user on Icelakex. The effect is slight but given the heavy cache
>> pressure generated by the test, shows up in the drop in user IPC:
>>
>> - 9,455,243,414,829 instructions # 2.75 insn per cycle ( +- 14.14% ) (46.17%)
>> - 2,367,920,864,112 L1-dcache-loads # 1.054 G/sec ( +- 14.14% ) (69.24%)
>> - 42,075,182,813 L1-dcache-load-misses # 2.96% of all L1-dcache accesses ( +- 14.14% ) (69.24%)
>> - 20,365,688 LLC-loads # 9.064 K/sec ( +- 13.98% ) (69.24%)
>> - 890,382 LLC-load-misses # 7.18% of all LL-cache accesses ( +- 14.91% ) (69.24%)
>>
>> + 9,467,796,660,698 instructions # 2.37 insn per cycle ( +- 14.14% ) (46.16%)
>> + 2,369,973,307,561 L1-dcache-loads # 1.027 G/sec ( +- 14.14% ) (69.24%)
>> + 42,155,621,201 L1-dcache-load-misses # 2.96% of all L1-dcache accesses ( +- 14.14% ) (69.24%)
>> + 22,116,300 LLC-loads # 9.588 K/sec ( +- 14.20% ) (69.24%)
>> + 1,355,607 LLC-load-misses # 10.29% of all LL-cache accesses ( +- 15.49% ) (69.25%)
>>
>> Given the fact that the stime improves for all loads using this path,
>> while the utime drop is load dependent add this change.
>
> Either I really need sleep, or *NONE* of the above is actually relevant
> to what the patch below actually does!
Yeah, you are right about the relevance.
I wanted to provide two sets of stats: the raw memory BW stats and the
relevant vm-scalability workload. The vm-scalability workload needs a
more reasonable scheduling model than what's present until this patch
and so it seemed to make sense to put here for that reason.
But yeah it doesn't really fit here.
> The above talks about the glories of using large clears, while the patch
> allows reschedules which are about latency.
Yeah, let me find a more reasonable way to present these.
Ankur
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> arch/x86/mm/hugetlbpage.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 4294b77c4f18..c8564b0552e5 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -158,7 +158,17 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>> static void clear_contig_region(struct page *page, unsigned long vaddr,
>> unsigned int npages)
>> {
>> + might_sleep();
>> +
>> + /*
>> + * We might be clearing a large region.
>> + * Allow rescheduling.
>> + */
>> + allow_resched();
>> clear_user_pages(page_address(page), vaddr, page, npages);
>> + disallow_resched();
>> +
>> + cond_resched();
>> }
>>
>> void clear_huge_page(struct page *page,
>> --
>> 2.31.1
>>
--
ankur
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-06 16:56 ` Ankur Arora
@ 2023-04-06 20:13 ` Peter Zijlstra
2023-04-06 20:16 ` Peter Zijlstra
2023-04-07 2:29 ` Ankur Arora
0 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06 20:13 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Thu, Apr 06, 2023 at 09:56:07AM -0700, Ankur Arora wrote:
>
> Right, so:
>
> CONFIG_PREEMPTION: raw_foo()
> CONFIG_PREEMPTION && (preempt_dynamic_mode == preempt_dynamic_full): raw_foo()
>
> This is NOP if CONFIG_PREEMPTION && preempt_dynamic_mode != preempt_dynamic_full.
>
> > if (!A)
> > raw_foo();
>
> So I would call raw_foo() for the CONFIG_PREEMPTION=n case.
>
> > What you really care about is the CONFIG_PREEMPTION=n case, but that you
> > don't actually handle.
>
> I don't see that. The CONFIG_PREEMPTION=n (or its dynamic version)
> is being handled here.
Bah, I overlooked we have multiple definitions of the
preempt_model_foo() things. For some reason I thought all that was only
for DYNAMIC_PREEMPT.
> > And yeah, you've got the extra resched_allowed() thing in there, but
> > that doesn't make it any better -- this is all quite horrible.
>
> I don't disagree. There's a quite a lot of static/dynamic config options
> here and adding this clause didn't improve things.
>
> I think just going with a static call here for the allow-resched case
> might have been cleaner. Alternately I'll see if it can be cleanly
> folded in with the irqentry_exit_cond_resched() logic.
Something like the below perhaps?
---
include/linux/entry-common.h | 6 ++++++
kernel/entry/common.c | 23 +++++++++++++++++++++--
2 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d95ab85f96ba..0c365dc1f1c2 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -415,10 +415,16 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
+void irqentry_exit_cond_resched_tif(void);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
+#ifdef TIF_RESCHED_ALLOW
+#define irqentry_exit_cond_resched_dynamic_disabled irqentry_exit_cond_resched_tif
+#else
#define irqentry_exit_cond_resched_dynamic_disabled NULL
+#endif
DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
#define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)()
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index be61332c66b5..211d118aa672 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -390,6 +390,21 @@ void raw_irqentry_exit_cond_resched(void)
preempt_schedule_irq();
}
}
+
+void irqentry_exit_cond_resched_tif(void)
+{
+#ifdef TIF_RESCHED_ALLOW
+ if (resched_allowed()) {
+ /* Sanity check RCU and thread stack */
+ rcu_irq_exit_check_preempt();
+ if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
+ WARN_ON_ONCE(!on_thread_stack());
+ if (need_resched())
+ preempt_schedule_irq();
+ }
+#endif
+}
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
@@ -397,8 +412,10 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
void dynamic_irqentry_exit_cond_resched(void)
{
- if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
- return;
+ if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched)) {
+ irqentry_exit_cond_resched_tif();
+ return
+ }
raw_irqentry_exit_cond_resched();
}
#endif
@@ -431,6 +448,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
instrumentation_begin();
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
+ else
+ irqentry_exit_cond_resched_tif();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-06 20:13 ` Peter Zijlstra
@ 2023-04-06 20:16 ` Peter Zijlstra
2023-04-07 2:29 ` Ankur Arora
1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-06 20:16 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Thu, Apr 06, 2023 at 10:13:59PM +0200, Peter Zijlstra wrote:
> +void irqentry_exit_cond_resched_tif(void)
> +{
> +#ifdef TIF_RESCHED_ALLOW
> + if (resched_allowed()) {
> + /* Sanity check RCU and thread stack */
> + rcu_irq_exit_check_preempt();
> + if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> + WARN_ON_ONCE(!on_thread_stack());
> + if (need_resched())
> + preempt_schedule_irq();
arguably this can simply call raw_irqentry_exit_cond_resched_tif()..
probably better than duplicating all that again.
> + }
> +#endif
> +}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
2023-04-06 8:23 ` Peter Zijlstra
@ 2023-04-07 0:50 ` Ankur Arora
2023-04-07 10:34 ` Peter Zijlstra
0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-07 0:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
Peter Zijlstra <peterz@infradead.org> writes:
> On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
>> Add clear_pages() and define the ancillary clear_user_pages().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> arch/x86/include/asm/page.h | 6 ++++++
>> arch/x86/include/asm/page_32.h | 6 ++++++
>> arch/x86/include/asm/page_64.h | 9 +++++++--
>> 3 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>> index d18e5c332cb9..03e3c69fc427 100644
>> --- a/arch/x86/include/asm/page.h
>> +++ b/arch/x86/include/asm/page.h
>> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
>> clear_page(page);
>> }
>>
>> +static inline void clear_user_pages(void *page, unsigned long vaddr,
>> + struct page *pg, unsigned int nsubpages)
>> +{
>> + clear_pages(page, nsubpages);
>> +}
>
> This seems dodgy, clear_user* has slightly different semantics. It needs
> the access_ok() and stac/clac thing on at the very least.
That can't be right. On x86, clear_user_page(), copy_user_page() (and
now the multi-page versions) only write to kernel maps of user pages.
That's why they can skip the access_ok(), stac/clac or uacess
exception handling.
From core-api/cachetlb.rst:
``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)``
``void clear_user_page(void *to, unsigned long addr, struct page *page)``
These two routines store data in user anonymous or COW
pages. It allows a port to efficiently avoid D-cache alias
issues between userspace and the kernel.
For example, a port may temporarily map 'from' and 'to' to
kernel virtual addresses during the copy. The virtual address
for these two pages is chosen in such a way that the kernel
load/store instructions happen to virtual addresses which are
of the same "color" as the user mapping of the page. Sparc64
for example, uses this technique.
The 'addr' parameter tells the virtual address where the
user will ultimately have this page mapped, and the 'page'
parameter gives a pointer to the struct page of the target.
The naming OTOH does seems dodgy. Especially because as you say it
suggests semantics similar to clear_user() etc.
On x86, I think it is definitely a mistake for clear_huge_page() to be
calling clear_user_page*() (especially given that it is getting the
kernel map.) Will fix that.
Even for non-x86, I see just two users in common code:
highmem.h: copy_user_highpage(), clear_user_highpage()
fs/dax.c: copy_cow_page_dax()
All of them do a kmap_atomic() so there's really no "may" as documented
above:
For example, a port may temporarily map 'from' and 'to' to
kernel virtual addresses during the copy. The virtual address
Maybe a name change is warranted, if nothing else?
>> +
>> static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
>> struct page *topage)
>> {
>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>> index 580d71aca65a..3523d1150cfc 100644
>> --- a/arch/x86/include/asm/page_32.h
>> +++ b/arch/x86/include/asm/page_32.h
>> @@ -22,6 +22,12 @@ static inline void clear_page(void *page)
>> memset(page, 0, PAGE_SIZE);
>> }
>>
>> +static inline void clear_pages(void *page, unsigned int nsubpages)
>> +{
>> + for (int i = 0; i < nsubpages; i++)
>> + clear_page(page + i * PAGE_SIZE);
>
> cond_resched() ?
Missed that. Thanks. Will fix.
--
ankur
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-06 20:13 ` Peter Zijlstra
2023-04-06 20:16 ` Peter Zijlstra
@ 2023-04-07 2:29 ` Ankur Arora
2023-04-07 10:23 ` Peter Zijlstra
1 sibling, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-07 2:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Apr 06, 2023 at 09:56:07AM -0700, Ankur Arora wrote:
>
>>
>> Right, so:
>>
>> CONFIG_PREEMPTION: raw_foo()
>> CONFIG_PREEMPTION && (preempt_dynamic_mode == preempt_dynamic_full): raw_foo()
>>
>> This is NOP if CONFIG_PREEMPTION && preempt_dynamic_mode != preempt_dynamic_full.
>>
>> > if (!A)
>> > raw_foo();
>>
>> So I would call raw_foo() for the CONFIG_PREEMPTION=n case.
>>
>> > What you really care about is the CONFIG_PREEMPTION=n case, but that you
>> > don't actually handle.
>>
>> I don't see that. The CONFIG_PREEMPTION=n (or its dynamic version)
>> is being handled here.
>
> Bah, I overlooked we have multiple definitions of the
> preempt_model_foo() things. For some reason I thought all that was only
> for DYNAMIC_PREEMPT.
>
>> > And yeah, you've got the extra resched_allowed() thing in there, but
>> > that doesn't make it any better -- this is all quite horrible.
>>
>> I don't disagree. There's a quite a lot of static/dynamic config options
>> here and adding this clause didn't improve things.
>>
>> I think just going with a static call here for the allow-resched case
>> might have been cleaner. Alternately I'll see if it can be cleanly
>> folded in with the irqentry_exit_cond_resched() logic.
>
> Something like the below perhaps?
>
> ---
> include/linux/entry-common.h | 6 ++++++
> kernel/entry/common.c | 23 +++++++++++++++++++++--
> 2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index d95ab85f96ba..0c365dc1f1c2 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -415,10 +415,16 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
> * Conditional reschedule with additional sanity checks.
> */
> void raw_irqentry_exit_cond_resched(void);
> +void irqentry_exit_cond_resched_tif(void);
> +
> #ifdef CONFIG_PREEMPT_DYNAMIC
> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
> +#ifdef TIF_RESCHED_ALLOW
> +#define irqentry_exit_cond_resched_dynamic_disabled irqentry_exit_cond_resched_tif
> +#else
> #define irqentry_exit_cond_resched_dynamic_disabled NULL
> +#endif
So this is clever. Essentially this would toggle between the two kinds
for the preempt_model_preemptible()/!preempt_model_preemptible() dynamic
case. Do I have that right?
> DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> #define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)()
> #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index be61332c66b5..211d118aa672 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -390,6 +390,21 @@ void raw_irqentry_exit_cond_resched(void)
> preempt_schedule_irq();
> }
> }
> +
> +void irqentry_exit_cond_resched_tif(void)
> +{
> +#ifdef TIF_RESCHED_ALLOW
> + if (resched_allowed()) {
> + /* Sanity check RCU and thread stack */
> + rcu_irq_exit_check_preempt();
> + if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> + WARN_ON_ONCE(!on_thread_stack());
> + if (need_resched())
> + preempt_schedule_irq();
> + }
> +#endif
> +}
> +
> #ifdef CONFIG_PREEMPT_DYNAMIC
> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> @@ -397,8 +412,10 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> void dynamic_irqentry_exit_cond_resched(void)
> {
> - if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
> - return;
> + if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched)) {
> + irqentry_exit_cond_resched_tif();
> + return
> + }
> raw_irqentry_exit_cond_resched();
> }
> #endif
> @@ -431,6 +448,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> instrumentation_begin();
> if (IS_ENABLED(CONFIG_PREEMPTION))
> irqentry_exit_cond_resched();
> + else
> + irqentry_exit_cond_resched_tif();
And, if we choose between the two resched modes at compile time then this
would work.
Might massage the names a little but this should work as is. Okay if I
use your Codeveloped-by/Suggested-by on this patch?
--
ankur
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length
2023-04-06 8:19 ` Peter Zijlstra
@ 2023-04-07 3:03 ` Ankur Arora
0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2023-04-07 3:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
Peter Zijlstra <peterz@infradead.org> writes:
> On Sun, Apr 02, 2023 at 10:22:28PM -0700, Ankur Arora wrote:
>> Change clear_page*() to take a length parameter.
>> Rename to clear_pages_*().
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>
>> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> index ecbfb4dd3b01..6069acf6072f 100644
>> --- a/arch/x86/lib/clear_page_64.S
>> +++ b/arch/x86/lib/clear_page_64.S
> So it seems to me that clear_user_*() and clear_page_*() are now very
> similar; is there really no way to de-duplicate all that?
I'm not sure I see a good way to do that. The clear_page_*() variants
are way simpler because they don't do alignment or uacess exception
handling.
The innermost loop in them could be unified -- but both clear_pages_rep()
and clear_pages_erms() are pretty trivial.
One place where it might be worth doing is clear_user_original() and
clear_pages_orig(). These two have also diverged some (clear_pages_orig
unrolls its loop and uses a register mov):
inner loop in clear_pages_orig:
.Lloop:
decl %ecx
#define PUT(x) movq %rax,x*8(%rdi)
movq %rax,(%rdi)
PUT(1)
PUT(2)
PUT(3)
PUT(4)
PUT(5)
PUT(6)
PUT(7)
leaq 64(%rdi),%rdi
jnz .Lloop
nop
inner loop in clear_user_original:
.Lqwords:
movq $0,(%rdi)
lea 8(%rdi),%rdi
dec %rcx
jnz .Lqwords
Maybe something like this?
---
arch/x86/lib/clear_page_64.S | 35 +++++------------------------------
1 file changed, 5 insertions(+), 30 deletions(-)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 6069acf6072f..795a82214d99 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -28,36 +28,6 @@ SYM_FUNC_START(clear_pages_rep)
SYM_FUNC_END(clear_pages_rep)
EXPORT_SYMBOL_GPL(clear_pages_rep)
-/*
- * Original page zeroing loop.
- * %rdi - page
- * %esi - page-length
- *
- * Clobbers: %rax, %rcx, %rflags
- */
-SYM_FUNC_START(clear_pages_orig)
- movl %esi, %ecx
- xorl %eax,%eax
- shrl $6,%ecx
- .p2align 4
-.Lloop:
- decl %ecx
-#define PUT(x) movq %rax,x*8(%rdi)
- movq %rax,(%rdi)
- PUT(1)
- PUT(2)
- PUT(3)
- PUT(4)
- PUT(5)
- PUT(6)
- PUT(7)
- leaq 64(%rdi),%rdi
- jnz .Lloop
- nop
- RET
-SYM_FUNC_END(clear_pages_orig)
-EXPORT_SYMBOL_GPL(clear_pages_orig)
-
/*
* Zero a page.
* %rdi - page
@@ -92,6 +62,9 @@ SYM_FUNC_START(clear_user_original)
jz .Lrest_bytes
# do the qwords first
+SYM_INNER_LABEL(clear_pages_orig, SYM_L_GLOBAL)
+ movl %esi, %ecx
+ shr $3,%rcx
.p2align 4
.Lqwords:
movq $0,(%rdi)
@@ -135,6 +108,8 @@ SYM_FUNC_START(clear_user_original)
SYM_FUNC_END(clear_user_original)
EXPORT_SYMBOL(clear_user_original)
+EXPORT_SYMBOL_GPL(clear_pages_orig)
+
/*
* Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
* present.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 8/9] irqentry: define irqentry_exit_allow_resched()
2023-04-07 2:29 ` Ankur Arora
@ 2023-04-07 10:23 ` Peter Zijlstra
0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-07 10:23 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Thu, Apr 06, 2023 at 07:29:59PM -0700, Ankur Arora wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > Something like the below perhaps?
> >
> > ---
> > include/linux/entry-common.h | 6 ++++++
> > kernel/entry/common.c | 23 +++++++++++++++++++++--
> > 2 files changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index d95ab85f96ba..0c365dc1f1c2 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -415,10 +415,16 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
> > * Conditional reschedule with additional sanity checks.
> > */
> > void raw_irqentry_exit_cond_resched(void);
> > +void irqentry_exit_cond_resched_tif(void);
> > +
> > #ifdef CONFIG_PREEMPT_DYNAMIC
> > #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> > #define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
> > +#ifdef TIF_RESCHED_ALLOW
> > +#define irqentry_exit_cond_resched_dynamic_disabled irqentry_exit_cond_resched_tif
> > +#else
> > #define irqentry_exit_cond_resched_dynamic_disabled NULL
> > +#endif
>
> So this is clever. Essentially this would toggle between the two kinds
> for the preempt_model_preemptible()/!preempt_model_preemptible() dynamic
> case. Do I have that right?
Exactly!
> > DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> > #define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)()
> > #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index be61332c66b5..211d118aa672 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -390,6 +390,21 @@ void raw_irqentry_exit_cond_resched(void)
> > preempt_schedule_irq();
> > }
> > }
> > +
> > +void irqentry_exit_cond_resched_tif(void)
> > +{
> > +#ifdef TIF_RESCHED_ALLOW
> > + if (resched_allowed()) {
> > + /* Sanity check RCU and thread stack */
> > + rcu_irq_exit_check_preempt();
> > + if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> > + WARN_ON_ONCE(!on_thread_stack());
> > + if (need_resched())
> > + preempt_schedule_irq();
> > + }
> > +#endif
> > +}
> > +
> > #ifdef CONFIG_PREEMPT_DYNAMIC
> > #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> > DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> > @@ -397,8 +412,10 @@ DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
> > DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
> > void dynamic_irqentry_exit_cond_resched(void)
> > {
> > - if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
> > - return;
> > + if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched)) {
> > + irqentry_exit_cond_resched_tif();
> > + return
> > + }
> > raw_irqentry_exit_cond_resched();
> > }
> > #endif
> > @@ -431,6 +448,8 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> > instrumentation_begin();
> > if (IS_ENABLED(CONFIG_PREEMPTION))
> > irqentry_exit_cond_resched();
> > + else
> > + irqentry_exit_cond_resched_tif();
>
> And, if we choose between the two resched modes at compile time then this
> would work.
Just so.
> Might massage the names a little but this should work as is.
Yeah, I'm not liking the naming either, perhaps your
irqentry_exit_allow_resched() would've been better, dunno, see what
works.
> Okay if I use your Codeveloped-by/Suggested-by on this patch?
Yep.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
2023-04-07 0:50 ` Ankur Arora
@ 2023-04-07 10:34 ` Peter Zijlstra
2023-04-09 13:26 ` Matthew Wilcox
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-04-07 10:34 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, rostedt,
tglx, vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Thu, Apr 06, 2023 at 05:50:18PM -0700, Ankur Arora wrote:
>
> Peter Zijlstra <peterz@infradead.org> writes:
>
> > On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
> >> Add clear_pages() and define the ancillary clear_user_pages().
> >>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >> arch/x86/include/asm/page.h | 6 ++++++
> >> arch/x86/include/asm/page_32.h | 6 ++++++
> >> arch/x86/include/asm/page_64.h | 9 +++++++--
> >> 3 files changed, 19 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >> index d18e5c332cb9..03e3c69fc427 100644
> >> --- a/arch/x86/include/asm/page.h
> >> +++ b/arch/x86/include/asm/page.h
> >> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
> >> clear_page(page);
> >> }
> >>
> >> +static inline void clear_user_pages(void *page, unsigned long vaddr,
> >> + struct page *pg, unsigned int nsubpages)
> >> +{
> >> + clear_pages(page, nsubpages);
> >> +}
> >
> > This seems dodgy, clear_user* has slightly different semantics. It needs
> > the access_ok() and stac/clac thing on at the very least.
>
> That can't be right. On x86, clear_user_page(), copy_user_page() (and
> now the multi-page versions) only write to kernel maps of user pages.
> That's why they can skip the access_ok(), stac/clac or uacess
> exception handling.
Bah, that namespace is a mess :/
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
@ 2023-04-08 22:46 ` Ankur Arora
2023-04-10 6:26 ` Raghavendra K T
0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2023-04-08 22:46 UTC (permalink / raw)
To: Raghavendra K T
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, peterz,
rostedt, tglx, vincent.guittot, jon.grimm, bharata,
boris.ostrovsky, konrad.wilk
Raghavendra K T <raghavendra.kt@amd.com> writes:
> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>> This series introduces multi-page clearing for hugepages.
> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (GB/s) (GB/s)
> pg-sz=2MB 12.24 17.54 +43.30%
> pg-sz=1GB 17.98 37.24 +107.11%
>
>
> Hello Ankur,
>
> Was able to test your patches. To summarize, am seeing 2x-3x perf
> improvement for 2M, 1GB base hugepage sizes.
Great. Thanks Raghavendra.
> SUT: Genoa AMD EPYC
> Thread(s) per core: 2
> Core(s) per socket: 128
> Socket(s): 2
>
> NUMA:
> NUMA node(s): 2
> NUMA node0 CPU(s): 0-127,256-383
> NUMA node1 CPU(s): 128-255,384-511
>
> Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
>
> perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> page-size mm/clear_huge_page x86/clear_huge_page
> 2M 5.4567 2.6774
> 1G 2.64452 1.011281
So translating into BW, for Genoa we have:
page-size mm/clear_huge_page x86/clear_huge_page
2M 11.74 23.97
1G 24.24 63.36
That's a pretty good bump over Milan:
> *Milan* mm/clear_huge_page x86/clear_huge_page
> (GB/s) (GB/s)
> pg-sz=2MB 12.24 17.54
> pg-sz=1GB 17.98 37.24
Btw, are these numbers with boost=1?
> Full perfstat info
>
> page size = 2M mm/clear_huge_page
>
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
> 5,434.71 msec task-clock # 0.996 CPUs utilized
> ( +- 0.55% )
> 8 context-switches # 1.466 /sec
> ( +- 4.66% )
> 0 cpu-migrations # 0.000 /sec
> 32,918 page-faults # 6.034 K/sec
> ( +- 0.00% )
> 16,977,242,482 cycles # 3.112 GHz
> ( +- 0.04% ) (35.70%)
> 1,961,724 stalled-cycles-frontend # 0.01% frontend cycles
> idle ( +- 1.09% ) (35.72%)
> 35,685,674 stalled-cycles-backend # 0.21% backend cycles idle
> ( +- 3.48% ) (35.74%)
> 1,038,327,182 instructions # 0.06 insn per cycle
> # 0.04 stalled cycles per
> insn ( +- 0.38% )
> (35.75%)
> 221,409,216 branches # 40.584 M/sec
> ( +- 0.36% ) (35.75%)
> 350,730 branch-misses # 0.16% of all branches
> ( +- 1.18% ) (35.75%)
> 2,520,888,779 L1-dcache-loads # 462.077 M/sec
> ( +- 0.03% ) (35.73%)
> 1,094,178,209 L1-dcache-load-misses # 43.46% of all L1-dcache
> accesses ( +- 0.02% ) (35.71%)
> 67,751,730 L1-icache-loads # 12.419 M/sec
> ( +- 0.11% ) (35.70%)
> 271,118 L1-icache-load-misses # 0.40% of all L1-icache
> accesses ( +- 2.55% ) (35.70%)
> 506,635 dTLB-loads # 92.866 K/sec
> ( +- 3.31% ) (35.70%)
> 237,385 dTLB-load-misses # 43.64% of all dTLB cache
> accesses ( +- 7.00% ) (35.69%)
> 268 iTLB-load-misses # 6700.00% of all iTLB cache
> accesses ( +- 13.86% ) (35.70%)
>
> 5.4567 +- 0.0300 seconds time elapsed ( +- 0.55% )
>
> page size = 2M x86/clear_huge_page
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
> 2,780.69 msec task-clock # 1.039 CPUs utilized
> ( +- 1.03% )
> 3 context-switches # 1.121 /sec
> ( +- 21.34% )
> 0 cpu-migrations # 0.000 /sec
> 32,918 page-faults # 12.301 K/sec
> ( +- 0.00% )
> 8,143,619,771 cycles # 3.043 GHz
> ( +- 0.25% ) (35.62%)
> 2,024,872 stalled-cycles-frontend # 0.02% frontend cycles
> idle ( +-320.93% ) (35.66%)
> 717,198,728 stalled-cycles-backend # 8.82% backend cycles idle
> ( +- 8.26% ) (35.69%)
> 606,549,334 instructions # 0.07 insn per cycle
> # 1.39 stalled cycles per
> insn ( +- 0.23% )
> (35.73%)
> 108,856,550 branches # 40.677 M/sec
> ( +- 0.24% ) (35.76%)
> 202,490 branch-misses # 0.18% of all branches
> ( +- 3.58% ) (35.78%)
> 2,348,818,806 L1-dcache-loads # 877.701 M/sec
> ( +- 0.03% ) (35.78%)
> 1,081,562,988 L1-dcache-load-misses # 46.04% of all L1-dcache
> accesses ( +- 0.01% ) (35.78%)
> <not supported> LLC-loads
> <not supported> LLC-load-misses
> 43,411,167 L1-icache-loads # 16.222 M/sec
> ( +- 0.19% ) (35.77%)
> 273,042 L1-icache-load-misses # 0.64% of all L1-icache
> accesses ( +- 4.94% ) (35.76%)
> 834,482 dTLB-loads # 311.827 K/sec
> ( +- 9.73% ) (35.72%)
> 437,343 dTLB-load-misses # 65.86% of all dTLB cache
> accesses ( +- 8.56% ) (35.68%)
> 0 iTLB-loads # 0.000 /sec
> (35.65%)
> 160 iTLB-load-misses # 1777.78% of all iTLB cache
> accesses ( +- 15.82% ) (35.62%)
>
> 2.6774 +- 0.0287 seconds time elapsed ( +- 1.07% )
>
> page size = 1G mm/clear_huge_page
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 2,625.24 msec task-clock # 0.993 CPUs utilized
> ( +- 0.23% )
> 4 context-switches # 1.513 /sec
> ( +- 4.49% )
> 1 cpu-migrations # 0.378 /sec
> 214 page-faults # 80.965 /sec
> ( +- 0.13% )
> 8,178,624,349 cycles # 3.094 GHz
> ( +- 0.23% ) (35.65%)
> 2,942,576 stalled-cycles-frontend # 0.04% frontend cycles
> idle ( +- 75.22% ) (35.69%)
> 7,117,425 stalled-cycles-backend # 0.09% backend cycles idle
> ( +- 3.79% ) (35.73%)
> 454,521,647 instructions # 0.06 insn per cycle
> # 0.02 stalled cycles per
> insn ( +- 0.10% )
> (35.77%)
> 113,223,853 branches # 42.837 M/sec
> ( +- 0.08% ) (35.80%)
> 84,766 branch-misses # 0.07% of all branches
> ( +- 5.37% ) (35.80%)
> 2,294,528,890 L1-dcache-loads # 868.111 M/sec
> ( +- 0.02% ) (35.81%)
> 1,075,907,551 L1-dcache-load-misses # 46.88% of all L1-dcache
> accesses ( +- 0.02% ) (35.78%)
> 26,167,323 L1-icache-loads # 9.900 M/sec
> ( +- 0.24% ) (35.74%)
> 139,675 L1-icache-load-misses # 0.54% of all L1-icache
> accesses ( +- 0.37% ) (35.70%)
> 3,459 dTLB-loads # 1.309 K/sec
> ( +- 12.75% ) (35.67%)
> 732 dTLB-load-misses # 19.71% of all dTLB cache
> accesses ( +- 26.61% ) (35.62%)
> 11 iTLB-load-misses # 192.98% of all iTLB cache
> accesses ( +-238.28% ) (35.62%)
>
> 2.64452 +- 0.00600 seconds time elapsed ( +- 0.23% )
>
>
> page size = 1G x86/clear_huge_page
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 1,009.09 msec task-clock # 0.998 CPUs utilized
> ( +- 0.06% )
> 2 context-switches # 1.980 /sec
> ( +- 23.63% )
> 1 cpu-migrations # 0.990 /sec
> 214 page-faults # 211.887 /sec
> ( +- 0.16% )
> 3,154,980,463 cycles # 3.124 GHz
> ( +- 0.06% ) (35.77%)
> 145,051 stalled-cycles-frontend # 0.00% frontend cycles
> idle ( +- 6.26% ) (35.78%)
> 730,087,143 stalled-cycles-backend # 23.12% backend cycles idle
> ( +- 9.75% ) (35.78%)
> 45,813,391 instructions # 0.01 insn per cycle
> # 18.51 stalled cycles per
> insn ( +- 1.00% )
> (35.78%)
> 8,498,282 branches # 8.414 M/sec
> ( +- 1.54% ) (35.78%)
> 63,351 branch-misses # 0.74% of all branches
> ( +- 6.70% ) (35.69%)
> 29,135,863 L1-dcache-loads # 28.848 M/sec
> ( +- 5.67% ) (35.68%)
> 8,537,280 L1-dcache-load-misses # 28.66% of all L1-dcache
> accesses ( +- 10.15% ) (35.68%)
> 1,040,087 L1-icache-loads # 1.030 M/sec
> ( +- 1.60% ) (35.68%)
> 9,147 L1-icache-load-misses # 0.85% of all L1-icache
> accesses ( +- 6.50% ) (35.67%)
> 1,084 dTLB-loads # 1.073 K/sec
> ( +- 12.05% ) (35.68%)
> 431 dTLB-load-misses # 40.28% of all dTLB cache
> accesses ( +- 43.46% ) (35.68%)
> 16 iTLB-load-misses # 0.00% of all iTLB cache
> accesses ( +- 40.54% ) (35.68%)
>
> 1.011281 +- 0.000624 seconds time elapsed ( +- 0.06% )
>
> Please feel free to add
>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Thanks
Ankur
> Will come back with further observations on patch/performance if any
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 5/9] x86/clear_pages: add clear_pages()
2023-04-07 10:34 ` Peter Zijlstra
@ 2023-04-09 13:26 ` Matthew Wilcox
0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2023-04-09 13:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, luto,
bp, dave.hansen, hpa, mingo, juri.lelli, mgorman, rostedt, tglx,
vincent.guittot, jon.grimm, bharata, boris.ostrovsky,
konrad.wilk
On Fri, Apr 07, 2023 at 12:34:44PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 05:50:18PM -0700, Ankur Arora wrote:
> >
> > Peter Zijlstra <peterz@infradead.org> writes:
> >
> > > On Sun, Apr 02, 2023 at 10:22:29PM -0700, Ankur Arora wrote:
> > >> Add clear_pages() and define the ancillary clear_user_pages().
> > >>
> > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > >> ---
> > >> arch/x86/include/asm/page.h | 6 ++++++
> > >> arch/x86/include/asm/page_32.h | 6 ++++++
> > >> arch/x86/include/asm/page_64.h | 9 +++++++--
> > >> 3 files changed, 19 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > >> index d18e5c332cb9..03e3c69fc427 100644
> > >> --- a/arch/x86/include/asm/page.h
> > >> +++ b/arch/x86/include/asm/page.h
> > >> @@ -28,6 +28,12 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
> > >> clear_page(page);
> > >> }
> > >>
> > >> +static inline void clear_user_pages(void *page, unsigned long vaddr,
> > >> + struct page *pg, unsigned int nsubpages)
> > >> +{
> > >> + clear_pages(page, nsubpages);
> > >> +}
> > >
> > > This seems dodgy, clear_user* has slightly different semantics. It needs
> > > the access_ok() and stac/clac thing on at the very least.
> >
> > That can't be right. On x86, clear_user_page(), copy_user_page() (and
> > now the multi-page versions) only write to kernel maps of user pages.
> > That's why they can skip the access_ok(), stac/clac or uacess
> > exception handling.
>
> Bah, that namespace is a mess :/
What (I think) it's suppsoed to be is that clear_page() works on kernel
pages that are never seen by userspace while clear_user_page() works
on kernel mappings of pages the user can definitely see. This makes
no difference to x86, but some architectures can skip a lot of cache
flushing.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
2023-04-08 22:46 ` Ankur Arora
@ 2023-04-10 6:26 ` Raghavendra K T
0 siblings, 0 replies; 30+ messages in thread
From: Raghavendra K T @ 2023-04-10 6:26 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-mm, x86, torvalds, akpm, luto, bp,
dave.hansen, hpa, mingo, juri.lelli, willy, mgorman, peterz,
rostedt, tglx, vincent.guittot, jon.grimm, bharata,
boris.ostrovsky, konrad.wilk
On 4/9/2023 4:16 AM, Ankur Arora wrote:
>
> Raghavendra K T <raghavendra.kt@amd.com> writes:
>
>> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>>> This series introduces multi-page clearing for hugepages.
>
>> *Milan* mm/clear_huge_page x86/clear_huge_page change
>> (GB/s) (GB/s)
>> pg-sz=2MB 12.24 17.54 +43.30%
>> pg-sz=1GB 17.98 37.24 +107.11%
>>
>>
>> Hello Ankur,
>>
>> Was able to test your patches. To summarize, am seeing 2x-3x perf
>> improvement for 2M, 1GB base hugepage sizes.
>
> Great. Thanks Raghavendra.
>
>> SUT: Genoa AMD EPYC
>> Thread(s) per core: 2
>> Core(s) per socket: 128
>> Socket(s): 2
>>
>> NUMA:
>> NUMA node(s): 2
>> NUMA node0 CPU(s): 0-127,256-383
>> NUMA node1 CPU(s): 128-255,384-511
>>
>> Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
>> both base-hugepage-size=2M and 1GB
>>
>> perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
>>
>> time in seconds elapsed (average of 10 runs) (lower = better)
>>
>> Result:
>> page-size mm/clear_huge_page x86/clear_huge_page
>> 2M 5.4567 2.6774
>> 1G 2.64452 1.011281
>
> So translating into BW, for Genoa we have:
>
> page-size mm/clear_huge_page x86/clear_huge_page
> 2M 11.74 23.97
> 1G 24.24 63.36
>
> That's a pretty good bump over Milan:
>
>> *Milan* mm/clear_huge_page x86/clear_huge_page
>> (GB/s) (GB/s)
>> pg-sz=2MB 12.24 17.54
>> pg-sz=1GB 17.98 37.24
>
> Btw, are these numbers with boost=1?
>
Yes it is. Also a note about config. I had not enabled
GCOV/LOCKSTAT related config because I faced some issues.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2023-04-10 6:26 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03 5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-04-03 5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
2023-04-03 5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
2023-04-03 5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
2023-04-03 5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
2023-04-06 8:19 ` Peter Zijlstra
2023-04-07 3:03 ` Ankur Arora
2023-04-03 5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
2023-04-06 8:23 ` Peter Zijlstra
2023-04-07 0:50 ` Ankur Arora
2023-04-07 10:34 ` Peter Zijlstra
2023-04-09 13:26 ` Matthew Wilcox
2023-04-03 5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
2023-04-03 5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-04-05 20:07 ` Peter Zijlstra
2023-04-03 5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-04-04 9:38 ` Thomas Gleixner
2023-04-05 5:29 ` Ankur Arora
2023-04-05 20:22 ` Peter Zijlstra
2023-04-06 16:56 ` Ankur Arora
2023-04-06 20:13 ` Peter Zijlstra
2023-04-06 20:16 ` Peter Zijlstra
2023-04-07 2:29 ` Ankur Arora
2023-04-07 10:23 ` Peter Zijlstra
2023-04-03 5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-04-05 20:27 ` Peter Zijlstra
2023-04-06 17:00 ` Ankur Arora
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
2023-04-08 22:46 ` Ankur Arora
2023-04-10 6:26 ` Raghavendra K T
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.