* [RFC PATCH 1/4] mm: Add speculative numa fault support
2021-12-12 11:31 [RFC PATCH 0/4] Add speculative numa fault support Baolin Wang
@ 2021-12-12 11:31 ` Baolin Wang
2021-12-12 11:31 ` [RFC PATCH 2/4] mm: Add a debug interface to control the range of speculative numa fault Baolin Wang
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2021-12-12 11:31 UTC (permalink / raw)
To: akpm, ying.huang, dave.hansen
Cc: ziy, shy828301, baolin.wang, zhongjiang-ali, xlpang, linux-mm,
linux-kernel
Some workloads access a set of data entities will follow the data locality,
also known as locality of reference, which means the probability of accessing
some data soon after some nearby data has been accessed.
On some systems with different memory types, which will rely on the numa
balancing to promote slow hot memory to fast memory to improve performance.
So we can promote several sequential pages on slow memory in advance according
to the data locality for some workloads to improve the performance.
Thus this patch supports speculative numa fault mechanism to help to
migrate suitable pages in advance to improve the performance. And now
the basic concept of the speculative numa fault is that, it will add a
new member for each VMA to record the numa fault window, which will record
the last numa fault address and the pages need to be migrated to the target
node. So when numa fault occurs, we will check the last numa fault window
for current VMA to check if it is a sequential stream accessing, if yes, we
can expand the numa fault window; if not, we can reduce the numa fault
winow or close the speculative numa fault to avoid doing unnecessary
migration.
Testing with mysql can show about 6% performance improved as below.
Machine: 16 CPUs, 64G DRAM, 256G AEP
sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
--mysql-user=root --mysql-password=root --oltp-test-mode=complex
--oltp-tables-count=80 --oltp-table-size=5000000 --threads=20 --time=600
--report-interval=10 prepare/run
No speculative numa fault:
queries performed:
read: 33039860
write: 9439960
other: 4719980
total: 47199800
transactions: 2359990 (3933.28 per sec.)
queries: 47199800 (78665.50 per sec.)
Speculative numa fault:
queries performed:
read: 34896862
write: 9970532
other: 4985266
total: 49852660
transactions: 2492633 (4154.35 per sec.)
queries: 49852660 (83086.94 per sec.)
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
include/linux/mm_types.h | 3 +
mm/memory.c | 165 ++++++++++++++++++++++++++++++++++++---
2 files changed, 159 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 449b6eafc695..8d8381e9aec9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -474,6 +474,9 @@ struct vm_area_struct {
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+ atomic_long_t numafault_ahead_info;
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;
diff --git a/mm/memory.c b/mm/memory.c
index 2291417783bc..2c9ed63e4e23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -74,6 +74,8 @@
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
#include <linux/sched/sysctl.h>
+#include <linux/pagewalk.h>
+#include <linux/page_idle.h>
#include <trace/events/kmem.h>
@@ -4315,16 +4317,156 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
return mpol_misplaced(page, vma, addr);
}
+static bool try_next_numa_page(struct vm_fault *vmf, unsigned int win_pages,
+ unsigned long *fault_addr)
+{
+ unsigned long next_fault_addr = *fault_addr + PAGE_SIZE;
+ unsigned long numa_fault_end = vmf->address + (win_pages + 1) * PAGE_SIZE;
+
+ if (next_fault_addr > numa_fault_end)
+ return false;
+
+ *fault_addr = next_fault_addr;
+ vmf->pte = pte_offset_map(vmf->pmd, next_fault_addr);
+ vmf->orig_pte = *vmf->pte;
+ if (pte_protnone(vmf->orig_pte))
+ return true;
+
+ return false;
+}
+
+#define NUMA_FAULT_AHEAD_DEFUALT 2
+#define NUMA_FAULT_EXPAND_STEP 1
+#define NUMA_FAULT_REDUCE_STEP 2
+#define GET_NUMA_FAULT_INFO(vma) \
+ (atomic_long_read(&(vma)->numafault_ahead_info))
+#define NUMA_FAULT_WINDOW_START(v) ((v) & PAGE_MASK)
+#define NUMA_FAULT_WINDOW_SIZE_MASK ((1UL << PAGE_SHIFT) - 1)
+#define NUMA_FAULT_WINDOW_SIZE(v) ((v) & NUMA_FAULT_WINDOW_SIZE_MASK)
+#define NUMA_FAULT_INFO(addr, win) \
+ (((addr) & PAGE_MASK) | \
+ ((win) & NUMA_FAULT_WINDOW_SIZE_MASK))
+
+static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma,
+ unsigned long fault_address)
+{
+ unsigned long pmd_end_addr = (fault_address & PMD_MASK) + PMD_SIZE;
+ unsigned long max_fault_addr = min_t(unsigned long, pmd_end_addr,
+ vma->vm_end);
+
+ return (max_fault_addr - fault_address - 1) >> PAGE_SHIFT;
+}
+
+static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma,
+ unsigned long fault_address)
+{
+ unsigned long numafault_ahead = GET_NUMA_FAULT_INFO(vma);
+ unsigned long prev_start = NUMA_FAULT_WINDOW_START(numafault_ahead);
+ unsigned int prev_pages = NUMA_FAULT_WINDOW_SIZE(numafault_ahead);
+ unsigned long win_start;
+ unsigned int win_pages, max_fault_pages;
+
+ win_start = fault_address + PAGE_SIZE;
+
+ /*
+ * Start accessing the VMA, then just open a small window to try.
+ */
+ if (!numafault_ahead) {
+ win_pages = NUMA_FAULT_AHEAD_DEFUALT;
+ goto out;
+ }
+
+ /*
+ * If last numa fault window was close, we should check if current fault
+ * address is continue with previous fault addess before opening the
+ * new numa fault window.
+ */
+ if (!prev_pages) {
+ if (fault_address == prev_start ||
+ fault_address == prev_start + PAGE_SIZE)
+ win_pages = NUMA_FAULT_AHEAD_DEFUALT;
+ else
+ win_pages = 0;
+
+ goto out;
+ }
+
+ /*
+ * TODO: need check the fault addess is occured before the last numa
+ * fault window.
+ */
+ if (fault_address >= prev_start) {
+ unsigned long prev_end = prev_start + prev_pages * PAGE_SIZE;
+
+ /*
+ * Continue with the previous numa fault window, then assume
+ * it is a sequential accessing, which need expand the numa fault
+ * window.
+ */
+ if (fault_address == prev_end ||
+ fault_address == prev_end + PAGE_SIZE) {
+ win_pages = prev_pages + NUMA_FAULT_EXPAND_STEP;
+ goto validate_out;
+ } else if (fault_address < prev_end) {
+ /*
+ * If current fault address is in the range of last numa
+ * fault window, which means the pages in last numa fault
+ * window were not all migrated successfully, so just
+ * keep current size of last numa fault window to try
+ * again, since last numa fault window speculation may
+ * be on the correct way.
+ */
+ win_pages = prev_pages;
+ goto validate_out;
+ }
+ }
+
+ /*
+ * Until now assume random accessing, reduce the numa fault window
+ * by step.
+ */
+ if (prev_pages <= NUMA_FAULT_REDUCE_STEP) {
+ win_pages = 0;
+ goto out;
+ } else {
+ win_pages = prev_pages - NUMA_FAULT_REDUCE_STEP;
+ }
+
+validate_out:
+ /*
+ * Make sure the size of ahead numa fault address is less than the
+ * size of current VMA or PMD.
+ */
+ max_fault_pages = numa_fault_max_pages(vma, fault_address);
+ if (win_pages > max_fault_pages)
+ win_pages = max_fault_pages;
+
+out:
+ atomic_long_set(&vma->numafault_ahead_info,
+ NUMA_FAULT_INFO(win_start, win_pages));
+ return win_pages;
+}
+
static vm_fault_t do_numa_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL;
- int page_nid = NUMA_NO_NODE;
+ int page_nid;
int last_cpupid;
int target_nid;
pte_t pte, old_pte;
- bool was_writable = pte_savedwrite(vmf->orig_pte);
- int flags = 0;
+ bool was_writable;
+ int flags;
+ unsigned long fault_address = vmf->address;
+ unsigned int win_pages;
+
+ /* Try to speculate the numa fault window for current VMA. */
+ win_pages = adjust_numa_fault_window(vma, fault_address);
+
+try_next:
+ was_writable = pte_savedwrite(vmf->orig_pte);
+ flags = 0;
+ page_nid = NUMA_NO_NODE;
/*
* The "pte" at this point cannot be used safely without
@@ -4342,7 +4484,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
old_pte = ptep_get(vmf->pte);
pte = pte_modify(old_pte, vma->vm_page_prot);
- page = vm_normal_page(vma, vmf->address, pte);
+ page = vm_normal_page(vma, fault_address, pte);
if (!page)
goto out_map;
@@ -4378,7 +4520,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
last_cpupid = (-1 & LAST_CPUPID_MASK);
else
last_cpupid = page_cpupid_last(page);
- target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
+ target_nid = numa_migrate_prep(page, vma, fault_address, page_nid,
&flags);
if (target_nid == NUMA_NO_NODE) {
put_page(page);
@@ -4392,7 +4534,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
flags |= TNF_MIGRATED;
} else {
flags |= TNF_MIGRATE_FAIL;
- vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ vmf->pte = pte_offset_map(vmf->pmd, fault_address);
spin_lock(vmf->ptl);
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4404,19 +4546,24 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
out:
if (page_nid != NUMA_NO_NODE)
task_numa_fault(last_cpupid, page_nid, 1, flags);
+
+ if ((flags & TNF_MIGRATED) && (win_pages > 0) &&
+ try_next_numa_page(vmf, win_pages, &fault_address))
+ goto try_next;
+
return 0;
out_map:
/*
* Make it present again, depending on how arch implements
* non-accessible ptes, some can allow access by kernel mode.
*/
- old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+ old_pte = ptep_modify_prot_start(vma, fault_address, vmf->pte);
pte = pte_modify(old_pte, vma->vm_page_prot);
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
- ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
- update_mmu_cache(vma, vmf->address, vmf->pte);
+ ptep_modify_prot_commit(vma, fault_address, vmf->pte, old_pte, pte);
+ update_mmu_cache(vma, fault_address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
}
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH 2/4] mm: Add a debug interface to control the range of speculative numa fault
2021-12-12 11:31 [RFC PATCH 0/4] Add speculative numa fault support Baolin Wang
2021-12-12 11:31 ` [RFC PATCH 1/4] mm: " Baolin Wang
@ 2021-12-12 11:31 ` Baolin Wang
2021-12-12 11:31 ` [RFC PATCH 3/4] mm: Add speculative numa fault stats Baolin Wang
2021-12-12 11:32 ` [RFC PATCH 4/4] mm: Update the speculative pages' accessing time Baolin Wang
3 siblings, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2021-12-12 11:31 UTC (permalink / raw)
To: akpm, ying.huang, dave.hansen
Cc: ziy, shy828301, baolin.wang, zhongjiang-ali, xlpang, linux-mm,
linux-kernel
Add a debug interface to control the range of speculative numa fault,
which can be used to tuning the performance or event close the speculative
numa fault window for some workloads.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/memory.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 43 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2c9ed63e4e23..a0f4a2a008cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4052,7 +4052,29 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
static unsigned long fault_around_bytes __read_mostly =
rounddown_pow_of_two(65536);
+static unsigned long numa_around_bytes __read_mostly;
+
#ifdef CONFIG_DEBUG_FS
+static int numa_around_bytes_get(void *data, u64 *val)
+{
+ *val = numa_around_bytes;
+ return 0;
+}
+
+static int numa_around_bytes_set(void *data, u64 val)
+{
+ if (val / PAGE_SIZE > PTRS_PER_PTE)
+ return -EINVAL;
+ if (val > PAGE_SIZE)
+ numa_around_bytes = rounddown_pow_of_two(val);
+ else
+ numa_around_bytes = 0; /* rounddown_pow_of_two(0) is undefined */
+ return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(numa_around_bytes_fops,
+ numa_around_bytes_get,
+ numa_around_bytes_set, "%llu\n");
+
static int fault_around_bytes_get(void *data, u64 *val)
{
*val = fault_around_bytes;
@@ -4080,6 +4102,8 @@ static int __init fault_around_debugfs(void)
{
debugfs_create_file_unsafe("fault_around_bytes", 0644, NULL, NULL,
&fault_around_bytes_fops);
+ debugfs_create_file_unsafe("numa_around_bytes", 0644, NULL, NULL,
+ &numa_around_bytes_fops);
return 0;
}
late_initcall(fault_around_debugfs);
@@ -4348,10 +4372,13 @@ static bool try_next_numa_page(struct vm_fault *vmf, unsigned int win_pages,
((win) & NUMA_FAULT_WINDOW_SIZE_MASK))
static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma,
- unsigned long fault_address)
+ unsigned long fault_address,
+ unsigned long numa_around_size)
{
+ unsigned long numa_around_addr =
+ (fault_address + numa_around_size) & PAGE_MASK;
unsigned long pmd_end_addr = (fault_address & PMD_MASK) + PMD_SIZE;
- unsigned long max_fault_addr = min_t(unsigned long, pmd_end_addr,
+ unsigned long max_fault_addr = min3(numa_around_addr, pmd_end_addr,
vma->vm_end);
return (max_fault_addr - fault_address - 1) >> PAGE_SHIFT;
@@ -4360,12 +4387,24 @@ static inline unsigned int numa_fault_max_pages(struct vm_area_struct *vma,
static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma,
unsigned long fault_address)
{
+ unsigned long numa_around_size = READ_ONCE(numa_around_bytes);
unsigned long numafault_ahead = GET_NUMA_FAULT_INFO(vma);
unsigned long prev_start = NUMA_FAULT_WINDOW_START(numafault_ahead);
unsigned int prev_pages = NUMA_FAULT_WINDOW_SIZE(numafault_ahead);
unsigned long win_start;
unsigned int win_pages, max_fault_pages;
+ /*
+ * Shut down the proactive numa fault if the numa_around_bytes
+ * is set to 0.
+ */
+ if (!numa_around_size) {
+ if (numafault_ahead)
+ atomic_long_set(&vma->numafault_ahead_info,
+ NUMA_FAULT_INFO(0, 0));
+ return 0;
+ }
+
win_start = fault_address + PAGE_SIZE;
/*
@@ -4437,7 +4476,8 @@ static unsigned int adjust_numa_fault_window(struct vm_area_struct *vma,
* Make sure the size of ahead numa fault address is less than the
* size of current VMA or PMD.
*/
- max_fault_pages = numa_fault_max_pages(vma, fault_address);
+ max_fault_pages = numa_fault_max_pages(vma, fault_address,
+ numa_around_size);
if (win_pages > max_fault_pages)
win_pages = max_fault_pages;
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH 3/4] mm: Add speculative numa fault stats
2021-12-12 11:31 [RFC PATCH 0/4] Add speculative numa fault support Baolin Wang
2021-12-12 11:31 ` [RFC PATCH 1/4] mm: " Baolin Wang
2021-12-12 11:31 ` [RFC PATCH 2/4] mm: Add a debug interface to control the range of speculative numa fault Baolin Wang
@ 2021-12-12 11:31 ` Baolin Wang
2021-12-12 11:32 ` [RFC PATCH 4/4] mm: Update the speculative pages' accessing time Baolin Wang
3 siblings, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2021-12-12 11:31 UTC (permalink / raw)
To: akpm, ying.huang, dave.hansen
Cc: ziy, shy828301, baolin.wang, zhongjiang-ali, xlpang, linux-mm,
linux-kernel
Add a new statistic to help to tune the speculative numa fault window.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
include/linux/vm_event_item.h | 1 +
mm/memory.c | 2 ++
mm/vmstat.c | 1 +
3 files changed, 4 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a185cc75ff52..97cdc661b7da 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -62,6 +62,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_MIGRATION_SUCCESS,
THP_MIGRATION_FAIL,
THP_MIGRATION_SPLIT,
+ PGMIGRATE_SPECULATION,
#endif
#ifdef CONFIG_COMPACTION
COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
diff --git a/mm/memory.c b/mm/memory.c
index a0f4a2a008cc..91122beb6e53 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4572,6 +4572,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (migrate_misplaced_page(page, vma, target_nid)) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
+ if (vmf->address != fault_address)
+ count_vm_events(PGMIGRATE_SPECULATION, 1);
} else {
flags |= TNF_MIGRATE_FAIL;
vmf->pte = pte_offset_map(vmf->pmd, fault_address);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 787a012de3e2..c64700994786 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1314,6 +1314,7 @@ const char * const vmstat_text[] = {
"thp_migration_success",
"thp_migration_fail",
"thp_migration_split",
+ "pgmigrate_speculation",
#endif
#ifdef CONFIG_COMPACTION
"compact_migrate_scanned",
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH 4/4] mm: Update the speculative pages' accessing time
2021-12-12 11:31 [RFC PATCH 0/4] Add speculative numa fault support Baolin Wang
` (2 preceding siblings ...)
2021-12-12 11:31 ` [RFC PATCH 3/4] mm: Add speculative numa fault stats Baolin Wang
@ 2021-12-12 11:32 ` Baolin Wang
3 siblings, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2021-12-12 11:32 UTC (permalink / raw)
To: akpm, ying.huang, dave.hansen
Cc: ziy, shy828301, baolin.wang, zhongjiang-ali, xlpang, linux-mm,
linux-kernel
On some systems with different memory types, including fast memory (DRAM)
and slow memory (persistent memory), which will rely on the numa balancing
to promote slow and hot memory to fast memory to improve performance.
After supporting the speculative numa fault, we can update the next pages'
accessing time to help to promote it to fast memory node easily to
improve the performance.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/memory.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 91122beb6e53..e19b10299913 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4556,10 +4556,21 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* to record page access time. So use default value.
*/
if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
- !node_is_toptier(page_nid))
+ !node_is_toptier(page_nid)) {
last_cpupid = (-1 & LAST_CPUPID_MASK);
- else
+ /*
+ * According to the data locality for some workloads, the
+ * probability of accessing some data soon after some nearby
+ * data has been accessed. So for tiered memory systems, we
+ * can update the sequential page's age located on slow memory
+ * type, to try to promote it to fast memory in advance to
+ * improve the performance.
+ */
+ if (vmf->address != fault_address)
+ xchg_page_access_time(page, jiffies_to_msecs(jiffies));
+ } else {
last_cpupid = page_cpupid_last(page);
+ }
target_nid = numa_migrate_prep(page, vma, fault_address, page_nid,
&flags);
if (target_nid == NUMA_NO_NODE) {
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread