* [RFC PATCH 0/2] Hot page promotion optimization for large address space @ 2024-03-27 16:02 Bharata B Rao 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Bharata B Rao @ 2024-03-27 16:02 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, ying.huang, dave.hansen, hannes, Bharata B Rao In order to check how efficiently the existing NUMA balancing based hot page promotion mechanism can detect hot regions and promote pages for workloads with large memory footprints, I wrote and tested a program that allocates huge amount of memory but routinely touches only small parts of it. This microbenchmark provisions memory both on DRAM node and CXL node. It then divides the entire allocated memory into chunks of smaller size and randomly choses a chunk for generating memory accesses. Each chunk is then accessed for a fixed number of iterations to create the notion of hotness. Within each chunk, the individual pages at 4K granularity are again accessed in random fashion. When a chunk is taken up for access in this manner, its pages can either be residing on DRAM or CXL. In the latter case, the NUMA balancing driven hot page promotion logic is expected to detect and promote the hot pages that reside on CXL. The experiment was conducted on a 2P AMD Bergamo system that has CXL as the 3rd node. $ numactl -H available: 3 nodes (0-2) node 0 cpus: 0-127,256-383 node 0 size: 128054 MB node 1 cpus: 128-255,384-511 node 1 size: 128880 MB node 2 cpus: node 2 size: 129024 MB node distances: node 0 1 2 0: 10 32 60 1: 32 10 50 2: 255 255 10 It is seen that number of pages that get promoted is really low and the reason for it happens to be that the NUMA hint fault latency turns out to be much higher than the hot threshold most of the times. Here are a few latency and threshold sample values captured from should_numa_migrate_memory() routine when the benchmark was run: latency threshold (in ms) 20620 1125 56185 1125 98710 1250 148871 1375 182891 1625 369415 1875 630745 2000 The NUMA hint fault latency metric, which is based on absolute time difference between scanning time and fault time may not be suitable for applications that have large amounts of memory. If the time difference between the scan time PTE update and the subsequent access (hint fault) is more, the existing logic in should_numa_migrate_memory() to determine if the page needs to be migrated, will exclude more pages than it selects pages for promotion. To address this problem, this RFC converts the absolute time based hint fault latency in to a relative metric. The number of hint faults that have occurred between the scan time and the page's fault time is used as the threshold. This is quite an experimental work and there are things to take care of still. While more testing needs to be conducted with different benchmarks, I am posting the patchset here to just get early feedback. Microbenchmark ============== Total allocation is 192G which initially occupies full of Node 1 (DRAM) and half of Node 2 (CXL) Chunk size is 1G Default Patched Benchmark score (us) 637,787,351 571,350,410 (-10.41%) (Lesser is better) numa_pte_updates 29,834,747 29,275,489 numa_hint_faults 12,512,736 12,080,772 numa_hint_faults_local 0 0 numa_pages_migrated 1,804,583 6,709,580 pgpromote_success 1,804,500 6,709,526 pgpromote_candidate 1,916,720 7,523,345 pgdemote_kswapd 5,358,119 9,438,006 pgdemote_direct 0 0 Default Patched Number of times should_numa_migrate_memory() was invoked: 12,512,736 12,080,772 Number of times the migration request was rejected due to hint fault latency being higher than threshold: 10,595,933 4,557,401 Redis-memtier ============= memtier_benchmark -t 512 -n 25000 --ratio 1:1 -c 20 -x 1 --key-pattern R:R --hide-histogram --distinct-client-seed -d 20000 --pipeline=1000 Default Patched Ops/sec 51,921.16 52,694.55 Hits/sec 21,908.72 22,235.03 Misses/sec 4051.86 4112.24 Avg. Latency 867.51710 591.27561 (-31.84%) p50 Latency 876.54300 708.60700 (-19.15%) p99 Latency 1044.47900 1044.47900 p99.9 Latency 1048.57500 1048.57500 KB/sec 937,330.19 951,291.76 numa_pte_updates 66,628,064 72,125,512 numa_hint_faults 57,093,369 63,369,538 numa_hint_faults_local 0 0 numa_pages_migrated 799,128 3,634,114 pgpromote_success 798,974 3,633,672 pgpromote_candidate 33,884,196 23,143,552 pgdemote_kswapd 13,321,784 11,948,894 pgdemote_direct 257 57,147 Bharata B Rao (2): sched/numa: Fault count based NUMA hint fault latency mm: Update hint fault count for pages that are skipped during scanning include/linux/mm.h | 23 ++++--------- include/linux/mm_types.h | 3 ++ kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 73 +++++++++++----------------------------- kernel/sched/sched.h | 1 + mm/huge_memory.c | 10 +++--- mm/memory.c | 2 ++ mm/mprotect.c | 14 ++++---- 8 files changed, 46 insertions(+), 82 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency 2024-03-27 16:02 [RFC PATCH 0/2] Hot page promotion optimization for large address space Bharata B Rao @ 2024-03-27 16:02 ` Bharata B Rao 2024-03-28 1:56 ` Huang, Ying ` (2 more replies) 2024-03-27 16:02 ` [RFC PATCH 2/2] mm: Update hint fault count for pages that are skipped during scanning Bharata B Rao 2024-03-28 5:35 ` [RFC PATCH 0/2] Hot page promotion optimization for large address space Huang, Ying 2 siblings, 3 replies; 21+ messages in thread From: Bharata B Rao @ 2024-03-27 16:02 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, ying.huang, dave.hansen, hannes, Bharata B Rao For memory tiering mode, the hint fault latency determines if the page is considered hot enough to be promoted. This latency value depends on absolute time and is the difference between the scanning time and the fault time. The default value of the threshold used to categorize the page as hot is 1s. When the address space is huge, pages may not be accessed right after they are scanned. Hence the hint fault latency is found to be, most of the times, way beyond the current threshold thereby resulting in a very low number of page promotions. To address this problem, convert the absolute time based hint fault latency in to a relative metric based. Use the number of hint faults that have occurred between the scan time and the page's fault time as the threshold. TODO: The existing threshold adjustment logic, which is based on time based hint fault latency has been removed for now. Signed-off-by: Bharata B Rao <bharata@amd.com> --- include/linux/mm.h | 23 ++++--------- include/linux/mm_types.h | 3 ++ kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 73 +++++++++++----------------------------- kernel/sched/sched.h | 1 + mm/huge_memory.c | 3 +- mm/memory.c | 2 ++ mm/mprotect.c | 5 +-- 8 files changed, 37 insertions(+), 75 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f5a97dec5169..cb1e79f2920b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1656,17 +1656,7 @@ static inline int folio_nid(const struct folio *folio) } #ifdef CONFIG_NUMA_BALANCING -/* page access time bits needs to hold at least 4 seconds */ -#define PAGE_ACCESS_TIME_MIN_BITS 12 -#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS -#define PAGE_ACCESS_TIME_BUCKETS \ - (PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT) -#else -#define PAGE_ACCESS_TIME_BUCKETS 0 -#endif - -#define PAGE_ACCESS_TIME_MASK \ - (LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS) +#define PAGE_FAULT_COUNT_BUCKETS 16 static inline int cpu_pid_to_cpupid(int cpu, int pid) { @@ -1732,13 +1722,12 @@ static inline void page_cpupid_reset_last(struct page *page) } #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */ -static inline int folio_xchg_access_time(struct folio *folio, int time) +static inline int folio_xchg_fault_count(struct folio *folio, int count) { - int last_time; + int last_count; - last_time = folio_xchg_last_cpupid(folio, - time >> PAGE_ACCESS_TIME_BUCKETS); - return last_time << PAGE_ACCESS_TIME_BUCKETS; + last_count = folio_xchg_last_cpupid(folio, count >> PAGE_FAULT_COUNT_BUCKETS); + return last_count << PAGE_FAULT_COUNT_BUCKETS; } static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) @@ -1756,7 +1745,7 @@ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid) return folio_nid(folio); /* XXX */ } -static inline int folio_xchg_access_time(struct folio *folio, int time) +static inline int folio_xchg_fault_count(struct folio *folio, int time) { return 0; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8b611e13153e..280043a08f25 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -922,6 +922,9 @@ struct mm_struct { /* numa_scan_seq prevents two threads remapping PTEs. */ int numa_scan_seq; + + /* Accummulated number of hint faults */ + atomic_t hint_faults; #endif /* * An operation with batched TLB flushing is going on. Anything diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 8d5d98a5834d..cd6367cef6cb 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -369,7 +369,7 @@ static __init int sched_init_debug(void) debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min); debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max); debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size); - debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold); + debugfs_create_u32("fault_count_threshold", 0644, numa, &sysctl_numa_balancing_fault_count_threshold); #endif debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a16129f9a5c..977584683f5f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1364,8 +1364,11 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; -/* The page with hint page fault latency < threshold in ms is considered hot */ -unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC; +/* + * Page is considered hot if the number of hint faults between scan time and + * the page's fault time is less than this threshold. + */ +unsigned int sysctl_numa_balancing_fault_count_threshold = 1000000; struct numa_group { refcount_t refcount; @@ -1750,25 +1753,20 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) } /* - * For memory tiering mode, when page tables are scanned, the scan - * time will be recorded in struct page in addition to make page - * PROT_NONE for slow memory page. So when the page is accessed, in - * hint page fault handler, the hint page fault latency is calculated - * via, + * For memory tiering mode, when page tables are scanned, the current + * hint fault count will be recorded in struct page in addition to + * make page PROT_NONE for slow memory page. So when the page is + * accessed, in hint page fault handler, the hint page fault latency is + * calculated via, * - * hint page fault latency = hint page fault time - scan time + * hint page fault latency = current hint fault count - fault count at scan time * * The smaller the hint page fault latency, the higher the possibility * for the page to be hot. */ -static int numa_hint_fault_latency(struct folio *folio) +static inline int numa_hint_fault_latency(struct folio *folio, int count) { - int last_time, time; - - time = jiffies_to_msecs(jiffies); - last_time = folio_xchg_access_time(folio, time); - - return (time - last_time) & PAGE_ACCESS_TIME_MASK; + return count - folio_xchg_fault_count(folio, count); } /* @@ -1794,35 +1792,6 @@ static bool numa_promotion_rate_limit(struct pglist_data *pgdat, return false; } -#define NUMA_MIGRATION_ADJUST_STEPS 16 - -static void numa_promotion_adjust_threshold(struct pglist_data *pgdat, - unsigned long rate_limit, - unsigned int ref_th) -{ - unsigned int now, start, th_period, unit_th, th; - unsigned long nr_cand, ref_cand, diff_cand; - - now = jiffies_to_msecs(jiffies); - th_period = sysctl_numa_balancing_scan_period_max; - start = pgdat->nbp_th_start; - if (now - start > th_period && - cmpxchg(&pgdat->nbp_th_start, start, now) == start) { - ref_cand = rate_limit * - sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC; - nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE); - diff_cand = nr_cand - pgdat->nbp_th_nr_cand; - unit_th = ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS; - th = pgdat->nbp_threshold ? : ref_th; - if (diff_cand > ref_cand * 11 / 10) - th = max(th - unit_th, unit_th); - else if (diff_cand < ref_cand * 9 / 10) - th = min(th + unit_th, ref_th * 2); - pgdat->nbp_th_nr_cand = nr_cand; - pgdat->nbp_threshold = th; - } -} - bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int src_nid, int dst_cpu) { @@ -1838,7 +1807,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, !node_is_toptier(src_nid)) { struct pglist_data *pgdat; unsigned long rate_limit; - unsigned int latency, th, def_th; + unsigned int latency; pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) { @@ -1847,16 +1816,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, return true; } - def_th = sysctl_numa_balancing_hot_threshold; - rate_limit = sysctl_numa_balancing_promote_rate_limit << \ - (20 - PAGE_SHIFT); - numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); - - th = pgdat->nbp_threshold ? : def_th; - latency = numa_hint_fault_latency(folio); - if (latency >= th) + latency = numa_hint_fault_latency(folio, + atomic_read(&p->mm->hint_faults)); + if (latency >= sysctl_numa_balancing_fault_count_threshold) return false; + rate_limit = sysctl_numa_balancing_promote_rate_limit << \ + (20 - PAGE_SHIFT); return !numa_promotion_rate_limit(pgdat, rate_limit, folio_nr_pages(folio)); } @@ -3444,6 +3410,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); mm->numa_scan_seq = 0; } + atomic_set(&mm->hint_faults, 0); } p->node_stamp = 0; p->numa_scan_seq = mm ? mm->numa_scan_seq : 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d2242679239e..f975d643fa6a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2564,6 +2564,7 @@ extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_numa_balancing_hot_threshold; +extern unsigned int sysctl_numa_balancing_fault_count_threshold; #endif #ifdef CONFIG_SCHED_HRTICK diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 94c958f7ebb5..7e62c3c2bbcb 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2101,8 +2101,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && !toptier) - folio_xchg_access_time(folio, - jiffies_to_msecs(jiffies)); + folio_xchg_fault_count(folio, atomic_read(&mm->hint_faults)); } /* * In case prot_numa, we are under mmap_read_lock(mm). It's critical diff --git a/mm/memory.c b/mm/memory.c index 0bfc8b007c01..43a6358e6d31 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4927,6 +4927,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) pte_t pte, old_pte; int flags = 0; + atomic_inc(&vma->vm_mm->hint_faults); + /* * The "pte" at this point cannot be used safely without * validation through pte_unmap_same(). It's of NUMA type but diff --git a/mm/mprotect.c b/mm/mprotect.c index 81991102f785..30118fd492f4 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -159,8 +159,9 @@ static long change_pte_range(struct mmu_gather *tlb, continue; if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && !toptier) - folio_xchg_access_time(folio, - jiffies_to_msecs(jiffies)); + folio_xchg_fault_count(folio, + atomic_read(&vma->vm_mm->hint_faults)); + } oldpte = ptep_modify_prot_start(vma, addr, pte); -- 2.25.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao @ 2024-03-28 1:56 ` Huang, Ying 2024-03-28 4:39 ` Bharata B Rao 2024-03-30 1:11 ` kernel test robot 2024-03-30 3:47 ` kernel test robot 2 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-03-28 1:56 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: [snip] > @@ -1750,25 +1753,20 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) > } > > /* > - * For memory tiering mode, when page tables are scanned, the scan > - * time will be recorded in struct page in addition to make page > - * PROT_NONE for slow memory page. So when the page is accessed, in > - * hint page fault handler, the hint page fault latency is calculated > - * via, > + * For memory tiering mode, when page tables are scanned, the current > + * hint fault count will be recorded in struct page in addition to > + * make page PROT_NONE for slow memory page. So when the page is > + * accessed, in hint page fault handler, the hint page fault latency is > + * calculated via, > * > - * hint page fault latency = hint page fault time - scan time > + * hint page fault latency = current hint fault count - fault count at scan time > * > * The smaller the hint page fault latency, the higher the possibility > * for the page to be hot. > */ > -static int numa_hint_fault_latency(struct folio *folio) > +static inline int numa_hint_fault_latency(struct folio *folio, int count) > { > - int last_time, time; > - > - time = jiffies_to_msecs(jiffies); > - last_time = folio_xchg_access_time(folio, time); > - > - return (time - last_time) & PAGE_ACCESS_TIME_MASK; > + return count - folio_xchg_fault_count(folio, count); > } I found count is task->mm->hint_faults. That is a process wide counting. How do you connect the hotness of a folio with the count of hint page fault in the process? How do you compare the hotness of folios among different processes? > /* > @@ -1794,35 +1792,6 @@ static bool numa_promotion_rate_limit(struct pglist_data *pgdat, > return false; > } > -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency 2024-03-28 1:56 ` Huang, Ying @ 2024-03-28 4:39 ` Bharata B Rao 2024-03-28 5:21 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-03-28 4:39 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 28-Mar-24 7:26 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > > [snip] > >> @@ -1750,25 +1753,20 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) >> } >> >> /* >> - * For memory tiering mode, when page tables are scanned, the scan >> - * time will be recorded in struct page in addition to make page >> - * PROT_NONE for slow memory page. So when the page is accessed, in >> - * hint page fault handler, the hint page fault latency is calculated >> - * via, >> + * For memory tiering mode, when page tables are scanned, the current >> + * hint fault count will be recorded in struct page in addition to >> + * make page PROT_NONE for slow memory page. So when the page is >> + * accessed, in hint page fault handler, the hint page fault latency is >> + * calculated via, >> * >> - * hint page fault latency = hint page fault time - scan time >> + * hint page fault latency = current hint fault count - fault count at scan time >> * >> * The smaller the hint page fault latency, the higher the possibility >> * for the page to be hot. >> */ >> -static int numa_hint_fault_latency(struct folio *folio) >> +static inline int numa_hint_fault_latency(struct folio *folio, int count) >> { >> - int last_time, time; >> - >> - time = jiffies_to_msecs(jiffies); >> - last_time = folio_xchg_access_time(folio, time); >> - >> - return (time - last_time) & PAGE_ACCESS_TIME_MASK; >> + return count - folio_xchg_fault_count(folio, count); >> } > > I found count is task->mm->hint_faults. That is a process wide > counting. How do you connect the hotness of a folio with the count of > hint page fault in the process? How do you compare the hotness of > folios among different processes? The global hint fault count that we already maintain could be used instead of per-task fault. That should take care of the concern you mention right? Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency 2024-03-28 4:39 ` Bharata B Rao @ 2024-03-28 5:21 ` Huang, Ying 0 siblings, 0 replies; 21+ messages in thread From: Huang, Ying @ 2024-03-28 5:21 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 28-Mar-24 7:26 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >> [snip] >> >>> @@ -1750,25 +1753,20 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) >>> } >>> >>> /* >>> - * For memory tiering mode, when page tables are scanned, the scan >>> - * time will be recorded in struct page in addition to make page >>> - * PROT_NONE for slow memory page. So when the page is accessed, in >>> - * hint page fault handler, the hint page fault latency is calculated >>> - * via, >>> + * For memory tiering mode, when page tables are scanned, the current >>> + * hint fault count will be recorded in struct page in addition to >>> + * make page PROT_NONE for slow memory page. So when the page is >>> + * accessed, in hint page fault handler, the hint page fault latency is >>> + * calculated via, >>> * >>> - * hint page fault latency = hint page fault time - scan time >>> + * hint page fault latency = current hint fault count - fault count at scan time >>> * >>> * The smaller the hint page fault latency, the higher the possibility >>> * for the page to be hot. >>> */ >>> -static int numa_hint_fault_latency(struct folio *folio) >>> +static inline int numa_hint_fault_latency(struct folio *folio, int count) >>> { >>> - int last_time, time; >>> - >>> - time = jiffies_to_msecs(jiffies); >>> - last_time = folio_xchg_access_time(folio, time); >>> - >>> - return (time - last_time) & PAGE_ACCESS_TIME_MASK; >>> + return count - folio_xchg_fault_count(folio, count); >>> } >> >> I found count is task->mm->hint_faults. That is a process wide >> counting. How do you connect the hotness of a folio with the count of >> hint page fault in the process? How do you compare the hotness of >> folios among different processes? > > The global hint fault count that we already maintain could > be used instead of per-task fault. That should take care > of the concern you mention right? I have plotted the total number of hint faults per second before, and it changes a lot along the time. So I don't think it is a good measurement. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao 2024-03-28 1:56 ` Huang, Ying @ 2024-03-30 1:11 ` kernel test robot 2024-03-30 3:47 ` kernel test robot 2 siblings, 0 replies; 21+ messages in thread From: kernel test robot @ 2024-03-30 1:11 UTC (permalink / raw) To: Bharata B Rao; +Cc: oe-kbuild-all Hi Bharata, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on peterz-queue/sched/core] [also build test ERROR on linus/master v6.9-rc1 next-20240328] [cannot apply to akpm-mm/mm-everything tip/sched/core] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Bharata-B-Rao/sched-numa-Fault-count-based-NUMA-hint-fault-latency/20240328-000607 base: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core patch link: https://lore.kernel.org/r/20240327160237.2355-2-bharata%40amd.com patch subject: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency config: m68k-allmodconfig (https://download.01.org/0day-ci/archive/20240330/202403300858.1nHEhscI-lkp@intel.com/config) compiler: m68k-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240330/202403300858.1nHEhscI-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202403300858.1nHEhscI-lkp@intel.com/ All errors (new ones prefixed by >>): mm/memory.c: In function 'do_numa_page': >> mm/memory.c:4804:31: error: 'struct mm_struct' has no member named 'hint_faults' 4804 | atomic_inc(&vma->vm_mm->hint_faults); | ^~ -- mm/mprotect.c: In function 'change_pte_range': >> mm/mprotect.c:163:80: error: 'struct mm_struct' has no member named 'hint_faults' 163 | atomic_read(&vma->vm_mm->hint_faults)); | ^~ vim +4804 mm/memory.c 4792 4793 static vm_fault_t do_numa_page(struct vm_fault *vmf) 4794 { 4795 struct vm_area_struct *vma = vmf->vma; 4796 struct folio *folio = NULL; 4797 int nid = NUMA_NO_NODE; 4798 bool writable = false; 4799 int last_cpupid; 4800 int target_nid; 4801 pte_t pte, old_pte; 4802 int flags = 0; 4803 > 4804 atomic_inc(&vma->vm_mm->hint_faults); 4805 4806 /* 4807 * The "pte" at this point cannot be used safely without 4808 * validation through pte_unmap_same(). It's of NUMA type but 4809 * the pfn may be screwed if the read is non atomic. 4810 */ 4811 spin_lock(vmf->ptl); 4812 if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { 4813 pte_unmap_unlock(vmf->pte, vmf->ptl); 4814 goto out; 4815 } 4816 4817 /* Get the normal PTE */ 4818 old_pte = ptep_get(vmf->pte); 4819 pte = pte_modify(old_pte, vma->vm_page_prot); 4820 4821 /* 4822 * Detect now whether the PTE could be writable; this information 4823 * is only valid while holding the PT lock. 4824 */ 4825 writable = pte_write(pte); 4826 if (!writable && vma_wants_manual_pte_write_upgrade(vma) && 4827 can_change_pte_writable(vma, vmf->address, pte)) 4828 writable = true; 4829 4830 folio = vm_normal_folio(vma, vmf->address, pte); 4831 if (!folio || folio_is_zone_device(folio)) 4832 goto out_map; 4833 4834 /* TODO: handle PTE-mapped THP */ 4835 if (folio_test_large(folio)) 4836 goto out_map; 4837 4838 /* 4839 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as 4840 * much anyway since they can be in shared cache state. This misses 4841 * the case where a mapping is writable but the process never writes 4842 * to it but pte_write gets cleared during protection updates and 4843 * pte_dirty has unpredictable behaviour between PTE scan updates, 4844 * background writeback, dirty balancing and application behaviour. 4845 */ 4846 if (!writable) 4847 flags |= TNF_NO_GROUP; 4848 4849 /* 4850 * Flag if the folio is shared between multiple address spaces. This 4851 * is later used when determining whether to group tasks together 4852 */ 4853 if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED)) 4854 flags |= TNF_SHARED; 4855 4856 nid = folio_nid(folio); 4857 /* 4858 * For memory tiering mode, cpupid of slow memory page is used 4859 * to record page access time. So use default value. 4860 */ 4861 if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && 4862 !node_is_toptier(nid)) 4863 last_cpupid = (-1 & LAST_CPUPID_MASK); 4864 else 4865 last_cpupid = folio_last_cpupid(folio); 4866 target_nid = numa_migrate_prep(folio, vma, vmf->address, nid, &flags); 4867 if (target_nid == NUMA_NO_NODE) { 4868 folio_put(folio); 4869 goto out_map; 4870 } 4871 pte_unmap_unlock(vmf->pte, vmf->ptl); 4872 writable = false; 4873 4874 /* Migrate to the requested node */ 4875 if (migrate_misplaced_folio(folio, vma, target_nid)) { 4876 nid = target_nid; 4877 flags |= TNF_MIGRATED; 4878 } else { 4879 flags |= TNF_MIGRATE_FAIL; 4880 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4881 vmf->address, &vmf->ptl); 4882 if (unlikely(!vmf->pte)) 4883 goto out; 4884 if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { 4885 pte_unmap_unlock(vmf->pte, vmf->ptl); 4886 goto out; 4887 } 4888 goto out_map; 4889 } 4890 4891 out: 4892 if (nid != NUMA_NO_NODE) 4893 task_numa_fault(last_cpupid, nid, 1, flags); 4894 return 0; 4895 out_map: 4896 /* 4897 * Make it present again, depending on how arch implements 4898 * non-accessible ptes, some can allow access by kernel mode. 4899 */ 4900 old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); 4901 pte = pte_modify(old_pte, vma->vm_page_prot); 4902 pte = pte_mkyoung(pte); 4903 if (writable) 4904 pte = pte_mkwrite(pte, vma); 4905 ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); 4906 update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 4907 pte_unmap_unlock(vmf->pte, vmf->ptl); 4908 goto out; 4909 } 4910 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao 2024-03-28 1:56 ` Huang, Ying 2024-03-30 1:11 ` kernel test robot @ 2024-03-30 3:47 ` kernel test robot 2 siblings, 0 replies; 21+ messages in thread From: kernel test robot @ 2024-03-30 3:47 UTC (permalink / raw) To: Bharata B Rao; +Cc: llvm, oe-kbuild-all Hi Bharata, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on peterz-queue/sched/core] [also build test ERROR on linus/master v6.9-rc1 next-20240328] [cannot apply to akpm-mm/mm-everything tip/sched/core] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Bharata-B-Rao/sched-numa-Fault-count-based-NUMA-hint-fault-latency/20240328-000607 base: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core patch link: https://lore.kernel.org/r/20240327160237.2355-2-bharata%40amd.com patch subject: [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency config: arm-defconfig (https://download.01.org/0day-ci/archive/20240330/202403301111.fNcnRPWz-lkp@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240330/202403301111.fNcnRPWz-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202403301111.fNcnRPWz-lkp@intel.com/ All errors (new ones prefixed by >>): >> mm/memory.c:4804:26: error: no member named 'hint_faults' in 'struct mm_struct' atomic_inc(&vma->vm_mm->hint_faults); ~~~~~~~~~~ ^ 1 error generated. -- >> mm/mprotect.c:163:33: error: no member named 'hint_faults' in 'struct mm_struct' atomic_read(&vma->vm_mm->hint_faults)); ~~~~~~~~~~ ^ 1 error generated. vim +4804 mm/memory.c 4792 4793 static vm_fault_t do_numa_page(struct vm_fault *vmf) 4794 { 4795 struct vm_area_struct *vma = vmf->vma; 4796 struct folio *folio = NULL; 4797 int nid = NUMA_NO_NODE; 4798 bool writable = false; 4799 int last_cpupid; 4800 int target_nid; 4801 pte_t pte, old_pte; 4802 int flags = 0; 4803 > 4804 atomic_inc(&vma->vm_mm->hint_faults); 4805 4806 /* 4807 * The "pte" at this point cannot be used safely without 4808 * validation through pte_unmap_same(). It's of NUMA type but 4809 * the pfn may be screwed if the read is non atomic. 4810 */ 4811 spin_lock(vmf->ptl); 4812 if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { 4813 pte_unmap_unlock(vmf->pte, vmf->ptl); 4814 goto out; 4815 } 4816 4817 /* Get the normal PTE */ 4818 old_pte = ptep_get(vmf->pte); 4819 pte = pte_modify(old_pte, vma->vm_page_prot); 4820 4821 /* 4822 * Detect now whether the PTE could be writable; this information 4823 * is only valid while holding the PT lock. 4824 */ 4825 writable = pte_write(pte); 4826 if (!writable && vma_wants_manual_pte_write_upgrade(vma) && 4827 can_change_pte_writable(vma, vmf->address, pte)) 4828 writable = true; 4829 4830 folio = vm_normal_folio(vma, vmf->address, pte); 4831 if (!folio || folio_is_zone_device(folio)) 4832 goto out_map; 4833 4834 /* TODO: handle PTE-mapped THP */ 4835 if (folio_test_large(folio)) 4836 goto out_map; 4837 4838 /* 4839 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as 4840 * much anyway since they can be in shared cache state. This misses 4841 * the case where a mapping is writable but the process never writes 4842 * to it but pte_write gets cleared during protection updates and 4843 * pte_dirty has unpredictable behaviour between PTE scan updates, 4844 * background writeback, dirty balancing and application behaviour. 4845 */ 4846 if (!writable) 4847 flags |= TNF_NO_GROUP; 4848 4849 /* 4850 * Flag if the folio is shared between multiple address spaces. This 4851 * is later used when determining whether to group tasks together 4852 */ 4853 if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED)) 4854 flags |= TNF_SHARED; 4855 4856 nid = folio_nid(folio); 4857 /* 4858 * For memory tiering mode, cpupid of slow memory page is used 4859 * to record page access time. So use default value. 4860 */ 4861 if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && 4862 !node_is_toptier(nid)) 4863 last_cpupid = (-1 & LAST_CPUPID_MASK); 4864 else 4865 last_cpupid = folio_last_cpupid(folio); 4866 target_nid = numa_migrate_prep(folio, vma, vmf->address, nid, &flags); 4867 if (target_nid == NUMA_NO_NODE) { 4868 folio_put(folio); 4869 goto out_map; 4870 } 4871 pte_unmap_unlock(vmf->pte, vmf->ptl); 4872 writable = false; 4873 4874 /* Migrate to the requested node */ 4875 if (migrate_misplaced_folio(folio, vma, target_nid)) { 4876 nid = target_nid; 4877 flags |= TNF_MIGRATED; 4878 } else { 4879 flags |= TNF_MIGRATE_FAIL; 4880 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4881 vmf->address, &vmf->ptl); 4882 if (unlikely(!vmf->pte)) 4883 goto out; 4884 if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { 4885 pte_unmap_unlock(vmf->pte, vmf->ptl); 4886 goto out; 4887 } 4888 goto out_map; 4889 } 4890 4891 out: 4892 if (nid != NUMA_NO_NODE) 4893 task_numa_fault(last_cpupid, nid, 1, flags); 4894 return 0; 4895 out_map: 4896 /* 4897 * Make it present again, depending on how arch implements 4898 * non-accessible ptes, some can allow access by kernel mode. 4899 */ 4900 old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); 4901 pte = pte_modify(old_pte, vma->vm_page_prot); 4902 pte = pte_mkyoung(pte); 4903 if (writable) 4904 pte = pte_mkwrite(pte, vma); 4905 ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); 4906 update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 4907 pte_unmap_unlock(vmf->pte, vmf->ptl); 4908 goto out; 4909 } 4910 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 2/2] mm: Update hint fault count for pages that are skipped during scanning 2024-03-27 16:02 [RFC PATCH 0/2] Hot page promotion optimization for large address space Bharata B Rao 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao @ 2024-03-27 16:02 ` Bharata B Rao 2024-03-28 5:35 ` [RFC PATCH 0/2] Hot page promotion optimization for large address space Huang, Ying 2 siblings, 0 replies; 21+ messages in thread From: Bharata B Rao @ 2024-03-27 16:02 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, ying.huang, dave.hansen, hannes, Bharata B Rao During scanning, PTE updates are skipped for those pages which are already marked as PROT_NONE. This is required but update the scan time fault count so that the fault count which is used to calculate the latency is kept uptodate based on the recent scanning iteration. Signed-off-by: Bharata B Rao <bharata@amd.com> --- mm/huge_memory.c | 7 ++++--- mm/mprotect.c | 9 +++++---- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7e62c3c2bbcb..24a4f976323e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2086,9 +2086,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (is_huge_zero_pmd(*pmd)) goto unlock; - if (pmd_protnone(*pmd)) - goto unlock; - folio = page_folio(pmd_page(*pmd)); toptier = node_is_toptier(folio_nid(folio)); /* @@ -2102,6 +2099,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && !toptier) folio_xchg_fault_count(folio, atomic_read(&mm->hint_faults)); + + if (pmd_protnone(*pmd)) + goto unlock; + } /* * In case prot_numa, we are under mmap_read_lock(mm). It's critical diff --git a/mm/mprotect.c b/mm/mprotect.c index 30118fd492f4..cfd3812302be 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -118,10 +118,6 @@ static long change_pte_range(struct mmu_gather *tlb, int nid; bool toptier; - /* Avoid TLB flush if possible */ - if (pte_protnone(oldpte)) - continue; - folio = vm_normal_folio(vma, addr, oldpte); if (!folio || folio_is_zone_device(folio) || folio_test_ksm(folio)) @@ -162,6 +158,11 @@ static long change_pte_range(struct mmu_gather *tlb, folio_xchg_fault_count(folio, atomic_read(&vma->vm_mm->hint_faults)); + /* Avoid TLB flush if possible */ + if (pte_protnone(oldpte)) + continue; + + } oldpte = ptep_modify_prot_start(vma, addr, pte); -- 2.25.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-03-27 16:02 [RFC PATCH 0/2] Hot page promotion optimization for large address space Bharata B Rao 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao 2024-03-27 16:02 ` [RFC PATCH 2/2] mm: Update hint fault count for pages that are skipped during scanning Bharata B Rao @ 2024-03-28 5:35 ` Huang, Ying 2024-03-28 5:49 ` Bharata B Rao 2 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-03-28 5:35 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > In order to check how efficiently the existing NUMA balancing > based hot page promotion mechanism can detect hot regions and > promote pages for workloads with large memory footprints, I > wrote and tested a program that allocates huge amount of > memory but routinely touches only small parts of it. > > This microbenchmark provisions memory both on DRAM node and CXL node. > It then divides the entire allocated memory into chunks of smaller > size and randomly choses a chunk for generating memory accesses. > Each chunk is then accessed for a fixed number of iterations to > create the notion of hotness. Within each chunk, the individual > pages at 4K granularity are again accessed in random fashion. > > When a chunk is taken up for access in this manner, its pages > can either be residing on DRAM or CXL. In the latter case, the NUMA > balancing driven hot page promotion logic is expected to detect and > promote the hot pages that reside on CXL. > > The experiment was conducted on a 2P AMD Bergamo system that has > CXL as the 3rd node. > > $ numactl -H > available: 3 nodes (0-2) > node 0 cpus: 0-127,256-383 > node 0 size: 128054 MB > node 1 cpus: 128-255,384-511 > node 1 size: 128880 MB > node 2 cpus: > node 2 size: 129024 MB > node distances: > node 0 1 2 > 0: 10 32 60 > 1: 32 10 50 > 2: 255 255 10 > > It is seen that number of pages that get promoted is really low and > the reason for it happens to be that the NUMA hint fault latency turns > out to be much higher than the hot threshold most of the times. Here > are a few latency and threshold sample values captured from > should_numa_migrate_memory() routine when the benchmark was run: > > latency threshold (in ms) > 20620 1125 > 56185 1125 > 98710 1250 > 148871 1375 > 182891 1625 > 369415 1875 > 630745 2000 The access latency of your workload is 20s to 630s, which appears too long. Can you try to increase the range of threshold to deal with that? For example, echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms [snip] -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-03-28 5:35 ` [RFC PATCH 0/2] Hot page promotion optimization for large address space Huang, Ying @ 2024-03-28 5:49 ` Bharata B Rao 2024-03-28 6:03 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-03-28 5:49 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 28-Mar-24 11:05 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> In order to check how efficiently the existing NUMA balancing >> based hot page promotion mechanism can detect hot regions and >> promote pages for workloads with large memory footprints, I >> wrote and tested a program that allocates huge amount of >> memory but routinely touches only small parts of it. >> >> This microbenchmark provisions memory both on DRAM node and CXL node. >> It then divides the entire allocated memory into chunks of smaller >> size and randomly choses a chunk for generating memory accesses. >> Each chunk is then accessed for a fixed number of iterations to >> create the notion of hotness. Within each chunk, the individual >> pages at 4K granularity are again accessed in random fashion. >> >> When a chunk is taken up for access in this manner, its pages >> can either be residing on DRAM or CXL. In the latter case, the NUMA >> balancing driven hot page promotion logic is expected to detect and >> promote the hot pages that reside on CXL. >> >> The experiment was conducted on a 2P AMD Bergamo system that has >> CXL as the 3rd node. >> >> $ numactl -H >> available: 3 nodes (0-2) >> node 0 cpus: 0-127,256-383 >> node 0 size: 128054 MB >> node 1 cpus: 128-255,384-511 >> node 1 size: 128880 MB >> node 2 cpus: >> node 2 size: 129024 MB >> node distances: >> node 0 1 2 >> 0: 10 32 60 >> 1: 32 10 50 >> 2: 255 255 10 >> >> It is seen that number of pages that get promoted is really low and >> the reason for it happens to be that the NUMA hint fault latency turns >> out to be much higher than the hot threshold most of the times. Here >> are a few latency and threshold sample values captured from >> should_numa_migrate_memory() routine when the benchmark was run: >> >> latency threshold (in ms) >> 20620 1125 >> 56185 1125 >> 98710 1250 >> 148871 1375 >> 182891 1625 >> 369415 1875 >> 630745 2000 > > The access latency of your workload is 20s to 630s, which appears too > long. Can you try to increase the range of threshold to deal with that? > For example, > > echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms That of course should help. But I was exploring alternatives where the notion of hotness can be de-linked from the absolute scanning time to the extent possible. For large memory workloads where only parts of memory get accessed at once, the scanning time can lag from the actual access time significantly as the data above shows. Wondering if such cases can be addressed without having to be workload-specific. Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-03-28 5:49 ` Bharata B Rao @ 2024-03-28 6:03 ` Huang, Ying 2024-03-28 6:29 ` Bharata B Rao 0 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-03-28 6:03 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 28-Mar-24 11:05 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> In order to check how efficiently the existing NUMA balancing >>> based hot page promotion mechanism can detect hot regions and >>> promote pages for workloads with large memory footprints, I >>> wrote and tested a program that allocates huge amount of >>> memory but routinely touches only small parts of it. >>> >>> This microbenchmark provisions memory both on DRAM node and CXL node. >>> It then divides the entire allocated memory into chunks of smaller >>> size and randomly choses a chunk for generating memory accesses. >>> Each chunk is then accessed for a fixed number of iterations to >>> create the notion of hotness. Within each chunk, the individual >>> pages at 4K granularity are again accessed in random fashion. >>> >>> When a chunk is taken up for access in this manner, its pages >>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>> balancing driven hot page promotion logic is expected to detect and >>> promote the hot pages that reside on CXL. >>> >>> The experiment was conducted on a 2P AMD Bergamo system that has >>> CXL as the 3rd node. >>> >>> $ numactl -H >>> available: 3 nodes (0-2) >>> node 0 cpus: 0-127,256-383 >>> node 0 size: 128054 MB >>> node 1 cpus: 128-255,384-511 >>> node 1 size: 128880 MB >>> node 2 cpus: >>> node 2 size: 129024 MB >>> node distances: >>> node 0 1 2 >>> 0: 10 32 60 >>> 1: 32 10 50 >>> 2: 255 255 10 >>> >>> It is seen that number of pages that get promoted is really low and >>> the reason for it happens to be that the NUMA hint fault latency turns >>> out to be much higher than the hot threshold most of the times. Here >>> are a few latency and threshold sample values captured from >>> should_numa_migrate_memory() routine when the benchmark was run: >>> >>> latency threshold (in ms) >>> 20620 1125 >>> 56185 1125 >>> 98710 1250 >>> 148871 1375 >>> 182891 1625 >>> 369415 1875 >>> 630745 2000 >> >> The access latency of your workload is 20s to 630s, which appears too >> long. Can you try to increase the range of threshold to deal with that? >> For example, >> >> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms > > That of course should help. But I was exploring alternatives where the > notion of hotness can be de-linked from the absolute scanning time to In fact, only relative time from scan to hint fault is recorded and calculated, we have only limited bits. > the extent possible. For large memory workloads where only parts of memory > get accessed at once, the scanning time can lag from the actual access > time significantly as the data above shows. Wondering if such cases can > be addressed without having to be workload-specific. Does it really matter to promote the quite cold pages (accessed every more than 20s)? And if so, how can we adjust the current algorithm to cover that? I think that may be possible via extending the threshold range. And I think that we can find some way to extending the range by default if necessary. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-03-28 6:03 ` Huang, Ying @ 2024-03-28 6:29 ` Bharata B Rao 2024-03-29 1:14 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-03-28 6:29 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 28-Mar-24 11:33 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 28-Mar-24 11:05 AM, Huang, Ying wrote: >>> Bharata B Rao <bharata@amd.com> writes: >>> >>>> In order to check how efficiently the existing NUMA balancing >>>> based hot page promotion mechanism can detect hot regions and >>>> promote pages for workloads with large memory footprints, I >>>> wrote and tested a program that allocates huge amount of >>>> memory but routinely touches only small parts of it. >>>> >>>> This microbenchmark provisions memory both on DRAM node and CXL node. >>>> It then divides the entire allocated memory into chunks of smaller >>>> size and randomly choses a chunk for generating memory accesses. >>>> Each chunk is then accessed for a fixed number of iterations to >>>> create the notion of hotness. Within each chunk, the individual >>>> pages at 4K granularity are again accessed in random fashion. >>>> >>>> When a chunk is taken up for access in this manner, its pages >>>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>>> balancing driven hot page promotion logic is expected to detect and >>>> promote the hot pages that reside on CXL. >>>> >>>> The experiment was conducted on a 2P AMD Bergamo system that has >>>> CXL as the 3rd node. >>>> >>>> $ numactl -H >>>> available: 3 nodes (0-2) >>>> node 0 cpus: 0-127,256-383 >>>> node 0 size: 128054 MB >>>> node 1 cpus: 128-255,384-511 >>>> node 1 size: 128880 MB >>>> node 2 cpus: >>>> node 2 size: 129024 MB >>>> node distances: >>>> node 0 1 2 >>>> 0: 10 32 60 >>>> 1: 32 10 50 >>>> 2: 255 255 10 >>>> >>>> It is seen that number of pages that get promoted is really low and >>>> the reason for it happens to be that the NUMA hint fault latency turns >>>> out to be much higher than the hot threshold most of the times. Here >>>> are a few latency and threshold sample values captured from >>>> should_numa_migrate_memory() routine when the benchmark was run: >>>> >>>> latency threshold (in ms) >>>> 20620 1125 >>>> 56185 1125 >>>> 98710 1250 >>>> 148871 1375 >>>> 182891 1625 >>>> 369415 1875 >>>> 630745 2000 >>> >>> The access latency of your workload is 20s to 630s, which appears too >>> long. Can you try to increase the range of threshold to deal with that? >>> For example, >>> >>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms >> >> That of course should help. But I was exploring alternatives where the >> notion of hotness can be de-linked from the absolute scanning time to > > In fact, only relative time from scan to hint fault is recorded and > calculated, we have only limited bits. > >> the extent possible. For large memory workloads where only parts of memory >> get accessed at once, the scanning time can lag from the actual access >> time significantly as the data above shows. Wondering if such cases can >> be addressed without having to be workload-specific. > > Does it really matter to promote the quite cold pages (accessed every > more than 20s)? And if so, how can we adjust the current algorithm to > cover that? I think that may be possible via extending the threshold > range. And I think that we can find some way to extending the range by > default if necessary. I don't think the pages are cold but rather the existing mechanism fails to categorize them as hot. This is because the pages were scanned way before the accesses start happening. When repeated accesses are made to a chunk of memory that has been scanned a while back, none of those accesses get classified as hot because the scan time is way behind the current access time. That's the reason we are seeing the value of latency ranging from 20s to 630s as shown above. Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-03-28 6:29 ` Bharata B Rao @ 2024-03-29 1:14 ` Huang, Ying 2024-04-01 12:20 ` Bharata B Rao 0 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-03-29 1:14 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 28-Mar-24 11:33 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 28-Mar-24 11:05 AM, Huang, Ying wrote: >>>> Bharata B Rao <bharata@amd.com> writes: >>>> >>>>> In order to check how efficiently the existing NUMA balancing >>>>> based hot page promotion mechanism can detect hot regions and >>>>> promote pages for workloads with large memory footprints, I >>>>> wrote and tested a program that allocates huge amount of >>>>> memory but routinely touches only small parts of it. >>>>> >>>>> This microbenchmark provisions memory both on DRAM node and CXL node. >>>>> It then divides the entire allocated memory into chunks of smaller >>>>> size and randomly choses a chunk for generating memory accesses. >>>>> Each chunk is then accessed for a fixed number of iterations to >>>>> create the notion of hotness. Within each chunk, the individual >>>>> pages at 4K granularity are again accessed in random fashion. >>>>> >>>>> When a chunk is taken up for access in this manner, its pages >>>>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>>>> balancing driven hot page promotion logic is expected to detect and >>>>> promote the hot pages that reside on CXL. >>>>> >>>>> The experiment was conducted on a 2P AMD Bergamo system that has >>>>> CXL as the 3rd node. >>>>> >>>>> $ numactl -H >>>>> available: 3 nodes (0-2) >>>>> node 0 cpus: 0-127,256-383 >>>>> node 0 size: 128054 MB >>>>> node 1 cpus: 128-255,384-511 >>>>> node 1 size: 128880 MB >>>>> node 2 cpus: >>>>> node 2 size: 129024 MB >>>>> node distances: >>>>> node 0 1 2 >>>>> 0: 10 32 60 >>>>> 1: 32 10 50 >>>>> 2: 255 255 10 >>>>> >>>>> It is seen that number of pages that get promoted is really low and >>>>> the reason for it happens to be that the NUMA hint fault latency turns >>>>> out to be much higher than the hot threshold most of the times. Here >>>>> are a few latency and threshold sample values captured from >>>>> should_numa_migrate_memory() routine when the benchmark was run: >>>>> >>>>> latency threshold (in ms) >>>>> 20620 1125 >>>>> 56185 1125 >>>>> 98710 1250 >>>>> 148871 1375 >>>>> 182891 1625 >>>>> 369415 1875 >>>>> 630745 2000 >>>> >>>> The access latency of your workload is 20s to 630s, which appears too >>>> long. Can you try to increase the range of threshold to deal with that? >>>> For example, >>>> >>>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms >>> >>> That of course should help. But I was exploring alternatives where the >>> notion of hotness can be de-linked from the absolute scanning time to >> >> In fact, only relative time from scan to hint fault is recorded and >> calculated, we have only limited bits. >> >>> the extent possible. For large memory workloads where only parts of memory >>> get accessed at once, the scanning time can lag from the actual access >>> time significantly as the data above shows. Wondering if such cases can >>> be addressed without having to be workload-specific. >> >> Does it really matter to promote the quite cold pages (accessed every >> more than 20s)? And if so, how can we adjust the current algorithm to >> cover that? I think that may be possible via extending the threshold >> range. And I think that we can find some way to extending the range by >> default if necessary. > > I don't think the pages are cold but rather the existing mechanism fails > to categorize them as hot. This is because the pages were scanned way > before the accesses start happening. When repeated accesses are made to > a chunk of memory that has been scanned a while back, none of those > accesses get classified as hot because the scan time is way behind > the current access time. That's the reason we are seeing the value > of latency ranging from 20s to 630s as shown above. If repeated accesses continue, the page will be identified as hot when it is scanned next time even if we don't expand the threshold range. If the repeated accesses only last very short time, it makes little sense to identify the pages as hot. Right? The bits to record scan time or hint page fault is limited, so it's possible for it to overflow anyway. We scan scale time stamp if necessary (for example, from 1ms to 10ms). But it's hard to scale fault counter. And nobody can guarantee the frequency of hint page fault must be less 1/ms, if it's 10/ms, it can record even short interval. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-03-29 1:14 ` Huang, Ying @ 2024-04-01 12:20 ` Bharata B Rao 2024-04-02 2:03 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-04-01 12:20 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 29-Mar-24 6:44 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: <snip> >> I don't think the pages are cold but rather the existing mechanism fails >> to categorize them as hot. This is because the pages were scanned way >> before the accesses start happening. When repeated accesses are made to >> a chunk of memory that has been scanned a while back, none of those >> accesses get classified as hot because the scan time is way behind >> the current access time. That's the reason we are seeing the value >> of latency ranging from 20s to 630s as shown above. > > If repeated accesses continue, the page will be identified as hot when > it is scanned next time even if we don't expand the threshold range. If > the repeated accesses only last very short time, it makes little sense > to identify the pages as hot. Right? The total allocated memory here is 192G and the chunk size is 1G. Each time one such 1G chunk is taken up randomly for generating memory accesses. Within that 1G, 262144 random accesses are performed and 262144 such accesses are repeated for 512 times. I thought that should be enough to classify that chunk of memory as hot. But as we see, often times the scan time is lagging the access time by a large value. Let me instrument the code further to learn more insights (if possible) about the scanning/fault time behaviors here. Leaving the fault count based threshold apart, do you think there is value in updating the scan time for skipped pages/PTEs during every scan so that the scan time remains current for all the pages? > > The bits to record scan time or hint page fault is limited, so it's > possible for it to overflow anyway. We scan scale time stamp if > necessary (for example, from 1ms to 10ms). But it's hard to scale fault > counter. And nobody can guarantee the frequency of hint page fault must > be less 1/ms, if it's 10/ms, it can record even short interval. Yes, with the approach I have taken, the time factor is out of the equation and the notion of hotness is purely a factor of the number of faults (or accesses) Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-01 12:20 ` Bharata B Rao @ 2024-04-02 2:03 ` Huang, Ying 2024-04-02 9:26 ` Bharata B Rao 0 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-04-02 2:03 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 29-Mar-24 6:44 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: > <snip> >>> I don't think the pages are cold but rather the existing mechanism fails >>> to categorize them as hot. This is because the pages were scanned way >>> before the accesses start happening. When repeated accesses are made to >>> a chunk of memory that has been scanned a while back, none of those >>> accesses get classified as hot because the scan time is way behind >>> the current access time. That's the reason we are seeing the value >>> of latency ranging from 20s to 630s as shown above. >> >> If repeated accesses continue, the page will be identified as hot when >> it is scanned next time even if we don't expand the threshold range. If >> the repeated accesses only last very short time, it makes little sense >> to identify the pages as hot. Right? > > The total allocated memory here is 192G and the chunk size is 1G. Each > time one such 1G chunk is taken up randomly for generating memory accesses. > Within that 1G, 262144 random accesses are performed and 262144 such > accesses are repeated for 512 times. I thought that should be enough > to classify that chunk of memory as hot. IIUC, some pages are accessed in very short time (maybe within 1ms). This isn't repeated access in a long period. I think that pages accessed repeatedly in a long period are good candidates for promoting. But pages accessed frequently in only very short time aren't. > But as we see, often times > the scan time is lagging the access time by a large value. > > Let me instrument the code further to learn more insights (if possible) > about the scanning/fault time behaviors here. > > Leaving the fault count based threshold apart, do you think there is > value in updating the scan time for skipped pages/PTEs during every > scan so that the scan time remains current for all the pages? No, I don't think so. That makes hint page fault latency more inaccurate. >> >> The bits to record scan time or hint page fault is limited, so it's >> possible for it to overflow anyway. We scan scale time stamp if >> necessary (for example, from 1ms to 10ms). But it's hard to scale fault >> counter. And nobody can guarantee the frequency of hint page fault must >> be less 1/ms, if it's 10/ms, it can record even short interval. > > Yes, with the approach I have taken, the time factor is out of the > equation and the notion of hotness is purely a factor of the number of > faults (or accesses) Sorry, I don't get your idea here. I think that the fault count may be worse than time in quite some cases. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-02 2:03 ` Huang, Ying @ 2024-04-02 9:26 ` Bharata B Rao 2024-04-03 8:40 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-04-02 9:26 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 02-Apr-24 7:33 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>> Bharata B Rao <bharata@amd.com> writes: >> <snip> >>>> I don't think the pages are cold but rather the existing mechanism fails >>>> to categorize them as hot. This is because the pages were scanned way >>>> before the accesses start happening. When repeated accesses are made to >>>> a chunk of memory that has been scanned a while back, none of those >>>> accesses get classified as hot because the scan time is way behind >>>> the current access time. That's the reason we are seeing the value >>>> of latency ranging from 20s to 630s as shown above. >>> >>> If repeated accesses continue, the page will be identified as hot when >>> it is scanned next time even if we don't expand the threshold range. If >>> the repeated accesses only last very short time, it makes little sense >>> to identify the pages as hot. Right? >> >> The total allocated memory here is 192G and the chunk size is 1G. Each >> time one such 1G chunk is taken up randomly for generating memory accesses. >> Within that 1G, 262144 random accesses are performed and 262144 such >> accesses are repeated for 512 times. I thought that should be enough >> to classify that chunk of memory as hot. > > IIUC, some pages are accessed in very short time (maybe within 1ms). > This isn't repeated access in a long period. I think that pages > accessed repeatedly in a long period are good candidates for promoting. > But pages accessed frequently in only very short time aren't. Here are the numbers for the 192nd chunk: Each iteration of 262144 random accesses takes around ~10ms 512 such iterations are taking ~5s numa_scan_seq is 16 when this chunk is accessed. And no page promotions were done from this chunk. All the time should_numa_migrate_memory() found the NUMA hint fault latency to be higher than threshold. Are these time periods considered too short for the pages to be detected as hot and promoted? > >> But as we see, often times >> the scan time is lagging the access time by a large value. >> >> Let me instrument the code further to learn more insights (if possible) >> about the scanning/fault time behaviors here. >> >> Leaving the fault count based threshold apart, do you think there is >> value in updating the scan time for skipped pages/PTEs during every >> scan so that the scan time remains current for all the pages? > > No, I don't think so. That makes hint page fault latency more > inaccurate. For the case that I have shown, depending on a old value of scan time doesn't work well when pages get accessed after a long time since scanning. At least with the scheme I show in patch 2/2, probability of detecting pages as hot increases. Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-02 9:26 ` Bharata B Rao @ 2024-04-03 8:40 ` Huang, Ying 2024-04-12 4:00 ` Bharata B Rao 0 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-04-03 8:40 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 02-Apr-24 7:33 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>>> Bharata B Rao <bharata@amd.com> writes: >>> <snip> >>>>> I don't think the pages are cold but rather the existing mechanism fails >>>>> to categorize them as hot. This is because the pages were scanned way >>>>> before the accesses start happening. When repeated accesses are made to >>>>> a chunk of memory that has been scanned a while back, none of those >>>>> accesses get classified as hot because the scan time is way behind >>>>> the current access time. That's the reason we are seeing the value >>>>> of latency ranging from 20s to 630s as shown above. >>>> >>>> If repeated accesses continue, the page will be identified as hot when >>>> it is scanned next time even if we don't expand the threshold range. If >>>> the repeated accesses only last very short time, it makes little sense >>>> to identify the pages as hot. Right? >>> >>> The total allocated memory here is 192G and the chunk size is 1G. Each >>> time one such 1G chunk is taken up randomly for generating memory accesses. >>> Within that 1G, 262144 random accesses are performed and 262144 such >>> accesses are repeated for 512 times. I thought that should be enough >>> to classify that chunk of memory as hot. >> >> IIUC, some pages are accessed in very short time (maybe within 1ms). >> This isn't repeated access in a long period. I think that pages >> accessed repeatedly in a long period are good candidates for promoting. >> But pages accessed frequently in only very short time aren't. > > Here are the numbers for the 192nd chunk: > > Each iteration of 262144 random accesses takes around ~10ms > 512 such iterations are taking ~5s > numa_scan_seq is 16 when this chunk is accessed. > And no page promotions were done from this chunk. All the > time should_numa_migrate_memory() found the NUMA hint fault > latency to be higher than threshold. > > Are these time periods considered too short for the pages > to be detected as hot and promoted? Yes. I think so. This is burst accessing, not repeated accessing. IIUC, NUMA balancing based promotion only works for repeated accessing for long time, for example, >100s. >> >>> But as we see, often times >>> the scan time is lagging the access time by a large value. >>> >>> Let me instrument the code further to learn more insights (if possible) >>> about the scanning/fault time behaviors here. >>> >>> Leaving the fault count based threshold apart, do you think there is >>> value in updating the scan time for skipped pages/PTEs during every >>> scan so that the scan time remains current for all the pages? >> >> No, I don't think so. That makes hint page fault latency more >> inaccurate. > > For the case that I have shown, depending on a old value of scan > time doesn't work well when pages get accessed after a long time > since scanning. At least with the scheme I show in patch 2/2, > probability of detecting pages as hot increases. Yes. This may help your cases, but it will hurt other cases with incorrect hint page fault latency. To resolve your issue, we can increase the max value of the hot threshold automatically. We can work on that if you can find a real workload. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-03 8:40 ` Huang, Ying @ 2024-04-12 4:00 ` Bharata B Rao 2024-04-12 7:28 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-04-12 4:00 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 03-Apr-24 2:10 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 02-Apr-24 7:33 AM, Huang, Ying wrote: >>> Bharata B Rao <bharata@amd.com> writes: >>> >>>> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>>>> Bharata B Rao <bharata@amd.com> writes: >>>> <snip> >>>>>> I don't think the pages are cold but rather the existing mechanism fails >>>>>> to categorize them as hot. This is because the pages were scanned way >>>>>> before the accesses start happening. When repeated accesses are made to >>>>>> a chunk of memory that has been scanned a while back, none of those >>>>>> accesses get classified as hot because the scan time is way behind >>>>>> the current access time. That's the reason we are seeing the value >>>>>> of latency ranging from 20s to 630s as shown above. >>>>> >>>>> If repeated accesses continue, the page will be identified as hot when >>>>> it is scanned next time even if we don't expand the threshold range. If >>>>> the repeated accesses only last very short time, it makes little sense >>>>> to identify the pages as hot. Right? >>>> >>>> The total allocated memory here is 192G and the chunk size is 1G. Each >>>> time one such 1G chunk is taken up randomly for generating memory accesses. >>>> Within that 1G, 262144 random accesses are performed and 262144 such >>>> accesses are repeated for 512 times. I thought that should be enough >>>> to classify that chunk of memory as hot. >>> >>> IIUC, some pages are accessed in very short time (maybe within 1ms). >>> This isn't repeated access in a long period. I think that pages >>> accessed repeatedly in a long period are good candidates for promoting. >>> But pages accessed frequently in only very short time aren't. >> >> Here are the numbers for the 192nd chunk: >> >> Each iteration of 262144 random accesses takes around ~10ms >> 512 such iterations are taking ~5s >> numa_scan_seq is 16 when this chunk is accessed. >> And no page promotions were done from this chunk. All the >> time should_numa_migrate_memory() found the NUMA hint fault >> latency to be higher than threshold. >> >> Are these time periods considered too short for the pages >> to be detected as hot and promoted? > > Yes. I think so. This is burst accessing, not repeated accessing. > IIUC, NUMA balancing based promotion only works for repeated accessing > for long time, for example, >100s. Hmm... When a page is accessed 512 times over a period of 5s and it is still not detected as hot. This is understandable if fresh scanning couldn't be done as the accesses were bursty and hence they couldn't be captured via NUMA hint faults. But here the access captured via hint fault is being rejected as not hot because the scanning was done a while back. But I do see the challenge here since we depend on scanning time to obtain the frequency-of-access metric. BTW, for the above same scenario with numa_balancing_mode=1, the remote accesses get detected and migration to source node is tried. It is a different matter that eventually pages can't be migrated in this specific scenario as the src node is already full. Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-12 4:00 ` Bharata B Rao @ 2024-04-12 7:28 ` Huang, Ying 2024-04-12 8:16 ` Bharata B Rao 0 siblings, 1 reply; 21+ messages in thread From: Huang, Ying @ 2024-04-12 7:28 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 03-Apr-24 2:10 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 02-Apr-24 7:33 AM, Huang, Ying wrote: >>>> Bharata B Rao <bharata@amd.com> writes: >>>> >>>>> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>>>>> Bharata B Rao <bharata@amd.com> writes: >>>>> <snip> >>>>>>> I don't think the pages are cold but rather the existing mechanism fails >>>>>>> to categorize them as hot. This is because the pages were scanned way >>>>>>> before the accesses start happening. When repeated accesses are made to >>>>>>> a chunk of memory that has been scanned a while back, none of those >>>>>>> accesses get classified as hot because the scan time is way behind >>>>>>> the current access time. That's the reason we are seeing the value >>>>>>> of latency ranging from 20s to 630s as shown above. >>>>>> >>>>>> If repeated accesses continue, the page will be identified as hot when >>>>>> it is scanned next time even if we don't expand the threshold range. If >>>>>> the repeated accesses only last very short time, it makes little sense >>>>>> to identify the pages as hot. Right? >>>>> >>>>> The total allocated memory here is 192G and the chunk size is 1G. Each >>>>> time one such 1G chunk is taken up randomly for generating memory accesses. >>>>> Within that 1G, 262144 random accesses are performed and 262144 such >>>>> accesses are repeated for 512 times. I thought that should be enough >>>>> to classify that chunk of memory as hot. >>>> >>>> IIUC, some pages are accessed in very short time (maybe within 1ms). >>>> This isn't repeated access in a long period. I think that pages >>>> accessed repeatedly in a long period are good candidates for promoting. >>>> But pages accessed frequently in only very short time aren't. >>> >>> Here are the numbers for the 192nd chunk: >>> >>> Each iteration of 262144 random accesses takes around ~10ms >>> 512 such iterations are taking ~5s >>> numa_scan_seq is 16 when this chunk is accessed. >>> And no page promotions were done from this chunk. All the >>> time should_numa_migrate_memory() found the NUMA hint fault >>> latency to be higher than threshold. >>> >>> Are these time periods considered too short for the pages >>> to be detected as hot and promoted? >> >> Yes. I think so. This is burst accessing, not repeated accessing. >> IIUC, NUMA balancing based promotion only works for repeated accessing >> for long time, for example, >100s. > > Hmm... When a page is accessed 512 times over a period of 5s and it is > still not detected as hot. This is understandable if fresh scanning couldn't > be done as the accesses were bursty and hence they couldn't be captured via > NUMA hint faults. But here the access captured via hint fault is being rejected > as not hot because the scanning was done a while back. But I do see the challenge > here since we depend on scanning time to obtain the frequency-of-access metric. Consider some pages that will be accessed once every 1 hour, should we consider it hot or not? Will your proposed method deal with that correctly? > BTW, for the above same scenario with numa_balancing_mode=1, the remote > accesses get detected and migration to source node is tried. It is a different > matter that eventually pages can't be migrated in this specific scenario as > the src node is already full. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-12 7:28 ` Huang, Ying @ 2024-04-12 8:16 ` Bharata B Rao 2024-04-12 8:48 ` Huang, Ying 0 siblings, 1 reply; 21+ messages in thread From: Bharata B Rao @ 2024-04-12 8:16 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes On 12-Apr-24 12:58 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 03-Apr-24 2:10 PM, Huang, Ying wrote: >>>> Here are the numbers for the 192nd chunk: >>>> >>>> Each iteration of 262144 random accesses takes around ~10ms >>>> 512 such iterations are taking ~5s >>>> numa_scan_seq is 16 when this chunk is accessed. >>>> And no page promotions were done from this chunk. All the >>>> time should_numa_migrate_memory() found the NUMA hint fault >>>> latency to be higher than threshold. >>>> >>>> Are these time periods considered too short for the pages >>>> to be detected as hot and promoted? >>> >>> Yes. I think so. This is burst accessing, not repeated accessing. >>> IIUC, NUMA balancing based promotion only works for repeated accessing >>> for long time, for example, >100s. >> >> Hmm... When a page is accessed 512 times over a period of 5s and it is >> still not detected as hot. This is understandable if fresh scanning couldn't >> be done as the accesses were bursty and hence they couldn't be captured via >> NUMA hint faults. But here the access captured via hint fault is being rejected >> as not hot because the scanning was done a while back. But I do see the challenge >> here since we depend on scanning time to obtain the frequency-of-access metric. > > Consider some pages that will be accessed once every 1 hour, should we > consider it hot or not? Will your proposed method deal with that > correctly? The proposed method removes the absolute time as a factor for the decision and instead relies on the number of hint faults that have occurred since that page was scanned last. As long as there are enough hint faults happening in that 1 hour (which means a lot many other accesses have been captured in that 1 hour), that page shouldn't be considered as hot. You did mention earlier about hint fault rate varying a lot and one thing I haven't tried yet is to vary the fault threshold based on current or historical fault rate. Regards, Bharata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space 2024-04-12 8:16 ` Bharata B Rao @ 2024-04-12 8:48 ` Huang, Ying 0 siblings, 0 replies; 21+ messages in thread From: Huang, Ying @ 2024-04-12 8:48 UTC (permalink / raw) To: Bharata B Rao Cc: linux-mm, linux-kernel, akpm, mingo, peterz, mgorman, raghavendra.kt, dave.hansen, hannes Bharata B Rao <bharata@amd.com> writes: > On 12-Apr-24 12:58 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 03-Apr-24 2:10 PM, Huang, Ying wrote: >>>>> Here are the numbers for the 192nd chunk: >>>>> >>>>> Each iteration of 262144 random accesses takes around ~10ms >>>>> 512 such iterations are taking ~5s >>>>> numa_scan_seq is 16 when this chunk is accessed. >>>>> And no page promotions were done from this chunk. All the >>>>> time should_numa_migrate_memory() found the NUMA hint fault >>>>> latency to be higher than threshold. >>>>> >>>>> Are these time periods considered too short for the pages >>>>> to be detected as hot and promoted? >>>> >>>> Yes. I think so. This is burst accessing, not repeated accessing. >>>> IIUC, NUMA balancing based promotion only works for repeated accessing >>>> for long time, for example, >100s. >>> >>> Hmm... When a page is accessed 512 times over a period of 5s and it is >>> still not detected as hot. This is understandable if fresh scanning couldn't >>> be done as the accesses were bursty and hence they couldn't be captured via >>> NUMA hint faults. But here the access captured via hint fault is being rejected >>> as not hot because the scanning was done a while back. But I do see the challenge >>> here since we depend on scanning time to obtain the frequency-of-access metric. >> >> Consider some pages that will be accessed once every 1 hour, should we >> consider it hot or not? Will your proposed method deal with that >> correctly? > > The proposed method removes the absolute time as a factor for the decision and instead > relies on the number of hint faults that have occurred since that page was scanned last. > As long as there are enough hint faults happening in that 1 hour (which means a lot many > other accesses have been captured in that 1 hour), that page shouldn't be considered as > hot. You did mention earlier about hint fault rate varying a lot and one thing I haven't > tried yet is to vary the fault threshold based on current or historical fault rate. In your original example, if a lot many other accesses between NUMA balancing page table scanning and 512 page accesses, you cannot identify the page as hot too, right? If the NUMA balancing page table scanning period is much longer than 5s, it's high possible that we cannot distinguish between 1 and 512 page accesses within 5s with your method and the original method. Better discuss the behavior with a more detail example, for example, when the page is scanned, how many pages are accessed, how long between accesses, etc. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-04-12 8:50 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-27 16:02 [RFC PATCH 0/2] Hot page promotion optimization for large address space Bharata B Rao 2024-03-27 16:02 ` [RFC PATCH 1/2] sched/numa: Fault count based NUMA hint fault latency Bharata B Rao 2024-03-28 1:56 ` Huang, Ying 2024-03-28 4:39 ` Bharata B Rao 2024-03-28 5:21 ` Huang, Ying 2024-03-30 1:11 ` kernel test robot 2024-03-30 3:47 ` kernel test robot 2024-03-27 16:02 ` [RFC PATCH 2/2] mm: Update hint fault count for pages that are skipped during scanning Bharata B Rao 2024-03-28 5:35 ` [RFC PATCH 0/2] Hot page promotion optimization for large address space Huang, Ying 2024-03-28 5:49 ` Bharata B Rao 2024-03-28 6:03 ` Huang, Ying 2024-03-28 6:29 ` Bharata B Rao 2024-03-29 1:14 ` Huang, Ying 2024-04-01 12:20 ` Bharata B Rao 2024-04-02 2:03 ` Huang, Ying 2024-04-02 9:26 ` Bharata B Rao 2024-04-03 8:40 ` Huang, Ying 2024-04-12 4:00 ` Bharata B Rao 2024-04-12 7:28 ` Huang, Ying 2024-04-12 8:16 ` Bharata B Rao 2024-04-12 8:48 ` Huang, Ying
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.