Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

From: Raghavendra K T <raghavendra.kt@amd.com>
To: Mel Gorman <mgorman@suse.de>, Bharata B Rao <bharata@amd.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com,
	dishaa.talreja@amd.com, Wei Huang <wei.huang2@amd.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework
Date: Wed, 21 Jun 2023 11:20:24 +0530	[thread overview]
Message-ID: <bc4b5606-2409-85e6-cdf2-1d7857c8a446@amd.com> (raw)
In-Reply-To: <20220201141520.GB3301@suse.de>

+linux-mm
On 2/1/2022 7:45 PM, Mel Gorman wrote:
> On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:
>> On 1/31/2022 5:47 PM, Mel Gorman wrote:
>>> On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
>>>> From: Disha Talreja <dishaa.talreja@amd.com>
>>>>
>>>> Add a new framework that calculates autonuma scan period
>>>> based on per-process NUMA fault stats.
>>>>
>>>> NUMA faults can be classified into different categories, such
>>>> as local vs. remote, or private vs. shared. It is also important
>>>> to understand such behavior from the perspective of a process.
>>>> The per-process fault stats added here will be used for
>>>> calculating the scan period in the adaptive NUMA algorithm.
>>>>
>>>
>>> Be more specific no how the local vs remote, private vs shared states
>>> are reflections of per-task activity of the same.
>>
>> Sure, will document the algorithm better. However the overall thinking
>> here is that the address-space scanning is a per-process activity and
>> hence the scan period value derived from the accumulated per-process
>> faults is more appropriate than calculating per-task (per-thread) scan
>> periods. Participating threads may have their local/shared and private/shared
>> behaviors, but when aggregated at the process level, it gives a better
>> input for eventual scan period variation. The understanding is that individual
>> thread fault rates will start altering the overall process metrics in
>> such a manner that we respond by changing the scan rate to do more aggressive
>> or less aggressive scanning.
>>
> 
> I don't have anything to add on your other responses as it would mostly
> be an acknowledgment of your response.
> 
> However, the major concern I have is that address-space wide decisions
> on scan rates has no sensible means of adapting to thread-specific
> requirements. I completely agree that it will result in more stable scan
> rates, particularly the adjustments. It also side-steps a problem where
> new threads may start with a scan rate that is completely inappropriate.
> 
> However, I worry that it would be limited overall because each thread
> potentially has unique behaviour which is not obvious in a workload like
> NAS where threads are all executing similar instructions on different
> data. For other applications, threads may operate on thread-local areas
> only (low scan rate), others could operate on shared only regresions (high
> scan rate until back off and interleave), threads can has phase behaviour
> (manager thread collecting data from worker threads) and threads can have
> different lifetimes and phase behaviour. Each thread would have a different
> optimal scan rate to decide if memory needs to be migrated to a local node
> or not. I don't see how address-space wide statistics could every be mapped
> back to threads to adapt scan rates based on thread-specific behaviour.
> 
> Thread scanning on the other hand can be improved in multiple ways. If
> nothing else, they can do redundant scanning of regions that are
> not relveant to a task which gets increasingly problematic when VSZ
> increases. The obvious problems are
> 
> 1. Scan based on page table updates, not address ranges to mitigate
>     problems with THP vs base page updates
> 

Hello Mel,
Sorry for digging a very old email, to seek directions on numascanning.

 From the list we have handled (2) and (3) below .. and looking forward 
to continue, with (1) above.

My understanding is when the 256MB limit was introduced, it was mainly
  to limit total number PTE we scan (=64k PTEs of 4kB page).

Considering we can do more if we have THP or hugepage, and thus do we
want to cover more hugePTEs here?

I mean can we say we want to scan 64k worth 2MB page table entry (or 
corresponding hugepage entries)?

I started with a simple patch that just handles 4k/hugepage, but
does not handle THP case properly yet as its not trivial (to track
how much worth of page table entries we handled in a VMA that has THP)
(patch may have white space error because of copying).

Idea is to scan 64k worth of PTEs instead of 256MB for scanning.

Secondly Unrelated to this, I was also thinking if how recently
  vma access was done information could be helpful..

Please let me know your suggestion/comment on the direction/approach etc

> 2. Move scan delay to be a per-vma structure that is kmalloced if
>     necessary instead of being address space wide.
> 
> 3. Track what threads access a VMA. The suggestion was to use a unsigned
>     long pid_mask and use the lower bits to tag approximately what
>     threads access a VMA. Skip VMAs that did not trap a fault. This would
>     be approximate because of PID collisions but would reduce scanning
>     of areas the thread is not interested in
> 
> 4. Track active regions within VMAs. Very coarse tracking, use unsigned
>     long to trap what ranges are active
> 
> In different ways, this would reduce the amount of scanning work threads
> do and focuses them on regions of relevance to reduce overhead overall
> without losing thread-specific details.
> 
> Unfortunately, I have not had the time yet to prototype anything.
> 
Comments about the patch
- may need to scale virtpages checking as well
- Needs checking of exact THP PTEs covered in scan
- Does not touch task_scan_min() etc which influences scan_period (do we 
require)???

---8<---

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6d041aa9f0fe..066e9bee1187 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -260,7 +260,8 @@ int pud_huge(pud_t pud);
  long hugetlb_change_protection(struct vm_area_struct *vma,
                 unsigned long address, unsigned long end, pgprot_t newprot,
                 unsigned long cp_flags);
-
+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma, 
unsigned long start,
+               unsigned long end);
  bool is_hugetlb_entry_migration(pte_t pte);
  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..e64430863f9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2441,6 +2441,8 @@ bool can_change_pte_writable(struct vm_area_struct 
*vma, unsigned long addr,
  extern long change_protection(struct mmu_gather *tlb,
                               struct vm_area_struct *vma, unsigned long 
start,
                               unsigned long end, unsigned long cp_flags);
+extern long effective_scanned_ptes(struct vm_area_struct *vma,
+                               unsigned long start, unsigned long end);
  extern int mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather 
*tlb,
           struct vm_area_struct *vma, struct vm_area_struct **pprev,
           unsigned long start, unsigned long end, unsigned long newflags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..a8280f589cbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2959,7 +2959,7 @@ static void task_numa_work(struct callback_head *work)
         struct vm_area_struct *vma;
         unsigned long start, end;
         unsigned long nr_pte_updates = 0;
-       long pages, virtpages;
+       long pages, virtpages, ptes_to_scan;
         struct vma_iterator vmi;

         SCHED_WARN_ON(p != container_of(work, struct task_struct, 
numa_work));
@@ -3006,6 +3006,8 @@ static void task_numa_work(struct callback_head *work)
         start = mm->numa_scan_offset;
         pages = sysctl_numa_balancing_scan_size;
         pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+       /* Consider total number of PTEs to scan rather than sticking to 
256MB */
+       ptes_to_scan = pages;
         virtpages = pages * 8;     /* Scan up to this much virtual space */
         if (!pages)
                 return;
@@ -3099,11 +3101,11 @@ static void task_numa_work(struct callback_head 
*work)
                          * areas faster.
                          */
                         if (nr_pte_updates)
-                               pages -= (end - start) >> PAGE_SHIFT;
-                       virtpages -= (end - start) >> PAGE_SHIFT;
+                               ptes_to_scan -= 
effective_scanned_ptes(vma, start, end);

+                       virtpages -= effective_scanned_ptes(vma, start, 
end);
                         start = end;
-                       if (pages <= 0 || virtpages <= 0)
+                       if (ptes_to_scan <= 0 || virtpages <= 0)
                                 goto out;

                         cond_resched();
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f154019e6b84..9935b462c479 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6841,6 +6841,15 @@ long hugetlb_change_protection(struct 
vm_area_struct *vma,
         return pages > 0 ? (pages << h->order) : pages;
  }

+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma, 
unsigned long start,
+                      unsigned long end)
+{
+       struct hstate *h = hstate_vma(vma);
+
+       return (end - start) >> (PAGE_SHIFT + h->order);
+}
+
+
  /* Return true if reservation was successful, false otherwise.  */
  bool hugetlb_reserve_pages(struct inode *inode,
                                         long from, long to,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..8022cb09b47b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -586,6 +586,16 @@ long change_protection(struct mmu_gather *tlb,
         return pages;
  }

+long effective_scanned_ptes(struct vm_area_struct *vma, unsigned long 
start,
+                      unsigned long end)
+{
+       if (is_vm_hugetlb_page(vma))
+               return hugetllb_effective_scanned_ptes(vma, start, end);
+
+       return (end - start) >> PAGE_SHIFT;
+}
+
+
  static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
                                unsigned long next, struct mm_walk *walk)
  {