Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

From: Mel Gorman <mgorman@suse.de>
To: Bharata B Rao <bharata@amd.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com,
	dishaa.talreja@amd.com, Wei Huang <wei.huang2@amd.com>
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework
Date: Tue, 1 Feb 2022 14:15:20 +0000	[thread overview]
Message-ID: <20220201141520.GB3301@suse.de> (raw)
In-Reply-To: <9f95a85f-5396-b8bd-50cf-c4eeeac2a013@amd.com>

On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:
> On 1/31/2022 5:47 PM, Mel Gorman wrote:
> > On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
> >> From: Disha Talreja <dishaa.talreja@amd.com>
> >>
> >> Add a new framework that calculates autonuma scan period
> >> based on per-process NUMA fault stats.
> >>
> >> NUMA faults can be classified into different categories, such
> >> as local vs. remote, or private vs. shared. It is also important
> >> to understand such behavior from the perspective of a process.
> >> The per-process fault stats added here will be used for
> >> calculating the scan period in the adaptive NUMA algorithm.
> >>
> > 
> > Be more specific no how the local vs remote, private vs shared states
> > are reflections of per-task activity of the same.
> 
> Sure, will document the algorithm better. However the overall thinking
> here is that the address-space scanning is a per-process activity and
> hence the scan period value derived from the accumulated per-process
> faults is more appropriate than calculating per-task (per-thread) scan
> periods. Participating threads may have their local/shared and private/shared
> behaviors, but when aggregated at the process level, it gives a better
> input for eventual scan period variation. The understanding is that individual
> thread fault rates will start altering the overall process metrics in
> such a manner that we respond by changing the scan rate to do more aggressive
> or less aggressive scanning.  
> 

I don't have anything to add on your other responses as it would mostly
be an acknowledgment of your response.

However, the major concern I have is that address-space wide decisions
on scan rates has no sensible means of adapting to thread-specific
requirements. I completely agree that it will result in more stable scan
rates, particularly the adjustments. It also side-steps a problem where
new threads may start with a scan rate that is completely inappropriate.

However, I worry that it would be limited overall because each thread
potentially has unique behaviour which is not obvious in a workload like
NAS where threads are all executing similar instructions on different
data. For other applications, threads may operate on thread-local areas
only (low scan rate), others could operate on shared only regresions (high
scan rate until back off and interleave), threads can has phase behaviour
(manager thread collecting data from worker threads) and threads can have
different lifetimes and phase behaviour. Each thread would have a different
optimal scan rate to decide if memory needs to be migrated to a local node
or not. I don't see how address-space wide statistics could every be mapped
back to threads to adapt scan rates based on thread-specific behaviour.

Thread scanning on the other hand can be improved in multiple ways. If
nothing else, they can do redundant scanning of regions that are
not relveant to a task which gets increasingly problematic when VSZ
increases. The obvious problems are

1. Scan based on page table updates, not address ranges to mitigate
   problems with THP vs base page updates

2. Move scan delay to be a per-vma structure that is kmalloced if
   necessary instead of being address space wide.

3. Track what threads access a VMA. The suggestion was to use a unsigned
   long pid_mask and use the lower bits to tag approximately what
   threads access a VMA. Skip VMAs that did not trap a fault. This would
   be approximate because of PID collisions but would reduce scanning
   of areas the thread is not interested in

4. Track active regions within VMAs. Very coarse tracking, use unsigned
   long to trap what ranges are active

In different ways, this would reduce the amount of scanning work threads
do and focuses them on regions of relevance to reduce overhead overall
without losing thread-specific details.

Unfortunately, I have not had the time yet to prototype anything.

-- 
Mel Gorman
SUSE Labs