linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] mm: activate access-more-than-once page via NUMA balancing
@ 2021-03-24  8:32 Huang Ying
  2021-03-24 10:31 ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Huang Ying @ 2021-03-24  8:32 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Huang Ying, Yu Zhao, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi

One idea behind the LRU page reclaiming algorithm is to put the
access-once pages in the inactive list and access-more-than-once pages
in the active list.  This is true for the file pages that are accessed
via syscall (read()/write(), etc.), but not for the pages accessed via
the page tables.  We can only activate them via page reclaim scanning
now.  This may cause some problems.  For example, even if there are
only hot file pages accessed via the page tables in the inactive list,
we will enable the cache trim mode incorrectly to scan only the hot
file pages instead of cold anon pages.

This can be improved via NUMA balancing.  Where, the page tables of
all processes will be scanned gradually to trap the page accesses.
With that, we can identify whether a page in the inactive list has
been accessed at least twice.  If so, we can activate the page to
leave only the access-once pages in the inactive list.  This patch
implements this.

It may sound overkill to enable NUMA balancing only to activate some
pages.  But firstly, if you have used NUMA balancing already, the
added overhead is negligible.  Secondly, this patch is only the first
step to take advantage of the NUMA balancing to optimize the page
reclaiming.  We may improve the page reclaim further with the help of
the NUMA balancing.  For example, we have implemented a way to measure
the page hot/cold via NUMA balancing in

https://lore.kernel.org/linux-mm/20210311081821.138467-5-ying.huang@intel.com/

That may help to improve the LRU algorithm.  For example, instead of
migrating from PMEM to DRAM, the hot pages can be put at the head of
the active list (or a separate hot page list) to make it easier to
reclaim the cold pages at the tail of the LRU.

This patch is inspired by the work done by Yu Zhao in the
Multigenerational LRU patchset as follows,

https://lore.kernel.org/linux-mm/20210313075747.3781593-1-yuzhao@google.com/

It may be possible to combine some ideas from the multi-generational
LRU patchset with the NUMA balancing page table scanning to improve
the LRU page reclaiming algorithm.  Compared with the page table
scanning method used in the multi-generational LRU patchset, the page
tables can be scanned much slower via NUMA balancing, because the page
faults instead of the Accessed bit is used to trap the page accesses.
This can reduce the peak overhead of scanning.

To show the effect of the patch, we designed a test as follows,

On a system with 128 GB DRAM and 2 NVMe disks as swap,

  * Run the workload A with about 60 GB hot anon pages.

  * After 100 seconds, run the workload B with about 58 GB cold anon
    pages (accessed-once).

  * After another 200 second, run the workload C with about 57 GB hot
    anon pages.

It’s desirable that the 58 GB cold pages of the workload B will be
swapped out to accommodate the 57 GB memory of the workload C.

The test results are as follows,

			         base	      patched
Pages swapped in (GB)		  2.3		  0.0
Pages swapped out (GB)		 59.0		 55.9
Pages scanned (GB)		296.7		172.5
Avg length of active
list (GB)			 18.1		 58.4
Avg length of inactive
list (GB)			 89.1		 48.4

Because the size of the cold workload B (58 GB) is larger than the
size of the workload C, it’s desirable that the accessed-once pages of
workload B will be reclaimed to accommodate the workload C, so that
there should be no pages to be swapped in.  But in the base kernel,
because the pages of the workload A are scanned before that of the
workload B, some hot pages (~2.3 GB) from the workload A will be
swapped out wrongly.  While in the patched kernel, the pages of
workload A will be activated to the active list beforehand, so the
pages swapped in reduces greatly (~14.2 MB).  Because the size of
inactive list is much shorter in the patched kernel, to reclaim pages
for the workload C, the pages scanned is much less too (172.5 GB
vs. 296.7 GB).

As always, the VM subsystem is complex, any change may cause some
regressions.  We have observed some for this patch too.  The
fundamental effect of the patch is to reduce the size of inactive list
to reduce the scanning overhead and improve scanning correctness.  But
in some situations, the long inactive list in the base kernel (not
patched) can help performance.  Because it will take longer to scan
a (not so) hot page twice, to make it easier to distinguish the hot
and cold pages.  But generally, I don't think it is a good idea to
improve the performance via increasing the system overhead purely.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Inspired-by: Yu Zhao <yuzhao@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Yang Shi <shy828301@gmail.com>
---
 mm/memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..b44b6fd577a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4165,6 +4165,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 			&flags);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	if (target_nid == NUMA_NO_NODE) {
+		if (!PageActive(page) && page_evictable(page) &&
+		    (!PageSwapBacked(page) || total_swap_pages)) {
+			if (pte_young(old_pte) && !PageReferenced(page))
+				SetPageReferenced(page);
+			if (PageReferenced(page))
+				mark_page_accessed(page);
+		}
 		put_page(page);
 		goto out;
 	}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: activate access-more-than-once page via NUMA balancing
  2021-03-24  8:32 [RFC] mm: activate access-more-than-once page via NUMA balancing Huang Ying
@ 2021-03-24 10:31 ` Mel Gorman
  2021-03-25  4:33   ` Huang, Ying
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2021-03-24 10:31 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, Andrew Morton, linux-kernel, Yu Zhao, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi

On Wed, Mar 24, 2021 at 04:32:09PM +0800, Huang Ying wrote:
> One idea behind the LRU page reclaiming algorithm is to put the
> access-once pages in the inactive list and access-more-than-once pages
> in the active list.  This is true for the file pages that are accessed
> via syscall (read()/write(), etc.), but not for the pages accessed via
> the page tables.  We can only activate them via page reclaim scanning
> now.  This may cause some problems.  For example, even if there are
> only hot file pages accessed via the page tables in the inactive list,
> we will enable the cache trim mode incorrectly to scan only the hot
> file pages instead of cold anon pages.
> 

I caution against this patch.

It's non-deterministic for a number of reasons. As it requires NUMA
balancing to be enabled, the pageout behaviour of a system changes when
NUMA balancing is active. If this led to pages being artificially and
inappropriately preserved, NUMA balancing could be disabled for the
wrong reasons.  It only applies to pages that have no target node so
memory policies affect which pages are activated differently. Similarly,
NUMA balancing does not scan all VMAs and some pages may never trap a
NUMA fault as a result. The timing of when an address space gets scanned
is driven by the locality of pages and so the timing of page activation
potentially becomes linked to whether pages are local or need to migrate
(although not right now for this patch as it only affects pages with a
target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
that affect migration potentially affect the aging rate.  Similarly,
the activate rate of a process with a single thread and multiple threads
potentially have different activation rates.

Finally, the NUMA balancing scan algorithm is sub-optimal. It potentially
scans the entire address space even though only a small number of pages
are scanned. This is particularly problematic when a process has a lot
of threads because threads are redundantly scanning the same regions. If
NUMA balancing ever introduced range tracking of faulted pages to limit
how much scanning it has to do, it would inadvertently cause a change in
page activation rate.

NUMA balancing is about page locality, it should not get conflated with
page aging.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: activate access-more-than-once page via NUMA balancing
  2021-03-24 10:31 ` Mel Gorman
@ 2021-03-25  4:33   ` Huang, Ying
  2021-03-25 11:57     ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2021-03-25  4:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, linux-kernel, Yu Zhao, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi

Hi, Mel,

Thanks for comment!

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Mar 24, 2021 at 04:32:09PM +0800, Huang Ying wrote:
>> One idea behind the LRU page reclaiming algorithm is to put the
>> access-once pages in the inactive list and access-more-than-once pages
>> in the active list.  This is true for the file pages that are accessed
>> via syscall (read()/write(), etc.), but not for the pages accessed via
>> the page tables.  We can only activate them via page reclaim scanning
>> now.  This may cause some problems.  For example, even if there are
>> only hot file pages accessed via the page tables in the inactive list,
>> we will enable the cache trim mode incorrectly to scan only the hot
>> file pages instead of cold anon pages.
>> 
>
> I caution against this patch.
>
> It's non-deterministic for a number of reasons. As it requires NUMA
> balancing to be enabled, the pageout behaviour of a system changes when
> NUMA balancing is active. If this led to pages being artificially and
> inappropriately preserved, NUMA balancing could be disabled for the
> wrong reasons.  It only applies to pages that have no target node so
> memory policies affect which pages are activated differently. Similarly,
> NUMA balancing does not scan all VMAs and some pages may never trap a
> NUMA fault as a result. The timing of when an address space gets scanned
> is driven by the locality of pages and so the timing of page activation
> potentially becomes linked to whether pages are local or need to migrate
> (although not right now for this patch as it only affects pages with a
> target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
> that affect migration potentially affect the aging rate.  Similarly,
> the activate rate of a process with a single thread and multiple threads
> potentially have different activation rates.
>
> Finally, the NUMA balancing scan algorithm is sub-optimal. It potentially
> scans the entire address space even though only a small number of pages
> are scanned. This is particularly problematic when a process has a lot
> of threads because threads are redundantly scanning the same regions. If
> NUMA balancing ever introduced range tracking of faulted pages to limit
> how much scanning it has to do, it would inadvertently cause a change in
> page activation rate.
>
> NUMA balancing is about page locality, it should not get conflated with
> page aging.

I understand your concerns about binding the NUMA balancing and page
reclaiming.  The requirement of the page locality and page aging is
different, so the policies need to be different.  This is the wrong part
of the patch.

From another point of view, it's still possible to share some underlying
mechanisms (and code) between them.  That is, scanning the page tables
to make pages unaccessible and capture the page accesses via the page
fault.  Now these page accessing information is used for the page
locality.  Do you think it's a good idea to use these information for
the page aging too (but with a different policy as you pointed out)?

From yet another point of view :-), in current NUMA balancing
implementation, it's assumed that the node private pages can fit in the
accessing node.  But this may be not always true.  Is it a valid
optimization to migrate the hot private pages first?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: activate access-more-than-once page via NUMA balancing
  2021-03-25  4:33   ` Huang, Ying
@ 2021-03-25 11:57     ` Mel Gorman
  2021-03-26  6:20       ` Huang, Ying
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2021-03-25 11:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, linux-kernel, Yu Zhao, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi

On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote:
> > I caution against this patch.
> >
> > It's non-deterministic for a number of reasons. As it requires NUMA
> > balancing to be enabled, the pageout behaviour of a system changes when
> > NUMA balancing is active. If this led to pages being artificially and
> > inappropriately preserved, NUMA balancing could be disabled for the
> > wrong reasons.  It only applies to pages that have no target node so
> > memory policies affect which pages are activated differently. Similarly,
> > NUMA balancing does not scan all VMAs and some pages may never trap a
> > NUMA fault as a result. The timing of when an address space gets scanned
> > is driven by the locality of pages and so the timing of page activation
> > potentially becomes linked to whether pages are local or need to migrate
> > (although not right now for this patch as it only affects pages with a
> > target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
> > that affect migration potentially affect the aging rate.  Similarly,
> > the activate rate of a process with a single thread and multiple threads
> > potentially have different activation rates.
> >
> > Finally, the NUMA balancing scan algorithm is sub-optimal. It potentially
> > scans the entire address space even though only a small number of pages
> > are scanned. This is particularly problematic when a process has a lot
> > of threads because threads are redundantly scanning the same regions. If
> > NUMA balancing ever introduced range tracking of faulted pages to limit
> > how much scanning it has to do, it would inadvertently cause a change in
> > page activation rate.
> >
> > NUMA balancing is about page locality, it should not get conflated with
> > page aging.
> 
> I understand your concerns about binding the NUMA balancing and page
> reclaiming.  The requirement of the page locality and page aging is
> different, so the policies need to be different.  This is the wrong part
> of the patch.
> 
> From another point of view, it's still possible to share some underlying
> mechanisms (and code) between them.  That is, scanning the page tables
> to make pages unaccessible and capture the page accesses via the page
> fault. 

Potentially yes but not necessarily recommended for page aging. NUMA
balancing has to be careful about the rate it scans pages to avoid
excessive overhead so it's driven by locality. The scanning happens
within a tasks context so during that time, the task is not executing
its normal work and it incurs the overhead for faults. Generally, this
is not too much overhead because pages get migrated locally, the scan
rate drops and so does the overhead.

However, if you want to drive page aging, that is constant so the rate
could not be easily adapted in a way that would be deterministic.

> Now these page accessing information is used for the page
> locality.  Do you think it's a good idea to use these information for
> the page aging too (but with a different policy as you pointed out)?
> 

I'm not completely opposed to it but I think the overhead it would
introduce could be severe. Worse, if a workload fits in memory and there
is limited to no memory pressure, it's all overhead for no gain. Early
generations of NUMA balancing had to find a balance to sure the gains
from locality exceeded the cost of measuring locality and doing the same
for page aging in some ways is even more challenging.

> From yet another point of view :-), in current NUMA balancing
> implementation, it's assumed that the node private pages can fit in the
> accessing node.  But this may be not always true.  Is it a valid
> optimization to migrate the hot private pages first?
> 

I'm not sure how the hotness of pages could be ranked. At the time of a
hinting fault, the page is by definition active now because it was been
accessed. Prioritising what pages to migrate based on the number of faults
that have been trapped would have to be stored somewhere.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: activate access-more-than-once page via NUMA balancing
  2021-03-25 11:57     ` Mel Gorman
@ 2021-03-26  6:20       ` Huang, Ying
  2021-04-10 22:25         ` Yu Zhao
  0 siblings, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2021-03-26  6:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, linux-kernel, Yu Zhao, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi

Mel Gorman <mgorman@suse.de> writes:

> On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote:
>> > I caution against this patch.
>> >
>> > It's non-deterministic for a number of reasons. As it requires NUMA
>> > balancing to be enabled, the pageout behaviour of a system changes when
>> > NUMA balancing is active. If this led to pages being artificially and
>> > inappropriately preserved, NUMA balancing could be disabled for the
>> > wrong reasons.  It only applies to pages that have no target node so
>> > memory policies affect which pages are activated differently. Similarly,
>> > NUMA balancing does not scan all VMAs and some pages may never trap a
>> > NUMA fault as a result. The timing of when an address space gets scanned
>> > is driven by the locality of pages and so the timing of page activation
>> > potentially becomes linked to whether pages are local or need to migrate
>> > (although not right now for this patch as it only affects pages with a
>> > target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
>> > that affect migration potentially affect the aging rate.  Similarly,
>> > the activate rate of a process with a single thread and multiple threads
>> > potentially have different activation rates.
>> >
>> > Finally, the NUMA balancing scan algorithm is sub-optimal. It potentially
>> > scans the entire address space even though only a small number of pages
>> > are scanned. This is particularly problematic when a process has a lot
>> > of threads because threads are redundantly scanning the same regions. If
>> > NUMA balancing ever introduced range tracking of faulted pages to limit
>> > how much scanning it has to do, it would inadvertently cause a change in
>> > page activation rate.
>> >
>> > NUMA balancing is about page locality, it should not get conflated with
>> > page aging.
>> 
>> I understand your concerns about binding the NUMA balancing and page
>> reclaiming.  The requirement of the page locality and page aging is
>> different, so the policies need to be different.  This is the wrong part
>> of the patch.
>> 
>> From another point of view, it's still possible to share some underlying
>> mechanisms (and code) between them.  That is, scanning the page tables
>> to make pages unaccessible and capture the page accesses via the page
>> fault. 
>
> Potentially yes but not necessarily recommended for page aging. NUMA
> balancing has to be careful about the rate it scans pages to avoid
> excessive overhead so it's driven by locality. The scanning happens
> within a tasks context so during that time, the task is not executing
> its normal work and it incurs the overhead for faults. Generally, this
> is not too much overhead because pages get migrated locally, the scan
> rate drops and so does the overhead.
>
> However, if you want to drive page aging, that is constant so the rate
> could not be easily adapted in a way that would be deterministic.
>
>> Now these page accessing information is used for the page
>> locality.  Do you think it's a good idea to use these information for
>> the page aging too (but with a different policy as you pointed out)?
>> 
>
> I'm not completely opposed to it but I think the overhead it would
> introduce could be severe. Worse, if a workload fits in memory and there
> is limited to no memory pressure, it's all overhead for no gain. Early
> generations of NUMA balancing had to find a balance to sure the gains
> from locality exceeded the cost of measuring locality and doing the same
> for page aging in some ways is even more challenging.

Yes.  I will think more about it from the overhead vs. gain point of
view.  Thanks a lot for your sharing on that.

>> From yet another point of view :-), in current NUMA balancing
>> implementation, it's assumed that the node private pages can fit in the
>> accessing node.  But this may be not always true.  Is it a valid
>> optimization to migrate the hot private pages first?
>> 
>
> I'm not sure how the hotness of pages could be ranked. At the time of a
> hinting fault, the page is by definition active now because it was been
> accessed. Prioritising what pages to migrate based on the number of faults
> that have been trapped would have to be stored somewhere.

Yes.  We need to store some information about that.  In an old version
of the patchset which uses NUMA balancing to promote hot pages from the
PMEM to DRAM, we have designed a method to measure the hotness of the
pages.  The basic idea is as follows,

- When the page table of a process is scanned, the latest N scanning
  address ranges and scan times are recorded in a ring buffer of
  mm_struct.

- In hint page fault handler, the ring buffer is search with the fault
  address, to get the scan time.

Then the hint page fault latency of the page is defined as,

  hint page fault latency = fault time - scan time

The shorter the hint page fault latency, the hotter the page.

Then we need a way to determine the hot/cold threshold.  We used a rate
limit based threshold adjustment method.  If the number of pages that
pass the threshold is much more than the rate limit, then we will lower
the threshold (more stricter), or vice versa.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: activate access-more-than-once page via NUMA balancing
  2021-03-26  6:20       ` Huang, Ying
@ 2021-04-10 22:25         ` Yu Zhao
  2021-04-13  5:42           ` Huang, Ying
  0 siblings, 1 reply; 7+ messages in thread
From: Yu Zhao @ 2021-04-10 22:25 UTC (permalink / raw)
  To: Huang, Ying, Mel Gorman
  Cc: Linux-MM, Andrew Morton, linux-kernel, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi

On Fri, Mar 26, 2021 at 12:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mel Gorman <mgorman@suse.de> writes:
>
> > On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote:
> >> > I caution against this patch.
> >> >
> >> > It's non-deterministic for a number of reasons. As it requires NUMA
> >> > balancing to be enabled, the pageout behaviour of a system changes when
> >> > NUMA balancing is active. If this led to pages being artificially and
> >> > inappropriately preserved, NUMA balancing could be disabled for the
> >> > wrong reasons.  It only applies to pages that have no target node so
> >> > memory policies affect which pages are activated differently. Similarly,
> >> > NUMA balancing does not scan all VMAs and some pages may never trap a
> >> > NUMA fault as a result. The timing of when an address space gets scanned
> >> > is driven by the locality of pages and so the timing of page activation
> >> > potentially becomes linked to whether pages are local or need to migrate
> >> > (although not right now for this patch as it only affects pages with a
> >> > target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
> >> > that affect migration potentially affect the aging rate.  Similarly,
> >> > the activate rate of a process with a single thread and multiple threads
> >> > potentially have different activation rates.
> >> >
> >> > Finally, the NUMA balancing scan algorithm is sub-optimal. It potentially
> >> > scans the entire address space even though only a small number of pages
> >> > are scanned. This is particularly problematic when a process has a lot
> >> > of threads because threads are redundantly scanning the same regions. If
> >> > NUMA balancing ever introduced range tracking of faulted pages to limit
> >> > how much scanning it has to do, it would inadvertently cause a change in
> >> > page activation rate.
> >> >
> >> > NUMA balancing is about page locality, it should not get conflated with
> >> > page aging.
> >>
> >> I understand your concerns about binding the NUMA balancing and page
> >> reclaiming.  The requirement of the page locality and page aging is
> >> different, so the policies need to be different.  This is the wrong part
> >> of the patch.
> >>
> >> From another point of view, it's still possible to share some underlying
> >> mechanisms (and code) between them.  That is, scanning the page tables
> >> to make pages unaccessible and capture the page accesses via the page
> >> fault.
> >
> > Potentially yes but not necessarily recommended for page aging. NUMA
> > balancing has to be careful about the rate it scans pages to avoid
> > excessive overhead so it's driven by locality. The scanning happens
> > within a tasks context so during that time, the task is not executing
> > its normal work and it incurs the overhead for faults. Generally, this
> > is not too much overhead because pages get migrated locally, the scan
> > rate drops and so does the overhead.
> >
> > However, if you want to drive page aging, that is constant so the rate
> > could not be easily adapted in a way that would be deterministic.
> >
> >> Now these page accessing information is used for the page
> >> locality.  Do you think it's a good idea to use these information for
> >> the page aging too (but with a different policy as you pointed out)?
> >>
> >
> > I'm not completely opposed to it but I think the overhead it would
> > introduce could be severe. Worse, if a workload fits in memory and there
> > is limited to no memory pressure, it's all overhead for no gain. Early
> > generations of NUMA balancing had to find a balance to sure the gains
> > from locality exceeded the cost of measuring locality and doing the same
> > for page aging in some ways is even more challenging.
>
> Yes.  I will think more about it from the overhead vs. gain point of
> view.  Thanks a lot for your sharing on that.
>
> >> From yet another point of view :-), in current NUMA balancing
> >> implementation, it's assumed that the node private pages can fit in the
> >> accessing node.  But this may be not always true.  Is it a valid
> >> optimization to migrate the hot private pages first?
> >>
> >
> > I'm not sure how the hotness of pages could be ranked. At the time of a
> > hinting fault, the page is by definition active now because it was been
> > accessed. Prioritising what pages to migrate based on the number of faults
> > that have been trapped would have to be stored somewhere.
>
> Yes.  We need to store some information about that.  In an old version
> of the patchset which uses NUMA balancing to promote hot pages from the
> PMEM to DRAM, we have designed a method to measure the hotness of the
> pages.  The basic idea is as follows,
>
> - When the page table of a process is scanned, the latest N scanning
>   address ranges and scan times are recorded in a ring buffer of
>   mm_struct.
>
> - In hint page fault handler, the ring buffer is search with the fault
>   address, to get the scan time.
>
> Then the hint page fault latency of the page is defined as,
>
>   hint page fault latency = fault time - scan time
>
> The shorter the hint page fault latency, the hotter the page.
>
> Then we need a way to determine the hot/cold threshold.  We used a rate
> limit based threshold adjustment method.  If the number of pages that
> pass the threshold is much more than the rate limit, then we will lower
> the threshold (more stricter), or vice versa.

Sorry for the late reply. I do see where you are coming from and I
agree in principle. The aging and the NUMA balancing should be talking
to each other, and IMO, it is easier for the aging to help the numa
balancing because it has to do the legwork anway.

My idea is to make the page table scanning in the multigenerational
LRU NUMA policy aware -- I don't have any concrete plan yet. But in
general, it can range from mildly skewing the aging of pages from
wrong nodes so they become preferable during eviction to aggressively
working against those pages like queue_pages_pte_range() does.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] mm: activate access-more-than-once page via NUMA balancing
  2021-04-10 22:25         ` Yu Zhao
@ 2021-04-13  5:42           ` Huang, Ying
  0 siblings, 0 replies; 7+ messages in thread
From: Huang, Ying @ 2021-04-13  5:42 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Mel Gorman, Linux-MM, Andrew Morton, linux-kernel, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi

Yu Zhao <yuzhao@google.com> writes:

> On Fri, Mar 26, 2021 at 12:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Mel Gorman <mgorman@suse.de> writes:
>>
>> > On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote:
>> >> > I caution against this patch.
>> >> >
>> >> > It's non-deterministic for a number of reasons. As it requires NUMA
>> >> > balancing to be enabled, the pageout behaviour of a system changes when
>> >> > NUMA balancing is active. If this led to pages being artificially and
>> >> > inappropriately preserved, NUMA balancing could be disabled for the
>> >> > wrong reasons.  It only applies to pages that have no target node so
>> >> > memory policies affect which pages are activated differently. Similarly,
>> >> > NUMA balancing does not scan all VMAs and some pages may never trap a
>> >> > NUMA fault as a result. The timing of when an address space gets scanned
>> >> > is driven by the locality of pages and so the timing of page activation
>> >> > potentially becomes linked to whether pages are local or need to migrate
>> >> > (although not right now for this patch as it only affects pages with a
>> >> > target nid of NUMA_NO_NODE). In other words, changes in NUMA balancing
>> >> > that affect migration potentially affect the aging rate.  Similarly,
>> >> > the activate rate of a process with a single thread and multiple threads
>> >> > potentially have different activation rates.
>> >> >
>> >> > Finally, the NUMA balancing scan algorithm is sub-optimal. It potentially
>> >> > scans the entire address space even though only a small number of pages
>> >> > are scanned. This is particularly problematic when a process has a lot
>> >> > of threads because threads are redundantly scanning the same regions. If
>> >> > NUMA balancing ever introduced range tracking of faulted pages to limit
>> >> > how much scanning it has to do, it would inadvertently cause a change in
>> >> > page activation rate.
>> >> >
>> >> > NUMA balancing is about page locality, it should not get conflated with
>> >> > page aging.
>> >>
>> >> I understand your concerns about binding the NUMA balancing and page
>> >> reclaiming.  The requirement of the page locality and page aging is
>> >> different, so the policies need to be different.  This is the wrong part
>> >> of the patch.
>> >>
>> >> From another point of view, it's still possible to share some underlying
>> >> mechanisms (and code) between them.  That is, scanning the page tables
>> >> to make pages unaccessible and capture the page accesses via the page
>> >> fault.
>> >
>> > Potentially yes but not necessarily recommended for page aging. NUMA
>> > balancing has to be careful about the rate it scans pages to avoid
>> > excessive overhead so it's driven by locality. The scanning happens
>> > within a tasks context so during that time, the task is not executing
>> > its normal work and it incurs the overhead for faults. Generally, this
>> > is not too much overhead because pages get migrated locally, the scan
>> > rate drops and so does the overhead.
>> >
>> > However, if you want to drive page aging, that is constant so the rate
>> > could not be easily adapted in a way that would be deterministic.
>> >
>> >> Now these page accessing information is used for the page
>> >> locality.  Do you think it's a good idea to use these information for
>> >> the page aging too (but with a different policy as you pointed out)?
>> >>
>> >
>> > I'm not completely opposed to it but I think the overhead it would
>> > introduce could be severe. Worse, if a workload fits in memory and there
>> > is limited to no memory pressure, it's all overhead for no gain. Early
>> > generations of NUMA balancing had to find a balance to sure the gains
>> > from locality exceeded the cost of measuring locality and doing the same
>> > for page aging in some ways is even more challenging.
>>
>> Yes.  I will think more about it from the overhead vs. gain point of
>> view.  Thanks a lot for your sharing on that.
>>
>> >> From yet another point of view :-), in current NUMA balancing
>> >> implementation, it's assumed that the node private pages can fit in the
>> >> accessing node.  But this may be not always true.  Is it a valid
>> >> optimization to migrate the hot private pages first?
>> >>
>> >
>> > I'm not sure how the hotness of pages could be ranked. At the time of a
>> > hinting fault, the page is by definition active now because it was been
>> > accessed. Prioritising what pages to migrate based on the number of faults
>> > that have been trapped would have to be stored somewhere.
>>
>> Yes.  We need to store some information about that.  In an old version
>> of the patchset which uses NUMA balancing to promote hot pages from the
>> PMEM to DRAM, we have designed a method to measure the hotness of the
>> pages.  The basic idea is as follows,
>>
>> - When the page table of a process is scanned, the latest N scanning
>>   address ranges and scan times are recorded in a ring buffer of
>>   mm_struct.
>>
>> - In hint page fault handler, the ring buffer is search with the fault
>>   address, to get the scan time.
>>
>> Then the hint page fault latency of the page is defined as,
>>
>>   hint page fault latency = fault time - scan time
>>
>> The shorter the hint page fault latency, the hotter the page.
>>
>> Then we need a way to determine the hot/cold threshold.  We used a rate
>> limit based threshold adjustment method.  If the number of pages that
>> pass the threshold is much more than the rate limit, then we will lower
>> the threshold (more stricter), or vice versa.
>
> Sorry for the late reply. I do see where you are coming from and I
> agree in principle. The aging and the NUMA balancing should be talking
> to each other, and IMO, it is easier for the aging to help the numa
> balancing because it has to do the legwork anway.
>
> My idea is to make the page table scanning in the multigenerational
> LRU NUMA policy aware -- I don't have any concrete plan yet. But in
> general, it can range from mildly skewing the aging of pages from
> wrong nodes so they become preferable during eviction to aggressively
> working against those pages like queue_pages_pte_range() does.

As Mel has pointed out, the policy of the page aging and locality is
different.  So it's not easy to combine them simply.  And it appears
that we can get some page hotness estimation from NUMA balancing hint
page fault latency already.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-13  5:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24  8:32 [RFC] mm: activate access-more-than-once page via NUMA balancing Huang Ying
2021-03-24 10:31 ` Mel Gorman
2021-03-25  4:33   ` Huang, Ying
2021-03-25 11:57     ` Mel Gorman
2021-03-26  6:20       ` Huang, Ying
2021-04-10 22:25         ` Yu Zhao
2021-04-13  5:42           ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).