linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Andi Kleen" <ak@linux.intel.com>,
	"Aneesh Kumar" <aneesh.kumar@linux.ibm.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Hillf Danton" <hdanton@sina.com>, "Jens Axboe" <axboe@kernel.dk>,
	"Jesse Barnes" <jsbarnes@google.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"Michael Larabel" <Michael@michaellarabel.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Will Deacon" <will@kernel.org>,
	"Linux ARM" <linux-arm-kernel@lists.infradead.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Kernel Page Reclaim v2" <page-reclaim@google.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Brian Geffon" <bgeffon@google.com>,
	"Jan Alexander Steffens" <heftig@archlinux.org>,
	"Oleksandr Natalenko" <oleksandr@natalenko.name>,
	"Steven Barrett" <steven@liquorix.net>,
	"Suleiman Souhlal" <suleiman@google.com>,
	"Daniel Byrne" <djbyrne@mtu.edu>,
	"Donald Carr" <d@chaos-reins.com>,
	"Holger Hoffstätte" <holger@applied-asynchrony.com>,
	"Konstantin Kharlamov" <Hi-Angel@yandex.ru>,
	"Shuang Zhai" <szhai2@cs.rochester.edu>,
	"Sofia Trinh" <sofia.trinh@edi.works>,
	"Vaibhav Jain" <vaibhav@linux.ibm.com>
Subject: Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
Date: Wed, 16 Mar 2022 01:54:41 -0600	[thread overview]
Message-ID: <CAOUHufYBPSx8W5oP=Rf2Sa9QoMhUbEyiF-heR9SuQhcVp+42Rw@mail.gmail.com> (raw)
In-Reply-To: <87wnguwif3.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Tue, Mar 15, 2022 at 11:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
>
> [snip]
>
> >
> > +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     if (!can_demote(pgdat->node_id, sc) &&
> > +         mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> > +             return 0;
> > +
> > +     return mem_cgroup_swappiness(memcg);
> > +}
> > +
>
> We have tested v9 for memory tiering system, the demotion works now even
> without swap devices configured.  Thanks!

Admittedly I didn't test it :) So thanks for testing -- I'm glad to
hear it didn't fall apart.

> And we found that the demotion (page reclaiming on DRAM nodes) speed is
> lower than the original implementation.

This sounds like an improvement to me, assuming the initial hot/cold
memory placements were similar for both the baseline and MGLRU.

Correct me if I'm wrong: since demotion is driven by promotion, lower
demotion speed means hot and cold pages were sorted out to DRAM and
AEP at a faster speed, hence an improvement.

# promotion path:
numa_hint_faults    498301236
numa_pages_migrated 152650705

numa_hint_faults    494583387
numa_pages_migrated 34165992

# demotion path:
pgsteal_anon 153798203
pgsteal_file 33

pgsteal_anon 32701576
pgsteal_file 33

The hint faults are similar but MGLRU has much fewer migrated -- my
guess is it demoted much fewer hot/warm pages and therefore led to
less work on the promotion path.

>  The workload itself is just a
> memory accessing micro-benchmark with Gauss distribution.  It is run on
> a system with DRAM and PMEM.  Initially, quite some hot pages are placed
> in PMEM and quite some cold pages are placed in DRAM.  Then the page
> placement optimizing mechanism based on NUMA balancing will try to
> promote some hot pages from PMEM node to DRAM node.

My understanding seems to be correct?

>  If the DRAM node
> near full (reach high watermark), kswapd of the DRAM node will be woke
> up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
> pages on DRAM is very cold (not accessed for at least several seconds),
> the benchmark performance will be better if demotion speed is faster.

I'm confused. It seems to me demotion speed is irrelevant. The time to
reach the equilibrium is what we want to measure.

> Some data comes from /proc/vmstat and perf-profile is as follows.
>
> From /proc/vmstat, it seems that the page scanned and page demoted is
> much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
> 5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
> think this shows the value of direct page table scanning.

Can't disagree :)

> From perf-profile, the CPU cycles for kswapd is same.  But less pages
> are demoted (reclaimed) with MGLRU.  And it appears that the total page
> table scanning time of MGLRU is longer if we compare walk_page_range
> (1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?

It's possible if the address space is very large and sparse. But once
MGLRU warms up, it should detect it and fall back to
page_referenced().

> Because we only demote (reclaim) from DRAM nodes, but not demote
> (reclaim) from PMEM nodes and bloom filter doesn't work well enough?

The bloom filters are per lruvec. So this should affect them.

> One thing that may be not friendly for bloom filter is that some virtual
> pages may change their resident nodes because of demotion/promotion.

Yes, it's possible.

> Can you teach me to how interpret these data for MGLRU?  Or can you
> point me to the other/better data for MGLRU?

You are the expert :)

My current understanding is that this is an improvement. IOW, with
MGLRU, DRAM (hot) <-> AEP (cold) reached equilibrium a lot faster.


> MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
> --------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 1767172340
> pgdeactivate 1740111896
> pglazyfree 0
> pgfault 583875828
> pgmajfault 0
> pglazyfreed 0
> pgrefill 1740111896
> pgreuse 22626572
> pgsteal_kswapd 153796237
> pgsteal_direct 1999
> pgdemote_kswapd 153796237
> pgdemote_direct 1999
> pgscan_kswapd 2055504891
> pgscan_direct 1999
> pgscan_direct_throttle 0
> pgscan_anon 2055356614
> pgscan_file 150276
> pgsteal_anon 153798203
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 82761
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2960
> kswapd_high_wmark_hit_quickly 17732
> pageoutrun 21583
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 515994024
> numa_huge_pte_updates 154
> numa_hint_faults 498301236
> numa_hint_faults_local 121109067
> numa_pages_migrated 152650705
> pgmigrate_success 307213704
> pgmigrate_fail 39
> thp_migration_success 93
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
> shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
> shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
> shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
> shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
> migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
> page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54
>
>
> MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
> -------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 47212585
> pgdeactivate 0
> pglazyfree 0
> pgfault 580056521
> pgmajfault 0
> pglazyfreed 0
> pgrefill 6911868880
> pgreuse 25108929
> pgsteal_kswapd 32701609
> pgsteal_direct 0
> pgdemote_kswapd 32701609
> pgdemote_direct 0
> pgscan_kswapd 83582770
> pgscan_direct 0
> pgscan_direct_throttle 0
> pgscan_anon 83549777
> pgscan_file 32993
> pgsteal_anon 32701576
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 84829
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 313
> kswapd_high_wmark_hit_quickly 5262
> pageoutrun 5895
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 512084786
> numa_huge_pte_updates 198
> numa_hint_faults 494583387
> numa_hint_faults_local 129411334
> numa_pages_migrated 34165992
> pgmigrate_success 67833977
> pgmigrate_fail 7
> thp_migration_success 135
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
> walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
> evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
> scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66
>
> Best Regards,
> Huang, Ying
>
> [snip]
>


  reply	other threads:[~2022-03-16  7:54 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-09  2:12 [PATCH v9 00/14] Multi-Gen LRU Framework Yu Zhao
2022-03-09  2:12 ` [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-03-11 10:55   ` Barry Song
2022-03-11 22:57     ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-03-16 22:15   ` Barry Song
2022-03-09  2:12 ` [PATCH v9 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-03-18  1:15   ` Barry Song
2022-03-09  2:12 ` [PATCH v9 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao
2022-03-09  2:12 ` [PATCH v9 05/14] mm: multi-gen LRU: groundwork Yu Zhao
2022-03-14  8:08   ` Huang, Ying
2022-03-14  9:30     ` Yu Zhao
2022-03-15  0:34       ` Huang, Ying
2022-03-15  0:50         ` Yu Zhao
2022-03-21 18:58       ` Justin Forbes
2022-03-21 19:17         ` Prarit Bhargava
2022-03-22  4:52           ` Yu Zhao
2022-03-16 23:25   ` Barry Song
2022-03-21  9:04     ` Yu Zhao
2022-03-21 11:47       ` Barry Song
2022-03-09  2:12 ` [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao
2022-03-16  5:55   ` Huang, Ying
2022-03-16  7:54     ` Yu Zhao [this message]
2022-03-19  3:01   ` Barry Song
2022-03-19  3:11     ` Yu Zhao
2022-03-23  7:47       ` Barry Song
2022-03-24  6:24         ` Yu Zhao
2022-03-24  8:13           ` Barry Song
2022-03-19 10:14   ` Barry Song
2022-03-21 23:51     ` Yu Zhao
2022-03-19 11:15   ` Barry Song
2022-03-22  0:30     ` Yu Zhao
2022-03-21 12:51   ` Aneesh Kumar K.V
2022-03-22  4:02     ` Yu Zhao
2022-03-21 13:01   ` Aneesh Kumar K.V
2022-03-22  4:39     ` Yu Zhao
2022-03-22  5:26   ` Aneesh Kumar K.V
2022-03-22  5:55     ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao
2022-04-07  2:29   ` Barry Song
2022-04-07  3:04     ` Yu Zhao
2022-04-07  3:46       ` Barry Song
2022-04-07 23:51         ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
2022-03-09  2:12 ` [PATCH v9 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
2022-03-09  2:12 ` [PATCH v9 10/14] mm: multi-gen LRU: kill switch Yu Zhao
2022-03-22  7:47   ` Barry Song
2022-03-22  8:20     ` Yu Zhao
2022-03-22  8:45       ` Barry Song
2022-03-22  9:00         ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao
2022-03-22  7:22   ` Barry Song
2022-03-22  8:14     ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao
2022-03-09  2:12 ` [PATCH v9 13/14] mm: multi-gen LRU: admin guide Yu Zhao
2022-03-10 12:29   ` Mike Rapoport
2022-03-11  0:37     ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 14/14] mm: multi-gen LRU: design doc Yu Zhao
2022-03-11  8:22   ` Mike Rapoport
2022-03-11  9:38     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOUHufYBPSx8W5oP=Rf2Sa9QoMhUbEyiF-heR9SuQhcVp+42Rw@mail.gmail.com' \
    --to=yuzhao@google.com \
    --cc=Hi-Angel@yandex.ru \
    --cc=Michael@michaellarabel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=bgeffon@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=d@chaos-reins.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=djbyrne@mtu.edu \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=heftig@archlinux.org \
    --cc=holger@applied-asynchrony.com \
    --cc=jsbarnes@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=sofia.trinh@edi.works \
    --cc=steven@liquorix.net \
    --cc=suleiman@google.com \
    --cc=szhai2@cs.rochester.edu \
    --cc=torvalds@linux-foundation.org \
    --cc=vaibhav@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).