Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation

From: Yu Zhao <yuzhao@google.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Andi Kleen" <ak@linux.intel.com>,
	"Aneesh Kumar" <aneesh.kumar@linux.ibm.com>,
	"Barry Song" <21cnbao@gmail.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Hillf Danton" <hdanton@sina.com>, "Jens Axboe" <axboe@kernel.dk>,
	"Jesse Barnes" <jsbarnes@google.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Michael Larabel" <Michael@michaellarabel.com>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Will Deacon" <will@kernel.org>,
	"Linux ARM" <linux-arm-kernel@lists.infradead.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Kernel Page Reclaim v2" <page-reclaim@google.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Brian Geffon" <bgeffon@google.com>,
	"Jan Alexander Steffens" <heftig@archlinux.org>,
	"Oleksandr Natalenko" <oleksandr@natalenko.name>,
	"Steven Barrett" <steven@liquorix.net>,
	"Suleiman Souhlal" <suleiman@google.com>,
	"Daniel Byrne" <djbyrne@mtu.edu>,
	"Donald Carr" <d@chaos-reins.com>,
	"Holger Hoffstätte" <holger@applied-asynchrony.com>,
	"Konstantin Kharlamov" <Hi-Angel@yandex.ru>,
	"Shuang Zhai" <szhai2@cs.rochester.edu>,
	"Sofia Trinh" <sofia.trinh@edi.works>
Subject: Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
Date: Wed, 23 Feb 2022 21:09:56 -0700	[thread overview]
Message-ID: <CAOUHufYgNr-82AsfGFu6DVOsVUdmVsOo2Jav3nHDXiuu6iDC9A@mail.gmail.com> (raw)
In-Reply-To: <87h78p3pp2.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Yu,
> >> >>
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> >> > LRU, as usual.
> >> >>
> >> >> In the memory tiering related commits and patchset, for example as follows,
> >> >>
> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >> >>
> >> >>     mm/vmscan: add page demotion counter
> >> >>
> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >> >>
> >> >> "demote" and "promote" is used for migrating pages between different
> >> >> types of memory.  Is it better for us to avoid overloading these words
> >> >> too much to avoid the possible confusion?
> >> >
> >> > Given that LRU and migration are usually different contexts, I think
> >> > we'd be fine, unless we want a third pair of terms.
> >>
> >> This is true before memory tiering is introduced.  In systems with
> >> multiple types memory (called memory tiering), LRU is used to identify
> >> pages to be migrated to the slow memory node.  Please take a look at
> >> can_demote(), which is called in shrink_page_list().
> >
> > This sounds clearly two contexts to me. Promotion/demotion (move
> > between generations) while pages are on LRU; or promotion/demotion
> > (migration between nodes) after pages are taken off LRU.
> >
> > Note that promotion/demotion are not used in function names. They are
> > used to describe how MGLRU works, in comparison with the
> > active/inactive LRU. Memory tiering is not within this context.
>
> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
> better to avoid to use promote/demote directly for MGLRU in ABI.  A
> possible solution is to use "mglru" and "promote/demote" together (such
> as "mglru_promote_*" when it is needed?

*If* it is needed. Currently there are no such plans.

> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> >> > +{
> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> >> > +}
> >> >>
> >> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> >> pages in the fast memory node could be demoted to the slow memory node
> >> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> >> consider that too?
> >> >
> >> > Sure. How do I check whether there is still space on the slow node?
> >>
> >> You can always check the watermark of the slow node.  But now, we
> >> actually don't check that (as in demote_page_list()), instead we will
> >> wake up kswapd of the slow node.  The intended behavior is something
> >> like,
> >>
> >>   DRAM -> PMEM -> disk
> >
> > I'll look into this later -- for now, it's a low priority because
> > there isn't much demand. I'll bump it up if anybody is interested in
> > giving it a try. Meanwhile, please feel free to cook up something if
> > you are interested.
>
> When we introduce a new feature, we shouldn't break an existing one.
> That is, not introducing regression.  I think that it is a rule?
>
> If my understanding were correct, MGLRU will ignore to scan anonymous
> page list even if there's demotion target for the node.  This breaks the
> demotion feature in the upstream kernel.  Right?

I'm not saying this shouldn't be fixed. I'm saying it's a low priority
until somebody is interested in using/testing it (or making it work).

Regarding regressions, I'm sure MGLRU *will* regress many workloads.
Its goal is to improve the majority of use cases, i.e., total net
gain. Trying to improve everything is methodically wrong because the
problem space is near infinite but the resource is limited. So we have
to prioritize major use cases over minor ones. The bottom line is
users have a choice not to use MGLRU.

> It's a new feature to check whether there is still space on the slow
> node.  We can look at that later.

SGTM.