Re: [PATCH v10 13/14] mm: multi-gen LRU: admin guide

From: Yu Zhao <yuzhao@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Stephen Rothwell" <sfr@rothwell.id.au>,
	Linux-MM <linux-mm@kvack.org>, "Andi Kleen" <ak@linux.intel.com>,
	"Aneesh Kumar" <aneesh.kumar@linux.ibm.com>,
	"Barry Song" <21cnbao@gmail.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Hillf Danton" <hdanton@sina.com>, "Jens Axboe" <axboe@kernel.dk>,
	"Jesse Barnes" <jsbarnes@google.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"Michael Larabel" <Michael@michaellarabel.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Will Deacon" <will@kernel.org>,
	"Ying Huang" <ying.huang@intel.com>,
	"Linux ARM" <linux-arm-kernel@lists.infradead.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	"Kernel Page Reclaim v2" <page-reclaim@google.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Brian Geffon" <bgeffon@google.com>,
	"Jan Alexander Steffens" <heftig@archlinux.org>,
	"Oleksandr Natalenko" <oleksandr@natalenko.name>,
	"Steven Barrett" <steven@liquorix.net>,
	"Suleiman Souhlal" <suleiman@google.com>,
	"Daniel Byrne" <djbyrne@mtu.edu>,
	"Donald Carr" <d@chaos-reins.com>,
	"Holger Hoffstätte" <holger@applied-asynchrony.com>,
	"Konstantin Kharlamov" <Hi-Angel@yandex.ru>,
	"Shuang Zhai" <szhai2@cs.rochester.edu>,
	"Sofia Trinh" <sofia.trinh@edi.works>,
	"Vaibhav Jain" <vaibhav@linux.ibm.com>
Subject: Re: [PATCH v10 13/14] mm: multi-gen LRU: admin guide
Date: Fri, 15 Apr 2022 20:22:42 -0600	[thread overview]
Message-ID: <CAOUHufacnY6zMzkMvgHD9_DAwDcnpq7a9YdYT3SKUV8dAi=Fmw@mail.gmail.com> (raw)
In-Reply-To: <20220411191639.52c62959489a6c27cb7d251e@linux-foundation.org>

On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  6 Apr 2022 21:15:25 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > +Kill switch
> > +-----------
> > +``enable`` accepts different values to enable or disable the following
>
> It's actually called "enabled".

Good catch. Thanks!

> And I suggest that the file name be
> included right there in the title.  ie.
>
> "enabled": Kill Switch
> ======================

Will do.

> > +Experimental features
> > +=====================
> > +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> > +following subsections. Multiple command lines are supported, so does
> > +concatenation with delimiters ``,`` and ``;``.
> > +
> > +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> > +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> > +evicted generations in this file.
> > +
> > +Working set estimation
> > +----------------------
> > +Working set estimation measures how much memory an application
> > +requires in a given time interval, and it is usually done with little
> > +impact on the performance of the application. E.g., data centers want
> > +to optimize job scheduling (bin packing) to improve memory
> > +utilizations. When a new job comes in, the job scheduler needs to find
> > +out whether each server it manages can allocate a certain amount of
> > +memory for this new job before it can pick a candidate. To do so, this
> > +job scheduler needs to estimate the working sets of the existing jobs.
>
> These various sysfs interfaces are a big deal.  Because they are so
> hard to change once released.

Debugfs, not sysfs. The title is "Experimental features" :)

> btw, what is this "job scheduler" of which you speak?

Basically it's part of cluster management software. Many jobs
(programs + data) can run concurrently in the same cluster and the job
scheduler of this cluster does the bin packing. To improve resource
utilization, the job scheduler needs to know the (memory) size of each
job it packs, hence the working set estimation (how much memory a job
uses within a given time interval). The job scheduler also takes
memory from some jobs so that those jobs can better fit into a single
machine (proactive reclaim).

> Is there an open
> source implementation upon which we hope the world will converge?

There are many [1], e.g., Kubernetes (k8s). Personally, I don't think
they'll ever converge.

At the moment, all open source implementations I know of rely on users
manually specifying the size of each job (job spec), e.g., [2]. Users
overprovision memory to avoid OOM kills. The average memory
utilization generally is surprisingly low. What we can hope for is
that eventually some of the open source implementations will use the
working set estimation and proactive reclaim features provided here.

[1] https://en.wikipedia.org/wiki/List_of_cluster_management_software
[2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

> > +Proactive reclaim
> > +-----------------
> > +Proactive reclaim induces memory reclaim when there is no memory
> > +pressure and usually targets cold memory only. E.g., when a new job
> > +comes in, the job scheduler wants to proactively reclaim memory on the
> > +server it has selected to improve the chance of successfully landing
> > +this new job.
> > +
> > +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> > +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
> > +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> > +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> > +aged and therefore cannot be evicted. ``swappiness`` overrides the
> > +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> > +the number of pages to evict.
> > +
> > +A typical use case is that a job scheduler writes to ``lru_gen``
> > +before it tries to land a new job on a server, and if it fails to
> > +materialize the cold memory without impacting the existing jobs on
> > +this server, it retries on the next server according to the ranking
> > +result obtained from the working set estimation step described
> > +earlier.
>
> It sounds to me that these interfaces were developed in response to
> ongoing development and use of a particular job scheduler.

I did borrow some of my previous experience with Google's data
centers. But I'm a Chrome OS developer now, so I designed them to be
job scheduler agnostic :)

> This is a very good thing, but has thought been given to the potential
> needs of other job schedulers?

Yes, basically I'm trying to help everybody replicate the success
stories at Google and Meta [3][4].

[3] https://dl.acm.org/doi/10.1145/3297858.3304053
[4] https://dl.acm.org/doi/10.1145/3503222.3507731