linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Stephen Rothwell" <sfr@rothwell.id.au>,
	Linux-MM <linux-mm@kvack.org>, "Andi Kleen" <ak@linux.intel.com>,
	"Aneesh Kumar" <aneesh.kumar@linux.ibm.com>,
	"Barry Song" <21cnbao@gmail.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Hillf Danton" <hdanton@sina.com>, "Jens Axboe" <axboe@kernel.dk>,
	"Jesse Barnes" <jsbarnes@google.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"Michael Larabel" <Michael@michaellarabel.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Will Deacon" <will@kernel.org>,
	"Ying Huang" <ying.huang@intel.com>,
	"Linux ARM" <linux-arm-kernel@lists.infradead.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	"Kernel Page Reclaim v2" <page-reclaim@google.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Brian Geffon" <bgeffon@google.com>,
	"Jan Alexander Steffens" <heftig@archlinux.org>,
	"Oleksandr Natalenko" <oleksandr@natalenko.name>,
	"Steven Barrett" <steven@liquorix.net>,
	"Suleiman Souhlal" <suleiman@google.com>,
	"Daniel Byrne" <djbyrne@mtu.edu>,
	"Donald Carr" <d@chaos-reins.com>,
	"Holger Hoffstätte" <holger@applied-asynchrony.com>,
	"Konstantin Kharlamov" <Hi-Angel@yandex.ru>,
	"Shuang Zhai" <szhai2@cs.rochester.edu>,
	"Sofia Trinh" <sofia.trinh@edi.works>,
	"Vaibhav Jain" <vaibhav@linux.ibm.com>
Subject: Re: [PATCH v10 13/14] mm: multi-gen LRU: admin guide
Date: Fri, 15 Apr 2022 20:22:42 -0600	[thread overview]
Message-ID: <CAOUHufacnY6zMzkMvgHD9_DAwDcnpq7a9YdYT3SKUV8dAi=Fmw@mail.gmail.com> (raw)
In-Reply-To: <20220411191639.52c62959489a6c27cb7d251e@linux-foundation.org>

On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  6 Apr 2022 21:15:25 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > +Kill switch
> > +-----------
> > +``enable`` accepts different values to enable or disable the following
>
> It's actually called "enabled".

Good catch. Thanks!

> And I suggest that the file name be
> included right there in the title.  ie.
>
> "enabled": Kill Switch
> ======================

Will do.

> > +Experimental features
> > +=====================
> > +``/sys/kernel/debug/lru_gen`` accepts commands described in the
> > +following subsections. Multiple command lines are supported, so does
> > +concatenation with delimiters ``,`` and ``;``.
> > +
> > +``/sys/kernel/debug/lru_gen_full`` provides additional stats for
> > +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
> > +evicted generations in this file.
> > +
> > +Working set estimation
> > +----------------------
> > +Working set estimation measures how much memory an application
> > +requires in a given time interval, and it is usually done with little
> > +impact on the performance of the application. E.g., data centers want
> > +to optimize job scheduling (bin packing) to improve memory
> > +utilizations. When a new job comes in, the job scheduler needs to find
> > +out whether each server it manages can allocate a certain amount of
> > +memory for this new job before it can pick a candidate. To do so, this
> > +job scheduler needs to estimate the working sets of the existing jobs.
>
> These various sysfs interfaces are a big deal.  Because they are so
> hard to change once released.

Debugfs, not sysfs. The title is "Experimental features" :)

> btw, what is this "job scheduler" of which you speak?

Basically it's part of cluster management software. Many jobs
(programs + data) can run concurrently in the same cluster and the job
scheduler of this cluster does the bin packing. To improve resource
utilization, the job scheduler needs to know the (memory) size of each
job it packs, hence the working set estimation (how much memory a job
uses within a given time interval). The job scheduler also takes
memory from some jobs so that those jobs can better fit into a single
machine (proactive reclaim).

> Is there an open
> source implementation upon which we hope the world will converge?

There are many [1], e.g., Kubernetes (k8s). Personally, I don't think
they'll ever converge.

At the moment, all open source implementations I know of rely on users
manually specifying the size of each job (job spec), e.g., [2]. Users
overprovision memory to avoid OOM kills. The average memory
utilization generally is surprisingly low. What we can hope for is
that eventually some of the open source implementations will use the
working set estimation and proactive reclaim features provided here.

[1] https://en.wikipedia.org/wiki/List_of_cluster_management_software
[2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

> > +Proactive reclaim
> > +-----------------
> > +Proactive reclaim induces memory reclaim when there is no memory
> > +pressure and usually targets cold memory only. E.g., when a new job
> > +comes in, the job scheduler wants to proactively reclaim memory on the
> > +server it has selected to improve the chance of successfully landing
> > +this new job.
> > +
> > +Users can write ``- memcg_id node_id min_gen_nr [swappiness
> > +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
> > +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
> > +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
> > +aged and therefore cannot be evicted. ``swappiness`` overrides the
> > +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
> > +the number of pages to evict.
> > +
> > +A typical use case is that a job scheduler writes to ``lru_gen``
> > +before it tries to land a new job on a server, and if it fails to
> > +materialize the cold memory without impacting the existing jobs on
> > +this server, it retries on the next server according to the ranking
> > +result obtained from the working set estimation step described
> > +earlier.
>
> It sounds to me that these interfaces were developed in response to
> ongoing development and use of a particular job scheduler.

I did borrow some of my previous experience with Google's data
centers. But I'm a Chrome OS developer now, so I designed them to be
job scheduler agnostic :)

> This is a very good thing, but has thought been given to the potential
> needs of other job schedulers?

Yes, basically I'm trying to help everybody replicate the success
stories at Google and Meta [3][4].

[3] https://dl.acm.org/doi/10.1145/3297858.3304053
[4] https://dl.acm.org/doi/10.1145/3503222.3507731


  reply	other threads:[~2022-04-16  2:23 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-07  3:15 [PATCH v10 00/14] Multi-Gen LRU Framework Yu Zhao
2022-04-07  3:15 ` [PATCH v10 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-04-07  3:15 ` [PATCH v10 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-04-07  3:15 ` [PATCH v10 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-04-16  6:48   ` Miaohe Lin
2022-04-07  3:15 ` [PATCH v10 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao
2022-04-16  6:50   ` Miaohe Lin
2022-04-07  3:15 ` [PATCH v10 05/14] mm: multi-gen LRU: groundwork Yu Zhao
2022-04-12  2:16   ` Andrew Morton
2022-04-12  7:06     ` Peter Zijlstra
2022-04-20  0:39       ` Yu Zhao
2022-04-20 20:07         ` Linus Torvalds
2022-04-26 22:39     ` Yu Zhao
2022-04-26 23:42       ` Andrew Morton
2022-04-27  1:18         ` Yu Zhao
2022-04-27  1:34           ` Andrew Morton
2022-04-07  3:15 ` [PATCH v10 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao
2022-04-14  6:03   ` Barry Song
2022-04-14 20:36     ` Yu Zhao
2022-04-14 21:39       ` Andrew Morton
2022-04-14 22:14         ` Yu Zhao
2022-04-15 10:15         ` Barry Song
2022-04-15 20:17           ` Yu Zhao
2022-04-15 10:26       ` Barry Song
2022-04-15 20:18         ` Yu Zhao
2022-04-14 11:47   ` Chen Wandun
2022-04-14 20:53     ` Yu Zhao
2022-04-15  2:23       ` Chen Wandun
2022-04-15  5:25         ` Yu Zhao
2022-04-15  6:31           ` Chen Wandun
2022-04-15  6:44             ` Yu Zhao
2022-04-15  9:27               ` Chen Wandun
2022-04-18  9:58   ` Barry Song
2022-04-19  0:53     ` Yu Zhao
2022-04-19  4:25       ` Barry Song
2022-04-19  4:36         ` Barry Song
2022-04-19 22:25           ` Yu Zhao
2022-04-19 22:20         ` Yu Zhao
2022-04-07  3:15 ` [PATCH v10 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao
2022-04-27  4:32   ` Aneesh Kumar K.V
2022-04-27  4:38     ` Yu Zhao
2022-04-27  5:31       ` Aneesh Kumar K V
2022-04-27  6:00         ` Yu Zhao
2022-04-07  3:15 ` [PATCH v10 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
2022-04-12  2:16   ` Andrew Morton
2022-04-12  7:10     ` Peter Zijlstra
2022-04-15  5:30       ` Yu Zhao
2022-04-15  1:14     ` Yu Zhao
2022-04-15  1:56       ` Andrew Morton
2022-04-15  6:25         ` Yu Zhao
2022-04-15 19:15           ` Andrew Morton
2022-04-15 20:11             ` Yu Zhao
2022-04-15 21:32               ` Andrew Morton
2022-04-15 21:36                 ` Linus Torvalds
2022-04-15 22:57                   ` Yu Zhao
2022-04-15 23:03                     ` Linus Torvalds
2022-04-15 23:24                       ` [page-reclaim] " Jesse Barnes
2022-04-15 23:31                         ` Matthew Wilcox
2022-04-15 23:37                           ` Jesse Barnes
2022-04-15 23:49                       ` Yu Zhao
2022-04-16 16:32                 ` Justin Forbes
2022-04-19 22:32                   ` Yu Zhao
2022-04-29 14:10   ` zhong jiang
2022-04-30  8:34     ` Yu Zhao
2022-04-07  3:15 ` [PATCH v10 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
2022-04-07  3:15 ` [PATCH v10 10/14] mm: multi-gen LRU: kill switch Yu Zhao
2022-04-12  2:16   ` Andrew Morton
2022-04-26 20:57     ` Yu Zhao
2022-04-26 22:22       ` Andrew Morton
2022-04-27  1:11         ` Yu Zhao
2022-04-07  3:15 ` [PATCH v10 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao
2022-04-07  3:15 ` [PATCH v10 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao
2022-04-12  2:16   ` Andrew Morton
2022-04-16  0:03     ` Yu Zhao
2022-04-16  4:20       ` Andrew Morton
2022-04-26  6:59         ` Yu Zhao
2022-04-26 21:30           ` Andrew Morton
2022-04-26 22:15             ` Yu Zhao
2022-04-07  3:15 ` [PATCH v10 13/14] mm: multi-gen LRU: admin guide Yu Zhao
2022-04-07 12:41   ` Bagas Sanjaya
2022-04-07 12:51     ` Jonathan Corbet
2022-04-12  2:16   ` Andrew Morton
2022-04-16  2:22     ` Yu Zhao [this message]
2022-04-07  3:15 ` [PATCH v10 14/14] mm: multi-gen LRU: design doc Yu Zhao
2022-04-07 11:39   ` Huang Shijie
2022-04-07 12:41   ` Bagas Sanjaya
2022-04-07 12:52     ` Jonathan Corbet
2022-04-08  4:48       ` Bagas Sanjaya
2022-04-12  2:16   ` Andrew Morton
2022-04-26  7:42     ` Yu Zhao
2022-04-07  3:24 ` [PATCH v10 00/14] Multi-Gen LRU Framework Yu Zhao
2022-04-07  8:31   ` Stephen Rothwell
2022-04-07  9:08     ` Yu Zhao
2022-04-07  9:41     ` Yu Zhao
2022-04-07 12:13       ` Stephen Rothwell
2022-04-08  2:08         ` Yu Zhao
2022-04-12  2:15 ` Andrew Morton
2022-04-14  5:06 ` Andrew Morton
2022-04-20  0:50   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOUHufacnY6zMzkMvgHD9_DAwDcnpq7a9YdYT3SKUV8dAi=Fmw@mail.gmail.com' \
    --to=yuzhao@google.com \
    --cc=21cnbao@gmail.com \
    --cc=Hi-Angel@yandex.ru \
    --cc=Michael@michaellarabel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=bgeffon@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=d@chaos-reins.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=djbyrne@mtu.edu \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=heftig@archlinux.org \
    --cc=holger@applied-asynchrony.com \
    --cc=jsbarnes@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=sfr@rothwell.id.au \
    --cc=sofia.trinh@edi.works \
    --cc=steven@liquorix.net \
    --cc=suleiman@google.com \
    --cc=szhai2@cs.rochester.edu \
    --cc=torvalds@linux-foundation.org \
    --cc=vaibhav@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).