linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: linux-mm@kvack.org
Cc: Alex Shi <alexs@kernel.org>, Andi Kleen <ak@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Benjamin Manes <ben.manes@gmail.com>,
	Dave Chinner <david@fromorbit.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>,
	Michael Larabel <michael@michaellarabel.com>,
	Michal Hocko <mhocko@suse.com>,
	Michel Lespinasse <michel@lespinasse.org>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Rong Chen <rong.a.chen@intel.com>,
	SeongJae Park <sjpark@amazon.de>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
	Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
	linux-kernel@vger.kernel.org, lkp@lists.01.org,
	page-reclaim@google.com, Yu Zhao <yuzhao@google.com>
Subject: [PATCH v2 16/16] mm: multigenerational lru: documentation
Date: Tue, 13 Apr 2021 00:56:33 -0600	[thread overview]
Message-ID: <20210413065633.2782273-17-yuzhao@google.com> (raw)
In-Reply-To: <20210413065633.2782273-1-yuzhao@google.com>

Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 Documentation/vm/index.rst        |   1 +
 Documentation/vm/multigen_lru.rst | 192 ++++++++++++++++++++++++++++++
 2 files changed, 193 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..c353b3f55924 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -17,6 +17,7 @@ various features of the Linux memory management
 
    swap_numa
    zswap
+   multigen_lru
 
 Kernel developers MM documentation
 ==================================
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..cf772aeca317
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,192 @@
+=====================
+Multigenerational LRU
+=====================
+
+Quick Start
+===========
+Build Options
+-------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support
+ a maximum of ``X`` generations.
+
+:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to support
+ a maximum of ``Y`` tiers per generation.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
+ default.
+
+Runtime Options
+---------------
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature was not turned on by default.
+
+:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N``
+ to spread pages out across ``N+1`` generations. ``N`` should be less
+ than ``X``. Larger values make the background aging more aggressive.
+
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature.
+ This file has the following output:
+
+::
+
+  memcg  memcg_id  memcg_path
+    node  node_id
+      min_gen  birth_time  anon_size  file_size
+      ...
+      max_gen  birth_time  anon_size  file_size
+
+Given a memcg and a node, ``min_gen`` is the oldest generation
+(number) and ``max_gen`` is the youngest. Birth time is in
+milliseconds. The sizes of anon and file types are in pages.
+
+Recipes
+-------
+:Android on ARMv8.1+: ``X=4``, ``N=0``
+
+:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
+ ``ARM64_HW_AFDBM``
+
+:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
+
+:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
+ to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
+ generation ``max_gen`` and create the next generation ``max_gen+1``.
+ ``gen`` should be equal to ``max_gen``. A swap file and a non-zero
+ ``swappiness`` are required to scan anon type. If swapping is not
+ desired, set ``vm.swappiness`` to ``0``.
+
+:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
+ generations less than or equal to ``gen``. ``gen`` should be less
+ than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active
+ generations and therefore protected from the eviction. Use
+ ``nr_to_reclaim`` to limit the number of pages to be evicted.
+ Multiple command lines are supported, so does concatenation with
+ delimiters ``,`` and ``;``.
+
+Framework
+=========
+For each ``lruvec``, evictable pages are divided into multiple
+generations. The youngest generation number is stored in ``max_seq``
+for both anon and file types as they are aged on an equal footing. The
+oldest generation numbers are stored in ``min_seq[2]`` separately for
+anon and file types as clean file pages can be evicted regardless of
+swap and write-back constraints. Generation numbers are truncated into
+``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
+``page->flags``. The sliding window technique is used to prevent
+truncated generation numbers from overlapping. Each truncated
+generation number is an index to an array of per-type and per-zone
+lists. Evictable pages are added to the per-zone lists indexed by
+``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``),
+depending on whether they are being faulted in.
+
+Each generation is then divided into multiple tiers. Tiers represent
+levels of usage from file descriptors only. Pages accessed N times via
+file descriptors belong to tier order_base_2(N). In contrast to moving
+across generations which requires the lru lock, moving across tiers
+only involves an atomic operation on ``page->flags`` and therefore has
+a negligible cost.
+
+The workflow comprises two conceptually independent functions: the
+aging and the eviction.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, the aging
+scans page tables for referenced pages of this ``lruvec``. Upon
+finding one, the aging updates its generation number to ``max_seq``.
+After each round of scan, the aging increments ``max_seq``.
+
+The aging maintains either a system-wide ``mm_struct`` list or
+per-memcg ``mm_struct`` lists, and it only scans page tables of
+processes that have been scheduled since the last scan. Since scans
+are differential with respect to referenced pages, the cost is roughly
+proportional to their number.
+
+The aging is due when both of ``min_seq[2]`` reaches ``max_seq-1``,
+assuming both anon and file types are reclaimable.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, the
+eviction scans the pages on the per-zone lists indexed by either of
+``min_seq[2]``. It first tries to select a type based on the values of
+``min_seq[2]``. When anon and file types are both available from the
+same generation, it selects the one that has a lower refault rate.
+
+During a scan, the eviction sorts pages according to their generation
+numbers, if the aging has found them referenced.  It also moves pages
+from the tiers that have higher refault rates than tier 0 to the next
+generation.
+
+When it finds all the per-zone lists of a selected type are empty, the
+eviction increments ``min_seq[2]`` indexed by this selected type.
+
+Rationale
+=========
+Limitations of Current Implementation
+-------------------------------------
+Notion of Active/Inactive
+~~~~~~~~~~~~~~~~~~~~~~~~~
+For servers equipped with hundreds of gigabytes of memory, the
+granularity of the active/inactive is too coarse to be useful for job
+scheduling. False active/inactive rates are relatively high, and thus
+the assumed savings may not materialize.
+
+For phones and laptops, executable pages are frequently evicted
+despite the fact that there are many less recently used anon pages.
+Major faults on executable pages cause ``janks`` (slow UI renderings)
+and negatively impact user experience.
+
+For ``lruvec``\s from different memcgs or nodes, comparisons are
+impossible due to the lack of a common frame of reference.
+
+Incremental Scans via ``rmap``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each incremental scan picks up at where the last scan left off and
+stops after it has found a handful of unreferenced pages. For
+workloads using a large amount of anon memory, incremental scans lose
+the advantage under sustained memory pressure due to high ratios of
+the number of scanned pages to the number of reclaimed pages. On top
+of that, the ``rmap`` has poor memory locality due to its complex data
+structures. The combined effects typically result in a high amount of
+CPU usage in the reclaim path.
+
+Benefits of Multigenerational LRU
+---------------------------------
+Notion of Generation Numbers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The notion of generation numbers introduces a quantitative approach to
+memory overcommit. A larger number of pages can be spread out across
+configurable generations, and thus they have relatively low false
+active/inactive rates. Each generation includes all pages that have
+been referenced since the last generation.
+
+Given an ``lruvec``, scans and the selections between anon and file
+types are all based on generation numbers, which are simple and yet
+effective. For different ``lruvec``\s, comparisons are still possible
+based on birth times of generations.
+
+Differential Scans via Page Tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each differential scan discovers all pages that have been referenced
+since the last scan. Specifically, it walks the ``mm_struct`` list
+associated with an ``lruvec`` to scan page tables of processes that
+have been scheduled since the last scan. The cost of each differential
+scan is roughly proportional to the number of referenced pages it
+discovers. Unless address spaces are extremely sparse, page tables
+usually have better memory locality than the ``rmap``. The end result
+is generally a significant reduction in CPU usage, for workloads
+using a large amount of anon memory.
+
+To-do List
+==========
+KVM Optimization
+----------------
+Support shadow page table scanning.
+
+NUMA Optimization
+-----------------
+Support NUMA policies and per-node RSS counters.
-- 
2.31.1.295.g9ea45b61b8-goog


  parent reply	other threads:[~2021-04-13  6:57 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13  6:56 [PATCH v2 00/16] Multigenerational LRU Framework Yu Zhao
2021-04-13  6:56 ` [PATCH v2 01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-04-13  6:56 ` [PATCH v2 02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-04-13  6:56 ` [PATCH v2 03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-04-13  6:56 ` [PATCH v2 04/16] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-04-13  6:56 ` [PATCH v2 05/16] mm/swap.c: export activate_page() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 06/16] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-04-13  6:56 ` [PATCH v2 07/16] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-04-13  6:56 ` [PATCH v2 08/16] mm: multigenerational lru: groundwork Yu Zhao
2021-04-13  6:56 ` [PATCH v2 09/16] mm: multigenerational lru: activation Yu Zhao
2021-04-13  6:56 ` [PATCH v2 10/16] mm: multigenerational lru: mm_struct list Yu Zhao
2021-04-14 14:36   ` Matthew Wilcox
2021-04-13  6:56 ` [PATCH v2 11/16] mm: multigenerational lru: aging Yu Zhao
2021-04-13  6:56 ` [PATCH v2 12/16] mm: multigenerational lru: eviction Yu Zhao
2021-04-13  6:56 ` [PATCH v2 13/16] mm: multigenerational lru: page reclaim Yu Zhao
2021-04-13  6:56 ` [PATCH v2 14/16] mm: multigenerational lru: user interface Yu Zhao
2021-04-13  6:56 ` [PATCH v2 15/16] mm: multigenerational lru: Kconfig Yu Zhao
2021-04-13  6:56 ` Yu Zhao [this message]
2021-04-13  7:51 ` [PATCH v2 00/16] Multigenerational LRU Framework SeongJae Park
2021-04-13 16:13   ` Jens Axboe
2021-04-13 16:42     ` SeongJae Park
2021-04-13 23:14     ` Dave Chinner
2021-04-14  2:29       ` Rik van Riel
     [not found]         ` <CAOUHufafMcaG8sOS=1YMy2P_6p0R1FzP16bCwpUau7g1-PybBQ@mail.gmail.com>
2021-04-14  6:15           ` Huang, Ying
2021-04-14  7:58             ` Yu Zhao
2021-04-14  8:27               ` Huang, Ying
2021-04-14 13:51                 ` Rik van Riel
2021-04-14 15:56                   ` Andi Kleen
2021-04-14 15:58                   ` [page-reclaim] " Shakeel Butt
2021-04-14 18:45                   ` Yu Zhao
2021-04-14 15:51           ` Andi Kleen
2021-04-14 15:58             ` Rik van Riel
2021-04-14 19:14               ` Yu Zhao
2021-04-14 19:41                 ` Rik van Riel
2021-04-14 20:08                   ` Yu Zhao
2021-04-14 19:04             ` Yu Zhao
2021-04-15  3:00               ` Andi Kleen
2021-04-15  7:13                 ` Yu Zhao
2021-04-15  8:19                   ` Huang, Ying
2021-04-15  9:57                   ` Michel Lespinasse
2021-04-24  2:33                     ` Yu Zhao
2021-04-24  3:30                       ` Andi Kleen
2021-04-24  4:16                         ` Yu Zhao
2021-04-14  3:40       ` Yu Zhao
2021-04-14  4:50         ` Dave Chinner
2021-04-14  7:16           ` Yu Zhao
2021-04-14 10:00             ` Yu Zhao
2021-04-15  1:36             ` Dave Chinner
2021-04-24 21:21               ` Yu Zhao
2021-04-14 14:43       ` Jens Axboe
2021-04-14 19:42         ` Yu Zhao
2021-04-15  1:21         ` Dave Chinner
2021-04-14 17:43 ` Johannes Weiner
2021-04-27 10:35   ` Yu Zhao
2021-04-29 23:46 ` Konstantin Kharlamov
2021-04-30  6:37   ` Konstantin Kharlamov
2021-04-30 19:31     ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210413065633.2782273-17-yuzhao@google.com \
    --to=yuzhao@google.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexs@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=ben.manes@gmail.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@lists.01.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=michael@michaellarabel.com \
    --cc=michel@lespinasse.org \
    --cc=page-reclaim@google.com \
    --cc=riel@surriel.com \
    --cc=rong.a.chen@intel.com \
    --cc=shy828301@gmail.com \
    --cc=sjpark@amazon.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).