archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <>
Cc:, Hillf Danton <>,, Yu Zhao <>,
	Konstantin Kharlamov <>
Subject: [PATCH v4 11/11] mm: multigenerational lru: documentation
Date: Wed, 18 Aug 2021 00:31:07 -0600	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao <>
Tested-by: Konstantin Kharlamov <>
 Documentation/vm/index.rst        |   1 +
 Documentation/vm/multigen_lru.rst | 134 ++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..c353b3f55924 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -17,6 +17,7 @@ various features of the Linux memory management
+   multigen_lru
 Kernel developers MM documentation
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..adedff5319d9
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,134 @@
+.. SPDX-License-Identifier: GPL-2.0
+Multigenerational LRU
+Quick Start
+Build Configurations
+:Required: Set ``CONFIG_LRU_GEN=y``.
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
+ default.
+Runtime Configurations
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature was not turned on by default.
+:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to
+ protect the working set of ``N`` milliseconds. The OOM killer is
+ invoked if this working set cannot be kept in memory.
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature
+ is turned on. This file has the following output:
+  memcg  memcg_id  memcg_path
+    node  node_id
+      min_gen  birth_time  anon_size  file_size
+      ...
+      max_gen  birth_time  anon_size  file_size
+``min_gen`` is the oldest generation number and ``max_gen`` is the
+youngest generation number. ``birth_time`` is in milliseconds.
+``anon_size`` and ``file_size`` are in pages.
+No additional configurations required.
+Servers/Data Centers
+:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a
+ larger number.
+:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger
+ number.
+:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``.
+:Working set estimation: Write ``+ memcg_id node_id max_gen
+ [swappiness]`` to ``/sys/kernel/debug/lru_gen`` to invoke the aging,
+ which scans PTEs for accessed pages and then creates the next
+ generation ``max_gen+1``. A swap file and a non-zero ``swappiness``,
+ which overrides ``vm.swappiness``, are required to scan PTEs mapping
+ anon pages.
+:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness]
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
+ eviction, which evicts generations less than or equal to ``min_gen``.
+ ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
+ ``max_gen-1`` are not fully aged and therefore cannot be evicted.
+ ``nr_to_reclaim`` can be used to limit the number of pages to evict.
+ Multiple command lines are supported, so does concatenation with
+ delimiters ``,`` and ``;``.
+For each ``lruvec``, evictable pages are divided into multiple
+generations. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[2]`` separately for anon and file types as clean
+file pages can be evicted regardless of swap and writeback
+constraints. These three variables are monotonically increasing.
+Generation numbers are truncated into
+``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
+``page->flags``. The sliding window technique is used to prevent
+truncated generation numbers from overlapping. Each truncated
+generation number is an index to an array of per-type and per-zone
+lists ``lrugen->lists``.
+Each generation is then divided into multiple tiers. Tiers represent
+levels of usage from file descriptors only. Pages accessed ``N`` times
+via file descriptors belong to tier ``order_base_2(N)``. Each
+generation contains at most ``CONFIG_TIERS_PER_GEN`` tiers, and they
+require additional ``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``.
+In contrast to moving across generations which requires list
+operations, moving across tiers only involves operations on
+``page->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refault rates of all tiers
+and decides when to protect pages from which tiers.
+The framework comprises two conceptually independent components: the
+aging and the eviction, which can be invoked separately from user
+space for the purpose of working set estimation and proactive reclaim.
+The aging produces young generations. Given an ``lruvec``, the aging
+traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()``
+to scan PTEs for accessed pages (a ``mm_struct`` list is maintained
+for each ``memcg``). Upon finding one, the aging updates its
+generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``).
+After each round of traversal, the aging increments ``max_seq``. The
+aging is due when both ``min_seq[2]`` have caught up with
+The eviction consumes old generations. Given an ``lruvec``, the
+eviction scans pages on the per-zone lists indexed by anon and file
+``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to
+select a type based on the values of ``min_seq[2]``. If they are
+equal, it selects the type that has a lower refault rate. The eviction
+sorts a page according to its updated generation number if the aging
+has found this page accessed. It also moves a page to the next
+generation if this page is from an upper tier that has a higher
+refault rate than the base tier. The eviction increments
+``min_seq[2]`` of a selected type when it finds all the per-zone lists
+indexed by ``min_seq[2]`` of this selected type are empty.
+To-do List
+KVM Optimization
+Support shadow page table walk.
+NUMA Optimization
+Optimize page table walk for NUMA.

  parent reply	other threads:[~2021-08-18  6:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
2021-08-18  6:30 ` [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2021-08-19  9:19   ` Will Deacon
2021-08-19 21:23     ` Yu Zhao
2021-08-18  6:30 ` [PATCH v4 02/11] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2021-08-18  6:30 ` [PATCH v4 03/11] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-08-18  6:31 ` [PATCH v4 04/11] mm: multigenerational lru: groundwork Yu Zhao
2021-08-18  6:31 ` [PATCH v4 05/11] mm: multigenerational lru: protection Yu Zhao
2021-08-18  6:31 ` [PATCH v4 06/11] mm: multigenerational lru: mm_struct list Yu Zhao
2021-08-18  6:31 ` [PATCH v4 07/11] mm: multigenerational lru: aging Yu Zhao
2021-08-18  6:31 ` [PATCH v4 08/11] mm: multigenerational lru: eviction Yu Zhao
2021-08-18  6:31 ` [PATCH v4 09/11] mm: multigenerational lru: user interface Yu Zhao
2021-08-18  6:31 ` [PATCH v4 10/11] mm: multigenerational lru: Kconfig Yu Zhao
2021-08-18  6:31 ` Yu Zhao [this message]
2021-10-09  5:43 ` [PATCH v4 00/11] Multigenerational LRU Framework bot
2021-10-21 19:41 ` bot
2021-11-02  0:20 ` bot
2021-11-09  2:13 ` bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \
    --subject='Re: [PATCH v4 11/11] mm: multigenerational lru: documentation' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).