archive mirror
 help / color / mirror / Atom feed
From: Yu Zhao <>
Cc: Alex Shi <>,
	Andrew Morton <>,
	Dave Hansen <>,
	Hillf Danton <>,
	Johannes Weiner <>,
	Joonsoo Kim <>,
	Matthew Wilcox <>,
	Mel Gorman <>, Michal Hocko <>,
	Roman Gushchin <>, Vlastimil Babka <>,
	Wei Yang <>,
	Yang Shi <>, Ying Huang <>,,,
	Yu Zhao <>
Subject: [PATCH v1 00/14] Multigenerational LRU
Date: Sat, 13 Mar 2021 00:57:33 -0700	[thread overview]
Message-ID: <> (raw)

The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
a performant, versatile and straightforward augment.

git fetch refs/changes/01/1101/1


DRAM is a major factor in total cost of ownership, and improving
memory overcommit brings a high return on investment. Over the past
decade of research and experimentation in memory overcommit, we
observed a distinct trend across millions of servers and clients: the
size of page cache has been decreasing because of the growing
popularity of cloud storage. Nowadays anon pages account for more than
90% of our memory consumption and page cache contains mostly
executable pages.

Notion of the active/inactive
For servers equipped with hundreds of gigabytes of memory, the
granularity of the active/inactive is too coarse to be useful for job
scheduling. And false active/inactive rates are relatively high. In
addition, scans of largely varying numbers of pages are unpredictable
because inactive_is_low() is based on magic numbers.

For phones and laptops, the eviction is biased toward file pages
because the selection has to resort to heuristics as direct
comparisons between anon and file types are infeasible. On Android and
Chrome OS, executable pages are frequently evicted despite the fact
that there are many less recently used anon pages. This causes "janks"
(slow UI rendering) and negatively impacts user experience.

For systems with multiple nodes and/or memcgs, it is impossible to
compare lruvecs based on the notion of the active/inactive.

Incremental scans via the rmap
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For most of
the systems running cloud workloads, incremental scans lose the
advantage under sustained memory pressure due to high ratios of the
number of scanned pages to the number of reclaimed pages. In our case,
the average ratio of pgscan to pgsteal is about 7.

On top of that, the rmap has poor memory locality due to its complex
data structures. The combined effects typically result in a high
amount of CPU usage in the reclaim path. For example, with zram, a
typical kswapd profile on v5.11 looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

And with real swap, it looks like:
  45.16%  page_vma_mapped_walk
   7.61%  do_raw_spin_lock
   5.69%  vma_interval_tree_iter_next
   4.91%  vma_interval_tree_subtree_search
   3.71%  page_referenced_one

Notion of generation numbers
The notion of generation numbers introduces a quantitative approach to
memory overcommit. A larger number of pages can be spread out across
configurable generations, and thus they have relatively low false
active/inactive rates. Each generation includes all pages that have
been referenced since the last generation.

Given an lruvec, scans and the selections between anon and file types
are all based on generation numbers, which are simple and yet
effective. For different lruvecs, comparisons are still possible based
on birth times of generations.

Differential scans via page tables
Each differential scan discovers all pages that have been referenced
since the last scan. Specifically, it walks the mm_struct list
associated with an lruvec to scan page tables of processes that have
been scheduled since the last scan. The cost of each differential scan
is roughly proportional to the number of referenced pages it
discovers. Unless address spaces are extremely sparse, page tables
usually have better memory locality than the rmap. The end result is
generally a significant reduction in CPU usage, for most of the
systems running cloud workloads.

On Chrome OS, our real-world benchmark that browses popular websites
in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
(full) less PSI on v5.11. And kswapd profile looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced 7%. These metrics are
important to phones and laptops as they are correlated to user

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lruvec->evictable.max_seq
for both anon and file types as they are aged on an equal footing. The
oldest generation numbers are stored in lruvec->evictable.min_seq[2]
separately for anon and file types as clean file pages can be evicted
regardless of may_swap or may_writepage. Generation numbers are
truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
page->flags. The sliding window technique is used to prevent truncated
generation numbers from overlapping. Each truncated generation number
is an index to
Evictable pages are added to the per-zone lists indexed by max_seq or
min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
faulted in or read ahead. The workflow comprises two conceptually
independent functions: the aging and the eviction.

The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq. The aging maintains either a
system-wide mm_struct list or per-memcg mm_struct lists and tracks
whether an mm_struct is being used on any CPUs or has been used since
the last scan. Multiple threads can concurrently work on the same
mm_struct list, and each of them will be given a different mm_struct
belonging to a process that has been scheduled since the last scan.

The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It selects a type according to the values of min_seq[2] and
swappiness. During a scan, the eviction either sorts or isolates a
page, depending on whether the aging has updated its generation
number. When it finds all the per-zone lists are empty, the eviction
increments min_seq[2] indexed by this selected type. The eviction
triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Use cases
On Android, our most advanced simulation that generates memory
pressure from realistic user behavior shows 18% fewer low-memory
kills, which in turn reduces cold starts by 16%.

On Borg, a similar approach enables us to identify jobs that
underutilize their memory and downsize them considerably without
compromising any of our service level indicators.

On Chrome OS, our field telemetry reports 96% fewer low-memory tab
discards and 59% fewer OOM kills from fully-utilized devices and no UX
regressions from underutilized devices.

For other use cases include working set estimation, proactive reclaim,
far memory tiering and NUMA-aware job scheduling, please refer to the
documentation included in this series and the following references.

1. Long-term SLOs for reclaimed cloud computing resources
2. Profiling a warehouse-scale computer
3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
4. Software-defined far memory in warehouse-scale computers
5. Borg: the Next Generation

Yu Zhao (14):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/huge_mm.h: define is_huge_zero_pmd() if
  include/linux/cgroup.h: export cgroup_mutex
  mm/swap.c: export activate_page()
  mm, x86: support the access bit on non-leaf PMD entries
  mm/pagewalk.c: add pud_entry_post() for post-order traversals
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: core
  mm: multigenerational lru: page activation
  mm: multigenerational lru: user space interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  210 +++
 arch/Kconfig                      |    8 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/proc/task_mmu.c                |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/huge_mm.h           |    5 +
 include/linux/memcontrol.h        |    5 +-
 include/linux/mm.h                |    1 +
 include/linux/mm_inline.h         |  246 ++++
 include/linux/mm_types.h          |  135 ++
 include/linux/mmzone.h            |   62 +-
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   20 +-
 include/linux/pagewalk.h          |    4 +
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    5 +-
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   29 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   14 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |   13 +-
 mm/mmzone.c                       |    2 +
 mm/pagewalk.c                     |    5 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   58 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
 39 files changed, 2870 insertions(+), 144 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst


             reply	other threads:[~2021-03-13  7:59 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-13  7:57 Yu Zhao [this message]
2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-03-13 15:09   ` Matthew Wilcox
2021-03-14  7:45     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-03-14 22:12   ` Zi Yan
2021-03-14 22:51     ` Matthew Wilcox
2021-03-15  0:03       ` Yu Zhao
2021-03-15  0:27         ` Zi Yan
2021-03-15  1:04           ` Yu Zhao
2021-03-14 23:22   ` Dave Hansen
2021-03-15  3:16     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
2021-03-15 19:40   ` Rik van Riel
2021-03-16  2:07     ` Huang, Ying
2021-03-16  3:57       ` Yu Zhao
2021-03-16  6:44         ` Huang, Ying
2021-03-16  7:56           ` Yu Zhao
2021-03-17  3:37             ` Huang, Ying
2021-03-17 10:46               ` Yu Zhao
2021-03-22  3:13                 ` Huang, Ying
2021-03-22  8:08                   ` Yu Zhao
2021-03-24  6:58                     ` Huang, Ying
2021-04-10 18:48                       ` Yu Zhao
2021-04-13  3:06                         ` Huang, Ying
2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
2021-03-15  2:02   ` Andi Kleen
2021-03-15  3:37     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
2021-03-16 16:34   ` Matthew Wilcox
2021-03-16 21:29     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
2021-03-19  9:31   ` Alex Shi
2021-03-22  6:09     ` Yu Zhao
2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
2021-03-15  0:52   ` Yu Zhao
     [not found] ` <>
2021-03-15  6:49   ` Yu Zhao
2021-03-15 18:00 ` Dave Hansen
2021-03-16  2:24   ` Yu Zhao
2021-03-16 14:50     ` Dave Hansen
2021-03-16 20:30       ` Yu Zhao
2021-03-16 21:14         ` Dave Hansen
2021-04-10  9:21           ` Yu Zhao
2021-04-13  3:02             ` Huang, Ying
2021-04-13 23:00               ` Yu Zhao
2021-03-15 18:38 ` Yang Shi
2021-03-16  3:38   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).