All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/14] Multi-Gen LRU Framework
@ 2022-03-09  2:12 ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao

What's new
==========
Removed CONFIG_NR_LRU_GENS and CONFIG_TIERS_PER_GEN.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/lkml/20220309021230.721028-15-yuzhao@google.com/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
    its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without any optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/lkml/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/lkml/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/lkml/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/lkml/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/lkml/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/lkml/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/lkml/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/lkml/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/lkml/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.
   
   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had not a
   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

Vaibhav from IBM reported [12]:
   In a synthetic MongoDB Benchmark, seeing an average of ~19%
   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
   MGLRU patches on top of v5.16 kernel for MongoDB + YCSB across
   three different request distributions, namely, Exponential, Uniform
   and Zipfan.

Shuang from U of Rochester reported [13]:
   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
   and [9.26, 10.36]% higher throughput, respectively, for random
   access, Zipfian (distribution) access and Gaussian (distribution)
   access, when the average number of jobs per CPU is 1; 95% CIs
   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
   throughput, respectively, for random access, Zipfian access and
   Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
   With Memcached allocating ~100GB of byte-addressable Optante,
   performance improvement in terms of throughput (measured as queries
   per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
The downstream kernels that have been using MGLRU include:
1. Android ARCVM [15]
2. Arch Linux Zen [16]
3. Chrome OS [17]
4. Liquorix [18]
5. post-factum [19]
6. XanMod [20]

We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [21] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
rendering latency at the 50th percentile.

[11] https://lore.kernel.org/lkml/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/lkml/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://chromium.googlesource.com/chromiumos/third_party/kernel
[16] https://archlinux.org
[17] https://chromium.org
[18] https://liquorix.net
[19] https://gitlab.com/post-factum/pf-kernel
[20] https://xanmod.org
[21] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no one has demonstrated smaller changes
   with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [22][23][24], the
   new features will likely be put to use for both personal computers
   and data centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects on top of the existing
   design.

[22] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
[23] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
[24] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Yu Zhao (14):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
    sole caller"
  mm: multi-gen LRU: groundwork
  mm: multi-gen LRU: minimal implementation
  mm: multi-gen LRU: exploit locality in rmap
  mm: multi-gen LRU: support page table walks
  mm: multi-gen LRU: optimize multiple memcgs
  mm: multi-gen LRU: kill switch
  mm: multi-gen LRU: thrashing prevention
  mm: multi-gen LRU: debugfs interface
  mm: multi-gen LRU: admin guide
  mm: multi-gen LRU: design doc

 Documentation/admin-guide/mm/index.rst        |    1 +
 Documentation/admin-guide/mm/multigen_lru.rst |  146 +
 Documentation/vm/index.rst                    |    1 +
 Documentation/vm/multigen_lru.rst             |  156 +
 arch/Kconfig                                  |    9 +
 arch/arm64/include/asm/pgtable.h              |   14 +-
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |    9 +-
 arch/x86/mm/pgtable.c                         |    5 +-
 fs/exec.c                                     |    2 +
 fs/fuse/dev.c                                 |    3 +-
 include/linux/cgroup.h                        |   15 +-
 include/linux/memcontrol.h                    |   36 +
 include/linux/mm.h                            |    8 +
 include/linux/mm_inline.h                     |  217 +-
 include/linux/mm_types.h                      |   78 +
 include/linux/mmzone.h                        |  211 ++
 include/linux/nodemask.h                      |    1 +
 include/linux/page-flags-layout.h             |   11 +-
 include/linux/page-flags.h                    |    4 +-
 include/linux/pgtable.h                       |   17 +-
 include/linux/sched.h                         |    4 +
 include/linux/swap.h                          |    5 +
 kernel/bounds.c                               |    7 +
 kernel/cgroup/cgroup-internal.h               |    1 -
 kernel/exit.c                                 |    1 +
 kernel/fork.c                                 |    9 +
 kernel/sched/core.c                           |    1 +
 mm/Kconfig                                    |   26 +
 mm/huge_memory.c                              |    3 +-
 mm/memcontrol.c                               |   27 +
 mm/memory.c                                   |   39 +-
 mm/mm_init.c                                  |    6 +-
 mm/mmzone.c                                   |    2 +
 mm/rmap.c                                     |    7 +
 mm/swap.c                                     |   55 +-
 mm/vmscan.c                                   | 2824 ++++++++++++++++-
 mm/workingset.c                               |  119 +-
 38 files changed, 3934 insertions(+), 147 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v9 00/14] Multi-Gen LRU Framework
@ 2022-03-09  2:12 ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao

What's new
==========
Removed CONFIG_NR_LRU_GENS and CONFIG_TIERS_PER_GEN.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/lkml/20220309021230.721028-15-yuzhao@google.com/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
    its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without any optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/lkml/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/lkml/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/lkml/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/lkml/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/lkml/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/lkml/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/lkml/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/lkml/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/lkml/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.
   
   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had not a
   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

Vaibhav from IBM reported [12]:
   In a synthetic MongoDB Benchmark, seeing an average of ~19%
   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
   MGLRU patches on top of v5.16 kernel for MongoDB + YCSB across
   three different request distributions, namely, Exponential, Uniform
   and Zipfan.

Shuang from U of Rochester reported [13]:
   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
   and [9.26, 10.36]% higher throughput, respectively, for random
   access, Zipfian (distribution) access and Gaussian (distribution)
   access, when the average number of jobs per CPU is 1; 95% CIs
   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
   throughput, respectively, for random access, Zipfian access and
   Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
   With Memcached allocating ~100GB of byte-addressable Optante,
   performance improvement in terms of throughput (measured as queries
   per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
The downstream kernels that have been using MGLRU include:
1. Android ARCVM [15]
2. Arch Linux Zen [16]
3. Chrome OS [17]
4. Liquorix [18]
5. post-factum [19]
6. XanMod [20]

We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [21] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
rendering latency at the 50th percentile.

[11] https://lore.kernel.org/lkml/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/lkml/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://chromium.googlesource.com/chromiumos/third_party/kernel
[16] https://archlinux.org
[17] https://chromium.org
[18] https://liquorix.net
[19] https://gitlab.com/post-factum/pf-kernel
[20] https://xanmod.org
[21] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no one has demonstrated smaller changes
   with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [22][23][24], the
   new features will likely be put to use for both personal computers
   and data centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects on top of the existing
   design.

[22] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
[23] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
[24] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Yu Zhao (14):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
    sole caller"
  mm: multi-gen LRU: groundwork
  mm: multi-gen LRU: minimal implementation
  mm: multi-gen LRU: exploit locality in rmap
  mm: multi-gen LRU: support page table walks
  mm: multi-gen LRU: optimize multiple memcgs
  mm: multi-gen LRU: kill switch
  mm: multi-gen LRU: thrashing prevention
  mm: multi-gen LRU: debugfs interface
  mm: multi-gen LRU: admin guide
  mm: multi-gen LRU: design doc

 Documentation/admin-guide/mm/index.rst        |    1 +
 Documentation/admin-guide/mm/multigen_lru.rst |  146 +
 Documentation/vm/index.rst                    |    1 +
 Documentation/vm/multigen_lru.rst             |  156 +
 arch/Kconfig                                  |    9 +
 arch/arm64/include/asm/pgtable.h              |   14 +-
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |    9 +-
 arch/x86/mm/pgtable.c                         |    5 +-
 fs/exec.c                                     |    2 +
 fs/fuse/dev.c                                 |    3 +-
 include/linux/cgroup.h                        |   15 +-
 include/linux/memcontrol.h                    |   36 +
 include/linux/mm.h                            |    8 +
 include/linux/mm_inline.h                     |  217 +-
 include/linux/mm_types.h                      |   78 +
 include/linux/mmzone.h                        |  211 ++
 include/linux/nodemask.h                      |    1 +
 include/linux/page-flags-layout.h             |   11 +-
 include/linux/page-flags.h                    |    4 +-
 include/linux/pgtable.h                       |   17 +-
 include/linux/sched.h                         |    4 +
 include/linux/swap.h                          |    5 +
 kernel/bounds.c                               |    7 +
 kernel/cgroup/cgroup-internal.h               |    1 -
 kernel/exit.c                                 |    1 +
 kernel/fork.c                                 |    9 +
 kernel/sched/core.c                           |    1 +
 mm/Kconfig                                    |   26 +
 mm/huge_memory.c                              |    3 +-
 mm/memcontrol.c                               |   27 +
 mm/memory.c                                   |   39 +-
 mm/mm_init.c                                  |    6 +-
 mm/mmzone.c                                   |    2 +
 mm/rmap.c                                     |    7 +
 mm/swap.c                                     |   55 +-
 mm/vmscan.c                                   | 2824 ++++++++++++++++-
 mm/workingset.c                               |  119 +-
 38 files changed, 3934 insertions(+), 147 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/arm64/include/asm/pgtable.h | 14 ++------------
 arch/x86/include/asm/pgtable.h   |  6 +++---
 include/linux/pgtable.h          | 13 +++++++++++++
 mm/memory.c                      | 14 +-------------
 4 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c4ba047a82d2..990358eca359 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
  * page after fork() + CoW for pfn mappings. We don't always have a
  * hardware-managed access flag on arm64.
  */
-static inline bool arch_faults_on_old_pte(void)
-{
-	WARN_ON(preemptible());
-
-	return !cpu_has_hw_af();
-}
-#define arch_faults_on_old_pte		arch_faults_on_old_pte
+#define arch_has_hw_pte_young		cpu_has_hw_af
 
 /*
  * Experimentally, it's cheap to set the access flag in hardware and we
  * benefit from prefaulting mappings as 'old' to start with.
  */
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
-	return !arch_faults_on_old_pte();
-}
-#define arch_wants_old_prefaulted_pte	arch_wants_old_prefaulted_pte
+#define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
 static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
 {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8a9432fb3802..60b6ce45c2e3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1423,10 +1423,10 @@ static inline bool arch_has_pfn_modify_check(void)
 	return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(void)
 {
-	return false;
+	return true;
 }
 
 #endif	/* __ASSEMBLY__ */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..79f64dcff07d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef arch_has_hw_pte_young
+/*
+ * Return whether the accessed bit is supported on the local CPU.
+ *
+ * This stub assumes accessing through an old PTE triggers a page fault.
+ * Architectures that automatically set the access bit should overwrite it.
+ */
+static inline bool arch_has_hw_pte_young(void)
+{
+	return false;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR
 static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep)
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..a7379196a47e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -122,18 +122,6 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
-#ifndef arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
-{
-	/*
-	 * Those arches which don't have hw access flag feature need to
-	 * implement their own helper. By default, "true" means pagefault
-	 * will be hit on old pte.
-	 */
-	return true;
-}
-#endif
-
 #ifndef arch_wants_old_prefaulted_pte
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
@@ -2778,7 +2766,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	 * On architectures with software "accessed" bits, we would
 	 * take a double page fault, so mark it accessed here.
 	 */
-	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+	if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
 		pte_t entry;
 
 		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/arm64/include/asm/pgtable.h | 14 ++------------
 arch/x86/include/asm/pgtable.h   |  6 +++---
 include/linux/pgtable.h          | 13 +++++++++++++
 mm/memory.c                      | 14 +-------------
 4 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c4ba047a82d2..990358eca359 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
  * page after fork() + CoW for pfn mappings. We don't always have a
  * hardware-managed access flag on arm64.
  */
-static inline bool arch_faults_on_old_pte(void)
-{
-	WARN_ON(preemptible());
-
-	return !cpu_has_hw_af();
-}
-#define arch_faults_on_old_pte		arch_faults_on_old_pte
+#define arch_has_hw_pte_young		cpu_has_hw_af
 
 /*
  * Experimentally, it's cheap to set the access flag in hardware and we
  * benefit from prefaulting mappings as 'old' to start with.
  */
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
-	return !arch_faults_on_old_pte();
-}
-#define arch_wants_old_prefaulted_pte	arch_wants_old_prefaulted_pte
+#define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
 static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
 {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8a9432fb3802..60b6ce45c2e3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1423,10 +1423,10 @@ static inline bool arch_has_pfn_modify_check(void)
 	return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(void)
 {
-	return false;
+	return true;
 }
 
 #endif	/* __ASSEMBLY__ */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..79f64dcff07d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef arch_has_hw_pte_young
+/*
+ * Return whether the accessed bit is supported on the local CPU.
+ *
+ * This stub assumes accessing through an old PTE triggers a page fault.
+ * Architectures that automatically set the access bit should overwrite it.
+ */
+static inline bool arch_has_hw_pte_young(void)
+{
+	return false;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR
 static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep)
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..a7379196a47e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -122,18 +122,6 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
-#ifndef arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
-{
-	/*
-	 * Those arches which don't have hw access flag feature need to
-	 * implement their own helper. By default, "true" means pagefault
-	 * will be hit on old pte.
-	 */
-	return true;
-}
-#endif
-
 #ifndef arch_wants_old_prefaulted_pte
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
@@ -2778,7 +2766,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	 * On architectures with software "accessed" bits, we would
 	 * take a double page fault, so mark it accessed here.
 	 */
-	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+	if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
 		pte_t entry;
 
 		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.

Note that:
1. Although an inline function is preferable, this capability is added
   as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
   only tested on Intel and AMD CPUs.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (June 2021), section 4.8

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/Kconfig                   | 9 +++++++++
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 3 ++-
 arch/x86/mm/pgtable.c          | 5 ++++-
 include/linux/pgtable.h        | 4 ++--
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..f9c59ecadbbb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1322,6 +1322,15 @@ config DYNAMIC_SIGFRAME
 config HAVE_ARCH_NODE_DEV_GROUP
 	bool
 
+config ARCH_HAS_NONLEAF_PMD_YOUNG
+	bool
+	depends on PGTABLE_LEVELS > 2
+	help
+	  Architectures that select this option are capable of setting the
+	  accessed bit in non-leaf PMD entries when using them as part of linear
+	  address translations. Page table walkers that clear the accessed bit
+	  may use this capability to reduce their search space.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9f5bd41bf660..e787b7fc75be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
+	select ARCH_HAS_NONLEAF_PMD_YOUNG
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 60b6ce45c2e3..f973788f6b21 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -819,7 +819,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+	       (_KERNPG_TABLE & ~_PAGE_ACCESSED);
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..a224193d84bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
 	return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 79f64dcff07d..743e7fc4afda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.

Note that:
1. Although an inline function is preferable, this capability is added
   as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
   only tested on Intel and AMD CPUs.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (June 2021), section 4.8

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/Kconfig                   | 9 +++++++++
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 3 ++-
 arch/x86/mm/pgtable.c          | 5 ++++-
 include/linux/pgtable.h        | 4 ++--
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..f9c59ecadbbb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1322,6 +1322,15 @@ config DYNAMIC_SIGFRAME
 config HAVE_ARCH_NODE_DEV_GROUP
 	bool
 
+config ARCH_HAS_NONLEAF_PMD_YOUNG
+	bool
+	depends on PGTABLE_LEVELS > 2
+	help
+	  Architectures that select this option are capable of setting the
+	  accessed bit in non-leaf PMD entries when using them as part of linear
+	  address translations. Page table walkers that clear the accessed bit
+	  may use this capability to reduce their search space.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9f5bd41bf660..e787b7fc75be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
+	select ARCH_HAS_NONLEAF_PMD_YOUNG
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 60b6ce45c2e3..f973788f6b21 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -819,7 +819,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+	       (_KERNPG_TABLE & ~_PAGE_ACCESSED);
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..a224193d84bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
 	return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 79f64dcff07d..743e7fc4afda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 03/14] mm/vmscan.c: refactor shrink_node()
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
 1 file changed, 104 insertions(+), 94 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59b14e0d696c..8e744cdf802f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2718,6 +2718,109 @@ enum scan_balance {
 	SCAN_FILE,
 };
 
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+	unsigned long file;
+	struct lruvec *target_lruvec;
+
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+	/*
+	 * Flush the memory cgroup stats, so that we read accurate per-memcg
+	 * lruvec stats for heuristics.
+	 */
+	mem_cgroup_flush_stats();
+
+	/*
+	 * Determine the scan balance between anon and file LRUs.
+	 */
+	spin_lock_irq(&target_lruvec->lru_lock);
+	sc->anon_cost = target_lruvec->anon_cost;
+	sc->file_cost = target_lruvec->file_cost;
+	spin_unlock_irq(&target_lruvec->lru_lock);
+
+	/*
+	 * Target desirable inactive:active list ratios for the anon
+	 * and file LRU lists.
+	 */
+	if (!sc->force_deactivate) {
+		unsigned long refaults;
+
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+			sc->may_deactivate |= DEACTIVATE_ANON;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+		/*
+		 * When refaults are being observed, it means a new
+		 * workingset is being established. Deactivate to get
+		 * rid of any stale active pages quickly.
+		 */
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+			sc->may_deactivate |= DEACTIVATE_FILE;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_FILE;
+	} else
+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+	/*
+	 * If we have plenty of inactive file pages that aren't
+	 * thrashing, try to reclaim those first before touching
+	 * anonymous pages.
+	 */
+	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+		sc->cache_trim_mode = 1;
+	else
+		sc->cache_trim_mode = 0;
+
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long total_high_wmark = 0;
+		unsigned long free, anon;
+		int z;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		/*
+		 * Consider anon: if that's low too, this isn't a
+		 * runaway file reclaim problem, but rather just
+		 * extreme pressure. Reclaim as per usual then.
+		 */
+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+		sc->file_is_tiny =
+			file + free <= total_high_wmark &&
+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+			anon >> sc->priority;
+	}
+}
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -3188,109 +3291,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
-	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 again:
-	/*
-	 * Flush the memory cgroup stats, so that we read accurate per-memcg
-	 * lruvec stats for heuristics.
-	 */
-	mem_cgroup_flush_stats();
-
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
-	/*
-	 * Determine the scan balance between anon and file LRUs.
-	 */
-	spin_lock_irq(&target_lruvec->lru_lock);
-	sc->anon_cost = target_lruvec->anon_cost;
-	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&target_lruvec->lru_lock);
-
-	/*
-	 * Target desirable inactive:active list ratios for the anon
-	 * and file LRU lists.
-	 */
-	if (!sc->force_deactivate) {
-		unsigned long refaults;
-
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_ANON);
-		if (refaults != target_lruvec->refaults[0] ||
-			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
-			sc->may_deactivate |= DEACTIVATE_ANON;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_ANON;
-
-		/*
-		 * When refaults are being observed, it means a new
-		 * workingset is being established. Deactivate to get
-		 * rid of any stale active pages quickly.
-		 */
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_FILE);
-		if (refaults != target_lruvec->refaults[1] ||
-		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
-			sc->may_deactivate |= DEACTIVATE_FILE;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_FILE;
-	} else
-		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-
-	/*
-	 * If we have plenty of inactive file pages that aren't
-	 * thrashing, try to reclaim those first before touching
-	 * anonymous pages.
-	 */
-	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
-	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
-		sc->cache_trim_mode = 1;
-	else
-		sc->cache_trim_mode = 0;
-
-	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
-	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long total_high_wmark = 0;
-		unsigned long free, anon;
-		int z;
-
-		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		/*
-		 * Consider anon: if that's low too, this isn't a
-		 * runaway file reclaim problem, but rather just
-		 * extreme pressure. Reclaim as per usual then.
-		 */
-		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-
-		sc->file_is_tiny =
-			file + free <= total_high_wmark &&
-			!(sc->may_deactivate & DEACTIVATE_ANON) &&
-			anon >> sc->priority;
-	}
+	prepare_scan_count(pgdat, sc);
 
 	shrink_node_memcgs(pgdat, sc);
 
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 03/14] mm/vmscan.c: refactor shrink_node()
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
 1 file changed, 104 insertions(+), 94 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59b14e0d696c..8e744cdf802f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2718,6 +2718,109 @@ enum scan_balance {
 	SCAN_FILE,
 };
 
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+	unsigned long file;
+	struct lruvec *target_lruvec;
+
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+	/*
+	 * Flush the memory cgroup stats, so that we read accurate per-memcg
+	 * lruvec stats for heuristics.
+	 */
+	mem_cgroup_flush_stats();
+
+	/*
+	 * Determine the scan balance between anon and file LRUs.
+	 */
+	spin_lock_irq(&target_lruvec->lru_lock);
+	sc->anon_cost = target_lruvec->anon_cost;
+	sc->file_cost = target_lruvec->file_cost;
+	spin_unlock_irq(&target_lruvec->lru_lock);
+
+	/*
+	 * Target desirable inactive:active list ratios for the anon
+	 * and file LRU lists.
+	 */
+	if (!sc->force_deactivate) {
+		unsigned long refaults;
+
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+			sc->may_deactivate |= DEACTIVATE_ANON;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+		/*
+		 * When refaults are being observed, it means a new
+		 * workingset is being established. Deactivate to get
+		 * rid of any stale active pages quickly.
+		 */
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+			sc->may_deactivate |= DEACTIVATE_FILE;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_FILE;
+	} else
+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+	/*
+	 * If we have plenty of inactive file pages that aren't
+	 * thrashing, try to reclaim those first before touching
+	 * anonymous pages.
+	 */
+	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+		sc->cache_trim_mode = 1;
+	else
+		sc->cache_trim_mode = 0;
+
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long total_high_wmark = 0;
+		unsigned long free, anon;
+		int z;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		/*
+		 * Consider anon: if that's low too, this isn't a
+		 * runaway file reclaim problem, but rather just
+		 * extreme pressure. Reclaim as per usual then.
+		 */
+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+		sc->file_is_tiny =
+			file + free <= total_high_wmark &&
+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+			anon >> sc->priority;
+	}
+}
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -3188,109 +3291,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
-	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 again:
-	/*
-	 * Flush the memory cgroup stats, so that we read accurate per-memcg
-	 * lruvec stats for heuristics.
-	 */
-	mem_cgroup_flush_stats();
-
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
-	/*
-	 * Determine the scan balance between anon and file LRUs.
-	 */
-	spin_lock_irq(&target_lruvec->lru_lock);
-	sc->anon_cost = target_lruvec->anon_cost;
-	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&target_lruvec->lru_lock);
-
-	/*
-	 * Target desirable inactive:active list ratios for the anon
-	 * and file LRU lists.
-	 */
-	if (!sc->force_deactivate) {
-		unsigned long refaults;
-
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_ANON);
-		if (refaults != target_lruvec->refaults[0] ||
-			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
-			sc->may_deactivate |= DEACTIVATE_ANON;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_ANON;
-
-		/*
-		 * When refaults are being observed, it means a new
-		 * workingset is being established. Deactivate to get
-		 * rid of any stale active pages quickly.
-		 */
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_FILE);
-		if (refaults != target_lruvec->refaults[1] ||
-		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
-			sc->may_deactivate |= DEACTIVATE_FILE;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_FILE;
-	} else
-		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-
-	/*
-	 * If we have plenty of inactive file pages that aren't
-	 * thrashing, try to reclaim those first before touching
-	 * anonymous pages.
-	 */
-	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
-	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
-		sc->cache_trim_mode = 1;
-	else
-		sc->cache_trim_mode = 0;
-
-	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
-	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long total_high_wmark = 0;
-		unsigned long free, anon;
-		int z;
-
-		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		/*
-		 * Consider anon: if that's low too, this isn't a
-		 * runaway file reclaim problem, but rather just
-		 * extreme pressure. Reclaim as per usual then.
-		 */
-		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-
-		sc->file_is_tiny =
-			file + free <= total_high_wmark &&
-			!(sc->may_deactivate & DEACTIVATE_ANON) &&
-			anon >> sc->priority;
-	}
+	prepare_scan_count(pgdat, sc);
 
 	shrink_node_memcgs(pgdat, sc);
 
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

This patch undoes the following refactor:
commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")

The upcoming changes to include/linux/mm_inline.h will reuse
__update_lru_size().

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mm_inline.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index cf90b1fa2c60..2c24f5ac3e2a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -32,7 +32,7 @@ static inline int page_is_file_lru(struct page *page)
 	return folio_is_file_lru(page_folio(page));
 }
 
-static __always_inline void update_lru_size(struct lruvec *lruvec,
+static __always_inline void __update_lru_size(struct lruvec *lruvec,
 				enum lru_list lru, enum zone_type zid,
 				long nr_pages)
 {
@@ -41,6 +41,13 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
+}
+
+static __always_inline void update_lru_size(struct lruvec *lruvec,
+				enum lru_list lru, enum zone_type zid,
+				int nr_pages)
+{
+	__update_lru_size(lruvec, lru, zid, nr_pages);
 #ifdef CONFIG_MEMCG
 	mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
 #endif
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

This patch undoes the following refactor:
commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")

The upcoming changes to include/linux/mm_inline.h will reuse
__update_lru_size().

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mm_inline.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index cf90b1fa2c60..2c24f5ac3e2a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -32,7 +32,7 @@ static inline int page_is_file_lru(struct page *page)
 	return folio_is_file_lru(page_folio(page));
 }
 
-static __always_inline void update_lru_size(struct lruvec *lruvec,
+static __always_inline void __update_lru_size(struct lruvec *lruvec,
 				enum lru_list lru, enum zone_type zid,
 				long nr_pages)
 {
@@ -41,6 +41,13 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
+}
+
+static __always_inline void update_lru_size(struct lruvec *lruvec,
+				enum lru_list lru, enum zone_type zid,
+				int nr_pages)
+{
+	__update_lru_size(lruvec, lru, zid, nr_pages);
 #ifdef CONFIG_MEMCG
 	mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
 #endif
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed.

The next patch will address the "outlying refaults". A few macros,
i.e., LRU_REFS_*, used later are added in this patch to make the
patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 fs/fuse/dev.c                     |   3 +-
 include/linux/mm.h                |   2 +
 include/linux/mm_inline.h         | 176 ++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |  94 ++++++++++++++++
 include/linux/page-flags-layout.h |  11 +-
 include/linux/page-flags.h        |   4 +-
 include/linux/sched.h             |   4 +
 kernel/bounds.c                   |   7 ++
 mm/Kconfig                        |  10 ++
 mm/huge_memory.c                  |   3 +-
 mm/memcontrol.c                   |   2 +
 mm/memory.c                       |  25 +++++
 mm/mm_init.c                      |   6 +-
 mm/mmzone.c                       |   2 +
 mm/swap.c                         |   9 +-
 mm/vmscan.c                       |  73 +++++++++++++
 16 files changed, 418 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 592730fd6e42..e7c0aa6d61ce 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
 	       1 << PG_active |
 	       1 << PG_workingset |
 	       1 << PG_reclaim |
-	       1 << PG_waiters))) {
+	       1 << PG_waiters |
+	       LRU_GEN_MASK | LRU_REFS_MASK))) {
 		dump_page(page, "fuse: trying to steal weird page");
 		return 1;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5744a3fc4716..c1162659d824 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1032,6 +1032,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
 #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_REFS_PGOFF		(LRU_GEN_PGOFF - LRU_REFS_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 2c24f5ac3e2a..e3594171b421 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -38,6 +38,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 {
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+	lockdep_assert_held(&lruvec->lru_lock);
+	WARN_ON_ONCE(nr_pages != (int)nr_pages);
+
 	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
@@ -99,11 +102,178 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
 	return lru;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static inline bool lru_gen_enabled(void)
+{
+	return true;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+	return current->in_lru_fault;
+}
+
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+	return seq % MAX_NR_GENS;
+}
+
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+	unsigned long max_seq = lruvec->lrugen.max_seq;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+
+	/* see the comment on MIN_NR_GENS */
+	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
+				       int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+	enum lru_list lru = type * LRU_INACTIVE_FILE;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
+	VM_BUG_ON(old_gen == -1 && new_gen == -1);
+
+	if (old_gen >= 0)
+		WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
+			   lrugen->nr_pages[old_gen][type][zone] - delta);
+	if (new_gen >= 0)
+		WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
+			   lrugen->nr_pages[new_gen][type][zone] + delta);
+
+	/* addition */
+	if (old_gen < 0) {
+		if (lru_gen_is_active(lruvec, new_gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, delta);
+		return;
+	}
+
+	/* deletion */
+	if (new_gen < 0) {
+		if (lru_gen_is_active(lruvec, old_gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, -delta);
+		return;
+	}
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	if (folio_test_unevictable(folio))
+		return false;
+	/*
+	 * There are three common cases for this page:
+	 * 1. If it's hot, e.g., freshly faulted in or previously hot and
+	 *    migrated, add it to the youngest generation.
+	 * 2. If it's cold but can't be evicted immediately, i.e., an anon page
+	 *    not in swapcache or a dirty page pending writeback, add it to the
+	 *    second oldest generation.
+	 * 3. Everything else (clean, cold) is added to the oldest generation.
+	 */
+	if (folio_test_active(folio))
+		gen = lru_gen_from_seq(lrugen->max_seq);
+	else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
+		 (folio_test_reclaim(folio) &&
+		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
+		gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
+	else
+		gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
+
+		/* see the comment on MIN_NR_GENS */
+		new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, -1, gen);
+	/* for folio_rotate_reclaimable() */
+	if (reclaiming)
+		list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+	else
+		list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+	return true;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		if (!(new_flags & LRU_GEN_MASK))
+			return false;
+
+		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+		VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+
+		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+		new_flags &= ~LRU_GEN_MASK;
+		/* for shrink_page_list() */
+		if (reclaiming)
+			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
+		else if (lru_gen_is_active(lruvec, gen))
+			new_flags |= BIT(PG_active);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, gen, -1);
+	list_del(&folio->lru);
+
+	return true;
+}
+
+#else
+
+static inline bool lru_gen_enabled(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	return false;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 static __always_inline
 void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	if (lru_gen_add_folio(lruvec, folio, false))
+		return;
+
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
 	list_add(&folio->lru, &lruvec->lists[lru]);
@@ -120,6 +290,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	if (lru_gen_add_folio(lruvec, folio, true))
+		return;
+
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
 	list_add_tail(&folio->lru, &lruvec->lists[lru]);
@@ -134,6 +307,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline
 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
+	if (lru_gen_del_folio(lruvec, folio, false))
+		return;
+
 	list_del(&folio->lru);
 	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
 			-folio_nr_pages(folio));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aed44e9b5d89..a88e27d85693 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,6 +303,96 @@ enum lruvec_flags {
 					 */
 };
 
+#endif /* !__GENERATING_BOUNDS_H */
+
+/*
+ * Evictable pages are divided into multiple generations. The youngest and the
+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
+ * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
+ * generation. The gen counter in folio->flags stores gen+1 while a page is on
+ * one of lrugen->lists[]. Otherwise it stores 0.
+ *
+ * A page is added to the youngest generation on faulting. The aging needs to
+ * check the accessed bit at least twice before handing this page over to the
+ * eviction. The first check takes care of the accessed bit set on the initial
+ * fault; the second check makes sure this page hasn't been used since then.
+ * This process, AKA second chance, requires a minimum of two generations,
+ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
+ * LRU, these two generations are considered active; the rest of generations, if
+ * they exist, are considered inactive. See lru_gen_is_active(). PG_active is
+ * always cleared while a page is on one of lrugen->lists[] so that the aging
+ * needs not to worry about it. And it's set again when a page considered active
+ * is isolated for non-reclaiming purposes, e.g., migration. See
+ * lru_gen_add_folio() and lru_gen_del_folio().
+ *
+ * MAX_NR_GENS is set to 4 so that the multi-gen LRU has twice of the categories
+ * of the active/inactive LRU.
+ *
+ */
+#define MIN_NR_GENS		2U
+#define MAX_NR_GENS		4U
+
+#ifndef __GENERATING_BOUNDS_H
+
+struct lruvec;
+
+#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+enum {
+	LRU_GEN_ANON,
+	LRU_GEN_FILE,
+};
+
+/*
+ * The youngest generation number is stored in max_seq for both anon and file
+ * types as they are aged on an equal footing. The oldest generation numbers are
+ * stored in min_seq[] separately for anon and file types as clean file pages
+ * can be evicted regardless of swap constraints.
+ *
+ * Normally anon and file min_seq are in sync. But if swapping is constrained,
+ * e.g., out of swap space, file min_seq is allowed to advance and leave anon
+ * min_seq behind.
+ */
+struct lru_gen_struct {
+	/* the aging increments the youngest generation number */
+	unsigned long max_seq;
+	/* the eviction increments the oldest generation numbers */
+	unsigned long min_seq[ANON_AND_FILE];
+	/* the multi-gen LRU lists */
+	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the sizes of the above lists */
+	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+};
+
+void lru_gen_init_lruvec(struct lruvec *lruvec);
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg);
+void lru_gen_exit_memcg(struct mem_cgroup *memcg);
+#endif
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
+#endif /* CONFIG_LRU_GEN */
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	/* per lruvec lru_lock for memcg */
@@ -320,6 +410,10 @@ struct lruvec {
 	unsigned long			refaults[ANON_AND_FILE];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+#ifdef CONFIG_LRU_GEN
+	/* evictable pages divided into generations */
+	struct lru_gen_struct		lrugen;
+#endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index ef1e3e736e14..c1946cdb845f 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -55,7 +55,8 @@
 #define SECTIONS_WIDTH		0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
+	<= BITS_PER_LONG - NR_PAGEFLAGS
 #define NODES_WIDTH		NODES_SHIFT
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 #error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +90,8 @@
 #define LAST_CPUPID_SHIFT 0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
-	<= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -100,8 +101,8 @@
 #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
-	> BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..a95518ca98eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -935,7 +935,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
 	 1UL << PG_private	| 1UL << PG_private_2	|	\
 	 1UL << PG_writeback	| 1UL << PG_reserved	|	\
 	 1UL << PG_slab		| 1UL << PG_active 	|	\
-	 1UL << PG_unevictable	| __PG_MLOCKED)
+	 1UL << PG_unevictable	| __PG_MLOCKED | LRU_GEN_MASK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
@@ -946,7 +946,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(PAGEFLAGS_MASK & ~__PG_HWPOISON)
+	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..e7fe784b11aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -914,6 +914,10 @@ struct task_struct {
 #ifdef CONFIG_MEMCG
 	unsigned			in_user_fault:1;
 #endif
+#ifdef CONFIG_LRU_GEN
+	/* whether the LRU algorithm may apply to this access */
+	unsigned			in_lru_fault:1;
+#endif
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
 #endif
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 9795d75b09b2..e08fb89f87f4 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -22,6 +22,13 @@ int main(void)
 	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
 #endif
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
+#ifdef CONFIG_LRU_GEN
+	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
+	DEFINE(LRU_REFS_WIDTH, 0);
+#else
+	DEFINE(LRU_GEN_WIDTH, 0);
+	DEFINE(LRU_REFS_WIDTH, 0);
+#endif
 	/* End of constants */
 
 	return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..747ab1690bcf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -892,6 +892,16 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+# the multi-gen LRU {
+config LRU_GEN
+	bool "Multi-Gen LRU"
+	depends on MMU
+	# the following options can use up the spare bits in page flags
+	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
+	help
+	  A high performance LRU implementation for memory overcommit.
+# }
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..3df389fd307f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
-			 (1L << PG_dirty)));
+			 (1L << PG_dirty) |
+			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36e9f38c919d..3fcbfeda259b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5121,6 +5121,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
+	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
 	__mem_cgroup_free(memcg);
 }
@@ -5180,6 +5181,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->deferred_split_queue.split_queue_len = 0;
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+	lru_gen_init_memcg(memcg);
 	return memcg;
 fail:
 	mem_cgroup_id_remove(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index a7379196a47e..d27e5f1a2533 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+	/* the LRU algorithm doesn't apply to sequential or random reads */
+	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
+}
+
+static void lru_gen_exit_fault(void)
+{
+	current->in_lru_fault = false;
+}
+#else
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+}
+
+static void lru_gen_exit_fault(void)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -4785,11 +4806,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flags & FAULT_FLAG_USER)
 		mem_cgroup_enter_user_fault();
 
+	lru_gen_enter_fault(vma);
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	lru_gen_exit_fault();
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ddaf0e1b0ab..0d7b2bd2454a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
 
 	shift = 8 * sizeof(unsigned long);
 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
-		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
+		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
 		LAST_CPUPID_WIDTH,
 		KASAN_TAG_WIDTH,
+		LRU_GEN_WIDTH,
+		LRU_REFS_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..2ec0d7793424 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+
+	lru_gen_init_lruvec(lruvec);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..e5f2ab3dab4a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
 	VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
+	/* see the comment in lru_gen_add_folio() */
+	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
+	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
+		folio_set_active(folio);
+
 	folio_get(folio);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
@@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -677,7 +682,7 @@ void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8e744cdf802f..65eb668abf2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3042,6 +3042,79 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 	return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ *                          shorthand helpers
+ ******************************************************************************/
+
+#define for_each_gen_type_zone(gen, type, zone)				\
+	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
+		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
+			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+#ifdef CONFIG_MEMCG
+	if (memcg) {
+		struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
+
+		/* for hotadd_new_pgdat() */
+		if (!lruvec->pgdat)
+			lruvec->pgdat = pgdat;
+
+		return lruvec;
+	}
+#endif
+	return pgdat ? &pgdat->__lruvec : NULL;
+}
+
+/******************************************************************************
+ *                          initialization
+ ******************************************************************************/
+
+void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	lrugen->max_seq = MIN_NR_GENS + 1;
+
+	for_each_gen_type_zone(gen, type, zone)
+		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+void lru_gen_exit_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
+				     sizeof(lruvec->lrugen.nr_pages)));
+	}
+}
+#endif
+
+static int __init init_lru_gen(void)
+{
+	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
+	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
+	return 0;
+};
+late_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed.

The next patch will address the "outlying refaults". A few macros,
i.e., LRU_REFS_*, used later are added in this patch to make the
patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 fs/fuse/dev.c                     |   3 +-
 include/linux/mm.h                |   2 +
 include/linux/mm_inline.h         | 176 ++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |  94 ++++++++++++++++
 include/linux/page-flags-layout.h |  11 +-
 include/linux/page-flags.h        |   4 +-
 include/linux/sched.h             |   4 +
 kernel/bounds.c                   |   7 ++
 mm/Kconfig                        |  10 ++
 mm/huge_memory.c                  |   3 +-
 mm/memcontrol.c                   |   2 +
 mm/memory.c                       |  25 +++++
 mm/mm_init.c                      |   6 +-
 mm/mmzone.c                       |   2 +
 mm/swap.c                         |   9 +-
 mm/vmscan.c                       |  73 +++++++++++++
 16 files changed, 418 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 592730fd6e42..e7c0aa6d61ce 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
 	       1 << PG_active |
 	       1 << PG_workingset |
 	       1 << PG_reclaim |
-	       1 << PG_waiters))) {
+	       1 << PG_waiters |
+	       LRU_GEN_MASK | LRU_REFS_MASK))) {
 		dump_page(page, "fuse: trying to steal weird page");
 		return 1;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5744a3fc4716..c1162659d824 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1032,6 +1032,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
 #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_REFS_PGOFF		(LRU_GEN_PGOFF - LRU_REFS_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 2c24f5ac3e2a..e3594171b421 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -38,6 +38,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 {
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+	lockdep_assert_held(&lruvec->lru_lock);
+	WARN_ON_ONCE(nr_pages != (int)nr_pages);
+
 	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
 	__mod_zone_page_state(&pgdat->node_zones[zid],
 				NR_ZONE_LRU_BASE + lru, nr_pages);
@@ -99,11 +102,178 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
 	return lru;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static inline bool lru_gen_enabled(void)
+{
+	return true;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+	return current->in_lru_fault;
+}
+
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+	return seq % MAX_NR_GENS;
+}
+
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+	unsigned long max_seq = lruvec->lrugen.max_seq;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+
+	/* see the comment on MIN_NR_GENS */
+	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
+				       int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+	enum lru_list lru = type * LRU_INACTIVE_FILE;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
+	VM_BUG_ON(old_gen == -1 && new_gen == -1);
+
+	if (old_gen >= 0)
+		WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
+			   lrugen->nr_pages[old_gen][type][zone] - delta);
+	if (new_gen >= 0)
+		WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
+			   lrugen->nr_pages[new_gen][type][zone] + delta);
+
+	/* addition */
+	if (old_gen < 0) {
+		if (lru_gen_is_active(lruvec, new_gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, delta);
+		return;
+	}
+
+	/* deletion */
+	if (new_gen < 0) {
+		if (lru_gen_is_active(lruvec, old_gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, -delta);
+		return;
+	}
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	if (folio_test_unevictable(folio))
+		return false;
+	/*
+	 * There are three common cases for this page:
+	 * 1. If it's hot, e.g., freshly faulted in or previously hot and
+	 *    migrated, add it to the youngest generation.
+	 * 2. If it's cold but can't be evicted immediately, i.e., an anon page
+	 *    not in swapcache or a dirty page pending writeback, add it to the
+	 *    second oldest generation.
+	 * 3. Everything else (clean, cold) is added to the oldest generation.
+	 */
+	if (folio_test_active(folio))
+		gen = lru_gen_from_seq(lrugen->max_seq);
+	else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
+		 (folio_test_reclaim(folio) &&
+		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
+		gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
+	else
+		gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
+
+		/* see the comment on MIN_NR_GENS */
+		new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, -1, gen);
+	/* for folio_rotate_reclaimable() */
+	if (reclaiming)
+		list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+	else
+		list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+	return true;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		if (!(new_flags & LRU_GEN_MASK))
+			return false;
+
+		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+		VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+
+		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+		new_flags &= ~LRU_GEN_MASK;
+		/* for shrink_page_list() */
+		if (reclaiming)
+			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
+		else if (lru_gen_is_active(lruvec, gen))
+			new_flags |= BIT(PG_active);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, gen, -1);
+	list_del(&folio->lru);
+
+	return true;
+}
+
+#else
+
+static inline bool lru_gen_enabled(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	return false;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 static __always_inline
 void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	if (lru_gen_add_folio(lruvec, folio, false))
+		return;
+
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
 	list_add(&folio->lru, &lruvec->lists[lru]);
@@ -120,6 +290,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	if (lru_gen_add_folio(lruvec, folio, true))
+		return;
+
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
 	list_add_tail(&folio->lru, &lruvec->lists[lru]);
@@ -134,6 +307,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline
 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
+	if (lru_gen_del_folio(lruvec, folio, false))
+		return;
+
 	list_del(&folio->lru);
 	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
 			-folio_nr_pages(folio));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aed44e9b5d89..a88e27d85693 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,6 +303,96 @@ enum lruvec_flags {
 					 */
 };
 
+#endif /* !__GENERATING_BOUNDS_H */
+
+/*
+ * Evictable pages are divided into multiple generations. The youngest and the
+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
+ * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
+ * generation. The gen counter in folio->flags stores gen+1 while a page is on
+ * one of lrugen->lists[]. Otherwise it stores 0.
+ *
+ * A page is added to the youngest generation on faulting. The aging needs to
+ * check the accessed bit at least twice before handing this page over to the
+ * eviction. The first check takes care of the accessed bit set on the initial
+ * fault; the second check makes sure this page hasn't been used since then.
+ * This process, AKA second chance, requires a minimum of two generations,
+ * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
+ * LRU, these two generations are considered active; the rest of generations, if
+ * they exist, are considered inactive. See lru_gen_is_active(). PG_active is
+ * always cleared while a page is on one of lrugen->lists[] so that the aging
+ * needs not to worry about it. And it's set again when a page considered active
+ * is isolated for non-reclaiming purposes, e.g., migration. See
+ * lru_gen_add_folio() and lru_gen_del_folio().
+ *
+ * MAX_NR_GENS is set to 4 so that the multi-gen LRU has twice of the categories
+ * of the active/inactive LRU.
+ *
+ */
+#define MIN_NR_GENS		2U
+#define MAX_NR_GENS		4U
+
+#ifndef __GENERATING_BOUNDS_H
+
+struct lruvec;
+
+#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+enum {
+	LRU_GEN_ANON,
+	LRU_GEN_FILE,
+};
+
+/*
+ * The youngest generation number is stored in max_seq for both anon and file
+ * types as they are aged on an equal footing. The oldest generation numbers are
+ * stored in min_seq[] separately for anon and file types as clean file pages
+ * can be evicted regardless of swap constraints.
+ *
+ * Normally anon and file min_seq are in sync. But if swapping is constrained,
+ * e.g., out of swap space, file min_seq is allowed to advance and leave anon
+ * min_seq behind.
+ */
+struct lru_gen_struct {
+	/* the aging increments the youngest generation number */
+	unsigned long max_seq;
+	/* the eviction increments the oldest generation numbers */
+	unsigned long min_seq[ANON_AND_FILE];
+	/* the multi-gen LRU lists */
+	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the sizes of the above lists */
+	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+};
+
+void lru_gen_init_lruvec(struct lruvec *lruvec);
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg);
+void lru_gen_exit_memcg(struct mem_cgroup *memcg);
+#endif
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
+#endif /* CONFIG_LRU_GEN */
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	/* per lruvec lru_lock for memcg */
@@ -320,6 +410,10 @@ struct lruvec {
 	unsigned long			refaults[ANON_AND_FILE];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+#ifdef CONFIG_LRU_GEN
+	/* evictable pages divided into generations */
+	struct lru_gen_struct		lrugen;
+#endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index ef1e3e736e14..c1946cdb845f 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -55,7 +55,8 @@
 #define SECTIONS_WIDTH		0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
+	<= BITS_PER_LONG - NR_PAGEFLAGS
 #define NODES_WIDTH		NODES_SHIFT
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 #error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +90,8 @@
 #define LAST_CPUPID_SHIFT 0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
-	<= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -100,8 +101,8 @@
 #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
-	> BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..a95518ca98eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -935,7 +935,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
 	 1UL << PG_private	| 1UL << PG_private_2	|	\
 	 1UL << PG_writeback	| 1UL << PG_reserved	|	\
 	 1UL << PG_slab		| 1UL << PG_active 	|	\
-	 1UL << PG_unevictable	| __PG_MLOCKED)
+	 1UL << PG_unevictable	| __PG_MLOCKED | LRU_GEN_MASK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
@@ -946,7 +946,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(PAGEFLAGS_MASK & ~__PG_HWPOISON)
+	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..e7fe784b11aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -914,6 +914,10 @@ struct task_struct {
 #ifdef CONFIG_MEMCG
 	unsigned			in_user_fault:1;
 #endif
+#ifdef CONFIG_LRU_GEN
+	/* whether the LRU algorithm may apply to this access */
+	unsigned			in_lru_fault:1;
+#endif
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
 #endif
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 9795d75b09b2..e08fb89f87f4 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -22,6 +22,13 @@ int main(void)
 	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
 #endif
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
+#ifdef CONFIG_LRU_GEN
+	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
+	DEFINE(LRU_REFS_WIDTH, 0);
+#else
+	DEFINE(LRU_GEN_WIDTH, 0);
+	DEFINE(LRU_REFS_WIDTH, 0);
+#endif
 	/* End of constants */
 
 	return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..747ab1690bcf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -892,6 +892,16 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+# the multi-gen LRU {
+config LRU_GEN
+	bool "Multi-Gen LRU"
+	depends on MMU
+	# the following options can use up the spare bits in page flags
+	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
+	help
+	  A high performance LRU implementation for memory overcommit.
+# }
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..3df389fd307f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
-			 (1L << PG_dirty)));
+			 (1L << PG_dirty) |
+			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36e9f38c919d..3fcbfeda259b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5121,6 +5121,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
+	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
 	__mem_cgroup_free(memcg);
 }
@@ -5180,6 +5181,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->deferred_split_queue.split_queue_len = 0;
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+	lru_gen_init_memcg(memcg);
 	return memcg;
 fail:
 	mem_cgroup_id_remove(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index a7379196a47e..d27e5f1a2533 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+	/* the LRU algorithm doesn't apply to sequential or random reads */
+	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
+}
+
+static void lru_gen_exit_fault(void)
+{
+	current->in_lru_fault = false;
+}
+#else
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+}
+
+static void lru_gen_exit_fault(void)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -4785,11 +4806,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flags & FAULT_FLAG_USER)
 		mem_cgroup_enter_user_fault();
 
+	lru_gen_enter_fault(vma);
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	lru_gen_exit_fault();
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ddaf0e1b0ab..0d7b2bd2454a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
 
 	shift = 8 * sizeof(unsigned long);
 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
-		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
+		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
 		LAST_CPUPID_WIDTH,
 		KASAN_TAG_WIDTH,
+		LRU_GEN_WIDTH,
+		LRU_REFS_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..2ec0d7793424 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+
+	lru_gen_init_lruvec(lruvec);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..e5f2ab3dab4a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
 	VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
+	/* see the comment in lru_gen_add_folio() */
+	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
+	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
+		folio_set_active(folio);
+
 	folio_get(folio);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
@@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -677,7 +682,7 @@ void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8e744cdf802f..65eb668abf2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3042,6 +3042,79 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 	return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ *                          shorthand helpers
+ ******************************************************************************/
+
+#define for_each_gen_type_zone(gen, type, zone)				\
+	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
+		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
+			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+#ifdef CONFIG_MEMCG
+	if (memcg) {
+		struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
+
+		/* for hotadd_new_pgdat() */
+		if (!lruvec->pgdat)
+			lruvec->pgdat = pgdat;
+
+		return lruvec;
+	}
+#endif
+	return pgdat ? &pgdat->__lruvec : NULL;
+}
+
+/******************************************************************************
+ *                          initialization
+ ******************************************************************************/
+
+void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	lrugen->max_seq = MIN_NR_GENS + 1;
+
+	for_each_gen_type_zone(gen, type, zone)
+		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+void lru_gen_exit_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
+				     sizeof(lruvec->lrugen.nr_pages)));
+	}
+}
+#endif
+
+static int __init init_lru_gen(void)
+{
+	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
+	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
+	return 0;
+};
+late_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
folio->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[47, 49]%
                IOPS         BW
      5.17-rc2: 2242k        8759MiB/s
      patch1-5: 3321k        12.7GiB/s

  Single workload:
    memcached (anon): +[101, 105]%
                Ops/sec      KB/sec
      5.17-rc2: 476771.79    18544.31
      patch1-5: 972526.07    37826.95

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.17-rc2
      38.05%  page_vma_mapped_walk
      20.86%  lzo1x_1_do_compress (real work)
       6.16%  do_raw_spin_lock
       4.61%  _raw_spin_unlock_irq
       2.20%  vma_interval_tree_iter_next
       2.19%  vma_interval_tree_subtree_search
       2.15%  page_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.65%  ptep_clear_flush
       1.00%  __zram_bvec_write

    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mm.h        |   1 +
 include/linux/mm_inline.h |  24 ++
 include/linux/mmzone.h    |  42 ++
 kernel/bounds.c           |   2 +-
 mm/Kconfig                |   9 +
 mm/swap.c                 |  42 ++
 mm/vmscan.c               | 786 +++++++++++++++++++++++++++++++++++++-
 mm/workingset.c           | 119 +++++-
 8 files changed, 1021 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1162659d824..1e3e6dd90c0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define lru_to_folio(head) (list_entry((head)->prev, struct folio, lru))
 
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e3594171b421..15a04a9b5560 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -119,6 +119,19 @@ static inline int lru_gen_from_seq(unsigned long seq)
 	return seq % MAX_NR_GENS;
 }
 
+static inline int lru_hist_from_seq(unsigned long seq)
+{
+	return seq % NR_HIST_GENS;
+}
+
+static inline int lru_tier_from_refs(int refs)
+{
+	VM_BUG_ON(refs > BIT(LRU_REFS_WIDTH));
+
+	/* see the comment on MAX_NR_TIERS */
+	return order_base_2(refs + 1);
+}
+
 static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
 {
 	unsigned long max_seq = lruvec->lrugen.max_seq;
@@ -164,6 +177,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
 		__update_lru_size(lruvec, lru, zone, -delta);
 		return;
 	}
+
+	/* promotion */
+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+		__update_lru_size(lruvec, lru, zone, -delta);
+		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+
+	/* demotion requires isolation, e.g., lru_deactivate_fn() */
+	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
 }
 
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
@@ -229,6 +251,8 @@ static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio,
 		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 
 		new_flags &= ~LRU_GEN_MASK;
+		if ((new_flags & LRU_REFS_FLAGS) != LRU_REFS_FLAGS)
+			new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
 		/* for shrink_page_list() */
 		if (reclaiming)
 			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a88e27d85693..307c5c24c7ac 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -333,6 +333,29 @@ enum lruvec_flags {
 #define MIN_NR_GENS		2U
 #define MAX_NR_GENS		4U
 
+/*
+ * Each generation is divided into multiple tiers. Tiers represent different
+ * ranges of numbers of accesses through file descriptors. A page accessed N
+ * times through file descriptors is in tier order_base_2(N). A page in the
+ * first tier (N=0,1) is marked by PG_referenced unless it was faulted in
+ * though page tables or read ahead. A page in any other tier (N>1) is marked
+ * by PG_referenced and PG_workingset. Two additional bits in folio->flags are
+ * required to support four tiers.
+ *
+ * In contrast to moving across generations which requires the LRU lock, moving
+ * across tiers only requires operations on folio->flags and therefore has a
+ * negligible cost in the buffered access path. In the eviction path,
+ * comparisons of refaulted/(evicted+protected) from the first tier and the
+ * rest infer whether pages accessed multiple times through file descriptors
+ * are statistically hot and thus worth protecting.
+ *
+ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU has of twice of the
+ * categories of the active/inactive LRU when tracking accesses through file
+ * descriptors.
+ */
+#define MAX_NR_TIERS		4U
+#define LRU_REFS_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
+
 #ifndef __GENERATING_BOUNDS_H
 
 struct lruvec;
@@ -347,6 +370,16 @@ enum {
 	LRU_GEN_FILE,
 };
 
+#define MIN_LRU_BATCH		BITS_PER_LONG
+#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
+
+/* whether to keep historical stats from evicted generations */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_HIST_GENS		MAX_NR_GENS
+#else
+#define NR_HIST_GENS		1U
+#endif
+
 /*
  * The youngest generation number is stored in max_seq for both anon and file
  * types as they are aged on an equal footing. The oldest generation numbers are
@@ -366,6 +399,15 @@ struct lru_gen_struct {
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
 	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the exponential moving average of refaulted */
+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the exponential moving average of evicted+protected */
+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the first tier doesn't need protection, hence the minus one */
+	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* can be modified without holding the LRU lock */
+	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
diff --git a/kernel/bounds.c b/kernel/bounds.c
index e08fb89f87f4..10dd9e6b03e5 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -24,7 +24,7 @@ int main(void)
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
 #ifdef CONFIG_LRU_GEN
 	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
-	DEFINE(LRU_REFS_WIDTH, 0);
+	DEFINE(LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
 #else
 	DEFINE(LRU_GEN_WIDTH, 0);
 	DEFINE(LRU_REFS_WIDTH, 0);
diff --git a/mm/Kconfig b/mm/Kconfig
index 747ab1690bcf..804c2bca8205 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,15 @@ config LRU_GEN
 	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
 	help
 	  A high performance LRU implementation for memory overcommit.
+
+config LRU_GEN_STATS
+	bool "Full stats for debugging"
+	depends on LRU_GEN
+	help
+	  Do not enable this option unless you plan to look at historical stats
+	  from evicted generations for debugging purpose.
+
+	  This option has a per-memcg and per-node memory overhead.
 # }
 
 source "mm/damon/Kconfig"
diff --git a/mm/swap.c b/mm/swap.c
index e5f2ab3dab4a..f5c0bcac8dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,43 @@ static void __lru_cache_activate_folio(struct folio *folio)
 	local_unlock(&lru_pvecs.lock);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void folio_inc_refs(struct folio *folio)
+{
+	unsigned long refs;
+	unsigned long old_flags, new_flags;
+
+	if (folio_test_unevictable(folio))
+		return;
+
+	/* see the comment on MAX_NR_TIERS */
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		if (!(new_flags & BIT(PG_referenced))) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		if (!(new_flags & BIT(PG_workingset))) {
+			new_flags |= BIT(PG_workingset);
+			continue;
+		}
+
+		refs = new_flags & LRU_REFS_MASK;
+		refs = min(refs + BIT(LRU_REFS_PGOFF), LRU_REFS_MASK);
+
+		new_flags &= ~LRU_REFS_MASK;
+		new_flags |= refs;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void folio_inc_refs(struct folio *folio)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * Mark a page as having seen activity.
  *
@@ -419,6 +456,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (lru_gen_enabled()) {
+		folio_inc_refs(folio);
+		return;
+	}
+
 	if (!folio_test_referenced(folio)) {
 		folio_set_referenced(folio);
 	} else if (folio_test_unevictable(folio)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65eb668abf2d..91a827ff665d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1287,9 +1287,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		mem_cgroup_swapout(page, swap);
+
+		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(page, target_memcg);
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_page(page, swap);
@@ -2723,6 +2725,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -3048,11 +3053,38 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
  *                          shorthand helpers
  ******************************************************************************/
 
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]),	\
+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),	\
+	}
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
+static int folio_lru_gen(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static int folio_lru_tier(struct folio *folio)
+{
+	int refs;
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	refs = (flags & LRU_REFS_FLAGS) == LRU_REFS_FLAGS ?
+	       ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1 : 0;
+
+	return lru_tier_from_refs(refs);
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3071,6 +3103,735 @@ static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return pgdat ? &pgdat->__lruvec : NULL;
 }
 
+static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	if (!can_demote(pgdat->node_id, sc) &&
+	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
+		return 0;
+
+	return mem_cgroup_swappiness(memcg);
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	/* see the comment on lru_gen_struct */
+	return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
+	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ *                          refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
+ *
+ * The P term is refaulted/(evicted+protected) from a tier in the generation
+ * currently being evicted; the I term is the exponential moving average of the
+ * P term over the generations previously evicted, using the smoothing factor
+ * 1/2; the D term isn't supported.
+ *
+ * The setpoint (SP) is always the first tier of one type; the process variable
+ * (PV) is either any tier of the other type or any other tier of the same
+ * type.
+ *
+ * The error is the difference between the SP and the PV; the correction is
+ * turn off protection when SP>PV or turn on protection when SP<PV.
+ *
+ * For future optimizations:
+ * 1. The D term may discount the other two terms over time so that long-lived
+ *    generations can resist stale information.
+ */
+struct ctrl_pos {
+	unsigned long refaulted;
+	unsigned long total;
+	int gain;
+};
+
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
+			  struct ctrl_pos *pos)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+	pos->total = lrugen->avg_total[type][tier] +
+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
+	if (tier)
+		pos->total += lrugen->protected[hist][type][tier - 1];
+	pos->gain = gain;
+}
+
+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
+{
+	int hist, tier;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
+	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+
+	if (!carryover && !clear)
+		return;
+
+	hist = lru_hist_from_seq(seq);
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		if (carryover) {
+			unsigned long sum;
+
+			sum = lrugen->avg_refaulted[type][tier] +
+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+			sum = lrugen->avg_total[type][tier] +
+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
+			if (tier)
+				sum += lrugen->protected[hist][type][tier - 1];
+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+		}
+
+		if (clear) {
+			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+			if (tier)
+				WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
+		}
+	}
+}
+
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
+{
+	/*
+	 * Return true if the PV has a limited number of refaults or a lower
+	 * refaulted/total than the SP.
+	 */
+	return pv->refaulted < MIN_LRU_BATCH ||
+	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
+	       (sp->refaulted + 1) * pv->total * pv->gain;
+}
+
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+		/* for folio_end_writeback() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+
+	return new_gen;
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+	int type;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	}
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+	int gen, type, zone;
+	bool success = false;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
+			gen = lru_gen_from_seq(min_seq[type]);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				if (!list_empty(&lrugen->lists[gen][type][zone]))
+					goto next;
+			}
+
+			min_seq[type]++;
+		}
+next:
+		;
+	}
+
+	/* see the comment on lru_gen_struct */
+	if (can_swap) {
+		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
+		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
+	}
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		if (min_seq[type] == lrugen->min_seq[type])
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int prev, next;
+	int type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	inc_min_seq(lruvec);
+
+	/* update the active/inactive LRU sizes for compatibility */
+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
+	next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			enum lru_list lru = type * LRU_INACTIVE_FILE;
+			long delta = lrugen->nr_pages[prev][type][zone] -
+				     lrugen->nr_pages[next][type][zone];
+
+			if (!delta)
+				continue;
+
+			__update_lru_size(lruvec, lru, zone, delta);
+			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+		}
+	}
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		reset_ctrl_pos(lruvec, type, false);
+
+	/* make sure preceding modifications appear */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
+			     unsigned long *min_seq, bool can_swap, bool *need_aging)
+{
+	int gen, type, zone;
+	long old = 0;
+	long young = 0;
+	long total = 0;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			long size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			total += size;
+			if (seq == max_seq)
+				young += size;
+			if (seq + MIN_NR_GENS == max_seq)
+				old += size;
+		}
+	}
+
+	/* try to spread pages out across MIN_NR_GENS+1 generations */
+	if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
+		*need_aging = true;
+	else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
+		*need_aging = false;
+	else if (young * MIN_NR_GENS > total)
+		*need_aging = true;
+	else if (old * (MIN_NR_GENS + 2) < total)
+		*need_aging = true;
+	else
+		*need_aging = false;
+
+	return total > 0 ? total : 0;
+}
+
+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	bool need_aging;
+	long nr_to_scan;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	mem_cgroup_calculate_protection(NULL, memcg);
+
+	if (mem_cgroup_below_min(memcg))
+		return;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
+	if (!nr_to_scan)
+		return;
+
+	nr_to_scan >>= sc->priority;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+		inc_max_seq(lruvec, max_seq);
+}
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		age_lruvec(lruvec, sc);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+/******************************************************************************
+ *                          the eviction
+ ******************************************************************************/
+
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
+{
+	bool success;
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int tier = folio_lru_tier(folio);
+	int delta = folio_nr_pages(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
+
+	if (!folio_evictable(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_unevictable(folio);
+		lruvec_add_folio(lruvec, folio);
+		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
+		return true;
+	}
+
+	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_swapbacked(folio);
+		lruvec_add_folio_tail(lruvec, folio);
+		return true;
+	}
+
+	if (tier > tier_idx) {
+		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+		gen = folio_inc_gen(lruvec, folio, false);
+		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
+			   lrugen->protected[hist][type][tier - 1] + delta);
+		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+		return true;
+	}
+
+	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
+	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+		gen = folio_inc_gen(lruvec, folio, true);
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
+	return false;
+}
+
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
+{
+	bool success;
+
+	if (!sc->may_unmap && folio_mapped(folio))
+		return false;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (folio_test_dirty(folio) ||
+	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+		return false;
+
+	if (!folio_try_get(folio))
+		return false;
+
+	if (!folio_test_clear_lru(folio)) {
+		folio_put(folio);
+		return false;
+	}
+
+	success = lru_gen_del_folio(lruvec, folio, true);
+	VM_BUG_ON_FOLIO(!success, folio);
+
+	return true;
+}
+
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
+		       int type, int tier, struct list_head *list)
+{
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int remaining = MAX_LRU_BATCH;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+		return 0;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct folio *folio = lru_to_folio(head);
+			int delta = folio_nr_pages(folio);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			scanned += delta;
+
+			if (sort_folio(lruvec, folio, tier))
+				sorted += delta;
+			else if (isolate_folio(lruvec, folio, sc)) {
+				list_add(&folio->lru, list);
+				isolated += delta;
+			} else {
+				list_move(&folio->lru, &moved);
+				skipped += delta;
+			}
+
+			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
+				break;
+		}
+
+		if (skipped) {
+			list_splice(&moved, head);
+			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+		}
+
+		if (!remaining || isolated >= MIN_LRU_BATCH)
+			break;
+	}
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc)) {
+		__count_vm_events(item, isolated);
+		__count_vm_events(PGREFILL, sorted);
+	}
+	__count_memcg_events(memcg, item, isolated);
+	__count_memcg_events(memcg, PGREFILL, sorted);
+	__count_vm_events(PGSCAN_ANON + type, isolated);
+
+	/*
+	 * There might not be eligible pages due to reclaim_idx, may_unmap and
+	 * may_writepage. Check the remaining to prevent livelock if there is no
+	 * progress.
+	 */
+	return isolated || !remaining ? scanned : 0;
+}
+
+static int get_tier_idx(struct lruvec *lruvec, int type)
+{
+	int tier;
+	struct ctrl_pos sp, pv;
+
+	/*
+	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
+	 * This value is chosen because any other tier would have at least twice
+	 * as many refaults as the first tier.
+	 */
+	read_ctrl_pos(lruvec, type, 0, 1, &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, 2, &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+{
+	int type, tier;
+	struct ctrl_pos sp, pv;
+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+	/*
+	 * Compare the first tier of anon with that of file to determine which
+	 * type to scan. Also need to compare other tiers of the selected type
+	 * with the first tier of the other type to determine the last tier (of
+	 * the selected type) to evict.
+	 */
+	read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
+	read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
+	type = positive_ctrl_err(&sp, &pv);
+
+	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	*tier_idx = tier - 1;
+
+	return type;
+}
+
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			  int *type_scanned, struct list_head *list)
+{
+	int i;
+	int type;
+	int scanned;
+	int tier = -1;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	/*
+	 * Try to make the obvious choice first. When anon and file are both
+	 * available from the same generation, interpret swappiness 1 as file
+	 * first and 200 as anon first.
+	 */
+	if (!swappiness)
+		type = LRU_GEN_FILE;
+	else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
+		type = LRU_GEN_ANON;
+	else if (swappiness == 1)
+		type = LRU_GEN_FILE;
+	else if (swappiness == 200)
+		type = LRU_GEN_ANON;
+	else
+		type = get_type_to_scan(lruvec, swappiness, &tier);
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		if (tier < 0)
+			tier = get_tier_idx(lruvec, type);
+
+		scanned = scan_folios(lruvec, sc, type, tier, list);
+		if (scanned)
+			break;
+
+		type = !type;
+		tier = -1;
+	}
+
+	*type_scanned = type;
+
+	return scanned;
+}
+
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+{
+	int type;
+	int scanned;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct folio *folio;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+
+	if (try_to_inc_min_seq(lruvec, swappiness))
+		scanned++;
+
+	if (get_nr_gens(lruvec, LRU_GEN_FILE) == MIN_NR_GENS)
+		scanned = 0;
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (list_empty(&list))
+		return scanned;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+
+	/*
+	 * To avoid livelock, don't add rejected pages back to the same lists
+	 * they were isolated from. See lru_gen_add_folio().
+	 */
+	list_for_each_entry(folio, &list, lru) {
+		if (folio_test_reclaim(folio) &&
+		    (folio_test_dirty(folio) || folio_test_writeback(folio)))
+			folio_clear_active(folio);
+		else if (folio_is_file_lru(folio) || folio_test_swapcache(folio))
+			folio_set_active(folio);
+
+		folio_clear_referenced(folio);
+		folio_clear_workingset(folio);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(memcg, item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+
+	return scanned;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (mem_cgroup_below_min(memcg) ||
+	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+		return 0;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
+	if (!nr_to_scan)
+		return 0;
+
+	/* reset the priority if the target has been met */
+	nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (!nr_to_scan)
+		return 0;
+
+	if (!need_aging)
+		return nr_to_scan;
+
+	/* leave the work to lru_gen_age_node() */
+	if (current_is_kswapd())
+		return 0;
+
+	/* try other memcgs before going to the aging path */
+	if (!cgroup_reclaim(sc) && !sc->force_deactivate) {
+		sc->skipped_deactivate = true;
+		return 0;
+	}
+
+	inc_max_seq(lruvec, max_seq);
+
+	return nr_to_scan;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	long scanned = 0;
+
+	lru_add_drain();
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		int delta;
+		int swappiness;
+		long nr_to_scan;
+
+		if (sc->may_swap)
+			swappiness = get_swappiness(lruvec, sc);
+		else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
+			swappiness = 1;
+		else
+			swappiness = 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
+		if (!nr_to_scan)
+			break;
+
+		delta = evict_folios(lruvec, sc, swappiness);
+		if (!delta)
+			break;
+
+		scanned += delta;
+		if (scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -3113,6 +3874,16 @@ static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3126,6 +3897,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		lru_gen_shrink_lruvec(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -3630,6 +4406,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
@@ -4000,6 +4779,11 @@ static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!can_age_anon_pages(pgdat, sc))
 		return;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..93ee00c7e4d1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
 {
-	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
-	*evictionp = entry << bucket_order;
+	*evictionp = entry;
 	*workingsetp = workingset;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static int folio_lru_refs(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+	/* see the comment on MAX_NR_TIERS */
+	return flags & BIT(PG_workingset) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	int hist, tier;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	int type = folio_is_file_lru(folio);
+	int refs = folio_lru_refs(folio);
+	int delta = folio_nr_pages(folio);
+	bool workingset = folio_test_workingset(folio);
+	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct pglist_data *pgdat = folio_pgdat(folio);
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	token = (min_seq << LRU_REFS_WIDTH) | refs;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+	int hist, tier, refs;
+	int memcg_id;
+	bool workingset;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	int type = folio_is_file_lru(folio);
+	int delta = folio_nr_pages(folio);
+
+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+
+	refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+	if (refs && !workingset)
+		return;
+
+	if (folio_pgdat(folio) != pgdat)
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg_rcu(folio);
+	if (mem_cgroup_id(memcg) != memcg_id)
+		goto unlock;
+
+	token >>= LRU_REFS_WIDTH;
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+		goto unlock;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+
+	/*
+	 * Count the following two cases as stalls:
+	 * 1. For pages accessed through page tables, hotter pages pushed out
+	 *    hot pages which refaulted immediately.
+	 * 2. For pages accessed through file descriptors, numbers of accesses
+	 *    might have been beyond the limit.
+	 */
+	if (lru_gen_in_fault() || refs + workingset == BIT(LRU_REFS_WIDTH)) {
+		folio_set_workingset(folio);
+		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	return NULL;
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 /**
  * workingset_age_nonresident - age non-resident entries as LRU ages
  * @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	if (lru_gen_enabled())
+		return lru_gen_eviction(page_folio(page));
+
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
+	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, thp_nr_pages(page));
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -297,7 +406,13 @@ void workingset_refault(struct folio *folio, void *shadow)
 	int memcgid;
 	long nr;
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		return;
+	}
+
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	eviction <<= bucket_order;
 
 	rcu_read_lock();
 	/*
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
folio->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[47, 49]%
                IOPS         BW
      5.17-rc2: 2242k        8759MiB/s
      patch1-5: 3321k        12.7GiB/s

  Single workload:
    memcached (anon): +[101, 105]%
                Ops/sec      KB/sec
      5.17-rc2: 476771.79    18544.31
      patch1-5: 972526.07    37826.95

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.17-rc2
      38.05%  page_vma_mapped_walk
      20.86%  lzo1x_1_do_compress (real work)
       6.16%  do_raw_spin_lock
       4.61%  _raw_spin_unlock_irq
       2.20%  vma_interval_tree_iter_next
       2.19%  vma_interval_tree_subtree_search
       2.15%  page_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.65%  ptep_clear_flush
       1.00%  __zram_bvec_write

    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mm.h        |   1 +
 include/linux/mm_inline.h |  24 ++
 include/linux/mmzone.h    |  42 ++
 kernel/bounds.c           |   2 +-
 mm/Kconfig                |   9 +
 mm/swap.c                 |  42 ++
 mm/vmscan.c               | 786 +++++++++++++++++++++++++++++++++++++-
 mm/workingset.c           | 119 +++++-
 8 files changed, 1021 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1162659d824..1e3e6dd90c0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define lru_to_folio(head) (list_entry((head)->prev, struct folio, lru))
 
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e3594171b421..15a04a9b5560 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -119,6 +119,19 @@ static inline int lru_gen_from_seq(unsigned long seq)
 	return seq % MAX_NR_GENS;
 }
 
+static inline int lru_hist_from_seq(unsigned long seq)
+{
+	return seq % NR_HIST_GENS;
+}
+
+static inline int lru_tier_from_refs(int refs)
+{
+	VM_BUG_ON(refs > BIT(LRU_REFS_WIDTH));
+
+	/* see the comment on MAX_NR_TIERS */
+	return order_base_2(refs + 1);
+}
+
 static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
 {
 	unsigned long max_seq = lruvec->lrugen.max_seq;
@@ -164,6 +177,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
 		__update_lru_size(lruvec, lru, zone, -delta);
 		return;
 	}
+
+	/* promotion */
+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+		__update_lru_size(lruvec, lru, zone, -delta);
+		__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+
+	/* demotion requires isolation, e.g., lru_deactivate_fn() */
+	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
 }
 
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
@@ -229,6 +251,8 @@ static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio,
 		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 
 		new_flags &= ~LRU_GEN_MASK;
+		if ((new_flags & LRU_REFS_FLAGS) != LRU_REFS_FLAGS)
+			new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
 		/* for shrink_page_list() */
 		if (reclaiming)
 			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a88e27d85693..307c5c24c7ac 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -333,6 +333,29 @@ enum lruvec_flags {
 #define MIN_NR_GENS		2U
 #define MAX_NR_GENS		4U
 
+/*
+ * Each generation is divided into multiple tiers. Tiers represent different
+ * ranges of numbers of accesses through file descriptors. A page accessed N
+ * times through file descriptors is in tier order_base_2(N). A page in the
+ * first tier (N=0,1) is marked by PG_referenced unless it was faulted in
+ * though page tables or read ahead. A page in any other tier (N>1) is marked
+ * by PG_referenced and PG_workingset. Two additional bits in folio->flags are
+ * required to support four tiers.
+ *
+ * In contrast to moving across generations which requires the LRU lock, moving
+ * across tiers only requires operations on folio->flags and therefore has a
+ * negligible cost in the buffered access path. In the eviction path,
+ * comparisons of refaulted/(evicted+protected) from the first tier and the
+ * rest infer whether pages accessed multiple times through file descriptors
+ * are statistically hot and thus worth protecting.
+ *
+ * MAX_NR_TIERS is set to 4 so that the multi-gen LRU has of twice of the
+ * categories of the active/inactive LRU when tracking accesses through file
+ * descriptors.
+ */
+#define MAX_NR_TIERS		4U
+#define LRU_REFS_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
+
 #ifndef __GENERATING_BOUNDS_H
 
 struct lruvec;
@@ -347,6 +370,16 @@ enum {
 	LRU_GEN_FILE,
 };
 
+#define MIN_LRU_BATCH		BITS_PER_LONG
+#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
+
+/* whether to keep historical stats from evicted generations */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_HIST_GENS		MAX_NR_GENS
+#else
+#define NR_HIST_GENS		1U
+#endif
+
 /*
  * The youngest generation number is stored in max_seq for both anon and file
  * types as they are aged on an equal footing. The oldest generation numbers are
@@ -366,6 +399,15 @@ struct lru_gen_struct {
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
 	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the exponential moving average of refaulted */
+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the exponential moving average of evicted+protected */
+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the first tier doesn't need protection, hence the minus one */
+	unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* can be modified without holding the LRU lock */
+	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
diff --git a/kernel/bounds.c b/kernel/bounds.c
index e08fb89f87f4..10dd9e6b03e5 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -24,7 +24,7 @@ int main(void)
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
 #ifdef CONFIG_LRU_GEN
 	DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
-	DEFINE(LRU_REFS_WIDTH, 0);
+	DEFINE(LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
 #else
 	DEFINE(LRU_GEN_WIDTH, 0);
 	DEFINE(LRU_REFS_WIDTH, 0);
diff --git a/mm/Kconfig b/mm/Kconfig
index 747ab1690bcf..804c2bca8205 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -900,6 +900,15 @@ config LRU_GEN
 	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
 	help
 	  A high performance LRU implementation for memory overcommit.
+
+config LRU_GEN_STATS
+	bool "Full stats for debugging"
+	depends on LRU_GEN
+	help
+	  Do not enable this option unless you plan to look at historical stats
+	  from evicted generations for debugging purpose.
+
+	  This option has a per-memcg and per-node memory overhead.
 # }
 
 source "mm/damon/Kconfig"
diff --git a/mm/swap.c b/mm/swap.c
index e5f2ab3dab4a..f5c0bcac8dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,43 @@ static void __lru_cache_activate_folio(struct folio *folio)
 	local_unlock(&lru_pvecs.lock);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void folio_inc_refs(struct folio *folio)
+{
+	unsigned long refs;
+	unsigned long old_flags, new_flags;
+
+	if (folio_test_unevictable(folio))
+		return;
+
+	/* see the comment on MAX_NR_TIERS */
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		if (!(new_flags & BIT(PG_referenced))) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		if (!(new_flags & BIT(PG_workingset))) {
+			new_flags |= BIT(PG_workingset);
+			continue;
+		}
+
+		refs = new_flags & LRU_REFS_MASK;
+		refs = min(refs + BIT(LRU_REFS_PGOFF), LRU_REFS_MASK);
+
+		new_flags &= ~LRU_REFS_MASK;
+		new_flags |= refs;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void folio_inc_refs(struct folio *folio)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * Mark a page as having seen activity.
  *
@@ -419,6 +456,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (lru_gen_enabled()) {
+		folio_inc_refs(folio);
+		return;
+	}
+
 	if (!folio_test_referenced(folio)) {
 		folio_set_referenced(folio);
 	} else if (folio_test_unevictable(folio)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65eb668abf2d..91a827ff665d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1287,9 +1287,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		mem_cgroup_swapout(page, swap);
+
+		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(page, target_memcg);
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_page(page, swap);
@@ -2723,6 +2725,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -3048,11 +3053,38 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
  *                          shorthand helpers
  ******************************************************************************/
 
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]),	\
+		READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),	\
+	}
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
+static int folio_lru_gen(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static int folio_lru_tier(struct folio *folio)
+{
+	int refs;
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	refs = (flags & LRU_REFS_FLAGS) == LRU_REFS_FLAGS ?
+	       ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1 : 0;
+
+	return lru_tier_from_refs(refs);
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3071,6 +3103,735 @@ static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return pgdat ? &pgdat->__lruvec : NULL;
 }
 
+static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	if (!can_demote(pgdat->node_id, sc) &&
+	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
+		return 0;
+
+	return mem_cgroup_swappiness(memcg);
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	/* see the comment on lru_gen_struct */
+	return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
+	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ *                          refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
+ *
+ * The P term is refaulted/(evicted+protected) from a tier in the generation
+ * currently being evicted; the I term is the exponential moving average of the
+ * P term over the generations previously evicted, using the smoothing factor
+ * 1/2; the D term isn't supported.
+ *
+ * The setpoint (SP) is always the first tier of one type; the process variable
+ * (PV) is either any tier of the other type or any other tier of the same
+ * type.
+ *
+ * The error is the difference between the SP and the PV; the correction is
+ * turn off protection when SP>PV or turn on protection when SP<PV.
+ *
+ * For future optimizations:
+ * 1. The D term may discount the other two terms over time so that long-lived
+ *    generations can resist stale information.
+ */
+struct ctrl_pos {
+	unsigned long refaulted;
+	unsigned long total;
+	int gain;
+};
+
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
+			  struct ctrl_pos *pos)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+	pos->total = lrugen->avg_total[type][tier] +
+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
+	if (tier)
+		pos->total += lrugen->protected[hist][type][tier - 1];
+	pos->gain = gain;
+}
+
+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
+{
+	int hist, tier;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
+	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+
+	if (!carryover && !clear)
+		return;
+
+	hist = lru_hist_from_seq(seq);
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		if (carryover) {
+			unsigned long sum;
+
+			sum = lrugen->avg_refaulted[type][tier] +
+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+			sum = lrugen->avg_total[type][tier] +
+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
+			if (tier)
+				sum += lrugen->protected[hist][type][tier - 1];
+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+		}
+
+		if (clear) {
+			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+			if (tier)
+				WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
+		}
+	}
+}
+
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
+{
+	/*
+	 * Return true if the PV has a limited number of refaults or a lower
+	 * refaulted/total than the SP.
+	 */
+	return pv->refaulted < MIN_LRU_BATCH ||
+	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
+	       (sp->refaulted + 1) * pv->total * pv->gain;
+}
+
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+		/* for folio_end_writeback() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+
+	return new_gen;
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+	int type;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	}
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+	int gen, type, zone;
+	bool success = false;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
+			gen = lru_gen_from_seq(min_seq[type]);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				if (!list_empty(&lrugen->lists[gen][type][zone]))
+					goto next;
+			}
+
+			min_seq[type]++;
+		}
+next:
+		;
+	}
+
+	/* see the comment on lru_gen_struct */
+	if (can_swap) {
+		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
+		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
+	}
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		if (min_seq[type] == lrugen->min_seq[type])
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int prev, next;
+	int type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	inc_min_seq(lruvec);
+
+	/* update the active/inactive LRU sizes for compatibility */
+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
+	next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			enum lru_list lru = type * LRU_INACTIVE_FILE;
+			long delta = lrugen->nr_pages[prev][type][zone] -
+				     lrugen->nr_pages[next][type][zone];
+
+			if (!delta)
+				continue;
+
+			__update_lru_size(lruvec, lru, zone, delta);
+			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+		}
+	}
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		reset_ctrl_pos(lruvec, type, false);
+
+	/* make sure preceding modifications appear */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
+			     unsigned long *min_seq, bool can_swap, bool *need_aging)
+{
+	int gen, type, zone;
+	long old = 0;
+	long young = 0;
+	long total = 0;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			long size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			total += size;
+			if (seq == max_seq)
+				young += size;
+			if (seq + MIN_NR_GENS == max_seq)
+				old += size;
+		}
+	}
+
+	/* try to spread pages out across MIN_NR_GENS+1 generations */
+	if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
+		*need_aging = true;
+	else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
+		*need_aging = false;
+	else if (young * MIN_NR_GENS > total)
+		*need_aging = true;
+	else if (old * (MIN_NR_GENS + 2) < total)
+		*need_aging = true;
+	else
+		*need_aging = false;
+
+	return total > 0 ? total : 0;
+}
+
+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	bool need_aging;
+	long nr_to_scan;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	mem_cgroup_calculate_protection(NULL, memcg);
+
+	if (mem_cgroup_below_min(memcg))
+		return;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
+	if (!nr_to_scan)
+		return;
+
+	nr_to_scan >>= sc->priority;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+		inc_max_seq(lruvec, max_seq);
+}
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		age_lruvec(lruvec, sc);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+/******************************************************************************
+ *                          the eviction
+ ******************************************************************************/
+
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
+{
+	bool success;
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int tier = folio_lru_tier(folio);
+	int delta = folio_nr_pages(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
+
+	if (!folio_evictable(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_unevictable(folio);
+		lruvec_add_folio(lruvec, folio);
+		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
+		return true;
+	}
+
+	if (type == LRU_GEN_FILE && folio_test_anon(folio) && folio_test_dirty(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_swapbacked(folio);
+		lruvec_add_folio_tail(lruvec, folio);
+		return true;
+	}
+
+	if (tier > tier_idx) {
+		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+		gen = folio_inc_gen(lruvec, folio, false);
+		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
+			   lrugen->protected[hist][type][tier - 1] + delta);
+		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+		return true;
+	}
+
+	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
+	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
+		gen = folio_inc_gen(lruvec, folio, true);
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
+	return false;
+}
+
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
+{
+	bool success;
+
+	if (!sc->may_unmap && folio_mapped(folio))
+		return false;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (folio_test_dirty(folio) ||
+	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+		return false;
+
+	if (!folio_try_get(folio))
+		return false;
+
+	if (!folio_test_clear_lru(folio)) {
+		folio_put(folio);
+		return false;
+	}
+
+	success = lru_gen_del_folio(lruvec, folio, true);
+	VM_BUG_ON_FOLIO(!success, folio);
+
+	return true;
+}
+
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
+		       int type, int tier, struct list_head *list)
+{
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int remaining = MAX_LRU_BATCH;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+		return 0;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct folio *folio = lru_to_folio(head);
+			int delta = folio_nr_pages(folio);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			scanned += delta;
+
+			if (sort_folio(lruvec, folio, tier))
+				sorted += delta;
+			else if (isolate_folio(lruvec, folio, sc)) {
+				list_add(&folio->lru, list);
+				isolated += delta;
+			} else {
+				list_move(&folio->lru, &moved);
+				skipped += delta;
+			}
+
+			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
+				break;
+		}
+
+		if (skipped) {
+			list_splice(&moved, head);
+			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+		}
+
+		if (!remaining || isolated >= MIN_LRU_BATCH)
+			break;
+	}
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc)) {
+		__count_vm_events(item, isolated);
+		__count_vm_events(PGREFILL, sorted);
+	}
+	__count_memcg_events(memcg, item, isolated);
+	__count_memcg_events(memcg, PGREFILL, sorted);
+	__count_vm_events(PGSCAN_ANON + type, isolated);
+
+	/*
+	 * There might not be eligible pages due to reclaim_idx, may_unmap and
+	 * may_writepage. Check the remaining to prevent livelock if there is no
+	 * progress.
+	 */
+	return isolated || !remaining ? scanned : 0;
+}
+
+static int get_tier_idx(struct lruvec *lruvec, int type)
+{
+	int tier;
+	struct ctrl_pos sp, pv;
+
+	/*
+	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
+	 * This value is chosen because any other tier would have at least twice
+	 * as many refaults as the first tier.
+	 */
+	read_ctrl_pos(lruvec, type, 0, 1, &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, 2, &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+{
+	int type, tier;
+	struct ctrl_pos sp, pv;
+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+	/*
+	 * Compare the first tier of anon with that of file to determine which
+	 * type to scan. Also need to compare other tiers of the selected type
+	 * with the first tier of the other type to determine the last tier (of
+	 * the selected type) to evict.
+	 */
+	read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
+	read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
+	type = positive_ctrl_err(&sp, &pv);
+
+	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	*tier_idx = tier - 1;
+
+	return type;
+}
+
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			  int *type_scanned, struct list_head *list)
+{
+	int i;
+	int type;
+	int scanned;
+	int tier = -1;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	/*
+	 * Try to make the obvious choice first. When anon and file are both
+	 * available from the same generation, interpret swappiness 1 as file
+	 * first and 200 as anon first.
+	 */
+	if (!swappiness)
+		type = LRU_GEN_FILE;
+	else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
+		type = LRU_GEN_ANON;
+	else if (swappiness == 1)
+		type = LRU_GEN_FILE;
+	else if (swappiness == 200)
+		type = LRU_GEN_ANON;
+	else
+		type = get_type_to_scan(lruvec, swappiness, &tier);
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		if (tier < 0)
+			tier = get_tier_idx(lruvec, type);
+
+		scanned = scan_folios(lruvec, sc, type, tier, list);
+		if (scanned)
+			break;
+
+		type = !type;
+		tier = -1;
+	}
+
+	*type_scanned = type;
+
+	return scanned;
+}
+
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+{
+	int type;
+	int scanned;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct folio *folio;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+
+	if (try_to_inc_min_seq(lruvec, swappiness))
+		scanned++;
+
+	if (get_nr_gens(lruvec, LRU_GEN_FILE) == MIN_NR_GENS)
+		scanned = 0;
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (list_empty(&list))
+		return scanned;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+
+	/*
+	 * To avoid livelock, don't add rejected pages back to the same lists
+	 * they were isolated from. See lru_gen_add_folio().
+	 */
+	list_for_each_entry(folio, &list, lru) {
+		if (folio_test_reclaim(folio) &&
+		    (folio_test_dirty(folio) || folio_test_writeback(folio)))
+			folio_clear_active(folio);
+		else if (folio_is_file_lru(folio) || folio_test_swapcache(folio))
+			folio_set_active(folio);
+
+		folio_clear_referenced(folio);
+		folio_clear_workingset(folio);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(memcg, item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+
+	return scanned;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (mem_cgroup_below_min(memcg) ||
+	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+		return 0;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
+	if (!nr_to_scan)
+		return 0;
+
+	/* reset the priority if the target has been met */
+	nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (!nr_to_scan)
+		return 0;
+
+	if (!need_aging)
+		return nr_to_scan;
+
+	/* leave the work to lru_gen_age_node() */
+	if (current_is_kswapd())
+		return 0;
+
+	/* try other memcgs before going to the aging path */
+	if (!cgroup_reclaim(sc) && !sc->force_deactivate) {
+		sc->skipped_deactivate = true;
+		return 0;
+	}
+
+	inc_max_seq(lruvec, max_seq);
+
+	return nr_to_scan;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	long scanned = 0;
+
+	lru_add_drain();
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		int delta;
+		int swappiness;
+		long nr_to_scan;
+
+		if (sc->may_swap)
+			swappiness = get_swappiness(lruvec, sc);
+		else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
+			swappiness = 1;
+		else
+			swappiness = 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
+		if (!nr_to_scan)
+			break;
+
+		delta = evict_folios(lruvec, sc, swappiness);
+		if (!delta)
+			break;
+
+		scanned += delta;
+		if (scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -3113,6 +3874,16 @@ static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3126,6 +3897,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		lru_gen_shrink_lruvec(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -3630,6 +4406,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
@@ -4000,6 +4779,11 @@ static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!can_age_anon_pages(pgdat, sc))
 		return;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..93ee00c7e4d1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
 {
-	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
-	*evictionp = entry << bucket_order;
+	*evictionp = entry;
 	*workingsetp = workingset;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static int folio_lru_refs(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+	/* see the comment on MAX_NR_TIERS */
+	return flags & BIT(PG_workingset) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	int hist, tier;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	int type = folio_is_file_lru(folio);
+	int refs = folio_lru_refs(folio);
+	int delta = folio_nr_pages(folio);
+	bool workingset = folio_test_workingset(folio);
+	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct pglist_data *pgdat = folio_pgdat(folio);
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	token = (min_seq << LRU_REFS_WIDTH) | refs;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+	int hist, tier, refs;
+	int memcg_id;
+	bool workingset;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	int type = folio_is_file_lru(folio);
+	int delta = folio_nr_pages(folio);
+
+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+
+	refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+	if (refs && !workingset)
+		return;
+
+	if (folio_pgdat(folio) != pgdat)
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg_rcu(folio);
+	if (mem_cgroup_id(memcg) != memcg_id)
+		goto unlock;
+
+	token >>= LRU_REFS_WIDTH;
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+		goto unlock;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+
+	/*
+	 * Count the following two cases as stalls:
+	 * 1. For pages accessed through page tables, hotter pages pushed out
+	 *    hot pages which refaulted immediately.
+	 * 2. For pages accessed through file descriptors, numbers of accesses
+	 *    might have been beyond the limit.
+	 */
+	if (lru_gen_in_fault() || refs + workingset == BIT(LRU_REFS_WIDTH)) {
+		folio_set_workingset(folio);
+		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	return NULL;
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 /**
  * workingset_age_nonresident - age non-resident entries as LRU ages
  * @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	if (lru_gen_enabled())
+		return lru_gen_eviction(page_folio(page));
+
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
+	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, thp_nr_pages(page));
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -297,7 +406,13 @@ void workingset_refault(struct folio *folio, void *shadow)
 	int memcgid;
 	long nr;
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		return;
+	}
+
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	eviction <<= bucket_order;
 
 	rcu_read_lock();
 	/*
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[3.5, 5.5]%
                Ops/sec      KB/sec
      patch1-5: 972526.07    37826.95
      patch1-6: 1015292.83   39490.38

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

    patch1-6
      45.49%  lzo1x_1_do_compress (real work)
       7.38%  page_vma_mapped_walk
       7.24%  _raw_spin_unlock_irq
       2.64%  ptep_clear_flush
       2.31%  __zram_bvec_write
       2.13%  do_raw_spin_lock
       2.09%  lru_gen_look_around
       1.89%  free_unref_page_list
       1.85%  memmove
       1.74%  obj_malloc

  Configurations:
    no change

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/memcontrol.h |  31 ++++++++
 include/linux/mm.h         |   5 ++
 include/linux/mmzone.h     |   6 ++
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            |   1 +
 mm/rmap.c                  |   7 ++
 mm/swap.c                  |   4 +-
 mm/vmscan.c                | 155 +++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..c8ce74577290 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -437,6 +437,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem folio a caller should hold an rcu read lock to protect memcg
  * associated with a kmem folio from being released.
@@ -498,6 +499,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem page a caller should hold an rcu read lock to protect memcg
  * associated with a kmem page from being released.
@@ -935,6 +937,23 @@ void unlock_page_memcg(struct page *page);
 
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	rcu_read_lock();
+
+	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+		return true;
+
+	rcu_read_unlock();
+	return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
 				   int idx, int val)
@@ -1372,6 +1391,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
 {
 }
 
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
+	return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 static inline void mem_cgroup_handle_over_high(void)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1e3e6dd90c0f..1f3695e95942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1588,6 +1588,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
 	return page_to_pfn(&folio->page);
 }
 
+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+	return page_folio(pfn_to_page(pfn));
+}
+
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
 #ifdef CONFIG_MIGRATION
 static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 307c5c24c7ac..cd64c64a952d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -359,6 +359,7 @@ enum lruvec_flags {
 #ifndef __GENERATING_BOUNDS_H
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -411,6 +412,7 @@ struct lru_gen_struct {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -423,6 +425,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
 
+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #ifdef CONFIG_MEMCG
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d38d9475c4d..b37520d3ff1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -372,6 +372,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void folio_activate(struct folio *folio);
 extern void deactivate_file_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3fcbfeda259b..e4c30950aa3c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2744,6 +2744,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 	 * - LRU isolation
 	 * - lock_page_memcg()
 	 * - exclusive reference
+	 * - mem_cgroup_trylock_pages()
 	 */
 	folio->memcg_data = (unsigned long)memcg;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..112e77dc62f4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -819,6 +820,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (pvmw.pte) {
+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+				lru_gen_look_around(&pvmw);
+				referenced++;
+			}
+
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/swap.c b/mm/swap.c
index f5c0bcac8dcd..e65e7520bebf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,7 +344,7 @@ static bool need_activate_page_drain(int cpu)
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	if (folio_test_lru(folio) && !folio_test_active(folio) &&
 	    !folio_test_unevictable(folio)) {
@@ -364,7 +364,7 @@ static inline void activate_page_drain(int cpu)
 {
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	struct lruvec *lruvec;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 91a827ff665d..2b685aa0379c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1558,6 +1558,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* folio_update_gen() tried to promote this page? */
+		if (lru_gen_enabled() && !ignore_references &&
+		    page_mapped(page) && PageReferenced(page))
+			goto keep_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -3225,6 +3230,31 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
  *                          the aging
  ******************************************************************************/
 
+static int folio_update_gen(struct folio *folio, int gen)
+{
+	unsigned long old_flags, new_flags;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+	VM_BUG_ON(!rcu_read_lock_held());
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		/* for shrink_page_list() */
+		if (!(new_flags & LRU_GEN_MASK)) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
 static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	unsigned long old_flags, new_flags;
@@ -3237,6 +3267,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
 
 		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		/* folio_update_gen() has promoted this page? */
+		if (new_gen >= 0 && new_gen != old_gen)
+			return new_gen;
+
 		new_gen = (old_gen + 1) % MAX_NR_GENS;
 
 		new_flags &= ~LRU_GEN_MASK;
@@ -3438,6 +3472,122 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 }
 
+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+	int i;
+	pte_t *pte;
+	unsigned long start;
+	unsigned long end;
+	unsigned long addr;
+	struct folio *folio;
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+	struct mem_cgroup *memcg = page_memcg(pvmw->page);
+	struct pglist_data *pgdat = page_pgdat(pvmw->page);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	DEFINE_MAX_SEQ(lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			end = start + MIN_LRU_BATCH * PAGE_SIZE;
+		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			start = end - MIN_LRU_BATCH * PAGE_SIZE;
+		else {
+			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+		}
+	}
+
+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+	rcu_read_lock();
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i]))
+			continue;
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+			continue;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_lru_gen(folio);
+		if (old_gen < 0)
+			folio_set_referenced(folio);
+		else if (old_gen != new_gen)
+			__set_bit(i, bitmap);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	rcu_read_unlock();
+
+	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+			folio = page_folio(pte_page(pte[i]));
+			folio_activate(folio);
+		}
+		return;
+	}
+
+	/* folio_update_gen() requires stable folio_memcg() */
+	if (!mem_cgroup_trylock_pages(memcg))
+		return;
+
+	spin_lock_irq(&lruvec->lru_lock);
+	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+		folio = page_folio(pte_page(pte[i]));
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen < 0 || old_gen == new_gen)
+			continue;
+
+		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_unlock_pages();
+}
+
 /******************************************************************************
  *                          the eviction
  ******************************************************************************/
@@ -3471,6 +3621,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
 		return true;
 	}
 
+	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
 	if (tier > tier_idx) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[3.5, 5.5]%
                Ops/sec      KB/sec
      patch1-5: 972526.07    37826.95
      patch1-6: 1015292.83   39490.38

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

    patch1-6
      45.49%  lzo1x_1_do_compress (real work)
       7.38%  page_vma_mapped_walk
       7.24%  _raw_spin_unlock_irq
       2.64%  ptep_clear_flush
       2.31%  __zram_bvec_write
       2.13%  do_raw_spin_lock
       2.09%  lru_gen_look_around
       1.89%  free_unref_page_list
       1.85%  memmove
       1.74%  obj_malloc

  Configurations:
    no change

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/memcontrol.h |  31 ++++++++
 include/linux/mm.h         |   5 ++
 include/linux/mmzone.h     |   6 ++
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            |   1 +
 mm/rmap.c                  |   7 ++
 mm/swap.c                  |   4 +-
 mm/vmscan.c                | 155 +++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..c8ce74577290 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -437,6 +437,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem folio a caller should hold an rcu read lock to protect memcg
  * associated with a kmem folio from being released.
@@ -498,6 +499,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem page a caller should hold an rcu read lock to protect memcg
  * associated with a kmem page from being released.
@@ -935,6 +937,23 @@ void unlock_page_memcg(struct page *page);
 
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	rcu_read_lock();
+
+	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+		return true;
+
+	rcu_read_unlock();
+	return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
 				   int idx, int val)
@@ -1372,6 +1391,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
 {
 }
 
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
+	return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 static inline void mem_cgroup_handle_over_high(void)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1e3e6dd90c0f..1f3695e95942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1588,6 +1588,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
 	return page_to_pfn(&folio->page);
 }
 
+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+	return page_folio(pfn_to_page(pfn));
+}
+
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
 #ifdef CONFIG_MIGRATION
 static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 307c5c24c7ac..cd64c64a952d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -359,6 +359,7 @@ enum lruvec_flags {
 #ifndef __GENERATING_BOUNDS_H
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -411,6 +412,7 @@ struct lru_gen_struct {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -423,6 +425,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
 
+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #ifdef CONFIG_MEMCG
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d38d9475c4d..b37520d3ff1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -372,6 +372,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void folio_activate(struct folio *folio);
 extern void deactivate_file_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3fcbfeda259b..e4c30950aa3c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2744,6 +2744,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 	 * - LRU isolation
 	 * - lock_page_memcg()
 	 * - exclusive reference
+	 * - mem_cgroup_trylock_pages()
 	 */
 	folio->memcg_data = (unsigned long)memcg;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..112e77dc62f4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -819,6 +820,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (pvmw.pte) {
+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+				lru_gen_look_around(&pvmw);
+				referenced++;
+			}
+
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/swap.c b/mm/swap.c
index f5c0bcac8dcd..e65e7520bebf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,7 +344,7 @@ static bool need_activate_page_drain(int cpu)
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	if (folio_test_lru(folio) && !folio_test_active(folio) &&
 	    !folio_test_unevictable(folio)) {
@@ -364,7 +364,7 @@ static inline void activate_page_drain(int cpu)
 {
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	struct lruvec *lruvec;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 91a827ff665d..2b685aa0379c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1558,6 +1558,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* folio_update_gen() tried to promote this page? */
+		if (lru_gen_enabled() && !ignore_references &&
+		    page_mapped(page) && PageReferenced(page))
+			goto keep_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -3225,6 +3230,31 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
  *                          the aging
  ******************************************************************************/
 
+static int folio_update_gen(struct folio *folio, int gen)
+{
+	unsigned long old_flags, new_flags;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+	VM_BUG_ON(!rcu_read_lock_held());
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		/* for shrink_page_list() */
+		if (!(new_flags & LRU_GEN_MASK)) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
 static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	unsigned long old_flags, new_flags;
@@ -3237,6 +3267,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
 
 		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		/* folio_update_gen() has promoted this page? */
+		if (new_gen >= 0 && new_gen != old_gen)
+			return new_gen;
+
 		new_gen = (old_gen + 1) % MAX_NR_GENS;
 
 		new_flags &= ~LRU_GEN_MASK;
@@ -3438,6 +3472,122 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 }
 
+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+	int i;
+	pte_t *pte;
+	unsigned long start;
+	unsigned long end;
+	unsigned long addr;
+	struct folio *folio;
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+	struct mem_cgroup *memcg = page_memcg(pvmw->page);
+	struct pglist_data *pgdat = page_pgdat(pvmw->page);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	DEFINE_MAX_SEQ(lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			end = start + MIN_LRU_BATCH * PAGE_SIZE;
+		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			start = end - MIN_LRU_BATCH * PAGE_SIZE;
+		else {
+			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+		}
+	}
+
+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+	rcu_read_lock();
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i]))
+			continue;
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+			continue;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_lru_gen(folio);
+		if (old_gen < 0)
+			folio_set_referenced(folio);
+		else if (old_gen != new_gen)
+			__set_bit(i, bitmap);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	rcu_read_unlock();
+
+	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+			folio = page_folio(pte_page(pte[i]));
+			folio_activate(folio);
+		}
+		return;
+	}
+
+	/* folio_update_gen() requires stable folio_memcg() */
+	if (!mem_cgroup_trylock_pages(memcg))
+		return;
+
+	spin_lock_irq(&lruvec->lru_lock);
+	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+		folio = page_folio(pte_page(pte[i]));
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen < 0 || old_gen == new_gen)
+			continue;
+
+		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_unlock_pages();
+}
+
 /******************************************************************************
  *                          the eviction
  ******************************************************************************/
@@ -3471,6 +3621,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
 		return true;
 	}
 
+	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
 	if (tier > tier_idx) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 08/14] mm: multi-gen LRU: support page table walks
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                Ops/sec      KB/sec
      patch1-6: 1015292.83   39490.38
      patch1-7: 1080856.82   42040.53

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      45.49%  lzo1x_1_do_compress (real work)
       7.38%  page_vma_mapped_walk
       7.24%  _raw_spin_unlock_irq
       2.64%  ptep_clear_flush
       2.31%  __zram_bvec_write
       2.13%  do_raw_spin_lock
       2.09%  lru_gen_look_around
       1.89%  free_unref_page_list
       1.85%  memmove
       1.74%  obj_malloc

    patch1-7
      47.73%  lzo1x_1_do_compress (real work)
       6.84%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       2.86%  walk_pte_range
       2.79%  ptep_clear_flush
       2.24%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.94%  free_unref_page_list
       1.80%  memmove
       1.75%  obj_malloc

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 fs/exec.c                  |   2 +
 include/linux/memcontrol.h |   5 +
 include/linux/mm_types.h   |  78 +++
 include/linux/mmzone.h     |  58 +++
 include/linux/swap.h       |   4 +
 kernel/exit.c              |   1 +
 kernel/fork.c              |   9 +
 kernel/sched/core.c        |   1 +
 mm/memcontrol.c            |  24 +
 mm/vmscan.c                | 960 ++++++++++++++++++++++++++++++++++++-
 10 files changed, 1129 insertions(+), 13 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..6696fbbecbf3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,6 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1018,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
 	activate_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
+	lru_gen_use_mm(mm);
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c8ce74577290..b8e5718665b8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -343,6 +343,11 @@ struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	/* per-memcg mm_struct list */
+	struct lru_gen_mm_list mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0f549870da6a..cbc7fa381ac6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@
 #define _LINUX_MM_TYPES_H
 
 #include <linux/mm_types_task.h>
+#include <linux/sched.h>
 
 #include <linux/auxvec.h>
 #include <linux/kref.h>
@@ -17,6 +18,8 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -637,6 +640,22 @@ struct mm_struct {
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
 #endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* this mm_struct is on lru_gen_mm_list */
+			struct list_head list;
+#ifdef CONFIG_MEMCG
+			/* points to the memcg of "owner" above */
+			struct mem_cgroup *memcg;
+#endif
+			/*
+			 * Set when switching to this mm_struct, as a hint of
+			 * whether it has been used since the last time per-node
+			 * page table walkers cleared the corresponding bits.
+			 */
+			nodemask_t nodes;
+		} lru_gen;
+#endif /* CONFIG_LRU_GEN */
 	} __randomize_layout;
 
 	/*
@@ -663,6 +682,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+	/* mm_struct list for page table walkers */
+	struct list_head fifo;
+	/* protects the list above */
+	spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+	INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+	mm->lru_gen.memcg = NULL;
+#endif
+	nodes_clear(mm->lru_gen.nodes);
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+	/* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+	VM_WARN_ON(list_empty(&mm->lru_gen.list));
+
+	if (!(current->flags & PF_KTHREAD) && !nodes_full(mm->lru_gen.nodes))
+		nodes_setall(mm->lru_gen.nodes);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cd64c64a952d..a2d53025a321 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -411,6 +411,58 @@ struct lru_gen_struct {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 };
 
+enum {
+	MM_PTE_TOTAL,	/* total leaf entries */
+	MM_PTE_OLD,	/* old leaf entries */
+	MM_PTE_YOUNG,	/* young leaf entries */
+	MM_PMD_TOTAL,	/* total non-leaf entries */
+	MM_PMD_FOUND,	/* non-leaf entries found in Bloom filters */
+	MM_PMD_ADDED,	/* non-leaf entries added to Bloom filters */
+	NR_MM_STATS
+};
+
+/* mnemonic codes for the mm stats above */
+#define MM_STAT_CODES		"toydfa"
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS	2
+
+struct lru_gen_mm_state {
+	/* set to max_seq after each iteration */
+	unsigned long seq;
+	/* where the current iteration starts (inclusive) */
+	struct list_head *head;
+	/* where the last iteration ends (exclusive) */
+	struct list_head *tail;
+	/* to wait for the last page table walker to finish */
+	struct wait_queue_head wait;
+	/* Bloom filters flip after each iteration */
+	unsigned long *filters[NR_BLOOM_FILTERS];
+	/* the mm stats for debugging */
+	unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+	/* the number of concurrent page table walkers */
+	int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+	/* the lruvec under reclaim */
+	struct lruvec *lruvec;
+	/* unstable max_seq from lru_gen_struct */
+	unsigned long max_seq;
+	/* the next address within an mm to scan */
+	unsigned long next_addr;
+	/* to batch page table entries */
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+	/* to batch promoted pages */
+	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* to batch the mm stats */
+	int mm_stats[NR_MM_STATS];
+	/* total batched items */
+	int batched;
+	bool can_swap;
+	bool full_scan;
+};
+
 void lru_gen_init_lruvec(struct lruvec *lruvec);
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
@@ -461,6 +513,8 @@ struct lruvec {
 #ifdef CONFIG_LRU_GEN
 	/* evictable pages divided into generations */
 	struct lru_gen_struct		lrugen;
+	/* to concurrently iterate lru_gen_mm_list */
+	struct lru_gen_mm_state		mm_state;
 #endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
@@ -1053,6 +1107,10 @@ typedef struct pglist_data {
 
 	unsigned long		flags;
 
+#ifdef CONFIG_LRU_GEN
+	/* kswap mm walk data */
+	struct lru_gen_mm_walk	mm_walk;
+#endif
 	ZONE_PADDING(_pad2_)
 
 	/* Per-node vmstats */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b37520d3ff1d..04d84ac6d1ac 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@ union swap_header {
  */
 struct reclaim_state {
 	unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+	/* per-thread mm walk data */
+	struct lru_gen_mm_walk *mm_walk;
+#endif
 };
 
 #ifdef __KERNEL__
diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab9..54d2ce4b93d1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -463,6 +463,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index f1e89007f228..9bc303eacca1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1079,6 +1079,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1121,6 +1122,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2586,6 +2588,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock the task to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9745613d531c..ecf0cdce8603 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4979,6 +4979,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4c30950aa3c..d5993490b32f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6155,6 +6155,29 @@ static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task = NULL;
+
+	cgroup_taskset_for_each_leader(task, css, tset)
+		break;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6500,6 +6523,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b685aa0379c..67dc4190e790 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,8 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3133,6 +3135,372 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
 	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
 }
 
+/******************************************************************************
+ *                          mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+	static struct lru_gen_mm_list mm_list = {
+		.fifo = LIST_HEAD_INIT(mm_list.fifo),
+		.lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+	};
+
+#ifdef CONFIG_MEMCG
+	if (memcg)
+		return &memcg->mm_list;
+#endif
+	return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+	mm->lru_gen.memcg = memcg;
+#endif
+	spin_lock(&mm_list->lock);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		if (lruvec->mm_state.tail == &mm_list->fifo)
+			lruvec->mm_state.tail = &mm->lru_gen.list;
+	}
+
+	list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+	struct mem_cgroup *memcg = NULL;
+
+	if (list_empty(&mm->lru_gen.list))
+		return;
+
+#ifdef CONFIG_MEMCG
+	memcg = mm->lru_gen.memcg;
+#endif
+	mm_list = get_mm_list(memcg);
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		if (lruvec->mm_state.tail == &mm->lru_gen.list)
+			lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+		if (lruvec->mm_state.head != &mm->lru_gen.list)
+			continue;
+
+		lruvec->mm_state.head = lruvec->mm_state.head->next;
+		if (lruvec->mm_state.head == &mm_list->fifo)
+			WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+	}
+
+	list_del_init(&mm->lru_gen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lru_gen.memcg);
+	mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	/* for mm_update_next_owner() */
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(mm->owner);
+	rcu_read_unlock();
+	if (memcg == mm->lru_gen.memcg)
+		return;
+
+	VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+	VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it produces a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1. It's not necessary to keep both filters all the time. The spare one can be
+ *    freed after the RCU grace period and reallocated if needed again.
+ * 2. And when reallocating, it's worth scaling its size according to the number
+ *    of inserted entries in the other filter, to reduce the memory overhead on
+ *    small systems and false positives on large systems.
+ * 3. Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT	15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+	return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+	u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+	BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+	key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+	key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	filter = lruvec->mm_state.filters[gen];
+	if (filter) {
+		bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+		return;
+	}
+
+	filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC);
+	WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return;
+
+	get_item_key(item, key);
+
+	if (!test_bit(key[0], filter))
+		set_bit(key[0], filter);
+	if (!test_bit(key[1], filter))
+		set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return true;
+
+	get_item_key(item, key);
+
+	return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+	int i;
+	int hist;
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	if (walk) {
+		hist = lru_hist_from_seq(walk->max_seq);
+
+		for (i = 0; i < NR_MM_STATS; i++) {
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+				   lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+			walk->mm_stats[i] = 0;
+		}
+	}
+
+	if (NR_HIST_GENS > 1 && last) {
+		hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+
+		for (i = 0; i < NR_MM_STATS; i++)
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+	}
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int type;
+	unsigned long size = 0;
+	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+	if (!walk->full_scan && cpumask_empty(mm_cpumask(mm)) &&
+	    !node_isset(pgdat->node_id, mm->lru_gen.nodes))
+		return true;
+
+	node_clear(pgdat->node_id, mm->lru_gen.nodes);
+
+	for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+		size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+			       get_mm_counter(mm, MM_ANONPAGES) +
+			       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	if (size < MIN_LRU_BATCH)
+		return true;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+			    struct mm_struct **iter)
+{
+	bool first = false;
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	/*
+	 * There are four interesting cases for this page table walker:
+	 * 1. It tries to start a new iteration of mm_list with a stale max_seq;
+	 *    there is nothing to be done.
+	 * 2. It's the first of the current generation, and it needs to reset
+	 *    the Bloom filter for the next generation.
+	 * 3. It reaches the end of mm_list, and it needs to increment
+	 *    mm_state->seq; the iteration is done.
+	 * 4. It's the last of the current generation, and it needs to reset the
+	 *    mm stats counters for the next generation.
+	 */
+	if (*iter)
+		mmput_async(*iter);
+	else if (walk->max_seq <= READ_ONCE(mm_state->seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(mm_state->seq + 1 < walk->max_seq);
+	VM_BUG_ON(*iter && mm_state->seq > walk->max_seq);
+	VM_BUG_ON(*iter && !mm_state->nr_walkers);
+
+	if (walk->max_seq <= mm_state->seq) {
+		if (!*iter)
+			last = false;
+		goto done;
+	}
+
+	if (!mm_state->nr_walkers) {
+		VM_BUG_ON(mm_state->head && mm_state->head != &mm_list->fifo);
+
+		mm_state->head = mm_list->fifo.next;
+		first = true;
+	}
+
+	while (!mm && mm_state->head != &mm_list->fifo) {
+		mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+		mm_state->head = mm_state->head->next;
+
+		/* full scan for those added after the last iteration */
+		if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) {
+			mm_state->tail = mm_state->head;
+			walk->full_scan = true;
+		}
+
+		if (should_skip_mm(mm, walk))
+			mm = NULL;
+	}
+
+	if (mm_state->head == &mm_list->fifo)
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+	if (*iter && !mm)
+		mm_state->nr_walkers--;
+	if (!*iter && mm)
+		mm_state->nr_walkers++;
+
+	if (mm_state->nr_walkers)
+		last = false;
+
+	if (mm && first)
+		reset_bloom_filter(lruvec, walk->max_seq + 1);
+
+	if (*iter || last)
+		reset_mm_stats(lruvec, walk, last);
+
+	spin_unlock(&mm_list->lock);
+
+	*iter = mm;
+
+	return last;
+}
+
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
+{
+	bool success = false;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	if (max_seq <= READ_ONCE(mm_state->seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(mm_state->seq + 1 < max_seq);
+
+	if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
+		VM_BUG_ON(mm_state->head && mm_state->head != &mm_list->fifo);
+
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+		reset_mm_stats(lruvec, NULL, true);
+		success = true;
+	}
+
+	spin_unlock(&mm_list->lock);
+
+	return success;
+}
+
 /******************************************************************************
  *                          refault feedback loop
  ******************************************************************************/
@@ -3286,6 +3654,465 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 	return new_gen;
 }
 
+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
+			      int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+
+	VM_BUG_ON(old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+	walk->batched++;
+
+	walk->nr_pages[old_gen][type][zone] -= delta;
+	walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	walk->batched = 0;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		enum lru_list lru = type * LRU_INACTIVE_FILE;
+		int delta = walk->nr_pages[gen][type][zone];
+
+		if (!delta)
+			continue;
+
+		walk->nr_pages[gen][type][zone] = 0;
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+			   lrugen->nr_pages[gen][type][zone] + delta);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, delta);
+	}
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma = walk->vma;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+	    (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) ||
+	    vma == get_gate_vma(vma->vm_mm))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !priv->can_swap;
+
+	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+		return true;
+
+	mapping = vma->vm_file->f_mapping;
+	if (mapping_unevictable(mapping))
+		return true;
+
+	/* check readpage to exclude special mappings like dax, etc. */
+	return shmem_mapping(mapping) ? !priv->can_swap : !mapping->a_ops->readpage;
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+			 unsigned long *start, unsigned long *end)
+{
+	unsigned long next = round_up(*end, size);
+
+	VM_BUG_ON(mask & size);
+	VM_BUG_ON(*start >= *end);
+	VM_BUG_ON((next & mask) != (*start & mask));
+
+	while (walk->vma) {
+		if (next >= walk->vma->vm_end) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		if ((next & mask) != (walk->vma->vm_start & mask))
+			return false;
+
+		if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		*start = max(next, walk->vma->vm_start);
+		next = (next | ~mask) + 1;
+		/* rounded-up boundaries can wrap to 0 */
+		*end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+		return true;
+	}
+
+	return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+	/* suitable if the average number of young PTEs per cacheline is >=1 */
+	return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int total = 0;
+	int young = 0;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+	VM_BUG_ON(pmd_leaf(*pmd));
+
+	pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+	arch_enter_lazy_mmu_mode();
+restart:
+	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		struct folio *folio;
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+
+		total++;
+		priv->mm_stats[MM_PTE_TOTAL]++;
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i])) {
+			priv->mm_stats[MM_PTE_OLD]++;
+			continue;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(walk->vma, addr, pte + i))
+			continue;
+
+		young++;
+		priv->mm_stats[MM_PTE_YOUNG]++;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(priv, folio, old_gen, new_gen);
+	}
+
+	if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+		goto restart;
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte, ptl);
+
+	return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+	int i;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+	if (*start == -1) {
+		*start = next;
+		return;
+	}
+
+	i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+	if (i && i <= MIN_LRU_BATCH) {
+		__set_bit(i - 1, priv->bitmap);
+		return;
+	}
+
+	pmd = pmd_offset(pud, *start);
+	ptl = pmd_lock(walk->mm, pmd);
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		struct folio *folio;
+		unsigned long pfn = pmd_pfn(pmd[i]);
+		unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+		VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+
+		if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+			goto next;
+
+		if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+			goto next;
+
+		if (!pmd_trans_huge(pmd[i])) {
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+				pmdp_test_and_clear_young(vma, addr, pmd + i);
+			goto next;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			goto next;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			goto next;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			goto next;
+
+		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+			goto next;
+
+		priv->mm_stats[MM_PTE_YOUNG]++;
+
+		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(priv, folio, old_gen, new_gen);
+next:
+		i = i > MIN_LRU_BATCH ? 0 :
+		    find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1;
+	} while (i <= MIN_LRU_BATCH);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+
+	*start = -1;
+	bitmap_zero(priv->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	unsigned long pos = -1;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/*
+	 * Finish an entire PMD in two passes: the first only reaches to PTE
+	 * tables to avoid taking the PMD lock; the second, if necessary, takes
+	 * the PMD lock to clear the accessed bit in PMD entries.
+	 */
+	pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+	/* walk_pte_range() may call get_next_vma() */
+	vma = walk->vma;
+	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		pmd_t val = pmd_read_atomic(pmd + i);
+
+		/* for pmd_read_atomic() */
+		barrier();
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(val)) {
+			priv->mm_stats[MM_PTE_TOTAL]++;
+			continue;
+		}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (pmd_trans_huge(val)) {
+			unsigned long pfn = pmd_pfn(val);
+			struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+			priv->mm_stats[MM_PTE_TOTAL]++;
+
+			if (is_huge_zero_pmd(val))
+				continue;
+
+			if (!pmd_young(val)) {
+				priv->mm_stats[MM_PTE_OLD]++;
+				continue;
+			}
+
+			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+				continue;
+
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			continue;
+		}
+#endif
+		priv->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+		if (!pmd_young(val))
+			continue;
+
+		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+#endif
+		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
+			continue;
+
+		priv->mm_stats[MM_PMD_FOUND]++;
+
+		if (!walk_pte_range(&val, addr, next, walk))
+			continue;
+
+		priv->mm_stats[MM_PMD_ADDED]++;
+
+		/* carry over to the next generation */
+		update_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i);
+	}
+
+	walk_pmd_range_locked(pud, -1, vma, walk, &pos);
+
+	if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+		goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	int i;
+	pud_t *pud;
+	unsigned long addr;
+	unsigned long next;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(p4d_leaf(*p4d));
+
+	pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+	for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+		pud_t val = READ_ONCE(pud[i]);
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+			continue;
+
+		walk_pmd_range(&val, addr, next, walk);
+
+		if (priv->batched >= MAX_LRU_BATCH) {
+			end = (addr | ~PUD_MASK) + 1;
+			goto done;
+		}
+	}
+
+	if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+		goto restart;
+
+	end = round_up(end, P4D_SIZE);
+done:
+	/* rounded-up boundaries can wrap to 0 */
+	priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+	return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	static const struct mm_walk_ops mm_walk_ops = {
+		.test_walk = should_skip_vma,
+		.p4d_entry = walk_pud_range,
+	};
+
+	int err;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	walk->next_addr = FIRST_USER_ADDRESS;
+
+	do {
+		err = -EBUSY;
+
+		/* folio_update_gen() requires stable folio_memcg() */
+		if (!mem_cgroup_trylock_pages(memcg))
+			break;
+
+		/* the caller might be holding the lock for write */
+		if (mmap_read_trylock(mm)) {
+			unsigned long start = walk->next_addr;
+			unsigned long end = mm->highest_vm_end;
+
+			err = walk_page_range(mm, start, end, &mm_walk_ops, walk);
+
+			mmap_read_unlock(mm);
+
+			if (walk->batched) {
+				spin_lock_irq(&lruvec->lru_lock);
+				reset_batch_size(lruvec, walk);
+				spin_unlock_irq(&lruvec->lru_lock);
+			}
+		}
+
+		mem_cgroup_unlock_pages();
+
+		cond_resched();
+	} while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+	if (current->reclaim_state && current->reclaim_state->mm_walk)
+		return current->reclaim_state->mm_walk;
+
+	return kzalloc(sizeof(struct lru_gen_mm_walk),
+		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+		kfree(walk);
+}
+
 static void inc_min_seq(struct lruvec *lruvec)
 {
 	int type;
@@ -3344,7 +4171,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	return success;
 }
 
-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+static void inc_max_seq(struct lruvec *lruvec)
 {
 	int prev, next;
 	int type, zone;
@@ -3354,9 +4181,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
 
 	VM_BUG_ON(!seq_is_valid(lruvec));
 
-	if (max_seq != lrugen->max_seq)
-		goto unlock;
-
 	inc_min_seq(lruvec);
 
 	/* update the active/inactive LRU sizes for compatibility */
@@ -3382,10 +4206,72 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
 
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-unlock:
+
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			       struct scan_control *sc, bool can_swap, bool full_scan)
+{
+	bool success;
+	struct lru_gen_mm_walk *walk;
+	struct mm_struct *mm = NULL;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+	/*
+	 * If the hardware doesn't automatically set the accessed bit, fallback
+	 * to lru_gen_look_around(), which only clears the accessed bit in a
+	 * handful of PTEs. Spreading the work out over a period of time usually
+	 * is less efficient, but it avoids bursty page faults.
+	 */
+	if (!full_scan && !arch_has_hw_pte_young()) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk = alloc_mm_walk();
+	if (!walk) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk->lruvec = lruvec;
+	walk->max_seq = max_seq;
+	walk->can_swap = can_swap;
+	walk->full_scan = full_scan;
+
+	do {
+		success = iterate_mm_list(lruvec, walk, &mm);
+		if (mm)
+			walk_mm(lruvec, mm, walk);
+
+		cond_resched();
+	} while (mm);
+
+	free_mm_walk(walk);
+done:
+	if (!success) {
+		if (!current_is_kswapd() && !sc->priority)
+			wait_event_killable(lruvec->mm_state.wait,
+					    max_seq < READ_ONCE(lrugen->max_seq));
+
+		return max_seq < READ_ONCE(lrugen->max_seq);
+	}
+
+	VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+	inc_max_seq(lruvec);
+	/* either this sees any waiters or they will see updated max_seq */
+	if (wq_has_sleeper(&lruvec->mm_state.wait))
+		wake_up_all(&lruvec->mm_state.wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
 static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 			     unsigned long *min_seq, bool can_swap, bool *need_aging)
 {
@@ -3453,7 +4339,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		nr_to_scan++;
 
 	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
-		inc_max_seq(lruvec, max_seq);
+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
 }
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -3462,6 +4348,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_BUG_ON(!current_is_kswapd());
 
+	current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -3470,11 +4358,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	current->reclaim_state->mm_walk = NULL;
 }
 
 /*
  * This function exploits spatial locality when shrink_page_list() walks the
  * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ * If the scan was done cacheline efficiently, it adds the PMD entry pointing
+ * to the PTE table to the Bloom filter. This process is a feedback loop from
+ * the eviction to the aging.
  */
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
@@ -3484,6 +4377,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long end;
 	unsigned long addr;
 	struct folio *folio;
+	struct lru_gen_mm_walk *walk;
+	int young = 0;
 	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
 	struct mem_cgroup *memcg = page_memcg(pvmw->page);
 	struct pglist_data *pgdat = page_pgdat(pvmw->page);
@@ -3541,6 +4436,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
 			continue;
 
+		young++;
+
 		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
@@ -3556,7 +4453,13 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	arch_leave_lazy_mmu_mode();
 	rcu_read_unlock();
 
-	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+	/* feedback from rmap walkers to page table walkers */
+	if (suitable_to_scan(i, young))
+		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+
+	if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
 		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 			folio = page_folio(pte_page(pte[i]));
 			folio_activate(folio);
@@ -3568,8 +4471,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	if (!mem_cgroup_trylock_pages(memcg))
 		return;
 
-	spin_lock_irq(&lruvec->lru_lock);
-	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	if (!walk) {
+		spin_lock_irq(&lruvec->lru_lock);
+		new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	}
 
 	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 		folio = page_folio(pte_page(pte[i]));
@@ -3580,10 +4485,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (old_gen < 0 || old_gen == new_gen)
 			continue;
 
-		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+		if (walk)
+			update_batch_size(walk, folio, old_gen, new_gen);
+		else
+			lru_gen_update_size(lruvec, folio, old_gen, new_gen);
 	}
 
-	spin_unlock_irq(&lruvec->lru_lock);
+	if (!walk)
+		spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_unlock_pages();
 }
@@ -3850,6 +4759,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	struct folio *folio;
 	enum vm_event_item item;
 	struct reclaim_stat stat;
+	struct lru_gen_mm_walk *walk;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -3889,6 +4799,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	move_pages_to_lru(lruvec, &list);
 
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+	if (walk && walk->batched)
+		reset_batch_size(lruvec, walk);
+
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, reclaimed);
@@ -3943,20 +4857,25 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 		return 0;
 	}
 
-	inc_max_seq(lruvec, max_seq);
+	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
+		return nr_to_scan;
 
-	return nr_to_scan;
+	return min_seq[LRU_GEN_FILE] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
 }
 
 static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	struct blk_plug plug;
 	long scanned = 0;
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
 
 	blk_start_plug(&plug);
 
+	if (current_is_kswapd())
+		current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
 	while (true) {
 		int delta;
 		int swappiness;
@@ -3984,6 +4903,9 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		cond_resched();
 	}
 
+	if (current_is_kswapd())
+		current->reclaim_state->mm_walk = NULL;
+
 	blk_finish_plug(&plug);
 }
 
@@ -4000,15 +4922,21 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
 
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+	lruvec->mm_state.seq = MIN_NR_GENS;
+	init_waitqueue_head(&lruvec->mm_state.wait);
 }
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
+	INIT_LIST_HEAD(&memcg->mm_list.fifo);
+	spin_lock_init(&memcg->mm_list.lock);
 }
 
 void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 {
+	int i;
 	int nid;
 
 	for_each_node(nid) {
@@ -4016,6 +4944,11 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 
 		VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
 				     sizeof(lruvec->lrugen.nr_pages)));
+
+		for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+			bitmap_free(lruvec->mm_state.filters[i]);
+			lruvec->mm_state.filters[i] = NULL;
+		}
 	}
 }
 #endif
@@ -4024,6 +4957,7 @@ static int __init init_lru_gen(void)
 {
 	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
 	return 0;
 };
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 08/14] mm: multi-gen LRU: support page table walks
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                Ops/sec      KB/sec
      patch1-6: 1015292.83   39490.38
      patch1-7: 1080856.82   42040.53

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      45.49%  lzo1x_1_do_compress (real work)
       7.38%  page_vma_mapped_walk
       7.24%  _raw_spin_unlock_irq
       2.64%  ptep_clear_flush
       2.31%  __zram_bvec_write
       2.13%  do_raw_spin_lock
       2.09%  lru_gen_look_around
       1.89%  free_unref_page_list
       1.85%  memmove
       1.74%  obj_malloc

    patch1-7
      47.73%  lzo1x_1_do_compress (real work)
       6.84%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       2.86%  walk_pte_range
       2.79%  ptep_clear_flush
       2.24%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.94%  free_unref_page_list
       1.80%  memmove
       1.75%  obj_malloc

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 fs/exec.c                  |   2 +
 include/linux/memcontrol.h |   5 +
 include/linux/mm_types.h   |  78 +++
 include/linux/mmzone.h     |  58 +++
 include/linux/swap.h       |   4 +
 kernel/exit.c              |   1 +
 kernel/fork.c              |   9 +
 kernel/sched/core.c        |   1 +
 mm/memcontrol.c            |  24 +
 mm/vmscan.c                | 960 ++++++++++++++++++++++++++++++++++++-
 10 files changed, 1129 insertions(+), 13 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..6696fbbecbf3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,6 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1018,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
 	activate_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
+	lru_gen_use_mm(mm);
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c8ce74577290..b8e5718665b8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -343,6 +343,11 @@ struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	/* per-memcg mm_struct list */
+	struct lru_gen_mm_list mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0f549870da6a..cbc7fa381ac6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@
 #define _LINUX_MM_TYPES_H
 
 #include <linux/mm_types_task.h>
+#include <linux/sched.h>
 
 #include <linux/auxvec.h>
 #include <linux/kref.h>
@@ -17,6 +18,8 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -637,6 +640,22 @@ struct mm_struct {
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
 #endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* this mm_struct is on lru_gen_mm_list */
+			struct list_head list;
+#ifdef CONFIG_MEMCG
+			/* points to the memcg of "owner" above */
+			struct mem_cgroup *memcg;
+#endif
+			/*
+			 * Set when switching to this mm_struct, as a hint of
+			 * whether it has been used since the last time per-node
+			 * page table walkers cleared the corresponding bits.
+			 */
+			nodemask_t nodes;
+		} lru_gen;
+#endif /* CONFIG_LRU_GEN */
 	} __randomize_layout;
 
 	/*
@@ -663,6 +682,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+	/* mm_struct list for page table walkers */
+	struct list_head fifo;
+	/* protects the list above */
+	spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+	INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+	mm->lru_gen.memcg = NULL;
+#endif
+	nodes_clear(mm->lru_gen.nodes);
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+	/* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+	VM_WARN_ON(list_empty(&mm->lru_gen.list));
+
+	if (!(current->flags & PF_KTHREAD) && !nodes_full(mm->lru_gen.nodes))
+		nodes_setall(mm->lru_gen.nodes);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cd64c64a952d..a2d53025a321 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -411,6 +411,58 @@ struct lru_gen_struct {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 };
 
+enum {
+	MM_PTE_TOTAL,	/* total leaf entries */
+	MM_PTE_OLD,	/* old leaf entries */
+	MM_PTE_YOUNG,	/* young leaf entries */
+	MM_PMD_TOTAL,	/* total non-leaf entries */
+	MM_PMD_FOUND,	/* non-leaf entries found in Bloom filters */
+	MM_PMD_ADDED,	/* non-leaf entries added to Bloom filters */
+	NR_MM_STATS
+};
+
+/* mnemonic codes for the mm stats above */
+#define MM_STAT_CODES		"toydfa"
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS	2
+
+struct lru_gen_mm_state {
+	/* set to max_seq after each iteration */
+	unsigned long seq;
+	/* where the current iteration starts (inclusive) */
+	struct list_head *head;
+	/* where the last iteration ends (exclusive) */
+	struct list_head *tail;
+	/* to wait for the last page table walker to finish */
+	struct wait_queue_head wait;
+	/* Bloom filters flip after each iteration */
+	unsigned long *filters[NR_BLOOM_FILTERS];
+	/* the mm stats for debugging */
+	unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+	/* the number of concurrent page table walkers */
+	int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+	/* the lruvec under reclaim */
+	struct lruvec *lruvec;
+	/* unstable max_seq from lru_gen_struct */
+	unsigned long max_seq;
+	/* the next address within an mm to scan */
+	unsigned long next_addr;
+	/* to batch page table entries */
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+	/* to batch promoted pages */
+	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* to batch the mm stats */
+	int mm_stats[NR_MM_STATS];
+	/* total batched items */
+	int batched;
+	bool can_swap;
+	bool full_scan;
+};
+
 void lru_gen_init_lruvec(struct lruvec *lruvec);
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
@@ -461,6 +513,8 @@ struct lruvec {
 #ifdef CONFIG_LRU_GEN
 	/* evictable pages divided into generations */
 	struct lru_gen_struct		lrugen;
+	/* to concurrently iterate lru_gen_mm_list */
+	struct lru_gen_mm_state		mm_state;
 #endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
@@ -1053,6 +1107,10 @@ typedef struct pglist_data {
 
 	unsigned long		flags;
 
+#ifdef CONFIG_LRU_GEN
+	/* kswap mm walk data */
+	struct lru_gen_mm_walk	mm_walk;
+#endif
 	ZONE_PADDING(_pad2_)
 
 	/* Per-node vmstats */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b37520d3ff1d..04d84ac6d1ac 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@ union swap_header {
  */
 struct reclaim_state {
 	unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+	/* per-thread mm walk data */
+	struct lru_gen_mm_walk *mm_walk;
+#endif
 };
 
 #ifdef __KERNEL__
diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab9..54d2ce4b93d1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -463,6 +463,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index f1e89007f228..9bc303eacca1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1079,6 +1079,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1121,6 +1122,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2586,6 +2588,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock the task to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9745613d531c..ecf0cdce8603 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4979,6 +4979,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4c30950aa3c..d5993490b32f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6155,6 +6155,29 @@ static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task = NULL;
+
+	cgroup_taskset_for_each_leader(task, css, tset)
+		break;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6500,6 +6523,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b685aa0379c..67dc4190e790 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,8 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3133,6 +3135,372 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
 	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
 }
 
+/******************************************************************************
+ *                          mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+	static struct lru_gen_mm_list mm_list = {
+		.fifo = LIST_HEAD_INIT(mm_list.fifo),
+		.lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+	};
+
+#ifdef CONFIG_MEMCG
+	if (memcg)
+		return &memcg->mm_list;
+#endif
+	return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+	mm->lru_gen.memcg = memcg;
+#endif
+	spin_lock(&mm_list->lock);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		if (lruvec->mm_state.tail == &mm_list->fifo)
+			lruvec->mm_state.tail = &mm->lru_gen.list;
+	}
+
+	list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+	struct mem_cgroup *memcg = NULL;
+
+	if (list_empty(&mm->lru_gen.list))
+		return;
+
+#ifdef CONFIG_MEMCG
+	memcg = mm->lru_gen.memcg;
+#endif
+	mm_list = get_mm_list(memcg);
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		if (lruvec->mm_state.tail == &mm->lru_gen.list)
+			lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+		if (lruvec->mm_state.head != &mm->lru_gen.list)
+			continue;
+
+		lruvec->mm_state.head = lruvec->mm_state.head->next;
+		if (lruvec->mm_state.head == &mm_list->fifo)
+			WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+	}
+
+	list_del_init(&mm->lru_gen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lru_gen.memcg);
+	mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	/* for mm_update_next_owner() */
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(mm->owner);
+	rcu_read_unlock();
+	if (memcg == mm->lru_gen.memcg)
+		return;
+
+	VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+	VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it produces a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1. It's not necessary to keep both filters all the time. The spare one can be
+ *    freed after the RCU grace period and reallocated if needed again.
+ * 2. And when reallocating, it's worth scaling its size according to the number
+ *    of inserted entries in the other filter, to reduce the memory overhead on
+ *    small systems and false positives on large systems.
+ * 3. Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT	15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+	return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+	u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+	BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+	key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+	key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	filter = lruvec->mm_state.filters[gen];
+	if (filter) {
+		bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+		return;
+	}
+
+	filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC);
+	WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return;
+
+	get_item_key(item, key);
+
+	if (!test_bit(key[0], filter))
+		set_bit(key[0], filter);
+	if (!test_bit(key[1], filter))
+		set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return true;
+
+	get_item_key(item, key);
+
+	return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+	int i;
+	int hist;
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	if (walk) {
+		hist = lru_hist_from_seq(walk->max_seq);
+
+		for (i = 0; i < NR_MM_STATS; i++) {
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+				   lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+			walk->mm_stats[i] = 0;
+		}
+	}
+
+	if (NR_HIST_GENS > 1 && last) {
+		hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+
+		for (i = 0; i < NR_MM_STATS; i++)
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+	}
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int type;
+	unsigned long size = 0;
+	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+	if (!walk->full_scan && cpumask_empty(mm_cpumask(mm)) &&
+	    !node_isset(pgdat->node_id, mm->lru_gen.nodes))
+		return true;
+
+	node_clear(pgdat->node_id, mm->lru_gen.nodes);
+
+	for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+		size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+			       get_mm_counter(mm, MM_ANONPAGES) +
+			       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	if (size < MIN_LRU_BATCH)
+		return true;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+			    struct mm_struct **iter)
+{
+	bool first = false;
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	/*
+	 * There are four interesting cases for this page table walker:
+	 * 1. It tries to start a new iteration of mm_list with a stale max_seq;
+	 *    there is nothing to be done.
+	 * 2. It's the first of the current generation, and it needs to reset
+	 *    the Bloom filter for the next generation.
+	 * 3. It reaches the end of mm_list, and it needs to increment
+	 *    mm_state->seq; the iteration is done.
+	 * 4. It's the last of the current generation, and it needs to reset the
+	 *    mm stats counters for the next generation.
+	 */
+	if (*iter)
+		mmput_async(*iter);
+	else if (walk->max_seq <= READ_ONCE(mm_state->seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(mm_state->seq + 1 < walk->max_seq);
+	VM_BUG_ON(*iter && mm_state->seq > walk->max_seq);
+	VM_BUG_ON(*iter && !mm_state->nr_walkers);
+
+	if (walk->max_seq <= mm_state->seq) {
+		if (!*iter)
+			last = false;
+		goto done;
+	}
+
+	if (!mm_state->nr_walkers) {
+		VM_BUG_ON(mm_state->head && mm_state->head != &mm_list->fifo);
+
+		mm_state->head = mm_list->fifo.next;
+		first = true;
+	}
+
+	while (!mm && mm_state->head != &mm_list->fifo) {
+		mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+		mm_state->head = mm_state->head->next;
+
+		/* full scan for those added after the last iteration */
+		if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) {
+			mm_state->tail = mm_state->head;
+			walk->full_scan = true;
+		}
+
+		if (should_skip_mm(mm, walk))
+			mm = NULL;
+	}
+
+	if (mm_state->head == &mm_list->fifo)
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+	if (*iter && !mm)
+		mm_state->nr_walkers--;
+	if (!*iter && mm)
+		mm_state->nr_walkers++;
+
+	if (mm_state->nr_walkers)
+		last = false;
+
+	if (mm && first)
+		reset_bloom_filter(lruvec, walk->max_seq + 1);
+
+	if (*iter || last)
+		reset_mm_stats(lruvec, walk, last);
+
+	spin_unlock(&mm_list->lock);
+
+	*iter = mm;
+
+	return last;
+}
+
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
+{
+	bool success = false;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	if (max_seq <= READ_ONCE(mm_state->seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(mm_state->seq + 1 < max_seq);
+
+	if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
+		VM_BUG_ON(mm_state->head && mm_state->head != &mm_list->fifo);
+
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+		reset_mm_stats(lruvec, NULL, true);
+		success = true;
+	}
+
+	spin_unlock(&mm_list->lock);
+
+	return success;
+}
+
 /******************************************************************************
  *                          refault feedback loop
  ******************************************************************************/
@@ -3286,6 +3654,465 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 	return new_gen;
 }
 
+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
+			      int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+
+	VM_BUG_ON(old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+	walk->batched++;
+
+	walk->nr_pages[old_gen][type][zone] -= delta;
+	walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	walk->batched = 0;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		enum lru_list lru = type * LRU_INACTIVE_FILE;
+		int delta = walk->nr_pages[gen][type][zone];
+
+		if (!delta)
+			continue;
+
+		walk->nr_pages[gen][type][zone] = 0;
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+			   lrugen->nr_pages[gen][type][zone] + delta);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, delta);
+	}
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma = walk->vma;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+	    (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) ||
+	    vma == get_gate_vma(vma->vm_mm))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !priv->can_swap;
+
+	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+		return true;
+
+	mapping = vma->vm_file->f_mapping;
+	if (mapping_unevictable(mapping))
+		return true;
+
+	/* check readpage to exclude special mappings like dax, etc. */
+	return shmem_mapping(mapping) ? !priv->can_swap : !mapping->a_ops->readpage;
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+			 unsigned long *start, unsigned long *end)
+{
+	unsigned long next = round_up(*end, size);
+
+	VM_BUG_ON(mask & size);
+	VM_BUG_ON(*start >= *end);
+	VM_BUG_ON((next & mask) != (*start & mask));
+
+	while (walk->vma) {
+		if (next >= walk->vma->vm_end) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		if ((next & mask) != (walk->vma->vm_start & mask))
+			return false;
+
+		if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		*start = max(next, walk->vma->vm_start);
+		next = (next | ~mask) + 1;
+		/* rounded-up boundaries can wrap to 0 */
+		*end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+		return true;
+	}
+
+	return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+	/* suitable if the average number of young PTEs per cacheline is >=1 */
+	return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int total = 0;
+	int young = 0;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+	VM_BUG_ON(pmd_leaf(*pmd));
+
+	pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+	arch_enter_lazy_mmu_mode();
+restart:
+	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		struct folio *folio;
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+
+		total++;
+		priv->mm_stats[MM_PTE_TOTAL]++;
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i])) {
+			priv->mm_stats[MM_PTE_OLD]++;
+			continue;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(walk->vma, addr, pte + i))
+			continue;
+
+		young++;
+		priv->mm_stats[MM_PTE_YOUNG]++;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(priv, folio, old_gen, new_gen);
+	}
+
+	if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+		goto restart;
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte, ptl);
+
+	return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+	int i;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+	if (*start == -1) {
+		*start = next;
+		return;
+	}
+
+	i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+	if (i && i <= MIN_LRU_BATCH) {
+		__set_bit(i - 1, priv->bitmap);
+		return;
+	}
+
+	pmd = pmd_offset(pud, *start);
+	ptl = pmd_lock(walk->mm, pmd);
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		struct folio *folio;
+		unsigned long pfn = pmd_pfn(pmd[i]);
+		unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+		VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+
+		if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+			goto next;
+
+		if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+			goto next;
+
+		if (!pmd_trans_huge(pmd[i])) {
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+				pmdp_test_and_clear_young(vma, addr, pmd + i);
+			goto next;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			goto next;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			goto next;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			goto next;
+
+		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+			goto next;
+
+		priv->mm_stats[MM_PTE_YOUNG]++;
+
+		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(priv, folio, old_gen, new_gen);
+next:
+		i = i > MIN_LRU_BATCH ? 0 :
+		    find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1;
+	} while (i <= MIN_LRU_BATCH);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+
+	*start = -1;
+	bitmap_zero(priv->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	unsigned long pos = -1;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/*
+	 * Finish an entire PMD in two passes: the first only reaches to PTE
+	 * tables to avoid taking the PMD lock; the second, if necessary, takes
+	 * the PMD lock to clear the accessed bit in PMD entries.
+	 */
+	pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+	/* walk_pte_range() may call get_next_vma() */
+	vma = walk->vma;
+	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		pmd_t val = pmd_read_atomic(pmd + i);
+
+		/* for pmd_read_atomic() */
+		barrier();
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(val)) {
+			priv->mm_stats[MM_PTE_TOTAL]++;
+			continue;
+		}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (pmd_trans_huge(val)) {
+			unsigned long pfn = pmd_pfn(val);
+			struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+			priv->mm_stats[MM_PTE_TOTAL]++;
+
+			if (is_huge_zero_pmd(val))
+				continue;
+
+			if (!pmd_young(val)) {
+				priv->mm_stats[MM_PTE_OLD]++;
+				continue;
+			}
+
+			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+				continue;
+
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			continue;
+		}
+#endif
+		priv->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+		if (!pmd_young(val))
+			continue;
+
+		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+#endif
+		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
+			continue;
+
+		priv->mm_stats[MM_PMD_FOUND]++;
+
+		if (!walk_pte_range(&val, addr, next, walk))
+			continue;
+
+		priv->mm_stats[MM_PMD_ADDED]++;
+
+		/* carry over to the next generation */
+		update_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i);
+	}
+
+	walk_pmd_range_locked(pud, -1, vma, walk, &pos);
+
+	if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+		goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	int i;
+	pud_t *pud;
+	unsigned long addr;
+	unsigned long next;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(p4d_leaf(*p4d));
+
+	pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+	for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+		pud_t val = READ_ONCE(pud[i]);
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+			continue;
+
+		walk_pmd_range(&val, addr, next, walk);
+
+		if (priv->batched >= MAX_LRU_BATCH) {
+			end = (addr | ~PUD_MASK) + 1;
+			goto done;
+		}
+	}
+
+	if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+		goto restart;
+
+	end = round_up(end, P4D_SIZE);
+done:
+	/* rounded-up boundaries can wrap to 0 */
+	priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+	return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	static const struct mm_walk_ops mm_walk_ops = {
+		.test_walk = should_skip_vma,
+		.p4d_entry = walk_pud_range,
+	};
+
+	int err;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	walk->next_addr = FIRST_USER_ADDRESS;
+
+	do {
+		err = -EBUSY;
+
+		/* folio_update_gen() requires stable folio_memcg() */
+		if (!mem_cgroup_trylock_pages(memcg))
+			break;
+
+		/* the caller might be holding the lock for write */
+		if (mmap_read_trylock(mm)) {
+			unsigned long start = walk->next_addr;
+			unsigned long end = mm->highest_vm_end;
+
+			err = walk_page_range(mm, start, end, &mm_walk_ops, walk);
+
+			mmap_read_unlock(mm);
+
+			if (walk->batched) {
+				spin_lock_irq(&lruvec->lru_lock);
+				reset_batch_size(lruvec, walk);
+				spin_unlock_irq(&lruvec->lru_lock);
+			}
+		}
+
+		mem_cgroup_unlock_pages();
+
+		cond_resched();
+	} while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+	if (current->reclaim_state && current->reclaim_state->mm_walk)
+		return current->reclaim_state->mm_walk;
+
+	return kzalloc(sizeof(struct lru_gen_mm_walk),
+		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+		kfree(walk);
+}
+
 static void inc_min_seq(struct lruvec *lruvec)
 {
 	int type;
@@ -3344,7 +4171,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	return success;
 }
 
-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+static void inc_max_seq(struct lruvec *lruvec)
 {
 	int prev, next;
 	int type, zone;
@@ -3354,9 +4181,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
 
 	VM_BUG_ON(!seq_is_valid(lruvec));
 
-	if (max_seq != lrugen->max_seq)
-		goto unlock;
-
 	inc_min_seq(lruvec);
 
 	/* update the active/inactive LRU sizes for compatibility */
@@ -3382,10 +4206,72 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
 
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-unlock:
+
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			       struct scan_control *sc, bool can_swap, bool full_scan)
+{
+	bool success;
+	struct lru_gen_mm_walk *walk;
+	struct mm_struct *mm = NULL;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+	/*
+	 * If the hardware doesn't automatically set the accessed bit, fallback
+	 * to lru_gen_look_around(), which only clears the accessed bit in a
+	 * handful of PTEs. Spreading the work out over a period of time usually
+	 * is less efficient, but it avoids bursty page faults.
+	 */
+	if (!full_scan && !arch_has_hw_pte_young()) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk = alloc_mm_walk();
+	if (!walk) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk->lruvec = lruvec;
+	walk->max_seq = max_seq;
+	walk->can_swap = can_swap;
+	walk->full_scan = full_scan;
+
+	do {
+		success = iterate_mm_list(lruvec, walk, &mm);
+		if (mm)
+			walk_mm(lruvec, mm, walk);
+
+		cond_resched();
+	} while (mm);
+
+	free_mm_walk(walk);
+done:
+	if (!success) {
+		if (!current_is_kswapd() && !sc->priority)
+			wait_event_killable(lruvec->mm_state.wait,
+					    max_seq < READ_ONCE(lrugen->max_seq));
+
+		return max_seq < READ_ONCE(lrugen->max_seq);
+	}
+
+	VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+	inc_max_seq(lruvec);
+	/* either this sees any waiters or they will see updated max_seq */
+	if (wq_has_sleeper(&lruvec->mm_state.wait))
+		wake_up_all(&lruvec->mm_state.wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
 static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 			     unsigned long *min_seq, bool can_swap, bool *need_aging)
 {
@@ -3453,7 +4339,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		nr_to_scan++;
 
 	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
-		inc_max_seq(lruvec, max_seq);
+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
 }
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -3462,6 +4348,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_BUG_ON(!current_is_kswapd());
 
+	current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -3470,11 +4358,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	current->reclaim_state->mm_walk = NULL;
 }
 
 /*
  * This function exploits spatial locality when shrink_page_list() walks the
  * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ * If the scan was done cacheline efficiently, it adds the PMD entry pointing
+ * to the PTE table to the Bloom filter. This process is a feedback loop from
+ * the eviction to the aging.
  */
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
@@ -3484,6 +4377,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long end;
 	unsigned long addr;
 	struct folio *folio;
+	struct lru_gen_mm_walk *walk;
+	int young = 0;
 	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
 	struct mem_cgroup *memcg = page_memcg(pvmw->page);
 	struct pglist_data *pgdat = page_pgdat(pvmw->page);
@@ -3541,6 +4436,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
 			continue;
 
+		young++;
+
 		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
@@ -3556,7 +4453,13 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	arch_leave_lazy_mmu_mode();
 	rcu_read_unlock();
 
-	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+	/* feedback from rmap walkers to page table walkers */
+	if (suitable_to_scan(i, young))
+		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+
+	if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
 		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 			folio = page_folio(pte_page(pte[i]));
 			folio_activate(folio);
@@ -3568,8 +4471,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	if (!mem_cgroup_trylock_pages(memcg))
 		return;
 
-	spin_lock_irq(&lruvec->lru_lock);
-	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	if (!walk) {
+		spin_lock_irq(&lruvec->lru_lock);
+		new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	}
 
 	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 		folio = page_folio(pte_page(pte[i]));
@@ -3580,10 +4485,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (old_gen < 0 || old_gen == new_gen)
 			continue;
 
-		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+		if (walk)
+			update_batch_size(walk, folio, old_gen, new_gen);
+		else
+			lru_gen_update_size(lruvec, folio, old_gen, new_gen);
 	}
 
-	spin_unlock_irq(&lruvec->lru_lock);
+	if (!walk)
+		spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_unlock_pages();
 }
@@ -3850,6 +4759,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	struct folio *folio;
 	enum vm_event_item item;
 	struct reclaim_stat stat;
+	struct lru_gen_mm_walk *walk;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -3889,6 +4799,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	move_pages_to_lru(lruvec, &list);
 
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+	if (walk && walk->batched)
+		reset_batch_size(lruvec, walk);
+
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, reclaimed);
@@ -3943,20 +4857,25 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 		return 0;
 	}
 
-	inc_max_seq(lruvec, max_seq);
+	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
+		return nr_to_scan;
 
-	return nr_to_scan;
+	return min_seq[LRU_GEN_FILE] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
 }
 
 static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	struct blk_plug plug;
 	long scanned = 0;
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
 
 	blk_start_plug(&plug);
 
+	if (current_is_kswapd())
+		current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
 	while (true) {
 		int delta;
 		int swappiness;
@@ -3984,6 +4903,9 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		cond_resched();
 	}
 
+	if (current_is_kswapd())
+		current->reclaim_state->mm_walk = NULL;
+
 	blk_finish_plug(&plug);
 }
 
@@ -4000,15 +4922,21 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
 
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+	lruvec->mm_state.seq = MIN_NR_GENS;
+	init_waitqueue_head(&lruvec->mm_state.wait);
 }
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
+	INIT_LIST_HEAD(&memcg->mm_list.fifo);
+	spin_lock_init(&memcg->mm_list.lock);
 }
 
 void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 {
+	int i;
 	int nid;
 
 	for_each_node(nid) {
@@ -4016,6 +4944,11 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 
 		VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
 				     sizeof(lruvec->lrugen.nr_pages)));
+
+		for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+			bitmap_free(lruvec->mm_state.filters[i]);
+			lruvec->mm_state.filters[i] = NULL;
+		}
 	}
 }
 #endif
@@ -4024,6 +4957,7 @@ static int __init init_lru_gen(void)
 {
 	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
 	return 0;
 };
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 09/14] mm: multi-gen LRU: optimize multiple memcgs
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

When multiple memcgs are available, it is possible to make better
choices based on generations and tiers and therefore improve the
overall performance under global memory pressure. This patch adds a
rudimentary optimization to select memcgs that can drop single-use
unmapped clean pages first. Doing so reduces the chance of going into
the aging path or swapping. These two operations can be costly.

A typical example that benefits from this optimization is a server
running mixed types of workloads, e.g., heavy anon workload in one
memcg and heavy buffered I/O workload in the other.

Though this optimization can be applied to both kswapd and direct
reclaim, it is only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): -[28, 30]%
                IOPS         BW
      patch1-7: 3117k        11.9GiB/s
      patch1-8: 2217k        8661MiB/s

    memcached (anon): +[247, 251]%
                Ops/sec      KB/sec
      patch1-7: 563772.35    21900.01
      patch1-8: 1968343.76   76461.24

  Mixed workloads:
    fio (buffered I/O): -[4, 6]%
                IOPS         BW
      5.17-rc2: 2338k        9133MiB/s
      patch1-8: 2217k        8661MiB/s

    memcached (anon): +[524, 530]%
                Ops/sec      KB/sec
      5.17-rc2: 313821.65    12190.55
      patch1-8: 1968343.76   76461.24

  Configurations:
    (changes since patch 5)

    cat mixed.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 mm/vmscan.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67dc4190e790..7375c9dae08f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -127,6 +127,13 @@ struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+#ifdef CONFIG_LRU_GEN
+	/* help make better choices when multiple memcgs are available */
+	unsigned int memcgs_need_aging:1;
+	unsigned int memcgs_need_swapping:1;
+	unsigned int memcgs_avoid_swapping:1;
+#endif
+
 	/* Allocation order */
 	s8 order;
 
@@ -4348,6 +4355,22 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_BUG_ON(!current_is_kswapd());
 
+	/*
+	 * To reduce the chance of going into the aging path or swapping, which
+	 * can be costly, optimistically skip them unless their corresponding
+	 * flags were cleared in the eviction path. This improves the overall
+	 * performance when multiple memcgs are available.
+	 */
+	if (!sc->memcgs_need_aging) {
+		sc->memcgs_need_aging = true;
+		sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping;
+		sc->memcgs_need_swapping = true;
+		return;
+	}
+
+	sc->memcgs_need_swapping = true;
+	sc->memcgs_avoid_swapping = true;
+
 	current->reclaim_state->mm_walk = &pgdat->mm_walk;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
@@ -4750,7 +4773,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 	return scanned;
 }
 
-static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			bool *swapped)
 {
 	int type;
 	int scanned;
@@ -4816,6 +4840,9 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	sc->nr_reclaimed += reclaimed;
 
+	if (type == LRU_GEN_ANON && swapped)
+		*swapped = true;
+
 	return scanned;
 }
 
@@ -4844,8 +4871,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 	if (!nr_to_scan)
 		return 0;
 
-	if (!need_aging)
+	if (!need_aging) {
+		sc->memcgs_need_aging = false;
 		return nr_to_scan;
+	}
 
 	/* leave the work to lru_gen_age_node() */
 	if (current_is_kswapd())
@@ -4867,6 +4896,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 {
 	struct blk_plug plug;
 	long scanned = 0;
+	bool swapped = false;
+	unsigned long reclaimed = sc->nr_reclaimed;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
@@ -4892,13 +4923,19 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		if (!nr_to_scan)
 			break;
 
-		delta = evict_folios(lruvec, sc, swappiness);
+		delta = evict_folios(lruvec, sc, swappiness, &swapped);
 		if (!delta)
 			break;
 
+		if (sc->memcgs_avoid_swapping && swappiness < 200 && swapped)
+			break;
+
 		scanned += delta;
-		if (scanned >= nr_to_scan)
+		if (scanned >= nr_to_scan) {
+			if (!swapped && sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH)
+				sc->memcgs_need_swapping = false;
 			break;
+		}
 
 		cond_resched();
 	}
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 09/14] mm: multi-gen LRU: optimize multiple memcgs
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

When multiple memcgs are available, it is possible to make better
choices based on generations and tiers and therefore improve the
overall performance under global memory pressure. This patch adds a
rudimentary optimization to select memcgs that can drop single-use
unmapped clean pages first. Doing so reduces the chance of going into
the aging path or swapping. These two operations can be costly.

A typical example that benefits from this optimization is a server
running mixed types of workloads, e.g., heavy anon workload in one
memcg and heavy buffered I/O workload in the other.

Though this optimization can be applied to both kswapd and direct
reclaim, it is only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): -[28, 30]%
                IOPS         BW
      patch1-7: 3117k        11.9GiB/s
      patch1-8: 2217k        8661MiB/s

    memcached (anon): +[247, 251]%
                Ops/sec      KB/sec
      patch1-7: 563772.35    21900.01
      patch1-8: 1968343.76   76461.24

  Mixed workloads:
    fio (buffered I/O): -[4, 6]%
                IOPS         BW
      5.17-rc2: 2338k        9133MiB/s
      patch1-8: 2217k        8661MiB/s

    memcached (anon): +[524, 530]%
                Ops/sec      KB/sec
      5.17-rc2: 313821.65    12190.55
      patch1-8: 1968343.76   76461.24

  Configurations:
    (changes since patch 5)

    cat mixed.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 mm/vmscan.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67dc4190e790..7375c9dae08f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -127,6 +127,13 @@ struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+#ifdef CONFIG_LRU_GEN
+	/* help make better choices when multiple memcgs are available */
+	unsigned int memcgs_need_aging:1;
+	unsigned int memcgs_need_swapping:1;
+	unsigned int memcgs_avoid_swapping:1;
+#endif
+
 	/* Allocation order */
 	s8 order;
 
@@ -4348,6 +4355,22 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_BUG_ON(!current_is_kswapd());
 
+	/*
+	 * To reduce the chance of going into the aging path or swapping, which
+	 * can be costly, optimistically skip them unless their corresponding
+	 * flags were cleared in the eviction path. This improves the overall
+	 * performance when multiple memcgs are available.
+	 */
+	if (!sc->memcgs_need_aging) {
+		sc->memcgs_need_aging = true;
+		sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping;
+		sc->memcgs_need_swapping = true;
+		return;
+	}
+
+	sc->memcgs_need_swapping = true;
+	sc->memcgs_avoid_swapping = true;
+
 	current->reclaim_state->mm_walk = &pgdat->mm_walk;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
@@ -4750,7 +4773,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 	return scanned;
 }
 
-static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			bool *swapped)
 {
 	int type;
 	int scanned;
@@ -4816,6 +4840,9 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	sc->nr_reclaimed += reclaimed;
 
+	if (type == LRU_GEN_ANON && swapped)
+		*swapped = true;
+
 	return scanned;
 }
 
@@ -4844,8 +4871,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 	if (!nr_to_scan)
 		return 0;
 
-	if (!need_aging)
+	if (!need_aging) {
+		sc->memcgs_need_aging = false;
 		return nr_to_scan;
+	}
 
 	/* leave the work to lru_gen_age_node() */
 	if (current_is_kswapd())
@@ -4867,6 +4896,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 {
 	struct blk_plug plug;
 	long scanned = 0;
+	bool swapped = false;
+	unsigned long reclaimed = sc->nr_reclaimed;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
@@ -4892,13 +4923,19 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		if (!nr_to_scan)
 			break;
 
-		delta = evict_folios(lruvec, sc, swappiness);
+		delta = evict_folios(lruvec, sc, swappiness, &swapped);
 		if (!delta)
 			break;
 
+		if (sc->memcgs_avoid_swapping && swappiness < 200 && swapped)
+			break;
+
 		scanned += delta;
-		if (scanned >= nr_to_scan)
+		if (scanned >= nr_to_scan) {
+			if (!swapped && sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH)
+				sc->memcgs_need_swapping = false;
 			break;
+		}
 
 		cond_resched();
 	}
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 10/14] mm: multi-gen LRU: kill switch
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy
memory pressure, in which case the mmap_lock contention is a lesser
concern, compared with the LRU lock contention and the I/O congestion.
So far the only well-known case of the mmap_lock contention happens on
Android, due to Scudo [1] which allocates several thousand VMAs for
merely a few hundred MBs. The SPF and the Maple Tree also have
provided their own assessments [2][3]. However, if walking page tables
does worsen the mmap_lock contention, the kill switch can be used to
disable it. In this case the multi-gen LRU will suffer a minor
performance degradation, as shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be
disabled, since this behavior was not tested on x86 varieties other
than Intel and AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/cgroup.h          |  15 +-
 include/linux/mm_inline.h       |  12 +-
 include/linux/mmzone.h          |   9 ++
 kernel/cgroup/cgroup-internal.h |   1 -
 mm/Kconfig                      |   6 +
 mm/vmscan.c                     | 237 +++++++++++++++++++++++++++++++-
 6 files changed, 271 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c151413fda..b145025f3eac 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+	mutex_lock(&cgroup_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
  * as locks used during the cgroup_subsys::attach() methods.
  */
 #ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)					\
 	rcu_dereference_check((task)->cgroups,				\
@@ -707,6 +718,8 @@ struct cgroup;
 static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
 static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 15a04a9b5560..1c8d617e73a9 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -106,7 +106,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
 
 static inline bool lru_gen_enabled(void)
 {
-	return true;
+#ifdef CONFIG_LRU_GEN_ENABLED
+	DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
+
+	return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
+#else
+	DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
+
+	return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
+#endif
 }
 
 static inline bool lru_gen_in_fault(void)
@@ -196,7 +204,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
 	int zone = folio_zonenum(folio);
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !lrugen->enabled)
 		return false;
 	/*
 	 * There are three common cases for this page:
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2d53025a321..116c9237e401 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -371,6 +371,13 @@ enum {
 	LRU_GEN_FILE,
 };
 
+enum {
+	LRU_GEN_CORE,
+	LRU_GEN_MM_WALK,
+	LRU_GEN_NONLEAF_YOUNG,
+	NR_LRU_GEN_CAPS
+};
+
 #define MIN_LRU_BATCH		BITS_PER_LONG
 #define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
 
@@ -409,6 +416,8 @@ struct lru_gen_struct {
 	/* can be modified without holding the LRU lock */
 	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	/* whether the multi-gen LRU is enabled */
+	bool enabled;
 };
 
 enum {
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 6e36e854b512..929ed3bf1a7c 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -165,7 +165,6 @@ struct cgroup_mgctx {
 #define DEFINE_CGROUP_MGCTX(name)						\
 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
 
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 extern struct cgroup_subsys *cgroup_subsys[];
 extern struct list_head cgroup_roots;
diff --git a/mm/Kconfig b/mm/Kconfig
index 804c2bca8205..050de1eae2d6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -901,6 +901,12 @@ config LRU_GEN
 	help
 	  A high performance LRU implementation for memory overcommit.
 
+config LRU_GEN_ENABLED
+	bool "Enable by default"
+	depends on LRU_GEN
+	help
+	  This option enables the multi-gen LRU by default.
+
 config LRU_GEN_STATS
 	bool "Full stats for debugging"
 	depends on LRU_GEN
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7375c9dae08f..55cc7d6b018b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3063,6 +3063,12 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 
 #ifdef CONFIG_LRU_GEN
 
+#ifdef CONFIG_LRU_GEN_ENABLED
+DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
+#else
+DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
+#endif
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3099,6 +3105,15 @@ static int folio_lru_tier(struct folio *folio)
 	return lru_tier_from_refs(refs);
 }
 
+static bool get_cap(int cap)
+{
+#ifdef CONFIG_LRU_GEN_ENABLED
+	return static_branch_likely(&lru_gen_caps[cap]);
+#else
+	return static_branch_unlikely(&lru_gen_caps[cap]);
+#endif
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3892,7 +3907,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
 			goto next;
 
 		if (!pmd_trans_huge(pmd[i])) {
-			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
+			    get_cap(LRU_GEN_NONLEAF_YOUNG))
 				pmdp_test_and_clear_young(vma, addr, pmd + i);
 			goto next;
 		}
@@ -3999,10 +4015,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 		priv->mm_stats[MM_PMD_TOTAL]++;
 
 #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-		if (!pmd_young(val))
-			continue;
+		if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
+			if (!pmd_young(val))
+				continue;
 
-		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+		}
 #endif
 		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
 			continue;
@@ -4233,7 +4251,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
 	 * handful of PTEs. Spreading the work out over a period of time usually
 	 * is less efficient, but it avoids bursty page faults.
 	 */
-	if (!full_scan && !arch_has_hw_pte_young()) {
+	if (!full_scan && (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK))) {
 		success = iterate_mm_list_nowalk(lruvec, max_seq);
 		goto done;
 	}
@@ -4946,6 +4964,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 	blk_finish_plug(&plug);
 }
 
+/******************************************************************************
+ *                          state change
+ ******************************************************************************/
+
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	if (lrugen->enabled) {
+		enum lru_list lru;
+
+		for_each_evictable_lru(lru) {
+			if (!list_empty(&lruvec->lists[lru]))
+				return false;
+		}
+	} else {
+		int gen, type, zone;
+
+		for_each_gen_type_zone(gen, type, zone) {
+			if (!list_empty(&lrugen->lists[gen][type][zone]))
+				return false;
+
+			/* unlikely but not a bug when reset_batch_size() is pending */
+			VM_WARN_ON(lrugen->nr_pages[gen][type][zone]);
+		}
+	}
+
+	return true;
+}
+
+static bool fill_evictable(struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int remaining = MAX_LRU_BATCH;
+
+	for_each_evictable_lru(lru) {
+		int type = is_file_lru(lru);
+		bool active = is_active_lru(lru);
+		struct list_head *head = &lruvec->lists[lru];
+
+		while (!list_empty(head)) {
+			bool success;
+			struct folio *folio = lru_to_folio(head);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio);
+
+			lruvec_del_folio(lruvec, folio);
+			success = lru_gen_add_folio(lruvec, folio, false);
+			VM_BUG_ON(!success);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static bool drain_evictable(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	int remaining = MAX_LRU_BATCH;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			bool success;
+			struct folio *folio = lru_to_folio(head);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			success = lru_gen_del_folio(lruvec, folio, false);
+			VM_BUG_ON(!success);
+			lruvec_add_folio(lruvec, folio);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static void lru_gen_change_state(bool enable)
+{
+	static DEFINE_MUTEX(state_mutex);
+
+	struct mem_cgroup *memcg;
+
+	cgroup_lock();
+	cpus_read_lock();
+	get_online_mems();
+	mutex_lock(&state_mutex);
+
+	if (enable == lru_gen_enabled())
+		goto unlock;
+
+	if (enable)
+		static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+	else
+		static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node(nid) {
+			struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+			if (!lruvec)
+				continue;
+
+			spin_lock_irq(&lruvec->lru_lock);
+
+			VM_BUG_ON(!seq_is_valid(lruvec));
+			VM_BUG_ON(!state_is_valid(lruvec));
+
+			lruvec->lrugen.enabled = enable;
+
+			while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				cond_resched();
+				spin_lock_irq(&lruvec->lru_lock);
+			}
+
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+unlock:
+	mutex_unlock(&state_mutex);
+	put_online_mems();
+	cpus_read_unlock();
+	cgroup_unlock();
+}
+
+/******************************************************************************
+ *                          sysfs interface
+ ******************************************************************************/
+
+static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	unsigned int caps = 0;
+
+	if (get_cap(LRU_GEN_CORE))
+		caps |= BIT(LRU_GEN_CORE);
+
+	if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
+		caps |= BIT(LRU_GEN_MM_WALK);
+
+	if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG))
+		caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
+
+	return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
+}
+
+static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
+			    const char *buf, size_t len)
+{
+	int i;
+	unsigned int caps;
+
+	if (tolower(*buf) == 'n')
+		caps = 0;
+	else if (tolower(*buf) == 'y')
+		caps = -1;
+	else if (kstrtouint(buf, 0, &caps))
+		return -EINVAL;
+
+	for (i = 0; i < NR_LRU_GEN_CAPS; i++) {
+		bool enable = caps & BIT(i);
+
+		if (i == LRU_GEN_CORE)
+			lru_gen_change_state(enable);
+		else if (enable)
+			static_branch_enable(&lru_gen_caps[i]);
+		else
+			static_branch_disable(&lru_gen_caps[i]);
+	}
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+	enabled, 0644, show_enable, store_enable
+);
+
+static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_enabled_attr.attr,
+	NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+	.name = "lru_gen",
+	.attrs = lru_gen_attrs,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -4956,6 +5179,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
 	lrugen->max_seq = MIN_NR_GENS + 1;
+	lrugen->enabled = lru_gen_enabled();
 
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
@@ -4996,6 +5220,9 @@ static int __init init_lru_gen(void)
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
 	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
+	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
+		pr_err("lru_gen: failed to create sysfs group\n");
+
 	return 0;
 };
 late_initcall(init_lru_gen);
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 10/14] mm: multi-gen LRU: kill switch
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy
memory pressure, in which case the mmap_lock contention is a lesser
concern, compared with the LRU lock contention and the I/O congestion.
So far the only well-known case of the mmap_lock contention happens on
Android, due to Scudo [1] which allocates several thousand VMAs for
merely a few hundred MBs. The SPF and the Maple Tree also have
provided their own assessments [2][3]. However, if walking page tables
does worsen the mmap_lock contention, the kill switch can be used to
disable it. In this case the multi-gen LRU will suffer a minor
performance degradation, as shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be
disabled, since this behavior was not tested on x86 varieties other
than Intel and AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/cgroup.h          |  15 +-
 include/linux/mm_inline.h       |  12 +-
 include/linux/mmzone.h          |   9 ++
 kernel/cgroup/cgroup-internal.h |   1 -
 mm/Kconfig                      |   6 +
 mm/vmscan.c                     | 237 +++++++++++++++++++++++++++++++-
 6 files changed, 271 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c151413fda..b145025f3eac 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+	mutex_lock(&cgroup_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
  * as locks used during the cgroup_subsys::attach() methods.
  */
 #ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)					\
 	rcu_dereference_check((task)->cgroups,				\
@@ -707,6 +718,8 @@ struct cgroup;
 static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
 static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 15a04a9b5560..1c8d617e73a9 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -106,7 +106,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
 
 static inline bool lru_gen_enabled(void)
 {
-	return true;
+#ifdef CONFIG_LRU_GEN_ENABLED
+	DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
+
+	return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
+#else
+	DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
+
+	return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
+#endif
 }
 
 static inline bool lru_gen_in_fault(void)
@@ -196,7 +204,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
 	int zone = folio_zonenum(folio);
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !lrugen->enabled)
 		return false;
 	/*
 	 * There are three common cases for this page:
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a2d53025a321..116c9237e401 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -371,6 +371,13 @@ enum {
 	LRU_GEN_FILE,
 };
 
+enum {
+	LRU_GEN_CORE,
+	LRU_GEN_MM_WALK,
+	LRU_GEN_NONLEAF_YOUNG,
+	NR_LRU_GEN_CAPS
+};
+
 #define MIN_LRU_BATCH		BITS_PER_LONG
 #define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
 
@@ -409,6 +416,8 @@ struct lru_gen_struct {
 	/* can be modified without holding the LRU lock */
 	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	/* whether the multi-gen LRU is enabled */
+	bool enabled;
 };
 
 enum {
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 6e36e854b512..929ed3bf1a7c 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -165,7 +165,6 @@ struct cgroup_mgctx {
 #define DEFINE_CGROUP_MGCTX(name)						\
 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
 
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 extern struct cgroup_subsys *cgroup_subsys[];
 extern struct list_head cgroup_roots;
diff --git a/mm/Kconfig b/mm/Kconfig
index 804c2bca8205..050de1eae2d6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -901,6 +901,12 @@ config LRU_GEN
 	help
 	  A high performance LRU implementation for memory overcommit.
 
+config LRU_GEN_ENABLED
+	bool "Enable by default"
+	depends on LRU_GEN
+	help
+	  This option enables the multi-gen LRU by default.
+
 config LRU_GEN_STATS
 	bool "Full stats for debugging"
 	depends on LRU_GEN
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7375c9dae08f..55cc7d6b018b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3063,6 +3063,12 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 
 #ifdef CONFIG_LRU_GEN
 
+#ifdef CONFIG_LRU_GEN_ENABLED
+DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
+#else
+DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
+#endif
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3099,6 +3105,15 @@ static int folio_lru_tier(struct folio *folio)
 	return lru_tier_from_refs(refs);
 }
 
+static bool get_cap(int cap)
+{
+#ifdef CONFIG_LRU_GEN_ENABLED
+	return static_branch_likely(&lru_gen_caps[cap]);
+#else
+	return static_branch_unlikely(&lru_gen_caps[cap]);
+#endif
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3892,7 +3907,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
 			goto next;
 
 		if (!pmd_trans_huge(pmd[i])) {
-			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
+			    get_cap(LRU_GEN_NONLEAF_YOUNG))
 				pmdp_test_and_clear_young(vma, addr, pmd + i);
 			goto next;
 		}
@@ -3999,10 +4015,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 		priv->mm_stats[MM_PMD_TOTAL]++;
 
 #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-		if (!pmd_young(val))
-			continue;
+		if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
+			if (!pmd_young(val))
+				continue;
 
-		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+		}
 #endif
 		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
 			continue;
@@ -4233,7 +4251,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
 	 * handful of PTEs. Spreading the work out over a period of time usually
 	 * is less efficient, but it avoids bursty page faults.
 	 */
-	if (!full_scan && !arch_has_hw_pte_young()) {
+	if (!full_scan && (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK))) {
 		success = iterate_mm_list_nowalk(lruvec, max_seq);
 		goto done;
 	}
@@ -4946,6 +4964,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 	blk_finish_plug(&plug);
 }
 
+/******************************************************************************
+ *                          state change
+ ******************************************************************************/
+
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	if (lrugen->enabled) {
+		enum lru_list lru;
+
+		for_each_evictable_lru(lru) {
+			if (!list_empty(&lruvec->lists[lru]))
+				return false;
+		}
+	} else {
+		int gen, type, zone;
+
+		for_each_gen_type_zone(gen, type, zone) {
+			if (!list_empty(&lrugen->lists[gen][type][zone]))
+				return false;
+
+			/* unlikely but not a bug when reset_batch_size() is pending */
+			VM_WARN_ON(lrugen->nr_pages[gen][type][zone]);
+		}
+	}
+
+	return true;
+}
+
+static bool fill_evictable(struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int remaining = MAX_LRU_BATCH;
+
+	for_each_evictable_lru(lru) {
+		int type = is_file_lru(lru);
+		bool active = is_active_lru(lru);
+		struct list_head *head = &lruvec->lists[lru];
+
+		while (!list_empty(head)) {
+			bool success;
+			struct folio *folio = lru_to_folio(head);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio);
+
+			lruvec_del_folio(lruvec, folio);
+			success = lru_gen_add_folio(lruvec, folio, false);
+			VM_BUG_ON(!success);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static bool drain_evictable(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	int remaining = MAX_LRU_BATCH;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			bool success;
+			struct folio *folio = lru_to_folio(head);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			success = lru_gen_del_folio(lruvec, folio, false);
+			VM_BUG_ON(!success);
+			lruvec_add_folio(lruvec, folio);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static void lru_gen_change_state(bool enable)
+{
+	static DEFINE_MUTEX(state_mutex);
+
+	struct mem_cgroup *memcg;
+
+	cgroup_lock();
+	cpus_read_lock();
+	get_online_mems();
+	mutex_lock(&state_mutex);
+
+	if (enable == lru_gen_enabled())
+		goto unlock;
+
+	if (enable)
+		static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+	else
+		static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node(nid) {
+			struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+			if (!lruvec)
+				continue;
+
+			spin_lock_irq(&lruvec->lru_lock);
+
+			VM_BUG_ON(!seq_is_valid(lruvec));
+			VM_BUG_ON(!state_is_valid(lruvec));
+
+			lruvec->lrugen.enabled = enable;
+
+			while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				cond_resched();
+				spin_lock_irq(&lruvec->lru_lock);
+			}
+
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+unlock:
+	mutex_unlock(&state_mutex);
+	put_online_mems();
+	cpus_read_unlock();
+	cgroup_unlock();
+}
+
+/******************************************************************************
+ *                          sysfs interface
+ ******************************************************************************/
+
+static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	unsigned int caps = 0;
+
+	if (get_cap(LRU_GEN_CORE))
+		caps |= BIT(LRU_GEN_CORE);
+
+	if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
+		caps |= BIT(LRU_GEN_MM_WALK);
+
+	if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG))
+		caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
+
+	return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
+}
+
+static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
+			    const char *buf, size_t len)
+{
+	int i;
+	unsigned int caps;
+
+	if (tolower(*buf) == 'n')
+		caps = 0;
+	else if (tolower(*buf) == 'y')
+		caps = -1;
+	else if (kstrtouint(buf, 0, &caps))
+		return -EINVAL;
+
+	for (i = 0; i < NR_LRU_GEN_CAPS; i++) {
+		bool enable = caps & BIT(i);
+
+		if (i == LRU_GEN_CORE)
+			lru_gen_change_state(enable);
+		else if (enable)
+			static_branch_enable(&lru_gen_caps[i]);
+		else
+			static_branch_disable(&lru_gen_caps[i]);
+	}
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+	enabled, 0644, show_enable, store_enable
+);
+
+static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_enabled_attr.attr,
+	NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+	.name = "lru_gen",
+	.attrs = lru_gen_attrs,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -4956,6 +5179,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
 	lrugen->max_seq = MIN_NR_GENS + 1;
+	lrugen->enabled = lru_gen_enabled();
 
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
@@ -4996,6 +5220,9 @@ static int __init init_lru_gen(void)
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
 	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
+	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
+		pr_err("lru_gen: failed to create sysfs group\n");
+
 	return 0;
 };
 late_initcall(init_lru_gen);
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
requested by many desktop users [1].

When set to value N, it prevents the working set of N milliseconds
from getting evicted. The OOM killer is triggered if this working set
cannot be kept in memory. Based on the average human detectable lag
(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
Larger values like N=3000 make lags less noticeable at the risk of
premature OOM kills.

Compared with the size-based approach, e.g., [2], this time-based
approach has the following advantages:
1. It is easier to configure because it is agnostic to applications
   and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.

[1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
[2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mmzone.h |  2 ++
 mm/vmscan.c            | 69 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 116c9237e401..f98f9ce50e67 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -403,6 +403,8 @@ struct lru_gen_struct {
 	unsigned long max_seq;
 	/* the eviction increments the oldest generation numbers */
 	unsigned long min_seq[ANON_AND_FILE];
+	/* the birth time of each generation in jiffies */
+	unsigned long timestamps[MAX_NR_GENS];
 	/* the multi-gen LRU lists */
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 55cc7d6b018b..6aa083b8bb26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4229,6 +4229,7 @@ static void inc_max_seq(struct lruvec *lruvec)
 	for (type = 0; type < ANON_AND_FILE; type++)
 		reset_ctrl_pos(lruvec, type, false);
 
+	WRITE_ONCE(lrugen->timestamps[next], jiffies);
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
 
@@ -4340,7 +4341,8 @@ static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 	return total > 0 ? total : 0;
 }
 
-static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
+		       unsigned long min_ttl)
 {
 	bool need_aging;
 	long nr_to_scan;
@@ -4349,14 +4351,22 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
+	if (min_ttl) {
+		int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
+		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+		if (time_is_after_jiffies(birth + min_ttl))
+			return false;
+	}
+
 	mem_cgroup_calculate_protection(NULL, memcg);
 
 	if (mem_cgroup_below_min(memcg))
-		return;
+		return false;
 
 	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
 	if (!nr_to_scan)
-		return;
+		return false;
 
 	nr_to_scan >>= sc->priority;
 
@@ -4365,11 +4375,18 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
 		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
+
+	return true;
 }
 
+/* to protect the working set of the last N jiffies */
+static unsigned long lru_gen_min_ttl __read_mostly;
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
+	bool success = false;
+	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
 
 	VM_BUG_ON(!current_is_kswapd());
 
@@ -4395,12 +4412,29 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-		age_lruvec(lruvec, sc);
+		if (age_lruvec(lruvec, sc, min_ttl))
+			success = true;
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 
 	current->reclaim_state->mm_walk = NULL;
+
+	/*
+	 * The main goal is to OOM kill if every generation from all memcgs is
+	 * younger than min_ttl. However, another theoretical possibility is all
+	 * memcgs are either below min or empty.
+	 */
+	if (!success && mutex_trylock(&oom_lock)) {
+		struct oom_control oc = {
+			.gfp_mask = sc->gfp_mask,
+			.order = sc->order,
+		};
+
+		out_of_memory(&oc);
+
+		mutex_unlock(&oom_lock);
+	}
 }
 
 /*
@@ -5112,6 +5146,28 @@ static void lru_gen_change_state(bool enable)
  *                          sysfs interface
  ******************************************************************************/
 
+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
+}
+
+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
+			     const char *buf, size_t len)
+{
+	unsigned int msecs;
+
+	if (kstrtouint(buf, 0, &msecs))
+		return -EINVAL;
+
+	WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
+	min_ttl_ms, 0644, show_min_ttl, store_min_ttl
+);
+
 static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
 {
 	unsigned int caps = 0;
@@ -5160,6 +5216,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
 );
 
 static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_min_ttl_attr.attr,
 	&lru_gen_enabled_attr.attr,
 	NULL
 };
@@ -5175,12 +5232,16 @@ static struct attribute_group lru_gen_attr_group = {
 
 void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
+	int i;
 	int gen, type, zone;
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
 	lrugen->max_seq = MIN_NR_GENS + 1;
 	lrugen->enabled = lru_gen_enabled();
 
+	for (i = 0; i <= MIN_NR_GENS + 1; i++)
+		lrugen->timestamps[i] = jiffies;
+
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
 
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
requested by many desktop users [1].

When set to value N, it prevents the working set of N milliseconds
from getting evicted. The OOM killer is triggered if this working set
cannot be kept in memory. Based on the average human detectable lag
(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
Larger values like N=3000 make lags less noticeable at the risk of
premature OOM kills.

Compared with the size-based approach, e.g., [2], this time-based
approach has the following advantages:
1. It is easier to configure because it is agnostic to applications
   and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.

[1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
[2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/mmzone.h |  2 ++
 mm/vmscan.c            | 69 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 116c9237e401..f98f9ce50e67 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -403,6 +403,8 @@ struct lru_gen_struct {
 	unsigned long max_seq;
 	/* the eviction increments the oldest generation numbers */
 	unsigned long min_seq[ANON_AND_FILE];
+	/* the birth time of each generation in jiffies */
+	unsigned long timestamps[MAX_NR_GENS];
 	/* the multi-gen LRU lists */
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 55cc7d6b018b..6aa083b8bb26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4229,6 +4229,7 @@ static void inc_max_seq(struct lruvec *lruvec)
 	for (type = 0; type < ANON_AND_FILE; type++)
 		reset_ctrl_pos(lruvec, type, false);
 
+	WRITE_ONCE(lrugen->timestamps[next], jiffies);
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
 
@@ -4340,7 +4341,8 @@ static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 	return total > 0 ? total : 0;
 }
 
-static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
+		       unsigned long min_ttl)
 {
 	bool need_aging;
 	long nr_to_scan;
@@ -4349,14 +4351,22 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
+	if (min_ttl) {
+		int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
+		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+		if (time_is_after_jiffies(birth + min_ttl))
+			return false;
+	}
+
 	mem_cgroup_calculate_protection(NULL, memcg);
 
 	if (mem_cgroup_below_min(memcg))
-		return;
+		return false;
 
 	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
 	if (!nr_to_scan)
-		return;
+		return false;
 
 	nr_to_scan >>= sc->priority;
 
@@ -4365,11 +4375,18 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
 		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
+
+	return true;
 }
 
+/* to protect the working set of the last N jiffies */
+static unsigned long lru_gen_min_ttl __read_mostly;
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
+	bool success = false;
+	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
 
 	VM_BUG_ON(!current_is_kswapd());
 
@@ -4395,12 +4412,29 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-		age_lruvec(lruvec, sc);
+		if (age_lruvec(lruvec, sc, min_ttl))
+			success = true;
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 
 	current->reclaim_state->mm_walk = NULL;
+
+	/*
+	 * The main goal is to OOM kill if every generation from all memcgs is
+	 * younger than min_ttl. However, another theoretical possibility is all
+	 * memcgs are either below min or empty.
+	 */
+	if (!success && mutex_trylock(&oom_lock)) {
+		struct oom_control oc = {
+			.gfp_mask = sc->gfp_mask,
+			.order = sc->order,
+		};
+
+		out_of_memory(&oc);
+
+		mutex_unlock(&oom_lock);
+	}
 }
 
 /*
@@ -5112,6 +5146,28 @@ static void lru_gen_change_state(bool enable)
  *                          sysfs interface
  ******************************************************************************/
 
+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
+}
+
+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
+			     const char *buf, size_t len)
+{
+	unsigned int msecs;
+
+	if (kstrtouint(buf, 0, &msecs))
+		return -EINVAL;
+
+	WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
+	min_ttl_ms, 0644, show_min_ttl, store_min_ttl
+);
+
 static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
 {
 	unsigned int caps = 0;
@@ -5160,6 +5216,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
 );
 
 static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_min_ttl_attr.attr,
 	&lru_gen_enabled_attr.attr,
 	NULL
 };
@@ -5175,12 +5232,16 @@ static struct attribute_group lru_gen_attr_group = {
 
 void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
+	int i;
 	int gen, type, zone;
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
 	lrugen->max_seq = MIN_NR_GENS + 1;
 	lrugen->enabled = lru_gen_enabled();
 
+	for (i = 0; i <= MIN_NR_GENS + 1; i++)
+		lrugen->timestamps[i] = jiffies;
+
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
 
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 12/14] mm: multi-gen LRU: debugfs interface
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. These features are required to optimize job scheduling (bin
packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
the following advantages:
1. It offers better choices because it is aware of memcgs, NUMA nodes,
   shared mappings and unmapped page cache.
2. It is more scalable because it is O(nr_hot_pages), whereas the
   PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/nodemask.h |   1 +
 mm/vmscan.c              | 353 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 354 insertions(+)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 567c3ddba2c4..90840c459abc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define first_memory_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
+#define next_memory_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1U
 #define nr_online_nodes		1U
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa083b8bb26..8f8f9ac2cd2c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -52,6 +52,8 @@
 #include <linux/psi.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5226,6 +5228,354 @@ static struct attribute_group lru_gen_attr_group = {
 	.attrs = lru_gen_attrs,
 };
 
+/******************************************************************************
+ *                          debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg;
+	loff_t nr_to_skip = *pos;
+
+	m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
+	if (!m->private)
+		return ERR_PTR(-ENOMEM);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			if (!nr_to_skip--)
+				return get_lruvec(memcg, nid);
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+	if (!IS_ERR_OR_NULL(v))
+		mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+	kvfree(m->private);
+	m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	int nid = lruvec_pgdat(v)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(v);
+
+	++*pos;
+
+	nid = next_memory_node(nid);
+	if (nid == MAX_NUMNODES) {
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+		if (!memcg)
+			return NULL;
+
+		nid = first_memory_node;
+	}
+
+	return get_lruvec(memcg, nid);
+}
+
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
+				  unsigned long max_seq, unsigned long *min_seq,
+				  unsigned long seq)
+{
+	int i;
+	int type, tier;
+	int hist = lru_hist_from_seq(seq);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		seq_printf(m, "            %10d", tier);
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			unsigned long n[3] = {};
+
+			if (seq == max_seq) {
+				n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+				n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+
+				seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]);
+			} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
+				n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+				n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+				if (tier)
+					n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
+
+				seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]);
+			} else
+				seq_puts(m, "          0           0           0 ");
+		}
+		seq_putc(m, '\n');
+	}
+
+	seq_puts(m, "                      ");
+	for (i = 0; i < NR_MM_STATS; i++) {
+		if (seq == max_seq && NR_HIST_GENS == 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+				   toupper(MM_STAT_CODES[i]));
+		else if (seq != max_seq && NR_HIST_GENS > 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+				   MM_STAT_CODES[i]);
+		else
+			seq_puts(m, "          0 ");
+	}
+	seq_putc(m, '\n');
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+	unsigned long seq;
+	bool full = !debugfs_real_fops(m->file)->write;
+	struct lruvec *lruvec = v;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (nid == first_memory_node) {
+		const char *path = memcg ? m->private : "";
+
+#ifdef CONFIG_MEMCG
+		if (memcg)
+			cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+		seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
+	}
+
+	seq_printf(m, " node %5d\n", nid);
+
+	if (!full)
+		seq = min_seq[LRU_GEN_ANON];
+	else if (max_seq >= MAX_NR_GENS)
+		seq = max_seq - MAX_NR_GENS + 1;
+	else
+		seq = 0;
+
+	for (; seq <= max_seq; seq++) {
+		int gen, type, zone;
+		unsigned int msecs;
+
+		gen = lru_gen_from_seq(seq);
+		msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen]));
+
+		seq_printf(m, " %10lu %10u", seq, msecs);
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			long size = 0;
+
+			if (seq < min_seq[type]) {
+				seq_puts(m, "         -0 ");
+				continue;
+			}
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			seq_printf(m, " %10lu ", max(size, 0L));
+		}
+
+		seq_putc(m, '\n');
+
+		if (full)
+			lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
+	}
+
+	return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+	.start = lru_gen_seq_start,
+	.stop = lru_gen_seq_stop,
+	.next = lru_gen_seq_next,
+	.show = lru_gen_seq_show,
+};
+
+static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+		     bool can_swap, bool full_scan)
+{
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq == max_seq)
+		try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan);
+
+	return seq > max_seq ? -EINVAL : 0;
+}
+
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+			int swappiness, unsigned long nr_to_reclaim)
+{
+	struct blk_plug plug;
+	int err = -EINTR;
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq + MIN_NR_GENS > max_seq)
+		return -EINVAL;
+
+	sc->nr_reclaimed = 0;
+
+	blk_start_plug(&plug);
+
+	while (!signal_pending(current)) {
+		DEFINE_MIN_SEQ(lruvec);
+
+		if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim ||
+		    !evict_folios(lruvec, sc, swappiness, NULL)) {
+			err = 0;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+
+	return err;
+}
+
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
+		   struct scan_control *sc, int swappiness, unsigned long opt)
+{
+	struct lruvec *lruvec;
+	int err = -EINVAL;
+	struct mem_cgroup *memcg = NULL;
+
+	if (!mem_cgroup_disabled()) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+#endif
+		rcu_read_unlock();
+
+		if (!memcg)
+			goto done;
+	}
+	if (memcg_id != mem_cgroup_id(memcg))
+		goto done;
+
+	if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+		goto done;
+
+	lruvec = get_lruvec(memcg, nid);
+
+	if (swappiness < 0)
+		swappiness = get_swappiness(lruvec, sc);
+	else if (swappiness > 200)
+		goto done;
+
+	switch (cmd) {
+	case '+':
+		err = run_aging(lruvec, seq, sc, swappiness, opt);
+		break;
+	case '-':
+		err = run_eviction(lruvec, seq, sc, swappiness, opt);
+		break;
+	}
+done:
+	mem_cgroup_put(memcg);
+
+	return err;
+}
+
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
+				 size_t len, loff_t *pos)
+{
+	void *buf;
+	char *cur, *next;
+	unsigned int flags;
+	int err = 0;
+	struct scan_control sc = {
+		.may_writepage = true,
+		.may_unmap = true,
+		.may_swap = true,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+
+	buf = kvmalloc(len + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, src, len)) {
+		kvfree(buf);
+		return -EFAULT;
+	}
+
+	next = buf;
+	next[len] = '\0';
+
+	sc.reclaim_state.mm_walk = alloc_mm_walk();
+	if (!sc.reclaim_state.mm_walk) {
+		kvfree(buf);
+		return -ENOMEM;
+	}
+
+	set_task_reclaim_state(current, &sc.reclaim_state);
+	flags = memalloc_noreclaim_save();
+
+	while ((cur = strsep(&next, ",;\n"))) {
+		int n;
+		int end;
+		char cmd;
+		unsigned int memcg_id;
+		unsigned int nid;
+		unsigned long seq;
+		unsigned int swappiness = -1;
+		unsigned long opt = -1;
+
+		cur = skip_spaces(cur);
+		if (!*cur)
+			continue;
+
+		n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+			   &seq, &end, &swappiness, &end, &opt, &end);
+		if (n < 4 || cur[end]) {
+			err = -EINVAL;
+			break;
+		}
+
+		err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
+		if (err)
+			break;
+	}
+
+	memalloc_noreclaim_restore(flags);
+	set_task_reclaim_state(current, NULL);
+
+	free_mm_walk(sc.reclaim_state.mm_walk);
+	kvfree(buf);
+
+	return err ? : len;
+}
+
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &lru_gen_seq_ops);
+}
+
+static const struct file_operations lru_gen_rw_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.write = lru_gen_seq_write,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static const struct file_operations lru_gen_ro_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -5284,6 +5634,9 @@ static int __init init_lru_gen(void)
 	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
 		pr_err("lru_gen: failed to create sysfs group\n");
 
+	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
+	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
+
 	return 0;
 };
 late_initcall(init_lru_gen);
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 12/14] mm: multi-gen LRU: debugfs interface
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. These features are required to optimize job scheduling (bin
packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
the following advantages:
1. It offers better choices because it is aware of memcgs, NUMA nodes,
   shared mappings and unmapped page cache.
2. It is more scalable because it is O(nr_hot_pages), whereas the
   PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/nodemask.h |   1 +
 mm/vmscan.c              | 353 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 354 insertions(+)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 567c3ddba2c4..90840c459abc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define first_memory_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
+#define next_memory_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1U
 #define nr_online_nodes		1U
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa083b8bb26..8f8f9ac2cd2c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -52,6 +52,8 @@
 #include <linux/psi.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5226,6 +5228,354 @@ static struct attribute_group lru_gen_attr_group = {
 	.attrs = lru_gen_attrs,
 };
 
+/******************************************************************************
+ *                          debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg;
+	loff_t nr_to_skip = *pos;
+
+	m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
+	if (!m->private)
+		return ERR_PTR(-ENOMEM);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			if (!nr_to_skip--)
+				return get_lruvec(memcg, nid);
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+	if (!IS_ERR_OR_NULL(v))
+		mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+	kvfree(m->private);
+	m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	int nid = lruvec_pgdat(v)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(v);
+
+	++*pos;
+
+	nid = next_memory_node(nid);
+	if (nid == MAX_NUMNODES) {
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+		if (!memcg)
+			return NULL;
+
+		nid = first_memory_node;
+	}
+
+	return get_lruvec(memcg, nid);
+}
+
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
+				  unsigned long max_seq, unsigned long *min_seq,
+				  unsigned long seq)
+{
+	int i;
+	int type, tier;
+	int hist = lru_hist_from_seq(seq);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		seq_printf(m, "            %10d", tier);
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			unsigned long n[3] = {};
+
+			if (seq == max_seq) {
+				n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+				n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+
+				seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]);
+			} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
+				n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+				n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+				if (tier)
+					n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
+
+				seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]);
+			} else
+				seq_puts(m, "          0           0           0 ");
+		}
+		seq_putc(m, '\n');
+	}
+
+	seq_puts(m, "                      ");
+	for (i = 0; i < NR_MM_STATS; i++) {
+		if (seq == max_seq && NR_HIST_GENS == 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+				   toupper(MM_STAT_CODES[i]));
+		else if (seq != max_seq && NR_HIST_GENS > 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+				   MM_STAT_CODES[i]);
+		else
+			seq_puts(m, "          0 ");
+	}
+	seq_putc(m, '\n');
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+	unsigned long seq;
+	bool full = !debugfs_real_fops(m->file)->write;
+	struct lruvec *lruvec = v;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (nid == first_memory_node) {
+		const char *path = memcg ? m->private : "";
+
+#ifdef CONFIG_MEMCG
+		if (memcg)
+			cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+		seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
+	}
+
+	seq_printf(m, " node %5d\n", nid);
+
+	if (!full)
+		seq = min_seq[LRU_GEN_ANON];
+	else if (max_seq >= MAX_NR_GENS)
+		seq = max_seq - MAX_NR_GENS + 1;
+	else
+		seq = 0;
+
+	for (; seq <= max_seq; seq++) {
+		int gen, type, zone;
+		unsigned int msecs;
+
+		gen = lru_gen_from_seq(seq);
+		msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen]));
+
+		seq_printf(m, " %10lu %10u", seq, msecs);
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			long size = 0;
+
+			if (seq < min_seq[type]) {
+				seq_puts(m, "         -0 ");
+				continue;
+			}
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			seq_printf(m, " %10lu ", max(size, 0L));
+		}
+
+		seq_putc(m, '\n');
+
+		if (full)
+			lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
+	}
+
+	return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+	.start = lru_gen_seq_start,
+	.stop = lru_gen_seq_stop,
+	.next = lru_gen_seq_next,
+	.show = lru_gen_seq_show,
+};
+
+static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+		     bool can_swap, bool full_scan)
+{
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq == max_seq)
+		try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan);
+
+	return seq > max_seq ? -EINVAL : 0;
+}
+
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+			int swappiness, unsigned long nr_to_reclaim)
+{
+	struct blk_plug plug;
+	int err = -EINTR;
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq + MIN_NR_GENS > max_seq)
+		return -EINVAL;
+
+	sc->nr_reclaimed = 0;
+
+	blk_start_plug(&plug);
+
+	while (!signal_pending(current)) {
+		DEFINE_MIN_SEQ(lruvec);
+
+		if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim ||
+		    !evict_folios(lruvec, sc, swappiness, NULL)) {
+			err = 0;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+
+	return err;
+}
+
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
+		   struct scan_control *sc, int swappiness, unsigned long opt)
+{
+	struct lruvec *lruvec;
+	int err = -EINVAL;
+	struct mem_cgroup *memcg = NULL;
+
+	if (!mem_cgroup_disabled()) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+#endif
+		rcu_read_unlock();
+
+		if (!memcg)
+			goto done;
+	}
+	if (memcg_id != mem_cgroup_id(memcg))
+		goto done;
+
+	if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+		goto done;
+
+	lruvec = get_lruvec(memcg, nid);
+
+	if (swappiness < 0)
+		swappiness = get_swappiness(lruvec, sc);
+	else if (swappiness > 200)
+		goto done;
+
+	switch (cmd) {
+	case '+':
+		err = run_aging(lruvec, seq, sc, swappiness, opt);
+		break;
+	case '-':
+		err = run_eviction(lruvec, seq, sc, swappiness, opt);
+		break;
+	}
+done:
+	mem_cgroup_put(memcg);
+
+	return err;
+}
+
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
+				 size_t len, loff_t *pos)
+{
+	void *buf;
+	char *cur, *next;
+	unsigned int flags;
+	int err = 0;
+	struct scan_control sc = {
+		.may_writepage = true,
+		.may_unmap = true,
+		.may_swap = true,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+
+	buf = kvmalloc(len + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, src, len)) {
+		kvfree(buf);
+		return -EFAULT;
+	}
+
+	next = buf;
+	next[len] = '\0';
+
+	sc.reclaim_state.mm_walk = alloc_mm_walk();
+	if (!sc.reclaim_state.mm_walk) {
+		kvfree(buf);
+		return -ENOMEM;
+	}
+
+	set_task_reclaim_state(current, &sc.reclaim_state);
+	flags = memalloc_noreclaim_save();
+
+	while ((cur = strsep(&next, ",;\n"))) {
+		int n;
+		int end;
+		char cmd;
+		unsigned int memcg_id;
+		unsigned int nid;
+		unsigned long seq;
+		unsigned int swappiness = -1;
+		unsigned long opt = -1;
+
+		cur = skip_spaces(cur);
+		if (!*cur)
+			continue;
+
+		n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+			   &seq, &end, &swappiness, &end, &opt, &end);
+		if (n < 4 || cur[end]) {
+			err = -EINVAL;
+			break;
+		}
+
+		err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
+		if (err)
+			break;
+	}
+
+	memalloc_noreclaim_restore(flags);
+	set_task_reclaim_state(current, NULL);
+
+	free_mm_walk(sc.reclaim_state.mm_walk);
+	kvfree(buf);
+
+	return err ? : len;
+}
+
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &lru_gen_seq_ops);
+}
+
+static const struct file_operations lru_gen_rw_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.write = lru_gen_seq_write,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static const struct file_operations lru_gen_ro_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -5284,6 +5634,9 @@ static int __init init_lru_gen(void)
 	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
 		pr_err("lru_gen: failed to create sysfs group\n");
 
+	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
+	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
+
 	return 0;
 };
 late_initcall(init_lru_gen);
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 13/14] mm: multi-gen LRU: admin guide
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add an admin guide.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/admin-guide/mm/multigen_lru.rst | 146 ++++++++++++++++++
 mm/Kconfig                                    |   3 +-
 3 files changed, 149 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..2cf5bae62036 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   multigen_lru
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..4ea6a801dc56
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,146 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enable`` accepts different values to enable or disabled the
+following components. The default value of this file depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enable`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+       batches, when MMU sets it (e.g., on x86). This behavior can
+       theoretically worsen lock contention (mmap_lock). If it is
+       disabled, the multi-gen LRU will suffer a minor performance
+       degradation.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+       well, when MMU sets it (e.g., on x86). This behavior was not
+       verified on x86 varieties other than Intel and AMD. If it is
+       disabled, the multi-gen LRU will suffer a negligible
+       performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+    echo y >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0007
+    echo 5 >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application
+requires in a given time interval, and it is usually done with little
+impact on the performance of the application. E.g., data centers want
+to optimize job scheduling (bin packing) to improve memory
+utilizations. When a new job comes in, the job scheduler needs to find
+out whether each server it manages can allocate a certain amount of
+memory for this new job before it can pick a candidate. To do so, this
+job scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram.
+::
+
+    memcg  memcg_id  memcg_path
+       node  node_id
+           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
+           ...
+           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
+
+Each generation contains an estimated number of pages that have been
+accessed within ``age_in_ms`` non-cumulatively. E.g., ``min_gen_nr``
+contains the coldest pages and ``max_gen_nr`` contains the hottest
+pages, since ``age_in_ms`` of the former is the largest and that of
+the latter is the smallest.
+
+Users can write ``+ memcg_id node_id max_gen_nr
+[can_swap[full_scan]]`` to ``lru_gen`` to create a new generation
+``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
+is set to ``1``, it forces the scan of anon pages when swap is off.
+``full_scan`` defaults to ``1`` and, if it is set to ``0``, it reduces
+the overhead as well as the coverage when scanning page tables.
+
+A typical use case is that a job scheduler writes to ``lru_gen`` at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold memory defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces memory reclaim when there is no memory
+pressure and usually targets cold memory only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim memory on the
+server it has selected to improve the chance of successfully landing
+this new job.
+
+Users can write ``- memcg_id node_id min_gen_nr [swappiness
+[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
+equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
+``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
+aged and therefore cannot be evicted. ``swappiness`` overrides the
+default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
+the number of pages to evict.
+
+A typical use case is that a job scheduler writes to ``lru_gen``
+before it tries to land a new job on a server, and if it fails to
+materialize the cold memory without impacting the existing jobs on
+this server, it retries on the next server according to the ranking
+result obtained from the working set estimation step described
+earlier.
diff --git a/mm/Kconfig b/mm/Kconfig
index 050de1eae2d6..7fd84e0384dc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -899,7 +899,8 @@ config LRU_GEN
 	# the following options can use up the spare bits in page flags
 	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
 	help
-	  A high performance LRU implementation for memory overcommit.
+	  A high performance LRU implementation for memory overcommit. See
+	  Documentation/admin-guide/mm/multigen_lru.rst for details.
 
 config LRU_GEN_ENABLED
 	bool "Enable by default"
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 13/14] mm: multi-gen LRU: admin guide
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add an admin guide.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/admin-guide/mm/multigen_lru.rst | 146 ++++++++++++++++++
 mm/Kconfig                                    |   3 +-
 3 files changed, 149 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..2cf5bae62036 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   multigen_lru
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..4ea6a801dc56
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,146 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+Quick start
+===========
+Build the kernel with the following configurations.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+
+All set!
+
+Runtime options
+===============
+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
+following subsections.
+
+Kill switch
+-----------
+``enable`` accepts different values to enable or disabled the
+following components. The default value of this file depends on
+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
+unless some of them have unforeseen side effects. Writing to
+``enable`` has no effect when a component is not supported by the
+hardware, and valid values will be accepted even when the main switch
+is off.
+
+====== ===============================================================
+Values Components
+====== ===============================================================
+0x0001 The main switch for the multi-gen LRU.
+0x0002 Clearing the accessed bit in leaf page table entries in large
+       batches, when MMU sets it (e.g., on x86). This behavior can
+       theoretically worsen lock contention (mmap_lock). If it is
+       disabled, the multi-gen LRU will suffer a minor performance
+       degradation.
+0x0004 Clearing the accessed bit in non-leaf page table entries as
+       well, when MMU sets it (e.g., on x86). This behavior was not
+       verified on x86 varieties other than Intel and AMD. If it is
+       disabled, the multi-gen LRU will suffer a negligible
+       performance degradation.
+[yYnN] Apply to all the components above.
+====== ===============================================================
+
+E.g.,
+::
+
+    echo y >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0007
+    echo 5 >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0005
+
+Thrashing prevention
+--------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multi-gen LRU offers thrashing prevention to the
+majority of laptop and desktop users who do not have ``oomd``.
+
+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
+``N`` milliseconds from getting evicted. The OOM killer is triggered
+if this working set cannot be kept in memory. In other words, this
+option works as an adjustable pressure relief valve, and when open, it
+terminates applications that are hopefully not being used.
+
+Based on the average human detectable lag (~100ms), ``N=1000`` usually
+eliminates intolerable janks due to thrashing. Larger values like
+``N=3000`` make janks less noticeable at the risk of premature OOM
+kills.
+
+Experimental features
+=====================
+``/sys/kernel/debug/lru_gen`` accepts commands described in the
+following subsections. Multiple command lines are supported, so does
+concatenation with delimiters ``,`` and ``;``.
+
+``/sys/kernel/debug/lru_gen_full`` provides additional stats for
+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
+evicted generations in this file.
+
+Working set estimation
+----------------------
+Working set estimation measures how much memory an application
+requires in a given time interval, and it is usually done with little
+impact on the performance of the application. E.g., data centers want
+to optimize job scheduling (bin packing) to improve memory
+utilizations. When a new job comes in, the job scheduler needs to find
+out whether each server it manages can allocate a certain amount of
+memory for this new job before it can pick a candidate. To do so, this
+job scheduler needs to estimate the working sets of the existing jobs.
+
+When it is read, ``lru_gen`` returns a histogram of numbers of pages
+accessed over different time intervals for each memcg and node.
+``MAX_NR_GENS`` decides the number of bins for each histogram.
+::
+
+    memcg  memcg_id  memcg_path
+       node  node_id
+           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
+           ...
+           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
+
+Each generation contains an estimated number of pages that have been
+accessed within ``age_in_ms`` non-cumulatively. E.g., ``min_gen_nr``
+contains the coldest pages and ``max_gen_nr`` contains the hottest
+pages, since ``age_in_ms`` of the former is the largest and that of
+the latter is the smallest.
+
+Users can write ``+ memcg_id node_id max_gen_nr
+[can_swap[full_scan]]`` to ``lru_gen`` to create a new generation
+``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it
+is set to ``1``, it forces the scan of anon pages when swap is off.
+``full_scan`` defaults to ``1`` and, if it is set to ``0``, it reduces
+the overhead as well as the coverage when scanning page tables.
+
+A typical use case is that a job scheduler writes to ``lru_gen`` at a
+certain time interval to create new generations, and it ranks the
+servers it manages based on the sizes of their cold memory defined by
+this time interval.
+
+Proactive reclaim
+-----------------
+Proactive reclaim induces memory reclaim when there is no memory
+pressure and usually targets cold memory only. E.g., when a new job
+comes in, the job scheduler wants to proactively reclaim memory on the
+server it has selected to improve the chance of successfully landing
+this new job.
+
+Users can write ``- memcg_id node_id min_gen_nr [swappiness
+[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or
+equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than
+``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully
+aged and therefore cannot be evicted. ``swappiness`` overrides the
+default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits
+the number of pages to evict.
+
+A typical use case is that a job scheduler writes to ``lru_gen``
+before it tries to land a new job on a server, and if it fails to
+materialize the cold memory without impacting the existing jobs on
+this server, it retries on the next server according to the ranking
+result obtained from the working set estimation step described
+earlier.
diff --git a/mm/Kconfig b/mm/Kconfig
index 050de1eae2d6..7fd84e0384dc 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -899,7 +899,8 @@ config LRU_GEN
 	# the following options can use up the spare bits in page flags
 	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
 	help
-	  A high performance LRU implementation for memory overcommit.
+	  A high performance LRU implementation for memory overcommit. See
+	  Documentation/admin-guide/mm/multigen_lru.rst for details.
 
 config LRU_GEN_ENABLED
 	bool "Enable by default"
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 14/14] mm: multi-gen LRU: design doc
  2022-03-09  2:12 ` Yu Zhao
@ 2022-03-09  2:12   ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add a design doc.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/vm/index.rst        |   1 +
 Documentation/vm/multigen_lru.rst | 156 ++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..b48434300226 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
    ksm
    memory-model
    mmu_notifier
+   multigen_lru
    numa
    overcommit-accounting
    page_migration
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..cde60de16621
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+
+Design overview
+===============
+Objectives
+----------
+The design objectives are:
+
+* Good representation of access recency
+* Try to profit from spatial locality
+* Fast paths to make obvious choices
+* Simple self-correcting heuristics
+
+The representation of access recency is at the core of all LRU
+implementations. In the multi-gen LRU, each generation represents a
+group of pages with similar access recency. Generations establish a
+common frame of reference and therefore help make better choices,
+e.g., between different memcgs on a computer or different computers in
+a data center (for job scheduling).
+
+Exploiting spatial locality improves efficiency when gathering the
+accessed bit. A rmap walk targets a single page and does not try to
+profit from discovering a young PTE. A page table walk can sweep all
+the young PTEs in an address space, but the address space can be too
+large to make a profit. The key is to optimize both methods and use
+them in combination.
+
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
+do not require TLB flushes; clean pages do not require writeback.
+These facts are only helpful when other conditions, e.g., access
+recency, are similar. With generations as a common frame of reference,
+additional factors stand out. But obvious choices might not be good
+choices; thus self-correction is required.
+
+The benefits of simple self-correcting heuristics are self-evident.
+Again, with generations as a common frame of reference, this becomes
+attainable. Specifically, pages in the same generation can be
+categorized based on additional factors, and a feedback loop can
+statistically compare the refault percentages across those categories
+and infer which of them are better choices.
+
+Assumptions
+-----------
+The protection of hot pages and the selection of cold pages are based
+on page access channels and patterns. There are two access channels:
+
+* Accesses through page tables
+* Accesses through file descriptors
+
+The protection of the former channel is by design stronger because:
+
+1. The uncertainty in determining the access patterns of the former
+   channel is higher due to the approximation of the accessed bit.
+2. The cost of evicting the former channel is higher due to the TLB
+   flushes required and the likelihood of encountering the dirty bit.
+3. The penalty of underprotecting the former channel is higher because
+   applications usually do not prepare themselves for major page
+   faults like they do for blocked I/O. E.g., GUI applications
+   commonly use dedicated I/O threads to avoid blocking the rendering
+   threads.
+
+There are also two access patterns:
+
+* Accesses exhibiting temporal locality
+* Accesses not exhibiting temporal locality
+
+For the reasons listed above, the former channel is assumed to follow
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
+present, and the latter channel is assumed to follow the latter
+pattern unless outlying refaults have been observed.
+
+Workflow overview
+=================
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[]`` separately for anon and file types as clean file
+pages can be evicted regardless of swap constraints. These three
+variables are monotonically increasing.
+
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
+bits in order to fit into the gen counter in ``folio->flags``. Each
+truncated generation number is an index to ``lrugen->lists[]``. The
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
+at most ``MAX_NR_GENS`` generations. The gen counter stores a value
+within ``[1, MAX_NR_GENS]`` while a page is on one of
+``lrugen->lists[]``; otherwise it stores zero.
+
+Each generation is divided into multiple tiers. Tiers represent
+different ranges of numbers of accesses through file descriptors. A
+page accessed ``N`` times through file descriptors is in tier
+``order_base_2(N)``. In contrast to moving across generations, which
+requires the LRU lock, moving across tiers only requires operations on
+``folio->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refaults over all the tiers
+from anon and file types and decides which tiers from which types to
+evict or protect.
+
+There are two conceptually independent procedures: the aging and the
+eviction. They form a closed-loop system, i.e., the page reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, it
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
+generation when it finds them accessed through page tables; the
+demotion of cold pages happens consequently when it increments
+``max_seq``. The aging uses page table walks and rmap walks to find
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
+to scan PTEs. On finding a young PTE, it clears the accessed bit and
+updates the gen counter of the page mapped by this PTE to
+``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it
+increments ``max_seq``. For the latter, when the eviction walks the
+rmap and finds a young PTE, the aging scans the adjacent PTEs and
+follows the same steps just described.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, it
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
+evict from, it first compares ``min_seq[]`` to select the older type.
+If both types are equally old, it selects the one whose first tier has
+a lower refault percentage. The first tier contains single-use
+unmapped clean pages, which are the best bet. The eviction sorts a
+page according to the gen counter if the aging has found this page
+accessed through page tables and updated the gen counter. It also
+moves a page to the next generation, i.e., ``min_seq+1``, if this page
+was accessed multiple times through file descriptors and the feedback
+loop has detected outlying refaults from the tier this page is in. To
+do this, the feedback loop uses the first tier as the baseline, for
+the reason stated earlier.
+
+Summary
+-------
+The multi-gen LRU can be disassembled into the following parts:
+
+* Generations
+* Page table walks
+* Rmap walks
+* Bloom filters
+* The PID controller
+
+The aging and the eviction is a producer-consumer model; specifically,
+the latter drives the former by the sliding window over generations.
+Within the aging, rmap walks drive page table walks by inserting hot
+densely populated page tables to the Bloom filters. Within the
+eviction, the PID controller uses refaults as the feedback to select
+types to evict and tiers to protect.
-- 
2.35.1.616.g0bdcbb4464-goog


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v9 14/14] mm: multi-gen LRU: design doc
@ 2022-03-09  2:12   ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-09  2:12 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Ying Huang, linux-arm-kernel, linux-doc,
	linux-kernel, linux-mm, page-reclaim, x86, Yu Zhao, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Add a design doc.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/vm/index.rst        |   1 +
 Documentation/vm/multigen_lru.rst | 156 ++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..b48434300226 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
    ksm
    memory-model
    mmu_notifier
+   multigen_lru
    numa
    overcommit-accounting
    page_migration
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..cde60de16621
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Multi-Gen LRU
+=============
+
+Design overview
+===============
+Objectives
+----------
+The design objectives are:
+
+* Good representation of access recency
+* Try to profit from spatial locality
+* Fast paths to make obvious choices
+* Simple self-correcting heuristics
+
+The representation of access recency is at the core of all LRU
+implementations. In the multi-gen LRU, each generation represents a
+group of pages with similar access recency. Generations establish a
+common frame of reference and therefore help make better choices,
+e.g., between different memcgs on a computer or different computers in
+a data center (for job scheduling).
+
+Exploiting spatial locality improves efficiency when gathering the
+accessed bit. A rmap walk targets a single page and does not try to
+profit from discovering a young PTE. A page table walk can sweep all
+the young PTEs in an address space, but the address space can be too
+large to make a profit. The key is to optimize both methods and use
+them in combination.
+
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
+do not require TLB flushes; clean pages do not require writeback.
+These facts are only helpful when other conditions, e.g., access
+recency, are similar. With generations as a common frame of reference,
+additional factors stand out. But obvious choices might not be good
+choices; thus self-correction is required.
+
+The benefits of simple self-correcting heuristics are self-evident.
+Again, with generations as a common frame of reference, this becomes
+attainable. Specifically, pages in the same generation can be
+categorized based on additional factors, and a feedback loop can
+statistically compare the refault percentages across those categories
+and infer which of them are better choices.
+
+Assumptions
+-----------
+The protection of hot pages and the selection of cold pages are based
+on page access channels and patterns. There are two access channels:
+
+* Accesses through page tables
+* Accesses through file descriptors
+
+The protection of the former channel is by design stronger because:
+
+1. The uncertainty in determining the access patterns of the former
+   channel is higher due to the approximation of the accessed bit.
+2. The cost of evicting the former channel is higher due to the TLB
+   flushes required and the likelihood of encountering the dirty bit.
+3. The penalty of underprotecting the former channel is higher because
+   applications usually do not prepare themselves for major page
+   faults like they do for blocked I/O. E.g., GUI applications
+   commonly use dedicated I/O threads to avoid blocking the rendering
+   threads.
+
+There are also two access patterns:
+
+* Accesses exhibiting temporal locality
+* Accesses not exhibiting temporal locality
+
+For the reasons listed above, the former channel is assumed to follow
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
+present, and the latter channel is assumed to follow the latter
+pattern unless outlying refaults have been observed.
+
+Workflow overview
+=================
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[]`` separately for anon and file types as clean file
+pages can be evicted regardless of swap constraints. These three
+variables are monotonically increasing.
+
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
+bits in order to fit into the gen counter in ``folio->flags``. Each
+truncated generation number is an index to ``lrugen->lists[]``. The
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
+at most ``MAX_NR_GENS`` generations. The gen counter stores a value
+within ``[1, MAX_NR_GENS]`` while a page is on one of
+``lrugen->lists[]``; otherwise it stores zero.
+
+Each generation is divided into multiple tiers. Tiers represent
+different ranges of numbers of accesses through file descriptors. A
+page accessed ``N`` times through file descriptors is in tier
+``order_base_2(N)``. In contrast to moving across generations, which
+requires the LRU lock, moving across tiers only requires operations on
+``folio->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refaults over all the tiers
+from anon and file types and decides which tiers from which types to
+evict or protect.
+
+There are two conceptually independent procedures: the aging and the
+eviction. They form a closed-loop system, i.e., the page reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, it
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
+generation when it finds them accessed through page tables; the
+demotion of cold pages happens consequently when it increments
+``max_seq``. The aging uses page table walks and rmap walks to find
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
+to scan PTEs. On finding a young PTE, it clears the accessed bit and
+updates the gen counter of the page mapped by this PTE to
+``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it
+increments ``max_seq``. For the latter, when the eviction walks the
+rmap and finds a young PTE, the aging scans the adjacent PTEs and
+follows the same steps just described.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, it
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
+evict from, it first compares ``min_seq[]`` to select the older type.
+If both types are equally old, it selects the one whose first tier has
+a lower refault percentage. The first tier contains single-use
+unmapped clean pages, which are the best bet. The eviction sorts a
+page according to the gen counter if the aging has found this page
+accessed through page tables and updated the gen counter. It also
+moves a page to the next generation, i.e., ``min_seq+1``, if this page
+was accessed multiple times through file descriptors and the feedback
+loop has detected outlying refaults from the tier this page is in. To
+do this, the feedback loop uses the first tier as the baseline, for
+the reason stated earlier.
+
+Summary
+-------
+The multi-gen LRU can be disassembled into the following parts:
+
+* Generations
+* Page table walks
+* Rmap walks
+* Bloom filters
+* The PID controller
+
+The aging and the eviction is a producer-consumer model; specifically,
+the latter drives the former by the sliding window over generations.
+Within the aging, rmap walks drive page table walks by inserting hot
+densely populated page tables to the Bloom filters. Within the
+eviction, the PID controller uses refaults as the feedback to select
+types to evict and tiers to protect.
-- 
2.35.1.616.g0bdcbb4464-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 13/14] mm: multi-gen LRU: admin guide
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-10 12:29     ` Mike Rapoport
  -1 siblings, 0 replies; 120+ messages in thread
From: Mike Rapoport @ 2022-03-10 12:29 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Hi,

On Tue, Mar 08, 2022 at 07:12:30PM -0700, Yu Zhao wrote:
> Add an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 146 ++++++++++++++++++
>  mm/Kconfig                                    |   3 +-
>  3 files changed, 149 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
> 
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..2cf5bae62036 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   multigen_lru
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..4ea6a801dc56
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,146 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Multi-Gen LRU
> +=============

I'm still missing an opening paragraph the explains what is Multi-gen LRU
and why users would want it.

Something like 

  Multi-gen LRU is an efficient mechanism for page reclamation.

More details are of course welcome :)


> +Quick start
> +===========
> +Build the kernel with the following configurations.
> +
> +* ``CONFIG_LRU_GEN=y``
> +* ``CONFIG_LRU_GEN_ENABLED=y``
> +
> +All set!
> +
> +Runtime options
> +===============
> +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> +following subsections.
> +
> +Kill switch
> +-----------
> +``enable`` accepts different values to enable or disabled the

                                                   ^ disable

> +following components. The default value of this file depends on
> +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
> +unless some of them have unforeseen side effects. Writing to
> +``enable`` has no effect when a component is not supported by the
> +hardware, and valid values will be accepted even when the main switch
> +is off.
> +
> +====== ===============================================================
> +Values Components
> +====== ===============================================================
> +0x0001 The main switch for the multi-gen LRU.
> +0x0002 Clearing the accessed bit in leaf page table entries in large
> +       batches, when MMU sets it (e.g., on x86). This behavior can
> +       theoretically worsen lock contention (mmap_lock). If it is
> +       disabled, the multi-gen LRU will suffer a minor performance
> +       degradation.
> +0x0004 Clearing the accessed bit in non-leaf page table entries as
> +       well, when MMU sets it (e.g., on x86). This behavior was not
> +       verified on x86 varieties other than Intel and AMD. If it is
> +       disabled, the multi-gen LRU will suffer a negligible
> +       performance degradation.
> +[yYnN] Apply to all the components above.
> +====== ===============================================================
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Thrashing prevention
> +--------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multi-gen LRU offers thrashing prevention to the
> +majority of laptop and desktop users who do not have ``oomd``.
> +
> +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> +``N`` milliseconds from getting evicted. The OOM killer is triggered
> +if this working set cannot be kept in memory. In other words, this
> +option works as an adjustable pressure relief valve, and when open, it
> +terminates applications that are hopefully not being used.
> +
> +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> +eliminates intolerable janks due to thrashing. Larger values like
> +``N=3000`` make janks less noticeable at the risk of premature OOM
> +kills.

What is the default value of min_ttl_ms?

> +
> +Experimental features
> +=====================

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 13/14] mm: multi-gen LRU: admin guide
@ 2022-03-10 12:29     ` Mike Rapoport
  0 siblings, 0 replies; 120+ messages in thread
From: Mike Rapoport @ 2022-03-10 12:29 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Hi,

On Tue, Mar 08, 2022 at 07:12:30PM -0700, Yu Zhao wrote:
> Add an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 146 ++++++++++++++++++
>  mm/Kconfig                                    |   3 +-
>  3 files changed, 149 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
> 
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..2cf5bae62036 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   multigen_lru
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..4ea6a801dc56
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,146 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Multi-Gen LRU
> +=============

I'm still missing an opening paragraph the explains what is Multi-gen LRU
and why users would want it.

Something like 

  Multi-gen LRU is an efficient mechanism for page reclamation.

More details are of course welcome :)


> +Quick start
> +===========
> +Build the kernel with the following configurations.
> +
> +* ``CONFIG_LRU_GEN=y``
> +* ``CONFIG_LRU_GEN_ENABLED=y``
> +
> +All set!
> +
> +Runtime options
> +===============
> +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> +following subsections.
> +
> +Kill switch
> +-----------
> +``enable`` accepts different values to enable or disabled the

                                                   ^ disable

> +following components. The default value of this file depends on
> +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
> +unless some of them have unforeseen side effects. Writing to
> +``enable`` has no effect when a component is not supported by the
> +hardware, and valid values will be accepted even when the main switch
> +is off.
> +
> +====== ===============================================================
> +Values Components
> +====== ===============================================================
> +0x0001 The main switch for the multi-gen LRU.
> +0x0002 Clearing the accessed bit in leaf page table entries in large
> +       batches, when MMU sets it (e.g., on x86). This behavior can
> +       theoretically worsen lock contention (mmap_lock). If it is
> +       disabled, the multi-gen LRU will suffer a minor performance
> +       degradation.
> +0x0004 Clearing the accessed bit in non-leaf page table entries as
> +       well, when MMU sets it (e.g., on x86). This behavior was not
> +       verified on x86 varieties other than Intel and AMD. If it is
> +       disabled, the multi-gen LRU will suffer a negligible
> +       performance degradation.
> +[yYnN] Apply to all the components above.
> +====== ===============================================================
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Thrashing prevention
> +--------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multi-gen LRU offers thrashing prevention to the
> +majority of laptop and desktop users who do not have ``oomd``.
> +
> +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> +``N`` milliseconds from getting evicted. The OOM killer is triggered
> +if this working set cannot be kept in memory. In other words, this
> +option works as an adjustable pressure relief valve, and when open, it
> +terminates applications that are hopefully not being used.
> +
> +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> +eliminates intolerable janks due to thrashing. Larger values like
> +``N=3000`` make janks less noticeable at the risk of premature OOM
> +kills.

What is the default value of min_ttl_ms?

> +
> +Experimental features
> +=====================

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 13/14] mm: multi-gen LRU: admin guide
  2022-03-10 12:29     ` Mike Rapoport
@ 2022-03-11  0:37       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-11  0:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Thu, Mar 10, 2022 at 5:30 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> On Tue, Mar 08, 2022 at 07:12:30PM -0700, Yu Zhao wrote:
> > Add an admin guide.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
> >  Documentation/admin-guide/mm/index.rst        |   1 +
> >  Documentation/admin-guide/mm/multigen_lru.rst | 146 ++++++++++++++++++
> >  mm/Kconfig                                    |   3 +-
> >  3 files changed, 149 insertions(+), 1 deletion(-)
> >  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
> >
> > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> > index c21b5823f126..2cf5bae62036 100644
> > --- a/Documentation/admin-guide/mm/index.rst
> > +++ b/Documentation/admin-guide/mm/index.rst
> > @@ -32,6 +32,7 @@ the Linux memory management.
> >     idle_page_tracking
> >     ksm
> >     memory-hotplug
> > +   multigen_lru
> >     nommu-mmap
> >     numa_memory_policy
> >     numaperf
> > diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> > new file mode 100644
> > index 000000000000..4ea6a801dc56
> > --- /dev/null
> > +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> > @@ -0,0 +1,146 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Multi-Gen LRU
> > +=============
>
> I'm still missing an opening paragraph the explains what is Multi-gen LRU
> and why users would want it.
>
> Something like
>
>   Multi-gen LRU is an efficient mechanism for page reclamation.
>
> More details are of course welcome :)

I've add the following for the next spin:

+Page reclaim decides the kernel's caching policy and ability to
+overcommit memory. It directly impacts the kswapd CPU usage and RAM
+efficiency. Multi-gen LRU aims to optimize page reclaim and improve
+performance under memory pressure.

> > +Quick start
> > +===========
> > +Build the kernel with the following configurations.
> > +
> > +* ``CONFIG_LRU_GEN=y``
> > +* ``CONFIG_LRU_GEN_ENABLED=y``
> > +
> > +All set!
> > +
> > +Runtime options
> > +===============
> > +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> > +following subsections.
> > +
> > +Kill switch
> > +-----------
> > +``enable`` accepts different values to enable or disabled the
>
>                                                    ^ disable

Good catch. Will fix it up.

> > +following components. The default value of this file depends on
> > +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
> > +unless some of them have unforeseen side effects. Writing to
> > +``enable`` has no effect when a component is not supported by the
> > +hardware, and valid values will be accepted even when the main switch
> > +is off.
> > +
> > +====== ===============================================================
> > +Values Components
> > +====== ===============================================================
> > +0x0001 The main switch for the multi-gen LRU.
> > +0x0002 Clearing the accessed bit in leaf page table entries in large
> > +       batches, when MMU sets it (e.g., on x86). This behavior can
> > +       theoretically worsen lock contention (mmap_lock). If it is
> > +       disabled, the multi-gen LRU will suffer a minor performance
> > +       degradation.
> > +0x0004 Clearing the accessed bit in non-leaf page table entries as
> > +       well, when MMU sets it (e.g., on x86). This behavior was not
> > +       verified on x86 varieties other than Intel and AMD. If it is
> > +       disabled, the multi-gen LRU will suffer a negligible
> > +       performance degradation.
> > +[yYnN] Apply to all the components above.
> > +====== ===============================================================
> > +
> > +E.g.,
> > +::
> > +
> > +    echo y >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0007
> > +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0005
> > +
> > +Thrashing prevention
> > +--------------------
> > +Personal computers are more sensitive to thrashing because it can
> > +cause janks (lags when rendering UI) and negatively impact user
> > +experience. The multi-gen LRU offers thrashing prevention to the
> > +majority of laptop and desktop users who do not have ``oomd``.
> > +
> > +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> > +``N`` milliseconds from getting evicted. The OOM killer is triggered
> > +if this working set cannot be kept in memory. In other words, this
> > +option works as an adjustable pressure relief valve, and when open, it
> > +terminates applications that are hopefully not being used.
> > +
> > +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> > +eliminates intolerable janks due to thrashing. Larger values like
> > +``N=3000`` make janks less noticeable at the risk of premature OOM
> > +kills.
>
> What is the default value of min_ttl_ms?

Right. I've added the following for the next spin:

+The default value ``0`` means disabled.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 13/14] mm: multi-gen LRU: admin guide
@ 2022-03-11  0:37       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-11  0:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Thu, Mar 10, 2022 at 5:30 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> On Tue, Mar 08, 2022 at 07:12:30PM -0700, Yu Zhao wrote:
> > Add an admin guide.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
> >  Documentation/admin-guide/mm/index.rst        |   1 +
> >  Documentation/admin-guide/mm/multigen_lru.rst | 146 ++++++++++++++++++
> >  mm/Kconfig                                    |   3 +-
> >  3 files changed, 149 insertions(+), 1 deletion(-)
> >  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
> >
> > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> > index c21b5823f126..2cf5bae62036 100644
> > --- a/Documentation/admin-guide/mm/index.rst
> > +++ b/Documentation/admin-guide/mm/index.rst
> > @@ -32,6 +32,7 @@ the Linux memory management.
> >     idle_page_tracking
> >     ksm
> >     memory-hotplug
> > +   multigen_lru
> >     nommu-mmap
> >     numa_memory_policy
> >     numaperf
> > diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> > new file mode 100644
> > index 000000000000..4ea6a801dc56
> > --- /dev/null
> > +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> > @@ -0,0 +1,146 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Multi-Gen LRU
> > +=============
>
> I'm still missing an opening paragraph the explains what is Multi-gen LRU
> and why users would want it.
>
> Something like
>
>   Multi-gen LRU is an efficient mechanism for page reclamation.
>
> More details are of course welcome :)

I've add the following for the next spin:

+Page reclaim decides the kernel's caching policy and ability to
+overcommit memory. It directly impacts the kswapd CPU usage and RAM
+efficiency. Multi-gen LRU aims to optimize page reclaim and improve
+performance under memory pressure.

> > +Quick start
> > +===========
> > +Build the kernel with the following configurations.
> > +
> > +* ``CONFIG_LRU_GEN=y``
> > +* ``CONFIG_LRU_GEN_ENABLED=y``
> > +
> > +All set!
> > +
> > +Runtime options
> > +===============
> > +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
> > +following subsections.
> > +
> > +Kill switch
> > +-----------
> > +``enable`` accepts different values to enable or disabled the
>
>                                                    ^ disable

Good catch. Will fix it up.

> > +following components. The default value of this file depends on
> > +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
> > +unless some of them have unforeseen side effects. Writing to
> > +``enable`` has no effect when a component is not supported by the
> > +hardware, and valid values will be accepted even when the main switch
> > +is off.
> > +
> > +====== ===============================================================
> > +Values Components
> > +====== ===============================================================
> > +0x0001 The main switch for the multi-gen LRU.
> > +0x0002 Clearing the accessed bit in leaf page table entries in large
> > +       batches, when MMU sets it (e.g., on x86). This behavior can
> > +       theoretically worsen lock contention (mmap_lock). If it is
> > +       disabled, the multi-gen LRU will suffer a minor performance
> > +       degradation.
> > +0x0004 Clearing the accessed bit in non-leaf page table entries as
> > +       well, when MMU sets it (e.g., on x86). This behavior was not
> > +       verified on x86 varieties other than Intel and AMD. If it is
> > +       disabled, the multi-gen LRU will suffer a negligible
> > +       performance degradation.
> > +[yYnN] Apply to all the components above.
> > +====== ===============================================================
> > +
> > +E.g.,
> > +::
> > +
> > +    echo y >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0007
> > +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0005
> > +
> > +Thrashing prevention
> > +--------------------
> > +Personal computers are more sensitive to thrashing because it can
> > +cause janks (lags when rendering UI) and negatively impact user
> > +experience. The multi-gen LRU offers thrashing prevention to the
> > +majority of laptop and desktop users who do not have ``oomd``.
> > +
> > +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
> > +``N`` milliseconds from getting evicted. The OOM killer is triggered
> > +if this working set cannot be kept in memory. In other words, this
> > +option works as an adjustable pressure relief valve, and when open, it
> > +terminates applications that are hopefully not being used.
> > +
> > +Based on the average human detectable lag (~100ms), ``N=1000`` usually
> > +eliminates intolerable janks due to thrashing. Larger values like
> > +``N=3000`` make janks less noticeable at the risk of premature OOM
> > +kills.
>
> What is the default value of min_ttl_ms?

Right. I've added the following for the next spin:

+The default value ``0`` means disabled.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 14/14] mm: multi-gen LRU: design doc
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-11  8:22     ` Mike Rapoport
  -1 siblings, 0 replies; 120+ messages in thread
From: Mike Rapoport @ 2022-03-11  8:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 08, 2022 at 07:12:31PM -0700, Yu Zhao wrote:
> Add a design doc.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  Documentation/vm/index.rst        |   1 +
>  Documentation/vm/multigen_lru.rst | 156 ++++++++++++++++++++++++++++++
>  2 files changed, 157 insertions(+)
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 44365c4574a3..b48434300226 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
>     ksm
>     memory-model
>     mmu_notifier
> +   multigen_lru
>     numa
>     overcommit-accounting
>     page_migration
> diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> new file mode 100644
> index 000000000000..cde60de16621
> --- /dev/null
> +++ b/Documentation/vm/multigen_lru.rst
> @@ -0,0 +1,156 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Multi-Gen LRU
> +=============

Here I also miss an introductory paragraph about what Multi-Gen LRU is.

All the rest looks good to me.
> +
> +Design overview
> +===============
> +Objectives
> +----------

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 14/14] mm: multi-gen LRU: design doc
@ 2022-03-11  8:22     ` Mike Rapoport
  0 siblings, 0 replies; 120+ messages in thread
From: Mike Rapoport @ 2022-03-11  8:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 08, 2022 at 07:12:31PM -0700, Yu Zhao wrote:
> Add a design doc.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  Documentation/vm/index.rst        |   1 +
>  Documentation/vm/multigen_lru.rst | 156 ++++++++++++++++++++++++++++++
>  2 files changed, 157 insertions(+)
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 44365c4574a3..b48434300226 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
>     ksm
>     memory-model
>     mmu_notifier
> +   multigen_lru
>     numa
>     overcommit-accounting
>     page_migration
> diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> new file mode 100644
> index 000000000000..cde60de16621
> --- /dev/null
> +++ b/Documentation/vm/multigen_lru.rst
> @@ -0,0 +1,156 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Multi-Gen LRU
> +=============

Here I also miss an introductory paragraph about what Multi-Gen LRU is.

All the rest looks good to me.
> +
> +Design overview
> +===============
> +Objectives
> +----------

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 14/14] mm: multi-gen LRU: design doc
  2022-03-11  8:22     ` Mike Rapoport
@ 2022-03-11  9:38       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-11  9:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Fri, Mar 11, 2022 at 1:23 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Mar 08, 2022 at 07:12:31PM -0700, Yu Zhao wrote:
> > Add a design doc.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
> >  Documentation/vm/index.rst        |   1 +
> >  Documentation/vm/multigen_lru.rst | 156 ++++++++++++++++++++++++++++++
> >  2 files changed, 157 insertions(+)
> >  create mode 100644 Documentation/vm/multigen_lru.rst
> >
> > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> > index 44365c4574a3..b48434300226 100644
> > --- a/Documentation/vm/index.rst
> > +++ b/Documentation/vm/index.rst
> > @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
> >     ksm
> >     memory-model
> >     mmu_notifier
> > +   multigen_lru
> >     numa
> >     overcommit-accounting
> >     page_migration
> > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> > new file mode 100644
> > index 000000000000..cde60de16621
> > --- /dev/null
> > +++ b/Documentation/vm/multigen_lru.rst
> > @@ -0,0 +1,156 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Multi-Gen LRU
> > +=============
>
> Here I also miss an introductory paragraph about what Multi-Gen LRU is.
>
> All the rest looks good to me.

Will add one in the next spin. Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 14/14] mm: multi-gen LRU: design doc
@ 2022-03-11  9:38       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-11  9:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Fri, Mar 11, 2022 at 1:23 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Mar 08, 2022 at 07:12:31PM -0700, Yu Zhao wrote:
> > Add a design doc.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
> >  Documentation/vm/index.rst        |   1 +
> >  Documentation/vm/multigen_lru.rst | 156 ++++++++++++++++++++++++++++++
> >  2 files changed, 157 insertions(+)
> >  create mode 100644 Documentation/vm/multigen_lru.rst
> >
> > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> > index 44365c4574a3..b48434300226 100644
> > --- a/Documentation/vm/index.rst
> > +++ b/Documentation/vm/index.rst
> > @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
> >     ksm
> >     memory-model
> >     mmu_notifier
> > +   multigen_lru
> >     numa
> >     overcommit-accounting
> >     page_migration
> > diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
> > new file mode 100644
> > index 000000000000..cde60de16621
> > --- /dev/null
> > +++ b/Documentation/vm/multigen_lru.rst
> > @@ -0,0 +1,156 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Multi-Gen LRU
> > +=============
>
> Here I also miss an introductory paragraph about what Multi-Gen LRU is.
>
> All the rest looks good to me.

Will add one in the next spin. Thanks.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-11 10:55     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-11 10:55 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Some architectures automatically set the accessed bit in PTEs, e.g.,
> x86 and arm64 v8.2. On architectures that do not have this capability,
> clearing the accessed bit in a PTE usually triggers a page fault
> following the TLB miss of this PTE (to emulate the accessed bit).
>
> Being aware of this capability can help make better decisions, e.g.,
> whether to spread the work out over a period of time to reduce bursty
> page faults when trying to clear the accessed bit in many PTEs.
>
> Note that theoretically this capability can be unreliable, e.g.,
> hotplugged CPUs might be different from builtin ones. Therefore it
> should not be used in architecture-independent code that involves
> correctness, e.g., to determine whether TLB flushes are required (in
> combination with the accessed bit).
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Acked-by: Will Deacon <will@kernel.org>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>

i guess arch_has_hw_pte_young() isn't called that often in either
mm/memory.c or mm/vmscan.c.
Otherwise, moving to a static key might help. Is it?


>  arch/arm64/include/asm/pgtable.h | 14 ++------------
>  arch/x86/include/asm/pgtable.h   |  6 +++---
>  include/linux/pgtable.h          | 13 +++++++++++++
>  mm/memory.c                      | 14 +-------------
>  4 files changed, 19 insertions(+), 28 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c4ba047a82d2..990358eca359 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
>   * page after fork() + CoW for pfn mappings. We don't always have a
>   * hardware-managed access flag on arm64.
>   */
> -static inline bool arch_faults_on_old_pte(void)
> -{
> -       WARN_ON(preemptible());
> -
> -       return !cpu_has_hw_af();
> -}
> -#define arch_faults_on_old_pte         arch_faults_on_old_pte
> +#define arch_has_hw_pte_young          cpu_has_hw_af
>
>  /*
>   * Experimentally, it's cheap to set the access flag in hardware and we
>   * benefit from prefaulting mappings as 'old' to start with.
>   */
> -static inline bool arch_wants_old_prefaulted_pte(void)
> -{
> -       return !arch_faults_on_old_pte();
> -}
> -#define arch_wants_old_prefaulted_pte  arch_wants_old_prefaulted_pte
> +#define arch_wants_old_prefaulted_pte  cpu_has_hw_af
>
>  static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
>  {
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 8a9432fb3802..60b6ce45c2e3 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1423,10 +1423,10 @@ static inline bool arch_has_pfn_modify_check(void)
>         return boot_cpu_has_bug(X86_BUG_L1TF);
>  }
>
> -#define arch_faults_on_old_pte arch_faults_on_old_pte
> -static inline bool arch_faults_on_old_pte(void)
> +#define arch_has_hw_pte_young arch_has_hw_pte_young
> +static inline bool arch_has_hw_pte_young(void)
>  {
> -       return false;
> +       return true;
>  }
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f4f4077b97aa..79f64dcff07d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>
> +#ifndef arch_has_hw_pte_young
> +/*
> + * Return whether the accessed bit is supported on the local CPU.
> + *
> + * This stub assumes accessing through an old PTE triggers a page fault.
> + * Architectures that automatically set the access bit should overwrite it.
> + */
> +static inline bool arch_has_hw_pte_young(void)
> +{
> +       return false;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_CLEAR
>  static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
>                               pte_t *ptep)
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..a7379196a47e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -122,18 +122,6 @@ int randomize_va_space __read_mostly =
>                                         2;
>  #endif
>
> -#ifndef arch_faults_on_old_pte
> -static inline bool arch_faults_on_old_pte(void)
> -{
> -       /*
> -        * Those arches which don't have hw access flag feature need to
> -        * implement their own helper. By default, "true" means pagefault
> -        * will be hit on old pte.
> -        */
> -       return true;
> -}
> -#endif
> -
>  #ifndef arch_wants_old_prefaulted_pte
>  static inline bool arch_wants_old_prefaulted_pte(void)
>  {
> @@ -2778,7 +2766,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
>          * On architectures with software "accessed" bits, we would
>          * take a double page fault, so mark it accessed here.
>          */
> -       if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
> +       if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
>                 pte_t entry;
>
>                 vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
@ 2022-03-11 10:55     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-11 10:55 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Some architectures automatically set the accessed bit in PTEs, e.g.,
> x86 and arm64 v8.2. On architectures that do not have this capability,
> clearing the accessed bit in a PTE usually triggers a page fault
> following the TLB miss of this PTE (to emulate the accessed bit).
>
> Being aware of this capability can help make better decisions, e.g.,
> whether to spread the work out over a period of time to reduce bursty
> page faults when trying to clear the accessed bit in many PTEs.
>
> Note that theoretically this capability can be unreliable, e.g.,
> hotplugged CPUs might be different from builtin ones. Therefore it
> should not be used in architecture-independent code that involves
> correctness, e.g., to determine whether TLB flushes are required (in
> combination with the accessed bit).
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Acked-by: Will Deacon <will@kernel.org>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>

i guess arch_has_hw_pte_young() isn't called that often in either
mm/memory.c or mm/vmscan.c.
Otherwise, moving to a static key might help. Is it?


>  arch/arm64/include/asm/pgtable.h | 14 ++------------
>  arch/x86/include/asm/pgtable.h   |  6 +++---
>  include/linux/pgtable.h          | 13 +++++++++++++
>  mm/memory.c                      | 14 +-------------
>  4 files changed, 19 insertions(+), 28 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c4ba047a82d2..990358eca359 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
>   * page after fork() + CoW for pfn mappings. We don't always have a
>   * hardware-managed access flag on arm64.
>   */
> -static inline bool arch_faults_on_old_pte(void)
> -{
> -       WARN_ON(preemptible());
> -
> -       return !cpu_has_hw_af();
> -}
> -#define arch_faults_on_old_pte         arch_faults_on_old_pte
> +#define arch_has_hw_pte_young          cpu_has_hw_af
>
>  /*
>   * Experimentally, it's cheap to set the access flag in hardware and we
>   * benefit from prefaulting mappings as 'old' to start with.
>   */
> -static inline bool arch_wants_old_prefaulted_pte(void)
> -{
> -       return !arch_faults_on_old_pte();
> -}
> -#define arch_wants_old_prefaulted_pte  arch_wants_old_prefaulted_pte
> +#define arch_wants_old_prefaulted_pte  cpu_has_hw_af
>
>  static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
>  {
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 8a9432fb3802..60b6ce45c2e3 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1423,10 +1423,10 @@ static inline bool arch_has_pfn_modify_check(void)
>         return boot_cpu_has_bug(X86_BUG_L1TF);
>  }
>
> -#define arch_faults_on_old_pte arch_faults_on_old_pte
> -static inline bool arch_faults_on_old_pte(void)
> +#define arch_has_hw_pte_young arch_has_hw_pte_young
> +static inline bool arch_has_hw_pte_young(void)
>  {
> -       return false;
> +       return true;
>  }
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f4f4077b97aa..79f64dcff07d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>
> +#ifndef arch_has_hw_pte_young
> +/*
> + * Return whether the accessed bit is supported on the local CPU.
> + *
> + * This stub assumes accessing through an old PTE triggers a page fault.
> + * Architectures that automatically set the access bit should overwrite it.
> + */
> +static inline bool arch_has_hw_pte_young(void)
> +{
> +       return false;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_CLEAR
>  static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
>                               pte_t *ptep)
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..a7379196a47e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -122,18 +122,6 @@ int randomize_va_space __read_mostly =
>                                         2;
>  #endif
>
> -#ifndef arch_faults_on_old_pte
> -static inline bool arch_faults_on_old_pte(void)
> -{
> -       /*
> -        * Those arches which don't have hw access flag feature need to
> -        * implement their own helper. By default, "true" means pagefault
> -        * will be hit on old pte.
> -        */
> -       return true;
> -}
> -#endif
> -
>  #ifndef arch_wants_old_prefaulted_pte
>  static inline bool arch_wants_old_prefaulted_pte(void)
>  {
> @@ -2778,7 +2766,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
>          * On architectures with software "accessed" bits, we would
>          * take a double page fault, so mark it accessed here.
>          */
> -       if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
> +       if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
>                 pte_t entry;
>
>                 vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
  2022-03-11 10:55     ` Barry Song
@ 2022-03-11 22:57       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-11 22:57 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Fri, Mar 11, 2022 at 3:55 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Some architectures automatically set the accessed bit in PTEs, e.g.,
> > x86 and arm64 v8.2. On architectures that do not have this capability,
> > clearing the accessed bit in a PTE usually triggers a page fault
> > following the TLB miss of this PTE (to emulate the accessed bit).
> >
> > Being aware of this capability can help make better decisions, e.g.,
> > whether to spread the work out over a period of time to reduce bursty
> > page faults when trying to clear the accessed bit in many PTEs.
> >
> > Note that theoretically this capability can be unreliable, e.g.,
> > hotplugged CPUs might be different from builtin ones. Therefore it
> > should not be used in architecture-independent code that involves
> > correctness, e.g., to determine whether TLB flushes are required (in
> > combination with the accessed bit).
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Acked-by: Will Deacon <will@kernel.org>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
>
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks.

> i guess arch_has_hw_pte_young() isn't called that often in either
> mm/memory.c or mm/vmscan.c.
> Otherwise, moving to a static key might help. Is it?

MRS shouldn't be slower than either branch of a static key. With a
static key, we only can optimize one of the two cases.

There is a *theoretical* problem with MRS: ARM specs don't prohibit a
physical CPU to support both cases (on different logical CPUs).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()
@ 2022-03-11 22:57       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-11 22:57 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Fri, Mar 11, 2022 at 3:55 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Some architectures automatically set the accessed bit in PTEs, e.g.,
> > x86 and arm64 v8.2. On architectures that do not have this capability,
> > clearing the accessed bit in a PTE usually triggers a page fault
> > following the TLB miss of this PTE (to emulate the accessed bit).
> >
> > Being aware of this capability can help make better decisions, e.g.,
> > whether to spread the work out over a period of time to reduce bursty
> > page faults when trying to clear the accessed bit in many PTEs.
> >
> > Note that theoretically this capability can be unreliable, e.g.,
> > hotplugged CPUs might be different from builtin ones. Therefore it
> > should not be used in architecture-independent code that involves
> > correctness, e.g., to determine whether TLB flushes are required (in
> > combination with the accessed bit).
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Acked-by: Will Deacon <will@kernel.org>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
>
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks.

> i guess arch_has_hw_pte_young() isn't called that often in either
> mm/memory.c or mm/vmscan.c.
> Otherwise, moving to a static key might help. Is it?

MRS shouldn't be slower than either branch of a static key. With a
static key, we only can optimize one of the two cases.

There is a *theoretical* problem with MRS: ARM specs don't prohibit a
physical CPU to support both cases (on different logical CPUs).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-14  8:08     ` Huang, Ying
  -1 siblings, 0 replies; 120+ messages in thread
From: Huang, Ying @ 2022-03-14  8:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..747ab1690bcf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# the multi-gen LRU {
> +config LRU_GEN
> +	bool "Multi-Gen LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)

LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
by LRU_GEN?

> +	help
> +	  A high performance LRU implementation for memory overcommit.
> +# }
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu

Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-14  8:08     ` Huang, Ying
  0 siblings, 0 replies; 120+ messages in thread
From: Huang, Ying @ 2022-03-14  8:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..747ab1690bcf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# the multi-gen LRU {
> +config LRU_GEN
> +	bool "Multi-Gen LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)

LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
by LRU_GEN?

> +	help
> +	  A high performance LRU implementation for memory overcommit.
> +# }
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu

Best Regards,
Huang, Ying

[snip]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-14  8:08     ` Huang, Ying
@ 2022-03-14  9:30       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-14  9:30 UTC (permalink / raw)
  To: Huang, Ying, kernel, kernel-team
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 3326ee3903f3..747ab1690bcf 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >         area from being merged with adjacent virtual memory areas due to the
> >         difference in their name.
> >
> > +# the multi-gen LRU {
> > +config LRU_GEN
> > +     bool "Multi-Gen LRU"
> > +     depends on MMU
> > +     # the following options can use up the spare bits in page flags
> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>
> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> by LRU_GEN?

LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
max number. The dependency is with NODES_SHIFT selected by MAXSMP:
    default "10" if MAXSMP
This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.

MAXSMP is meant for kernel developers to test their code, and it
should not be used in production [1]. But some distros unfortunately
ship kernels built with this option, e.g., Fedora and Ubuntu. And
their users reported build errors to me after they applied MGLRU on
those kernels ("Not enough bits in page flags"). Let me add Fedora and
Ubuntu to this thread.

Fedora and Ubuntu,

Could you please clarify if there is a reason to ship kernels built
with MAXSMP? Otherwise, please consider disabling this option. Thanks.

As per above, MAXSMP enables ridiculously large numbers of CPUs and
NUMA nodes for testing purposes. It is detrimental to performance,
e.g., CPUMASK_OFFSTACK.

[1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-14  9:30       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-14  9:30 UTC (permalink / raw)
  To: Huang, Ying, kernel, kernel-team
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 3326ee3903f3..747ab1690bcf 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >         area from being merged with adjacent virtual memory areas due to the
> >         difference in their name.
> >
> > +# the multi-gen LRU {
> > +config LRU_GEN
> > +     bool "Multi-Gen LRU"
> > +     depends on MMU
> > +     # the following options can use up the spare bits in page flags
> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>
> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> by LRU_GEN?

LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
max number. The dependency is with NODES_SHIFT selected by MAXSMP:
    default "10" if MAXSMP
This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.

MAXSMP is meant for kernel developers to test their code, and it
should not be used in production [1]. But some distros unfortunately
ship kernels built with this option, e.g., Fedora and Ubuntu. And
their users reported build errors to me after they applied MGLRU on
those kernels ("Not enough bits in page flags"). Let me add Fedora and
Ubuntu to this thread.

Fedora and Ubuntu,

Could you please clarify if there is a reason to ship kernels built
with MAXSMP? Otherwise, please consider disabling this option. Thanks.

As per above, MAXSMP enables ridiculously large numbers of CPUs and
NUMA nodes for testing purposes. It is detrimental to performance,
e.g., CPUMASK_OFFSTACK.

[1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-14  9:30       ` Yu Zhao
@ 2022-03-15  0:34         ` Huang, Ying
  -1 siblings, 0 replies; 120+ messages in thread
From: Huang, Ying @ 2022-03-15  0:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: kernel, kernel-team, Andrew Morton, Linus Torvalds, Andi Kleen,
	Aneesh Kumar, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Yu Zhao <yuzhao@google.com> writes:

> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Yu,
>>
>> Yu Zhao <yuzhao@google.com> writes:
>> > diff --git a/mm/Kconfig b/mm/Kconfig
>> > index 3326ee3903f3..747ab1690bcf 100644
>> > --- a/mm/Kconfig
>> > +++ b/mm/Kconfig
>> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>> >         area from being merged with adjacent virtual memory areas due to the
>> >         difference in their name.
>> >
>> > +# the multi-gen LRU {
>> > +config LRU_GEN
>> > +     bool "Multi-Gen LRU"
>> > +     depends on MMU
>> > +     # the following options can use up the spare bits in page flags
>> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>>
>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
>> by LRU_GEN?
>
> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>     default "10" if MAXSMP
> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.

From the following code snippets from page-flags-layout.h,
LAST_CPUPID_SHIFT is related to NR_CPUS instead of NODES_SHIFT.

#define LAST__PID_SHIFT 8
#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)

#define LAST__CPU_SHIFT NR_CPUS_BITS
#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)

#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)

Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-15  0:34         ` Huang, Ying
  0 siblings, 0 replies; 120+ messages in thread
From: Huang, Ying @ 2022-03-15  0:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: kernel, kernel-team, Andrew Morton, Linus Torvalds, Andi Kleen,
	Aneesh Kumar, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Yu Zhao <yuzhao@google.com> writes:

> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Yu,
>>
>> Yu Zhao <yuzhao@google.com> writes:
>> > diff --git a/mm/Kconfig b/mm/Kconfig
>> > index 3326ee3903f3..747ab1690bcf 100644
>> > --- a/mm/Kconfig
>> > +++ b/mm/Kconfig
>> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>> >         area from being merged with adjacent virtual memory areas due to the
>> >         difference in their name.
>> >
>> > +# the multi-gen LRU {
>> > +config LRU_GEN
>> > +     bool "Multi-Gen LRU"
>> > +     depends on MMU
>> > +     # the following options can use up the spare bits in page flags
>> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>>
>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
>> by LRU_GEN?
>
> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>     default "10" if MAXSMP
> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.

From the following code snippets from page-flags-layout.h,
LAST_CPUPID_SHIFT is related to NR_CPUS instead of NODES_SHIFT.

#define LAST__PID_SHIFT 8
#define LAST__PID_MASK  ((1 << LAST__PID_SHIFT)-1)

#define LAST__CPU_SHIFT NR_CPUS_BITS
#define LAST__CPU_MASK  ((1 << LAST__CPU_SHIFT)-1)

#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)

Best Regards,
Huang, Ying

[snip]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-15  0:34         ` Huang, Ying
@ 2022-03-15  0:50           ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-15  0:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: kernel, kernel-team, Andrew Morton, Linus Torvalds, Andi Kleen,
	Aneesh Kumar, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 14, 2022 at 6:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Yu,
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >> > diff --git a/mm/Kconfig b/mm/Kconfig
> >> > index 3326ee3903f3..747ab1690bcf 100644
> >> > --- a/mm/Kconfig
> >> > +++ b/mm/Kconfig
> >> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >> >         area from being merged with adjacent virtual memory areas due to the
> >> >         difference in their name.
> >> >
> >> > +# the multi-gen LRU {
> >> > +config LRU_GEN
> >> > +     bool "Multi-Gen LRU"
> >> > +     depends on MMU
> >> > +     # the following options can use up the spare bits in page flags
> >> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >>
> >> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> >> by LRU_GEN?
> >
> > LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> > max number. The dependency is with NODES_SHIFT selected by MAXSMP:
> >     default "10" if MAXSMP
> > This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>
> From the following code snippets from page-flags-layout.h,
> LAST_CPUPID_SHIFT is related to NR_CPUS instead of NODES_SHIFT.

It is. But LAST_CPUPID_NOT_IN_PAGE_FLAGS should always work but
NODE_NOT_IN_PAGE_FLAGS doesn't.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-15  0:50           ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-15  0:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: kernel, kernel-team, Andrew Morton, Linus Torvalds, Andi Kleen,
	Aneesh Kumar, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 14, 2022 at 6:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Yu,
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >> > diff --git a/mm/Kconfig b/mm/Kconfig
> >> > index 3326ee3903f3..747ab1690bcf 100644
> >> > --- a/mm/Kconfig
> >> > +++ b/mm/Kconfig
> >> > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >> >         area from being merged with adjacent virtual memory areas due to the
> >> >         difference in their name.
> >> >
> >> > +# the multi-gen LRU {
> >> > +config LRU_GEN
> >> > +     bool "Multi-Gen LRU"
> >> > +     depends on MMU
> >> > +     # the following options can use up the spare bits in page flags
> >> > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >>
> >> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> >> by LRU_GEN?
> >
> > LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> > max number. The dependency is with NODES_SHIFT selected by MAXSMP:
> >     default "10" if MAXSMP
> > This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>
> From the following code snippets from page-flags-layout.h,
> LAST_CPUPID_SHIFT is related to NR_CPUS instead of NODES_SHIFT.

It is. But LAST_CPUPID_NOT_IN_PAGE_FLAGS should always work but
NODE_NOT_IN_PAGE_FLAGS doesn't.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-16  5:55     ` Huang, Ying
  -1 siblings, 0 replies; 120+ messages in thread
From: Huang, Ying @ 2022-03-16  5:55 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:

[snip]

>  
> +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	if (!can_demote(pgdat->node_id, sc) &&
> +	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> +		return 0;
> +
> +	return mem_cgroup_swappiness(memcg);
> +}
> +

We have tested v9 for memory tiering system, the demotion works now even
without swap devices configured.  Thanks!

And we found that the demotion (page reclaiming on DRAM nodes) speed is
lower than the original implementation.  The workload itself is just a
memory accessing micro-benchmark with Gauss distribution.  It is run on
a system with DRAM and PMEM.  Initially, quite some hot pages are placed
in PMEM and quite some cold pages are placed in DRAM.  Then the page
placement optimizing mechanism based on NUMA balancing will try to
promote some hot pages from PMEM node to DRAM node.  If the DRAM node
near full (reach high watermark), kswapd of the DRAM node will be woke
up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
pages on DRAM is very cold (not accessed for at least several seconds),
the benchmark performance will be better if demotion speed is faster.

Some data comes from /proc/vmstat and perf-profile is as follows.

From /proc/vmstat, it seems that the page scanned and page demoted is
much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
think this shows the value of direct page table scanning.

From perf-profile, the CPU cycles for kswapd is same.  But less pages
are demoted (reclaimed) with MGLRU.  And it appears that the total page
table scanning time of MGLRU is longer if we compare walk_page_range
(1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?
Because we only demote (reclaim) from DRAM nodes, but not demote
(reclaim) from PMEM nodes and bloom filter doesn't work well enough?
One thing that may be not friendly for bloom filter is that some virtual
pages may change their resident nodes because of demotion/promotion.

Can you teach me to how interpret these data for MGLRU?  Or can you
point me to the other/better data for MGLRU?

MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
--------------------------------------------------------------

/proc/vmstat:

pgactivate 1767172340
pgdeactivate 1740111896
pglazyfree 0
pgfault 583875828
pgmajfault 0
pglazyfreed 0
pgrefill 1740111896
pgreuse 22626572
pgsteal_kswapd 153796237
pgsteal_direct 1999
pgdemote_kswapd 153796237
pgdemote_direct 1999
pgscan_kswapd 2055504891
pgscan_direct 1999
pgscan_direct_throttle 0
pgscan_anon 2055356614
pgscan_file 150276
pgsteal_anon 153798203
pgsteal_file 33
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 82761
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 2960
kswapd_high_wmark_hit_quickly 17732
pageoutrun 21583
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 515994024
numa_huge_pte_updates 154
numa_hint_faults 498301236
numa_hint_faults_local 121109067
numa_pages_migrated 152650705
pgmigrate_success 307213704
pgmigrate_fail 39
thp_migration_success 93
thp_migration_fail 0
thp_migration_split 0

perf-profile:

kswapd.kthread.ret_from_fork: 2.86
balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54


MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
-------------------------------------------------------------

/proc/vmstat:

pgactivate 47212585
pgdeactivate 0
pglazyfree 0
pgfault 580056521
pgmajfault 0
pglazyfreed 0
pgrefill 6911868880
pgreuse 25108929
pgsteal_kswapd 32701609
pgsteal_direct 0
pgdemote_kswapd 32701609
pgdemote_direct 0
pgscan_kswapd 83582770
pgscan_direct 0
pgscan_direct_throttle 0
pgscan_anon 83549777
pgscan_file 32993
pgsteal_anon 32701576
pgsteal_file 33
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 84829
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 313
kswapd_high_wmark_hit_quickly 5262
pageoutrun 5895
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 512084786
numa_huge_pte_updates 198
numa_hint_faults 494583387
numa_hint_faults_local 129411334
numa_pages_migrated 34165992
pgmigrate_success 67833977
pgmigrate_fail 7
thp_migration_success 135
thp_migration_fail 0
thp_migration_split 0

perf-profile:

kswapd.kthread.ret_from_fork: 2.86
balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66

Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-16  5:55     ` Huang, Ying
  0 siblings, 0 replies; 120+ messages in thread
From: Huang, Ying @ 2022-03-16  5:55 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:

[snip]

>  
> +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	if (!can_demote(pgdat->node_id, sc) &&
> +	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> +		return 0;
> +
> +	return mem_cgroup_swappiness(memcg);
> +}
> +

We have tested v9 for memory tiering system, the demotion works now even
without swap devices configured.  Thanks!

And we found that the demotion (page reclaiming on DRAM nodes) speed is
lower than the original implementation.  The workload itself is just a
memory accessing micro-benchmark with Gauss distribution.  It is run on
a system with DRAM and PMEM.  Initially, quite some hot pages are placed
in PMEM and quite some cold pages are placed in DRAM.  Then the page
placement optimizing mechanism based on NUMA balancing will try to
promote some hot pages from PMEM node to DRAM node.  If the DRAM node
near full (reach high watermark), kswapd of the DRAM node will be woke
up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
pages on DRAM is very cold (not accessed for at least several seconds),
the benchmark performance will be better if demotion speed is faster.

Some data comes from /proc/vmstat and perf-profile is as follows.

From /proc/vmstat, it seems that the page scanned and page demoted is
much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
think this shows the value of direct page table scanning.

From perf-profile, the CPU cycles for kswapd is same.  But less pages
are demoted (reclaimed) with MGLRU.  And it appears that the total page
table scanning time of MGLRU is longer if we compare walk_page_range
(1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?
Because we only demote (reclaim) from DRAM nodes, but not demote
(reclaim) from PMEM nodes and bloom filter doesn't work well enough?
One thing that may be not friendly for bloom filter is that some virtual
pages may change their resident nodes because of demotion/promotion.

Can you teach me to how interpret these data for MGLRU?  Or can you
point me to the other/better data for MGLRU?

MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
--------------------------------------------------------------

/proc/vmstat:

pgactivate 1767172340
pgdeactivate 1740111896
pglazyfree 0
pgfault 583875828
pgmajfault 0
pglazyfreed 0
pgrefill 1740111896
pgreuse 22626572
pgsteal_kswapd 153796237
pgsteal_direct 1999
pgdemote_kswapd 153796237
pgdemote_direct 1999
pgscan_kswapd 2055504891
pgscan_direct 1999
pgscan_direct_throttle 0
pgscan_anon 2055356614
pgscan_file 150276
pgsteal_anon 153798203
pgsteal_file 33
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 82761
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 2960
kswapd_high_wmark_hit_quickly 17732
pageoutrun 21583
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 515994024
numa_huge_pte_updates 154
numa_hint_faults 498301236
numa_hint_faults_local 121109067
numa_pages_migrated 152650705
pgmigrate_success 307213704
pgmigrate_fail 39
thp_migration_success 93
thp_migration_fail 0
thp_migration_split 0

perf-profile:

kswapd.kthread.ret_from_fork: 2.86
balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54


MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
-------------------------------------------------------------

/proc/vmstat:

pgactivate 47212585
pgdeactivate 0
pglazyfree 0
pgfault 580056521
pgmajfault 0
pglazyfreed 0
pgrefill 6911868880
pgreuse 25108929
pgsteal_kswapd 32701609
pgsteal_direct 0
pgdemote_kswapd 32701609
pgdemote_direct 0
pgscan_kswapd 83582770
pgscan_direct 0
pgscan_direct_throttle 0
pgscan_anon 83549777
pgscan_file 32993
pgsteal_anon 32701576
pgsteal_file 33
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 84829
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 313
kswapd_high_wmark_hit_quickly 5262
pageoutrun 5895
pgrotated 0
drop_pagecache 0
drop_slab 0
oom_kill 0
numa_pte_updates 512084786
numa_huge_pte_updates 198
numa_hint_faults 494583387
numa_hint_faults_local 129411334
numa_pages_migrated 34165992
pgmigrate_success 67833977
pgmigrate_fail 7
thp_migration_success 135
thp_migration_fail 0
thp_migration_split 0

perf-profile:

kswapd.kthread.ret_from_fork: 2.86
balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66

Best Regards,
Huang, Ying

[snip]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-16  5:55     ` Huang, Ying
@ 2022-03-16  7:54       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-16  7:54 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 15, 2022 at 11:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
>
> [snip]
>
> >
> > +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     if (!can_demote(pgdat->node_id, sc) &&
> > +         mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> > +             return 0;
> > +
> > +     return mem_cgroup_swappiness(memcg);
> > +}
> > +
>
> We have tested v9 for memory tiering system, the demotion works now even
> without swap devices configured.  Thanks!

Admittedly I didn't test it :) So thanks for testing -- I'm glad to
hear it didn't fall apart.

> And we found that the demotion (page reclaiming on DRAM nodes) speed is
> lower than the original implementation.

This sounds like an improvement to me, assuming the initial hot/cold
memory placements were similar for both the baseline and MGLRU.

Correct me if I'm wrong: since demotion is driven by promotion, lower
demotion speed means hot and cold pages were sorted out to DRAM and
AEP at a faster speed, hence an improvement.

# promotion path:
numa_hint_faults    498301236
numa_pages_migrated 152650705

numa_hint_faults    494583387
numa_pages_migrated 34165992

# demotion path:
pgsteal_anon 153798203
pgsteal_file 33

pgsteal_anon 32701576
pgsteal_file 33

The hint faults are similar but MGLRU has much fewer migrated -- my
guess is it demoted much fewer hot/warm pages and therefore led to
less work on the promotion path.

>  The workload itself is just a
> memory accessing micro-benchmark with Gauss distribution.  It is run on
> a system with DRAM and PMEM.  Initially, quite some hot pages are placed
> in PMEM and quite some cold pages are placed in DRAM.  Then the page
> placement optimizing mechanism based on NUMA balancing will try to
> promote some hot pages from PMEM node to DRAM node.

My understanding seems to be correct?

>  If the DRAM node
> near full (reach high watermark), kswapd of the DRAM node will be woke
> up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
> pages on DRAM is very cold (not accessed for at least several seconds),
> the benchmark performance will be better if demotion speed is faster.

I'm confused. It seems to me demotion speed is irrelevant. The time to
reach the equilibrium is what we want to measure.

> Some data comes from /proc/vmstat and perf-profile is as follows.
>
> From /proc/vmstat, it seems that the page scanned and page demoted is
> much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
> 5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
> think this shows the value of direct page table scanning.

Can't disagree :)

> From perf-profile, the CPU cycles for kswapd is same.  But less pages
> are demoted (reclaimed) with MGLRU.  And it appears that the total page
> table scanning time of MGLRU is longer if we compare walk_page_range
> (1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?

It's possible if the address space is very large and sparse. But once
MGLRU warms up, it should detect it and fall back to
page_referenced().

> Because we only demote (reclaim) from DRAM nodes, but not demote
> (reclaim) from PMEM nodes and bloom filter doesn't work well enough?

The bloom filters are per lruvec. So this should affect them.

> One thing that may be not friendly for bloom filter is that some virtual
> pages may change their resident nodes because of demotion/promotion.

Yes, it's possible.

> Can you teach me to how interpret these data for MGLRU?  Or can you
> point me to the other/better data for MGLRU?

You are the expert :)

My current understanding is that this is an improvement. IOW, with
MGLRU, DRAM (hot) <-> AEP (cold) reached equilibrium a lot faster.


> MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
> --------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 1767172340
> pgdeactivate 1740111896
> pglazyfree 0
> pgfault 583875828
> pgmajfault 0
> pglazyfreed 0
> pgrefill 1740111896
> pgreuse 22626572
> pgsteal_kswapd 153796237
> pgsteal_direct 1999
> pgdemote_kswapd 153796237
> pgdemote_direct 1999
> pgscan_kswapd 2055504891
> pgscan_direct 1999
> pgscan_direct_throttle 0
> pgscan_anon 2055356614
> pgscan_file 150276
> pgsteal_anon 153798203
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 82761
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2960
> kswapd_high_wmark_hit_quickly 17732
> pageoutrun 21583
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 515994024
> numa_huge_pte_updates 154
> numa_hint_faults 498301236
> numa_hint_faults_local 121109067
> numa_pages_migrated 152650705
> pgmigrate_success 307213704
> pgmigrate_fail 39
> thp_migration_success 93
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
> shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
> shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
> shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
> shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
> migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
> page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54
>
>
> MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
> -------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 47212585
> pgdeactivate 0
> pglazyfree 0
> pgfault 580056521
> pgmajfault 0
> pglazyfreed 0
> pgrefill 6911868880
> pgreuse 25108929
> pgsteal_kswapd 32701609
> pgsteal_direct 0
> pgdemote_kswapd 32701609
> pgdemote_direct 0
> pgscan_kswapd 83582770
> pgscan_direct 0
> pgscan_direct_throttle 0
> pgscan_anon 83549777
> pgscan_file 32993
> pgsteal_anon 32701576
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 84829
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 313
> kswapd_high_wmark_hit_quickly 5262
> pageoutrun 5895
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 512084786
> numa_huge_pte_updates 198
> numa_hint_faults 494583387
> numa_hint_faults_local 129411334
> numa_pages_migrated 34165992
> pgmigrate_success 67833977
> pgmigrate_fail 7
> thp_migration_success 135
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
> walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
> evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
> scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66
>
> Best Regards,
> Huang, Ying
>
> [snip]
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-16  7:54       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-16  7:54 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 15, 2022 at 11:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
>
> [snip]
>
> >
> > +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     if (!can_demote(pgdat->node_id, sc) &&
> > +         mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
> > +             return 0;
> > +
> > +     return mem_cgroup_swappiness(memcg);
> > +}
> > +
>
> We have tested v9 for memory tiering system, the demotion works now even
> without swap devices configured.  Thanks!

Admittedly I didn't test it :) So thanks for testing -- I'm glad to
hear it didn't fall apart.

> And we found that the demotion (page reclaiming on DRAM nodes) speed is
> lower than the original implementation.

This sounds like an improvement to me, assuming the initial hot/cold
memory placements were similar for both the baseline and MGLRU.

Correct me if I'm wrong: since demotion is driven by promotion, lower
demotion speed means hot and cold pages were sorted out to DRAM and
AEP at a faster speed, hence an improvement.

# promotion path:
numa_hint_faults    498301236
numa_pages_migrated 152650705

numa_hint_faults    494583387
numa_pages_migrated 34165992

# demotion path:
pgsteal_anon 153798203
pgsteal_file 33

pgsteal_anon 32701576
pgsteal_file 33

The hint faults are similar but MGLRU has much fewer migrated -- my
guess is it demoted much fewer hot/warm pages and therefore led to
less work on the promotion path.

>  The workload itself is just a
> memory accessing micro-benchmark with Gauss distribution.  It is run on
> a system with DRAM and PMEM.  Initially, quite some hot pages are placed
> in PMEM and quite some cold pages are placed in DRAM.  Then the page
> placement optimizing mechanism based on NUMA balancing will try to
> promote some hot pages from PMEM node to DRAM node.

My understanding seems to be correct?

>  If the DRAM node
> near full (reach high watermark), kswapd of the DRAM node will be woke
> up to demote (reclaim) some cold DRAM pages to PMEM.  Because quite some
> pages on DRAM is very cold (not accessed for at least several seconds),
> the benchmark performance will be better if demotion speed is faster.

I'm confused. It seems to me demotion speed is irrelevant. The time to
reach the equilibrium is what we want to measure.

> Some data comes from /proc/vmstat and perf-profile is as follows.
>
> From /proc/vmstat, it seems that the page scanned and page demoted is
> much less with MGLRU enabled.  The pgdemote_kswapd / pgscan_kswapd is
> 5.22 times higher with MGLRU enabled than that with MGLRU disabled.  I
> think this shows the value of direct page table scanning.

Can't disagree :)

> From perf-profile, the CPU cycles for kswapd is same.  But less pages
> are demoted (reclaimed) with MGLRU.  And it appears that the total page
> table scanning time of MGLRU is longer if we compare walk_page_range
> (1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)?

It's possible if the address space is very large and sparse. But once
MGLRU warms up, it should detect it and fall back to
page_referenced().

> Because we only demote (reclaim) from DRAM nodes, but not demote
> (reclaim) from PMEM nodes and bloom filter doesn't work well enough?

The bloom filters are per lruvec. So this should affect them.

> One thing that may be not friendly for bloom filter is that some virtual
> pages may change their resident nodes because of demotion/promotion.

Yes, it's possible.

> Can you teach me to how interpret these data for MGLRU?  Or can you
> point me to the other/better data for MGLRU?

You are the expert :)

My current understanding is that this is an improvement. IOW, with
MGLRU, DRAM (hot) <-> AEP (cold) reached equilibrium a lot faster.


> MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled
> --------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 1767172340
> pgdeactivate 1740111896
> pglazyfree 0
> pgfault 583875828
> pgmajfault 0
> pglazyfreed 0
> pgrefill 1740111896
> pgreuse 22626572
> pgsteal_kswapd 153796237
> pgsteal_direct 1999
> pgdemote_kswapd 153796237
> pgdemote_direct 1999
> pgscan_kswapd 2055504891
> pgscan_direct 1999
> pgscan_direct_throttle 0
> pgscan_anon 2055356614
> pgscan_file 150276
> pgsteal_anon 153798203
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 82761
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2960
> kswapd_high_wmark_hit_quickly 17732
> pageoutrun 21583
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 515994024
> numa_huge_pte_updates 154
> numa_hint_faults 498301236
> numa_hint_faults_local 121109067
> numa_pages_migrated 152650705
> pgmigrate_success 307213704
> pgmigrate_fail 39
> thp_migration_success 93
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85
> shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76
> shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9
> shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52
> shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85
> migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79
> page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54
>
>
> MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled
> -------------------------------------------------------------
>
> /proc/vmstat:
>
> pgactivate 47212585
> pgdeactivate 0
> pglazyfree 0
> pgfault 580056521
> pgmajfault 0
> pglazyfreed 0
> pgrefill 6911868880
> pgreuse 25108929
> pgsteal_kswapd 32701609
> pgsteal_direct 0
> pgdemote_kswapd 32701609
> pgdemote_direct 0
> pgscan_kswapd 83582770
> pgscan_direct 0
> pgscan_direct_throttle 0
> pgscan_anon 83549777
> pgscan_file 32993
> pgsteal_anon 32701576
> pgsteal_file 33
> zone_reclaim_failed 0
> pginodesteal 0
> slabs_scanned 84829
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 313
> kswapd_high_wmark_hit_quickly 5262
> pageoutrun 5895
> pgrotated 0
> drop_pagecache 0
> drop_slab 0
> oom_kill 0
> numa_pte_updates 512084786
> numa_huge_pte_updates 198
> numa_hint_faults 494583387
> numa_hint_faults_local 129411334
> numa_pages_migrated 34165992
> pgmigrate_success 67833977
> pgmigrate_fail 7
> thp_migration_success 135
> thp_migration_fail 0
> thp_migration_split 0
>
> perf-profile:
>
> kswapd.kthread.ret_from_fork: 2.86
> balance_pgdat.kswapd.kthread.ret_from_fork: 2.86
> lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97
> walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97
> shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89
> evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89
> scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66
>
> Best Regards,
> Huang, Ying
>
> [snip]
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-16 22:15     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-16 22:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Some architectures support the accessed bit in non-leaf PMD entries,
> e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
> as part of linear address translation [1]. Page table walkers that
> clear the accessed bit may use this capability to reduce their search
> space.
>
> Note that:
> 1. Although an inline function is preferable, this capability is added
>    as a configuration option for consistency with the existing macros.
> 2. Due to the little interest in other varieties, this capability was
>    only tested on Intel and AMD CPUs.
>
> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>      Volume 3 (June 2021), section 4.8
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>

hard to read this patch by itself. but after reading the change in
walk_pmd_range(), it seems this patch becomes quite clear:
walk_pmd_range()
{
...
#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
                if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
                        if (!pmd_young(val))
                                continue;

                        walk_pmd_range_locked(pud, addr, vma, walk, &pos);
                }
#endif
...
}
this gives us the chance to skip the scan of all ptes within the
pmd.
so i am not quite sure this should necessarily be a separate
patch, or should be put together with the change in
walk_pmd_range() to make readers understand its purpose.


>  arch/Kconfig                   | 9 +++++++++
>  arch/x86/Kconfig               | 1 +
>  arch/x86/include/asm/pgtable.h | 3 ++-
>  arch/x86/mm/pgtable.c          | 5 ++++-
>  include/linux/pgtable.h        | 4 ++--
>  5 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 678a80713b21..f9c59ecadbbb 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1322,6 +1322,15 @@ config DYNAMIC_SIGFRAME
>  config HAVE_ARCH_NODE_DEV_GROUP
>         bool
>
> +config ARCH_HAS_NONLEAF_PMD_YOUNG
> +       bool
> +       depends on PGTABLE_LEVELS > 2
> +       help
> +         Architectures that select this option are capable of setting the
> +         accessed bit in non-leaf PMD entries when using them as part of linear
> +         address translations. Page table walkers that clear the accessed bit
> +         may use this capability to reduce their search space.
> +
>  source "kernel/gcov/Kconfig"
>
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9f5bd41bf660..e787b7fc75be 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -85,6 +85,7 @@ config X86
>         select ARCH_HAS_PMEM_API                if X86_64
>         select ARCH_HAS_PTE_DEVMAP              if X86_64
>         select ARCH_HAS_PTE_SPECIAL
> +       select ARCH_HAS_NONLEAF_PMD_YOUNG
>         select ARCH_HAS_UACCESS_FLUSHCACHE      if X86_64
>         select ARCH_HAS_COPY_MC                 if X86_64
>         select ARCH_HAS_SET_MEMORY
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 60b6ce45c2e3..f973788f6b21 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -819,7 +819,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>
>  static inline int pmd_bad(pmd_t pmd)
>  {
> -       return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
> +       return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
> +              (_KERNPG_TABLE & ~_PAGE_ACCESSED);
>  }
>
>  static inline unsigned long pages_to_mb(unsigned long npg)
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..a224193d84bf 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
>         return ret;
>  }
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>  int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>                               unsigned long addr, pmd_t *pmdp)
>  {
> @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>
>         return ret;
>  }
> +#endif
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int pudp_test_and_clear_young(struct vm_area_struct *vma,
>                               unsigned long addr, pud_t *pudp)
>  {
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 79f64dcff07d..743e7fc4afda 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  #endif
>
>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>                                             unsigned long address,
>                                             pmd_t *pmdp)
> @@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>         BUILD_BUG();
>         return 0;
>  }
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
>  #endif
>
>  #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
@ 2022-03-16 22:15     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-16 22:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Some architectures support the accessed bit in non-leaf PMD entries,
> e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
> as part of linear address translation [1]. Page table walkers that
> clear the accessed bit may use this capability to reduce their search
> space.
>
> Note that:
> 1. Although an inline function is preferable, this capability is added
>    as a configuration option for consistency with the existing macros.
> 2. Due to the little interest in other varieties, this capability was
>    only tested on Intel and AMD CPUs.
>
> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>      Volume 3 (June 2021), section 4.8
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>

hard to read this patch by itself. but after reading the change in
walk_pmd_range(), it seems this patch becomes quite clear:
walk_pmd_range()
{
...
#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
                if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
                        if (!pmd_young(val))
                                continue;

                        walk_pmd_range_locked(pud, addr, vma, walk, &pos);
                }
#endif
...
}
this gives us the chance to skip the scan of all ptes within the
pmd.
so i am not quite sure this should necessarily be a separate
patch, or should be put together with the change in
walk_pmd_range() to make readers understand its purpose.


>  arch/Kconfig                   | 9 +++++++++
>  arch/x86/Kconfig               | 1 +
>  arch/x86/include/asm/pgtable.h | 3 ++-
>  arch/x86/mm/pgtable.c          | 5 ++++-
>  include/linux/pgtable.h        | 4 ++--
>  5 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 678a80713b21..f9c59ecadbbb 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1322,6 +1322,15 @@ config DYNAMIC_SIGFRAME
>  config HAVE_ARCH_NODE_DEV_GROUP
>         bool
>
> +config ARCH_HAS_NONLEAF_PMD_YOUNG
> +       bool
> +       depends on PGTABLE_LEVELS > 2
> +       help
> +         Architectures that select this option are capable of setting the
> +         accessed bit in non-leaf PMD entries when using them as part of linear
> +         address translations. Page table walkers that clear the accessed bit
> +         may use this capability to reduce their search space.
> +
>  source "kernel/gcov/Kconfig"
>
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9f5bd41bf660..e787b7fc75be 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -85,6 +85,7 @@ config X86
>         select ARCH_HAS_PMEM_API                if X86_64
>         select ARCH_HAS_PTE_DEVMAP              if X86_64
>         select ARCH_HAS_PTE_SPECIAL
> +       select ARCH_HAS_NONLEAF_PMD_YOUNG
>         select ARCH_HAS_UACCESS_FLUSHCACHE      if X86_64
>         select ARCH_HAS_COPY_MC                 if X86_64
>         select ARCH_HAS_SET_MEMORY
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 60b6ce45c2e3..f973788f6b21 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -819,7 +819,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>
>  static inline int pmd_bad(pmd_t pmd)
>  {
> -       return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
> +       return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
> +              (_KERNPG_TABLE & ~_PAGE_ACCESSED);
>  }
>
>  static inline unsigned long pages_to_mb(unsigned long npg)
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..a224193d84bf 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
>         return ret;
>  }
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>  int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>                               unsigned long addr, pmd_t *pmdp)
>  {
> @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>
>         return ret;
>  }
> +#endif
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int pudp_test_and_clear_young(struct vm_area_struct *vma,
>                               unsigned long addr, pud_t *pudp)
>  {
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 79f64dcff07d..743e7fc4afda 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  #endif
>
>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>                                             unsigned long address,
>                                             pmd_t *pmdp)
> @@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>         BUILD_BUG();
>         return 0;
>  }
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
>  #endif
>
>  #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-16 23:25     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-16 23:25 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lrugen->max_seq for both
> anon and file types as they are aged on an equal footing. The oldest
> generation numbers are stored in lrugen->min_seq[] separately for anon
> and file types as clean file pages can be evicted regardless of swap
> constraints. These three variables are monotonically increasing.
>
> Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
> in order to fit into the gen counter in folio->flags. Each truncated
> generation number is an index to lrugen->lists[]. The sliding window
> technique is used to track at least MIN_NR_GENS and at most
> MAX_NR_GENS generations. The gen counter stores a value within [1,
> MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
> stores 0.
>
> There are two conceptually independent procedures: "the aging", which
> produces young generations, and "the eviction", which consumes old
> generations. They form a closed-loop system, i.e., "the page reclaim".
> Both procedures can be invoked from userspace for the purposes of
> working set estimation and proactive reclaim. These features are
> required to optimize job scheduling (bin packing) in data centers. The
> variable size of the sliding window is designed for such use cases
> [1][2].
>
> To avoid confusion, the terms "hot" and "cold" will be applied to the
> multi-gen LRU, as a new convention; the terms "active" and "inactive"
> will be applied to the active/inactive LRU, as usual.
>
> The protection of hot pages and the selection of cold pages are based
> on page access channels and patterns. There are two access channels:
> one through page tables and the other through file descriptors. The
> protection of the former channel is by design stronger because:
> 1. The uncertainty in determining the access patterns of the former
>    channel is higher due to the approximation of the accessed bit.
> 2. The cost of evicting the former channel is higher due to the TLB
>    flushes required and the likelihood of encountering the dirty bit.
> 3. The penalty of underprotecting the former channel is higher because
>    applications usually do not prepare themselves for major page
>    faults like they do for blocked I/O. E.g., GUI applications
>    commonly use dedicated I/O threads to avoid blocking the rendering
>    threads.
> There are also two access patterns: one with temporal locality and the
> other without. For the reasons listed above, the former channel is
> assumed to follow the former pattern unless VM_SEQ_READ or
> VM_RAND_READ is present; the latter channel is assumed to follow the
> latter pattern unless outlying refaults have been observed.
>
> The next patch will address the "outlying refaults". A few macros,
> i.e., LRU_REFS_*, used later are added in this patch to make the
> patchset less diffy.
>
> A page is added to the youngest generation on faulting. The aging
> needs to check the accessed bit at least twice before handing this
> page over to the eviction. The first check takes care of the accessed
> bit set on the initial fault; the second check makes sure this page
> has not been used since then. This protocol, AKA second chance,
> requires a minimum of two generations, hence MIN_NR_GENS.
>
> [1] https://dl.acm.org/doi/10.1145/3297858.3304053
> [2] https://dl.acm.org/doi/10.1145/3503222.3507731
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  fs/fuse/dev.c                     |   3 +-
>  include/linux/mm.h                |   2 +
>  include/linux/mm_inline.h         | 176 ++++++++++++++++++++++++++++++
>  include/linux/mmzone.h            |  94 ++++++++++++++++
>  include/linux/page-flags-layout.h |  11 +-
>  include/linux/page-flags.h        |   4 +-
>  include/linux/sched.h             |   4 +
>  kernel/bounds.c                   |   7 ++
>  mm/Kconfig                        |  10 ++
>  mm/huge_memory.c                  |   3 +-
>  mm/memcontrol.c                   |   2 +
>  mm/memory.c                       |  25 +++++
>  mm/mm_init.c                      |   6 +-
>  mm/mmzone.c                       |   2 +
>  mm/swap.c                         |   9 +-
>  mm/vmscan.c                       |  73 +++++++++++++
>  16 files changed, 418 insertions(+), 13 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 592730fd6e42..e7c0aa6d61ce 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
>                1 << PG_active |
>                1 << PG_workingset |
>                1 << PG_reclaim |
> -              1 << PG_waiters))) {
> +              1 << PG_waiters |
> +              LRU_GEN_MASK | LRU_REFS_MASK))) {
>                 dump_page(page, "fuse: trying to steal weird page");
>                 return 1;
>         }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5744a3fc4716..c1162659d824 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1032,6 +1032,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>  #define ZONES_PGOFF            (NODES_PGOFF - ZONES_WIDTH)
>  #define LAST_CPUPID_PGOFF      (ZONES_PGOFF - LAST_CPUPID_WIDTH)
>  #define KASAN_TAG_PGOFF                (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
> +#define LRU_GEN_PGOFF          (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
> +#define LRU_REFS_PGOFF         (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
>
>  /*
>   * Define the bit shifts to access each section.  For non-existent
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 2c24f5ac3e2a..e3594171b421 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -38,6 +38,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  {
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
> +       lockdep_assert_held(&lruvec->lru_lock);
> +       WARN_ON_ONCE(nr_pages != (int)nr_pages);
> +
>         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
>         __mod_zone_page_state(&pgdat->node_zones[zid],
>                                 NR_ZONE_LRU_BASE + lru, nr_pages);
> @@ -99,11 +102,178 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
>         return lru;
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +       return true;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +       return current->in_lru_fault;
> +}
> +
> +static inline int lru_gen_from_seq(unsigned long seq)
> +{
> +       return seq % MAX_NR_GENS;
> +}
> +
> +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> +{
> +       unsigned long max_seq = lruvec->lrugen.max_seq;
> +
> +       VM_BUG_ON(gen >= MAX_NR_GENS);
> +
> +       /* see the comment on MIN_NR_GENS */
> +       return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> +}
> +
> +static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
> +                                      int old_gen, int new_gen)
> +{
> +       int type = folio_is_file_lru(folio);
> +       int zone = folio_zonenum(folio);
> +       int delta = folio_nr_pages(folio);
> +       enum lru_list lru = type * LRU_INACTIVE_FILE;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
> +       VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
> +       VM_BUG_ON(old_gen == -1 && new_gen == -1);
> +
> +       if (old_gen >= 0)
> +               WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
> +                          lrugen->nr_pages[old_gen][type][zone] - delta);
> +       if (new_gen >= 0)
> +               WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
> +                          lrugen->nr_pages[new_gen][type][zone] + delta);
> +
> +       /* addition */
> +       if (old_gen < 0) {
> +               if (lru_gen_is_active(lruvec, new_gen))
> +                       lru += LRU_ACTIVE;
> +               __update_lru_size(lruvec, lru, zone, delta);
> +               return;
> +       }
> +
> +       /* deletion */
> +       if (new_gen < 0) {
> +               if (lru_gen_is_active(lruvec, old_gen))
> +                       lru += LRU_ACTIVE;
> +               __update_lru_size(lruvec, lru, zone, -delta);
> +               return;
> +       }
> +}
> +
> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       int gen;
> +       unsigned long old_flags, new_flags;
> +       int type = folio_is_file_lru(folio);
> +       int zone = folio_zonenum(folio);
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       if (folio_test_unevictable(folio))
> +               return false;
> +       /*
> +        * There are three common cases for this page:
> +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> +        *    migrated, add it to the youngest generation.

usually, one page is not active when it is faulted in. till its second
access is detected, it can be active.


> +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> +        *    not in swapcache or a dirty page pending writeback, add it to the
> +        *    second oldest generation.
> +        * 3. Everything else (clean, cold) is added to the oldest generation.
> +        */
> +       if (folio_test_active(folio))
> +               gen = lru_gen_from_seq(lrugen->max_seq);
> +       else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
> +                (folio_test_reclaim(folio) &&
> +                 (folio_test_dirty(folio) || folio_test_writeback(folio))))
> +               gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
> +       else
> +               gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
> +
> +               /* see the comment on MIN_NR_GENS */
> +               new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
> +               new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, -1, gen);
> +       /* for folio_rotate_reclaimable() */
> +       if (reclaiming)
> +               list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> +       else
> +               list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
> +
> +       return true;
> +}
> +
> +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       int gen;
> +       unsigned long old_flags, new_flags;
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               if (!(new_flags & LRU_GEN_MASK))
> +                       return false;
> +
> +               VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> +               VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +
> +               gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               /* for shrink_page_list() */
> +               if (reclaiming)
> +                       new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
> +               else if (lru_gen_is_active(lruvec, gen))
> +                       new_flags |= BIT(PG_active);
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, gen, -1);
> +       list_del(&folio->lru);
> +
> +       return true;
> +}
> +
> +#else
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  static __always_inline
>  void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
>  {
>         enum lru_list lru = folio_lru_list(folio);
>
> +       if (lru_gen_add_folio(lruvec, folio, false))
> +               return;
> +
>         update_lru_size(lruvec, lru, folio_zonenum(folio),
>                         folio_nr_pages(folio));
>         list_add(&folio->lru, &lruvec->lists[lru]);
> @@ -120,6 +290,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
>  {
>         enum lru_list lru = folio_lru_list(folio);
>
> +       if (lru_gen_add_folio(lruvec, folio, true))
> +               return;
> +
>         update_lru_size(lruvec, lru, folio_zonenum(folio),
>                         folio_nr_pages(folio));
>         list_add_tail(&folio->lru, &lruvec->lists[lru]);
> @@ -134,6 +307,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  static __always_inline
>  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
>  {
> +       if (lru_gen_del_folio(lruvec, folio, false))
> +               return;
> +
>         list_del(&folio->lru);
>         update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
>                         -folio_nr_pages(folio));
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aed44e9b5d89..a88e27d85693 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -303,6 +303,96 @@ enum lruvec_flags {
>                                          */
>  };
>
> +#endif /* !__GENERATING_BOUNDS_H */
> +
> +/*
> + * Evictable pages are divided into multiple generations. The youngest and the
> + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
> + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> + * one of lrugen->lists[]. Otherwise it stores 0.
> + *
> + * A page is added to the youngest generation on faulting. The aging needs to
> + * check the accessed bit at least twice before handing this page over to the
> + * eviction. The first check takes care of the accessed bit set on the initial
> + * fault; the second check makes sure this page hasn't been used since then.
> + * This process, AKA second chance, requires a minimum of two generations,
> + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
> + * LRU, these two generations are considered active; the rest of generations, if
> + * they exist, are considered inactive. See lru_gen_is_active(). PG_active is
> + * always cleared while a page is on one of lrugen->lists[] so that the aging
> + * needs not to worry about it. And it's set again when a page considered active
> + * is isolated for non-reclaiming purposes, e.g., migration. See
> + * lru_gen_add_folio() and lru_gen_del_folio().
> + *
> + * MAX_NR_GENS is set to 4 so that the multi-gen LRU has twice of the categories
> + * of the active/inactive LRU.
> + *
> + */
> +#define MIN_NR_GENS            2U
> +#define MAX_NR_GENS            4U
> +
> +#ifndef __GENERATING_BOUNDS_H
> +
> +struct lruvec;
> +
> +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)

The commit log said nothing about REFS flags and tiers.
but the code is here. either the commit log lacks something
or the code should belong to the next patch?

> +
> +#ifdef CONFIG_LRU_GEN
> +
> +enum {
> +       LRU_GEN_ANON,
> +       LRU_GEN_FILE,
> +};
> +
> +/*
> + * The youngest generation number is stored in max_seq for both anon and file
> + * types as they are aged on an equal footing. The oldest generation numbers are
> + * stored in min_seq[] separately for anon and file types as clean file pages
> + * can be evicted regardless of swap constraints.
> + *
> + * Normally anon and file min_seq are in sync. But if swapping is constrained,
> + * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> + * min_seq behind.
> + */
> +struct lru_gen_struct {
> +       /* the aging increments the youngest generation number */
> +       unsigned long max_seq;
> +       /* the eviction increments the oldest generation numbers */
> +       unsigned long min_seq[ANON_AND_FILE];
> +       /* the multi-gen LRU lists */
> +       struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +       /* the sizes of the above lists */
> +       unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +};
> +
> +void lru_gen_init_lruvec(struct lruvec *lruvec);
> +
> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg);
> +void lru_gen_exit_memcg(struct mem_cgroup *memcg);
> +#endif
> +
> +#else /* !CONFIG_LRU_GEN */
> +
> +static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
> +{
> +}
> +
> +#ifdef CONFIG_MEMCG
> +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  struct lruvec {
>         struct list_head                lists[NR_LRU_LISTS];
>         /* per lruvec lru_lock for memcg */
> @@ -320,6 +410,10 @@ struct lruvec {
>         unsigned long                   refaults[ANON_AND_FILE];
>         /* Various lruvec state flags (enum lruvec_flags) */
>         unsigned long                   flags;
> +#ifdef CONFIG_LRU_GEN
> +       /* evictable pages divided into generations */
> +       struct lru_gen_struct           lrugen;
> +#endif
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
> index ef1e3e736e14..c1946cdb845f 100644
> --- a/include/linux/page-flags-layout.h
> +++ b/include/linux/page-flags-layout.h
> @@ -55,7 +55,8 @@
>  #define SECTIONS_WIDTH         0
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
> +       <= BITS_PER_LONG - NR_PAGEFLAGS
>  #define NODES_WIDTH            NODES_SHIFT
>  #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #error "Vmemmap: No space for nodes field in page flags"
> @@ -89,8 +90,8 @@
>  #define LAST_CPUPID_SHIFT 0
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
> -       <= BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> +       KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
>  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
>  #else
>  #define LAST_CPUPID_WIDTH 0
> @@ -100,8 +101,8 @@
>  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
> -       > BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> +       KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
>  #error "Not enough bits in page flags"
>  #endif
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1c3b6e5c8bfd..a95518ca98eb 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -935,7 +935,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
>          1UL << PG_private      | 1UL << PG_private_2   |       \
>          1UL << PG_writeback    | 1UL << PG_reserved    |       \
>          1UL << PG_slab         | 1UL << PG_active      |       \
> -        1UL << PG_unevictable  | __PG_MLOCKED)
> +        1UL << PG_unevictable  | __PG_MLOCKED | LRU_GEN_MASK)
>
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> @@ -946,7 +946,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
>   * alloc-free cycle to prevent from reusing the page.
>   */
>  #define PAGE_FLAGS_CHECK_AT_PREP       \
> -       (PAGEFLAGS_MASK & ~__PG_HWPOISON)
> +       ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
>
>  #define PAGE_FLAGS_PRIVATE                             \
>         (1UL << PG_private | 1UL << PG_private_2)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 75ba8aa60248..e7fe784b11aa 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -914,6 +914,10 @@ struct task_struct {
>  #ifdef CONFIG_MEMCG
>         unsigned                        in_user_fault:1;
>  #endif
> +#ifdef CONFIG_LRU_GEN
> +       /* whether the LRU algorithm may apply to this access */
> +       unsigned                        in_lru_fault:1;
> +#endif
>  #ifdef CONFIG_COMPAT_BRK
>         unsigned                        brk_randomized:1;
>  #endif
> diff --git a/kernel/bounds.c b/kernel/bounds.c
> index 9795d75b09b2..e08fb89f87f4 100644
> --- a/kernel/bounds.c
> +++ b/kernel/bounds.c
> @@ -22,6 +22,13 @@ int main(void)
>         DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
>  #endif
>         DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
> +#ifdef CONFIG_LRU_GEN
> +       DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
> +       DEFINE(LRU_REFS_WIDTH, 0);
> +#else
> +       DEFINE(LRU_GEN_WIDTH, 0);
> +       DEFINE(LRU_REFS_WIDTH, 0);
> +#endif
>         /* End of constants */
>
>         return 0;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..747ab1690bcf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>           area from being merged with adjacent virtual memory areas due to the
>           difference in their name.
>
> +# the multi-gen LRU {
> +config LRU_GEN
> +       bool "Multi-Gen LRU"
> +       depends on MMU
> +       # the following options can use up the spare bits in page flags
> +       depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +       help
> +         A high performance LRU implementation for memory overcommit.
> +# }
> +
>  source "mm/damon/Kconfig"
>
>  endmenu
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 406a3c28c026..3df389fd307f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  #ifdef CONFIG_64BIT
>                          (1L << PG_arch_2) |
>  #endif
> -                        (1L << PG_dirty)));
> +                        (1L << PG_dirty) |
> +                        LRU_GEN_MASK | LRU_REFS_MASK));
>
>         /* ->mapping in first tail page is compound_mapcount */
>         VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36e9f38c919d..3fcbfeda259b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5121,6 +5121,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +       lru_gen_exit_memcg(memcg);
>         memcg_wb_domain_exit(memcg);
>         __mem_cgroup_free(memcg);
>  }
> @@ -5180,6 +5181,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>         memcg->deferred_split_queue.split_queue_len = 0;
>  #endif
>         idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
> +       lru_gen_init_memcg(memcg);
>         return memcg;
>  fail:
>         mem_cgroup_id_remove(memcg);
> diff --git a/mm/memory.c b/mm/memory.c
> index a7379196a47e..d27e5f1a2533 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +       /* the LRU algorithm doesn't apply to sequential or random reads */
> +       current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +       current->in_lru_fault = false;
> +}
> +#else
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +}
> +#endif /* CONFIG_LRU_GEN */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> @@ -4785,11 +4806,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         if (flags & FAULT_FLAG_USER)
>                 mem_cgroup_enter_user_fault();
>
> +       lru_gen_enter_fault(vma);
> +
>         if (unlikely(is_vm_hugetlb_page(vma)))
>                 ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>         else
>                 ret = __handle_mm_fault(vma, address, flags);
>
> +       lru_gen_exit_fault();
> +
>         if (flags & FAULT_FLAG_USER) {
>                 mem_cgroup_exit_user_fault();
>                 /*
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 9ddaf0e1b0ab..0d7b2bd2454a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
>
>         shift = 8 * sizeof(unsigned long);
>         width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
> -               - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
> +               - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
>         mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
> -               "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
> +               "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
>                 SECTIONS_WIDTH,
>                 NODES_WIDTH,
>                 ZONES_WIDTH,
>                 LAST_CPUPID_WIDTH,
>                 KASAN_TAG_WIDTH,
> +               LRU_GEN_WIDTH,
> +               LRU_REFS_WIDTH,
>                 NR_PAGEFLAGS);
>         mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
>                 "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index eb89d6e018e2..2ec0d7793424 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
>
>         for_each_lru(lru)
>                 INIT_LIST_HEAD(&lruvec->lists[lru]);
> +
> +       lru_gen_init_lruvec(lruvec);
>  }
>
>  #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..e5f2ab3dab4a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
>         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> +       /* see the comment in lru_gen_add_folio() */
> +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> +               folio_set_active(folio);

So here is our magic to make folio active as long as it is
faulted in? i really don't think the below comment is good,
could we say our purpose directly and explicitly?

 /* see the comment in lru_gen_add_folio() */

> +
>         folio_get(folio);
>         local_lock(&lru_pvecs.lock);
>         pvec = this_cpu_ptr(&lru_pvecs.lru_add);
> @@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>
>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  {
> -       if (PageActive(page) && !PageUnevictable(page)) {
> +       if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
>                 int nr_pages = thp_nr_pages(page);
>
>                 del_page_from_lru_list(page, lruvec);
> @@ -677,7 +682,7 @@ void deactivate_file_page(struct page *page)
>   */
>  void deactivate_page(struct page *page)
>  {
> -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +       if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
>                 struct pagevec *pvec;
>
>                 local_lock(&lru_pvecs.lock);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8e744cdf802f..65eb668abf2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3042,6 +3042,79 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>         return can_demote(pgdat->node_id, sc);
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +
> +/******************************************************************************
> + *                          shorthand helpers
> + ******************************************************************************/
> +
> +#define for_each_gen_type_zone(gen, type, zone)                                \
> +       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> +               for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
> +                       for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
> +
> +static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
> +{
> +       struct pglist_data *pgdat = NODE_DATA(nid);
> +
> +#ifdef CONFIG_MEMCG
> +       if (memcg) {
> +               struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
> +
> +               /* for hotadd_new_pgdat() */
> +               if (!lruvec->pgdat)
> +                       lruvec->pgdat = pgdat;
> +
> +               return lruvec;
> +       }
> +#endif
> +       return pgdat ? &pgdat->__lruvec : NULL;
> +}
> +
> +/******************************************************************************
> + *                          initialization
> + ******************************************************************************/
> +
> +void lru_gen_init_lruvec(struct lruvec *lruvec)
> +{
> +       int gen, type, zone;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       lrugen->max_seq = MIN_NR_GENS + 1;
> +
> +       for_each_gen_type_zone(gen, type, zone)
> +               INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
> +}
> +
> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +void lru_gen_exit_memcg(struct mem_cgroup *memcg)
> +{
> +       int nid;
> +
> +       for_each_node(nid) {
> +               struct lruvec *lruvec = get_lruvec(memcg, nid);
> +
> +               VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
> +                                    sizeof(lruvec->lrugen.nr_pages)));
> +       }
> +}
> +#endif
> +
> +static int __init init_lru_gen(void)
> +{
> +       BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
> +       BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
> +
> +       return 0;
> +};
> +late_initcall(init_lru_gen);
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
>         unsigned long nr[NR_LRU_LISTS];
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-16 23:25     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-16 23:25 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lrugen->max_seq for both
> anon and file types as they are aged on an equal footing. The oldest
> generation numbers are stored in lrugen->min_seq[] separately for anon
> and file types as clean file pages can be evicted regardless of swap
> constraints. These three variables are monotonically increasing.
>
> Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
> in order to fit into the gen counter in folio->flags. Each truncated
> generation number is an index to lrugen->lists[]. The sliding window
> technique is used to track at least MIN_NR_GENS and at most
> MAX_NR_GENS generations. The gen counter stores a value within [1,
> MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
> stores 0.
>
> There are two conceptually independent procedures: "the aging", which
> produces young generations, and "the eviction", which consumes old
> generations. They form a closed-loop system, i.e., "the page reclaim".
> Both procedures can be invoked from userspace for the purposes of
> working set estimation and proactive reclaim. These features are
> required to optimize job scheduling (bin packing) in data centers. The
> variable size of the sliding window is designed for such use cases
> [1][2].
>
> To avoid confusion, the terms "hot" and "cold" will be applied to the
> multi-gen LRU, as a new convention; the terms "active" and "inactive"
> will be applied to the active/inactive LRU, as usual.
>
> The protection of hot pages and the selection of cold pages are based
> on page access channels and patterns. There are two access channels:
> one through page tables and the other through file descriptors. The
> protection of the former channel is by design stronger because:
> 1. The uncertainty in determining the access patterns of the former
>    channel is higher due to the approximation of the accessed bit.
> 2. The cost of evicting the former channel is higher due to the TLB
>    flushes required and the likelihood of encountering the dirty bit.
> 3. The penalty of underprotecting the former channel is higher because
>    applications usually do not prepare themselves for major page
>    faults like they do for blocked I/O. E.g., GUI applications
>    commonly use dedicated I/O threads to avoid blocking the rendering
>    threads.
> There are also two access patterns: one with temporal locality and the
> other without. For the reasons listed above, the former channel is
> assumed to follow the former pattern unless VM_SEQ_READ or
> VM_RAND_READ is present; the latter channel is assumed to follow the
> latter pattern unless outlying refaults have been observed.
>
> The next patch will address the "outlying refaults". A few macros,
> i.e., LRU_REFS_*, used later are added in this patch to make the
> patchset less diffy.
>
> A page is added to the youngest generation on faulting. The aging
> needs to check the accessed bit at least twice before handing this
> page over to the eviction. The first check takes care of the accessed
> bit set on the initial fault; the second check makes sure this page
> has not been used since then. This protocol, AKA second chance,
> requires a minimum of two generations, hence MIN_NR_GENS.
>
> [1] https://dl.acm.org/doi/10.1145/3297858.3304053
> [2] https://dl.acm.org/doi/10.1145/3503222.3507731
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  fs/fuse/dev.c                     |   3 +-
>  include/linux/mm.h                |   2 +
>  include/linux/mm_inline.h         | 176 ++++++++++++++++++++++++++++++
>  include/linux/mmzone.h            |  94 ++++++++++++++++
>  include/linux/page-flags-layout.h |  11 +-
>  include/linux/page-flags.h        |   4 +-
>  include/linux/sched.h             |   4 +
>  kernel/bounds.c                   |   7 ++
>  mm/Kconfig                        |  10 ++
>  mm/huge_memory.c                  |   3 +-
>  mm/memcontrol.c                   |   2 +
>  mm/memory.c                       |  25 +++++
>  mm/mm_init.c                      |   6 +-
>  mm/mmzone.c                       |   2 +
>  mm/swap.c                         |   9 +-
>  mm/vmscan.c                       |  73 +++++++++++++
>  16 files changed, 418 insertions(+), 13 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 592730fd6e42..e7c0aa6d61ce 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
>                1 << PG_active |
>                1 << PG_workingset |
>                1 << PG_reclaim |
> -              1 << PG_waiters))) {
> +              1 << PG_waiters |
> +              LRU_GEN_MASK | LRU_REFS_MASK))) {
>                 dump_page(page, "fuse: trying to steal weird page");
>                 return 1;
>         }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5744a3fc4716..c1162659d824 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1032,6 +1032,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>  #define ZONES_PGOFF            (NODES_PGOFF - ZONES_WIDTH)
>  #define LAST_CPUPID_PGOFF      (ZONES_PGOFF - LAST_CPUPID_WIDTH)
>  #define KASAN_TAG_PGOFF                (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
> +#define LRU_GEN_PGOFF          (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
> +#define LRU_REFS_PGOFF         (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
>
>  /*
>   * Define the bit shifts to access each section.  For non-existent
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 2c24f5ac3e2a..e3594171b421 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -38,6 +38,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  {
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
> +       lockdep_assert_held(&lruvec->lru_lock);
> +       WARN_ON_ONCE(nr_pages != (int)nr_pages);
> +
>         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
>         __mod_zone_page_state(&pgdat->node_zones[zid],
>                                 NR_ZONE_LRU_BASE + lru, nr_pages);
> @@ -99,11 +102,178 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
>         return lru;
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +       return true;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +       return current->in_lru_fault;
> +}
> +
> +static inline int lru_gen_from_seq(unsigned long seq)
> +{
> +       return seq % MAX_NR_GENS;
> +}
> +
> +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> +{
> +       unsigned long max_seq = lruvec->lrugen.max_seq;
> +
> +       VM_BUG_ON(gen >= MAX_NR_GENS);
> +
> +       /* see the comment on MIN_NR_GENS */
> +       return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> +}
> +
> +static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
> +                                      int old_gen, int new_gen)
> +{
> +       int type = folio_is_file_lru(folio);
> +       int zone = folio_zonenum(folio);
> +       int delta = folio_nr_pages(folio);
> +       enum lru_list lru = type * LRU_INACTIVE_FILE;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
> +       VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
> +       VM_BUG_ON(old_gen == -1 && new_gen == -1);
> +
> +       if (old_gen >= 0)
> +               WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
> +                          lrugen->nr_pages[old_gen][type][zone] - delta);
> +       if (new_gen >= 0)
> +               WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
> +                          lrugen->nr_pages[new_gen][type][zone] + delta);
> +
> +       /* addition */
> +       if (old_gen < 0) {
> +               if (lru_gen_is_active(lruvec, new_gen))
> +                       lru += LRU_ACTIVE;
> +               __update_lru_size(lruvec, lru, zone, delta);
> +               return;
> +       }
> +
> +       /* deletion */
> +       if (new_gen < 0) {
> +               if (lru_gen_is_active(lruvec, old_gen))
> +                       lru += LRU_ACTIVE;
> +               __update_lru_size(lruvec, lru, zone, -delta);
> +               return;
> +       }
> +}
> +
> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       int gen;
> +       unsigned long old_flags, new_flags;
> +       int type = folio_is_file_lru(folio);
> +       int zone = folio_zonenum(folio);
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       if (folio_test_unevictable(folio))
> +               return false;
> +       /*
> +        * There are three common cases for this page:
> +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> +        *    migrated, add it to the youngest generation.

usually, one page is not active when it is faulted in. till its second
access is detected, it can be active.


> +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> +        *    not in swapcache or a dirty page pending writeback, add it to the
> +        *    second oldest generation.
> +        * 3. Everything else (clean, cold) is added to the oldest generation.
> +        */
> +       if (folio_test_active(folio))
> +               gen = lru_gen_from_seq(lrugen->max_seq);
> +       else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
> +                (folio_test_reclaim(folio) &&
> +                 (folio_test_dirty(folio) || folio_test_writeback(folio))))
> +               gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
> +       else
> +               gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
> +
> +               /* see the comment on MIN_NR_GENS */
> +               new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
> +               new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, -1, gen);
> +       /* for folio_rotate_reclaimable() */
> +       if (reclaiming)
> +               list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> +       else
> +               list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
> +
> +       return true;
> +}
> +
> +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       int gen;
> +       unsigned long old_flags, new_flags;
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               if (!(new_flags & LRU_GEN_MASK))
> +                       return false;
> +
> +               VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> +               VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +
> +               gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               /* for shrink_page_list() */
> +               if (reclaiming)
> +                       new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
> +               else if (lru_gen_is_active(lruvec, gen))
> +                       new_flags |= BIT(PG_active);
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, gen, -1);
> +       list_del(&folio->lru);
> +
> +       return true;
> +}
> +
> +#else
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       return false;
> +}
> +
> +static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  static __always_inline
>  void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
>  {
>         enum lru_list lru = folio_lru_list(folio);
>
> +       if (lru_gen_add_folio(lruvec, folio, false))
> +               return;
> +
>         update_lru_size(lruvec, lru, folio_zonenum(folio),
>                         folio_nr_pages(folio));
>         list_add(&folio->lru, &lruvec->lists[lru]);
> @@ -120,6 +290,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
>  {
>         enum lru_list lru = folio_lru_list(folio);
>
> +       if (lru_gen_add_folio(lruvec, folio, true))
> +               return;
> +
>         update_lru_size(lruvec, lru, folio_zonenum(folio),
>                         folio_nr_pages(folio));
>         list_add_tail(&folio->lru, &lruvec->lists[lru]);
> @@ -134,6 +307,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  static __always_inline
>  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
>  {
> +       if (lru_gen_del_folio(lruvec, folio, false))
> +               return;
> +
>         list_del(&folio->lru);
>         update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
>                         -folio_nr_pages(folio));
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aed44e9b5d89..a88e27d85693 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -303,6 +303,96 @@ enum lruvec_flags {
>                                          */
>  };
>
> +#endif /* !__GENERATING_BOUNDS_H */
> +
> +/*
> + * Evictable pages are divided into multiple generations. The youngest and the
> + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
> + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> + * one of lrugen->lists[]. Otherwise it stores 0.
> + *
> + * A page is added to the youngest generation on faulting. The aging needs to
> + * check the accessed bit at least twice before handing this page over to the
> + * eviction. The first check takes care of the accessed bit set on the initial
> + * fault; the second check makes sure this page hasn't been used since then.
> + * This process, AKA second chance, requires a minimum of two generations,
> + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
> + * LRU, these two generations are considered active; the rest of generations, if
> + * they exist, are considered inactive. See lru_gen_is_active(). PG_active is
> + * always cleared while a page is on one of lrugen->lists[] so that the aging
> + * needs not to worry about it. And it's set again when a page considered active
> + * is isolated for non-reclaiming purposes, e.g., migration. See
> + * lru_gen_add_folio() and lru_gen_del_folio().
> + *
> + * MAX_NR_GENS is set to 4 so that the multi-gen LRU has twice of the categories
> + * of the active/inactive LRU.
> + *
> + */
> +#define MIN_NR_GENS            2U
> +#define MAX_NR_GENS            4U
> +
> +#ifndef __GENERATING_BOUNDS_H
> +
> +struct lruvec;
> +
> +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)

The commit log said nothing about REFS flags and tiers.
but the code is here. either the commit log lacks something
or the code should belong to the next patch?

> +
> +#ifdef CONFIG_LRU_GEN
> +
> +enum {
> +       LRU_GEN_ANON,
> +       LRU_GEN_FILE,
> +};
> +
> +/*
> + * The youngest generation number is stored in max_seq for both anon and file
> + * types as they are aged on an equal footing. The oldest generation numbers are
> + * stored in min_seq[] separately for anon and file types as clean file pages
> + * can be evicted regardless of swap constraints.
> + *
> + * Normally anon and file min_seq are in sync. But if swapping is constrained,
> + * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> + * min_seq behind.
> + */
> +struct lru_gen_struct {
> +       /* the aging increments the youngest generation number */
> +       unsigned long max_seq;
> +       /* the eviction increments the oldest generation numbers */
> +       unsigned long min_seq[ANON_AND_FILE];
> +       /* the multi-gen LRU lists */
> +       struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +       /* the sizes of the above lists */
> +       unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +};
> +
> +void lru_gen_init_lruvec(struct lruvec *lruvec);
> +
> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg);
> +void lru_gen_exit_memcg(struct mem_cgroup *memcg);
> +#endif
> +
> +#else /* !CONFIG_LRU_GEN */
> +
> +static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
> +{
> +}
> +
> +#ifdef CONFIG_MEMCG
> +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  struct lruvec {
>         struct list_head                lists[NR_LRU_LISTS];
>         /* per lruvec lru_lock for memcg */
> @@ -320,6 +410,10 @@ struct lruvec {
>         unsigned long                   refaults[ANON_AND_FILE];
>         /* Various lruvec state flags (enum lruvec_flags) */
>         unsigned long                   flags;
> +#ifdef CONFIG_LRU_GEN
> +       /* evictable pages divided into generations */
> +       struct lru_gen_struct           lrugen;
> +#endif
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
> index ef1e3e736e14..c1946cdb845f 100644
> --- a/include/linux/page-flags-layout.h
> +++ b/include/linux/page-flags-layout.h
> @@ -55,7 +55,8 @@
>  #define SECTIONS_WIDTH         0
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
> +       <= BITS_PER_LONG - NR_PAGEFLAGS
>  #define NODES_WIDTH            NODES_SHIFT
>  #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #error "Vmemmap: No space for nodes field in page flags"
> @@ -89,8 +90,8 @@
>  #define LAST_CPUPID_SHIFT 0
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
> -       <= BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> +       KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
>  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
>  #else
>  #define LAST_CPUPID_WIDTH 0
> @@ -100,8 +101,8 @@
>  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  #endif
>
> -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
> -       > BITS_PER_LONG - NR_PAGEFLAGS
> +#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> +       KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
>  #error "Not enough bits in page flags"
>  #endif
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1c3b6e5c8bfd..a95518ca98eb 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -935,7 +935,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
>          1UL << PG_private      | 1UL << PG_private_2   |       \
>          1UL << PG_writeback    | 1UL << PG_reserved    |       \
>          1UL << PG_slab         | 1UL << PG_active      |       \
> -        1UL << PG_unevictable  | __PG_MLOCKED)
> +        1UL << PG_unevictable  | __PG_MLOCKED | LRU_GEN_MASK)
>
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> @@ -946,7 +946,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
>   * alloc-free cycle to prevent from reusing the page.
>   */
>  #define PAGE_FLAGS_CHECK_AT_PREP       \
> -       (PAGEFLAGS_MASK & ~__PG_HWPOISON)
> +       ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
>
>  #define PAGE_FLAGS_PRIVATE                             \
>         (1UL << PG_private | 1UL << PG_private_2)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 75ba8aa60248..e7fe784b11aa 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -914,6 +914,10 @@ struct task_struct {
>  #ifdef CONFIG_MEMCG
>         unsigned                        in_user_fault:1;
>  #endif
> +#ifdef CONFIG_LRU_GEN
> +       /* whether the LRU algorithm may apply to this access */
> +       unsigned                        in_lru_fault:1;
> +#endif
>  #ifdef CONFIG_COMPAT_BRK
>         unsigned                        brk_randomized:1;
>  #endif
> diff --git a/kernel/bounds.c b/kernel/bounds.c
> index 9795d75b09b2..e08fb89f87f4 100644
> --- a/kernel/bounds.c
> +++ b/kernel/bounds.c
> @@ -22,6 +22,13 @@ int main(void)
>         DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
>  #endif
>         DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
> +#ifdef CONFIG_LRU_GEN
> +       DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
> +       DEFINE(LRU_REFS_WIDTH, 0);
> +#else
> +       DEFINE(LRU_GEN_WIDTH, 0);
> +       DEFINE(LRU_REFS_WIDTH, 0);
> +#endif
>         /* End of constants */
>
>         return 0;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..747ab1690bcf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>           area from being merged with adjacent virtual memory areas due to the
>           difference in their name.
>
> +# the multi-gen LRU {
> +config LRU_GEN
> +       bool "Multi-Gen LRU"
> +       depends on MMU
> +       # the following options can use up the spare bits in page flags
> +       depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +       help
> +         A high performance LRU implementation for memory overcommit.
> +# }
> +
>  source "mm/damon/Kconfig"
>
>  endmenu
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 406a3c28c026..3df389fd307f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  #ifdef CONFIG_64BIT
>                          (1L << PG_arch_2) |
>  #endif
> -                        (1L << PG_dirty)));
> +                        (1L << PG_dirty) |
> +                        LRU_GEN_MASK | LRU_REFS_MASK));
>
>         /* ->mapping in first tail page is compound_mapcount */
>         VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36e9f38c919d..3fcbfeda259b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5121,6 +5121,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +       lru_gen_exit_memcg(memcg);
>         memcg_wb_domain_exit(memcg);
>         __mem_cgroup_free(memcg);
>  }
> @@ -5180,6 +5181,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>         memcg->deferred_split_queue.split_queue_len = 0;
>  #endif
>         idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
> +       lru_gen_init_memcg(memcg);
>         return memcg;
>  fail:
>         mem_cgroup_id_remove(memcg);
> diff --git a/mm/memory.c b/mm/memory.c
> index a7379196a47e..d27e5f1a2533 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +       /* the LRU algorithm doesn't apply to sequential or random reads */
> +       current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +       current->in_lru_fault = false;
> +}
> +#else
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +}
> +#endif /* CONFIG_LRU_GEN */
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> @@ -4785,11 +4806,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>         if (flags & FAULT_FLAG_USER)
>                 mem_cgroup_enter_user_fault();
>
> +       lru_gen_enter_fault(vma);
> +
>         if (unlikely(is_vm_hugetlb_page(vma)))
>                 ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>         else
>                 ret = __handle_mm_fault(vma, address, flags);
>
> +       lru_gen_exit_fault();
> +
>         if (flags & FAULT_FLAG_USER) {
>                 mem_cgroup_exit_user_fault();
>                 /*
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 9ddaf0e1b0ab..0d7b2bd2454a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
>
>         shift = 8 * sizeof(unsigned long);
>         width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
> -               - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
> +               - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
>         mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
> -               "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
> +               "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
>                 SECTIONS_WIDTH,
>                 NODES_WIDTH,
>                 ZONES_WIDTH,
>                 LAST_CPUPID_WIDTH,
>                 KASAN_TAG_WIDTH,
> +               LRU_GEN_WIDTH,
> +               LRU_REFS_WIDTH,
>                 NR_PAGEFLAGS);
>         mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
>                 "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index eb89d6e018e2..2ec0d7793424 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
>
>         for_each_lru(lru)
>                 INIT_LIST_HEAD(&lruvec->lists[lru]);
> +
> +       lru_gen_init_lruvec(lruvec);
>  }
>
>  #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..e5f2ab3dab4a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
>         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> +       /* see the comment in lru_gen_add_folio() */
> +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> +               folio_set_active(folio);

So here is our magic to make folio active as long as it is
faulted in? i really don't think the below comment is good,
could we say our purpose directly and explicitly?

 /* see the comment in lru_gen_add_folio() */

> +
>         folio_get(folio);
>         local_lock(&lru_pvecs.lock);
>         pvec = this_cpu_ptr(&lru_pvecs.lru_add);
> @@ -563,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>
>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  {
> -       if (PageActive(page) && !PageUnevictable(page)) {
> +       if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
>                 int nr_pages = thp_nr_pages(page);
>
>                 del_page_from_lru_list(page, lruvec);
> @@ -677,7 +682,7 @@ void deactivate_file_page(struct page *page)
>   */
>  void deactivate_page(struct page *page)
>  {
> -       if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +       if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
>                 struct pagevec *pvec;
>
>                 local_lock(&lru_pvecs.lock);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8e744cdf802f..65eb668abf2d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3042,6 +3042,79 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>         return can_demote(pgdat->node_id, sc);
>  }
>
> +#ifdef CONFIG_LRU_GEN
> +
> +/******************************************************************************
> + *                          shorthand helpers
> + ******************************************************************************/
> +
> +#define for_each_gen_type_zone(gen, type, zone)                                \
> +       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> +               for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
> +                       for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
> +
> +static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
> +{
> +       struct pglist_data *pgdat = NODE_DATA(nid);
> +
> +#ifdef CONFIG_MEMCG
> +       if (memcg) {
> +               struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
> +
> +               /* for hotadd_new_pgdat() */
> +               if (!lruvec->pgdat)
> +                       lruvec->pgdat = pgdat;
> +
> +               return lruvec;
> +       }
> +#endif
> +       return pgdat ? &pgdat->__lruvec : NULL;
> +}
> +
> +/******************************************************************************
> + *                          initialization
> + ******************************************************************************/
> +
> +void lru_gen_init_lruvec(struct lruvec *lruvec)
> +{
> +       int gen, type, zone;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       lrugen->max_seq = MIN_NR_GENS + 1;
> +
> +       for_each_gen_type_zone(gen, type, zone)
> +               INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
> +}
> +
> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +void lru_gen_exit_memcg(struct mem_cgroup *memcg)
> +{
> +       int nid;
> +
> +       for_each_node(nid) {
> +               struct lruvec *lruvec = get_lruvec(memcg, nid);
> +
> +               VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
> +                                    sizeof(lruvec->lrugen.nr_pages)));
> +       }
> +}
> +#endif
> +
> +static int __init init_lru_gen(void)
> +{
> +       BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
> +       BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
> +
> +       return 0;
> +};
> +late_initcall(init_lru_gen);
> +
> +#endif /* CONFIG_LRU_GEN */
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
>         unsigned long nr[NR_LRU_LISTS];
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 03/14] mm/vmscan.c: refactor shrink_node()
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-18  1:15     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-18  1:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> This patch refactors shrink_node() to improve readability for the
> upcoming changes to mm/vmscan.c.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

Reviewed-by: Barry Song <baohua@kernel.org>

seems nice refactoring since we are going to skip the whole
function for lru_gen later:
static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
{
        unsigned long file;
        struct lruvec *target_lruvec;

        if (lru_gen_enabled())
                return;
       ...
}

> ---
>  mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 104 insertions(+), 94 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 59b14e0d696c..8e744cdf802f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2718,6 +2718,109 @@ enum scan_balance {
>         SCAN_FILE,
>  };
>
> +static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> +{
> +       unsigned long file;
> +       struct lruvec *target_lruvec;
> +
> +       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +
> +       /*
> +        * Flush the memory cgroup stats, so that we read accurate per-memcg
> +        * lruvec stats for heuristics.
> +        */
> +       mem_cgroup_flush_stats();
> +
> +       /*
> +        * Determine the scan balance between anon and file LRUs.
> +        */
> +       spin_lock_irq(&target_lruvec->lru_lock);
> +       sc->anon_cost = target_lruvec->anon_cost;
> +       sc->file_cost = target_lruvec->file_cost;
> +       spin_unlock_irq(&target_lruvec->lru_lock);
> +
> +       /*
> +        * Target desirable inactive:active list ratios for the anon
> +        * and file LRU lists.
> +        */
> +       if (!sc->force_deactivate) {
> +               unsigned long refaults;
> +
> +               refaults = lruvec_page_state(target_lruvec,
> +                               WORKINGSET_ACTIVATE_ANON);
> +               if (refaults != target_lruvec->refaults[0] ||
> +                       inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> +                       sc->may_deactivate |= DEACTIVATE_ANON;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> +
> +               /*
> +                * When refaults are being observed, it means a new
> +                * workingset is being established. Deactivate to get
> +                * rid of any stale active pages quickly.
> +                */
> +               refaults = lruvec_page_state(target_lruvec,
> +                               WORKINGSET_ACTIVATE_FILE);
> +               if (refaults != target_lruvec->refaults[1] ||
> +                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> +                       sc->may_deactivate |= DEACTIVATE_FILE;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> +       } else
> +               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> +
> +       /*
> +        * If we have plenty of inactive file pages that aren't
> +        * thrashing, try to reclaim those first before touching
> +        * anonymous pages.
> +        */
> +       file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
> +       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> +               sc->cache_trim_mode = 1;
> +       else
> +               sc->cache_trim_mode = 0;
> +
> +       /*
> +        * Prevent the reclaimer from falling into the cache trap: as
> +        * cache pages start out inactive, every cache fault will tip
> +        * the scan balance towards the file LRU.  And as the file LRU
> +        * shrinks, so does the window for rotation from references.
> +        * This means we have a runaway feedback loop where a tiny
> +        * thrashing file LRU becomes infinitely more attractive than
> +        * anon pages.  Try to detect this based on file LRU size.
> +        */
> +       if (!cgroup_reclaim(sc)) {
> +               unsigned long total_high_wmark = 0;
> +               unsigned long free, anon;
> +               int z;
> +
> +               free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +
> +                       if (!managed_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
> +
> +               /*
> +                * Consider anon: if that's low too, this isn't a
> +                * runaway file reclaim problem, but rather just
> +                * extreme pressure. Reclaim as per usual then.
> +                */
> +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +               sc->file_is_tiny =
> +                       file + free <= total_high_wmark &&
> +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> +                       anon >> sc->priority;
> +       }
> +}
> +
>  /*
>   * Determine how aggressively the anon and file LRU lists should be
>   * scanned.  The relative value of each set of LRU lists is determined
> @@ -3188,109 +3291,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> -       unsigned long file;
>
>         target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>
>  again:
> -       /*
> -        * Flush the memory cgroup stats, so that we read accurate per-memcg
> -        * lruvec stats for heuristics.
> -        */
> -       mem_cgroup_flush_stats();
> -
>         memset(&sc->nr, 0, sizeof(sc->nr));
>
>         nr_reclaimed = sc->nr_reclaimed;
>         nr_scanned = sc->nr_scanned;
>
> -       /*
> -        * Determine the scan balance between anon and file LRUs.
> -        */
> -       spin_lock_irq(&target_lruvec->lru_lock);
> -       sc->anon_cost = target_lruvec->anon_cost;
> -       sc->file_cost = target_lruvec->file_cost;
> -       spin_unlock_irq(&target_lruvec->lru_lock);
> -
> -       /*
> -        * Target desirable inactive:active list ratios for the anon
> -        * and file LRU lists.
> -        */
> -       if (!sc->force_deactivate) {
> -               unsigned long refaults;
> -
> -               refaults = lruvec_page_state(target_lruvec,
> -                               WORKINGSET_ACTIVATE_ANON);
> -               if (refaults != target_lruvec->refaults[0] ||
> -                       inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> -                       sc->may_deactivate |= DEACTIVATE_ANON;
> -               else
> -                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> -
> -               /*
> -                * When refaults are being observed, it means a new
> -                * workingset is being established. Deactivate to get
> -                * rid of any stale active pages quickly.
> -                */
> -               refaults = lruvec_page_state(target_lruvec,
> -                               WORKINGSET_ACTIVATE_FILE);
> -               if (refaults != target_lruvec->refaults[1] ||
> -                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> -                       sc->may_deactivate |= DEACTIVATE_FILE;
> -               else
> -                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> -       } else
> -               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> -
> -       /*
> -        * If we have plenty of inactive file pages that aren't
> -        * thrashing, try to reclaim those first before touching
> -        * anonymous pages.
> -        */
> -       file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
> -       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> -               sc->cache_trim_mode = 1;
> -       else
> -               sc->cache_trim_mode = 0;
> -
> -       /*
> -        * Prevent the reclaimer from falling into the cache trap: as
> -        * cache pages start out inactive, every cache fault will tip
> -        * the scan balance towards the file LRU.  And as the file LRU
> -        * shrinks, so does the window for rotation from references.
> -        * This means we have a runaway feedback loop where a tiny
> -        * thrashing file LRU becomes infinitely more attractive than
> -        * anon pages.  Try to detect this based on file LRU size.
> -        */
> -       if (!cgroup_reclaim(sc)) {
> -               unsigned long total_high_wmark = 0;
> -               unsigned long free, anon;
> -               int z;
> -
> -               free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> -               file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> -                          node_page_state(pgdat, NR_INACTIVE_FILE);
> -
> -               for (z = 0; z < MAX_NR_ZONES; z++) {
> -                       struct zone *zone = &pgdat->node_zones[z];
> -                       if (!managed_zone(zone))
> -                               continue;
> -
> -                       total_high_wmark += high_wmark_pages(zone);
> -               }
> -
> -               /*
> -                * Consider anon: if that's low too, this isn't a
> -                * runaway file reclaim problem, but rather just
> -                * extreme pressure. Reclaim as per usual then.
> -                */
> -               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> -
> -               sc->file_is_tiny =
> -                       file + free <= total_high_wmark &&
> -                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> -                       anon >> sc->priority;
> -       }
> +       prepare_scan_count(pgdat, sc);
>
>         shrink_node_memcgs(pgdat, sc);
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 03/14] mm/vmscan.c: refactor shrink_node()
@ 2022-03-18  1:15     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-18  1:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> This patch refactors shrink_node() to improve readability for the
> upcoming changes to mm/vmscan.c.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

Reviewed-by: Barry Song <baohua@kernel.org>

seems nice refactoring since we are going to skip the whole
function for lru_gen later:
static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
{
        unsigned long file;
        struct lruvec *target_lruvec;

        if (lru_gen_enabled())
                return;
       ...
}

> ---
>  mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 104 insertions(+), 94 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 59b14e0d696c..8e744cdf802f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2718,6 +2718,109 @@ enum scan_balance {
>         SCAN_FILE,
>  };
>
> +static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> +{
> +       unsigned long file;
> +       struct lruvec *target_lruvec;
> +
> +       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +
> +       /*
> +        * Flush the memory cgroup stats, so that we read accurate per-memcg
> +        * lruvec stats for heuristics.
> +        */
> +       mem_cgroup_flush_stats();
> +
> +       /*
> +        * Determine the scan balance between anon and file LRUs.
> +        */
> +       spin_lock_irq(&target_lruvec->lru_lock);
> +       sc->anon_cost = target_lruvec->anon_cost;
> +       sc->file_cost = target_lruvec->file_cost;
> +       spin_unlock_irq(&target_lruvec->lru_lock);
> +
> +       /*
> +        * Target desirable inactive:active list ratios for the anon
> +        * and file LRU lists.
> +        */
> +       if (!sc->force_deactivate) {
> +               unsigned long refaults;
> +
> +               refaults = lruvec_page_state(target_lruvec,
> +                               WORKINGSET_ACTIVATE_ANON);
> +               if (refaults != target_lruvec->refaults[0] ||
> +                       inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> +                       sc->may_deactivate |= DEACTIVATE_ANON;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> +
> +               /*
> +                * When refaults are being observed, it means a new
> +                * workingset is being established. Deactivate to get
> +                * rid of any stale active pages quickly.
> +                */
> +               refaults = lruvec_page_state(target_lruvec,
> +                               WORKINGSET_ACTIVATE_FILE);
> +               if (refaults != target_lruvec->refaults[1] ||
> +                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> +                       sc->may_deactivate |= DEACTIVATE_FILE;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> +       } else
> +               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> +
> +       /*
> +        * If we have plenty of inactive file pages that aren't
> +        * thrashing, try to reclaim those first before touching
> +        * anonymous pages.
> +        */
> +       file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
> +       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> +               sc->cache_trim_mode = 1;
> +       else
> +               sc->cache_trim_mode = 0;
> +
> +       /*
> +        * Prevent the reclaimer from falling into the cache trap: as
> +        * cache pages start out inactive, every cache fault will tip
> +        * the scan balance towards the file LRU.  And as the file LRU
> +        * shrinks, so does the window for rotation from references.
> +        * This means we have a runaway feedback loop where a tiny
> +        * thrashing file LRU becomes infinitely more attractive than
> +        * anon pages.  Try to detect this based on file LRU size.
> +        */
> +       if (!cgroup_reclaim(sc)) {
> +               unsigned long total_high_wmark = 0;
> +               unsigned long free, anon;
> +               int z;
> +
> +               free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +
> +                       if (!managed_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
> +
> +               /*
> +                * Consider anon: if that's low too, this isn't a
> +                * runaway file reclaim problem, but rather just
> +                * extreme pressure. Reclaim as per usual then.
> +                */
> +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +               sc->file_is_tiny =
> +                       file + free <= total_high_wmark &&
> +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> +                       anon >> sc->priority;
> +       }
> +}
> +
>  /*
>   * Determine how aggressively the anon and file LRU lists should be
>   * scanned.  The relative value of each set of LRU lists is determined
> @@ -3188,109 +3291,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> -       unsigned long file;
>
>         target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>
>  again:
> -       /*
> -        * Flush the memory cgroup stats, so that we read accurate per-memcg
> -        * lruvec stats for heuristics.
> -        */
> -       mem_cgroup_flush_stats();
> -
>         memset(&sc->nr, 0, sizeof(sc->nr));
>
>         nr_reclaimed = sc->nr_reclaimed;
>         nr_scanned = sc->nr_scanned;
>
> -       /*
> -        * Determine the scan balance between anon and file LRUs.
> -        */
> -       spin_lock_irq(&target_lruvec->lru_lock);
> -       sc->anon_cost = target_lruvec->anon_cost;
> -       sc->file_cost = target_lruvec->file_cost;
> -       spin_unlock_irq(&target_lruvec->lru_lock);
> -
> -       /*
> -        * Target desirable inactive:active list ratios for the anon
> -        * and file LRU lists.
> -        */
> -       if (!sc->force_deactivate) {
> -               unsigned long refaults;
> -
> -               refaults = lruvec_page_state(target_lruvec,
> -                               WORKINGSET_ACTIVATE_ANON);
> -               if (refaults != target_lruvec->refaults[0] ||
> -                       inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> -                       sc->may_deactivate |= DEACTIVATE_ANON;
> -               else
> -                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> -
> -               /*
> -                * When refaults are being observed, it means a new
> -                * workingset is being established. Deactivate to get
> -                * rid of any stale active pages quickly.
> -                */
> -               refaults = lruvec_page_state(target_lruvec,
> -                               WORKINGSET_ACTIVATE_FILE);
> -               if (refaults != target_lruvec->refaults[1] ||
> -                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> -                       sc->may_deactivate |= DEACTIVATE_FILE;
> -               else
> -                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> -       } else
> -               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> -
> -       /*
> -        * If we have plenty of inactive file pages that aren't
> -        * thrashing, try to reclaim those first before touching
> -        * anonymous pages.
> -        */
> -       file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
> -       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> -               sc->cache_trim_mode = 1;
> -       else
> -               sc->cache_trim_mode = 0;
> -
> -       /*
> -        * Prevent the reclaimer from falling into the cache trap: as
> -        * cache pages start out inactive, every cache fault will tip
> -        * the scan balance towards the file LRU.  And as the file LRU
> -        * shrinks, so does the window for rotation from references.
> -        * This means we have a runaway feedback loop where a tiny
> -        * thrashing file LRU becomes infinitely more attractive than
> -        * anon pages.  Try to detect this based on file LRU size.
> -        */
> -       if (!cgroup_reclaim(sc)) {
> -               unsigned long total_high_wmark = 0;
> -               unsigned long free, anon;
> -               int z;
> -
> -               free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> -               file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> -                          node_page_state(pgdat, NR_INACTIVE_FILE);
> -
> -               for (z = 0; z < MAX_NR_ZONES; z++) {
> -                       struct zone *zone = &pgdat->node_zones[z];
> -                       if (!managed_zone(zone))
> -                               continue;
> -
> -                       total_high_wmark += high_wmark_pages(zone);
> -               }
> -
> -               /*
> -                * Consider anon: if that's low too, this isn't a
> -                * runaway file reclaim problem, but rather just
> -                * extreme pressure. Reclaim as per usual then.
> -                */
> -               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> -
> -               sc->file_is_tiny =
> -                       file + free <= total_high_wmark &&
> -                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> -                       anon >> sc->priority;
> -       }
> +       prepare_scan_count(pgdat, sc);
>
>         shrink_node_memcgs(pgdat, sc);
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-19  3:01     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-19  3:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       unsigned long old_flags, new_flags;
> +       int type = folio_is_file_lru(folio);
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> +
> +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +               new_gen = (old_gen + 1) % MAX_NR_GENS;

new_gen is assigned twice, i assume you mean
               old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
               new_gen = (old_gen + 1) % MAX_NR_GENS;

or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?

> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> +               new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +               /* for folio_end_writeback() */
> +               if (reclaiming)
> +                       new_flags |= BIT(PG_reclaim);
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +
> +       return new_gen;
> +}
> +

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-19  3:01     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-19  3:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +       unsigned long old_flags, new_flags;
> +       int type = folio_is_file_lru(folio);
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> +
> +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +               new_gen = (old_gen + 1) % MAX_NR_GENS;

new_gen is assigned twice, i assume you mean
               old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
               new_gen = (old_gen + 1) % MAX_NR_GENS;

or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?

> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> +               new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +               /* for folio_end_writeback() */
> +               if (reclaiming)
> +                       new_flags |= BIT(PG_reclaim);
> +       } while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +
> +       return new_gen;
> +}
> +

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-19  3:01     ` Barry Song
@ 2022-03-19  3:11       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-19  3:11 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +       unsigned long old_flags, new_flags;
> > +       int type = folio_is_file_lru(folio);
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > +
> > +       do {
> > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > +
> > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> new_gen is assigned twice, i assume you mean
>                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
>                new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?

Thanks a lot for your attention to details!

The first line should be in the next patch but I overlooked during the
last refactoring:

  new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
  new_gen = (old_gen + 1) % MAX_NR_GENS;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-19  3:11       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-19  3:11 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +       unsigned long old_flags, new_flags;
> > +       int type = folio_is_file_lru(folio);
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > +
> > +       do {
> > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > +
> > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> new_gen is assigned twice, i assume you mean
>                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
>                new_gen = (old_gen + 1) % MAX_NR_GENS;
>
> or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?

Thanks a lot for your attention to details!

The first line should be in the next patch but I overlooked during the
last refactoring:

  new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+ /* folio_update_gen() has promoted this page? */
+ if (new_gen >= 0 && new_gen != old_gen)
+ return new_gen;
+
  new_gen = (old_gen + 1) % MAX_NR_GENS;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-19 10:14     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-19 10:14 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

> +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> +{
> +       int prev, next;
> +       int type, zone;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +
> +       VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +       if (max_seq != lrugen->max_seq)
> +               goto unlock;
> +
> +       inc_min_seq(lruvec);
> +
> +       /* update the active/inactive LRU sizes for compatibility */
> +       prev = lru_gen_from_seq(lrugen->max_seq - 1);
> +       next = lru_gen_from_seq(lrugen->max_seq + 1);
> +
> +       for (type = 0; type < ANON_AND_FILE; type++) {
> +               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +                       enum lru_list lru = type * LRU_INACTIVE_FILE;
> +                       long delta = lrugen->nr_pages[prev][type][zone] -
> +                                    lrugen->nr_pages[next][type][zone];

this is confusing to me. does lrugen->nr_pages[next][type][zone] have a
chance to be none-zero even before max_seq is increased? some pages
can be in the next generation before the generation is born?

isn't it a bug if(lrugen->nr_pages[next][type][zone] > 0)? shouldn't it be?

delta = lrugen->nr_pages[prev][type][zone];

> +
> +                       if (!delta)
> +                               continue;
> +
> +                       __update_lru_size(lruvec, lru, zone, delta);
> +                       __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
> +               }
> +       }
> +
> +       for (type = 0; type < ANON_AND_FILE; type++)
> +               reset_ctrl_pos(lruvec, type, false);
> +
> +       /* make sure preceding modifications appear */
> +       smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
> +unlock:
> +       spin_unlock_irq(&lruvec->lru_lock);
> +}

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-19 10:14     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-19 10:14 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

> +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> +{
> +       int prev, next;
> +       int type, zone;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +
> +       VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +       if (max_seq != lrugen->max_seq)
> +               goto unlock;
> +
> +       inc_min_seq(lruvec);
> +
> +       /* update the active/inactive LRU sizes for compatibility */
> +       prev = lru_gen_from_seq(lrugen->max_seq - 1);
> +       next = lru_gen_from_seq(lrugen->max_seq + 1);
> +
> +       for (type = 0; type < ANON_AND_FILE; type++) {
> +               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +                       enum lru_list lru = type * LRU_INACTIVE_FILE;
> +                       long delta = lrugen->nr_pages[prev][type][zone] -
> +                                    lrugen->nr_pages[next][type][zone];

this is confusing to me. does lrugen->nr_pages[next][type][zone] have a
chance to be none-zero even before max_seq is increased? some pages
can be in the next generation before the generation is born?

isn't it a bug if(lrugen->nr_pages[next][type][zone] > 0)? shouldn't it be?

delta = lrugen->nr_pages[prev][type][zone];

> +
> +                       if (!delta)
> +                               continue;
> +
> +                       __update_lru_size(lruvec, lru, zone, delta);
> +                       __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
> +               }
> +       }
> +
> +       for (type = 0; type < ANON_AND_FILE; type++)
> +               reset_ctrl_pos(lruvec, type, false);
> +
> +       /* make sure preceding modifications appear */
> +       smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
> +unlock:
> +       spin_unlock_irq(&lruvec->lru_lock);
> +}

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-19 11:15     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-19 11:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

> +                            unsigned long *min_seq, bool can_swap, bool *need_aging)
> +{
> +       int gen, type, zone;
> +       long old = 0;
> +       long young = 0;
> +       long total = 0;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +               unsigned long seq;
> +
> +               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> +                       long size = 0;
> +
> +                       gen = lru_gen_from_seq(seq);
> +
> +                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> +                               size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> +
> +                       total += size;
> +                       if (seq == max_seq)
> +                               young += size;
> +                       if (seq + MIN_NR_GENS == max_seq)
> +                               old += size;
> +               }
> +       }
> +
> +       /* try to spread pages out across MIN_NR_GENS+1 generations */
> +       if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> +               *need_aging = true;
> +       else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> +               *need_aging = false;
> +       else if (young * MIN_NR_GENS > total)
> +               *need_aging = true;

Could we have some doc here? Given MIN_NR_GENS=2 and MAX_NR_GENS=4,
it seems you mean if we have three generations and the youngest pages are more
than 1/2 of the total pages, we need aging?


> +       else if (old * (MIN_NR_GENS + 2) < total)
> +               *need_aging = true;

it seems you mean if the oldest pages are less than 1/4 of the total pages,
we need aging? Can we have comments to explain why here?

your commit message only says " The aging produces young generations.
Given an lruvec, it increments max_seq when max_seq-min_seq+1
approaches MIN_NR_GENS." it can't explain what the code is doing
here.


> +       else
> +               *need_aging = false;
> +
> +       return total > 0 ? total : 0;
> +}

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-19 11:15     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-19 11:15 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

> +                            unsigned long *min_seq, bool can_swap, bool *need_aging)
> +{
> +       int gen, type, zone;
> +       long old = 0;
> +       long young = 0;
> +       long total = 0;
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +               unsigned long seq;
> +
> +               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> +                       long size = 0;
> +
> +                       gen = lru_gen_from_seq(seq);
> +
> +                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> +                               size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> +
> +                       total += size;
> +                       if (seq == max_seq)
> +                               young += size;
> +                       if (seq + MIN_NR_GENS == max_seq)
> +                               old += size;
> +               }
> +       }
> +
> +       /* try to spread pages out across MIN_NR_GENS+1 generations */
> +       if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> +               *need_aging = true;
> +       else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> +               *need_aging = false;
> +       else if (young * MIN_NR_GENS > total)
> +               *need_aging = true;

Could we have some doc here? Given MIN_NR_GENS=2 and MAX_NR_GENS=4,
it seems you mean if we have three generations and the youngest pages are more
than 1/2 of the total pages, we need aging?


> +       else if (old * (MIN_NR_GENS + 2) < total)
> +               *need_aging = true;

it seems you mean if the oldest pages are less than 1/4 of the total pages,
we need aging? Can we have comments to explain why here?

your commit message only says " The aging produces young generations.
Given an lruvec, it increments max_seq when max_seq-min_seq+1
approaches MIN_NR_GENS." it can't explain what the code is doing
here.


> +       else
> +               *need_aging = false;
> +
> +       return total > 0 ? total : 0;
> +}

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-16 23:25     ` Barry Song
@ 2022-03-21  9:04       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-21  9:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 16, 2022 at 5:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
...
> > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +       int gen;
> > +       unsigned long old_flags, new_flags;
> > +       int type = folio_is_file_lru(folio);
> > +       int zone = folio_zonenum(folio);
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       if (folio_test_unevictable(folio))
> > +               return false;
> > +       /*
> > +        * There are three common cases for this page:
> > +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> > +        *    migrated, add it to the youngest generation.
>
> usually, one page is not active when it is faulted in. till its second
> access is detected, it can be active.

The active/inactive LRU *assumes* this; MGLRU *assumes* the opposite,
and there is no "active" in MGLRU -- we call it hot to avoid confusion
:)

> > +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> > +        *    not in swapcache or a dirty page pending writeback, add it to the
> > +        *    second oldest generation.
> > +        * 3. Everything else (clean, cold) is added to the oldest generation.
> > +        */
...
> > +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
>
> The commit log said nothing about REFS flags and tiers.
> but the code is here. either the commit log lacks something
> or the code should belong to the next patch?

It did:
  A few macros, i.e., LRU_REFS_*, used later are added in this patch
to make the patchset less diffy.

> > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> >         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> >
> > +       /* see the comment in lru_gen_add_folio() */
> > +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > +               folio_set_active(folio);
>
> So here is our magic to make folio active as long as it is
> faulted in? i really don't think the below comment is good,
> could we say our purpose directly and explicitly?
>
>  /* see the comment in lru_gen_add_folio() */

I generally keep comments in a few major locations and reference them
from many other minior locations so that it's easier to manage in the
long run. It is a hassle for reviews but once in the tree you can jump
to lru_gen_add_folio() with ctags/cscope or find all places that
reference it by grepping. Assuming we state the purpose, which is to
make lru_gen_add_folio() add the page to the youngest generation, you
still want to go to lru_gen_add_folio() to check if this is really the
case. So why bother :)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-21  9:04       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-21  9:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 16, 2022 at 5:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
...
> > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +       int gen;
> > +       unsigned long old_flags, new_flags;
> > +       int type = folio_is_file_lru(folio);
> > +       int zone = folio_zonenum(folio);
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       if (folio_test_unevictable(folio))
> > +               return false;
> > +       /*
> > +        * There are three common cases for this page:
> > +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> > +        *    migrated, add it to the youngest generation.
>
> usually, one page is not active when it is faulted in. till its second
> access is detected, it can be active.

The active/inactive LRU *assumes* this; MGLRU *assumes* the opposite,
and there is no "active" in MGLRU -- we call it hot to avoid confusion
:)

> > +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> > +        *    not in swapcache or a dirty page pending writeback, add it to the
> > +        *    second oldest generation.
> > +        * 3. Everything else (clean, cold) is added to the oldest generation.
> > +        */
...
> > +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
>
> The commit log said nothing about REFS flags and tiers.
> but the code is here. either the commit log lacks something
> or the code should belong to the next patch?

It did:
  A few macros, i.e., LRU_REFS_*, used later are added in this patch
to make the patchset less diffy.

> > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> >         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> >
> > +       /* see the comment in lru_gen_add_folio() */
> > +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > +               folio_set_active(folio);
>
> So here is our magic to make folio active as long as it is
> faulted in? i really don't think the below comment is good,
> could we say our purpose directly and explicitly?
>
>  /* see the comment in lru_gen_add_folio() */

I generally keep comments in a few major locations and reference them
from many other minior locations so that it's easier to manage in the
long run. It is a hassle for reviews but once in the tree you can jump
to lru_gen_add_folio() with ctags/cscope or find all places that
reference it by grepping. Assuming we state the purpose, which is to
make lru_gen_add_folio() add the page to the youngest generation, you
still want to go to lru_gen_add_folio() to check if this is really the
case. So why bother :)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-21  9:04       ` Yu Zhao
@ 2022-03-21 11:47         ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-21 11:47 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 10:04 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Mar 16, 2022 at 5:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> ...
> > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +       int gen;
> > > +       unsigned long old_flags, new_flags;
> > > +       int type = folio_is_file_lru(folio);
> > > +       int zone = folio_zonenum(folio);
> > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +
> > > +       if (folio_test_unevictable(folio))
> > > +               return false;
> > > +       /*
> > > +        * There are three common cases for this page:
> > > +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> > > +        *    migrated, add it to the youngest generation.
> >
> > usually, one page is not active when it is faulted in. till its second
> > access is detected, it can be active.
>
> The active/inactive LRU *assumes* this; MGLRU *assumes* the opposite,
> and there is no "active" in MGLRU -- we call it hot to avoid confusion
> :)

yep.

>
> > > +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> > > +        *    not in swapcache or a dirty page pending writeback, add it to the
> > > +        *    second oldest generation.
> > > +        * 3. Everything else (clean, cold) is added to the oldest generation.
> > > +        */
> ...
> > > +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > > +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> >
> > The commit log said nothing about REFS flags and tiers.
> > but the code is here. either the commit log lacks something
> > or the code should belong to the next patch?
>
> It did:
>   A few macros, i.e., LRU_REFS_*, used later are added in this patch
> to make the patchset less diffy.

sorry for missing that.

>
> > > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> > >         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> > >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > >
> > > +       /* see the comment in lru_gen_add_folio() */
> > > +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > > +               folio_set_active(folio);
> >
> > So here is our magic to make folio active as long as it is
> > faulted in? i really don't think the below comment is good,
> > could we say our purpose directly and explicitly?
> >
> >  /* see the comment in lru_gen_add_folio() */
>
> I generally keep comments in a few major locations and reference them
> from many other minior locations so that it's easier to manage in the
> long run. It is a hassle for reviews but once in the tree you can jump
> to lru_gen_add_folio() with ctags/cscope or find all places that
> reference it by grepping. Assuming we state the purpose, which is to
> make lru_gen_add_folio() add the page to the youngest generation, you
> still want to go to lru_gen_add_folio() to check if this is really the
> case. So why bother :)

well understood though my pain was that I needed to email you to get
confirmed this is really the case.

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-21 11:47         ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-21 11:47 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 10:04 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Mar 16, 2022 at 5:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> ...
> > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +       int gen;
> > > +       unsigned long old_flags, new_flags;
> > > +       int type = folio_is_file_lru(folio);
> > > +       int zone = folio_zonenum(folio);
> > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +
> > > +       if (folio_test_unevictable(folio))
> > > +               return false;
> > > +       /*
> > > +        * There are three common cases for this page:
> > > +        * 1. If it's hot, e.g., freshly faulted in or previously hot and
> > > +        *    migrated, add it to the youngest generation.
> >
> > usually, one page is not active when it is faulted in. till its second
> > access is detected, it can be active.
>
> The active/inactive LRU *assumes* this; MGLRU *assumes* the opposite,
> and there is no "active" in MGLRU -- we call it hot to avoid confusion
> :)

yep.

>
> > > +        * 2. If it's cold but can't be evicted immediately, i.e., an anon page
> > > +        *    not in swapcache or a dirty page pending writeback, add it to the
> > > +        *    second oldest generation.
> > > +        * 3. Everything else (clean, cold) is added to the oldest generation.
> > > +        */
> ...
> > > +#define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > > +#define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> >
> > The commit log said nothing about REFS flags and tiers.
> > but the code is here. either the commit log lacks something
> > or the code should belong to the next patch?
>
> It did:
>   A few macros, i.e., LRU_REFS_*, used later are added in this patch
> to make the patchset less diffy.

sorry for missing that.

>
> > > @@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
> > >         VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
> > >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > >
> > > +       /* see the comment in lru_gen_add_folio() */
> > > +       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > > +               folio_set_active(folio);
> >
> > So here is our magic to make folio active as long as it is
> > faulted in? i really don't think the below comment is good,
> > could we say our purpose directly and explicitly?
> >
> >  /* see the comment in lru_gen_add_folio() */
>
> I generally keep comments in a few major locations and reference them
> from many other minior locations so that it's easier to manage in the
> long run. It is a hassle for reviews but once in the tree you can jump
> to lru_gen_add_folio() with ctags/cscope or find all places that
> reference it by grepping. Assuming we state the purpose, which is to
> make lru_gen_add_folio() add the page to the youngest generation, you
> still want to go to lru_gen_add_folio() to check if this is really the
> case. So why bother :)

well understood though my pain was that I needed to email you to get
confirmed this is really the case.

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-21 12:51     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 120+ messages in thread
From: Aneesh Kumar K.V @ 2022-03-21 12:51 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain

 +
> +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
> +			     unsigned long *min_seq, bool can_swap, bool *need_aging)
> +{
> +	int gen, type, zone;
> +	long old = 0;
> +	long young = 0;
> +	long total = 0;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		unsigned long seq;
> +
> +		for (seq = min_seq[type]; seq <= max_seq; seq++) {
> +			long size = 0;
> +
> +			gen = lru_gen_from_seq(seq);
> +
> +			for (zone = 0; zone < MAX_NR_ZONES; zone++)
> +				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> +
> +			total += size;
> +			if (seq == max_seq)
> +				young += size;
> +			if (seq + MIN_NR_GENS == max_seq)
> +				old += size;
> +		}
> +	}
> +
> +	/* try to spread pages out across MIN_NR_GENS+1 generations */
> +	if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> +		*need_aging = true;
> +	else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> +		*need_aging = false;

Can you explain/document the reason for the considering the below
conditions for ageing? 

> +	else if (young * MIN_NR_GENS > total)
> +		*need_aging = true;

Are we trying to consdier the case of more than half the total pages
young as needing ageing? If so should MIN_NR_GENS be 2 instead of using
that #define? Or 

> +	else if (old * (MIN_NR_GENS + 2) < total)
> +		*need_aging = true;

What is the significance of '+ 2' ? 

> +	else
> +		*need_aging = false;
> +
> +	return total > 0 ? total : 0;
> +}
> +

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-21 12:51     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 120+ messages in thread
From: Aneesh Kumar K.V @ 2022-03-21 12:51 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain

 +
> +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
> +			     unsigned long *min_seq, bool can_swap, bool *need_aging)
> +{
> +	int gen, type, zone;
> +	long old = 0;
> +	long young = 0;
> +	long total = 0;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		unsigned long seq;
> +
> +		for (seq = min_seq[type]; seq <= max_seq; seq++) {
> +			long size = 0;
> +
> +			gen = lru_gen_from_seq(seq);
> +
> +			for (zone = 0; zone < MAX_NR_ZONES; zone++)
> +				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> +
> +			total += size;
> +			if (seq == max_seq)
> +				young += size;
> +			if (seq + MIN_NR_GENS == max_seq)
> +				old += size;
> +		}
> +	}
> +
> +	/* try to spread pages out across MIN_NR_GENS+1 generations */
> +	if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> +		*need_aging = true;
> +	else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> +		*need_aging = false;

Can you explain/document the reason for the considering the below
conditions for ageing? 

> +	else if (young * MIN_NR_GENS > total)
> +		*need_aging = true;

Are we trying to consdier the case of more than half the total pages
young as needing ageing? If so should MIN_NR_GENS be 2 instead of using
that #define? Or 

> +	else if (old * (MIN_NR_GENS + 2) < total)
> +		*need_aging = true;

What is the significance of '+ 2' ? 

> +	else
> +		*need_aging = false;
> +
> +	return total > 0 ? total : 0;
> +}
> +

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-21 13:01     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 120+ messages in thread
From: Aneesh Kumar K.V @ 2022-03-21 13:01 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain

Yu Zhao <yuzhao@google.com> writes:

> To avoid confusion, the terms "promotion" and "demotion" will be
> applied to the multi-gen LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.
>
> The aging produces young generations. Given an lruvec, it increments
> max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> promotes hot pages to the youngest generation when it finds them
> accessed through page tables; the demotion of cold pages happens
> consequently when it increments max_seq. The aging has the complexity
> O(nr_hot_pages), since it is only interested in hot pages. Promotion
> in the aging path does not require any LRU list operations, only the
> updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
> the result of the increment of max_seq, requires LRU list operations,
> e.g., lru_deactivate_fn().
>
> The eviction consumes old generations. Given an lruvec, it increments
> min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
> feedback loop modeled after the PID controller monitors refaults over
> anon and file types and decides which type to evict when both types
> are available from the same generation.
>
> Each generation is divided into multiple tiers. Tiers represent
> different ranges of numbers of accesses through file descriptors. A
> page accessed N times through file descriptors is in tier
> order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
> bits in folio->flags. In contrast to moving across generations, which
> requires the LRU lock, moving across tiers only involves operations on
> folio->flags. The feedback loop also monitors refaults over all tiers
> and decides when to protect pages in which tiers (N>1), using the
> first tier (N=0,1) as a baseline. The first tier contains single-use
> unmapped clean pages, which are most likely the best choices. The
> eviction moves a page to the next generation, i.e., min_seq+1, if the
> feedback loop decides so. This approach has the following advantages:
> 1. It removes the cost of activation in the buffered access path by
>    inferring whether pages accessed multiple times through file
>    descriptors are statistically hot and thus worth protecting in the
>    eviction path.
> 2. It takes pages accessed through page tables into account and avoids
>    overprotecting pages accessed multiple times through file
>    descriptors. (Pages accessed through page tables are in the first
>    tier, since N=0.)
> 3. More tiers provide better protection for pages accessed more than
>    twice through file descriptors, when under heavy buffered I/O
>    workloads.
>
> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): +[47, 49]%
>                 IOPS         BW
>       5.17-rc2: 2242k        8759MiB/s
>       patch1-5: 3321k        12.7GiB/s
>
>   Single workload:
>     memcached (anon): +[101, 105]%
>                 Ops/sec      KB/sec
>       5.17-rc2: 476771.79    18544.31
>       patch1-5: 972526.07    37826.95
>
>   Configurations:
>     CPU: two Xeon 6154
>     Mem: total 256G
>
>     Node 1 was only used as a ram disk to reduce the variance in the
>     results.
>
>     patch drivers/block/brd.c <<EOF
>     99,100c99,100
>     < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
>     < 	page = alloc_page(gfp_flags);
>     ---
>     > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
>     > 	page = alloc_pages_node(1, gfp_flags, 0);
>     EOF
>
>     cat >>/etc/systemd/system.conf <<EOF
>     CPUAffinity=numa
>     NUMAPolicy=bind
>     NUMAMask=0
>     EOF
>
>     cat >>/etc/memcached.conf <<EOF
>     -m 184320
>     -s /var/run/memcached/memcached.sock
>     -a 0766
>     -t 36
>     -B binary
>     EOF
>
>     cat fio.sh
>     modprobe brd rd_nr=1 rd_size=113246208
>     mkfs.ext4 /dev/ram0
>     mount -t ext4 /dev/ram0 /mnt
>
>     mkdir /sys/fs/cgroup/user.slice/test
>     echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
>     echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
>     fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
>       --buffered=1 --ioengine=io_uring --iodepth=128 \
>       --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>       --rw=randread --random_distribution=random --norandommap \
>       --time_based --ramp_time=10m --runtime=5m --group_reporting
>
>     cat memcached.sh
>     modprobe brd rd_nr=1 rd_size=113246208
>     swapoff -a
>     mkswap /dev/ram0
>     swapon /dev/ram0
>
>     memtier_benchmark -S /var/run/memcached/memcached.sock \
>       -P memcache_binary -n allkeys --key-minimum=1 \
>       --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
>       --ratio 1:0 --pipeline 8 -d 2000
>
>     memtier_benchmark -S /var/run/memcached/memcached.sock \
>       -P memcache_binary -n allkeys --key-minimum=1 \
>       --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
>       --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
>
> Client benchmark results:
>   kswapd profiles:
>     5.17-rc2
>       38.05%  page_vma_mapped_walk
>       20.86%  lzo1x_1_do_compress (real work)
>        6.16%  do_raw_spin_lock
>        4.61%  _raw_spin_unlock_irq
>        2.20%  vma_interval_tree_iter_next
>        2.19%  vma_interval_tree_subtree_search
>        2.15%  page_referenced_one
>        1.93%  anon_vma_interval_tree_iter_first
>        1.65%  ptep_clear_flush
>        1.00%  __zram_bvec_write
>
>     patch1-5
>       39.73%  lzo1x_1_do_compress (real work)
>       14.96%  page_vma_mapped_walk
>        6.97%  _raw_spin_unlock_irq
>        3.07%  do_raw_spin_lock
>        2.53%  anon_vma_interval_tree_iter_first
>        2.04%  ptep_clear_flush
>        1.82%  __zram_bvec_write
>        1.76%  __anon_vma_interval_tree_subtree_search
>        1.57%  memmove
>        1.45%  free_unref_page_list
>
>   Configurations:
>     CPU: single Snapdragon 7c
>     Mem: total 4G
>
>     Chrome OS MemoryPressure [1]
>
> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
>

In shrink_active_list we do preferential treatment of VM_EXEC pages.
Do we do similar thing with MGLRU? if not why is that not needed? 

	if (page_referenced(page, 0, sc->target_mem_cgroup,
			    &vm_flags)) {
		/*
		 * Identify referenced, file-backed active pages and
		 * give them one more trip around the active list. So
		 * that executable code get better chances to stay in
		 * memory under moderate memory pressure.  Anon pages
		 * are not likely to be evicted by use-once streaming
		 * IO, plus JVM can create lots of anon VM_EXEC pages,
		 * so we ignore them here.
		 */
		if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
			nr_rotated += thp_nr_pages(page);
			list_add(&page->lru, &l_active);
			continue;
		}
	}


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-21 13:01     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 120+ messages in thread
From: Aneesh Kumar K.V @ 2022-03-21 13:01 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain

Yu Zhao <yuzhao@google.com> writes:

> To avoid confusion, the terms "promotion" and "demotion" will be
> applied to the multi-gen LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.
>
> The aging produces young generations. Given an lruvec, it increments
> max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> promotes hot pages to the youngest generation when it finds them
> accessed through page tables; the demotion of cold pages happens
> consequently when it increments max_seq. The aging has the complexity
> O(nr_hot_pages), since it is only interested in hot pages. Promotion
> in the aging path does not require any LRU list operations, only the
> updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
> the result of the increment of max_seq, requires LRU list operations,
> e.g., lru_deactivate_fn().
>
> The eviction consumes old generations. Given an lruvec, it increments
> min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
> feedback loop modeled after the PID controller monitors refaults over
> anon and file types and decides which type to evict when both types
> are available from the same generation.
>
> Each generation is divided into multiple tiers. Tiers represent
> different ranges of numbers of accesses through file descriptors. A
> page accessed N times through file descriptors is in tier
> order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
> bits in folio->flags. In contrast to moving across generations, which
> requires the LRU lock, moving across tiers only involves operations on
> folio->flags. The feedback loop also monitors refaults over all tiers
> and decides when to protect pages in which tiers (N>1), using the
> first tier (N=0,1) as a baseline. The first tier contains single-use
> unmapped clean pages, which are most likely the best choices. The
> eviction moves a page to the next generation, i.e., min_seq+1, if the
> feedback loop decides so. This approach has the following advantages:
> 1. It removes the cost of activation in the buffered access path by
>    inferring whether pages accessed multiple times through file
>    descriptors are statistically hot and thus worth protecting in the
>    eviction path.
> 2. It takes pages accessed through page tables into account and avoids
>    overprotecting pages accessed multiple times through file
>    descriptors. (Pages accessed through page tables are in the first
>    tier, since N=0.)
> 3. More tiers provide better protection for pages accessed more than
>    twice through file descriptors, when under heavy buffered I/O
>    workloads.
>
> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): +[47, 49]%
>                 IOPS         BW
>       5.17-rc2: 2242k        8759MiB/s
>       patch1-5: 3321k        12.7GiB/s
>
>   Single workload:
>     memcached (anon): +[101, 105]%
>                 Ops/sec      KB/sec
>       5.17-rc2: 476771.79    18544.31
>       patch1-5: 972526.07    37826.95
>
>   Configurations:
>     CPU: two Xeon 6154
>     Mem: total 256G
>
>     Node 1 was only used as a ram disk to reduce the variance in the
>     results.
>
>     patch drivers/block/brd.c <<EOF
>     99,100c99,100
>     < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
>     < 	page = alloc_page(gfp_flags);
>     ---
>     > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
>     > 	page = alloc_pages_node(1, gfp_flags, 0);
>     EOF
>
>     cat >>/etc/systemd/system.conf <<EOF
>     CPUAffinity=numa
>     NUMAPolicy=bind
>     NUMAMask=0
>     EOF
>
>     cat >>/etc/memcached.conf <<EOF
>     -m 184320
>     -s /var/run/memcached/memcached.sock
>     -a 0766
>     -t 36
>     -B binary
>     EOF
>
>     cat fio.sh
>     modprobe brd rd_nr=1 rd_size=113246208
>     mkfs.ext4 /dev/ram0
>     mount -t ext4 /dev/ram0 /mnt
>
>     mkdir /sys/fs/cgroup/user.slice/test
>     echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
>     echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
>     fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
>       --buffered=1 --ioengine=io_uring --iodepth=128 \
>       --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>       --rw=randread --random_distribution=random --norandommap \
>       --time_based --ramp_time=10m --runtime=5m --group_reporting
>
>     cat memcached.sh
>     modprobe brd rd_nr=1 rd_size=113246208
>     swapoff -a
>     mkswap /dev/ram0
>     swapon /dev/ram0
>
>     memtier_benchmark -S /var/run/memcached/memcached.sock \
>       -P memcache_binary -n allkeys --key-minimum=1 \
>       --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
>       --ratio 1:0 --pipeline 8 -d 2000
>
>     memtier_benchmark -S /var/run/memcached/memcached.sock \
>       -P memcache_binary -n allkeys --key-minimum=1 \
>       --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
>       --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
>
> Client benchmark results:
>   kswapd profiles:
>     5.17-rc2
>       38.05%  page_vma_mapped_walk
>       20.86%  lzo1x_1_do_compress (real work)
>        6.16%  do_raw_spin_lock
>        4.61%  _raw_spin_unlock_irq
>        2.20%  vma_interval_tree_iter_next
>        2.19%  vma_interval_tree_subtree_search
>        2.15%  page_referenced_one
>        1.93%  anon_vma_interval_tree_iter_first
>        1.65%  ptep_clear_flush
>        1.00%  __zram_bvec_write
>
>     patch1-5
>       39.73%  lzo1x_1_do_compress (real work)
>       14.96%  page_vma_mapped_walk
>        6.97%  _raw_spin_unlock_irq
>        3.07%  do_raw_spin_lock
>        2.53%  anon_vma_interval_tree_iter_first
>        2.04%  ptep_clear_flush
>        1.82%  __zram_bvec_write
>        1.76%  __anon_vma_interval_tree_subtree_search
>        1.57%  memmove
>        1.45%  free_unref_page_list
>
>   Configurations:
>     CPU: single Snapdragon 7c
>     Mem: total 4G
>
>     Chrome OS MemoryPressure [1]
>
> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
>

In shrink_active_list we do preferential treatment of VM_EXEC pages.
Do we do similar thing with MGLRU? if not why is that not needed? 

	if (page_referenced(page, 0, sc->target_mem_cgroup,
			    &vm_flags)) {
		/*
		 * Identify referenced, file-backed active pages and
		 * give them one more trip around the active list. So
		 * that executable code get better chances to stay in
		 * memory under moderate memory pressure.  Anon pages
		 * are not likely to be evicted by use-once streaming
		 * IO, plus JVM can create lots of anon VM_EXEC pages,
		 * so we ignore them here.
		 */
		if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
			nr_rotated += thp_nr_pages(page);
			list_add(&page->lru, &l_active);
			continue;
		}
	}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-14  9:30       ` Yu Zhao
@ 2022-03-21 18:58         ` Justin Forbes
  -1 siblings, 0 replies; 120+ messages in thread
From: Justin Forbes @ 2022-03-21 18:58 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Huang, Ying, kernel, kernel-team, Andrew Morton, Linus Torvalds,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Linux ARM, open list:DOCUMENTATION, linux-kernel,
	Linux-MM, Kernel Page Reclaim v2, the arch/x86 maintainers,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, Yu,
> >
> > Yu Zhao <yuzhao@google.com> writes:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 3326ee3903f3..747ab1690bcf 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> > >         area from being merged with adjacent virtual memory areas due to the
> > >         difference in their name.
> > >
> > > +# the multi-gen LRU {
> > > +config LRU_GEN
> > > +     bool "Multi-Gen LRU"
> > > +     depends on MMU
> > > +     # the following options can use up the spare bits in page flags
> > > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >
> > LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> > by LRU_GEN?
>
> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>     default "10" if MAXSMP
> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>
> MAXSMP is meant for kernel developers to test their code, and it
> should not be used in production [1]. But some distros unfortunately
> ship kernels built with this option, e.g., Fedora and Ubuntu. And
> their users reported build errors to me after they applied MGLRU on
> those kernels ("Not enough bits in page flags"). Let me add Fedora and
> Ubuntu to this thread.
>
> Fedora and Ubuntu,
>
> Could you please clarify if there is a reason to ship kernels built
> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
>
> As per above, MAXSMP enables ridiculously large numbers of CPUs and
> NUMA nodes for testing purposes. It is detrimental to performance,
> e.g., CPUMASK_OFFSTACK.

It was enabled for Fedora, and RHEL because we did need more than 512
CPUs, originally only in RHEL until SGI (years ago) complained that
they were testing very large machines with Fedora.  The testing done
on RHEL showed that the performance impact was minimal.   For a very
long time we had MAXSMP off and carried a patch which allowed us to
turn on CPUMASK_OFFSTACK without debugging because there was supposed
to be "something else" coming.  In 2019 we gave up, dropped that patch
and just turned on MAXSMP.

I do not have any metrics for how often someone runs Fedora on a
ridiculously large machine these days, but I would guess that number
is not 0.

Justin

> [1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-21 18:58         ` Justin Forbes
  0 siblings, 0 replies; 120+ messages in thread
From: Justin Forbes @ 2022-03-21 18:58 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Huang, Ying, kernel, kernel-team, Andrew Morton, Linus Torvalds,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Johannes Weiner,
	Jonathan Corbet, Matthew Wilcox, Mel Gorman, Michael Larabel,
	Michal Hocko, Mike Rapoport, Rik van Riel, Vlastimil Babka,
	Will Deacon, Linux ARM, open list:DOCUMENTATION, linux-kernel,
	Linux-MM, Kernel Page Reclaim v2, the arch/x86 maintainers,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, Yu,
> >
> > Yu Zhao <yuzhao@google.com> writes:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 3326ee3903f3..747ab1690bcf 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> > >         area from being merged with adjacent virtual memory areas due to the
> > >         difference in their name.
> > >
> > > +# the multi-gen LRU {
> > > +config LRU_GEN
> > > +     bool "Multi-Gen LRU"
> > > +     depends on MMU
> > > +     # the following options can use up the spare bits in page flags
> > > +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >
> > LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> > by LRU_GEN?
>
> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>     default "10" if MAXSMP
> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>
> MAXSMP is meant for kernel developers to test their code, and it
> should not be used in production [1]. But some distros unfortunately
> ship kernels built with this option, e.g., Fedora and Ubuntu. And
> their users reported build errors to me after they applied MGLRU on
> those kernels ("Not enough bits in page flags"). Let me add Fedora and
> Ubuntu to this thread.
>
> Fedora and Ubuntu,
>
> Could you please clarify if there is a reason to ship kernels built
> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
>
> As per above, MAXSMP enables ridiculously large numbers of CPUs and
> NUMA nodes for testing purposes. It is detrimental to performance,
> e.g., CPUMASK_OFFSTACK.

It was enabled for Fedora, and RHEL because we did need more than 512
CPUs, originally only in RHEL until SGI (years ago) complained that
they were testing very large machines with Fedora.  The testing done
on RHEL showed that the performance impact was minimal.   For a very
long time we had MAXSMP off and carried a patch which allowed us to
turn on CPUMASK_OFFSTACK without debugging because there was supposed
to be "something else" coming.  In 2019 we gave up, dropped that patch
and just turned on MAXSMP.

I do not have any metrics for how often someone runs Fedora on a
ridiculously large machine these days, but I would guess that number
is not 0.

Justin

> [1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-21 18:58         ` Justin Forbes
@ 2022-03-21 19:17           ` Prarit Bhargava
  -1 siblings, 0 replies; 120+ messages in thread
From: Prarit Bhargava @ 2022-03-21 19:17 UTC (permalink / raw)
  To: Justin Forbes, Yu Zhao
  Cc: Andi Kleen, kernel-team, Vaibhav Jain, Rik van Riel, Mel Gorman,
	Catalin Marinas, Johannes Weiner, Aneesh Kumar, Brian Geffon,
	open list:DOCUMENTATION, Jesse Barnes, Sofia Trinh, Huang, Ying,
	linux-kernel, Steven Barrett, Shuang Zhai, Donald Carr,
	Oleksandr Natalenko, Holger Hoffstätte, Will Deacon,
	Dave Hansen, Jonathan Corbet, Mike Rapoport, Andrew Morton,
	Jens Axboe, Hillf Danton, Michal Hocko, kernel, Suleiman Souhlal,
	Daniel Byrne, the arch/x86 maintainers, Konstantin Kharlamov,
	Matthew Wilcox, Linus Torvalds, Michael Larabel, Linux-MM,
	Kernel Page Reclaim v2, Jan Alexander Steffens, Linux ARM

On 3/21/22 14:58, Justin Forbes wrote:
> On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>
>>> Hi, Yu,
>>>
>>> Yu Zhao <yuzhao@google.com> writes:
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 3326ee3903f3..747ab1690bcf 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>>>>          area from being merged with adjacent virtual memory areas due to the
>>>>          difference in their name.
>>>>
>>>> +# the multi-gen LRU {
>>>> +config LRU_GEN
>>>> +     bool "Multi-Gen LRU"
>>>> +     depends on MMU
>>>> +     # the following options can use up the spare bits in page flags
>>>> +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>>>
>>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
>>> by LRU_GEN?
>>
>> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
>> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>>      default "10" if MAXSMP
>> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>>
>> MAXSMP is meant for kernel developers to test their code, and it
>> should not be used in production [1]. But some distros unfortunately
>> ship kernels built with this option, e.g., Fedora and Ubuntu. And
>> their users reported build errors to me after they applied MGLRU on
>> those kernels ("Not enough bits in page flags"). Let me add Fedora and
>> Ubuntu to this thread.
>>
>> Fedora and Ubuntu,
>>
>> Could you please clarify if there is a reason to ship kernels built
>> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
>>
>> As per above, MAXSMP enables ridiculously large numbers of CPUs and
>> NUMA nodes for testing purposes. It is detrimental to performance,
>> e.g., CPUMASK_OFFSTACK.
> 
> It was enabled for Fedora, and RHEL because we did need more than 512
> CPUs, originally only in RHEL until SGI (years ago) complained that
> they were testing very large machines with Fedora.  The testing done
> on RHEL showed that the performance impact was minimal.   For a very
> long time we had MAXSMP off and carried a patch which allowed us to
> turn on CPUMASK_OFFSTACK without debugging because there was supposed
> to be "something else" coming.  In 2019 we gave up, dropped that patch
> and just turned on MAXSMP.
> 
> I do not have any metrics for how often someone runs Fedora on a
> ridiculously large machine these days, but I would guess that number
> is not 0.

It is not 0.  I've seen data from large systems (1000+ logical threads) 
that are running Fedora albeit with a modified Fedora kernel.

Additionally the max limit for CPUS in RHEL is 1792, however, we have 
recently had a request to *double* that to 3584.  You should just assume 
that number will continue to increase.

P.


> 
> Justin
> 
>> [1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/
>>
> _______________________________________________
> kernel mailing list -- kernel@lists.fedoraproject.org
> To unsubscribe send an email to kernel-leave@lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-21 19:17           ` Prarit Bhargava
  0 siblings, 0 replies; 120+ messages in thread
From: Prarit Bhargava @ 2022-03-21 19:17 UTC (permalink / raw)
  To: Justin Forbes, Yu Zhao
  Cc: Andi Kleen, kernel-team, Vaibhav Jain, Rik van Riel, Mel Gorman,
	Catalin Marinas, Johannes Weiner, Aneesh Kumar, Brian Geffon,
	open list:DOCUMENTATION, Jesse Barnes, Sofia Trinh, Huang, Ying,
	linux-kernel, Steven Barrett, Shuang Zhai, Donald Carr,
	Oleksandr Natalenko, Holger Hoffstätte, Will Deacon,
	Dave Hansen, Jonathan Corbet, Mike Rapoport, Andrew Morton,
	Jens Axboe, Hillf Danton, Michal Hocko, kernel, Suleiman Souhlal,
	Daniel Byrne, the arch/x86 maintainers, Konstantin Kharlamov,
	Matthew Wilcox, Linus Torvalds, Michael Larabel, Linux-MM,
	Kernel Page Reclaim v2, Jan Alexander Steffens, Linux ARM

On 3/21/22 14:58, Justin Forbes wrote:
> On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>
>>> Hi, Yu,
>>>
>>> Yu Zhao <yuzhao@google.com> writes:
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 3326ee3903f3..747ab1690bcf 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
>>>>          area from being merged with adjacent virtual memory areas due to the
>>>>          difference in their name.
>>>>
>>>> +# the multi-gen LRU {
>>>> +config LRU_GEN
>>>> +     bool "Multi-Gen LRU"
>>>> +     depends on MMU
>>>> +     # the following options can use up the spare bits in page flags
>>>> +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
>>>
>>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
>>> by LRU_GEN?
>>
>> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
>> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
>>      default "10" if MAXSMP
>> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
>>
>> MAXSMP is meant for kernel developers to test their code, and it
>> should not be used in production [1]. But some distros unfortunately
>> ship kernels built with this option, e.g., Fedora and Ubuntu. And
>> their users reported build errors to me after they applied MGLRU on
>> those kernels ("Not enough bits in page flags"). Let me add Fedora and
>> Ubuntu to this thread.
>>
>> Fedora and Ubuntu,
>>
>> Could you please clarify if there is a reason to ship kernels built
>> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
>>
>> As per above, MAXSMP enables ridiculously large numbers of CPUs and
>> NUMA nodes for testing purposes. It is detrimental to performance,
>> e.g., CPUMASK_OFFSTACK.
> 
> It was enabled for Fedora, and RHEL because we did need more than 512
> CPUs, originally only in RHEL until SGI (years ago) complained that
> they were testing very large machines with Fedora.  The testing done
> on RHEL showed that the performance impact was minimal.   For a very
> long time we had MAXSMP off and carried a patch which allowed us to
> turn on CPUMASK_OFFSTACK without debugging because there was supposed
> to be "something else" coming.  In 2019 we gave up, dropped that patch
> and just turned on MAXSMP.
> 
> I do not have any metrics for how often someone runs Fedora on a
> ridiculously large machine these days, but I would guess that number
> is not 0.

It is not 0.  I've seen data from large systems (1000+ logical threads) 
that are running Fedora albeit with a modified Fedora kernel.

Additionally the max limit for CPUS in RHEL is 1792, however, we have 
recently had a request to *double* that to 3584.  You should just assume 
that number will continue to increase.

P.


> 
> Justin
> 
>> [1] https://lore.kernel.org/lkml/20131106055634.GA24044@gmail.com/
>>
> _______________________________________________
> kernel mailing list -- kernel@lists.fedoraproject.org
> To unsubscribe send an email to kernel-leave@lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-19 10:14     ` Barry Song
@ 2022-03-21 23:51       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-21 23:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Sat, Mar 19, 2022 at 4:14 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> > +{
> > +       int prev, next;
> > +       int type, zone;
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       spin_lock_irq(&lruvec->lru_lock);
> > +
> > +       VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > +       if (max_seq != lrugen->max_seq)
> > +               goto unlock;
> > +
> > +       inc_min_seq(lruvec);
> > +
> > +       /* update the active/inactive LRU sizes for compatibility */
> > +       prev = lru_gen_from_seq(lrugen->max_seq - 1);
> > +       next = lru_gen_from_seq(lrugen->max_seq + 1);
> > +
> > +       for (type = 0; type < ANON_AND_FILE; type++) {
> > +               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > +                       enum lru_list lru = type * LRU_INACTIVE_FILE;
> > +                       long delta = lrugen->nr_pages[prev][type][zone] -
> > +                                    lrugen->nr_pages[next][type][zone];
>
> this is confusing to me. does lrugen->nr_pages[next][type][zone] have a
> chance to be none-zero even before max_seq is increased? some pages
> can be in the next generation before the generation is born?

Yes.

> isn't it a bug if(lrugen->nr_pages[next][type][zone] > 0)? shouldn't it be?
>
> delta = lrugen->nr_pages[prev][type][zone];

No. The gen counter in page flags can be updated locklessly
(lru_lock). Later a batched update of nr_pages[] will account for the
change made. If the gen counter is updated to a stale max_seq, and
this stale max_seq is less than min_seq, then this page will be in a
generation yet to be born. Extremely unlikely, but still possible.

This is not a bug because pages might be misplaced but they won't be
lost. IOW, nr_pages[] is always balanced across all *possible*
generations. For the same reason, reset_batch_size() and
drain_evictable() use for_each_gen_type_zone() to go through all
possible generations rather than only those between[max_seq, min_seq].

I'll add a comment here. Sounds good?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-21 23:51       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-21 23:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Sat, Mar 19, 2022 at 4:14 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> > +{
> > +       int prev, next;
> > +       int type, zone;
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       spin_lock_irq(&lruvec->lru_lock);
> > +
> > +       VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > +       if (max_seq != lrugen->max_seq)
> > +               goto unlock;
> > +
> > +       inc_min_seq(lruvec);
> > +
> > +       /* update the active/inactive LRU sizes for compatibility */
> > +       prev = lru_gen_from_seq(lrugen->max_seq - 1);
> > +       next = lru_gen_from_seq(lrugen->max_seq + 1);
> > +
> > +       for (type = 0; type < ANON_AND_FILE; type++) {
> > +               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > +                       enum lru_list lru = type * LRU_INACTIVE_FILE;
> > +                       long delta = lrugen->nr_pages[prev][type][zone] -
> > +                                    lrugen->nr_pages[next][type][zone];
>
> this is confusing to me. does lrugen->nr_pages[next][type][zone] have a
> chance to be none-zero even before max_seq is increased? some pages
> can be in the next generation before the generation is born?

Yes.

> isn't it a bug if(lrugen->nr_pages[next][type][zone] > 0)? shouldn't it be?
>
> delta = lrugen->nr_pages[prev][type][zone];

No. The gen counter in page flags can be updated locklessly
(lru_lock). Later a batched update of nr_pages[] will account for the
change made. If the gen counter is updated to a stale max_seq, and
this stale max_seq is less than min_seq, then this page will be in a
generation yet to be born. Extremely unlikely, but still possible.

This is not a bug because pages might be misplaced but they won't be
lost. IOW, nr_pages[] is always balanced across all *possible*
generations. For the same reason, reset_batch_size() and
drain_evictable() use for_each_gen_type_zone() to go through all
possible generations rather than only those between[max_seq, min_seq].

I'll add a comment here. Sounds good?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-19 11:15     ` Barry Song
@ 2022-03-22  0:30       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  0:30 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Sat, Mar 19, 2022 at 5:15 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > +                            unsigned long *min_seq, bool can_swap, bool *need_aging)
> > +{
> > +       int gen, type, zone;
> > +       long old = 0;
> > +       long young = 0;
> > +       long total = 0;
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +               unsigned long seq;
> > +
> > +               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > +                       long size = 0;
> > +
> > +                       gen = lru_gen_from_seq(seq);
> > +
> > +                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > +                               size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> > +
> > +                       total += size;
> > +                       if (seq == max_seq)
> > +                               young += size;
> > +                       if (seq + MIN_NR_GENS == max_seq)
> > +                               old += size;
> > +               }
> > +       }
> > +
> > +       /* try to spread pages out across MIN_NR_GENS+1 generations */
> > +       if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> > +               *need_aging = true;
> > +       else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> > +               *need_aging = false;
> > +       else if (young * MIN_NR_GENS > total)
> > +               *need_aging = true;
>
> Could we have some doc here?

Will do.

> Given MIN_NR_GENS=2 and MAX_NR_GENS=4,
> it seems you mean if we have three generations and the youngest pages are more
> than 1/2 of the total pages, we need aging?

Yes.

> > +       else if (old * (MIN_NR_GENS + 2) < total)
> > +               *need_aging = true;
>
> it seems you mean if the oldest pages are less than 1/4 of the total pages,
> we need aging?

Yes.

> Can we have comments to explain why here?
>
> your commit message only says " The aging produces young generations.
> Given an lruvec, it increments max_seq when max_seq-min_seq+1
> approaches MIN_NR_GENS." it can't explain what the code is doing
> here.

Fair enough. Approaching MIN_NR_GENS=2 means getting close to it. From
the consumer's POV, if it *reaches* 2, the eviction will have to
stall, because the two youngest generations are not yet fully aged,
i.e., the second chance policy similar to the active/inactive lru.
From the producer's POV, the aging tries to be lazy to reduce the
overhead. So ideally, we want 3 generations, which gives a reasonable
range [2, 4], hence the first two if's.

In addition, we want pages to spread out evenly over these 3
generations, meaning an average 1/3 of total pages for each
generation, which gives another reasonable range [1/2, 1/4]. Since the
eviction reduces the number of old pages, we only need to check
against the lower bound, i.e., 1/4. On the other hand, page (re)faults
increase the number of young pages, so in this case, we need to check
against the upper bound.

I'll include these details in the next spin.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-22  0:30       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  0:30 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Sat, Mar 19, 2022 at 5:15 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > +                            unsigned long *min_seq, bool can_swap, bool *need_aging)
> > +{
> > +       int gen, type, zone;
> > +       long old = 0;
> > +       long young = 0;
> > +       long total = 0;
> > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +               unsigned long seq;
> > +
> > +               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > +                       long size = 0;
> > +
> > +                       gen = lru_gen_from_seq(seq);
> > +
> > +                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > +                               size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> > +
> > +                       total += size;
> > +                       if (seq == max_seq)
> > +                               young += size;
> > +                       if (seq + MIN_NR_GENS == max_seq)
> > +                               old += size;
> > +               }
> > +       }
> > +
> > +       /* try to spread pages out across MIN_NR_GENS+1 generations */
> > +       if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> > +               *need_aging = true;
> > +       else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> > +               *need_aging = false;
> > +       else if (young * MIN_NR_GENS > total)
> > +               *need_aging = true;
>
> Could we have some doc here?

Will do.

> Given MIN_NR_GENS=2 and MAX_NR_GENS=4,
> it seems you mean if we have three generations and the youngest pages are more
> than 1/2 of the total pages, we need aging?

Yes.

> > +       else if (old * (MIN_NR_GENS + 2) < total)
> > +               *need_aging = true;
>
> it seems you mean if the oldest pages are less than 1/4 of the total pages,
> we need aging?

Yes.

> Can we have comments to explain why here?
>
> your commit message only says " The aging produces young generations.
> Given an lruvec, it increments max_seq when max_seq-min_seq+1
> approaches MIN_NR_GENS." it can't explain what the code is doing
> here.

Fair enough. Approaching MIN_NR_GENS=2 means getting close to it. From
the consumer's POV, if it *reaches* 2, the eviction will have to
stall, because the two youngest generations are not yet fully aged,
i.e., the second chance policy similar to the active/inactive lru.
From the producer's POV, the aging tries to be lazy to reduce the
overhead. So ideally, we want 3 generations, which gives a reasonable
range [2, 4], hence the first two if's.

In addition, we want pages to spread out evenly over these 3
generations, meaning an average 1/3 of total pages for each
generation, which gives another reasonable range [1/2, 1/4]. Since the
eviction reduces the number of old pages, we only need to check
against the lower bound, i.e., 1/4. On the other hand, page (re)faults
increase the number of young pages, so in this case, we need to check
against the upper bound.

I'll include these details in the next spin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-21 12:51     ` Aneesh Kumar K.V
@ 2022-03-22  4:02       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  4:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Johannes Weiner, Jonathan Corbet, Matthew Wilcox, Mel Gorman,
	Michael Larabel, Michal Hocko, Mike Rapoport, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 6:52 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
>  +
> > +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
> > +                          unsigned long *min_seq, bool can_swap, bool *need_aging)
> > +{
> > +     int gen, type, zone;
> > +     long old = 0;
> > +     long young = 0;
> > +     long total = 0;
> > +     struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +     for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +             unsigned long seq;
> > +
> > +             for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > +                     long size = 0;
> > +
> > +                     gen = lru_gen_from_seq(seq);
> > +
> > +                     for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > +                             size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> > +
> > +                     total += size;
> > +                     if (seq == max_seq)
> > +                             young += size;
> > +                     if (seq + MIN_NR_GENS == max_seq)
> > +                             old += size;
> > +             }
> > +     }
> > +
> > +     /* try to spread pages out across MIN_NR_GENS+1 generations */
> > +     if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> > +             *need_aging = true;
> > +     else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> > +             *need_aging = false;
>
> Can you explain/document the reason for the considering the below
> conditions for ageing?
>
> > +     else if (young * MIN_NR_GENS > total)
> > +             *need_aging = true;
>
> Are we trying to consdier the case of more than half the total pages
> young as needing ageing? If so should MIN_NR_GENS be 2 instead of using
> that #define? Or
>
> > +     else if (old * (MIN_NR_GENS + 2) < total)
> > +             *need_aging = true;
>
> What is the significance of '+ 2' ?

Will improve the comment according to my previous reply here [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufYmUPZY0gCC+wYk6Vr1L8KEx+tJeEAhjpBfUnLJsAHq5A@mail.gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-22  4:02       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  4:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Johannes Weiner, Jonathan Corbet, Matthew Wilcox, Mel Gorman,
	Michael Larabel, Michal Hocko, Mike Rapoport, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 6:52 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
>  +
> > +static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
> > +                          unsigned long *min_seq, bool can_swap, bool *need_aging)
> > +{
> > +     int gen, type, zone;
> > +     long old = 0;
> > +     long young = 0;
> > +     long total = 0;
> > +     struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +     for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > +             unsigned long seq;
> > +
> > +             for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > +                     long size = 0;
> > +
> > +                     gen = lru_gen_from_seq(seq);
> > +
> > +                     for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > +                             size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
> > +
> > +                     total += size;
> > +                     if (seq == max_seq)
> > +                             young += size;
> > +                     if (seq + MIN_NR_GENS == max_seq)
> > +                             old += size;
> > +             }
> > +     }
> > +
> > +     /* try to spread pages out across MIN_NR_GENS+1 generations */
> > +     if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS > max_seq)
> > +             *need_aging = true;
> > +     else if (min_seq[LRU_GEN_FILE] + MIN_NR_GENS < max_seq)
> > +             *need_aging = false;
>
> Can you explain/document the reason for the considering the below
> conditions for ageing?
>
> > +     else if (young * MIN_NR_GENS > total)
> > +             *need_aging = true;
>
> Are we trying to consdier the case of more than half the total pages
> young as needing ageing? If so should MIN_NR_GENS be 2 instead of using
> that #define? Or
>
> > +     else if (old * (MIN_NR_GENS + 2) < total)
> > +             *need_aging = true;
>
> What is the significance of '+ 2' ?

Will improve the comment according to my previous reply here [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufYmUPZY0gCC+wYk6Vr1L8KEx+tJeEAhjpBfUnLJsAHq5A@mail.gmail.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-21 13:01     ` Aneesh Kumar K.V
@ 2022-03-22  4:39       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  4:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Johannes Weiner, Jonathan Corbet, Matthew Wilcox, Mel Gorman,
	Michael Larabel, Michal Hocko, Mike Rapoport, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 7:01 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > To avoid confusion, the terms "promotion" and "demotion" will be
> > applied to the multi-gen LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
> >
> > The aging produces young generations. Given an lruvec, it increments
> > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> > promotes hot pages to the youngest generation when it finds them
> > accessed through page tables; the demotion of cold pages happens
> > consequently when it increments max_seq. The aging has the complexity
> > O(nr_hot_pages), since it is only interested in hot pages. Promotion
> > in the aging path does not require any LRU list operations, only the
> > updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
> > the result of the increment of max_seq, requires LRU list operations,
> > e.g., lru_deactivate_fn().
> >
> > The eviction consumes old generations. Given an lruvec, it increments
> > min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
> > feedback loop modeled after the PID controller monitors refaults over
> > anon and file types and decides which type to evict when both types
> > are available from the same generation.
> >
> > Each generation is divided into multiple tiers. Tiers represent
> > different ranges of numbers of accesses through file descriptors. A
> > page accessed N times through file descriptors is in tier
> > order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
> > bits in folio->flags. In contrast to moving across generations, which
> > requires the LRU lock, moving across tiers only involves operations on
> > folio->flags. The feedback loop also monitors refaults over all tiers
> > and decides when to protect pages in which tiers (N>1), using the
> > first tier (N=0,1) as a baseline. The first tier contains single-use
> > unmapped clean pages, which are most likely the best choices. The
> > eviction moves a page to the next generation, i.e., min_seq+1, if the
> > feedback loop decides so. This approach has the following advantages:
> > 1. It removes the cost of activation in the buffered access path by
> >    inferring whether pages accessed multiple times through file
> >    descriptors are statistically hot and thus worth protecting in the
> >    eviction path.
> > 2. It takes pages accessed through page tables into account and avoids
> >    overprotecting pages accessed multiple times through file
> >    descriptors. (Pages accessed through page tables are in the first
> >    tier, since N=0.)
> > 3. More tiers provide better protection for pages accessed more than
> >    twice through file descriptors, when under heavy buffered I/O
> >    workloads.
> >
> > Server benchmark results:
> >   Single workload:
> >     fio (buffered I/O): +[47, 49]%
> >                 IOPS         BW
> >       5.17-rc2: 2242k        8759MiB/s
> >       patch1-5: 3321k        12.7GiB/s
> >
> >   Single workload:
> >     memcached (anon): +[101, 105]%
> >                 Ops/sec      KB/sec
> >       5.17-rc2: 476771.79    18544.31
> >       patch1-5: 972526.07    37826.95
> >
> >   Configurations:
> >     CPU: two Xeon 6154
> >     Mem: total 256G
> >
> >     Node 1 was only used as a ram disk to reduce the variance in the
> >     results.
> >
> >     patch drivers/block/brd.c <<EOF
> >     99,100c99,100
> >     <         gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
> >     <         page = alloc_page(gfp_flags);
> >     ---
> >     >         gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> >     >         page = alloc_pages_node(1, gfp_flags, 0);
> >     EOF
> >
> >     cat >>/etc/systemd/system.conf <<EOF
> >     CPUAffinity=numa
> >     NUMAPolicy=bind
> >     NUMAMask=0
> >     EOF
> >
> >     cat >>/etc/memcached.conf <<EOF
> >     -m 184320
> >     -s /var/run/memcached/memcached.sock
> >     -a 0766
> >     -t 36
> >     -B binary
> >     EOF
> >
> >     cat fio.sh
> >     modprobe brd rd_nr=1 rd_size=113246208
> >     mkfs.ext4 /dev/ram0
> >     mount -t ext4 /dev/ram0 /mnt
> >
> >     mkdir /sys/fs/cgroup/user.slice/test
> >     echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
> >     echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
> >     fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
> >       --buffered=1 --ioengine=io_uring --iodepth=128 \
> >       --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> >       --rw=randread --random_distribution=random --norandommap \
> >       --time_based --ramp_time=10m --runtime=5m --group_reporting
> >
> >     cat memcached.sh
> >     modprobe brd rd_nr=1 rd_size=113246208
> >     swapoff -a
> >     mkswap /dev/ram0
> >     swapon /dev/ram0
> >
> >     memtier_benchmark -S /var/run/memcached/memcached.sock \
> >       -P memcache_binary -n allkeys --key-minimum=1 \
> >       --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
> >       --ratio 1:0 --pipeline 8 -d 2000
> >
> >     memtier_benchmark -S /var/run/memcached/memcached.sock \
> >       -P memcache_binary -n allkeys --key-minimum=1 \
> >       --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
> >       --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
> >
> > Client benchmark results:
> >   kswapd profiles:
> >     5.17-rc2
> >       38.05%  page_vma_mapped_walk
> >       20.86%  lzo1x_1_do_compress (real work)
> >        6.16%  do_raw_spin_lock
> >        4.61%  _raw_spin_unlock_irq
> >        2.20%  vma_interval_tree_iter_next
> >        2.19%  vma_interval_tree_subtree_search
> >        2.15%  page_referenced_one
> >        1.93%  anon_vma_interval_tree_iter_first
> >        1.65%  ptep_clear_flush
> >        1.00%  __zram_bvec_write
> >
> >     patch1-5
> >       39.73%  lzo1x_1_do_compress (real work)
> >       14.96%  page_vma_mapped_walk
> >        6.97%  _raw_spin_unlock_irq
> >        3.07%  do_raw_spin_lock
> >        2.53%  anon_vma_interval_tree_iter_first
> >        2.04%  ptep_clear_flush
> >        1.82%  __zram_bvec_write
> >        1.76%  __anon_vma_interval_tree_subtree_search
> >        1.57%  memmove
> >        1.45%  free_unref_page_list
> >
> >   Configurations:
> >     CPU: single Snapdragon 7c
> >     Mem: total 4G
> >
> >     Chrome OS MemoryPressure [1]
> >
> > [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
> >
>
> In shrink_active_list we do preferential treatment of VM_EXEC pages.
> Do we do similar thing with MGLRU? if not why is that not needed?

No, because MGLRU has a different set of assumptions than the
active/inactive LRU does [1]. It provides mmapped pages with equal
opportunities, and the tradeoff was discussed here [2].

Note that even with this preferential treatment of executable pages,
plus other heuristics added since then, executable pages are still
underprotected for at least desktop workloads [3]. And I can confirm
the problem reported is genuine -- we recently accidentally removed
our private patch that works around the problem for the last 12 years,
and observed immediate consequences on a small portion of devices not
using MGLRU [4].

[1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/
[3] https://lore.kernel.org/linux-mm/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/
[4] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3429559

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-22  4:39       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  4:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Johannes Weiner, Jonathan Corbet, Matthew Wilcox, Mel Gorman,
	Michael Larabel, Michal Hocko, Mike Rapoport, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 7:01 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > To avoid confusion, the terms "promotion" and "demotion" will be
> > applied to the multi-gen LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
> >
> > The aging produces young generations. Given an lruvec, it increments
> > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> > promotes hot pages to the youngest generation when it finds them
> > accessed through page tables; the demotion of cold pages happens
> > consequently when it increments max_seq. The aging has the complexity
> > O(nr_hot_pages), since it is only interested in hot pages. Promotion
> > in the aging path does not require any LRU list operations, only the
> > updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
> > the result of the increment of max_seq, requires LRU list operations,
> > e.g., lru_deactivate_fn().
> >
> > The eviction consumes old generations. Given an lruvec, it increments
> > min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
> > feedback loop modeled after the PID controller monitors refaults over
> > anon and file types and decides which type to evict when both types
> > are available from the same generation.
> >
> > Each generation is divided into multiple tiers. Tiers represent
> > different ranges of numbers of accesses through file descriptors. A
> > page accessed N times through file descriptors is in tier
> > order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
> > bits in folio->flags. In contrast to moving across generations, which
> > requires the LRU lock, moving across tiers only involves operations on
> > folio->flags. The feedback loop also monitors refaults over all tiers
> > and decides when to protect pages in which tiers (N>1), using the
> > first tier (N=0,1) as a baseline. The first tier contains single-use
> > unmapped clean pages, which are most likely the best choices. The
> > eviction moves a page to the next generation, i.e., min_seq+1, if the
> > feedback loop decides so. This approach has the following advantages:
> > 1. It removes the cost of activation in the buffered access path by
> >    inferring whether pages accessed multiple times through file
> >    descriptors are statistically hot and thus worth protecting in the
> >    eviction path.
> > 2. It takes pages accessed through page tables into account and avoids
> >    overprotecting pages accessed multiple times through file
> >    descriptors. (Pages accessed through page tables are in the first
> >    tier, since N=0.)
> > 3. More tiers provide better protection for pages accessed more than
> >    twice through file descriptors, when under heavy buffered I/O
> >    workloads.
> >
> > Server benchmark results:
> >   Single workload:
> >     fio (buffered I/O): +[47, 49]%
> >                 IOPS         BW
> >       5.17-rc2: 2242k        8759MiB/s
> >       patch1-5: 3321k        12.7GiB/s
> >
> >   Single workload:
> >     memcached (anon): +[101, 105]%
> >                 Ops/sec      KB/sec
> >       5.17-rc2: 476771.79    18544.31
> >       patch1-5: 972526.07    37826.95
> >
> >   Configurations:
> >     CPU: two Xeon 6154
> >     Mem: total 256G
> >
> >     Node 1 was only used as a ram disk to reduce the variance in the
> >     results.
> >
> >     patch drivers/block/brd.c <<EOF
> >     99,100c99,100
> >     <         gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
> >     <         page = alloc_page(gfp_flags);
> >     ---
> >     >         gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> >     >         page = alloc_pages_node(1, gfp_flags, 0);
> >     EOF
> >
> >     cat >>/etc/systemd/system.conf <<EOF
> >     CPUAffinity=numa
> >     NUMAPolicy=bind
> >     NUMAMask=0
> >     EOF
> >
> >     cat >>/etc/memcached.conf <<EOF
> >     -m 184320
> >     -s /var/run/memcached/memcached.sock
> >     -a 0766
> >     -t 36
> >     -B binary
> >     EOF
> >
> >     cat fio.sh
> >     modprobe brd rd_nr=1 rd_size=113246208
> >     mkfs.ext4 /dev/ram0
> >     mount -t ext4 /dev/ram0 /mnt
> >
> >     mkdir /sys/fs/cgroup/user.slice/test
> >     echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
> >     echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
> >     fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
> >       --buffered=1 --ioengine=io_uring --iodepth=128 \
> >       --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> >       --rw=randread --random_distribution=random --norandommap \
> >       --time_based --ramp_time=10m --runtime=5m --group_reporting
> >
> >     cat memcached.sh
> >     modprobe brd rd_nr=1 rd_size=113246208
> >     swapoff -a
> >     mkswap /dev/ram0
> >     swapon /dev/ram0
> >
> >     memtier_benchmark -S /var/run/memcached/memcached.sock \
> >       -P memcache_binary -n allkeys --key-minimum=1 \
> >       --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
> >       --ratio 1:0 --pipeline 8 -d 2000
> >
> >     memtier_benchmark -S /var/run/memcached/memcached.sock \
> >       -P memcache_binary -n allkeys --key-minimum=1 \
> >       --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
> >       --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
> >
> > Client benchmark results:
> >   kswapd profiles:
> >     5.17-rc2
> >       38.05%  page_vma_mapped_walk
> >       20.86%  lzo1x_1_do_compress (real work)
> >        6.16%  do_raw_spin_lock
> >        4.61%  _raw_spin_unlock_irq
> >        2.20%  vma_interval_tree_iter_next
> >        2.19%  vma_interval_tree_subtree_search
> >        2.15%  page_referenced_one
> >        1.93%  anon_vma_interval_tree_iter_first
> >        1.65%  ptep_clear_flush
> >        1.00%  __zram_bvec_write
> >
> >     patch1-5
> >       39.73%  lzo1x_1_do_compress (real work)
> >       14.96%  page_vma_mapped_walk
> >        6.97%  _raw_spin_unlock_irq
> >        3.07%  do_raw_spin_lock
> >        2.53%  anon_vma_interval_tree_iter_first
> >        2.04%  ptep_clear_flush
> >        1.82%  __zram_bvec_write
> >        1.76%  __anon_vma_interval_tree_subtree_search
> >        1.57%  memmove
> >        1.45%  free_unref_page_list
> >
> >   Configurations:
> >     CPU: single Snapdragon 7c
> >     Mem: total 4G
> >
> >     Chrome OS MemoryPressure [1]
> >
> > [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
> >
>
> In shrink_active_list we do preferential treatment of VM_EXEC pages.
> Do we do similar thing with MGLRU? if not why is that not needed?

No, because MGLRU has a different set of assumptions than the
active/inactive LRU does [1]. It provides mmapped pages with equal
opportunities, and the tradeoff was discussed here [2].

Note that even with this preferential treatment of executable pages,
plus other heuristics added since then, executable pages are still
underprotected for at least desktop workloads [3]. And I can confirm
the problem reported is genuine -- we recently accidentally removed
our private patch that works around the problem for the last 12 years,
and observed immediate consequences on a small portion of devices not
using MGLRU [4].

[1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/
[3] https://lore.kernel.org/linux-mm/2dc51fc8-f14e-17ed-a8c6-0ec70423bf54@valdikss.org.ru/
[4] https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3429559

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
  2022-03-21 19:17           ` Prarit Bhargava
@ 2022-03-22  4:52             ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  4:52 UTC (permalink / raw)
  To: Prarit Bhargava, Justin Forbes
  Cc: Andi Kleen, kernel-team, Vaibhav Jain, Rik van Riel, Mel Gorman,
	Catalin Marinas, Johannes Weiner, Aneesh Kumar, Brian Geffon,
	open list:DOCUMENTATION, Jesse Barnes, Sofia Trinh, Huang, Ying,
	linux-kernel, Steven Barrett, Shuang Zhai, Donald Carr,
	Oleksandr Natalenko, Holger Hoffstätte, Will Deacon,
	Dave Hansen, Jonathan Corbet, Mike Rapoport, Andrew Morton,
	Jens Axboe, Hillf Danton, Michal Hocko, kernel, Suleiman Souhlal,
	Daniel Byrne, the arch/x86 maintainers, Konstantin Kharlamov,
	Matthew Wilcox, Linus Torvalds, Michael Larabel, Linux-MM,
	Kernel Page Reclaim v2, Jan Alexander Steffens, Linux ARM

On Mon, Mar 21, 2022 at 1:18 PM Prarit Bhargava <prarit@redhat.com> wrote:
>
> On 3/21/22 14:58, Justin Forbes wrote:
> > On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>>
> >>> Hi, Yu,
> >>>
> >>> Yu Zhao <yuzhao@google.com> writes:
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 3326ee3903f3..747ab1690bcf 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >>>>          area from being merged with adjacent virtual memory areas due to the
> >>>>          difference in their name.
> >>>>
> >>>> +# the multi-gen LRU {
> >>>> +config LRU_GEN
> >>>> +     bool "Multi-Gen LRU"
> >>>> +     depends on MMU
> >>>> +     # the following options can use up the spare bits in page flags
> >>>> +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >>>
> >>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> >>> by LRU_GEN?
> >>
> >> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> >> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
> >>      default "10" if MAXSMP
> >> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
> >>
> >> MAXSMP is meant for kernel developers to test their code, and it
> >> should not be used in production [1]. But some distros unfortunately
> >> ship kernels built with this option, e.g., Fedora and Ubuntu. And
> >> their users reported build errors to me after they applied MGLRU on
> >> those kernels ("Not enough bits in page flags"). Let me add Fedora and
> >> Ubuntu to this thread.
> >>
> >> Fedora and Ubuntu,
> >>
> >> Could you please clarify if there is a reason to ship kernels built
> >> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
> >>
> >> As per above, MAXSMP enables ridiculously large numbers of CPUs and
> >> NUMA nodes for testing purposes. It is detrimental to performance,
> >> e.g., CPUMASK_OFFSTACK.
> >
> > It was enabled for Fedora, and RHEL because we did need more than 512
> > CPUs, originally only in RHEL until SGI (years ago) complained that
> > they were testing very large machines with Fedora.  The testing done
> > on RHEL showed that the performance impact was minimal.   For a very
> > long time we had MAXSMP off and carried a patch which allowed us to
> > turn on CPUMASK_OFFSTACK without debugging because there was supposed
> > to be "something else" coming.  In 2019 we gave up, dropped that patch
> > and just turned on MAXSMP.
> >
> > I do not have any metrics for how often someone runs Fedora on a
> > ridiculously large machine these days, but I would guess that number
> > is not 0.
>
> It is not 0.  I've seen data from large systems (1000+ logical threads)
> that are running Fedora albeit with a modified Fedora kernel.
>
> Additionally the max limit for CPUS in RHEL is 1792, however, we have
> recently had a request to *double* that to 3584.  You should just assume
> that number will continue to increase.

Good to know. Thanks.

From the standpoint of overhead, I'd consider NR_CPUS=4096 and
NODES_SHIFT=7 as the next step, before going with MAXSMP.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 05/14] mm: multi-gen LRU: groundwork
@ 2022-03-22  4:52             ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  4:52 UTC (permalink / raw)
  To: Prarit Bhargava, Justin Forbes
  Cc: Andi Kleen, kernel-team, Vaibhav Jain, Rik van Riel, Mel Gorman,
	Catalin Marinas, Johannes Weiner, Aneesh Kumar, Brian Geffon,
	open list:DOCUMENTATION, Jesse Barnes, Sofia Trinh, Huang, Ying,
	linux-kernel, Steven Barrett, Shuang Zhai, Donald Carr,
	Oleksandr Natalenko, Holger Hoffstätte, Will Deacon,
	Dave Hansen, Jonathan Corbet, Mike Rapoport, Andrew Morton,
	Jens Axboe, Hillf Danton, Michal Hocko, kernel, Suleiman Souhlal,
	Daniel Byrne, the arch/x86 maintainers, Konstantin Kharlamov,
	Matthew Wilcox, Linus Torvalds, Michael Larabel, Linux-MM,
	Kernel Page Reclaim v2, Jan Alexander Steffens, Linux ARM

On Mon, Mar 21, 2022 at 1:18 PM Prarit Bhargava <prarit@redhat.com> wrote:
>
> On 3/21/22 14:58, Justin Forbes wrote:
> > On Mon, Mar 14, 2022 at 4:30 AM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Mar 14, 2022 at 2:09 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>>
> >>> Hi, Yu,
> >>>
> >>> Yu Zhao <yuzhao@google.com> writes:
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 3326ee3903f3..747ab1690bcf 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -892,6 +892,16 @@ config ANON_VMA_NAME
> >>>>          area from being merged with adjacent virtual memory areas due to the
> >>>>          difference in their name.
> >>>>
> >>>> +# the multi-gen LRU {
> >>>> +config LRU_GEN
> >>>> +     bool "Multi-Gen LRU"
> >>>> +     depends on MMU
> >>>> +     # the following options can use up the spare bits in page flags
> >>>> +     depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> >>>
> >>> LRU_GEN depends on !MAXSMP.  So, What is the maximum NR_CPUS supported
> >>> by LRU_GEN?
> >>
> >> LRU_GEN doesn't really care about NR_CPUS. IOW, it doesn't impose a
> >> max number. The dependency is with NODES_SHIFT selected by MAXSMP:
> >>      default "10" if MAXSMP
> >> This combined with LAST_CPUPID_SHIFT can exhaust the spare bits in page flags.
> >>
> >> MAXSMP is meant for kernel developers to test their code, and it
> >> should not be used in production [1]. But some distros unfortunately
> >> ship kernels built with this option, e.g., Fedora and Ubuntu. And
> >> their users reported build errors to me after they applied MGLRU on
> >> those kernels ("Not enough bits in page flags"). Let me add Fedora and
> >> Ubuntu to this thread.
> >>
> >> Fedora and Ubuntu,
> >>
> >> Could you please clarify if there is a reason to ship kernels built
> >> with MAXSMP? Otherwise, please consider disabling this option. Thanks.
> >>
> >> As per above, MAXSMP enables ridiculously large numbers of CPUs and
> >> NUMA nodes for testing purposes. It is detrimental to performance,
> >> e.g., CPUMASK_OFFSTACK.
> >
> > It was enabled for Fedora, and RHEL because we did need more than 512
> > CPUs, originally only in RHEL until SGI (years ago) complained that
> > they were testing very large machines with Fedora.  The testing done
> > on RHEL showed that the performance impact was minimal.   For a very
> > long time we had MAXSMP off and carried a patch which allowed us to
> > turn on CPUMASK_OFFSTACK without debugging because there was supposed
> > to be "something else" coming.  In 2019 we gave up, dropped that patch
> > and just turned on MAXSMP.
> >
> > I do not have any metrics for how often someone runs Fedora on a
> > ridiculously large machine these days, but I would guess that number
> > is not 0.
>
> It is not 0.  I've seen data from large systems (1000+ logical threads)
> that are running Fedora albeit with a modified Fedora kernel.
>
> Additionally the max limit for CPUS in RHEL is 1792, however, we have
> recently had a request to *double* that to 3584.  You should just assume
> that number will continue to increase.

Good to know. Thanks.

From the standpoint of overhead, I'd consider NR_CPUS=4096 and
NODES_SHIFT=7 as the next step, before going with MAXSMP.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-22  5:26     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 120+ messages in thread
From: Aneesh Kumar K.V @ 2022-03-22  5:26 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain

Yu Zhao <yuzhao@google.com> writes:

 +
> +static void inc_min_seq(struct lruvec *lruvec)
> +{
> +	int type;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	for (type = 0; type < ANON_AND_FILE; type++) {
> +		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> +			continue;
> +
> +		reset_ctrl_pos(lruvec, type, true);
> +		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
> +	}
> +}
> +
> +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> +{
> +	int gen, type, zone;
> +	bool success = false;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +	DEFINE_MIN_SEQ(lruvec);
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
> +			gen = lru_gen_from_seq(min_seq[type]);
> +
> +			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +				if (!list_empty(&lrugen->lists[gen][type][zone]))
> +					goto next;
> +			}
> +
> +			min_seq[type]++;
> +		}
> +next:
> +		;
> +	}
> +
> +	/* see the comment on lru_gen_struct */
> +	if (can_swap) {
> +		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> +		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> +	}
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		if (min_seq[type] == lrugen->min_seq[type])
> +			continue;
> +
> +		reset_ctrl_pos(lruvec, type, true);
> +		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
> +		success = true;
> +	}
> +
> +	return success;
> +}
> +
> +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> +{
> +	int prev, next;
> +	int type, zone;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	if (max_seq != lrugen->max_seq)
> +		goto unlock;
> +
> +	inc_min_seq(lruvec);

Can this min seq update result in pages considered oldest become young.
ie, if we had seq value of 0 - 3 and we need ageing, the new min seq and
max_seq value will now become 1 - 4. What happens to pages in the
generation value 0 which was oldest generation earlier and is youngest
now.


> +
> +	/* update the active/inactive LRU sizes for compatibility */
> +	prev = lru_gen_from_seq(lrugen->max_seq - 1);
> +	next = lru_gen_from_seq(lrugen->max_seq + 1);
> +
> +	for (type = 0; type < ANON_AND_FILE; type++) {
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			enum lru_list lru = type * LRU_INACTIVE_FILE;
> +			long delta = lrugen->nr_pages[prev][type][zone] -
> +				     lrugen->nr_pages[next][type][zone];
> +
> +			if (!delta)
> +				continue;
> +
> +			__update_lru_size(lruvec, lru, zone, delta);
> +			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
> +		}
> +	}
> +
> +	for (type = 0; type < ANON_AND_FILE; type++)
> +		reset_ctrl_pos(lruvec, type, false);
> +
> +	/* make sure preceding modifications appear */
> +	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
> +unlock:
> +	spin_unlock_irq(&lruvec->lru_lock);
> +}
> +

....

 +
> +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> +{
> +	int type;
> +	int scanned;
> +	int reclaimed;
> +	LIST_HEAD(list);
> +	struct folio *folio;
> +	enum vm_event_item item;
> +	struct reclaim_stat stat;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> +
> +	if (try_to_inc_min_seq(lruvec, swappiness))
> +		scanned++;

we are doing this before we shrink the page list. Any reason to do this before?

> +
> +	if (get_nr_gens(lruvec, LRU_GEN_FILE) == MIN_NR_GENS)
> +		scanned = 0;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-22  5:26     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 120+ messages in thread
From: Aneesh Kumar K.V @ 2022-03-22  5:26 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Linus Torvalds
  Cc: Andi Kleen, Catalin Marinas, Dave Hansen, Hillf Danton,
	Jens Axboe, Jesse Barnes, Johannes Weiner, Jonathan Corbet,
	Matthew Wilcox, Mel Gorman, Michael Larabel, Michal Hocko,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh, Vaibhav Jain

Yu Zhao <yuzhao@google.com> writes:

 +
> +static void inc_min_seq(struct lruvec *lruvec)
> +{
> +	int type;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	for (type = 0; type < ANON_AND_FILE; type++) {
> +		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> +			continue;
> +
> +		reset_ctrl_pos(lruvec, type, true);
> +		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
> +	}
> +}
> +
> +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> +{
> +	int gen, type, zone;
> +	bool success = false;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +	DEFINE_MIN_SEQ(lruvec);
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
> +			gen = lru_gen_from_seq(min_seq[type]);
> +
> +			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +				if (!list_empty(&lrugen->lists[gen][type][zone]))
> +					goto next;
> +			}
> +
> +			min_seq[type]++;
> +		}
> +next:
> +		;
> +	}
> +
> +	/* see the comment on lru_gen_struct */
> +	if (can_swap) {
> +		min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> +		min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> +	}
> +
> +	for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +		if (min_seq[type] == lrugen->min_seq[type])
> +			continue;
> +
> +		reset_ctrl_pos(lruvec, type, true);
> +		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
> +		success = true;
> +	}
> +
> +	return success;
> +}
> +
> +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> +{
> +	int prev, next;
> +	int type, zone;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	VM_BUG_ON(!seq_is_valid(lruvec));
> +
> +	if (max_seq != lrugen->max_seq)
> +		goto unlock;
> +
> +	inc_min_seq(lruvec);

Can this min seq update result in pages considered oldest become young.
ie, if we had seq value of 0 - 3 and we need ageing, the new min seq and
max_seq value will now become 1 - 4. What happens to pages in the
generation value 0 which was oldest generation earlier and is youngest
now.


> +
> +	/* update the active/inactive LRU sizes for compatibility */
> +	prev = lru_gen_from_seq(lrugen->max_seq - 1);
> +	next = lru_gen_from_seq(lrugen->max_seq + 1);
> +
> +	for (type = 0; type < ANON_AND_FILE; type++) {
> +		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> +			enum lru_list lru = type * LRU_INACTIVE_FILE;
> +			long delta = lrugen->nr_pages[prev][type][zone] -
> +				     lrugen->nr_pages[next][type][zone];
> +
> +			if (!delta)
> +				continue;
> +
> +			__update_lru_size(lruvec, lru, zone, delta);
> +			__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
> +		}
> +	}
> +
> +	for (type = 0; type < ANON_AND_FILE; type++)
> +		reset_ctrl_pos(lruvec, type, false);
> +
> +	/* make sure preceding modifications appear */
> +	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
> +unlock:
> +	spin_unlock_irq(&lruvec->lru_lock);
> +}
> +

....

 +
> +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> +{
> +	int type;
> +	int scanned;
> +	int reclaimed;
> +	LIST_HEAD(list);
> +	struct folio *folio;
> +	enum vm_event_item item;
> +	struct reclaim_stat stat;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> +
> +	if (try_to_inc_min_seq(lruvec, swappiness))
> +		scanned++;

we are doing this before we shrink the page list. Any reason to do this before?

> +
> +	if (get_nr_gens(lruvec, LRU_GEN_FILE) == MIN_NR_GENS)
> +		scanned = 0;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-22  5:26     ` Aneesh Kumar K.V
@ 2022-03-22  5:55       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  5:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Johannes Weiner, Jonathan Corbet, Matthew Wilcox, Mel Gorman,
	Michael Larabel, Michal Hocko, Mike Rapoport, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 11:27 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> > +{
> > +     int prev, next;
> > +     int type, zone;
> > +     struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +     spin_lock_irq(&lruvec->lru_lock);
> > +
> > +     VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > +     if (max_seq != lrugen->max_seq)
> > +             goto unlock;
> > +
> > +     inc_min_seq(lruvec);
>
> Can this min seq update result in pages considered oldest become young.
> ie, if we had seq value of 0 - 3 and we need ageing, the new min seq and
> max_seq value will now become 1 - 4. What happens to pages in the
> generation value 0 which was oldest generation earlier and is youngest
> now.

If anon pages are not reclaimable, e.g., no swapfile, they won't be
scanned at all. So their coldness/hotness don't matter -- they don't
need to be on lrugen->lists[] at all.

If there is a swapfile but it's full, then yes, the inversion will
happen. This can be handled by moving pages from the oldest generation
to the tail of the second oldest generation, which maintains the LRU
order.

In fact, both were handled in the previous versions [1] [2]. They were
removed in v6 for simplicity.

[1] https://lore.kernel.org/linux-mm/20211111041510.402534-5-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/20211111041510.402534-7-yuzhao@google.com/

> > +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +{
> > +     int type;
> > +     int scanned;
> > +     int reclaimed;
> > +     LIST_HEAD(list);
> > +     struct folio *folio;
> > +     enum vm_event_item item;
> > +     struct reclaim_stat stat;
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     spin_lock_irq(&lruvec->lru_lock);
> > +
> > +     scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> > +
> > +     if (try_to_inc_min_seq(lruvec, swappiness))
> > +             scanned++;
>
> we are doing this before we shrink the page list. Any reason to do this before?

We have isolated pages from lrugen->lists[], and we might have
exhausted all pages in the oldest generations, i.e.,
lrugen->lists[min_seq] is now empty. Incrementing min_seq after
shrink_page_list() is not wrong. However, it's better we do it ASAP so
that concurrent reclaimers are less likely to see a stale min_seq and
come here under the false impression that they'd make some progress.
(Instead, they will go to the aging path and inc_max_seq() first
before coming here.)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-22  5:55       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  5:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Johannes Weiner, Jonathan Corbet, Matthew Wilcox, Mel Gorman,
	Michael Larabel, Michal Hocko, Mike Rapoport, Rik van Riel,
	Vlastimil Babka, Will Deacon, Ying Huang, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Mon, Mar 21, 2022 at 11:27 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > +static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
> > +{
> > +     int prev, next;
> > +     int type, zone;
> > +     struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +     spin_lock_irq(&lruvec->lru_lock);
> > +
> > +     VM_BUG_ON(!seq_is_valid(lruvec));
> > +
> > +     if (max_seq != lrugen->max_seq)
> > +             goto unlock;
> > +
> > +     inc_min_seq(lruvec);
>
> Can this min seq update result in pages considered oldest become young.
> ie, if we had seq value of 0 - 3 and we need ageing, the new min seq and
> max_seq value will now become 1 - 4. What happens to pages in the
> generation value 0 which was oldest generation earlier and is youngest
> now.

If anon pages are not reclaimable, e.g., no swapfile, they won't be
scanned at all. So their coldness/hotness don't matter -- they don't
need to be on lrugen->lists[] at all.

If there is a swapfile but it's full, then yes, the inversion will
happen. This can be handled by moving pages from the oldest generation
to the tail of the second oldest generation, which maintains the LRU
order.

In fact, both were handled in the previous versions [1] [2]. They were
removed in v6 for simplicity.

[1] https://lore.kernel.org/linux-mm/20211111041510.402534-5-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/20211111041510.402534-7-yuzhao@google.com/

> > +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +{
> > +     int type;
> > +     int scanned;
> > +     int reclaimed;
> > +     LIST_HEAD(list);
> > +     struct folio *folio;
> > +     enum vm_event_item item;
> > +     struct reclaim_stat stat;
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +     struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +     spin_lock_irq(&lruvec->lru_lock);
> > +
> > +     scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> > +
> > +     if (try_to_inc_min_seq(lruvec, swappiness))
> > +             scanned++;
>
> we are doing this before we shrink the page list. Any reason to do this before?

We have isolated pages from lrugen->lists[], and we might have
exhausted all pages in the oldest generations, i.e.,
lrugen->lists[min_seq] is now empty. Incrementing min_seq after
shrink_page_list() is not wrong. However, it's better we do it ASAP so
that concurrent reclaimers are less likely to see a stale min_seq and
come here under the false impression that they'd make some progress.
(Instead, they will go to the aging path and inc_max_seq() first
before coming here.)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-22  7:22     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-22  7:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
> requested by many desktop users [1].
>
> When set to value N, it prevents the working set of N milliseconds
> from getting evicted. The OOM killer is triggered if this working set
> cannot be kept in memory. Based on the average human detectable lag
> (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
> Larger values like N=3000 make lags less noticeable at the risk of
> premature OOM kills.
>
> Compared with the size-based approach, e.g., [2], this time-based
> approach has the following advantages:
> 1. It is easier to configure because it is agnostic to applications
>    and memory sizes.
> 2. It is more reliable because it is directly wired to the OOM killer.
>

how are userspace oom daemons like android lmkd, systemd-oomd supposed
to work with this time-based oom killer?
only one of min_ttl_ms and userspace daemon should be enabled? or both
should be enabled at the same time?

> [1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
> [2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/vmscan.c            | 69 +++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 116c9237e401..f98f9ce50e67 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -403,6 +403,8 @@ struct lru_gen_struct {
>         unsigned long max_seq;
>         /* the eviction increments the oldest generation numbers */
>         unsigned long min_seq[ANON_AND_FILE];
> +       /* the birth time of each generation in jiffies */
> +       unsigned long timestamps[MAX_NR_GENS];
>         /* the multi-gen LRU lists */
>         struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
>         /* the sizes of the above lists */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 55cc7d6b018b..6aa083b8bb26 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4229,6 +4229,7 @@ static void inc_max_seq(struct lruvec *lruvec)
>         for (type = 0; type < ANON_AND_FILE; type++)
>                 reset_ctrl_pos(lruvec, type, false);
>
> +       WRITE_ONCE(lrugen->timestamps[next], jiffies);
>         /* make sure preceding modifications appear */
>         smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
>
> @@ -4340,7 +4341,8 @@ static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
>         return total > 0 ? total : 0;
>  }
>
> -static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> +                      unsigned long min_ttl)
>  {
>         bool need_aging;
>         long nr_to_scan;
> @@ -4349,14 +4351,22 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> +       if (min_ttl) {
> +               int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> +               unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
> +
> +               if (time_is_after_jiffies(birth + min_ttl))
> +                       return false;
> +       }
> +
>         mem_cgroup_calculate_protection(NULL, memcg);
>
>         if (mem_cgroup_below_min(memcg))
> -               return;
> +               return false;
>
>         nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
>         if (!nr_to_scan)
> -               return;
> +               return false;
>
>         nr_to_scan >>= sc->priority;
>
> @@ -4365,11 +4375,18 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>
>         if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
>                 try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
> +
> +       return true;
>  }
>
> +/* to protect the working set of the last N jiffies */
> +static unsigned long lru_gen_min_ttl __read_mostly;
> +
>  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>  {
>         struct mem_cgroup *memcg;
> +       bool success = false;
> +       unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
>
>         VM_BUG_ON(!current_is_kswapd());
>
> @@ -4395,12 +4412,29 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>         do {
>                 struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> -               age_lruvec(lruvec, sc);
> +               if (age_lruvec(lruvec, sc, min_ttl))
> +                       success = true;
>
>                 cond_resched();
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>
>         current->reclaim_state->mm_walk = NULL;
> +
> +       /*
> +        * The main goal is to OOM kill if every generation from all memcgs is
> +        * younger than min_ttl. However, another theoretical possibility is all
> +        * memcgs are either below min or empty.
> +        */
> +       if (!success && mutex_trylock(&oom_lock)) {
> +               struct oom_control oc = {
> +                       .gfp_mask = sc->gfp_mask,
> +                       .order = sc->order,
> +               };
> +
> +               out_of_memory(&oc);
> +
> +               mutex_unlock(&oom_lock);
> +       }
>  }
>
>  /*
> @@ -5112,6 +5146,28 @@ static void lru_gen_change_state(bool enable)
>   *                          sysfs interface
>   ******************************************************************************/
>
> +static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
> +}
> +
> +static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
> +                            const char *buf, size_t len)
> +{
> +       unsigned int msecs;
> +
> +       if (kstrtouint(buf, 0, &msecs))
> +               return -EINVAL;
> +
> +       WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
> +
> +       return len;
> +}
> +
> +static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
> +       min_ttl_ms, 0644, show_min_ttl, store_min_ttl
> +);
> +
>  static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
>  {
>         unsigned int caps = 0;
> @@ -5160,6 +5216,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
>  );
>
>  static struct attribute *lru_gen_attrs[] = {
> +       &lru_gen_min_ttl_attr.attr,
>         &lru_gen_enabled_attr.attr,
>         NULL
>  };
> @@ -5175,12 +5232,16 @@ static struct attribute_group lru_gen_attr_group = {
>
>  void lru_gen_init_lruvec(struct lruvec *lruvec)
>  {
> +       int i;
>         int gen, type, zone;
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
>         lrugen->max_seq = MIN_NR_GENS + 1;
>         lrugen->enabled = lru_gen_enabled();
>
> +       for (i = 0; i <= MIN_NR_GENS + 1; i++)
> +               lrugen->timestamps[i] = jiffies;
> +
>         for_each_gen_type_zone(gen, type, zone)
>                 INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention
@ 2022-03-22  7:22     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-22  7:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
> requested by many desktop users [1].
>
> When set to value N, it prevents the working set of N milliseconds
> from getting evicted. The OOM killer is triggered if this working set
> cannot be kept in memory. Based on the average human detectable lag
> (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
> Larger values like N=3000 make lags less noticeable at the risk of
> premature OOM kills.
>
> Compared with the size-based approach, e.g., [2], this time-based
> approach has the following advantages:
> 1. It is easier to configure because it is agnostic to applications
>    and memory sizes.
> 2. It is more reliable because it is directly wired to the OOM killer.
>

how are userspace oom daemons like android lmkd, systemd-oomd supposed
to work with this time-based oom killer?
only one of min_ttl_ms and userspace daemon should be enabled? or both
should be enabled at the same time?

> [1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
> [2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/vmscan.c            | 69 +++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 116c9237e401..f98f9ce50e67 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -403,6 +403,8 @@ struct lru_gen_struct {
>         unsigned long max_seq;
>         /* the eviction increments the oldest generation numbers */
>         unsigned long min_seq[ANON_AND_FILE];
> +       /* the birth time of each generation in jiffies */
> +       unsigned long timestamps[MAX_NR_GENS];
>         /* the multi-gen LRU lists */
>         struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
>         /* the sizes of the above lists */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 55cc7d6b018b..6aa083b8bb26 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4229,6 +4229,7 @@ static void inc_max_seq(struct lruvec *lruvec)
>         for (type = 0; type < ANON_AND_FILE; type++)
>                 reset_ctrl_pos(lruvec, type, false);
>
> +       WRITE_ONCE(lrugen->timestamps[next], jiffies);
>         /* make sure preceding modifications appear */
>         smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
>
> @@ -4340,7 +4341,8 @@ static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
>         return total > 0 ? total : 0;
>  }
>
> -static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
> +                      unsigned long min_ttl)
>  {
>         bool need_aging;
>         long nr_to_scan;
> @@ -4349,14 +4351,22 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> +       if (min_ttl) {
> +               int gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> +               unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
> +
> +               if (time_is_after_jiffies(birth + min_ttl))
> +                       return false;
> +       }
> +
>         mem_cgroup_calculate_protection(NULL, memcg);
>
>         if (mem_cgroup_below_min(memcg))
> -               return;
> +               return false;
>
>         nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
>         if (!nr_to_scan)
> -               return;
> +               return false;
>
>         nr_to_scan >>= sc->priority;
>
> @@ -4365,11 +4375,18 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>
>         if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
>                 try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
> +
> +       return true;
>  }
>
> +/* to protect the working set of the last N jiffies */
> +static unsigned long lru_gen_min_ttl __read_mostly;
> +
>  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>  {
>         struct mem_cgroup *memcg;
> +       bool success = false;
> +       unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
>
>         VM_BUG_ON(!current_is_kswapd());
>
> @@ -4395,12 +4412,29 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>         do {
>                 struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> -               age_lruvec(lruvec, sc);
> +               if (age_lruvec(lruvec, sc, min_ttl))
> +                       success = true;
>
>                 cond_resched();
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>
>         current->reclaim_state->mm_walk = NULL;
> +
> +       /*
> +        * The main goal is to OOM kill if every generation from all memcgs is
> +        * younger than min_ttl. However, another theoretical possibility is all
> +        * memcgs are either below min or empty.
> +        */
> +       if (!success && mutex_trylock(&oom_lock)) {
> +               struct oom_control oc = {
> +                       .gfp_mask = sc->gfp_mask,
> +                       .order = sc->order,
> +               };
> +
> +               out_of_memory(&oc);
> +
> +               mutex_unlock(&oom_lock);
> +       }
>  }
>
>  /*
> @@ -5112,6 +5146,28 @@ static void lru_gen_change_state(bool enable)
>   *                          sysfs interface
>   ******************************************************************************/
>
> +static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
> +}
> +
> +static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
> +                            const char *buf, size_t len)
> +{
> +       unsigned int msecs;
> +
> +       if (kstrtouint(buf, 0, &msecs))
> +               return -EINVAL;
> +
> +       WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
> +
> +       return len;
> +}
> +
> +static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
> +       min_ttl_ms, 0644, show_min_ttl, store_min_ttl
> +);
> +
>  static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
>  {
>         unsigned int caps = 0;
> @@ -5160,6 +5216,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
>  );
>
>  static struct attribute *lru_gen_attrs[] = {
> +       &lru_gen_min_ttl_attr.attr,
>         &lru_gen_enabled_attr.attr,
>         NULL
>  };
> @@ -5175,12 +5232,16 @@ static struct attribute_group lru_gen_attr_group = {
>
>  void lru_gen_init_lruvec(struct lruvec *lruvec)
>  {
> +       int i;
>         int gen, type, zone;
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
>         lrugen->max_seq = MIN_NR_GENS + 1;
>         lrugen->enabled = lru_gen_enabled();
>
> +       for (i = 0; i <= MIN_NR_GENS + 1; i++)
> +               lrugen->timestamps[i] = jiffies;
> +
>         for_each_gen_type_zone(gen, type, zone)
>                 INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
  2022-03-09  2:12   ` Yu Zhao
@ 2022-03-22  7:47     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-22  7:47 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
> can be disabled include:
>   0x0001: the multi-gen LRU core
>   0x0002: walking page table, when arch_has_hw_pte_young() returns
>           true
>   0x0004: clearing the accessed bit in non-leaf PMD entries, when
>           CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
>   [yYnN]: apply to all the components above
> E.g.,
>   echo y >/sys/kernel/mm/lru_gen/enabled
>   cat /sys/kernel/mm/lru_gen/enabled
>   0x0007
>   echo 5 >/sys/kernel/mm/lru_gen/enabled
>   cat /sys/kernel/mm/lru_gen/enabled
>   0x0005
>
> NB: the page table walks happen on the scale of seconds under heavy
> memory pressure, in which case the mmap_lock contention is a lesser
> concern, compared with the LRU lock contention and the I/O congestion.
> So far the only well-known case of the mmap_lock contention happens on
> Android, due to Scudo [1] which allocates several thousand VMAs for
> merely a few hundred MBs. The SPF and the Maple Tree also have
> provided their own assessments [2][3]. However, if walking page tables
> does worsen the mmap_lock contention, the kill switch can be used to
> disable it. In this case the multi-gen LRU will suffer a minor
> performance degradation, as shown previously.
>
> Clearing the accessed bit in non-leaf PMD entries can also be
> disabled, since this behavior was not tested on x86 varieties other
> than Intel and AMD.
>
> [1] https://source.android.com/devices/tech/debug/scudo
> [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
> [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/cgroup.h          |  15 +-
>  include/linux/mm_inline.h       |  12 +-
>  include/linux/mmzone.h          |   9 ++
>  kernel/cgroup/cgroup-internal.h |   1 -
>  mm/Kconfig                      |   6 +
>  mm/vmscan.c                     | 237 +++++++++++++++++++++++++++++++-
>  6 files changed, 271 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 75c151413fda..b145025f3eac 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
>         css_put(&cgrp->self);
>  }
>
> +extern struct mutex cgroup_mutex;
> +
> +static inline void cgroup_lock(void)
> +{
> +       mutex_lock(&cgroup_mutex);
> +}
> +
> +static inline void cgroup_unlock(void)
> +{
> +       mutex_unlock(&cgroup_mutex);
> +}
> +
>  /**
>   * task_css_set_check - obtain a task's css_set with extra access conditions
>   * @task: the task to obtain css_set for
> @@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
>   * as locks used during the cgroup_subsys::attach() methods.
>   */
>  #ifdef CONFIG_PROVE_RCU
> -extern struct mutex cgroup_mutex;
>  extern spinlock_t css_set_lock;
>  #define task_css_set_check(task, __c)                                  \
>         rcu_dereference_check((task)->cgroups,                          \
> @@ -707,6 +718,8 @@ struct cgroup;
>  static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
>  static inline void css_get(struct cgroup_subsys_state *css) {}
>  static inline void css_put(struct cgroup_subsys_state *css) {}
> +static inline void cgroup_lock(void) {}
> +static inline void cgroup_unlock(void) {}
>  static inline int cgroup_attach_task_all(struct task_struct *from,
>                                          struct task_struct *t) { return 0; }
>  static inline int cgroupstats_build(struct cgroupstats *stats,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 15a04a9b5560..1c8d617e73a9 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -106,7 +106,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
>
>  static inline bool lru_gen_enabled(void)
>  {
> -       return true;
> +#ifdef CONFIG_LRU_GEN_ENABLED
> +       DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
> +
> +       return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
> +#else
> +       DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
> +
> +       return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
> +#endif
>  }
>
>  static inline bool lru_gen_in_fault(void)
> @@ -196,7 +204,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
>         int zone = folio_zonenum(folio);
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
> -       if (folio_test_unevictable(folio))
> +       if (folio_test_unevictable(folio) || !lrugen->enabled)
>                 return false;
>         /*
>          * There are three common cases for this page:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a2d53025a321..116c9237e401 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -371,6 +371,13 @@ enum {
>         LRU_GEN_FILE,
>  };
>
> +enum {
> +       LRU_GEN_CORE,
> +       LRU_GEN_MM_WALK,
> +       LRU_GEN_NONLEAF_YOUNG,
> +       NR_LRU_GEN_CAPS
> +};
> +
>  #define MIN_LRU_BATCH          BITS_PER_LONG
>  #define MAX_LRU_BATCH          (MIN_LRU_BATCH * 128)
>
> @@ -409,6 +416,8 @@ struct lru_gen_struct {
>         /* can be modified without holding the LRU lock */
>         atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> +       /* whether the multi-gen LRU is enabled */
> +       bool enabled;
>  };
>
>  enum {
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index 6e36e854b512..929ed3bf1a7c 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -165,7 +165,6 @@ struct cgroup_mgctx {
>  #define DEFINE_CGROUP_MGCTX(name)                                              \
>         struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
>
> -extern struct mutex cgroup_mutex;
>  extern spinlock_t css_set_lock;
>  extern struct cgroup_subsys *cgroup_subsys[];
>  extern struct list_head cgroup_roots;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 804c2bca8205..050de1eae2d6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -901,6 +901,12 @@ config LRU_GEN
>         help
>           A high performance LRU implementation for memory overcommit.
>
> +config LRU_GEN_ENABLED
> +       bool "Enable by default"
> +       depends on LRU_GEN
> +       help
> +         This option enables the multi-gen LRU by default.
> +
>  config LRU_GEN_STATS
>         bool "Full stats for debugging"
>         depends on LRU_GEN
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7375c9dae08f..55cc7d6b018b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3063,6 +3063,12 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>
>  #ifdef CONFIG_LRU_GEN
>
> +#ifdef CONFIG_LRU_GEN_ENABLED
> +DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
> +#else
> +DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
> +#endif
> +
>  /******************************************************************************
>   *                          shorthand helpers
>   ******************************************************************************/
> @@ -3099,6 +3105,15 @@ static int folio_lru_tier(struct folio *folio)
>         return lru_tier_from_refs(refs);
>  }
>
> +static bool get_cap(int cap)
> +{
> +#ifdef CONFIG_LRU_GEN_ENABLED
> +       return static_branch_likely(&lru_gen_caps[cap]);
> +#else
> +       return static_branch_unlikely(&lru_gen_caps[cap]);
> +#endif
> +}
> +
>  static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
>  {
>         struct pglist_data *pgdat = NODE_DATA(nid);
> @@ -3892,7 +3907,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
>                         goto next;
>
>                 if (!pmd_trans_huge(pmd[i])) {
> -                       if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
> +                       if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
> +                           get_cap(LRU_GEN_NONLEAF_YOUNG))
>                                 pmdp_test_and_clear_young(vma, addr, pmd + i);
>                         goto next;
>                 }
> @@ -3999,10 +4015,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
>                 priv->mm_stats[MM_PMD_TOTAL]++;
>
>  #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> -               if (!pmd_young(val))
> -                       continue;
> +               if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
> +                       if (!pmd_young(val))
> +                               continue;
>
> -               walk_pmd_range_locked(pud, addr, vma, walk, &pos);
> +                       walk_pmd_range_locked(pud, addr, vma, walk, &pos);
> +               }
>  #endif
>                 if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
>                         continue;
> @@ -4233,7 +4251,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
>          * handful of PTEs. Spreading the work out over a period of time usually
>          * is less efficient, but it avoids bursty page faults.
>          */
> -       if (!full_scan && !arch_has_hw_pte_young()) {
> +       if (!full_scan && (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK))) {
>                 success = iterate_mm_list_nowalk(lruvec, max_seq);
>                 goto done;
>         }
> @@ -4946,6 +4964,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
>         blk_finish_plug(&plug);
>  }
>
> +/******************************************************************************
> + *                          state change
> + ******************************************************************************/
> +
> +static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
> +{
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       if (lrugen->enabled) {
> +               enum lru_list lru;
> +
> +               for_each_evictable_lru(lru) {
> +                       if (!list_empty(&lruvec->lists[lru]))
> +                               return false;
> +               }
> +       } else {
> +               int gen, type, zone;
> +
> +               for_each_gen_type_zone(gen, type, zone) {
> +                       if (!list_empty(&lrugen->lists[gen][type][zone]))
> +                               return false;
> +
> +                       /* unlikely but not a bug when reset_batch_size() is pending */
> +                       VM_WARN_ON(lrugen->nr_pages[gen][type][zone]);
> +               }
> +       }
> +
> +       return true;
> +}
> +
> +static bool fill_evictable(struct lruvec *lruvec)
> +{
> +       enum lru_list lru;
> +       int remaining = MAX_LRU_BATCH;
> +
> +       for_each_evictable_lru(lru) {
> +               int type = is_file_lru(lru);
> +               bool active = is_active_lru(lru);
> +               struct list_head *head = &lruvec->lists[lru];
> +
> +               while (!list_empty(head)) {
> +                       bool success;
> +                       struct folio *folio = lru_to_folio(head);
> +
> +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +                       VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio);
> +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> +                       VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio);
> +
> +                       lruvec_del_folio(lruvec, folio);
> +                       success = lru_gen_add_folio(lruvec, folio, false);
> +                       VM_BUG_ON(!success);
> +
> +                       if (!--remaining)
> +                               return false;
> +               }
> +       }
> +
> +       return true;
> +}
> +
> +static bool drain_evictable(struct lruvec *lruvec)
> +{
> +       int gen, type, zone;
> +       int remaining = MAX_LRU_BATCH;
> +
> +       for_each_gen_type_zone(gen, type, zone) {
> +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> +
> +               while (!list_empty(head)) {
> +                       bool success;
> +                       struct folio *folio = lru_to_folio(head);
> +
> +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> +
> +                       success = lru_gen_del_folio(lruvec, folio, false);
> +                       VM_BUG_ON(!success);
> +                       lruvec_add_folio(lruvec, folio);

for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
max_seq in the head of active list. but your code seems to be putting max_seq-1
after putting max_seq, then max_seq is more likely to be evicted
afterwards as it
is in the tail of the active list.

anyway, it might not be so important. I can't imagine we will
frequently switch mglru
with lru dynamically. will we?

> +
> +                       if (!--remaining)
> +                               return false;
> +               }
> +       }
> +
> +       return true;
> +}
> +
> +static void lru_gen_change_state(bool enable)
> +{
> +       static DEFINE_MUTEX(state_mutex);
> +
> +       struct mem_cgroup *memcg;
> +
> +       cgroup_lock();
> +       cpus_read_lock();
> +       get_online_mems();
> +       mutex_lock(&state_mutex);
> +
> +       if (enable == lru_gen_enabled())
> +               goto unlock;
> +
> +       if (enable)
> +               static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
> +       else
> +               static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
> +
> +       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> +       do {
> +               int nid;
> +
> +               for_each_node(nid) {
> +                       struct lruvec *lruvec = get_lruvec(memcg, nid);
> +
> +                       if (!lruvec)
> +                               continue;
> +
> +                       spin_lock_irq(&lruvec->lru_lock);
> +
> +                       VM_BUG_ON(!seq_is_valid(lruvec));
> +                       VM_BUG_ON(!state_is_valid(lruvec));
> +
> +                       lruvec->lrugen.enabled = enable;
> +
> +                       while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                               cond_resched();
> +                               spin_lock_irq(&lruvec->lru_lock);
> +                       }
> +
> +                       spin_unlock_irq(&lruvec->lru_lock);
> +               }
> +
> +               cond_resched();
> +       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +unlock:
> +       mutex_unlock(&state_mutex);
> +       put_online_mems();
> +       cpus_read_unlock();
> +       cgroup_unlock();
> +}
> +
> +/******************************************************************************
> + *                          sysfs interface
> + ******************************************************************************/
> +
> +static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       unsigned int caps = 0;
> +
> +       if (get_cap(LRU_GEN_CORE))
> +               caps |= BIT(LRU_GEN_CORE);
> +
> +       if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
> +               caps |= BIT(LRU_GEN_MM_WALK);
> +
> +       if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG))
> +               caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> +
> +       return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
> +}
> +
> +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
> +                           const char *buf, size_t len)
> +{
> +       int i;
> +       unsigned int caps;
> +
> +       if (tolower(*buf) == 'n')
> +               caps = 0;
> +       else if (tolower(*buf) == 'y')
> +               caps = -1;
> +       else if (kstrtouint(buf, 0, &caps))
> +               return -EINVAL;
> +
> +       for (i = 0; i < NR_LRU_GEN_CAPS; i++) {
> +               bool enable = caps & BIT(i);
> +
> +               if (i == LRU_GEN_CORE)
> +                       lru_gen_change_state(enable);
> +               else if (enable)
> +                       static_branch_enable(&lru_gen_caps[i]);
> +               else
> +                       static_branch_disable(&lru_gen_caps[i]);
> +       }
> +
> +       return len;
> +}
> +
> +static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
> +       enabled, 0644, show_enable, store_enable
> +);
> +
> +static struct attribute *lru_gen_attrs[] = {
> +       &lru_gen_enabled_attr.attr,
> +       NULL
> +};
> +
> +static struct attribute_group lru_gen_attr_group = {
> +       .name = "lru_gen",
> +       .attrs = lru_gen_attrs,
> +};
> +
>  /******************************************************************************
>   *                          initialization
>   ******************************************************************************/
> @@ -4956,6 +5179,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
>         lrugen->max_seq = MIN_NR_GENS + 1;
> +       lrugen->enabled = lru_gen_enabled();
>
>         for_each_gen_type_zone(gen, type, zone)
>                 INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
> @@ -4996,6 +5220,9 @@ static int __init init_lru_gen(void)
>         BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
>         BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
>
> +       if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
> +               pr_err("lru_gen: failed to create sysfs group\n");
> +
>         return 0;
>  };
>  late_initcall(init_lru_gen);
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
@ 2022-03-22  7:47     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-22  7:47 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
> can be disabled include:
>   0x0001: the multi-gen LRU core
>   0x0002: walking page table, when arch_has_hw_pte_young() returns
>           true
>   0x0004: clearing the accessed bit in non-leaf PMD entries, when
>           CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
>   [yYnN]: apply to all the components above
> E.g.,
>   echo y >/sys/kernel/mm/lru_gen/enabled
>   cat /sys/kernel/mm/lru_gen/enabled
>   0x0007
>   echo 5 >/sys/kernel/mm/lru_gen/enabled
>   cat /sys/kernel/mm/lru_gen/enabled
>   0x0005
>
> NB: the page table walks happen on the scale of seconds under heavy
> memory pressure, in which case the mmap_lock contention is a lesser
> concern, compared with the LRU lock contention and the I/O congestion.
> So far the only well-known case of the mmap_lock contention happens on
> Android, due to Scudo [1] which allocates several thousand VMAs for
> merely a few hundred MBs. The SPF and the Maple Tree also have
> provided their own assessments [2][3]. However, if walking page tables
> does worsen the mmap_lock contention, the kill switch can be used to
> disable it. In this case the multi-gen LRU will suffer a minor
> performance degradation, as shown previously.
>
> Clearing the accessed bit in non-leaf PMD entries can also be
> disabled, since this behavior was not tested on x86 varieties other
> than Intel and AMD.
>
> [1] https://source.android.com/devices/tech/debug/scudo
> [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
> [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/cgroup.h          |  15 +-
>  include/linux/mm_inline.h       |  12 +-
>  include/linux/mmzone.h          |   9 ++
>  kernel/cgroup/cgroup-internal.h |   1 -
>  mm/Kconfig                      |   6 +
>  mm/vmscan.c                     | 237 +++++++++++++++++++++++++++++++-
>  6 files changed, 271 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 75c151413fda..b145025f3eac 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
>         css_put(&cgrp->self);
>  }
>
> +extern struct mutex cgroup_mutex;
> +
> +static inline void cgroup_lock(void)
> +{
> +       mutex_lock(&cgroup_mutex);
> +}
> +
> +static inline void cgroup_unlock(void)
> +{
> +       mutex_unlock(&cgroup_mutex);
> +}
> +
>  /**
>   * task_css_set_check - obtain a task's css_set with extra access conditions
>   * @task: the task to obtain css_set for
> @@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
>   * as locks used during the cgroup_subsys::attach() methods.
>   */
>  #ifdef CONFIG_PROVE_RCU
> -extern struct mutex cgroup_mutex;
>  extern spinlock_t css_set_lock;
>  #define task_css_set_check(task, __c)                                  \
>         rcu_dereference_check((task)->cgroups,                          \
> @@ -707,6 +718,8 @@ struct cgroup;
>  static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
>  static inline void css_get(struct cgroup_subsys_state *css) {}
>  static inline void css_put(struct cgroup_subsys_state *css) {}
> +static inline void cgroup_lock(void) {}
> +static inline void cgroup_unlock(void) {}
>  static inline int cgroup_attach_task_all(struct task_struct *from,
>                                          struct task_struct *t) { return 0; }
>  static inline int cgroupstats_build(struct cgroupstats *stats,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 15a04a9b5560..1c8d617e73a9 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -106,7 +106,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
>
>  static inline bool lru_gen_enabled(void)
>  {
> -       return true;
> +#ifdef CONFIG_LRU_GEN_ENABLED
> +       DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
> +
> +       return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
> +#else
> +       DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
> +
> +       return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
> +#endif
>  }
>
>  static inline bool lru_gen_in_fault(void)
> @@ -196,7 +204,7 @@ static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
>         int zone = folio_zonenum(folio);
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
> -       if (folio_test_unevictable(folio))
> +       if (folio_test_unevictable(folio) || !lrugen->enabled)
>                 return false;
>         /*
>          * There are three common cases for this page:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a2d53025a321..116c9237e401 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -371,6 +371,13 @@ enum {
>         LRU_GEN_FILE,
>  };
>
> +enum {
> +       LRU_GEN_CORE,
> +       LRU_GEN_MM_WALK,
> +       LRU_GEN_NONLEAF_YOUNG,
> +       NR_LRU_GEN_CAPS
> +};
> +
>  #define MIN_LRU_BATCH          BITS_PER_LONG
>  #define MAX_LRU_BATCH          (MIN_LRU_BATCH * 128)
>
> @@ -409,6 +416,8 @@ struct lru_gen_struct {
>         /* can be modified without holding the LRU lock */
>         atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> +       /* whether the multi-gen LRU is enabled */
> +       bool enabled;
>  };
>
>  enum {
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index 6e36e854b512..929ed3bf1a7c 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -165,7 +165,6 @@ struct cgroup_mgctx {
>  #define DEFINE_CGROUP_MGCTX(name)                                              \
>         struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
>
> -extern struct mutex cgroup_mutex;
>  extern spinlock_t css_set_lock;
>  extern struct cgroup_subsys *cgroup_subsys[];
>  extern struct list_head cgroup_roots;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 804c2bca8205..050de1eae2d6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -901,6 +901,12 @@ config LRU_GEN
>         help
>           A high performance LRU implementation for memory overcommit.
>
> +config LRU_GEN_ENABLED
> +       bool "Enable by default"
> +       depends on LRU_GEN
> +       help
> +         This option enables the multi-gen LRU by default.
> +
>  config LRU_GEN_STATS
>         bool "Full stats for debugging"
>         depends on LRU_GEN
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7375c9dae08f..55cc7d6b018b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3063,6 +3063,12 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>
>  #ifdef CONFIG_LRU_GEN
>
> +#ifdef CONFIG_LRU_GEN_ENABLED
> +DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
> +#else
> +DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
> +#endif
> +
>  /******************************************************************************
>   *                          shorthand helpers
>   ******************************************************************************/
> @@ -3099,6 +3105,15 @@ static int folio_lru_tier(struct folio *folio)
>         return lru_tier_from_refs(refs);
>  }
>
> +static bool get_cap(int cap)
> +{
> +#ifdef CONFIG_LRU_GEN_ENABLED
> +       return static_branch_likely(&lru_gen_caps[cap]);
> +#else
> +       return static_branch_unlikely(&lru_gen_caps[cap]);
> +#endif
> +}
> +
>  static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
>  {
>         struct pglist_data *pgdat = NODE_DATA(nid);
> @@ -3892,7 +3907,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
>                         goto next;
>
>                 if (!pmd_trans_huge(pmd[i])) {
> -                       if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
> +                       if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
> +                           get_cap(LRU_GEN_NONLEAF_YOUNG))
>                                 pmdp_test_and_clear_young(vma, addr, pmd + i);
>                         goto next;
>                 }
> @@ -3999,10 +4015,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
>                 priv->mm_stats[MM_PMD_TOTAL]++;
>
>  #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> -               if (!pmd_young(val))
> -                       continue;
> +               if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
> +                       if (!pmd_young(val))
> +                               continue;
>
> -               walk_pmd_range_locked(pud, addr, vma, walk, &pos);
> +                       walk_pmd_range_locked(pud, addr, vma, walk, &pos);
> +               }
>  #endif
>                 if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
>                         continue;
> @@ -4233,7 +4251,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
>          * handful of PTEs. Spreading the work out over a period of time usually
>          * is less efficient, but it avoids bursty page faults.
>          */
> -       if (!full_scan && !arch_has_hw_pte_young()) {
> +       if (!full_scan && (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK))) {
>                 success = iterate_mm_list_nowalk(lruvec, max_seq);
>                 goto done;
>         }
> @@ -4946,6 +4964,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
>         blk_finish_plug(&plug);
>  }
>
> +/******************************************************************************
> + *                          state change
> + ******************************************************************************/
> +
> +static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
> +{
> +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +       if (lrugen->enabled) {
> +               enum lru_list lru;
> +
> +               for_each_evictable_lru(lru) {
> +                       if (!list_empty(&lruvec->lists[lru]))
> +                               return false;
> +               }
> +       } else {
> +               int gen, type, zone;
> +
> +               for_each_gen_type_zone(gen, type, zone) {
> +                       if (!list_empty(&lrugen->lists[gen][type][zone]))
> +                               return false;
> +
> +                       /* unlikely but not a bug when reset_batch_size() is pending */
> +                       VM_WARN_ON(lrugen->nr_pages[gen][type][zone]);
> +               }
> +       }
> +
> +       return true;
> +}
> +
> +static bool fill_evictable(struct lruvec *lruvec)
> +{
> +       enum lru_list lru;
> +       int remaining = MAX_LRU_BATCH;
> +
> +       for_each_evictable_lru(lru) {
> +               int type = is_file_lru(lru);
> +               bool active = is_active_lru(lru);
> +               struct list_head *head = &lruvec->lists[lru];
> +
> +               while (!list_empty(head)) {
> +                       bool success;
> +                       struct folio *folio = lru_to_folio(head);
> +
> +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +                       VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio);
> +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> +                       VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio);
> +
> +                       lruvec_del_folio(lruvec, folio);
> +                       success = lru_gen_add_folio(lruvec, folio, false);
> +                       VM_BUG_ON(!success);
> +
> +                       if (!--remaining)
> +                               return false;
> +               }
> +       }
> +
> +       return true;
> +}
> +
> +static bool drain_evictable(struct lruvec *lruvec)
> +{
> +       int gen, type, zone;
> +       int remaining = MAX_LRU_BATCH;
> +
> +       for_each_gen_type_zone(gen, type, zone) {
> +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> +
> +               while (!list_empty(head)) {
> +                       bool success;
> +                       struct folio *folio = lru_to_folio(head);
> +
> +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> +
> +                       success = lru_gen_del_folio(lruvec, folio, false);
> +                       VM_BUG_ON(!success);
> +                       lruvec_add_folio(lruvec, folio);

for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
max_seq in the head of active list. but your code seems to be putting max_seq-1
after putting max_seq, then max_seq is more likely to be evicted
afterwards as it
is in the tail of the active list.

anyway, it might not be so important. I can't imagine we will
frequently switch mglru
with lru dynamically. will we?

> +
> +                       if (!--remaining)
> +                               return false;
> +               }
> +       }
> +
> +       return true;
> +}
> +
> +static void lru_gen_change_state(bool enable)
> +{
> +       static DEFINE_MUTEX(state_mutex);
> +
> +       struct mem_cgroup *memcg;
> +
> +       cgroup_lock();
> +       cpus_read_lock();
> +       get_online_mems();
> +       mutex_lock(&state_mutex);
> +
> +       if (enable == lru_gen_enabled())
> +               goto unlock;
> +
> +       if (enable)
> +               static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
> +       else
> +               static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
> +
> +       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> +       do {
> +               int nid;
> +
> +               for_each_node(nid) {
> +                       struct lruvec *lruvec = get_lruvec(memcg, nid);
> +
> +                       if (!lruvec)
> +                               continue;
> +
> +                       spin_lock_irq(&lruvec->lru_lock);
> +
> +                       VM_BUG_ON(!seq_is_valid(lruvec));
> +                       VM_BUG_ON(!state_is_valid(lruvec));
> +
> +                       lruvec->lrugen.enabled = enable;
> +
> +                       while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                               cond_resched();
> +                               spin_lock_irq(&lruvec->lru_lock);
> +                       }
> +
> +                       spin_unlock_irq(&lruvec->lru_lock);
> +               }
> +
> +               cond_resched();
> +       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +unlock:
> +       mutex_unlock(&state_mutex);
> +       put_online_mems();
> +       cpus_read_unlock();
> +       cgroup_unlock();
> +}
> +
> +/******************************************************************************
> + *                          sysfs interface
> + ******************************************************************************/
> +
> +static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> +       unsigned int caps = 0;
> +
> +       if (get_cap(LRU_GEN_CORE))
> +               caps |= BIT(LRU_GEN_CORE);
> +
> +       if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
> +               caps |= BIT(LRU_GEN_MM_WALK);
> +
> +       if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG))
> +               caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> +
> +       return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
> +}
> +
> +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
> +                           const char *buf, size_t len)
> +{
> +       int i;
> +       unsigned int caps;
> +
> +       if (tolower(*buf) == 'n')
> +               caps = 0;
> +       else if (tolower(*buf) == 'y')
> +               caps = -1;
> +       else if (kstrtouint(buf, 0, &caps))
> +               return -EINVAL;
> +
> +       for (i = 0; i < NR_LRU_GEN_CAPS; i++) {
> +               bool enable = caps & BIT(i);
> +
> +               if (i == LRU_GEN_CORE)
> +                       lru_gen_change_state(enable);
> +               else if (enable)
> +                       static_branch_enable(&lru_gen_caps[i]);
> +               else
> +                       static_branch_disable(&lru_gen_caps[i]);
> +       }
> +
> +       return len;
> +}
> +
> +static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
> +       enabled, 0644, show_enable, store_enable
> +);
> +
> +static struct attribute *lru_gen_attrs[] = {
> +       &lru_gen_enabled_attr.attr,
> +       NULL
> +};
> +
> +static struct attribute_group lru_gen_attr_group = {
> +       .name = "lru_gen",
> +       .attrs = lru_gen_attrs,
> +};
> +
>  /******************************************************************************
>   *                          initialization
>   ******************************************************************************/
> @@ -4956,6 +5179,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
>         lrugen->max_seq = MIN_NR_GENS + 1;
> +       lrugen->enabled = lru_gen_enabled();
>
>         for_each_gen_type_zone(gen, type, zone)
>                 INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
> @@ -4996,6 +5220,9 @@ static int __init init_lru_gen(void)
>         BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
>         BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
>
> +       if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
> +               pr_err("lru_gen: failed to create sysfs group\n");
> +
>         return 0;
>  };
>  late_initcall(init_lru_gen);
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention
  2022-03-22  7:22     ` Barry Song
@ 2022-03-22  8:14       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  8:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 1:23 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
> > requested by many desktop users [1].
> >
> > When set to value N, it prevents the working set of N milliseconds
> > from getting evicted. The OOM killer is triggered if this working set
> > cannot be kept in memory. Based on the average human detectable lag
> > (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
> > Larger values like N=3000 make lags less noticeable at the risk of
> > premature OOM kills.
> >
> > Compared with the size-based approach, e.g., [2], this time-based
> > approach has the following advantages:
> > 1. It is easier to configure because it is agnostic to applications
> >    and memory sizes.
> > 2. It is more reliable because it is directly wired to the OOM killer.
> >
>
> how are userspace oom daemons like android lmkd, systemd-oomd supposed
> to work with this time-based oom killer?
> only one of min_ttl_ms and userspace daemon should be enabled? or both
> should be enabled at the same time?

Generally we just need one. lmkd and oomd are more flexible but 1)
they need customizations 2) not all distros have them 3) they might be
stuck in direct reclaim as well.

The last remark is not just a theoretical problem:
a) we had many servers under extremely heavy (global) memory pressure,
that 200+ direct reclaimers on each CPU competed for resources and
userspace livelocked for 2 hours. Eventually hardware watchdogs kicked
in.
b) on Chromebooks we have something similar to lmkd, and we still
frequently observe crashes due to heavy memory pressure, meaning some
Chrome tabs were stuck in direct reclaim for 120 seconds
(hung_task_timeout_secs=120).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention
@ 2022-03-22  8:14       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  8:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 1:23 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
> > requested by many desktop users [1].
> >
> > When set to value N, it prevents the working set of N milliseconds
> > from getting evicted. The OOM killer is triggered if this working set
> > cannot be kept in memory. Based on the average human detectable lag
> > (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
> > Larger values like N=3000 make lags less noticeable at the risk of
> > premature OOM kills.
> >
> > Compared with the size-based approach, e.g., [2], this time-based
> > approach has the following advantages:
> > 1. It is easier to configure because it is agnostic to applications
> >    and memory sizes.
> > 2. It is more reliable because it is directly wired to the OOM killer.
> >
>
> how are userspace oom daemons like android lmkd, systemd-oomd supposed
> to work with this time-based oom killer?
> only one of min_ttl_ms and userspace daemon should be enabled? or both
> should be enabled at the same time?

Generally we just need one. lmkd and oomd are more flexible but 1)
they need customizations 2) not all distros have them 3) they might be
stuck in direct reclaim as well.

The last remark is not just a theoretical problem:
a) we had many servers under extremely heavy (global) memory pressure,
that 200+ direct reclaimers on each CPU competed for resources and
userspace livelocked for 2 hours. Eventually hardware watchdogs kicked
in.
b) on Chromebooks we have something similar to lmkd, and we still
frequently observe crashes due to heavy memory pressure, meaning some
Chrome tabs were stuck in direct reclaim for 120 seconds
(hung_task_timeout_secs=120).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
  2022-03-22  7:47     ` Barry Song
@ 2022-03-22  8:20       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  8:20 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
...
> > +static bool drain_evictable(struct lruvec *lruvec)
> > +{
> > +       int gen, type, zone;
> > +       int remaining = MAX_LRU_BATCH;
> > +
> > +       for_each_gen_type_zone(gen, type, zone) {
> > +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> > +
> > +               while (!list_empty(head)) {
> > +                       bool success;
> > +                       struct folio *folio = lru_to_folio(head);
> > +
> > +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> > +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> > +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> > +
> > +                       success = lru_gen_del_folio(lruvec, folio, false);
> > +                       VM_BUG_ON(!success);
> > +                       lruvec_add_folio(lruvec, folio);
>
> for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
> max_seq in the head of active list. but your code seems to be putting max_seq-1
> after putting max_seq, then max_seq is more likely to be evicted
> afterwards as it
> is in the tail of the active list.

This is correct.

> anyway, it might not be so important. I can't imagine we will
> frequently switch mglru
> with lru dynamically. will we?

I certainly hope not :)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
@ 2022-03-22  8:20       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  8:20 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
...
> > +static bool drain_evictable(struct lruvec *lruvec)
> > +{
> > +       int gen, type, zone;
> > +       int remaining = MAX_LRU_BATCH;
> > +
> > +       for_each_gen_type_zone(gen, type, zone) {
> > +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> > +
> > +               while (!list_empty(head)) {
> > +                       bool success;
> > +                       struct folio *folio = lru_to_folio(head);
> > +
> > +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> > +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> > +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> > +
> > +                       success = lru_gen_del_folio(lruvec, folio, false);
> > +                       VM_BUG_ON(!success);
> > +                       lruvec_add_folio(lruvec, folio);
>
> for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
> max_seq in the head of active list. but your code seems to be putting max_seq-1
> after putting max_seq, then max_seq is more likely to be evicted
> afterwards as it
> is in the tail of the active list.

This is correct.

> anyway, it might not be so important. I can't imagine we will
> frequently switch mglru
> with lru dynamically. will we?

I certainly hope not :)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
  2022-03-22  8:20       ` Yu Zhao
@ 2022-03-22  8:45         ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-22  8:45 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 9:20 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Mar 22, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> ...
> > > +static bool drain_evictable(struct lruvec *lruvec)
> > > +{
> > > +       int gen, type, zone;
> > > +       int remaining = MAX_LRU_BATCH;
> > > +
> > > +       for_each_gen_type_zone(gen, type, zone) {
> > > +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> > > +
> > > +               while (!list_empty(head)) {
> > > +                       bool success;
> > > +                       struct folio *folio = lru_to_folio(head);
> > > +
> > > +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > > +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> > > +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> > > +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> > > +
> > > +                       success = lru_gen_del_folio(lruvec, folio, false);
> > > +                       VM_BUG_ON(!success);
> > > +                       lruvec_add_folio(lruvec, folio);
> >
> > for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
> > max_seq in the head of active list. but your code seems to be putting max_seq-1
> > after putting max_seq, then max_seq is more likely to be evicted
> > afterwards as it
> > is in the tail of the active list.
>
> This is correct.

maybe something like below can fix it:
 #define for_each_gen_type_zone(gen, type, zone)
         \
-       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
+       for (int seq = min_seq[type], (gen)=(seq_to_gen(seq)); seq <=
max_seq ; seq++)                       \
                for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
                        for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)

but i am not quite sure it is worth it if we don't switch mglru/lru that
often. so it is all up to you, either fix it or put a comment to describe
we are not trying to make an active list with completely the same
temperature (hot/cold) as pages were in mglru lists.

>
> > anyway, it might not be so important. I can't imagine we will
> > frequently switch mglru
> > with lru dynamically. will we?
>
> I certainly hope not :)

me too.

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
@ 2022-03-22  8:45         ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-22  8:45 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 9:20 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Mar 22, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> ...
> > > +static bool drain_evictable(struct lruvec *lruvec)
> > > +{
> > > +       int gen, type, zone;
> > > +       int remaining = MAX_LRU_BATCH;
> > > +
> > > +       for_each_gen_type_zone(gen, type, zone) {
> > > +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> > > +
> > > +               while (!list_empty(head)) {
> > > +                       bool success;
> > > +                       struct folio *folio = lru_to_folio(head);
> > > +
> > > +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > > +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> > > +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> > > +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> > > +
> > > +                       success = lru_gen_del_folio(lruvec, folio, false);
> > > +                       VM_BUG_ON(!success);
> > > +                       lruvec_add_folio(lruvec, folio);
> >
> > for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
> > max_seq in the head of active list. but your code seems to be putting max_seq-1
> > after putting max_seq, then max_seq is more likely to be evicted
> > afterwards as it
> > is in the tail of the active list.
>
> This is correct.

maybe something like below can fix it:
 #define for_each_gen_type_zone(gen, type, zone)
         \
-       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
+       for (int seq = min_seq[type], (gen)=(seq_to_gen(seq)); seq <=
max_seq ; seq++)                       \
                for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
                        for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)

but i am not quite sure it is worth it if we don't switch mglru/lru that
often. so it is all up to you, either fix it or put a comment to describe
we are not trying to make an active list with completely the same
temperature (hot/cold) as pages were in mglru lists.

>
> > anyway, it might not be so important. I can't imagine we will
> > frequently switch mglru
> > with lru dynamically. will we?
>
> I certainly hope not :)

me too.

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
  2022-03-22  8:45         ` Barry Song
@ 2022-03-22  9:00           ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  9:00 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 2:45 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 9:20 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Tue, Mar 22, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > ...
> > > > +static bool drain_evictable(struct lruvec *lruvec)
> > > > +{
> > > > +       int gen, type, zone;
> > > > +       int remaining = MAX_LRU_BATCH;
> > > > +
> > > > +       for_each_gen_type_zone(gen, type, zone) {
> > > > +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> > > > +
> > > > +               while (!list_empty(head)) {
> > > > +                       bool success;
> > > > +                       struct folio *folio = lru_to_folio(head);
> > > > +
> > > > +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > > > +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> > > > +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> > > > +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> > > > +
> > > > +                       success = lru_gen_del_folio(lruvec, folio, false);
> > > > +                       VM_BUG_ON(!success);
> > > > +                       lruvec_add_folio(lruvec, folio);
> > >
> > > for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
> > > max_seq in the head of active list. but your code seems to be putting max_seq-1
> > > after putting max_seq, then max_seq is more likely to be evicted
> > > afterwards as it
> > > is in the tail of the active list.
> >
> > This is correct.
>
> maybe something like below can fix it:
>  #define for_each_gen_type_zone(gen, type, zone)
>          \
> -       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> +       for (int seq = min_seq[type], (gen)=(seq_to_gen(seq)); seq <=
> max_seq ; seq++)                       \
>                 for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
>                         for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)

I explained in another email that you might not have the time to go
over yet [1].

This has to be all *possible* generations, not just [min_seq, max_seq].

[1] https://lore.kernel.org/linux-mm/CAOUHufa50Mj6wusKvFX2cCAk58oTwCLDC8im+_B6OS_dP6=TJQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 10/14] mm: multi-gen LRU: kill switch
@ 2022-03-22  9:00           ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-22  9:00 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Tue, Mar 22, 2022 at 2:45 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 9:20 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Tue, Mar 22, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > ...
> > > > +static bool drain_evictable(struct lruvec *lruvec)
> > > > +{
> > > > +       int gen, type, zone;
> > > > +       int remaining = MAX_LRU_BATCH;
> > > > +
> > > > +       for_each_gen_type_zone(gen, type, zone) {
> > > > +               struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
> > > > +
> > > > +               while (!list_empty(head)) {
> > > > +                       bool success;
> > > > +                       struct folio *folio = lru_to_folio(head);
> > > > +
> > > > +                       VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
> > > > +                       VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> > > > +                       VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
> > > > +                       VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
> > > > +
> > > > +                       success = lru_gen_del_folio(lruvec, folio, false);
> > > > +                       VM_BUG_ON(!success);
> > > > +                       lruvec_add_folio(lruvec, folio);
> > >
> > > for example, max_seq=4(GEN=0) and max_seq-1=3, then we are supposed to put
> > > max_seq in the head of active list. but your code seems to be putting max_seq-1
> > > after putting max_seq, then max_seq is more likely to be evicted
> > > afterwards as it
> > > is in the tail of the active list.
> >
> > This is correct.
>
> maybe something like below can fix it:
>  #define for_each_gen_type_zone(gen, type, zone)
>          \
> -       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
> +       for (int seq = min_seq[type], (gen)=(seq_to_gen(seq)); seq <=
> max_seq ; seq++)                       \
>                 for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
>                         for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)

I explained in another email that you might not have the time to go
over yet [1].

This has to be all *possible* generations, not just [min_seq, max_seq].

[1] https://lore.kernel.org/linux-mm/CAOUHufa50Mj6wusKvFX2cCAk58oTwCLDC8im+_B6OS_dP6=TJQ@mail.gmail.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-19  3:11       ` Yu Zhao
@ 2022-03-23  7:47         ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-23  7:47 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +       unsigned long old_flags, new_flags;
> > > +       int type = folio_is_file_lru(folio);
> > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > +
> > > +       do {
> > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > +
> > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> >
> > new_gen is assigned twice, i assume you mean
> >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> >
> > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
>
> Thanks a lot for your attention to details!
>
> The first line should be in the next patch but I overlooked during the
> last refactoring:

Thanks for the clarification. So an unmapped file-backed page which is
accessed only by system call will always be in either min_seq or
min_seq + 1? it has no chance to be in max_seq like a faulted-in
mapped file page?

>
>   new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> + /* folio_update_gen() has promoted this page? */
> + if (new_gen >= 0 && new_gen != old_gen)
> + return new_gen;
> +
>   new_gen = (old_gen + 1) % MAX_NR_GENS;

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-23  7:47         ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-23  7:47 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +       unsigned long old_flags, new_flags;
> > > +       int type = folio_is_file_lru(folio);
> > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > +
> > > +       do {
> > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > +
> > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> >
> > new_gen is assigned twice, i assume you mean
> >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> >
> > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
>
> Thanks a lot for your attention to details!
>
> The first line should be in the next patch but I overlooked during the
> last refactoring:

Thanks for the clarification. So an unmapped file-backed page which is
accessed only by system call will always be in either min_seq or
min_seq + 1? it has no chance to be in max_seq like a faulted-in
mapped file page?

>
>   new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> + /* folio_update_gen() has promoted this page? */
> + if (new_gen >= 0 && new_gen != old_gen)
> + return new_gen;
> +
>   new_gen = (old_gen + 1) % MAX_NR_GENS;

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-23  7:47         ` Barry Song
@ 2022-03-24  6:24           ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-24  6:24 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 23, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > +{
> > > > +       unsigned long old_flags, new_flags;
> > > > +       int type = folio_is_file_lru(folio);
> > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > > +
> > > > +       do {
> > > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > > +
> > > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> > >
> > > new_gen is assigned twice, i assume you mean
> > >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> > >
> > > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
> >
> > Thanks a lot for your attention to details!
> >
> > The first line should be in the next patch but I overlooked during the
> > last refactoring:
>
> Thanks for the clarification. So an unmapped file-backed page which is
> accessed only by system call will always be in either min_seq or
> min_seq + 1? it has no chance to be in max_seq like a faulted-in
> mapped file page?

That's right. The rationale is documented here under the `Assumptions`
section [1]. This is also related to Aneesh's question about why MGLRU
doesn't need additional heuristics for VM_EXEC pages [2]. Unmapped
file pages weaken the protection of executable pages under heavy
buffered IO workloads like Java NIO.

[1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/CAOUHufYfpiGdLSdffvzDqaD5oYFG99oDJ2xgQd2Ph77OFR5NAA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-24  6:24           ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-03-24  6:24 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 23, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > +{
> > > > +       unsigned long old_flags, new_flags;
> > > > +       int type = folio_is_file_lru(folio);
> > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > > +
> > > > +       do {
> > > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > > +
> > > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> > >
> > > new_gen is assigned twice, i assume you mean
> > >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> > >
> > > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
> >
> > Thanks a lot for your attention to details!
> >
> > The first line should be in the next patch but I overlooked during the
> > last refactoring:
>
> Thanks for the clarification. So an unmapped file-backed page which is
> accessed only by system call will always be in either min_seq or
> min_seq + 1? it has no chance to be in max_seq like a faulted-in
> mapped file page?

That's right. The rationale is documented here under the `Assumptions`
section [1]. This is also related to Aneesh's question about why MGLRU
doesn't need additional heuristics for VM_EXEC pages [2]. Unmapped
file pages weaken the protection of executable pages under heavy
buffered IO workloads like Java NIO.

[1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
[2] https://lore.kernel.org/linux-mm/CAOUHufYfpiGdLSdffvzDqaD5oYFG99oDJ2xgQd2Ph77OFR5NAA@mail.gmail.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
  2022-03-24  6:24           ` Yu Zhao
@ 2022-03-24  8:13             ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-24  8:13 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Thu, Mar 24, 2022 at 7:24 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Mar 23, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > > +{
> > > > > +       unsigned long old_flags, new_flags;
> > > > > +       int type = folio_is_file_lru(folio);
> > > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > > > +
> > > > > +       do {
> > > > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > > > +
> > > > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> > > >
> > > > new_gen is assigned twice, i assume you mean
> > > >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> > > >
> > > > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
> > >
> > > Thanks a lot for your attention to details!
> > >
> > > The first line should be in the next patch but I overlooked during the
> > > last refactoring:
> >
> > Thanks for the clarification. So an unmapped file-backed page which is
> > accessed only by system call will always be in either min_seq or
> > min_seq + 1? it has no chance to be in max_seq like a faulted-in
> > mapped file page?
>
> That's right. The rationale is documented here under the `Assumptions`
> section [1]. This is also related to Aneesh's question about why MGLRU
> doesn't need additional heuristics for VM_EXEC pages [2]. Unmapped
> file pages weaken the protection of executable pages under heavy
> buffered IO workloads like Java NIO.

ok. This is probably right.
i will also run a test by maltreating unmapped page in vanilla LRU, the
PoC code is like (not been tested yet):

Subject: [PATCH 1/1] mm: vmscan: maltreat unmapped file-backed pages

[This patch has not been tested yet.]

A lesson we learned from MGLRU is that mapped filed-backed pages
are much more important than unmapped ones.
So this patch doesn't move the second accessed unmapped pages to
the active list, alternatively, it keeps the pages in the inactive
list. And we abuse PG_workingset to let the memory reclaim this
is a relatively hot file-backed page, so the reclaim should keep
the pages in the inactive list.

---
 mm/swap.c   | 34 ++++++++++++++++++++++------------
 mm/vmscan.c |  6 ++++--
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index e65e7520bebf..cb0c6e704f2e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -470,18 +470,28 @@ void folio_mark_accessed(struct folio *folio)
  * evictable page accessed has no effect.
  */
  } else if (!folio_test_active(folio)) {
- /*
- * If the page is on the LRU, queue it for activation via
- * lru_pvecs.activate_page. Otherwise, assume the page is on a
- * pagevec, mark it active and it'll be moved to the active
- * LRU on the next drain.
- */
- if (folio_test_lru(folio))
- folio_activate(folio);
- else
- __lru_cache_activate_folio(folio);
- folio_clear_referenced(folio);
- workingset_activation(folio);
+ if (folio_mapped(folio)) {
+ /*
+ * If the mapped page is on the LRU, queue it for activation via
+ * lru_pvecs.activate_page. Otherwise, assume the page is on a
+ * pagevec, mark it active and it'll be moved to the active
+ * LRU on the next drain.
+ */
+ if (folio_test_lru(folio))
+ folio_activate(folio);
+ else
+ __lru_cache_activate_folio(folio);
+ folio_clear_referenced(folio);
+ workingset_activation(folio);
+ } else {
+ /*
+ * we maltreat unmmaped file-backed pages and abuse PG_workingset
+ * flag to let the eviction know this page is a relatively hot file
+ * page, thus, the eviction can move it back to the head of the
+ * inactive list
+ */
+ folio_set_workingset(folio);
+ }
  }
  if (folio_test_idle(folio))
  folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d6f3c9812f97..56a66eb4a3f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1393,12 +1393,14 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
    struct scan_control *sc)
 {
- int referenced_ptes, referenced_page;
+ int referenced_ptes, referenced_page, workingset;
  unsigned long vm_flags;

  referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
    &vm_flags);
  referenced_page = TestClearPageReferenced(page);
+ workingset = page_is_file_lru(page) && !page_mapped(page) &&
+ TestClearPageWorkingset(page);

  /*
  * Mlock lost the isolation race with us.  Let try_to_unmap()
@@ -1438,7 +1440,7 @@ static enum page_references
page_check_references(struct page *page,

  /* Reclaim if clean, defer dirty pages to writeback */
  if (referenced_page && !PageSwapBacked(page))
- return PAGEREF_RECLAIM_CLEAN;
+ return workingset ?  PAGEREF_KEEP : PAGEREF_RECLAIM_CLEAN;

  return PAGEREF_RECLAIM;
 }

>
> [1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
> [2] https://lore.kernel.org/linux-mm/CAOUHufYfpiGdLSdffvzDqaD5oYFG99oDJ2xgQd2Ph77OFR5NAA@mail.gmail.com/

Thanks
Barry

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation
@ 2022-03-24  8:13             ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-03-24  8:13 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Thu, Mar 24, 2022 at 7:24 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Mar 23, 2022 at 1:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 19, 2022 at 4:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Fri, Mar 18, 2022 at 9:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > > +{
> > > > > +       unsigned long old_flags, new_flags;
> > > > > +       int type = folio_is_file_lru(folio);
> > > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > > +       int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > > > > +
> > > > > +       do {
> > > > > +               new_flags = old_flags = READ_ONCE(folio->flags);
> > > > > +               VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > > > > +
> > > > > +               new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > > > +               new_gen = (old_gen + 1) % MAX_NR_GENS;
> > > >
> > > > new_gen is assigned twice, i assume you mean
> > > >                old_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > > >                new_gen = (old_gen + 1) % MAX_NR_GENS;
> > > >
> > > > or do you always mean new_gen =  lru_gen_from_seq(min_seq) + 1?
> > >
> > > Thanks a lot for your attention to details!
> > >
> > > The first line should be in the next patch but I overlooked during the
> > > last refactoring:
> >
> > Thanks for the clarification. So an unmapped file-backed page which is
> > accessed only by system call will always be in either min_seq or
> > min_seq + 1? it has no chance to be in max_seq like a faulted-in
> > mapped file page?
>
> That's right. The rationale is documented here under the `Assumptions`
> section [1]. This is also related to Aneesh's question about why MGLRU
> doesn't need additional heuristics for VM_EXEC pages [2]. Unmapped
> file pages weaken the protection of executable pages under heavy
> buffered IO workloads like Java NIO.

ok. This is probably right.
i will also run a test by maltreating unmapped page in vanilla LRU, the
PoC code is like (not been tested yet):

Subject: [PATCH 1/1] mm: vmscan: maltreat unmapped file-backed pages

[This patch has not been tested yet.]

A lesson we learned from MGLRU is that mapped filed-backed pages
are much more important than unmapped ones.
So this patch doesn't move the second accessed unmapped pages to
the active list, alternatively, it keeps the pages in the inactive
list. And we abuse PG_workingset to let the memory reclaim this
is a relatively hot file-backed page, so the reclaim should keep
the pages in the inactive list.

---
 mm/swap.c   | 34 ++++++++++++++++++++++------------
 mm/vmscan.c |  6 ++++--
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index e65e7520bebf..cb0c6e704f2e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -470,18 +470,28 @@ void folio_mark_accessed(struct folio *folio)
  * evictable page accessed has no effect.
  */
  } else if (!folio_test_active(folio)) {
- /*
- * If the page is on the LRU, queue it for activation via
- * lru_pvecs.activate_page. Otherwise, assume the page is on a
- * pagevec, mark it active and it'll be moved to the active
- * LRU on the next drain.
- */
- if (folio_test_lru(folio))
- folio_activate(folio);
- else
- __lru_cache_activate_folio(folio);
- folio_clear_referenced(folio);
- workingset_activation(folio);
+ if (folio_mapped(folio)) {
+ /*
+ * If the mapped page is on the LRU, queue it for activation via
+ * lru_pvecs.activate_page. Otherwise, assume the page is on a
+ * pagevec, mark it active and it'll be moved to the active
+ * LRU on the next drain.
+ */
+ if (folio_test_lru(folio))
+ folio_activate(folio);
+ else
+ __lru_cache_activate_folio(folio);
+ folio_clear_referenced(folio);
+ workingset_activation(folio);
+ } else {
+ /*
+ * we maltreat unmmaped file-backed pages and abuse PG_workingset
+ * flag to let the eviction know this page is a relatively hot file
+ * page, thus, the eviction can move it back to the head of the
+ * inactive list
+ */
+ folio_set_workingset(folio);
+ }
  }
  if (folio_test_idle(folio))
  folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d6f3c9812f97..56a66eb4a3f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1393,12 +1393,14 @@ enum page_references {
 static enum page_references page_check_references(struct page *page,
    struct scan_control *sc)
 {
- int referenced_ptes, referenced_page;
+ int referenced_ptes, referenced_page, workingset;
  unsigned long vm_flags;

  referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
    &vm_flags);
  referenced_page = TestClearPageReferenced(page);
+ workingset = page_is_file_lru(page) && !page_mapped(page) &&
+ TestClearPageWorkingset(page);

  /*
  * Mlock lost the isolation race with us.  Let try_to_unmap()
@@ -1438,7 +1440,7 @@ static enum page_references
page_check_references(struct page *page,

  /* Reclaim if clean, defer dirty pages to writeback */
  if (referenced_page && !PageSwapBacked(page))
- return PAGEREF_RECLAIM_CLEAN;
+ return workingset ?  PAGEREF_KEEP : PAGEREF_RECLAIM_CLEAN;

  return PAGEREF_RECLAIM;
 }

>
> [1] https://lore.kernel.org/linux-mm/20220309021230.721028-15-yuzhao@google.com/
> [2] https://lore.kernel.org/linux-mm/CAOUHufYfpiGdLSdffvzDqaD5oYFG99oDJ2xgQd2Ph77OFR5NAA@mail.gmail.com/

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
  2022-03-09  2:12   ` Yu Zhao
@ 2022-04-07  2:29     ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-04-07  2:29 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Searching the rmap for PTEs mapping each page on an LRU list (to test
> and clear the accessed bit) can be expensive because pages from
> different VMAs (PA space) are not cache friendly to the rmap (VA
> space). For workloads mostly using mapped pages, the rmap has a high
> CPU cost in the reclaim path.
>
> This patch exploits spatial locality to reduce the trips into the
> rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> adjacent PTEs. On finding another young PTE, it clears the accessed
> bit and updates the gen counter of the page mapped by this PTE to
> (max_seq%MAX_NR_GENS)+1.

Hi Yu,
It seems an interesting feature to save the cost of rmap. but will it lead to
possible judging of cold pages as hot pages?
In case a page is mapped by 20 processes,  and it has been accessed
by 5 of them, when we look around one of the 5 processes, the page
will be young and this pte is cleared. but we still have 4 ptes which are not
cleared. then we don't access the page for a long time, but the 4 uncleared
PTEs will still make the page "hot" since they are not cleared, we will find
the page is hot either due to look-arounding the 4 processes or rmapping
the page later?

>
> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): no change
>
>   Single workload:
>     memcached (anon): +[3.5, 5.5]%
>                 Ops/sec      KB/sec
>       patch1-5: 972526.07    37826.95
>       patch1-6: 1015292.83   39490.38
>
>   Configurations:
>     no change
>
> Client benchmark results:
>   kswapd profiles:
>     patch1-5
>       39.73%  lzo1x_1_do_compress (real work)
>       14.96%  page_vma_mapped_walk
>        6.97%  _raw_spin_unlock_irq
>        3.07%  do_raw_spin_lock
>        2.53%  anon_vma_interval_tree_iter_first
>        2.04%  ptep_clear_flush
>        1.82%  __zram_bvec_write
>        1.76%  __anon_vma_interval_tree_subtree_search
>        1.57%  memmove
>        1.45%  free_unref_page_list
>
>     patch1-6
>       45.49%  lzo1x_1_do_compress (real work)
>        7.38%  page_vma_mapped_walk
>        7.24%  _raw_spin_unlock_irq
>        2.64%  ptep_clear_flush
>        2.31%  __zram_bvec_write
>        2.13%  do_raw_spin_lock
>        2.09%  lru_gen_look_around
>        1.89%  free_unref_page_list
>        1.85%  memmove
>        1.74%  obj_malloc
>
>   Configurations:
>     no change
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/memcontrol.h |  31 ++++++++
>  include/linux/mm.h         |   5 ++
>  include/linux/mmzone.h     |   6 ++
>  include/linux/swap.h       |   1 +
>  mm/memcontrol.c            |   1 +
>  mm/rmap.c                  |   7 ++
>  mm/swap.c                  |   4 +-
>  mm/vmscan.c                | 155 +++++++++++++++++++++++++++++++++++++
>  8 files changed, 208 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0abbd685703b..c8ce74577290 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -437,6 +437,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem folio a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem folio from being released.
> @@ -498,6 +499,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem page a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem page from being released.
> @@ -935,6 +937,23 @@ void unlock_page_memcg(struct page *page);
>
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
>
> +/* try to stablize folio_memcg() for all the pages in a memcg */
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +       rcu_read_lock();
> +
> +       if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> +               return true;
> +
> +       rcu_read_unlock();
> +       return false;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
>  static inline void mod_memcg_state(struct mem_cgroup *memcg,
>                                    int idx, int val)
> @@ -1372,6 +1391,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
>  {
>  }
>
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +       /* to match folio_memcg_rcu() */
> +       rcu_read_lock();
> +       return true;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1e3e6dd90c0f..1f3695e95942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1588,6 +1588,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
>         return page_to_pfn(&folio->page);
>  }
>
> +static inline struct folio *pfn_folio(unsigned long pfn)
> +{
> +       return page_folio(pfn_to_page(pfn));
> +}
> +
>  /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
>  #ifdef CONFIG_MIGRATION
>  static inline bool is_pinnable_page(struct page *page)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 307c5c24c7ac..cd64c64a952d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -359,6 +359,7 @@ enum lruvec_flags {
>  #ifndef __GENERATING_BOUNDS_H
>
>  struct lruvec;
> +struct page_vma_mapped_walk;
>
>  #define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
>  #define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> @@ -411,6 +412,7 @@ struct lru_gen_struct {
>  };
>
>  void lru_gen_init_lruvec(struct lruvec *lruvec);
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
>
>  #ifdef CONFIG_MEMCG
>  void lru_gen_init_memcg(struct mem_cgroup *memcg);
> @@ -423,6 +425,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
>  {
>  }
>
> +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +}
> +
>  #ifdef CONFIG_MEMCG
>  static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
>  {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1d38d9475c4d..b37520d3ff1d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -372,6 +372,7 @@ extern void lru_add_drain(void);
>  extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_cpu_zone(struct zone *zone);
>  extern void lru_add_drain_all(void);
> +extern void folio_activate(struct folio *folio);
>  extern void deactivate_file_page(struct page *page);
>  extern void deactivate_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3fcbfeda259b..e4c30950aa3c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2744,6 +2744,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>          * - LRU isolation
>          * - lock_page_memcg()
>          * - exclusive reference
> +        * - mem_cgroup_trylock_pages()
>          */
>         folio->memcg_data = (unsigned long)memcg;
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6a1e8c7f6213..112e77dc62f4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -73,6 +73,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/memremap.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
>
>  #include <asm/tlbflush.h>
>
> @@ -819,6 +820,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                 }
>
>                 if (pvmw.pte) {
> +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> +                               lru_gen_look_around(&pvmw);
> +                               referenced++;
> +                       }
> +
>                         if (ptep_clear_flush_young_notify(vma, address,
>                                                 pvmw.pte)) {
>                                 /*
> diff --git a/mm/swap.c b/mm/swap.c
> index f5c0bcac8dcd..e65e7520bebf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -344,7 +344,7 @@ static bool need_activate_page_drain(int cpu)
>         return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
>  }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
>  {
>         if (folio_test_lru(folio) && !folio_test_active(folio) &&
>             !folio_test_unevictable(folio)) {
> @@ -364,7 +364,7 @@ static inline void activate_page_drain(int cpu)
>  {
>  }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
>  {
>         struct lruvec *lruvec;
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 91a827ff665d..2b685aa0379c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1558,6 +1558,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                 if (!sc->may_unmap && page_mapped(page))
>                         goto keep_locked;
>
> +               /* folio_update_gen() tried to promote this page? */
> +               if (lru_gen_enabled() && !ignore_references &&
> +                   page_mapped(page) && PageReferenced(page))
> +                       goto keep_locked;
> +
>                 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
>                         (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>
> @@ -3225,6 +3230,31 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
>   *                          the aging
>   ******************************************************************************/
>
> +static int folio_update_gen(struct folio *folio, int gen)
> +{
> +       unsigned long old_flags, new_flags;
> +
> +       VM_BUG_ON(gen >= MAX_NR_GENS);
> +       VM_BUG_ON(!rcu_read_lock_held());
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +
> +               /* for shrink_page_list() */
> +               if (!(new_flags & LRU_GEN_MASK)) {
> +                       new_flags |= BIT(PG_referenced);
> +                       continue;
> +               }
> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> +               new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +       } while (new_flags != old_flags &&
> +                cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +}
> +
>  static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
>  {
>         unsigned long old_flags, new_flags;
> @@ -3237,6 +3267,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
>                 VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
>
>                 new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +               /* folio_update_gen() has promoted this page? */
> +               if (new_gen >= 0 && new_gen != old_gen)
> +                       return new_gen;
> +
>                 new_gen = (old_gen + 1) % MAX_NR_GENS;
>
>                 new_flags &= ~LRU_GEN_MASK;
> @@ -3438,6 +3472,122 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>  }
>
> +/*
> + * This function exploits spatial locality when shrink_page_list() walks the
> + * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
> + */
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +       int i;
> +       pte_t *pte;
> +       unsigned long start;
> +       unsigned long end;
> +       unsigned long addr;
> +       struct folio *folio;
> +       unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
> +       struct mem_cgroup *memcg = page_memcg(pvmw->page);
> +       struct pglist_data *pgdat = page_pgdat(pvmw->page);
> +       struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +       DEFINE_MAX_SEQ(lruvec);
> +       int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> +       lockdep_assert_held(pvmw->ptl);
> +       VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
> +
> +       start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
> +       end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
> +
> +       if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
> +               if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
> +                       end = start + MIN_LRU_BATCH * PAGE_SIZE;
> +               else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
> +                       start = end - MIN_LRU_BATCH * PAGE_SIZE;
> +               else {
> +                       start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
> +                       end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
> +               }
> +       }
> +
> +       pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
> +
> +       rcu_read_lock();
> +       arch_enter_lazy_mmu_mode();
> +
> +       for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> +               unsigned long pfn = pte_pfn(pte[i]);
> +
> +               VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
> +
> +               if (!pte_present(pte[i]) || is_zero_pfn(pfn))
> +                       continue;
> +
> +               if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
> +                       continue;
> +
> +               if (!pte_young(pte[i]))
> +                       continue;
> +
> +               VM_BUG_ON(!pfn_valid(pfn));
> +               if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
> +                       continue;
> +
> +               folio = pfn_folio(pfn);
> +               if (folio_nid(folio) != pgdat->node_id)
> +                       continue;
> +
> +               if (folio_memcg_rcu(folio) != memcg)
> +                       continue;
> +
> +               if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> +                       continue;
> +
> +               if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
> +                   !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
> +                     !folio_test_swapcache(folio)))
> +                       folio_mark_dirty(folio);
> +
> +               old_gen = folio_lru_gen(folio);
> +               if (old_gen < 0)
> +                       folio_set_referenced(folio);
> +               else if (old_gen != new_gen)
> +                       __set_bit(i, bitmap);
> +       }
> +
> +       arch_leave_lazy_mmu_mode();
> +       rcu_read_unlock();
> +
> +       if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
> +               for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> +                       folio = page_folio(pte_page(pte[i]));
> +                       folio_activate(folio);
> +               }
> +               return;
> +       }
> +
> +       /* folio_update_gen() requires stable folio_memcg() */
> +       if (!mem_cgroup_trylock_pages(memcg))
> +               return;
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +       new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
> +
> +       for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> +               folio = page_folio(pte_page(pte[i]));
> +               if (folio_memcg_rcu(folio) != memcg)
> +                       continue;
> +
> +               old_gen = folio_update_gen(folio, new_gen);
> +               if (old_gen < 0 || old_gen == new_gen)
> +                       continue;
> +
> +               lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +       }
> +
> +       spin_unlock_irq(&lruvec->lru_lock);
> +
> +       mem_cgroup_unlock_pages();
> +}
> +
>  /******************************************************************************
>   *                          the eviction
>   ******************************************************************************/
> @@ -3471,6 +3621,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
>                 return true;
>         }
>
> +       if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
> +               list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> +               return true;
> +       }
> +
>         if (tier > tier_idx) {
>                 int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
@ 2022-04-07  2:29     ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-04-07  2:29 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Searching the rmap for PTEs mapping each page on an LRU list (to test
> and clear the accessed bit) can be expensive because pages from
> different VMAs (PA space) are not cache friendly to the rmap (VA
> space). For workloads mostly using mapped pages, the rmap has a high
> CPU cost in the reclaim path.
>
> This patch exploits spatial locality to reduce the trips into the
> rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> adjacent PTEs. On finding another young PTE, it clears the accessed
> bit and updates the gen counter of the page mapped by this PTE to
> (max_seq%MAX_NR_GENS)+1.

Hi Yu,
It seems an interesting feature to save the cost of rmap. but will it lead to
possible judging of cold pages as hot pages?
In case a page is mapped by 20 processes,  and it has been accessed
by 5 of them, when we look around one of the 5 processes, the page
will be young and this pte is cleared. but we still have 4 ptes which are not
cleared. then we don't access the page for a long time, but the 4 uncleared
PTEs will still make the page "hot" since they are not cleared, we will find
the page is hot either due to look-arounding the 4 processes or rmapping
the page later?

>
> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): no change
>
>   Single workload:
>     memcached (anon): +[3.5, 5.5]%
>                 Ops/sec      KB/sec
>       patch1-5: 972526.07    37826.95
>       patch1-6: 1015292.83   39490.38
>
>   Configurations:
>     no change
>
> Client benchmark results:
>   kswapd profiles:
>     patch1-5
>       39.73%  lzo1x_1_do_compress (real work)
>       14.96%  page_vma_mapped_walk
>        6.97%  _raw_spin_unlock_irq
>        3.07%  do_raw_spin_lock
>        2.53%  anon_vma_interval_tree_iter_first
>        2.04%  ptep_clear_flush
>        1.82%  __zram_bvec_write
>        1.76%  __anon_vma_interval_tree_subtree_search
>        1.57%  memmove
>        1.45%  free_unref_page_list
>
>     patch1-6
>       45.49%  lzo1x_1_do_compress (real work)
>        7.38%  page_vma_mapped_walk
>        7.24%  _raw_spin_unlock_irq
>        2.64%  ptep_clear_flush
>        2.31%  __zram_bvec_write
>        2.13%  do_raw_spin_lock
>        2.09%  lru_gen_look_around
>        1.89%  free_unref_page_list
>        1.85%  memmove
>        1.74%  obj_malloc
>
>   Configurations:
>     no change
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/memcontrol.h |  31 ++++++++
>  include/linux/mm.h         |   5 ++
>  include/linux/mmzone.h     |   6 ++
>  include/linux/swap.h       |   1 +
>  mm/memcontrol.c            |   1 +
>  mm/rmap.c                  |   7 ++
>  mm/swap.c                  |   4 +-
>  mm/vmscan.c                | 155 +++++++++++++++++++++++++++++++++++++
>  8 files changed, 208 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0abbd685703b..c8ce74577290 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -437,6 +437,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem folio a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem folio from being released.
> @@ -498,6 +499,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem page a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem page from being released.
> @@ -935,6 +937,23 @@ void unlock_page_memcg(struct page *page);
>
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
>
> +/* try to stablize folio_memcg() for all the pages in a memcg */
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +       rcu_read_lock();
> +
> +       if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> +               return true;
> +
> +       rcu_read_unlock();
> +       return false;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
>  static inline void mod_memcg_state(struct mem_cgroup *memcg,
>                                    int idx, int val)
> @@ -1372,6 +1391,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
>  {
>  }
>
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +       /* to match folio_memcg_rcu() */
> +       rcu_read_lock();
> +       return true;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1e3e6dd90c0f..1f3695e95942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1588,6 +1588,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
>         return page_to_pfn(&folio->page);
>  }
>
> +static inline struct folio *pfn_folio(unsigned long pfn)
> +{
> +       return page_folio(pfn_to_page(pfn));
> +}
> +
>  /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
>  #ifdef CONFIG_MIGRATION
>  static inline bool is_pinnable_page(struct page *page)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 307c5c24c7ac..cd64c64a952d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -359,6 +359,7 @@ enum lruvec_flags {
>  #ifndef __GENERATING_BOUNDS_H
>
>  struct lruvec;
> +struct page_vma_mapped_walk;
>
>  #define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
>  #define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> @@ -411,6 +412,7 @@ struct lru_gen_struct {
>  };
>
>  void lru_gen_init_lruvec(struct lruvec *lruvec);
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
>
>  #ifdef CONFIG_MEMCG
>  void lru_gen_init_memcg(struct mem_cgroup *memcg);
> @@ -423,6 +425,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
>  {
>  }
>
> +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +}
> +
>  #ifdef CONFIG_MEMCG
>  static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
>  {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1d38d9475c4d..b37520d3ff1d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -372,6 +372,7 @@ extern void lru_add_drain(void);
>  extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_cpu_zone(struct zone *zone);
>  extern void lru_add_drain_all(void);
> +extern void folio_activate(struct folio *folio);
>  extern void deactivate_file_page(struct page *page);
>  extern void deactivate_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3fcbfeda259b..e4c30950aa3c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2744,6 +2744,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>          * - LRU isolation
>          * - lock_page_memcg()
>          * - exclusive reference
> +        * - mem_cgroup_trylock_pages()
>          */
>         folio->memcg_data = (unsigned long)memcg;
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6a1e8c7f6213..112e77dc62f4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -73,6 +73,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/memremap.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
>
>  #include <asm/tlbflush.h>
>
> @@ -819,6 +820,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                 }
>
>                 if (pvmw.pte) {
> +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> +                               lru_gen_look_around(&pvmw);
> +                               referenced++;
> +                       }
> +
>                         if (ptep_clear_flush_young_notify(vma, address,
>                                                 pvmw.pte)) {
>                                 /*
> diff --git a/mm/swap.c b/mm/swap.c
> index f5c0bcac8dcd..e65e7520bebf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -344,7 +344,7 @@ static bool need_activate_page_drain(int cpu)
>         return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
>  }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
>  {
>         if (folio_test_lru(folio) && !folio_test_active(folio) &&
>             !folio_test_unevictable(folio)) {
> @@ -364,7 +364,7 @@ static inline void activate_page_drain(int cpu)
>  {
>  }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
>  {
>         struct lruvec *lruvec;
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 91a827ff665d..2b685aa0379c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1558,6 +1558,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                 if (!sc->may_unmap && page_mapped(page))
>                         goto keep_locked;
>
> +               /* folio_update_gen() tried to promote this page? */
> +               if (lru_gen_enabled() && !ignore_references &&
> +                   page_mapped(page) && PageReferenced(page))
> +                       goto keep_locked;
> +
>                 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
>                         (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>
> @@ -3225,6 +3230,31 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
>   *                          the aging
>   ******************************************************************************/
>
> +static int folio_update_gen(struct folio *folio, int gen)
> +{
> +       unsigned long old_flags, new_flags;
> +
> +       VM_BUG_ON(gen >= MAX_NR_GENS);
> +       VM_BUG_ON(!rcu_read_lock_held());
> +
> +       do {
> +               new_flags = old_flags = READ_ONCE(folio->flags);
> +
> +               /* for shrink_page_list() */
> +               if (!(new_flags & LRU_GEN_MASK)) {
> +                       new_flags |= BIT(PG_referenced);
> +                       continue;
> +               }
> +
> +               new_flags &= ~LRU_GEN_MASK;
> +               new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> +               new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +       } while (new_flags != old_flags &&
> +                cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +       return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +}
> +
>  static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
>  {
>         unsigned long old_flags, new_flags;
> @@ -3237,6 +3267,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
>                 VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
>
>                 new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +               /* folio_update_gen() has promoted this page? */
> +               if (new_gen >= 0 && new_gen != old_gen)
> +                       return new_gen;
> +
>                 new_gen = (old_gen + 1) % MAX_NR_GENS;
>
>                 new_flags &= ~LRU_GEN_MASK;
> @@ -3438,6 +3472,122 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>  }
>
> +/*
> + * This function exploits spatial locality when shrink_page_list() walks the
> + * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
> + */
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +       int i;
> +       pte_t *pte;
> +       unsigned long start;
> +       unsigned long end;
> +       unsigned long addr;
> +       struct folio *folio;
> +       unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
> +       struct mem_cgroup *memcg = page_memcg(pvmw->page);
> +       struct pglist_data *pgdat = page_pgdat(pvmw->page);
> +       struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +       DEFINE_MAX_SEQ(lruvec);
> +       int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> +       lockdep_assert_held(pvmw->ptl);
> +       VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
> +
> +       start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
> +       end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
> +
> +       if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
> +               if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
> +                       end = start + MIN_LRU_BATCH * PAGE_SIZE;
> +               else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
> +                       start = end - MIN_LRU_BATCH * PAGE_SIZE;
> +               else {
> +                       start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
> +                       end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
> +               }
> +       }
> +
> +       pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
> +
> +       rcu_read_lock();
> +       arch_enter_lazy_mmu_mode();
> +
> +       for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> +               unsigned long pfn = pte_pfn(pte[i]);
> +
> +               VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
> +
> +               if (!pte_present(pte[i]) || is_zero_pfn(pfn))
> +                       continue;
> +
> +               if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
> +                       continue;
> +
> +               if (!pte_young(pte[i]))
> +                       continue;
> +
> +               VM_BUG_ON(!pfn_valid(pfn));
> +               if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
> +                       continue;
> +
> +               folio = pfn_folio(pfn);
> +               if (folio_nid(folio) != pgdat->node_id)
> +                       continue;
> +
> +               if (folio_memcg_rcu(folio) != memcg)
> +                       continue;
> +
> +               if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> +                       continue;
> +
> +               if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
> +                   !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
> +                     !folio_test_swapcache(folio)))
> +                       folio_mark_dirty(folio);
> +
> +               old_gen = folio_lru_gen(folio);
> +               if (old_gen < 0)
> +                       folio_set_referenced(folio);
> +               else if (old_gen != new_gen)
> +                       __set_bit(i, bitmap);
> +       }
> +
> +       arch_leave_lazy_mmu_mode();
> +       rcu_read_unlock();
> +
> +       if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
> +               for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> +                       folio = page_folio(pte_page(pte[i]));
> +                       folio_activate(folio);
> +               }
> +               return;
> +       }
> +
> +       /* folio_update_gen() requires stable folio_memcg() */
> +       if (!mem_cgroup_trylock_pages(memcg))
> +               return;
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +       new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
> +
> +       for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> +               folio = page_folio(pte_page(pte[i]));
> +               if (folio_memcg_rcu(folio) != memcg)
> +                       continue;
> +
> +               old_gen = folio_update_gen(folio, new_gen);
> +               if (old_gen < 0 || old_gen == new_gen)
> +                       continue;
> +
> +               lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +       }
> +
> +       spin_unlock_irq(&lruvec->lru_lock);
> +
> +       mem_cgroup_unlock_pages();
> +}
> +
>  /******************************************************************************
>   *                          the eviction
>   ******************************************************************************/
> @@ -3471,6 +3621,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
>                 return true;
>         }
>
> +       if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
> +               list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> +               return true;
> +       }
> +
>         if (tier > tier_idx) {
>                 int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
  2022-04-07  2:29     ` Barry Song
@ 2022-04-07  3:04       ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-04-07  3:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > and clear the accessed bit) can be expensive because pages from
> > different VMAs (PA space) are not cache friendly to the rmap (VA
> > space). For workloads mostly using mapped pages, the rmap has a high
> > CPU cost in the reclaim path.
> >
> > This patch exploits spatial locality to reduce the trips into the
> > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > adjacent PTEs. On finding another young PTE, it clears the accessed
> > bit and updates the gen counter of the page mapped by this PTE to
> > (max_seq%MAX_NR_GENS)+1.
>
> Hi Yu,
> It seems an interesting feature to save the cost of rmap. but will it lead to
> possible judging of cold pages as hot pages?
> In case a page is mapped by 20 processes,  and it has been accessed
> by 5 of them, when we look around one of the 5 processes, the page
> will be young and this pte is cleared. but we still have 4 ptes which are not
> cleared. then we don't access the page for a long time, but the 4 uncleared
> PTEs will still make the page "hot" since they are not cleared, we will find
> the page is hot either due to look-arounding the 4 processes or rmapping
> the page later?

Why are the remaining 4 accessed PTEs skipped? The rmap should check
all the 20 PTEs.

Even if they were skipped, it doesn't matter. The same argument could
be made for the rest of 1 millions minus 1 pages that have been timely
scanned, on a 4GB laptop. The fundamental principle (assumption) of
MGLRU is never about making the best choices. Nothing can because it's
impossible to predict the future that well, given the complexity of
today's workloads, not on a phone, definitely not on a server that
runs mixed types of workloads. The primary goal is to avoid the worst
choices at a minimum (scanning) cost. The second goal is to pick good
ones at an acceptable cost, which probably are a half of all possible
choices.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
@ 2022-04-07  3:04       ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-04-07  3:04 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > and clear the accessed bit) can be expensive because pages from
> > different VMAs (PA space) are not cache friendly to the rmap (VA
> > space). For workloads mostly using mapped pages, the rmap has a high
> > CPU cost in the reclaim path.
> >
> > This patch exploits spatial locality to reduce the trips into the
> > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > adjacent PTEs. On finding another young PTE, it clears the accessed
> > bit and updates the gen counter of the page mapped by this PTE to
> > (max_seq%MAX_NR_GENS)+1.
>
> Hi Yu,
> It seems an interesting feature to save the cost of rmap. but will it lead to
> possible judging of cold pages as hot pages?
> In case a page is mapped by 20 processes,  and it has been accessed
> by 5 of them, when we look around one of the 5 processes, the page
> will be young and this pte is cleared. but we still have 4 ptes which are not
> cleared. then we don't access the page for a long time, but the 4 uncleared
> PTEs will still make the page "hot" since they are not cleared, we will find
> the page is hot either due to look-arounding the 4 processes or rmapping
> the page later?

Why are the remaining 4 accessed PTEs skipped? The rmap should check
all the 20 PTEs.

Even if they were skipped, it doesn't matter. The same argument could
be made for the rest of 1 millions minus 1 pages that have been timely
scanned, on a 4GB laptop. The fundamental principle (assumption) of
MGLRU is never about making the best choices. Nothing can because it's
impossible to predict the future that well, given the complexity of
today's workloads, not on a phone, definitely not on a server that
runs mixed types of workloads. The primary goal is to avoid the worst
choices at a minimum (scanning) cost. The second goal is to pick good
ones at an acceptable cost, which probably are a half of all possible
choices.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
  2022-04-07  3:04       ` Yu Zhao
@ 2022-04-07  3:46         ` Barry Song
  -1 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-04-07  3:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Thu, Apr 7, 2022 at 3:04 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > > and clear the accessed bit) can be expensive because pages from
> > > different VMAs (PA space) are not cache friendly to the rmap (VA
> > > space). For workloads mostly using mapped pages, the rmap has a high
> > > CPU cost in the reclaim path.
> > >
> > > This patch exploits spatial locality to reduce the trips into the
> > > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > > adjacent PTEs. On finding another young PTE, it clears the accessed
> > > bit and updates the gen counter of the page mapped by this PTE to
> > > (max_seq%MAX_NR_GENS)+1.
> >
> > Hi Yu,
> > It seems an interesting feature to save the cost of rmap. but will it lead to
> > possible judging of cold pages as hot pages?
> > In case a page is mapped by 20 processes,  and it has been accessed
> > by 5 of them, when we look around one of the 5 processes, the page
> > will be young and this pte is cleared. but we still have 4 ptes which are not
> > cleared. then we don't access the page for a long time, but the 4 uncleared
> > PTEs will still make the page "hot" since they are not cleared, we will find
> > the page is hot either due to look-arounding the 4 processes or rmapping
> > the page later?
>
> Why are the remaining 4 accessed PTEs skipped? The rmap should check
> all the 20 PTEs.

for example page A is the neighbour of page B in process 1, when we do rmap
for B, we look-around and clear A's pte in process 1. but A's ptes are
still set in
process 2,3,4,5.

>
> Even if they were skipped, it doesn't matter. The same argument could
> be made for the rest of 1 millions minus 1 pages that have been timely
> scanned, on a 4GB laptop. The fundamental principle (assumption) of
> MGLRU is never about making the best choices. Nothing can because it's
> impossible to predict the future that well, given the complexity of
> today's workloads, not on a phone, definitely not on a server that
> runs mixed types of workloads. The primary goal is to avoid the worst
> choices at a minimum (scanning) cost. The second goal is to pick good
> ones at an acceptable cost, which probably are a half of all possible
> choices.

thanks
barry

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
@ 2022-04-07  3:46         ` Barry Song
  0 siblings, 0 replies; 120+ messages in thread
From: Barry Song @ 2022-04-07  3:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Thu, Apr 7, 2022 at 3:04 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > > and clear the accessed bit) can be expensive because pages from
> > > different VMAs (PA space) are not cache friendly to the rmap (VA
> > > space). For workloads mostly using mapped pages, the rmap has a high
> > > CPU cost in the reclaim path.
> > >
> > > This patch exploits spatial locality to reduce the trips into the
> > > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > > adjacent PTEs. On finding another young PTE, it clears the accessed
> > > bit and updates the gen counter of the page mapped by this PTE to
> > > (max_seq%MAX_NR_GENS)+1.
> >
> > Hi Yu,
> > It seems an interesting feature to save the cost of rmap. but will it lead to
> > possible judging of cold pages as hot pages?
> > In case a page is mapped by 20 processes,  and it has been accessed
> > by 5 of them, when we look around one of the 5 processes, the page
> > will be young and this pte is cleared. but we still have 4 ptes which are not
> > cleared. then we don't access the page for a long time, but the 4 uncleared
> > PTEs will still make the page "hot" since they are not cleared, we will find
> > the page is hot either due to look-arounding the 4 processes or rmapping
> > the page later?
>
> Why are the remaining 4 accessed PTEs skipped? The rmap should check
> all the 20 PTEs.

for example page A is the neighbour of page B in process 1, when we do rmap
for B, we look-around and clear A's pte in process 1. but A's ptes are
still set in
process 2,3,4,5.

>
> Even if they were skipped, it doesn't matter. The same argument could
> be made for the rest of 1 millions minus 1 pages that have been timely
> scanned, on a 4GB laptop. The fundamental principle (assumption) of
> MGLRU is never about making the best choices. Nothing can because it's
> impossible to predict the future that well, given the complexity of
> today's workloads, not on a phone, definitely not on a server that
> runs mixed types of workloads. The primary goal is to avoid the worst
> choices at a minimum (scanning) cost. The second goal is to pick good
> ones at an acceptable cost, which probably are a half of all possible
> choices.

thanks
barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
  2022-04-07  3:46         ` Barry Song
@ 2022-04-07 23:51           ` Yu Zhao
  -1 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-04-07 23:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Apr 6, 2022 at 9:46 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Apr 7, 2022 at 3:04 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > > > and clear the accessed bit) can be expensive because pages from
> > > > different VMAs (PA space) are not cache friendly to the rmap (VA
> > > > space). For workloads mostly using mapped pages, the rmap has a high
> > > > CPU cost in the reclaim path.
> > > >
> > > > This patch exploits spatial locality to reduce the trips into the
> > > > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > > > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > > > adjacent PTEs. On finding another young PTE, it clears the accessed
> > > > bit and updates the gen counter of the page mapped by this PTE to
> > > > (max_seq%MAX_NR_GENS)+1.
> > >
> > > Hi Yu,
> > > It seems an interesting feature to save the cost of rmap. but will it lead to
> > > possible judging of cold pages as hot pages?
> > > In case a page is mapped by 20 processes,  and it has been accessed
> > > by 5 of them, when we look around one of the 5 processes, the page
> > > will be young and this pte is cleared. but we still have 4 ptes which are not
> > > cleared. then we don't access the page for a long time, but the 4 uncleared
> > > PTEs will still make the page "hot" since they are not cleared, we will find
> > > the page is hot either due to look-arounding the 4 processes or rmapping
> > > the page later?
> >
> > Why are the remaining 4 accessed PTEs skipped? The rmap should check
> > all the 20 PTEs.
>
> for example page A is the neighbour of page B in process 1, when we do rmap
> for B, we look-around and clear A's pte in process 1. but A's ptes are
> still set in
> process 2,3,4,5.

It makes no difference because it's too insignificant. The goal is not
to give several million pages unique timestamps and sort them; it's to
partition pages on the orders one tenth to a few seconds and quickly
find some reasonable candidates. Temporal locality gets weaker
exponentially over time. Even on small systems, the difference is not
measurable if several thousand pages used in the last few seconds are
chosen over another several thousand pages used in the last minute.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap
@ 2022-04-07 23:51           ` Yu Zhao
  0 siblings, 0 replies; 120+ messages in thread
From: Yu Zhao @ 2022-04-07 23:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Aneesh Kumar,
	Catalin Marinas, Dave Hansen, Hillf Danton, Jens Axboe,
	Jesse Barnes, Johannes Weiner, Jonathan Corbet, Matthew Wilcox,
	Mel Gorman, Michael Larabel, Michal Hocko, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh, Vaibhav Jain

On Wed, Apr 6, 2022 at 9:46 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Apr 7, 2022 at 3:04 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Apr 6, 2022 at 8:29 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Mar 9, 2022 at 3:48 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > > > and clear the accessed bit) can be expensive because pages from
> > > > different VMAs (PA space) are not cache friendly to the rmap (VA
> > > > space). For workloads mostly using mapped pages, the rmap has a high
> > > > CPU cost in the reclaim path.
> > > >
> > > > This patch exploits spatial locality to reduce the trips into the
> > > > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > > > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > > > adjacent PTEs. On finding another young PTE, it clears the accessed
> > > > bit and updates the gen counter of the page mapped by this PTE to
> > > > (max_seq%MAX_NR_GENS)+1.
> > >
> > > Hi Yu,
> > > It seems an interesting feature to save the cost of rmap. but will it lead to
> > > possible judging of cold pages as hot pages?
> > > In case a page is mapped by 20 processes,  and it has been accessed
> > > by 5 of them, when we look around one of the 5 processes, the page
> > > will be young and this pte is cleared. but we still have 4 ptes which are not
> > > cleared. then we don't access the page for a long time, but the 4 uncleared
> > > PTEs will still make the page "hot" since they are not cleared, we will find
> > > the page is hot either due to look-arounding the 4 processes or rmapping
> > > the page later?
> >
> > Why are the remaining 4 accessed PTEs skipped? The rmap should check
> > all the 20 PTEs.
>
> for example page A is the neighbour of page B in process 1, when we do rmap
> for B, we look-around and clear A's pte in process 1. but A's ptes are
> still set in
> process 2,3,4,5.

It makes no difference because it's too insignificant. The goal is not
to give several million pages unique timestamps and sort them; it's to
partition pages on the orders one tenth to a few seconds and quickly
find some reasonable candidates. Temporal locality gets weaker
exponentially over time. Even on small systems, the difference is not
measurable if several thousand pages used in the last few seconds are
chosen over another several thousand pages used in the last minute.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2022-04-07 23:53 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-09  2:12 [PATCH v9 00/14] Multi-Gen LRU Framework Yu Zhao
2022-03-09  2:12 ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-11 10:55   ` Barry Song
2022-03-11 10:55     ` Barry Song
2022-03-11 22:57     ` Yu Zhao
2022-03-11 22:57       ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 02/14] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-16 22:15   ` Barry Song
2022-03-16 22:15     ` Barry Song
2022-03-09  2:12 ` [PATCH v9 03/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-18  1:15   ` Barry Song
2022-03-18  1:15     ` Barry Song
2022-03-09  2:12 ` [PATCH v9 04/14] Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 05/14] mm: multi-gen LRU: groundwork Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-14  8:08   ` Huang, Ying
2022-03-14  8:08     ` Huang, Ying
2022-03-14  9:30     ` Yu Zhao
2022-03-14  9:30       ` Yu Zhao
2022-03-15  0:34       ` Huang, Ying
2022-03-15  0:34         ` Huang, Ying
2022-03-15  0:50         ` Yu Zhao
2022-03-15  0:50           ` Yu Zhao
2022-03-21 18:58       ` Justin Forbes
2022-03-21 18:58         ` Justin Forbes
2022-03-21 19:17         ` Prarit Bhargava
2022-03-21 19:17           ` Prarit Bhargava
2022-03-22  4:52           ` Yu Zhao
2022-03-22  4:52             ` Yu Zhao
2022-03-16 23:25   ` Barry Song
2022-03-16 23:25     ` Barry Song
2022-03-21  9:04     ` Yu Zhao
2022-03-21  9:04       ` Yu Zhao
2022-03-21 11:47       ` Barry Song
2022-03-21 11:47         ` Barry Song
2022-03-09  2:12 ` [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-16  5:55   ` Huang, Ying
2022-03-16  5:55     ` Huang, Ying
2022-03-16  7:54     ` Yu Zhao
2022-03-16  7:54       ` Yu Zhao
2022-03-19  3:01   ` Barry Song
2022-03-19  3:01     ` Barry Song
2022-03-19  3:11     ` Yu Zhao
2022-03-19  3:11       ` Yu Zhao
2022-03-23  7:47       ` Barry Song
2022-03-23  7:47         ` Barry Song
2022-03-24  6:24         ` Yu Zhao
2022-03-24  6:24           ` Yu Zhao
2022-03-24  8:13           ` Barry Song
2022-03-24  8:13             ` Barry Song
2022-03-19 10:14   ` Barry Song
2022-03-19 10:14     ` Barry Song
2022-03-21 23:51     ` Yu Zhao
2022-03-21 23:51       ` Yu Zhao
2022-03-19 11:15   ` Barry Song
2022-03-19 11:15     ` Barry Song
2022-03-22  0:30     ` Yu Zhao
2022-03-22  0:30       ` Yu Zhao
2022-03-21 12:51   ` Aneesh Kumar K.V
2022-03-21 12:51     ` Aneesh Kumar K.V
2022-03-22  4:02     ` Yu Zhao
2022-03-22  4:02       ` Yu Zhao
2022-03-21 13:01   ` Aneesh Kumar K.V
2022-03-21 13:01     ` Aneesh Kumar K.V
2022-03-22  4:39     ` Yu Zhao
2022-03-22  4:39       ` Yu Zhao
2022-03-22  5:26   ` Aneesh Kumar K.V
2022-03-22  5:26     ` Aneesh Kumar K.V
2022-03-22  5:55     ` Yu Zhao
2022-03-22  5:55       ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 07/14] mm: multi-gen LRU: exploit locality in rmap Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-04-07  2:29   ` Barry Song
2022-04-07  2:29     ` Barry Song
2022-04-07  3:04     ` Yu Zhao
2022-04-07  3:04       ` Yu Zhao
2022-04-07  3:46       ` Barry Song
2022-04-07  3:46         ` Barry Song
2022-04-07 23:51         ` Yu Zhao
2022-04-07 23:51           ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 08/14] mm: multi-gen LRU: support page table walks Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 09/14] mm: multi-gen LRU: optimize multiple memcgs Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 10/14] mm: multi-gen LRU: kill switch Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-22  7:47   ` Barry Song
2022-03-22  7:47     ` Barry Song
2022-03-22  8:20     ` Yu Zhao
2022-03-22  8:20       ` Yu Zhao
2022-03-22  8:45       ` Barry Song
2022-03-22  8:45         ` Barry Song
2022-03-22  9:00         ` Yu Zhao
2022-03-22  9:00           ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 11/14] mm: multi-gen LRU: thrashing prevention Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-22  7:22   ` Barry Song
2022-03-22  7:22     ` Barry Song
2022-03-22  8:14     ` Yu Zhao
2022-03-22  8:14       ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 12/14] mm: multi-gen LRU: debugfs interface Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 13/14] mm: multi-gen LRU: admin guide Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-10 12:29   ` Mike Rapoport
2022-03-10 12:29     ` Mike Rapoport
2022-03-11  0:37     ` Yu Zhao
2022-03-11  0:37       ` Yu Zhao
2022-03-09  2:12 ` [PATCH v9 14/14] mm: multi-gen LRU: design doc Yu Zhao
2022-03-09  2:12   ` Yu Zhao
2022-03-11  8:22   ` Mike Rapoport
2022-03-11  8:22     ` Mike Rapoport
2022-03-11  9:38     ` Yu Zhao
2022-03-11  9:38       ` Yu Zhao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.