linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/12] Multigenerational LRU Framework
@ 2022-02-08  8:18 Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
                   ` (14 more replies)
  0 siblings, 15 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao

What's new
==========
1) Addressed all the comments received on the mailing list and in the
   meeting with the stakeholders (will note on individual patches).
2) Measured the performance improvements for each patch between 5-8
   (reported in the commit messages).

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and straightforward.

Patchset overview
=================
The design and implementation overview was moved to patch 12 so that
people can finish reading this cover letter.

1. mm: x86, arm64: add arch_has_hw_pte_young()
2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Using hardware optimizations when trying to clear the accessed bit in
many PTEs.

3. mm/vmscan.c: refactor shrink_node()
A minor refactor.

4. mm: multigenerational LRU: groundwork
Adding the basic data structure and the functions that insert/remove
pages to/from the multigenerational LRU (MGLRU) lists.

5. mm: multigenerational LRU: minimal implementation
A minimal (functional) implementation without any optimizations.

6. mm: multigenerational LRU: exploit locality in rmap
Improving the efficiency when using the rmap.

7. mm: multigenerational LRU: support page table walks
Adding the (optional) page table scanning.

8. mm: multigenerational LRU: optimize multiple memcgs
Optimizing the overall performance for multiple memcgs running mixed
types of workloads.

9. mm: multigenerational LRU: runtime switch
Adding a runtime switch to enable or disable MGLRU.

10. mm: multigenerational LRU: thrashing prevention
11. mm: multigenerational LRU: debugfs interface
Providing userspace with additional features like thrashing prevention,
working set estimation and proactive reclaim.

12. mm: multigenerational LRU: documentation
Adding a design doc and an admin guide.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/lkml/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/lkml/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/lkml/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/lkml/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/lkml/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/lkml/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/lkml/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/lkml/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/lkml/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin wrote [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.
   
   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had *not a
   single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

An anonymous user wrote [12]:
   Using that v5 for some time and confirm that difference under heavy
   load and memory pressure is significant.

Shuang wrote [13]:
   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
   and [9.26, 10.36]% higher throughput, respectively, for random
   access, Zipfian (distribution) access and Gaussian (distribution)
   access, when the average number of jobs per CPU is 1; 95% CIs
   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
   respectively, for random access, Zipfian access and Gaussian access,
   when the average number of jobs per CPU is 2.

Daniel wrote [14]:
   With memcached allocating ~100GB of byte-addressable Optante,
   performance improvement in terms of throughput (measured as queries
   per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
The downstream kernels that have been using MGLRU include:
1. Android ARCVM [15]
2. Arch Linux Zen [16]
3. Chrome OS [17]
4. Liquorix [18]
5. post-factum [19]
6. XanMod [20]

We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [21] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
rendering latency at the 50th percentile.

[11] https://lore.kernel.org/lkml/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022?p=1301275#post1301275
[13] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://chromium.googlesource.com/chromiumos/third_party/kernel
[16] https://archlinux.org
[17] https://chromium.org
[18] https://liquorix.net
[19] https://gitlab.com/post-factum/pf-kernel
[20] https://xanmod.org
[21] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; nobody has demonstrated smaller changes
   with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [22][23][24], the
   new features will likely be put to use for both personal computers
   and data centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects on top of the existing
   design.

[22] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
[23] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
[24] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Yu Zhao (12):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational LRU: groundwork
  mm: multigenerational LRU: minimal implementation
  mm: multigenerational LRU: exploit locality in rmap
  mm: multigenerational LRU: support page table walks
  mm: multigenerational LRU: optimize multiple memcgs
  mm: multigenerational LRU: runtime switch
  mm: multigenerational LRU: thrashing prevention
  mm: multigenerational LRU: debugfs interface
  mm: multigenerational LRU: documentation

 Documentation/admin-guide/mm/index.rst        |    1 +
 Documentation/admin-guide/mm/multigen_lru.rst |  121 +
 Documentation/vm/index.rst                    |    1 +
 Documentation/vm/multigen_lru.rst             |  152 +
 arch/Kconfig                                  |    9 +
 arch/arm64/include/asm/pgtable.h              |   14 +-
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |    9 +-
 arch/x86/mm/pgtable.c                         |    5 +-
 fs/exec.c                                     |    2 +
 fs/fuse/dev.c                                 |    3 +-
 include/linux/cgroup.h                        |   15 +-
 include/linux/memcontrol.h                    |   36 +
 include/linux/mm.h                            |    8 +
 include/linux/mm_inline.h                     |  214 ++
 include/linux/mm_types.h                      |   78 +
 include/linux/mmzone.h                        |  182 ++
 include/linux/nodemask.h                      |    1 +
 include/linux/page-flags-layout.h             |   19 +-
 include/linux/page-flags.h                    |    4 +-
 include/linux/pgtable.h                       |   17 +-
 include/linux/sched.h                         |    4 +
 include/linux/swap.h                          |    5 +
 kernel/bounds.c                               |    3 +
 kernel/cgroup/cgroup-internal.h               |    1 -
 kernel/exit.c                                 |    1 +
 kernel/fork.c                                 |    9 +
 kernel/sched/core.c                           |    1 +
 mm/Kconfig                                    |   50 +
 mm/huge_memory.c                              |    3 +-
 mm/memcontrol.c                               |   27 +
 mm/memory.c                                   |   39 +-
 mm/mm_init.c                                  |    6 +-
 mm/page_alloc.c                               |    1 +
 mm/rmap.c                                     |    7 +
 mm/swap.c                                     |   55 +-
 mm/vmscan.c                                   | 2831 ++++++++++++++++-
 mm/workingset.c                               |  119 +-
 38 files changed, 3908 insertions(+), 146 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young()
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:24   ` Yu Zhao
  2022-02-08 10:33   ` Will Deacon
  2022-02-08  8:18 ` [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that don't have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
shouldn't be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 arch/arm64/include/asm/pgtable.h | 14 ++------------
 arch/x86/include/asm/pgtable.h   |  6 +++---
 include/linux/pgtable.h          | 13 +++++++++++++
 mm/memory.c                      | 14 +-------------
 4 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c4ba047a82d2..990358eca359 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
  * page after fork() + CoW for pfn mappings. We don't always have a
  * hardware-managed access flag on arm64.
  */
-static inline bool arch_faults_on_old_pte(void)
-{
-	WARN_ON(preemptible());
-
-	return !cpu_has_hw_af();
-}
-#define arch_faults_on_old_pte		arch_faults_on_old_pte
+#define arch_has_hw_pte_young		cpu_has_hw_af
 
 /*
  * Experimentally, it's cheap to set the access flag in hardware and we
  * benefit from prefaulting mappings as 'old' to start with.
  */
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
-	return !arch_faults_on_old_pte();
-}
-#define arch_wants_old_prefaulted_pte	arch_wants_old_prefaulted_pte
+#define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
 static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
 {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8a9432fb3802..60b6ce45c2e3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1423,10 +1423,10 @@ static inline bool arch_has_pfn_modify_check(void)
 	return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(void)
 {
-	return false;
+	return true;
 }
 
 #endif	/* __ASSEMBLY__ */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..c799635f4d79 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef arch_has_hw_pte_young
+/*
+ * Return whether the accessed bit is supported by the local CPU.
+ *
+ * This stub assumes accessing through an old PTE triggers a page fault.
+ * Architectures that automatically set the access bit should overwrite it.
+ */
+static inline bool arch_has_hw_pte_young(void)
+{
+	return false;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR
 static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep)
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..a7379196a47e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -122,18 +122,6 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
-#ifndef arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
-{
-	/*
-	 * Those arches which don't have hw access flag feature need to
-	 * implement their own helper. By default, "true" means pagefault
-	 * will be hit on old pte.
-	 */
-	return true;
-}
-#endif
-
 #ifndef arch_wants_old_prefaulted_pte
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
@@ -2778,7 +2766,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	 * On architectures with software "accessed" bits, we would
 	 * take a double page fault, so mark it accessed here.
 	 */
-	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+	if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
 		pte_t entry;
 
 		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:27   ` Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 03/12] mm/vmscan.c: refactor shrink_node() Yu Zhao
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.

Note that:
1. Although an inline function is preferable, this capability is added
   as a configuration option for the consistency with the existing
   macros.
2. Due to the little interest in other varieties, this capability was
   only tested on Intel and AMD CPUs.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (June 2021), section 4.8

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 arch/Kconfig                   | 9 +++++++++
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 3 ++-
 arch/x86/mm/pgtable.c          | 5 ++++-
 include/linux/pgtable.h        | 4 ++--
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..f9c59ecadbbb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1322,6 +1322,15 @@ config DYNAMIC_SIGFRAME
 config HAVE_ARCH_NODE_DEV_GROUP
 	bool
 
+config ARCH_HAS_NONLEAF_PMD_YOUNG
+	bool
+	depends on PGTABLE_LEVELS > 2
+	help
+	  Architectures that select this option are capable of setting the
+	  accessed bit in non-leaf PMD entries when using them as part of linear
+	  address translations. Page table walkers that clear the accessed bit
+	  may use this capability to reduce their search space.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9f5bd41bf660..e787b7fc75be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
+	select ARCH_HAS_NONLEAF_PMD_YOUNG
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 60b6ce45c2e3..f973788f6b21 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -819,7 +819,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+	       (_KERNPG_TABLE & ~_PAGE_ACCESSED);
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..a224193d84bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
 	return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c799635f4d79..30cf0d19cbdb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -233,7 +233,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 03/12] mm/vmscan.c: refactor shrink_node()
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 mm/vmscan.c | 198 +++++++++++++++++++++++++++-------------------------
 1 file changed, 104 insertions(+), 94 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..b7228b73e1b3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2716,6 +2716,109 @@ enum scan_balance {
 	SCAN_FILE,
 };
 
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+	unsigned long file;
+	struct lruvec *target_lruvec;
+
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+	/*
+	 * Flush the memory cgroup stats, so that we read accurate per-memcg
+	 * lruvec stats for heuristics.
+	 */
+	mem_cgroup_flush_stats();
+
+	/*
+	 * Determine the scan balance between anon and file LRUs.
+	 */
+	spin_lock_irq(&target_lruvec->lru_lock);
+	sc->anon_cost = target_lruvec->anon_cost;
+	sc->file_cost = target_lruvec->file_cost;
+	spin_unlock_irq(&target_lruvec->lru_lock);
+
+	/*
+	 * Target desirable inactive:active list ratios for the anon
+	 * and file LRU lists.
+	 */
+	if (!sc->force_deactivate) {
+		unsigned long refaults;
+
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+			sc->may_deactivate |= DEACTIVATE_ANON;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+		/*
+		 * When refaults are being observed, it means a new
+		 * workingset is being established. Deactivate to get
+		 * rid of any stale active pages quickly.
+		 */
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+			sc->may_deactivate |= DEACTIVATE_FILE;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_FILE;
+	} else
+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+	/*
+	 * If we have plenty of inactive file pages that aren't
+	 * thrashing, try to reclaim those first before touching
+	 * anonymous pages.
+	 */
+	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+		sc->cache_trim_mode = 1;
+	else
+		sc->cache_trim_mode = 0;
+
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long total_high_wmark = 0;
+		unsigned long free, anon;
+		int z;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		/*
+		 * Consider anon: if that's low too, this isn't a
+		 * runaway file reclaim problem, but rather just
+		 * extreme pressure. Reclaim as per usual then.
+		 */
+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+		sc->file_is_tiny =
+			file + free <= total_high_wmark &&
+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+			anon >> sc->priority;
+	}
+}
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -3186,109 +3289,16 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
-	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 again:
-	/*
-	 * Flush the memory cgroup stats, so that we read accurate per-memcg
-	 * lruvec stats for heuristics.
-	 */
-	mem_cgroup_flush_stats();
-
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
-	/*
-	 * Determine the scan balance between anon and file LRUs.
-	 */
-	spin_lock_irq(&target_lruvec->lru_lock);
-	sc->anon_cost = target_lruvec->anon_cost;
-	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&target_lruvec->lru_lock);
-
-	/*
-	 * Target desirable inactive:active list ratios for the anon
-	 * and file LRU lists.
-	 */
-	if (!sc->force_deactivate) {
-		unsigned long refaults;
-
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_ANON);
-		if (refaults != target_lruvec->refaults[0] ||
-			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
-			sc->may_deactivate |= DEACTIVATE_ANON;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_ANON;
-
-		/*
-		 * When refaults are being observed, it means a new
-		 * workingset is being established. Deactivate to get
-		 * rid of any stale active pages quickly.
-		 */
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_FILE);
-		if (refaults != target_lruvec->refaults[1] ||
-		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
-			sc->may_deactivate |= DEACTIVATE_FILE;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_FILE;
-	} else
-		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-
-	/*
-	 * If we have plenty of inactive file pages that aren't
-	 * thrashing, try to reclaim those first before touching
-	 * anonymous pages.
-	 */
-	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
-	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
-		sc->cache_trim_mode = 1;
-	else
-		sc->cache_trim_mode = 0;
-
-	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
-	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long total_high_wmark = 0;
-		unsigned long free, anon;
-		int z;
-
-		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		/*
-		 * Consider anon: if that's low too, this isn't a
-		 * runaway file reclaim problem, but rather just
-		 * extreme pressure. Reclaim as per usual then.
-		 */
-		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-
-		sc->file_is_tiny =
-			file + free <= total_high_wmark &&
-			!(sc->may_deactivate & DEACTIVATE_ANON) &&
-			anon >> sc->priority;
-	}
+	prepare_scan_count(pgdat, sc);
 
 	shrink_node_memcgs(pgdat, sc);
 
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (2 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 03/12] mm/vmscan.c: refactor shrink_node() Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:28   ` Yu Zhao
                     ` (2 more replies)
  2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
                   ` (10 subsequent siblings)
  14 siblings, 3 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they're aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores (seq%MAX_NR_GENS)+1
while a page is on one of lrugen->lists[]. Otherwise it stores 0.

There are two conceptually independent processes (as in the
manufacturing process): "the aging", which produces young generations,
and "the eviction", which consumes old generations. They form a
closed-loop system, i.e., "the page reclaim". Both processes can be
invoked from userspace for the purposes of working set estimation and
proactive reclaim. These features are required to optimize job
scheduling (bin packing) in data centers. The variable size of the
sliding window is designed for such use cases [1][2].

To avoid confusions, the terms "hot" and "cold" will be applied to the
multigenerational LRU, as a new convention; the terms "active" and
"inactive" will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1) The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2) The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3) The penalty of underprotecting the former channel is higher because
   applications usually don't prepare themselves for major page faults
   like they do for blocked I/O. E.g., GUI applications commonly use
   dedicated I/O threads to avoid blocking the rendering threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present, and the latter channel is assumed to follow
the latter pattern unless outlying refaults have been observed.

The next patch will address the "outlying refaults". A few macros,
i.e., LRU_REFS_*, used there are added in this patch to make the
patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
hasn't been used since then. This process, AKA second chance, requires
a minimum of two generations, hence MIN_NR_GENS.

[1] https://research.google/pubs/pub48551/
[2] https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 fs/fuse/dev.c                     |   3 +-
 include/linux/mm.h                |   2 +
 include/linux/mm_inline.h         | 191 ++++++++++++++++++++++++++++++
 include/linux/mmzone.h            |  76 ++++++++++++
 include/linux/page-flags-layout.h |  19 ++-
 include/linux/page-flags.h        |   4 +-
 include/linux/sched.h             |   4 +
 kernel/bounds.c                   |   3 +
 mm/huge_memory.c                  |   3 +-
 mm/memcontrol.c                   |   2 +
 mm/memory.c                       |  25 ++++
 mm/mm_init.c                      |   6 +-
 mm/page_alloc.c                   |   1 +
 mm/swap.c                         |   5 +
 mm/vmscan.c                       |  85 +++++++++++++
 15 files changed, 418 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd54a529460d..769139a8be86 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
 	       1 << PG_active |
 	       1 << PG_workingset |
 	       1 << PG_reclaim |
-	       1 << PG_waiters))) {
+	       1 << PG_waiters |
+	       LRU_GEN_MASK | LRU_REFS_MASK))) {
 		dump_page(page, "fuse: trying to steal weird page");
 		return 1;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..05dd33265740 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1032,6 +1032,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
 #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_REFS_PGOFF		(LRU_GEN_PGOFF - LRU_REFS_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index b725839dfe71..46f4fde0299f 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -92,11 +92,196 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
 	return lru;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static inline bool lru_gen_enabled(void)
+{
+	return true;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+	return current->in_lru_fault;
+}
+
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+	return seq % MAX_NR_GENS;
+}
+
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+	unsigned long max_seq = lruvec->lrugen.max_seq;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+
+	/* see the comment on MIN_NR_GENS */
+	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
+				       int zone, long delta)
+{
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	lockdep_assert_held(&lruvec->lru_lock);
+	WARN_ON_ONCE(delta != (int)delta);
+
+	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
+	__mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
+}
+
+static inline void lru_gen_balance_size(struct lruvec *lruvec, struct folio *folio,
+					int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+	enum lru_list lru = type * LRU_INACTIVE_FILE;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
+	VM_BUG_ON(old_gen == -1 && new_gen == -1);
+
+	if (old_gen >= 0)
+		WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
+			   lrugen->nr_pages[old_gen][type][zone] - delta);
+	if (new_gen >= 0)
+		WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
+			   lrugen->nr_pages[new_gen][type][zone] + delta);
+
+	if (old_gen < 0) {
+		if (lru_gen_is_active(lruvec, new_gen))
+			lru += LRU_ACTIVE;
+		lru_gen_update_size(lruvec, lru, zone, delta);
+		return;
+	}
+
+	if (new_gen < 0) {
+		if (lru_gen_is_active(lruvec, old_gen))
+			lru += LRU_ACTIVE;
+		lru_gen_update_size(lruvec, lru, zone, -delta);
+		return;
+	}
+
+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+		lru_gen_update_size(lruvec, lru, zone, -delta);
+		lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+
+	/* Promotion is legit while a page is on an LRU list, but demotion isn't. */
+	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	if (folio_test_unevictable(folio) || !lrugen->enabled)
+		return false;
+	/*
+	 * There are three common cases for this page:
+	 * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it
+	 *    to the youngest generation.
+	 * 2) If it can't be evicted immediately, i.e., it's an anon page and
+	 *    not in swapcache, or a dirty page pending writeback, add it to the
+	 *    second oldest generation.
+	 * 3) If it may be evicted immediately, e.g., it's a clean page, add it
+	 *    to the oldest generation.
+	 */
+	if (folio_test_active(folio))
+		gen = lru_gen_from_seq(lrugen->max_seq);
+	else if ((!type && !folio_test_swapcache(folio)) ||
+		 (folio_test_reclaim(folio) &&
+		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
+		gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
+	else
+		gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(new_flags & LRU_GEN_MASK, folio);
+
+		new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_balance_size(lruvec, folio, -1, gen);
+	/* for folio_rotate_reclaimable() */
+	if (reclaiming)
+		list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+	else
+		list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+	return true;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		if (!(new_flags & LRU_GEN_MASK))
+			return false;
+
+		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+		VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+
+		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+		new_flags &= ~LRU_GEN_MASK;
+		/* for shrink_page_list() */
+		if (reclaiming)
+			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
+		else if (lru_gen_is_active(lruvec, gen))
+			new_flags |= BIT(PG_active);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_balance_size(lruvec, folio, gen, -1);
+	list_del(&folio->lru);
+
+	return true;
+}
+
+#else
+
+static inline bool lru_gen_enabled(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_in_fault(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	return false;
+}
+
+static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 static __always_inline
 void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	if (lru_gen_add_folio(lruvec, folio, false))
+		return;
+
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
 	list_add(&folio->lru, &lruvec->lists[lru]);
@@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	if (lru_gen_add_folio(lruvec, folio, true))
+		return;
+
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
 	list_add_tail(&folio->lru, &lruvec->lists[lru]);
@@ -127,6 +315,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline
 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
+	if (lru_gen_del_folio(lruvec, folio, false))
+		return;
+
 	list_del(&folio->lru);
 	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
 			-folio_nr_pages(folio));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aed44e9b5d89..0f5e8a995781 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,6 +303,78 @@ enum lruvec_flags {
 					 */
 };
 
+struct lruvec;
+
+#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+#define MIN_LRU_BATCH		BITS_PER_LONG
+#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
+
+/*
+ * Evictable pages are divided into multiple generations. The youngest and the
+ * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
+ * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
+ * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
+ * generation. The gen counter in folio->flags stores gen+1 while a page is on
+ * one of lrugen->lists[]. Otherwise it stores 0.
+ *
+ * A page is added to the youngest generation on faulting. The aging needs to
+ * check the accessed bit at least twice before handing this page over to the
+ * eviction. The first check takes care of the accessed bit set on the initial
+ * fault; the second check makes sure this page hasn't been used since then.
+ * This process, AKA second chance, requires a minimum of two generations,
+ * hence MIN_NR_GENS. And to be compatible with the active/inactive LRU, these
+ * two generations are mapped to the active; the rest of generations, if they
+ * exist, are mapped to the inactive. PG_active is always cleared while a page
+ * is on one of lrugen->lists[] so that demotion, which happens consequently
+ * when the aging produces a new generation, needs not to worry about it.
+ */
+#define MIN_NR_GENS		2U
+#define MAX_NR_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
+
+struct lru_gen_struct {
+	/* the aging increments the youngest generation number */
+	unsigned long max_seq;
+	/* the eviction increments the oldest generation numbers */
+	unsigned long min_seq[ANON_AND_FILE];
+	/* the birth time of each generation in jiffies */
+	unsigned long timestamps[MAX_NR_GENS];
+	/* the multigenerational LRU lists */
+	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the sizes of the above lists */
+	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* whether the multigenerational LRU is enabled */
+	bool enabled;
+};
+
+void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg);
+void lru_gen_free_memcg(struct mem_cgroup *memcg);
+#endif
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_free_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
+#endif /* CONFIG_LRU_GEN */
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	/* per lruvec lru_lock for memcg */
@@ -320,6 +392,10 @@ struct lruvec {
 	unsigned long			refaults[ANON_AND_FILE];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+#ifdef CONFIG_LRU_GEN
+	/* evictable pages divided into generations */
+	struct lru_gen_struct		lrugen;
+#endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index ef1e3e736e14..8cdbbdccb5ad 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -26,6 +26,14 @@
 
 #define ZONES_WIDTH		ZONES_SHIFT
 
+#ifdef CONFIG_LRU_GEN
+/* LRU_GEN_WIDTH is generated from order_base_2(CONFIG_NR_LRU_GENS + 1). */
+#define LRU_REFS_WIDTH		(CONFIG_TIERS_PER_GEN - 2)
+#else
+#define LRU_GEN_WIDTH		0
+#define LRU_REFS_WIDTH		0
+#endif /* CONFIG_LRU_GEN */
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #define SECTIONS_SHIFT	(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
@@ -55,7 +63,8 @@
 #define SECTIONS_WIDTH		0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
+	<= BITS_PER_LONG - NR_PAGEFLAGS
 #define NODES_WIDTH		NODES_SHIFT
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 #error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +98,8 @@
 #define LAST_CPUPID_SHIFT 0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
-	<= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -100,8 +109,8 @@
 #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
-	> BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_REFS_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..a95518ca98eb 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -935,7 +935,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
 	 1UL << PG_private	| 1UL << PG_private_2	|	\
 	 1UL << PG_writeback	| 1UL << PG_reserved	|	\
 	 1UL << PG_slab		| 1UL << PG_active 	|	\
-	 1UL << PG_unevictable	| __PG_MLOCKED)
+	 1UL << PG_unevictable	| __PG_MLOCKED | LRU_GEN_MASK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
@@ -946,7 +946,7 @@ __PAGEFLAG(Isolated, isolated, PF_ANY);
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(PAGEFLAGS_MASK & ~__PG_HWPOISON)
+	((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..e7fe784b11aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -914,6 +914,10 @@ struct task_struct {
 #ifdef CONFIG_MEMCG
 	unsigned			in_user_fault:1;
 #endif
+#ifdef CONFIG_LRU_GEN
+	/* whether the LRU algorithm may apply to this access */
+	unsigned			in_lru_fault:1;
+#endif
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
 #endif
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 9795d75b09b2..aba13aa7336c 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -22,6 +22,9 @@ int main(void)
 	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
 #endif
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
+#ifdef CONFIG_LRU_GEN
+	DEFINE(LRU_GEN_WIDTH, order_base_2(CONFIG_NR_LRU_GENS + 1));
+#endif
 	/* End of constants */
 
 	return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..3df389fd307f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2364,7 +2364,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
-			 (1L << PG_dirty)));
+			 (1L << PG_dirty) |
+			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..cabb5085531b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5121,6 +5121,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
+	lru_gen_free_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
 	__mem_cgroup_free(memcg);
 }
@@ -5180,6 +5181,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->deferred_split_queue.split_queue_len = 0;
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+	lru_gen_init_memcg(memcg);
 	return memcg;
 fail:
 	mem_cgroup_id_remove(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index a7379196a47e..d27e5f1a2533 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+	/* the LRU algorithm doesn't apply to sequential or random reads */
+	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
+}
+
+static void lru_gen_exit_fault(void)
+{
+	current->in_lru_fault = false;
+}
+#else
+static void lru_gen_enter_fault(struct vm_area_struct *vma)
+{
+}
+
+static void lru_gen_exit_fault(void)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -4785,11 +4806,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flags & FAULT_FLAG_USER)
 		mem_cgroup_enter_user_fault();
 
+	lru_gen_enter_fault(vma);
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	lru_gen_exit_fault();
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ddaf0e1b0ab..0d7b2bd2454a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
 
 	shift = 8 * sizeof(unsigned long);
 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
-		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
+		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
 		LAST_CPUPID_WIDTH,
 		KASAN_TAG_WIDTH,
+		LRU_GEN_WIDTH,
+		LRU_REFS_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..a3faa8c02c07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7480,6 +7480,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
+	lru_gen_init_state(NULL, &pgdat->__lruvec);
 }
 
 static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..e2ef2acccc74 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -462,6 +462,11 @@ void folio_add_lru(struct folio *folio)
 	VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
+	/* see the comment in lru_gen_add_folio() */
+	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
+	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
+		folio_set_active(folio);
+
 	folio_get(folio);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7228b73e1b3..d75a5738d1dc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3040,6 +3040,91 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 	return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ *                          shorthand helpers
+ ******************************************************************************/
+
+#define for_each_gen_type_zone(gen, type, zone)				\
+	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
+		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
+			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+
+#ifdef CONFIG_MEMCG
+	if (memcg) {
+		struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
+
+		/* for hotadd_new_pgdat() */
+		if (!lruvec->pgdat)
+			lruvec->pgdat = pgdat;
+
+		return lruvec;
+	}
+#endif
+	return pgdat ? &pgdat->__lruvec : NULL;
+}
+
+/******************************************************************************
+ *                          initialization
+ ******************************************************************************/
+
+void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
+{
+	int i;
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	lrugen->max_seq = MIN_NR_GENS + 1;
+	lrugen->enabled = lru_gen_enabled();
+
+	for (i = 0; i <= MIN_NR_GENS + 1; i++)
+		lrugen->timestamps[i] = jiffies;
+
+	for_each_gen_type_zone(gen, type, zone)
+		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_init_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		lru_gen_init_state(memcg, lruvec);
+	}
+}
+
+void lru_gen_free_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
+				     sizeof(lruvec->lrugen.nr_pages)));
+	}
+}
+#endif
+
+static int __init init_lru_gen(void)
+{
+	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
+	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
+	return 0;
+};
+late_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (3 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:33   ` Yu Zhao
                     ` (3 more replies)
  2022-02-08  8:18 ` [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap Yu Zhao
                   ` (9 subsequent siblings)
  14 siblings, 4 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

To avoid confusions, the terms "promotion" and "demotion" will be
applied to the multigenerational LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. Since the aging is only
interested in hot pages, its complexity is O(nr_hot_pages). Promotion
in the aging path doesn't require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless
as the result of the increment of max_seq, requires LRU list
operations, e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the list indexed by min_seq%MAX_NR_GENS becomes empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both are
available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers don't have dedicated lrugen->lists[], only bits
in folio->flags.  In contrast to moving across generations which
requires the LRU lock, moving between tiers only involves operations
on folio->flags. The feedback loop also monitors refaults over all
tiers and decides when to promote pages in which tiers (N>1), using
the first tier (N=0,1) as a baseline. The first tier contains
single-use unmapped clean pages, which are most likely the best
choices. The eviction promotes a page to the next generation, i.e.,
min_seq+1 rather than max_seq, if the feedback loop decides so. This
approach has the following advantages:
1) It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth promoting in the
   eviction path.
2) It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier since N=0.)
3) More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[47, 49]%
                IOPS         BW
      5.17-rc2: 2242k        8759MiB/s
      patch1-5: 3321k        12.7GiB/s

  Single workload:
    memcached (anon): +[101, 105]%
                Ops/sec      KB/sec
      5.17-rc2: 476771.79    18544.31
      patch1-5: 972526.07    37826.95

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was used as a ram disk only to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<OEF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    OEF

    cat >>/etc/memcached.conf <<OEF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    OEF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.17-rc2
      38.05%  page_vma_mapped_walk
      20.86%  lzo1x_1_do_compress (real work)
       6.16%  do_raw_spin_lock
       4.61%  _raw_spin_unlock_irq
       2.20%  vma_interval_tree_iter_next
       2.19%  vma_interval_tree_subtree_search
       2.15%  page_referenced_one
       1.93%  anon_vma_interval_tree_iter_first
       1.65%  ptep_clear_flush
       1.00%  __zram_bvec_write

    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 include/linux/mm.h        |   1 +
 include/linux/mm_inline.h |  15 +
 include/linux/mmzone.h    |  35 ++
 mm/Kconfig                |  44 +++
 mm/swap.c                 |  46 ++-
 mm/vmscan.c               | 784 +++++++++++++++++++++++++++++++++++++-
 mm/workingset.c           | 119 +++++-
 7 files changed, 1039 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05dd33265740..b4b9886ba277 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
+#define lru_to_folio(head) (list_entry((head)->prev, struct folio, lru))
 
 void setup_initial_init_mm(void *start_code, void *end_code,
 			   void *end_data, void *brk);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 46f4fde0299f..37c8a0ede4ff 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -109,6 +109,19 @@ static inline int lru_gen_from_seq(unsigned long seq)
 	return seq % MAX_NR_GENS;
 }
 
+static inline int lru_hist_from_seq(unsigned long seq)
+{
+	return seq % NR_HIST_GENS;
+}
+
+static inline int lru_tier_from_refs(int refs)
+{
+	VM_BUG_ON(refs > BIT(LRU_REFS_WIDTH));
+
+	/* see the comment on MAX_NR_TIERS */
+	return order_base_2(refs + 1);
+}
+
 static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
 {
 	unsigned long max_seq = lruvec->lrugen.max_seq;
@@ -237,6 +250,8 @@ static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio,
 		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 
 		new_flags &= ~LRU_GEN_MASK;
+		if ((new_flags & LRU_REFS_FLAGS) != LRU_REFS_FLAGS)
+			new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
 		/* for shrink_page_list() */
 		if (reclaiming)
 			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f5e8a995781..3870dd9246a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -335,6 +335,32 @@ struct lruvec;
 #define MIN_NR_GENS		2U
 #define MAX_NR_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
 
+/*
+ * Each generation is divided into multiple tiers. Tiers represent different
+ * ranges of numbers of accesses through file descriptors. A page accessed N
+ * times through file descriptors is in tier order_base_2(N). A page in the
+ * first tier (N=0,1) is marked by PG_referenced unless it was faulted in
+ * though page tables or read ahead. A page in any other tier (N>1) is marked
+ * by PG_referenced and PG_workingset. Additional bits in folio->flags are
+ * required to support more than two tiers.
+ *
+ * In contrast to moving across generations which requires the LRU lock, moving
+ * across tiers only requires operations on folio->flags and therefore has a
+ * negligible cost in the buffered access path. In the eviction path,
+ * comparisons of refaulted/(evicted+promoted) from the first tier and the rest
+ * infer whether pages accessed multiple times through file descriptors are
+ * statistically hot and thus worth promoting.
+ */
+#define MAX_NR_TIERS		((unsigned int)CONFIG_TIERS_PER_GEN)
+#define LRU_REFS_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
+
+/* whether to keep historical stats from evicted generations */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_HIST_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
+#else
+#define NR_HIST_GENS		1U
+#endif
+
 struct lru_gen_struct {
 	/* the aging increments the youngest generation number */
 	unsigned long max_seq;
@@ -346,6 +372,15 @@ struct lru_gen_struct {
 	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the sizes of the above lists */
 	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the exponential moving average of refaulted */
+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the exponential moving average of evicted+promoted */
+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the first tier doesn't need promotion, hence the minus one */
+	unsigned long promoted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* can be modified without holding the LRU lock */
+	atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multigenerational LRU is enabled */
 	bool enabled;
 };
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..e899623d5df0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -892,6 +892,50 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+# multigenerational LRU {
+config LRU_GEN
+	bool "Multigenerational LRU"
+	depends on MMU
+	# the following options can use up the spare bits in page flags
+	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
+	help
+	  A high performance LRU implementation for memory overcommit. See
+	  Documentation/admin-guide/mm/multigen_lru.rst and
+	  Documentation/vm/multigen_lru.rst for details.
+
+config NR_LRU_GENS
+	int "Max number of generations"
+	depends on LRU_GEN
+	range 4 31
+	default 4
+	help
+	  Do not increase this value unless you plan to use working set
+	  estimation and proactive reclaim to optimize job scheduling in data
+	  centers.
+
+	  This option uses order_base_2(N+1) bits in page flags.
+
+config TIERS_PER_GEN
+	int "Number of tiers per generation"
+	depends on LRU_GEN
+	range 2 4
+	default 4
+	help
+	  Do not decrease this value unless you run out of spare bits in page
+	  flags, i.e., you see the "Not enough bits in page flags" build error.
+
+	  This option uses N-2 bits in page flags.
+
+config LRU_GEN_STATS
+	bool "Full stats for debugging"
+	depends on LRU_GEN
+	help
+	  Do not enable this option unless you plan to look at historical stats
+	  from evicted generations for debugging purpose.
+
+	  This option has a per-memcg and per-node memory overhead.
+# }
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/swap.c b/mm/swap.c
index e2ef2acccc74..f5c0bcac8dcd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -407,6 +407,43 @@ static void __lru_cache_activate_folio(struct folio *folio)
 	local_unlock(&lru_pvecs.lock);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void folio_inc_refs(struct folio *folio)
+{
+	unsigned long refs;
+	unsigned long old_flags, new_flags;
+
+	if (folio_test_unevictable(folio))
+		return;
+
+	/* see the comment on MAX_NR_TIERS */
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		if (!(new_flags & BIT(PG_referenced))) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		if (!(new_flags & BIT(PG_workingset))) {
+			new_flags |= BIT(PG_workingset);
+			continue;
+		}
+
+		refs = new_flags & LRU_REFS_MASK;
+		refs = min(refs + BIT(LRU_REFS_PGOFF), LRU_REFS_MASK);
+
+		new_flags &= ~LRU_REFS_MASK;
+		new_flags |= refs;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void folio_inc_refs(struct folio *folio)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * Mark a page as having seen activity.
  *
@@ -419,6 +456,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
  */
 void folio_mark_accessed(struct folio *folio)
 {
+	if (lru_gen_enabled()) {
+		folio_inc_refs(folio);
+		return;
+	}
+
 	if (!folio_test_referenced(folio)) {
 		folio_set_referenced(folio);
 	} else if (folio_test_unevictable(folio)) {
@@ -568,7 +610,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -682,7 +724,7 @@ void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d75a5738d1dc..5f0d92838712 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1285,9 +1285,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		mem_cgroup_swapout(page, swap);
+
+		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(page, target_memcg);
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_page(page, swap);
@@ -2721,6 +2723,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -3042,15 +3047,47 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 
 #ifdef CONFIG_LRU_GEN
 
+enum {
+	TYPE_ANON,
+	TYPE_FILE,
+};
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
 
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->lrugen.min_seq[TYPE_ANON]),		\
+		READ_ONCE((lruvec)->lrugen.min_seq[TYPE_FILE]),		\
+	}
+
 #define for_each_gen_type_zone(gen, type, zone)				\
 	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
+static int folio_lru_gen(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static int folio_lru_tier(struct folio *folio)
+{
+	int refs;
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	refs = (flags & LRU_REFS_FLAGS) == LRU_REFS_FLAGS ?
+	       ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1 : 0;
+
+	return lru_tier_from_refs(refs);
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3069,6 +3106,728 @@ static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 	return pgdat ? &pgdat->__lruvec : NULL;
 }
 
+static int get_swappiness(struct mem_cgroup *memcg)
+{
+	return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
+	       mem_cgroup_swappiness(memcg) : 0;
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+	return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	/*
+	 * Ideally anon and file min_seq should be in sync. But swapping isn't
+	 * as reliable as dropping clean file pages, e.g., out of swap space. So
+	 * allow file min_seq to advance and leave anon min_seq behind, but not
+	 * the other way around.
+	 */
+	return get_nr_gens(lruvec, TYPE_FILE) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, TYPE_FILE) <= get_nr_gens(lruvec, TYPE_ANON) &&
+	       get_nr_gens(lruvec, TYPE_ANON) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ *                          refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop based on Proportional-Integral-Derivative (PID) controller.
+ *
+ * The P term is refaulted/(evicted+promoted) from a tier in the generation
+ * currently being evicted; the I term is the exponential moving average of the
+ * P term over the generations previously evicted, using the smoothing factor
+ * 1/2; the D term isn't supported.
+ *
+ * The setpoint (SP) is always the first tier of one type; the process variable
+ * (PV) is either any tier of the other type or any other tier of the same
+ * type.
+ *
+ * The error is the difference between the SP and the PV; the correction is
+ * turn off promotion when SP>PV or turn on promotion when SP<PV.
+ *
+ * For future optimizations:
+ * 1) The D term may discount the other two terms over time so that long-lived
+ *    generations can resist stale information.
+ */
+struct ctrl_pos {
+	unsigned long refaulted;
+	unsigned long total;
+	int gain;
+};
+
+static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
+			  struct ctrl_pos *pos)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+	pos->total = lrugen->avg_total[type][tier] +
+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
+	if (tier)
+		pos->total += lrugen->promoted[hist][type][tier - 1];
+	pos->gain = gain;
+}
+
+static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
+{
+	int hist, tier;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
+	unsigned long seq = carryover ? lrugen->min_seq[type] : lrugen->max_seq + 1;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+
+	if (!carryover && !clear)
+		return;
+
+	hist = lru_hist_from_seq(seq);
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		if (carryover) {
+			unsigned long sum;
+
+			sum = lrugen->avg_refaulted[type][tier] +
+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+			sum = lrugen->avg_total[type][tier] +
+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
+			if (tier)
+				sum += lrugen->promoted[hist][type][tier - 1];
+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+		}
+
+		if (clear) {
+			atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+			atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+			if (tier)
+				WRITE_ONCE(lrugen->promoted[hist][type][tier - 1], 0);
+		}
+	}
+}
+
+static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
+{
+	/*
+	 * Return true if the PV has a limited number of refaults or a lower
+	 * refaulted/total than the SP.
+	 */
+	return pv->refaulted < MIN_LRU_BATCH ||
+	       pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
+	       (sp->refaulted + 1) * pv->total * pv->gain;
+}
+
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+{
+	unsigned long old_flags, new_flags;
+	int type = folio_is_file_lru(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+		/* for folio_end_writeback() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+
+	return new_gen;
+}
+
+static void inc_min_seq(struct lruvec *lruvec)
+{
+	int type;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+	}
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
+{
+	int gen, type, zone;
+	bool success = false;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		while (lrugen->max_seq >= min_seq[type] + MIN_NR_GENS) {
+			gen = lru_gen_from_seq(min_seq[type]);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+				if (!list_empty(&lrugen->lists[gen][type][zone]))
+					goto next;
+			}
+
+			min_seq[type]++;
+		}
+next:
+		;
+	}
+
+	/* see the comment in seq_is_valid() */
+	if (can_swap) {
+		min_seq[TYPE_ANON] = min(min_seq[TYPE_ANON], min_seq[TYPE_FILE]);
+		min_seq[TYPE_FILE] = max(min_seq[TYPE_ANON], lrugen->min_seq[TYPE_FILE]);
+	}
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		if (min_seq[type] == lrugen->min_seq[type])
+			continue;
+
+		reset_ctrl_pos(lruvec, type, true);
+		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int prev, next;
+	int type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	inc_min_seq(lruvec);
+
+	/* update the active/inactive LRU sizes for compatibility */
+	prev = lru_gen_from_seq(lrugen->max_seq - 1);
+	next = lru_gen_from_seq(lrugen->max_seq + 1);
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			enum lru_list lru = type * LRU_INACTIVE_FILE;
+			long delta = lrugen->nr_pages[prev][type][zone] -
+				     lrugen->nr_pages[next][type][zone];
+
+			if (!delta)
+				continue;
+
+			lru_gen_update_size(lruvec, lru, zone, delta);
+			lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
+		}
+	}
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		reset_ctrl_pos(lruvec, type, false);
+
+	WRITE_ONCE(lrugen->timestamps[next], jiffies);
+	/* make sure preceding modifications appear */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
+			     unsigned long *min_seq, bool can_swap, bool *need_aging)
+{
+	int gen, type, zone;
+	long total = 0;
+	long young = 0;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (type = !can_swap; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			long size = 0;
+
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			total += size;
+			if (seq == max_seq)
+				young += size;
+		}
+	}
+
+	/* try to spread pages out across MIN_NR_GENS+1 generations */
+	if (max_seq < min_seq[TYPE_FILE] + MIN_NR_GENS)
+		*need_aging = true;
+	else if (max_seq > min_seq[TYPE_FILE] + MIN_NR_GENS)
+		*need_aging = false;
+	else
+		*need_aging = young * MIN_NR_GENS > total;
+
+	return total > 0 ? total : 0;
+}
+
+static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	int swappiness = get_swappiness(memcg);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	mem_cgroup_calculate_protection(NULL, memcg);
+
+	if (mem_cgroup_below_min(memcg))
+		return;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
+	if (!nr_to_scan)
+		return;
+
+	nr_to_scan >>= sc->priority;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+		inc_max_seq(lruvec, max_seq);
+}
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		age_lruvec(lruvec, sc);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+/******************************************************************************
+ *                          the eviction
+ ******************************************************************************/
+
+static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
+{
+	bool success;
+	int gen = folio_lru_gen(folio);
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int tier = folio_lru_tier(folio);
+	int delta = folio_nr_pages(folio);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
+
+	if (!folio_evictable(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_unevictable(folio);
+		lruvec_add_folio(lruvec, folio);
+		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
+		return true;
+	}
+
+	if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
+		success = lru_gen_del_folio(lruvec, folio, true);
+		VM_BUG_ON_FOLIO(!success, folio);
+		folio_set_swapbacked(folio);
+		lruvec_add_folio_tail(lruvec, folio);
+		return true;
+	}
+
+	if (tier > tier_idx) {
+		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+		gen = folio_inc_gen(lruvec, folio, false);
+		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
+
+		WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
+			   lrugen->promoted[hist][type][tier - 1] + delta);
+		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+		return true;
+	}
+
+	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
+	    (type && folio_test_dirty(folio))) {
+		gen = folio_inc_gen(lruvec, folio, true);
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
+	return false;
+}
+
+static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc)
+{
+	bool success;
+
+	if (!sc->may_unmap && folio_mapped(folio))
+		return false;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (folio_test_dirty(folio) ||
+	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+		return false;
+
+	if (!folio_try_get(folio))
+		return false;
+
+	if (!folio_test_clear_lru(folio)) {
+		folio_put(folio);
+		return false;
+	}
+
+	success = lru_gen_del_folio(lruvec, folio, true);
+	VM_BUG_ON_FOLIO(!success, folio);
+
+	return true;
+}
+
+static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
+		       int type, int tier, struct list_head *list)
+{
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int remaining = MAX_LRU_BATCH;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+		return 0;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct folio *folio = lru_to_folio(head);
+			int delta = folio_nr_pages(folio);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			scanned += delta;
+
+			if (sort_folio(lruvec, folio, tier))
+				sorted += delta;
+			else if (isolate_folio(lruvec, folio, sc)) {
+				list_add(&folio->lru, list);
+				isolated += delta;
+			} else {
+				list_move(&folio->lru, &moved);
+				skipped += delta;
+			}
+
+			if (!--remaining || max(isolated, skipped) >= MIN_LRU_BATCH)
+				break;
+		}
+
+		if (skipped) {
+			list_splice(&moved, head);
+			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+		}
+
+		if (!remaining || isolated >= MIN_LRU_BATCH)
+			break;
+	}
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc)) {
+		__count_vm_events(item, isolated);
+		__count_vm_events(PGREFILL, sorted);
+	}
+	__count_memcg_events(memcg, item, isolated);
+	__count_memcg_events(memcg, PGREFILL, sorted);
+	__count_vm_events(PGSCAN_ANON + type, isolated);
+
+	/*
+	 * There might not be eligible pages due to reclaim_idx, may_unmap and
+	 * may_writepage. Check the remaining to prevent livelock if there is no
+	 * progress.
+	 */
+	return isolated || !remaining ? scanned : 0;
+}
+
+static int get_tier_idx(struct lruvec *lruvec, int type)
+{
+	int tier;
+	struct ctrl_pos sp, pv;
+
+	/*
+	 * To leave a margin for fluctuations, use a larger gain factor (1:2).
+	 * This value is chosen because any other tier would have at least twice
+	 * as many refaults as the first tier.
+	 */
+	read_ctrl_pos(lruvec, type, 0, 1, &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, 2, &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx)
+{
+	int type, tier;
+	struct ctrl_pos sp, pv;
+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+	/*
+	 * Compare the first tier of anon with that of file to determine which
+	 * type to scan. Also need to compare other tiers of the selected type
+	 * with the first tier of the other type to determine the last tier (of
+	 * the selected type) to evict.
+	 */
+	read_ctrl_pos(lruvec, TYPE_ANON, 0, gain[TYPE_ANON], &sp);
+	read_ctrl_pos(lruvec, TYPE_FILE, 0, gain[TYPE_FILE], &pv);
+	type = positive_ctrl_err(&sp, &pv);
+
+	read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	*tier_idx = tier - 1;
+
+	return type;
+}
+
+static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			  int *type_scanned, struct list_head *list)
+{
+	int i;
+	int type;
+	int scanned;
+	int tier = -1;
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	/*
+	 * Try to make the obvious choice first. When anon and file are both
+	 * available from the same generation, interpret swappiness 1 as file
+	 * first and 200 as anon first.
+	 */
+	if (!swappiness)
+		type = TYPE_FILE;
+	else if (min_seq[TYPE_ANON] < min_seq[TYPE_FILE])
+		type = TYPE_ANON;
+	else if (swappiness == 1)
+		type = TYPE_FILE;
+	else if (swappiness == 200)
+		type = TYPE_ANON;
+	else
+		type = get_type_to_scan(lruvec, swappiness, &tier);
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		if (tier < 0)
+			tier = get_tier_idx(lruvec, type);
+
+		scanned = scan_folios(lruvec, sc, type, tier, list);
+		if (scanned)
+			break;
+
+		type = !type;
+		tier = -1;
+	}
+
+	*type_scanned = type;
+
+	return scanned;
+}
+
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+{
+	int type;
+	int scanned;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct folio *folio;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+
+	if (try_to_inc_min_seq(lruvec, swappiness))
+		scanned++;
+
+	if (get_nr_gens(lruvec, TYPE_FILE) == MIN_NR_GENS)
+		scanned = 0;
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (list_empty(&list))
+		return scanned;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+
+	/*
+	 * To avoid livelock, don't add rejected pages back to the same lists
+	 * they were isolated from.
+	 */
+	list_for_each_entry(folio, &list, lru) {
+		if ((folio_is_file_lru(folio) || folio_test_swapcache(folio)) &&
+		    (!folio_test_reclaim(folio) ||
+		     !(folio_test_dirty(folio) || folio_test_writeback(folio))))
+			folio_set_active(folio);
+
+		folio_clear_referenced(folio);
+		folio_clear_workingset(folio);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(memcg, item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+
+	return scanned;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
+{
+	bool need_aging;
+	long nr_to_scan;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (mem_cgroup_below_min(memcg) ||
+	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+		return 0;
+
+	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, can_swap, &need_aging);
+	if (!nr_to_scan)
+		return 0;
+
+	/* reset the priority if the target has been met */
+	nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? sc->priority : DEF_PRIORITY;
+
+	if (!mem_cgroup_online(memcg))
+		nr_to_scan++;
+
+	if (!nr_to_scan)
+		return 0;
+
+	if (!need_aging)
+		return nr_to_scan;
+
+	/* leave the work to lru_gen_age_node() */
+	if (current_is_kswapd())
+		return 0;
+
+	/* try other memcgs before going to the aging path */
+	if (!cgroup_reclaim(sc) && !sc->force_deactivate) {
+		sc->skipped_deactivate = true;
+		return 0;
+	}
+
+	inc_max_seq(lruvec, max_seq);
+
+	return nr_to_scan;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	long scanned = 0;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	lru_add_drain();
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		int delta;
+		int swappiness;
+		long nr_to_scan;
+
+		if (sc->may_swap)
+			swappiness = get_swappiness(memcg);
+		else if (!cgroup_reclaim(sc) && get_swappiness(memcg))
+			swappiness = 1;
+		else
+			swappiness = 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
+		if (!nr_to_scan)
+			break;
+
+		delta = evict_folios(lruvec, sc, swappiness);
+		if (!delta)
+			break;
+
+		scanned += delta;
+		if (scanned >= nr_to_scan)
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -3123,6 +3882,16 @@ static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#else
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3136,6 +3905,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		lru_gen_shrink_lruvec(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -3640,6 +4414,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
@@ -4010,6 +4787,11 @@ static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!can_age_anon_pages(pgdat, sc))
 		return;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..443343a3f3e3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
 {
-	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
-	*evictionp = entry << bucket_order;
+	*evictionp = entry;
 	*workingsetp = workingset;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static int folio_lru_refs(struct folio *folio)
+{
+	unsigned long flags = READ_ONCE(folio->flags);
+
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+	/* see the comment on MAX_NR_TIERS */
+	return flags & BIT(PG_workingset) ? (flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF : 0;
+}
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	int hist, tier;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	int type = folio_is_file_lru(folio);
+	int refs = folio_lru_refs(folio);
+	int delta = folio_nr_pages(folio);
+	bool workingset = folio_test_workingset(folio);
+	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct pglist_data *pgdat = folio_pgdat(folio);
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	token = (min_seq << LRU_REFS_WIDTH) | refs;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+	int hist, tier, refs;
+	int memcg_id;
+	bool workingset;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lru_gen_struct *lrugen;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	int type = folio_is_file_lru(folio);
+	int delta = folio_nr_pages(folio);
+
+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+
+	refs = token & (BIT(LRU_REFS_WIDTH) - 1);
+	if (refs && !workingset)
+		return;
+
+	if (folio_pgdat(folio) != pgdat)
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg_rcu(folio);
+	if (mem_cgroup_id(memcg) != memcg_id)
+		goto unlock;
+
+	token >>= LRU_REFS_WIDTH;
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->lrugen;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
+		goto unlock;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_refs(refs + workingset);
+	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
+
+	/*
+	 * Count the following two cases as stalls:
+	 * 1) For pages accessed through page tables, hotter pages pushed out
+	 *    hot pages which refaulted immediately.
+	 * 2) For pages accessed through file descriptors, numbers of accesses
+	 *    might have been beyond the limit.
+	 */
+	if (lru_gen_in_fault() || refs + workingset == BIT(LRU_REFS_WIDTH)) {
+		folio_set_workingset(folio);
+		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static void *lru_gen_eviction(struct folio *folio)
+{
+	return NULL;
+}
+
+static void lru_gen_refault(struct folio *folio, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 /**
  * workingset_age_nonresident - age non-resident entries as LRU ages
  * @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	if (lru_gen_enabled())
+		return lru_gen_eviction(page_folio(page));
+
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
+	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, thp_nr_pages(page));
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -297,7 +406,13 @@ void workingset_refault(struct folio *folio, void *shadow)
 	int memcgid;
 	long nr;
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		return;
+	}
+
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	eviction <<= bucket_order;
 
 	rcu_read_lock();
 	/*
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (4 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:40   ` Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 07/12] mm: multigenerational LRU: support page table walks Yu Zhao
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs aren't cache friendly to the rmap. For workloads mostly
using mapped pages, the rmap has a high CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[3.5, 5.5]%
                Ops/sec      KB/sec
      patch1-5: 972526.07    37826.95
      patch1-6: 1015292.83   39490.38

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-5
      39.73%  lzo1x_1_do_compress (real work)
      14.96%  page_vma_mapped_walk
       6.97%  _raw_spin_unlock_irq
       3.07%  do_raw_spin_lock
       2.53%  anon_vma_interval_tree_iter_first
       2.04%  ptep_clear_flush
       1.82%  __zram_bvec_write
       1.76%  __anon_vma_interval_tree_subtree_search
       1.57%  memmove
       1.45%  free_unref_page_list

    patch1-6
      45.49%  lzo1x_1_do_compress (real work)
       7.38%  page_vma_mapped_walk
       7.24%  _raw_spin_unlock_irq
       2.64%  ptep_clear_flush
       2.31%  __zram_bvec_write
       2.13%  do_raw_spin_lock
       2.09%  lru_gen_look_around
       1.89%  free_unref_page_list
       1.85%  memmove
       1.74%  obj_malloc

  Configurations:
    no change

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 include/linux/memcontrol.h |  31 ++++++++
 include/linux/mm.h         |   5 ++
 include/linux/mmzone.h     |   6 ++
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            |   1 +
 mm/rmap.c                  |   7 ++
 mm/swap.c                  |   4 +-
 mm/vmscan.c                | 155 +++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b72d75141e12..51c9bc8e965d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -436,6 +436,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem folio a caller should hold an rcu read lock to protect memcg
  * associated with a kmem folio from being released.
@@ -497,6 +498,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem page a caller should hold an rcu read lock to protect memcg
  * associated with a kmem page from being released.
@@ -934,6 +936,23 @@ void unlock_page_memcg(struct page *page);
 
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	rcu_read_lock();
+
+	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+		return true;
+
+	rcu_read_unlock();
+	return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
 				   int idx, int val)
@@ -1371,6 +1390,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
 {
 }
 
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
+	return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 static inline void mem_cgroup_handle_over_high(void)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b4b9886ba277..7d70b42b67e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1588,6 +1588,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
 	return page_to_pfn(&folio->page);
 }
 
+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+	return page_folio(pfn_to_page(pfn));
+}
+
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
 #ifdef CONFIG_MIGRATION
 static inline bool is_pinnable_page(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3870dd9246a2..3d6ea30a2bdb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -304,6 +304,7 @@ enum lruvec_flags {
 };
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -386,6 +387,7 @@ struct lru_gen_struct {
 };
 
 void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -398,6 +400,10 @@ static inline void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *l
 {
 }
 
+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #ifdef CONFIG_MEMCG
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d38d9475c4d..b37520d3ff1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -372,6 +372,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
+extern void folio_activate(struct folio *folio);
 extern void deactivate_file_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cabb5085531b..74373df19d84 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2744,6 +2744,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 	 * - LRU isolation
 	 * - lock_page_memcg()
 	 * - exclusive reference
+	 * - mem_cgroup_trylock_pages()
 	 */
 	folio->memcg_data = (unsigned long)memcg;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..112e77dc62f4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -819,6 +820,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (pvmw.pte) {
+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+				lru_gen_look_around(&pvmw);
+				referenced++;
+			}
+
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/swap.c b/mm/swap.c
index f5c0bcac8dcd..e65e7520bebf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -344,7 +344,7 @@ static bool need_activate_page_drain(int cpu)
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	if (folio_test_lru(folio) && !folio_test_active(folio) &&
 	    !folio_test_unevictable(folio)) {
@@ -364,7 +364,7 @@ static inline void activate_page_drain(int cpu)
 {
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	struct lruvec *lruvec;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5f0d92838712..933d46ae2f68 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1556,6 +1556,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* folio_update_gen() tried to promote this page? */
+		if (lru_gen_enabled() && !ignore_references &&
+		    page_mapped(page) && PageReferenced(page))
+			goto keep_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -3227,6 +3232,31 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
  *                          the aging
  ******************************************************************************/
 
+static int folio_update_gen(struct folio *folio, int gen)
+{
+	unsigned long old_flags, new_flags;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+	VM_BUG_ON(!rcu_read_lock_held());
+
+	do {
+		new_flags = old_flags = READ_ONCE(folio->flags);
+
+		/* for shrink_page_list() */
+		if (!(new_flags & LRU_GEN_MASK)) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		new_flags &= ~LRU_GEN_MASK;
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
+	} while (new_flags != old_flags &&
+		 cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
+
+	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
 static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	unsigned long old_flags, new_flags;
@@ -3239,6 +3269,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
 
 		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		/* folio_update_gen() has promoted this page? */
+		if (new_gen >= 0 && new_gen != old_gen)
+			return new_gen;
+
 		new_gen = (old_gen + 1) % MAX_NR_GENS;
 
 		new_flags &= ~LRU_GEN_MASK;
@@ -3434,6 +3468,122 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 }
 
+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+	int i;
+	pte_t *pte;
+	unsigned long start;
+	unsigned long end;
+	unsigned long addr;
+	struct folio *folio;
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+	struct mem_cgroup *memcg = page_memcg(pvmw->page);
+	struct pglist_data *pgdat = page_pgdat(pvmw->page);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	DEFINE_MAX_SEQ(lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			end = start + MIN_LRU_BATCH * PAGE_SIZE;
+		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			start = end - MIN_LRU_BATCH * PAGE_SIZE;
+		else {
+			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+		}
+	}
+
+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+	rcu_read_lock();
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i]))
+			continue;
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+			continue;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_lru_gen(folio);
+		if (old_gen < 0)
+			folio_set_referenced(folio);
+		else if (old_gen != new_gen)
+			__set_bit(i, bitmap);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	rcu_read_unlock();
+
+	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+			folio = page_folio(pte_page(pte[i]));
+			folio_activate(folio);
+		}
+		return;
+	}
+
+	/* folio_update_gen() requires stable folio_memcg() */
+	if (!mem_cgroup_trylock_pages(memcg))
+		return;
+
+	spin_lock_irq(&lruvec->lru_lock);
+	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+		folio = page_folio(pte_page(pte[i]));
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen < 0 || old_gen == new_gen)
+			continue;
+
+		lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_unlock_pages();
+}
+
 /******************************************************************************
  *                          the eviction
  ******************************************************************************/
@@ -3467,6 +3617,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
 		return true;
 	}
 
+	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
 	if (tier > tier_idx) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 07/12] mm: multigenerational LRU: support page table walks
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (5 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:39   ` Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 08/12] mm: multigenerational LRU: optimize multiple memcgs Yu Zhao
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

To avoid confusions, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A runtime
switch will be added in the next patch to enable or disable this
feature. Without it, the aging relies on the rmap only.

NB: this feature has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmap them.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers (threads) iterate the same list, each
of them gets a unique mm_struct; therefore they can run concurrently.
Page table walkers ignore any misplaced pages, e.g., if an mm_struct
was migrated, pages it left in the previous memcg won't be promoted
when its current memcg is under reclaim. Similarly, page table walkers
won't promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1) It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2) It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3) It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4) It doesn't zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                Ops/sec      KB/sec
      patch1-6: 1015292.83   39490.38
      patch1-7: 1080856.82   42040.53

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      45.49%  lzo1x_1_do_compress (real work)
       7.38%  page_vma_mapped_walk
       7.24%  _raw_spin_unlock_irq
       2.64%  ptep_clear_flush
       2.31%  __zram_bvec_write
       2.13%  do_raw_spin_lock
       2.09%  lru_gen_look_around
       1.89%  free_unref_page_list
       1.85%  memmove
       1.74%  obj_malloc

    patch1-7
      47.73%  lzo1x_1_do_compress (real work)
       6.84%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       2.86%  walk_pte_range
       2.79%  ptep_clear_flush
       2.24%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.94%  free_unref_page_list
       1.80%  memmove
       1.75%  obj_malloc

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 fs/exec.c                  |   2 +
 include/linux/memcontrol.h |   5 +
 include/linux/mm_types.h   |  78 +++
 include/linux/mmzone.h     |  58 +++
 include/linux/swap.h       |   4 +
 kernel/exit.c              |   1 +
 kernel/fork.c              |   9 +
 kernel/sched/core.c        |   1 +
 mm/memcontrol.c            |  24 +
 mm/vmscan.c                | 963 ++++++++++++++++++++++++++++++++++++-
 10 files changed, 1132 insertions(+), 13 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..7a69046e9fd8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,6 +1006,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1016,6 +1017,7 @@ static int exec_mmap(struct mm_struct *mm)
 	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	activate_mm(active_mm, mm);
+	lru_gen_use_mm(mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 51c9bc8e965d..2f0d8e912cfe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -342,6 +342,11 @@ struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	/* per-memcg mm_struct list */
+	struct lru_gen_mm_list mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5140e5feb486..8d2cdbbdd467 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@
 #define _LINUX_MM_TYPES_H
 
 #include <linux/mm_types_task.h>
+#include <linux/sched.h>
 
 #include <linux/auxvec.h>
 #include <linux/kref.h>
@@ -17,6 +18,8 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -634,6 +637,22 @@ struct mm_struct {
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
 #endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* this mm_struct is on lru_gen_mm_list */
+			struct list_head list;
+#ifdef CONFIG_MEMCG
+			/* points to the memcg of "owner" above */
+			struct mem_cgroup *memcg;
+#endif
+			/*
+			 * Set when switching to this mm_struct, as a hint of
+			 * whether it has been used since the last time per-node
+			 * page table walkers cleared the corresponding bits.
+			 */
+			nodemask_t nodes;
+		} lru_gen;
+#endif /* CONFIG_LRU_GEN */
 	} __randomize_layout;
 
 	/*
@@ -660,6 +679,65 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+	/* mm_struct list for page table walkers */
+	struct list_head fifo;
+	/* protects the list above */
+	spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+	INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+	mm->lru_gen.memcg = NULL;
+#endif
+	nodes_clear(mm->lru_gen.nodes);
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+	/* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+	VM_WARN_ON(list_empty(&mm->lru_gen.list));
+
+	if (!(current->flags & PF_KTHREAD) && !nodes_full(mm->lru_gen.nodes))
+		nodes_setall(mm->lru_gen.nodes);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3d6ea30a2bdb..fa0a7a84ee58 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -386,6 +386,58 @@ struct lru_gen_struct {
 	bool enabled;
 };
 
+enum {
+	MM_PTE_TOTAL,	/* total leaf entries */
+	MM_PTE_OLD,	/* old leaf entries */
+	MM_PTE_YOUNG,	/* young leaf entries */
+	MM_PMD_TOTAL,	/* total non-leaf entries */
+	MM_PMD_FOUND,	/* non-leaf entries found in Bloom filters */
+	MM_PMD_ADDED,	/* non-leaf entries added to Bloom filters */
+	NR_MM_STATS
+};
+
+/* mnemonic codes for the mm stats above */
+#define MM_STAT_CODES		"toydfa"
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS	2
+
+struct lru_gen_mm_state {
+	/* set to max_seq after each iteration */
+	unsigned long seq;
+	/* where the current iteration starts (inclusive) */
+	struct list_head *head;
+	/* where the last iteration ends (exclusive) */
+	struct list_head *tail;
+	/* to wait for the last page table walker to finish */
+	struct wait_queue_head wait;
+	/* Bloom filters flip after each iteration */
+	unsigned long *filters[NR_BLOOM_FILTERS];
+	/* the mm stats for debugging */
+	unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+	/* the number of concurrent page table walkers */
+	int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+	/* the lruvec under reclaim */
+	struct lruvec *lruvec;
+	/* unstable max_seq from lru_gen_struct */
+	unsigned long max_seq;
+	/* the next address within an mm to scan */
+	unsigned long next_addr;
+	/* to batch page table entries */
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+	/* to batch promoted pages */
+	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* to batch the mm stats */
+	int mm_stats[NR_MM_STATS];
+	/* total batched items */
+	int batched;
+	bool can_swap;
+	bool full_scan;
+};
+
 void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
@@ -436,6 +488,8 @@ struct lruvec {
 #ifdef CONFIG_LRU_GEN
 	/* evictable pages divided into generations */
 	struct lru_gen_struct		lrugen;
+	/* to concurrently iterate lru_gen_mm_list */
+	struct lru_gen_mm_state		mm_state;
 #endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
@@ -1028,6 +1082,10 @@ typedef struct pglist_data {
 
 	unsigned long		flags;
 
+#ifdef CONFIG_LRU_GEN
+	/* kswap mm walk data */
+	struct lru_gen_mm_walk	mm_walk;
+#endif
 	ZONE_PADDING(_pad2_)
 
 	/* Per-node vmstats */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b37520d3ff1d..04d84ac6d1ac 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -137,6 +137,10 @@ union swap_header {
  */
 struct reclaim_state {
 	unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+	/* per-thread mm walk data */
+	struct lru_gen_mm_walk *mm_walk;
+#endif
 };
 
 #ifdef __KERNEL__
diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab9..54d2ce4b93d1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -463,6 +463,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..8dcf6c37b918 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1079,6 +1079,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1121,6 +1122,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2576,6 +2578,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock the task to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 848eaa0efe0e..e5fcfd4557ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4970,6 +4970,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74373df19d84..662e652f85ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6155,6 +6155,29 @@ static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task = NULL;
+
+	cgroup_taskset_for_each_leader(task, css, tset)
+		break;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6500,6 +6523,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 933d46ae2f68..5ab6cd332fcc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,8 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3135,6 +3137,371 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
 	       get_nr_gens(lruvec, TYPE_ANON) <= MAX_NR_GENS;
 }
 
+/******************************************************************************
+ *                          mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+	static struct lru_gen_mm_list mm_list = {
+		.fifo = LIST_HEAD_INIT(mm_list.fifo),
+		.lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+	};
+
+#ifdef CONFIG_MEMCG
+	if (memcg)
+		return &memcg->mm_list;
+#endif
+	return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+	mm->lru_gen.memcg = memcg;
+#endif
+	spin_lock(&mm_list->lock);
+
+	list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		if (lruvec->mm_state.tail == &mm_list->fifo)
+			lruvec->mm_state.tail = lruvec->mm_state.tail->prev;
+	}
+
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+	struct mem_cgroup *memcg = NULL;
+
+	if (list_empty(&mm->lru_gen.list))
+		return;
+
+#ifdef CONFIG_MEMCG
+	memcg = mm->lru_gen.memcg;
+#endif
+	mm_list = get_mm_list(memcg);
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		if (lruvec->mm_state.tail == &mm->lru_gen.list)
+			lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+		if (lruvec->mm_state.head != &mm->lru_gen.list)
+			continue;
+
+		lruvec->mm_state.head = lruvec->mm_state.head->next;
+		if (lruvec->mm_state.head == &mm_list->fifo)
+			WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+	}
+
+	list_del_init(&mm->lru_gen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lru_gen.memcg);
+	mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	/* for mm_update_next_owner() */
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(mm->owner);
+	rcu_read_unlock();
+	if (memcg == mm->lru_gen.memcg)
+		return;
+
+	VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+	VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it produces a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1) It's not necessary to keep both filters all the time. The spare one can be
+ *    freed after the RCU grace period and reallocated if needed again.
+ * 2) And when reallocating, it's worth scaling its size according to the number
+ *    of inserted entries in the other filter, to reduce the memory overhead on
+ *    small systems and false positives on large systems.
+ * 3) Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT	15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+	return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+	u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+	BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+	key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+	key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	filter = lruvec->mm_state.filters[gen];
+	if (filter) {
+		bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+		return;
+	}
+
+	filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT), GFP_ATOMIC);
+	WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return;
+
+	get_item_key(item, key);
+
+	if (!test_bit(key[0], filter))
+		set_bit(key[0], filter);
+	if (!test_bit(key[1], filter))
+		set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return true;
+
+	get_item_key(item, key);
+
+	return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+	int i;
+	int hist;
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	if (walk) {
+		hist = lru_hist_from_seq(walk->max_seq);
+
+		for (i = 0; i < NR_MM_STATS; i++) {
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+				   lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+			walk->mm_stats[i] = 0;
+		}
+	}
+
+	if (NR_HIST_GENS > 1 && last) {
+		hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+
+		for (i = 0; i < NR_MM_STATS; i++)
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+	}
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int type;
+	unsigned long size = 0;
+	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+	if (!walk->full_scan && cpumask_empty(mm_cpumask(mm)) &&
+	    !node_isset(pgdat->node_id, mm->lru_gen.nodes))
+		return true;
+
+	node_clear(pgdat->node_id, mm->lru_gen.nodes);
+
+	for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+		size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+			       get_mm_counter(mm, MM_ANONPAGES) +
+			       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	if (size < MIN_LRU_BATCH)
+		return true;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+			    struct mm_struct **iter)
+{
+	bool first = false;
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	/*
+	 * There are four interesting cases for this page table walker:
+	 * 1) It tries to start a new iteration of mm_list with a stale max_seq;
+	 *    there is nothing to be done.
+	 * 2) It's the first of the current generation, and it needs to reset
+	 *    the Bloom filter for the next generation.
+	 * 3) It reaches the end of mm_list, and it needs to increment
+	 *    mm_state->seq; the iteration is done.
+	 * 4) It's the last of the current generation, and it needs to reset the
+	 *    mm stats counters for the next generation.
+	 */
+	if (*iter)
+		mmput_async(*iter);
+	else if (walk->max_seq <= READ_ONCE(mm_state->seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(walk->max_seq > mm_state->seq + 1);
+	VM_BUG_ON(*iter && walk->max_seq < mm_state->seq);
+	VM_BUG_ON(*iter && !mm_state->nr_walkers);
+
+	if (walk->max_seq <= mm_state->seq) {
+		if (!*iter)
+			last = false;
+		goto done;
+	}
+
+	if (mm_state->head == &mm_list->fifo) {
+		VM_BUG_ON(mm_state->nr_walkers);
+		mm_state->head = mm_state->head->next;
+		first = true;
+	}
+
+	while (!mm && mm_state->head != &mm_list->fifo) {
+		mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+		mm_state->head = mm_state->head->next;
+
+		/* full scan for those added after the last iteration */
+		if (mm_state->tail == &mm->lru_gen.list) {
+			mm_state->tail = mm_state->tail->next;
+			walk->full_scan = true;
+		}
+
+		if (should_skip_mm(mm, walk))
+			mm = NULL;
+	}
+
+	if (mm_state->head == &mm_list->fifo)
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+	if (*iter && !mm)
+		mm_state->nr_walkers--;
+	if (!*iter && mm)
+		mm_state->nr_walkers++;
+
+	if (mm_state->nr_walkers)
+		last = false;
+
+	if (mm && first)
+		reset_bloom_filter(lruvec, walk->max_seq + 1);
+
+	if (*iter || last)
+		reset_mm_stats(lruvec, walk, last);
+
+	spin_unlock(&mm_list->lock);
+
+	*iter = mm;
+
+	return last;
+}
+
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
+{
+	bool success = false;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	if (max_seq <= READ_ONCE(mm_state->seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(max_seq > mm_state->seq + 1);
+
+	if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
+		VM_BUG_ON(mm_state->head != &mm_list->fifo);
+
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+		reset_mm_stats(lruvec, NULL, true);
+		success = true;
+	}
+
+	spin_unlock(&mm_list->lock);
+
+	return success;
+}
+
 /******************************************************************************
  *                          refault feedback loop
  ******************************************************************************/
@@ -3288,6 +3655,465 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 	return new_gen;
 }
 
+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
+			      int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+
+	VM_BUG_ON(old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+	walk->batched++;
+
+	walk->nr_pages[old_gen][type][zone] -= delta;
+	walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	walk->batched = 0;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		enum lru_list lru = type * LRU_INACTIVE_FILE;
+		int delta = walk->nr_pages[gen][type][zone];
+
+		if (!delta)
+			continue;
+
+		walk->nr_pages[gen][type][zone] = 0;
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+			   lrugen->nr_pages[gen][type][zone] + delta);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		lru_gen_update_size(lruvec, lru, zone, delta);
+	}
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma = walk->vma;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+	    (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ)) ||
+	    vma == get_gate_vma(vma->vm_mm))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !priv->can_swap;
+
+	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+		return true;
+
+	mapping = vma->vm_file->f_mapping;
+	if (mapping_unevictable(mapping))
+		return true;
+
+	/* check readpage to exclude special mappings like dax, etc. */
+	return shmem_mapping(mapping) ? !priv->can_swap : !mapping->a_ops->readpage;
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+			 unsigned long *start, unsigned long *end)
+{
+	unsigned long next = round_up(*end, size);
+
+	VM_BUG_ON(mask & size);
+	VM_BUG_ON(*start >= *end);
+	VM_BUG_ON((next & mask) != (*start & mask));
+
+	while (walk->vma) {
+		if (next >= walk->vma->vm_end) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		if ((next & mask) != (walk->vma->vm_start & mask))
+			return false;
+
+		if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		*start = max(next, walk->vma->vm_start);
+		next = (next | ~mask) + 1;
+		/* rounded-up boundaries can wrap to 0 */
+		*end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+		return true;
+	}
+
+	return false;
+}
+
+static bool suitable_to_scan(int total, int young)
+{
+	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+	/* suitable if the average number of young PTEs per cacheline is >=1 */
+	return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int total = 0;
+	int young = 0;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+	VM_BUG_ON(pmd_leaf(*pmd));
+
+	pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+	arch_enter_lazy_mmu_mode();
+restart:
+	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		struct folio *folio;
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+
+		total++;
+		priv->mm_stats[MM_PTE_TOTAL]++;
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i])) {
+			priv->mm_stats[MM_PTE_OLD]++;
+			continue;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(walk->vma, addr, pte + i))
+			continue;
+
+		young++;
+		priv->mm_stats[MM_PTE_YOUNG]++;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(priv, folio, old_gen, new_gen);
+	}
+
+	if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+		goto restart;
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte, ptl);
+
+	return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+	int i;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct lru_gen_mm_walk *priv = walk->private;
+	struct mem_cgroup *memcg = lruvec_memcg(priv->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(priv->max_seq);
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+	if (*start == -1) {
+		*start = next;
+		return;
+	}
+
+	i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+	if (i && i <= MIN_LRU_BATCH) {
+		__set_bit(i - 1, priv->bitmap);
+		return;
+	}
+
+	pmd = pmd_offset(pud, *start);
+	ptl = pmd_lock(walk->mm, pmd);
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		struct folio *folio;
+		unsigned long pfn = pmd_pfn(pmd[i]);
+		unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+		VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+
+		if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i]))
+			goto next;
+
+		if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+			goto next;
+
+		if (!pmd_trans_huge(pmd[i])) {
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+				pmdp_test_and_clear_young(vma, addr, pmd + i);
+			goto next;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			goto next;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			goto next;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			goto next;
+
+		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+			goto next;
+
+		priv->mm_stats[MM_PTE_YOUNG]++;
+
+		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(priv, folio, old_gen, new_gen);
+next:
+		i = i > MIN_LRU_BATCH ? 0 :
+		    find_next_bit(priv->bitmap, MIN_LRU_BATCH, i) + 1;
+	} while (i <= MIN_LRU_BATCH);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+
+	*start = -1;
+	bitmap_zero(priv->bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *walk, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	unsigned long pos = -1;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	/*
+	 * Finish an entire PMD in two passes: the first only reaches to PTE
+	 * tables to avoid taking the PMD lock; the second, if necessary, takes
+	 * the PMD lock to clear the accessed bit in PMD entries.
+	 */
+	pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+	/* walk_pte_range() may call get_next_vma() */
+	vma = walk->vma;
+	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		pmd_t val = pmd_read_atomic(pmd + i);
+
+		/* for pmd_read_atomic() */
+		barrier();
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(val)) {
+			priv->mm_stats[MM_PTE_TOTAL]++;
+			continue;
+		}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (pmd_trans_huge(val)) {
+			unsigned long pfn = pmd_pfn(val);
+			struct pglist_data *pgdat = lruvec_pgdat(priv->lruvec);
+
+			priv->mm_stats[MM_PTE_TOTAL]++;
+
+			if (is_huge_zero_pmd(val))
+				continue;
+
+			if (!pmd_young(val)) {
+				priv->mm_stats[MM_PTE_OLD]++;
+				continue;
+			}
+
+			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+				continue;
+
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			continue;
+		}
+#endif
+		priv->mm_stats[MM_PMD_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+		if (!pmd_young(val))
+			continue;
+
+		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+#endif
+		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
+			continue;
+
+		priv->mm_stats[MM_PMD_FOUND]++;
+
+		if (!walk_pte_range(&val, addr, next, walk))
+			continue;
+
+		priv->mm_stats[MM_PMD_ADDED]++;
+
+		/* carry over to the next generation */
+		update_bloom_filter(priv->lruvec, priv->max_seq + 1, pmd + i);
+	}
+
+	walk_pmd_range_locked(pud, -1, vma, walk, &pos);
+
+	if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+		goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	int i;
+	pud_t *pud;
+	unsigned long addr;
+	unsigned long next;
+	struct lru_gen_mm_walk *priv = walk->private;
+
+	VM_BUG_ON(p4d_leaf(*p4d));
+
+	pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+	for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+		pud_t val = READ_ONCE(pud[i]);
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+			continue;
+
+		walk_pmd_range(&val, addr, next, walk);
+
+		if (priv->batched >= MAX_LRU_BATCH) {
+			end = (addr | ~PUD_MASK) + 1;
+			goto done;
+		}
+	}
+
+	if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+		goto restart;
+
+	end = round_up(end, P4D_SIZE);
+done:
+	/* rounded-up boundaries can wrap to 0 */
+	priv->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+	return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	static const struct mm_walk_ops mm_walk_ops = {
+		.test_walk = should_skip_vma,
+		.p4d_entry = walk_pud_range,
+	};
+
+	int err;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	walk->next_addr = FIRST_USER_ADDRESS;
+
+	do {
+		err = -EBUSY;
+
+		/* folio_update_gen() requires stable folio_memcg() */
+		if (!mem_cgroup_trylock_pages(memcg))
+			break;
+
+		/* the caller might be holding the lock for write */
+		if (mmap_read_trylock(mm)) {
+			unsigned long start = walk->next_addr;
+			unsigned long end = mm->highest_vm_end;
+
+			err = walk_page_range(mm, start, end, &mm_walk_ops, walk);
+
+			mmap_read_unlock(mm);
+
+			if (walk->batched) {
+				spin_lock_irq(&lruvec->lru_lock);
+				reset_batch_size(lruvec, walk);
+				spin_unlock_irq(&lruvec->lru_lock);
+			}
+		}
+
+		mem_cgroup_unlock_pages();
+
+		cond_resched();
+	} while (err == -EAGAIN && walk->next_addr && !mm_is_oom_victim(mm));
+}
+
+static struct lru_gen_mm_walk *alloc_mm_walk(void)
+{
+	if (current->reclaim_state && current->reclaim_state->mm_walk)
+		return current->reclaim_state->mm_walk;
+
+	return kzalloc(sizeof(struct lru_gen_mm_walk),
+		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+}
+
+static void free_mm_walk(struct lru_gen_mm_walk *walk)
+{
+	if (!current->reclaim_state || !current->reclaim_state->mm_walk)
+		kfree(walk);
+}
+
 static void inc_min_seq(struct lruvec *lruvec)
 {
 	int type;
@@ -3346,7 +4172,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	return success;
 }
 
-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+static void inc_max_seq(struct lruvec *lruvec)
 {
 	int prev, next;
 	int type, zone;
@@ -3356,9 +4182,6 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
 
 	VM_BUG_ON(!seq_is_valid(lruvec));
 
-	if (max_seq != lrugen->max_seq)
-		goto unlock;
-
 	inc_min_seq(lruvec);
 
 	/* update the active/inactive LRU sizes for compatibility */
@@ -3385,10 +4208,72 @@ static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
 	WRITE_ONCE(lrugen->timestamps[next], jiffies);
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-unlock:
+
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			       struct scan_control *sc, bool can_swap, bool full_scan)
+{
+	bool success;
+	struct lru_gen_mm_walk *walk;
+	struct mm_struct *mm = NULL;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+	/*
+	 * If the hardware doesn't automatically set the accessed bit, fallback
+	 * to lru_gen_look_around(), which only clears the accessed bit in a
+	 * handful of PTEs. Spreading the work out over a period of time usually
+	 * is less efficient, but it avoids bursty page faults.
+	 */
+	if (!full_scan && !arch_has_hw_pte_young()) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk = alloc_mm_walk();
+	if (!walk) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk->lruvec = lruvec;
+	walk->max_seq = max_seq;
+	walk->can_swap = can_swap;
+	walk->full_scan = full_scan;
+
+	do {
+		success = iterate_mm_list(lruvec, walk, &mm);
+		if (mm)
+			walk_mm(lruvec, mm, walk);
+
+		cond_resched();
+	} while (mm);
+
+	free_mm_walk(walk);
+done:
+	if (!success) {
+		if (!current_is_kswapd() && !sc->priority)
+			wait_event_killable(lruvec->mm_state.wait,
+					    max_seq < READ_ONCE(lrugen->max_seq));
+
+		return max_seq < READ_ONCE(lrugen->max_seq);
+	}
+
+	VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+	inc_max_seq(lruvec);
+	/* either this sees any waiters or they will see updated max_seq */
+	if (wq_has_sleeper(&lruvec->mm_state.wait))
+		wake_up_all(&lruvec->mm_state.wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
 static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 			     unsigned long *min_seq, bool can_swap, bool *need_aging)
 {
@@ -3449,7 +4334,7 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		nr_to_scan++;
 
 	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
-		inc_max_seq(lruvec, max_seq);
+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
 }
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -3458,6 +4343,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_BUG_ON(!current_is_kswapd());
 
+	current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -3466,11 +4353,16 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	current->reclaim_state->mm_walk = NULL;
 }
 
 /*
  * This function exploits spatial locality when shrink_page_list() walks the
  * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ * If the scan was done cacheline efficiently, it adds the PMD entry pointing
+ * to the PTE table to the Bloom filter. This process is a feedback loop from
+ * the eviction to the aging.
  */
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
@@ -3480,6 +4372,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long end;
 	unsigned long addr;
 	struct folio *folio;
+	struct lru_gen_mm_walk *walk;
+	int young = 0;
 	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
 	struct mem_cgroup *memcg = page_memcg(pvmw->page);
 	struct pglist_data *pgdat = page_pgdat(pvmw->page);
@@ -3537,6 +4431,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
 			continue;
 
+		young++;
+
 		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
@@ -3552,7 +4448,13 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	arch_leave_lazy_mmu_mode();
 	rcu_read_unlock();
 
-	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+	/* feedback from rmap walkers to page table walkers */
+	if (suitable_to_scan(i, young))
+		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+
+	if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
 		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 			folio = page_folio(pte_page(pte[i]));
 			folio_activate(folio);
@@ -3564,8 +4466,10 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	if (!mem_cgroup_trylock_pages(memcg))
 		return;
 
-	spin_lock_irq(&lruvec->lru_lock);
-	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	if (!walk) {
+		spin_lock_irq(&lruvec->lru_lock);
+		new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	}
 
 	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 		folio = page_folio(pte_page(pte[i]));
@@ -3576,10 +4480,14 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (old_gen < 0 || old_gen == new_gen)
 			continue;
 
-		lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
+		if (walk)
+			update_batch_size(walk, folio, old_gen, new_gen);
+		else
+			lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
 	}
 
-	spin_unlock_irq(&lruvec->lru_lock);
+	if (!walk)
+		spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_unlock_pages();
 }
@@ -3846,6 +4754,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	struct folio *folio;
 	enum vm_event_item item;
 	struct reclaim_stat stat;
+	struct lru_gen_mm_walk *walk;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -3884,6 +4793,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	move_pages_to_lru(lruvec, &list);
 
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
+	if (walk && walk->batched)
+		reset_batch_size(lruvec, walk);
+
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, reclaimed);
@@ -3938,9 +4851,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 		return 0;
 	}
 
-	inc_max_seq(lruvec, max_seq);
+	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false))
+		return nr_to_scan;
 
-	return nr_to_scan;
+	return max_seq >= min_seq[TYPE_FILE] + MIN_NR_GENS ? nr_to_scan : 0;
 }
 
 static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -3948,9 +4862,13 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 	struct blk_plug plug;
 	long scanned = 0;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
 
+	if (current_is_kswapd())
+		current->reclaim_state->mm_walk = &pgdat->mm_walk;
+
 	blk_start_plug(&plug);
 
 	while (true) {
@@ -3981,6 +4899,9 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 	}
 
 	blk_finish_plug(&plug);
+
+	if (current_is_kswapd())
+		current->reclaim_state->mm_walk = NULL;
 }
 
 /******************************************************************************
@@ -3992,6 +4913,7 @@ void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
 	int i;
 	int gen, type, zone;
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
 
 	lrugen->max_seq = MIN_NR_GENS + 1;
 	lrugen->enabled = lru_gen_enabled();
@@ -4001,6 +4923,11 @@ void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec)
 
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+	lruvec->mm_state.seq = MIN_NR_GENS;
+	lruvec->mm_state.head = &mm_list->fifo;
+	lruvec->mm_state.tail = &mm_list->fifo;
+	init_waitqueue_head(&lruvec->mm_state.wait);
 }
 
 #ifdef CONFIG_MEMCG
@@ -4008,6 +4935,9 @@ void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
 	int nid;
 
+	INIT_LIST_HEAD(&memcg->mm_list.fifo);
+	spin_lock_init(&memcg->mm_list.lock);
+
 	for_each_node(nid) {
 		struct lruvec *lruvec = get_lruvec(memcg, nid);
 
@@ -4020,10 +4950,16 @@ void lru_gen_free_memcg(struct mem_cgroup *memcg)
 	int nid;
 
 	for_each_node(nid) {
+		int i;
 		struct lruvec *lruvec = get_lruvec(memcg, nid);
 
 		VM_BUG_ON(memchr_inv(lruvec->lrugen.nr_pages, 0,
 				     sizeof(lruvec->lrugen.nr_pages)));
+
+		for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+			bitmap_free(lruvec->mm_state.filters[i]);
+			lruvec->mm_state.filters[i] = NULL;
+		}
 	}
 }
 #endif
@@ -4032,6 +4968,7 @@ static int __init init_lru_gen(void)
 {
 	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
 	return 0;
 };
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 08/12] mm: multigenerational LRU: optimize multiple memcgs
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (6 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 07/12] mm: multigenerational LRU: support page table walks Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:18 ` [PATCH v7 09/12] mm: multigenerational LRU: runtime switch Yu Zhao
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

When multiple memcgs are available, it's possible to improve the
overall performance under global memory pressure by making better
choices based on generations and tiers. This patch adds a rudimentary
optimization to select memcgs that can drop single-use unmapped clean
pages first, and thus it reduces the chance of going into the aging
path or swapping, which can be costly. Its goal is to improve the
overall performance when there are mixed types of workloads, e.g.,
heavy anon workload in one memcg and heavy buffered I/O workload in
the other.

Though this optimization can be applied to both kswapd and direct
reclaim, it's only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): -[28, 30]%
                IOPS         BW
      patch1-7: 3117k        11.9GiB/s
      patch1-8: 2217k        8661MiB/s

    memcached (anon): +[247, 251]%
                Ops/sec      KB/sec
      patch1-7: 563772.35    21900.01
      patch1-8: 1968343.76   76461.24

  Mixed workloads:
    fio (buffered I/O): -[4, 6]%
                IOPS         BW
      5.17-rc2: 2338k        9133MiB/s
      patch1-8: 2217k        8661MiB/s

    memcached (anon): +[524, 530]%
                Ops/sec      KB/sec
      5.17-rc2: 313821.65    12190.55
      patch1-8: 1968343.76   76461.24

  Configurations:
    (changes since patch 5)

    cat combined.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 mm/vmscan.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ab6cd332fcc..fc09b6c10624 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -127,6 +127,13 @@ struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+#ifdef CONFIG_LRU_GEN
+	/* help make better choices when multiple memcgs are available */
+	unsigned int memcgs_need_aging:1;
+	unsigned int memcgs_need_swapping:1;
+	unsigned int memcgs_avoid_swapping:1;
+#endif
+
 	/* Allocation order */
 	s8 order;
 
@@ -4343,6 +4350,22 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_BUG_ON(!current_is_kswapd());
 
+	/*
+	 * To reduce the chance of going into the aging path or swapping, which
+	 * can be costly, optimistically skip them unless their corresponding
+	 * flags were cleared in the eviction path. This improves the overall
+	 * performance when multiple memcgs are available.
+	 */
+	if (!sc->memcgs_need_aging) {
+		sc->memcgs_need_aging = true;
+		sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping;
+		sc->memcgs_need_swapping = true;
+		return;
+	}
+
+	sc->memcgs_need_swapping = true;
+	sc->memcgs_avoid_swapping = true;
+
 	current->reclaim_state->mm_walk = &pgdat->mm_walk;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
@@ -4745,7 +4768,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 	return scanned;
 }
 
-static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			bool *swapped)
 {
 	int type;
 	int scanned;
@@ -4810,6 +4834,9 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	sc->nr_reclaimed += reclaimed;
 
+	if (!type && swapped)
+		*swapped = true;
+
 	return scanned;
 }
 
@@ -4838,8 +4865,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 	if (!nr_to_scan)
 		return 0;
 
-	if (!need_aging)
+	if (!need_aging) {
+		sc->memcgs_need_aging = false;
 		return nr_to_scan;
+	}
 
 	/* leave the work to lru_gen_age_node() */
 	if (current_is_kswapd())
@@ -4861,6 +4890,8 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 {
 	struct blk_plug plug;
 	long scanned = 0;
+	bool swapped = false;
+	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -4887,13 +4918,19 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		if (!nr_to_scan)
 			break;
 
-		delta = evict_folios(lruvec, sc, swappiness);
+		delta = evict_folios(lruvec, sc, swappiness, &swapped);
 		if (!delta)
 			break;
 
+		if (sc->memcgs_avoid_swapping && swappiness < 200 && swapped)
+			break;
+
 		scanned += delta;
-		if (scanned >= nr_to_scan)
+		if (scanned >= nr_to_scan) {
+			if (!swapped && sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH)
+				sc->memcgs_need_swapping = false;
 			break;
+		}
 
 		cond_resched();
 	}
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 09/12] mm: multigenerational LRU: runtime switch
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (7 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 08/12] mm: multigenerational LRU: optimize multiple memcgs Yu Zhao
@ 2022-02-08  8:18 ` Yu Zhao
  2022-02-08  8:42   ` Yu Zhao
  2022-02-08  8:19 ` [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention Yu Zhao
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:18 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Add /sys/kernel/mm/lru_gen/enabled as a runtime switch. Features that
can be enabled or disabled include:
  0x0001: the multigenerational LRU
  0x0002: the page table walks, when arch_has_hw_pte_young() returns
          true
  0x0004: the use of the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the features above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy
memory pressure. Under such a condition, the mmap_lock contention is a
lesser concern, compared with the LRU lock contention and the I/O
congestion. So far the only well-known case of the mmap_lock
contention is Android, due to Scudo [1] which allocates several
thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree
also have provided their own assessments [2][3]. However, if the page
table walks do worsen the mmap_lock contention, the runtime switch can
be used to disable this feature. In this case the multigenerational
LRU will suffer a minor performance degradation, as shown previously.

The use of the accessed bit in non-leaf PMD entries can also be
disabled, since this feature wasn't tested on x86 varieties other than
Intel and AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 include/linux/cgroup.h          |  15 +-
 include/linux/mm_inline.h       |  10 +-
 include/linux/mmzone.h          |   7 +
 kernel/cgroup/cgroup-internal.h |   1 -
 mm/Kconfig                      |   6 +
 mm/vmscan.c                     | 236 +++++++++++++++++++++++++++++++-
 6 files changed, 267 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c151413fda..b145025f3eac 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+	mutex_lock(&cgroup_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
  * as locks used during the cgroup_subsys::attach() methods.
  */
 #ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)					\
 	rcu_dereference_check((task)->cgroups,				\
@@ -707,6 +718,8 @@ struct cgroup;
 static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
 static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 37c8a0ede4ff..130d62751e05 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -96,7 +96,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
 
 static inline bool lru_gen_enabled(void)
 {
-	return true;
+#ifdef CONFIG_LRU_GEN_ENABLED
+	DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]);
+
+	return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]);
+#else
+	DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]);
+
+	return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]);
+#endif
 }
 
 static inline bool lru_gen_in_fault(void)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fa0a7a84ee58..4ecec9152761 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -311,6 +311,13 @@ struct page_vma_mapped_walk;
 
 #ifdef CONFIG_LRU_GEN
 
+enum {
+	LRU_GEN_CORE,
+	LRU_GEN_MM_WALK,
+	LRU_GEN_NONLEAF_YOUNG,
+	NR_LRU_GEN_CAPS
+};
+
 #define MIN_LRU_BATCH		BITS_PER_LONG
 #define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
 
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 6e36e854b512..929ed3bf1a7c 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -165,7 +165,6 @@ struct cgroup_mgctx {
 #define DEFINE_CGROUP_MGCTX(name)						\
 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
 
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 extern struct cgroup_subsys *cgroup_subsys[];
 extern struct list_head cgroup_roots;
diff --git a/mm/Kconfig b/mm/Kconfig
index e899623d5df0..aae72b740d8a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -903,6 +903,12 @@ config LRU_GEN
 	  Documentation/admin-guide/mm/multigen_lru.rst and
 	  Documentation/vm/multigen_lru.rst for details.
 
+config LRU_GEN_ENABLED
+	bool "Enable by default"
+	depends on LRU_GEN
+	help
+	  This option enables the multigenerational LRU by default.
+
 config NR_LRU_GENS
 	int "Max number of generations"
 	depends on LRU_GEN
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fc09b6c10624..700c35f2a030 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3066,6 +3066,12 @@ enum {
 	TYPE_FILE,
 };
 
+#ifdef CONFIG_LRU_GEN_ENABLED
+DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
+#else
+DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS);
+#endif
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3102,6 +3108,15 @@ static int folio_lru_tier(struct folio *folio)
 	return lru_tier_from_refs(refs);
 }
 
+static bool get_cap(int cap)
+{
+#ifdef CONFIG_LRU_GEN_ENABLED
+	return static_branch_likely(&lru_gen_caps[cap]);
+#else
+	return static_branch_unlikely(&lru_gen_caps[cap]);
+#endif
+}
+
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -3893,7 +3908,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area
 			goto next;
 
 		if (!pmd_trans_huge(pmd[i])) {
-			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
+			    get_cap(LRU_GEN_NONLEAF_YOUNG))
 				pmdp_test_and_clear_young(vma, addr, pmd + i);
 			goto next;
 		}
@@ -4000,10 +4016,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 		priv->mm_stats[MM_PMD_TOTAL]++;
 
 #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
-		if (!pmd_young(val))
-			continue;
+		if (get_cap(LRU_GEN_NONLEAF_YOUNG)) {
+			if (!pmd_young(val))
+				continue;
 
-		walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+			walk_pmd_range_locked(pud, addr, vma, walk, &pos);
+		}
 #endif
 		if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i))
 			continue;
@@ -4235,7 +4253,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
 	 * handful of PTEs. Spreading the work out over a period of time usually
 	 * is less efficient, but it avoids bursty page faults.
 	 */
-	if (!full_scan && !arch_has_hw_pte_young()) {
+	if (!full_scan && (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK))) {
 		success = iterate_mm_list_nowalk(lruvec, max_seq);
 		goto done;
 	}
@@ -4941,6 +4959,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		current->reclaim_state->mm_walk = NULL;
 }
 
+/******************************************************************************
+ *                          state change
+ ******************************************************************************/
+
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
+{
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	if (lrugen->enabled) {
+		enum lru_list lru;
+
+		for_each_evictable_lru(lru) {
+			if (!list_empty(&lruvec->lists[lru]))
+				return false;
+		}
+	} else {
+		int gen, type, zone;
+
+		for_each_gen_type_zone(gen, type, zone) {
+			if (!list_empty(&lrugen->lists[gen][type][zone]))
+				return false;
+
+			/* unlikely but not a bug when reset_batch_size() is pending */
+			VM_WARN_ON(lrugen->nr_pages[gen][type][zone]);
+		}
+	}
+
+	return true;
+}
+
+static bool fill_evictable(struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int remaining = MAX_LRU_BATCH;
+
+	for_each_evictable_lru(lru) {
+		int type = is_file_lru(lru);
+		bool active = is_active_lru(lru);
+		struct list_head *head = &lruvec->lists[lru];
+
+		while (!list_empty(head)) {
+			bool success;
+			struct folio *folio = lru_to_folio(head);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio);
+
+			lruvec_del_folio(lruvec, folio);
+			success = lru_gen_add_folio(lruvec, folio, false);
+			VM_BUG_ON(!success);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static bool drain_evictable(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	int remaining = MAX_LRU_BATCH;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		struct list_head *head = &lruvec->lrugen.lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			bool success;
+			struct folio *folio = lru_to_folio(head);
+
+			VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio);
+			VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+			VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio);
+			VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio);
+
+			success = lru_gen_del_folio(lruvec, folio, false);
+			VM_BUG_ON(!success);
+			lruvec_add_folio(lruvec, folio);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static void lru_gen_change_state(bool enable)
+{
+	static DEFINE_MUTEX(state_mutex);
+
+	struct mem_cgroup *memcg;
+
+	cgroup_lock();
+	cpus_read_lock();
+	get_online_mems();
+	mutex_lock(&state_mutex);
+
+	if (enable == lru_gen_enabled())
+		goto unlock;
+
+	if (enable)
+		static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+	else
+		static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node(nid) {
+			struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+			if (!lruvec)
+				continue;
+
+			spin_lock_irq(&lruvec->lru_lock);
+
+			VM_BUG_ON(!seq_is_valid(lruvec));
+			VM_BUG_ON(!state_is_valid(lruvec));
+
+			lruvec->lrugen.enabled = enable;
+
+			while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				cond_resched();
+				spin_lock_irq(&lruvec->lru_lock);
+			}
+
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+unlock:
+	mutex_unlock(&state_mutex);
+	put_online_mems();
+	cpus_read_unlock();
+	cgroup_unlock();
+}
+
+/******************************************************************************
+ *                          sysfs interface
+ ******************************************************************************/
+
+static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	unsigned int caps = 0;
+
+	if (get_cap(LRU_GEN_CORE))
+		caps |= BIT(LRU_GEN_CORE);
+
+	if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK))
+		caps |= BIT(LRU_GEN_MM_WALK);
+
+	if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG))
+		caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
+
+	return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps);
+}
+
+static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
+			    const char *buf, size_t len)
+{
+	int i;
+	unsigned int caps;
+
+	if (tolower(*buf) == 'n')
+		caps = 0;
+	else if (tolower(*buf) == 'y')
+		caps = -1;
+	else if (kstrtouint(buf, 0, &caps))
+		return -EINVAL;
+
+	for (i = 0; i < NR_LRU_GEN_CAPS; i++) {
+		bool enable = caps & BIT(i);
+
+		if (i == LRU_GEN_CORE)
+			lru_gen_change_state(enable);
+		else if (enable)
+			static_branch_enable(&lru_gen_caps[i]);
+		else
+			static_branch_disable(&lru_gen_caps[i]);
+	}
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+	enabled, 0644, show_enable, store_enable
+);
+
+static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_enabled_attr.attr,
+	NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+	.name = "lru_gen",
+	.attrs = lru_gen_attrs,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -5007,6 +5230,9 @@ static int __init init_lru_gen(void)
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
 	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
+	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
+		pr_err("lru_gen: failed to create sysfs group\n");
+
 	return 0;
 };
 late_initcall(init_lru_gen);
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (8 preceding siblings ...)
  2022-02-08  8:18 ` [PATCH v7 09/12] mm: multigenerational LRU: runtime switch Yu Zhao
@ 2022-02-08  8:19 ` Yu Zhao
  2022-02-08  8:43   ` Yu Zhao
  2022-02-08  8:19 ` [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface Yu Zhao
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:19 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
requested by many desktop users [1].

When set to value N, it prevents the working set of N milliseconds
from getting evicted. The OOM killer is triggered if this working set
can't be kept in memory. Based on the average human detectable lag
(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
Larger values like N=3000 make lags less noticeable at the risk of
premature OOM kills.

Compared with the size-based approach, e.g., [2], this time-based
approach has the following advantages:
1) It's easier to configure because it's agnostic to applications and
   memory sizes.
2) It's more reliable because it's directly wired to the OOM killer.

[1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
[2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 700c35f2a030..4d37d63668b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4335,7 +4335,8 @@ static long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 	return total > 0 ? total : 0;
 }
 
-static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc,
+		       unsigned long min_ttl)
 {
 	bool need_aging;
 	long nr_to_scan;
@@ -4344,14 +4345,22 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
+	if (min_ttl) {
+		int gen = lru_gen_from_seq(min_seq[TYPE_FILE]);
+		unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
+
+		if (time_is_after_jiffies(birth + min_ttl))
+			return false;
+	}
+
 	mem_cgroup_calculate_protection(NULL, memcg);
 
 	if (mem_cgroup_below_min(memcg))
-		return;
+		return false;
 
 	nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, swappiness, &need_aging);
 	if (!nr_to_scan)
-		return;
+		return false;
 
 	nr_to_scan >>= sc->priority;
 
@@ -4360,11 +4369,18 @@ static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 	if (nr_to_scan && need_aging && (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
 		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness, false);
+
+	return true;
 }
 
+/* to protect the working set of the last N jiffies */
+static unsigned long lru_gen_min_ttl __read_mostly;
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
+	bool success = false;
+	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
 
 	VM_BUG_ON(!current_is_kswapd());
 
@@ -4390,11 +4406,28 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-		age_lruvec(lruvec, sc);
+		if (age_lruvec(lruvec, sc, min_ttl))
+			success = true;
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 
+	/*
+	 * The main goal is to OOM kill if every generation from all memcgs is
+	 * younger than min_ttl. However, another theoretical possibility is all
+	 * memcgs are either below min or empty.
+	 */
+	if (!success && mutex_trylock(&oom_lock)) {
+		struct oom_control oc = {
+			.gfp_mask = sc->gfp_mask,
+			.order = sc->order,
+		};
+
+		out_of_memory(&oc);
+
+		mutex_unlock(&oom_lock);
+	}
+
 	current->reclaim_state->mm_walk = NULL;
 }
 
@@ -5107,6 +5140,28 @@ static void lru_gen_change_state(bool enable)
  *                          sysfs interface
  ******************************************************************************/
 
+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
+}
+
+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
+			     const char *buf, size_t len)
+{
+	unsigned int msecs;
+
+	if (kstrtouint(buf, 0, &msecs))
+		return -EINVAL;
+
+	WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
+	min_ttl_ms, 0644, show_min_ttl, store_min_ttl
+);
+
 static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
 {
 	unsigned int caps = 0;
@@ -5155,6 +5210,7 @@ static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
 );
 
 static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_min_ttl_attr.attr,
 	&lru_gen_enabled_attr.attr,
 	NULL
 };
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (9 preceding siblings ...)
  2022-02-08  8:19 ` [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention Yu Zhao
@ 2022-02-08  8:19 ` Yu Zhao
  2022-02-18 18:56   ` [page-reclaim] " David Rientjes
  2022-02-08  8:19 ` [PATCH v7 12/12] mm: multigenerational LRU: documentation Yu Zhao
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:19 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. These features are required to optimize job scheduling (bin
packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
the following advantages:
1) It offers better choices because it's aware of memcgs, NUMA nodes,
   shared mappings and unmapped page cache.
2) It's more scalable because it's O(nr_hot_pages), whereas the
   PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://research.google/pubs/pub48551/
[2] https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 include/linux/nodemask.h |   1 +
 mm/vmscan.c              | 353 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 354 insertions(+)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 567c3ddba2c4..90840c459abc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define first_memory_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
+#define next_memory_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1U
 #define nr_online_nodes		1U
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4d37d63668b5..3dfa938a4c4a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -52,6 +52,8 @@
 #include <linux/psi.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5220,6 +5222,354 @@ static struct attribute_group lru_gen_attr_group = {
 	.attrs = lru_gen_attrs,
 };
 
+/******************************************************************************
+ *                          debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg;
+	loff_t nr_to_skip = *pos;
+
+	m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
+	if (!m->private)
+		return ERR_PTR(-ENOMEM);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			if (!nr_to_skip--)
+				return get_lruvec(memcg, nid);
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+	if (!IS_ERR_OR_NULL(v))
+		mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+	kvfree(m->private);
+	m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	int nid = lruvec_pgdat(v)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(v);
+
+	++*pos;
+
+	nid = next_memory_node(nid);
+	if (nid == MAX_NUMNODES) {
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+		if (!memcg)
+			return NULL;
+
+		nid = first_memory_node;
+	}
+
+	return get_lruvec(memcg, nid);
+}
+
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
+				  unsigned long max_seq, unsigned long *min_seq,
+				  unsigned long seq)
+{
+	int i;
+	int type, tier;
+	int hist = lru_hist_from_seq(seq);
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		seq_printf(m, "            %10d", tier);
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			unsigned long n[3] = {};
+
+			if (seq == max_seq) {
+				n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+				n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+
+				seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]);
+			} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
+				n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+				n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+				if (tier)
+					n[2] = READ_ONCE(lrugen->promoted[hist][type][tier - 1]);
+
+				seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]);
+			} else
+				seq_puts(m, "          0           0           0 ");
+		}
+		seq_putc(m, '\n');
+	}
+
+	seq_puts(m, "                      ");
+	for (i = 0; i < NR_MM_STATS; i++) {
+		if (seq == max_seq && NR_HIST_GENS == 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+				   toupper(MM_STAT_CODES[i]));
+		else if (seq != max_seq && NR_HIST_GENS > 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]),
+				   MM_STAT_CODES[i]);
+		else
+			seq_puts(m, "          0 ");
+	}
+	seq_putc(m, '\n');
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+	unsigned long seq;
+	bool full = !debugfs_real_fops(m->file)->write;
+	struct lruvec *lruvec = v;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (nid == first_memory_node) {
+		const char *path = memcg ? m->private : "";
+
+#ifdef CONFIG_MEMCG
+		if (memcg)
+			cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+		seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
+	}
+
+	seq_printf(m, " node %5d\n", nid);
+
+	if (!full)
+		seq = min_seq[TYPE_ANON];
+	else if (max_seq >= MAX_NR_GENS)
+		seq = max_seq - MAX_NR_GENS + 1;
+	else
+		seq = 0;
+
+	for (; seq <= max_seq; seq++) {
+		int gen, type, zone;
+		unsigned int msecs;
+
+		gen = lru_gen_from_seq(seq);
+		msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen]));
+
+		seq_printf(m, " %10lu %10u", seq, msecs);
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			long size = 0;
+
+			if (seq < min_seq[type]) {
+				seq_puts(m, "         -0 ");
+				continue;
+			}
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
+
+			seq_printf(m, " %10lu ", max(size, 0L));
+		}
+
+		seq_putc(m, '\n');
+
+		if (full)
+			lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
+	}
+
+	return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+	.start = lru_gen_seq_start,
+	.stop = lru_gen_seq_stop,
+	.next = lru_gen_seq_next,
+	.show = lru_gen_seq_show,
+};
+
+static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+		     bool can_swap, bool full_scan)
+{
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq == max_seq)
+		try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan);
+
+	return seq > max_seq ? -EINVAL : 0;
+}
+
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
+			int swappiness, unsigned long nr_to_reclaim)
+{
+	struct blk_plug plug;
+	int err = -EINTR;
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (max_seq < seq + MIN_NR_GENS)
+		return -EINVAL;
+
+	sc->nr_reclaimed = 0;
+
+	blk_start_plug(&plug);
+
+	while (!signal_pending(current)) {
+		DEFINE_MIN_SEQ(lruvec);
+
+		if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim ||
+		    !evict_folios(lruvec, sc, swappiness, NULL)) {
+			err = 0;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+
+	return err;
+}
+
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
+		   struct scan_control *sc, int swappiness, unsigned long opt)
+{
+	struct lruvec *lruvec;
+	int err = -EINVAL;
+	struct mem_cgroup *memcg = NULL;
+
+	if (!mem_cgroup_disabled()) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+#endif
+		rcu_read_unlock();
+
+		if (!memcg)
+			goto done;
+	}
+	if (memcg_id != mem_cgroup_id(memcg))
+		goto done;
+
+	if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+		goto done;
+
+	lruvec = get_lruvec(memcg, nid);
+
+	if (swappiness < 0)
+		swappiness = get_swappiness(memcg);
+	else if (swappiness > 200)
+		goto done;
+
+	switch (cmd) {
+	case '+':
+		err = run_aging(lruvec, seq, sc, swappiness, opt);
+		break;
+	case '-':
+		err = run_eviction(lruvec, seq, sc, swappiness, opt);
+		break;
+	}
+done:
+	mem_cgroup_put(memcg);
+
+	return err;
+}
+
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
+				 size_t len, loff_t *pos)
+{
+	void *buf;
+	char *cur, *next;
+	unsigned int flags;
+	int err = 0;
+	struct scan_control sc = {
+		.may_writepage = true,
+		.may_unmap = true,
+		.may_swap = true,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+
+	buf = kvmalloc(len + 1, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, src, len)) {
+		kvfree(buf);
+		return -EFAULT;
+	}
+
+	next = buf;
+	next[len] = '\0';
+
+	sc.reclaim_state.mm_walk = alloc_mm_walk();
+	if (!sc.reclaim_state.mm_walk) {
+		kvfree(buf);
+		return -ENOMEM;
+	}
+
+	flags = memalloc_noreclaim_save();
+	set_task_reclaim_state(current, &sc.reclaim_state);
+
+	while ((cur = strsep(&next, ",;\n"))) {
+		int n;
+		int end;
+		char cmd;
+		unsigned int memcg_id;
+		unsigned int nid;
+		unsigned long seq;
+		unsigned int swappiness = -1;
+		unsigned long opt = -1;
+
+		cur = skip_spaces(cur);
+		if (!*cur)
+			continue;
+
+		n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+			   &seq, &end, &swappiness, &end, &opt, &end);
+		if (n < 4 || cur[end]) {
+			err = -EINVAL;
+			break;
+		}
+
+		err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt);
+		if (err)
+			break;
+	}
+
+	set_task_reclaim_state(current, NULL);
+	memalloc_noreclaim_restore(flags);
+
+	free_mm_walk(sc.reclaim_state.mm_walk);
+	kvfree(buf);
+
+	return err ? : len;
+}
+
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &lru_gen_seq_ops);
+}
+
+static const struct file_operations lru_gen_rw_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.write = lru_gen_seq_write,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static const struct file_operations lru_gen_ro_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -5289,6 +5639,9 @@ static int __init init_lru_gen(void)
 	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
 		pr_err("lru_gen: failed to create sysfs group\n");
 
+	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
+	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
+
 	return 0;
 };
 late_initcall(init_lru_gen);
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (10 preceding siblings ...)
  2022-02-08  8:19 ` [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface Yu Zhao
@ 2022-02-08  8:19 ` Yu Zhao
  2022-02-08  8:44   ` Yu Zhao
  2022-02-14 10:28   ` Mike Rapoport
  2022-02-08 10:11 ` [PATCH v7 00/12] Multigenerational LRU Framework Oleksandr Natalenko
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:19 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Add a design doc and an admin guide.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
 Documentation/vm/index.rst                    |   1 +
 Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
 4 files changed, 275 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..2cf5bae62036 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   multigen_lru
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
new file mode 100644
index 000000000000..16a543c8b886
--- /dev/null
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -0,0 +1,121 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigenerational LRU
+=====================
+
+Quick start
+===========
+Build configurations
+--------------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
+ multigenerational LRU by default.
+
+Runtime configurations
+----------------------
+:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
+ ``CONFIG_LRU_GEN_ENABLED=n``.
+
+This file accepts different values to enabled or disabled the
+following features:
+
+====== ========
+Values Features
+====== ========
+0x0001 the multigenerational LRU
+0x0002 clear the accessed bit in leaf page table entries **in large
+       batches**, when MMU sets it (e.g., on x86)
+0x0004 clear the accessed bit in non-leaf page table entries **as
+       well**, when MMU sets it (e.g., on x86)
+[yYnN] apply to all the features above
+====== ========
+
+E.g.,
+::
+
+    echo y >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0007
+    echo 5 >/sys/kernel/mm/lru_gen/enabled
+    cat /sys/kernel/mm/lru_gen/enabled
+    0x0005
+
+Most users should enable or disable all the features unless some of
+them have unforeseen side effects.
+
+Recipes
+=======
+Personal computers
+------------------
+Personal computers are more sensitive to thrashing because it can
+cause janks (lags when rendering UI) and negatively impact user
+experience. The multigenerational LRU offers thrashing prevention to
+the majority of laptop and desktop users who don't have oomd.
+
+:Thrashing prevention: Write ``N`` to
+ ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
+ ``N`` milliseconds from getting evicted. The OOM killer is triggered
+ if this working set can't be kept in memory. Based on the average
+ human detectable lag (~100ms), ``N=1000`` usually eliminates
+ intolerable janks due to thrashing. Larger values like ``N=3000``
+ make janks less noticeable at the risk of premature OOM kills.
+
+Data centers
+------------
+Data centers want to optimize job scheduling (bin packing) to improve
+memory utilizations. Job schedulers need to estimate whether a server
+can allocate a certain amount of memory for a new job, and this step
+is known as working set estimation, which doesn't impact the existing
+jobs running on this server. They also want to attempt freeing some
+cold memory from the existing jobs, and this step is known as proactive
+reclaim, which improves the chance of landing a new job successfully.
+
+:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
+ for working set estimation and proactive reclaim.
+
+:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
+ format:
+ ::
+
+   memcg  memcg_id  memcg_path
+     node  node_id
+       min_gen  birth_time  anon_size  file_size
+       ...
+       max_gen  birth_time  anon_size  file_size
+
+ ``min_gen`` is the oldest generation number and ``max_gen`` is the
+ youngest generation number. ``birth_time`` is in milliseconds.
+ ``anon_size`` and ``file_size`` are in pages. The youngest generation
+ represents the group of the MRU pages and the oldest generation
+ represents the group of the LRU pages. For working set estimation, a
+ job scheduler writes to this file at a certain time interval to
+ create new generations, and it ranks available servers based on the
+ sizes of their cold memory defined by this time interval. For
+ proactive reclaim, a job scheduler writes to this file before it
+ tries to land a new job, and if it fails to materialize the cold
+ memory without impacting the existing jobs, it retries on the next
+ server according to the ranking result.
+
+ This file accepts commands in the following subsections. Multiple
+ command lines are supported, so does concatenation with delimiters
+ ``,`` and ``;``.
+
+ ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
+ debugging.
+
+:Working set estimation: Write ``+ memcg_id node_id max_gen
+ [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke
+ the aging. It scans PTEs for hot pages and promotes them to the
+ youngest generation ``max_gen``. Then it creates a new generation
+ ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages
+ when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead
+ as well as the coverage when scanning PTEs.
+
+:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
+ [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
+ eviction. It evicts generations less than or equal to ``min_gen``.
+ ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
+ ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
+ ``nr_to_reclaim`` to limit the number of pages to evict.
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..b48434300226 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
    ksm
    memory-model
    mmu_notifier
+   multigen_lru
    numa
    overcommit-accounting
    page_migration
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..42a277b4e74b
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,152 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigenerational LRU
+=====================
+
+Design overview
+===============
+The design objectives are:
+
+* Good representation of access recency
+* Try to profit from spatial locality
+* Fast paths to make obvious choices
+* Simple self-correcting heuristics
+
+The representation of access recency is at the core of all LRU
+implementations. In the multigenerational LRU, each generation
+represents a group of pages with similar access recency (a timestamp).
+Generations establish a common frame of reference and therefore help
+make better choices, e.g., between different memcgs on a computer or
+different computers in a data center (for job scheduling).
+
+Exploiting spatial locality improves the efficiency when gathering the
+accessed bit. A rmap walk targets a single page and doesn't try to
+profit from discovering a young PTE. A page table walk can sweep all
+the young PTEs in an address space, but its search space can be too
+large to make a profit. The key is to optimize both methods and use
+them in combination.
+
+Fast paths reduce code complexity and runtime overhead. Unmapped pages
+don't require TLB flushes; clean pages don't require writeback. These
+facts are only helpful when other conditions, e.g., access recency,
+are similar. With generations as a common frame of reference,
+additional factors stand out. But obvious choices might not be good
+choices; thus self-correction is required.
+
+The benefits of simple self-correcting heuristics are self-evident.
+Again, with generations as a common frame of reference, this becomes
+attainable. Specifically, pages in the same generation are categorized
+based on additional factors, and a feedback loop statistically
+compares the refault percentages across those categories and infers
+which of them are better choices.
+
+The protection of hot pages and the selection of cold pages are based
+on page access channels and patterns. There are two access channels:
+
+* Accesses through page tables
+* Accesses through file descriptors
+
+The protection of the former channel is by design stronger because:
+
+1. The uncertainty in determining the access patterns of the former
+   channel is higher due to the approximation of the accessed bit.
+2. The cost of evicting the former channel is higher due to the TLB
+   flushes required and the likelihood of encountering the dirty bit.
+3. The penalty of underprotecting the former channel is higher because
+   applications usually don't prepare themselves for major page faults
+   like they do for blocked I/O. E.g., GUI applications commonly use
+   dedicated I/O threads to avoid blocking the rendering threads.
+
+There are also two access patterns:
+
+* Accesses exhibiting temporal locality
+* Accesses not exhibiting temporal locality
+
+For the reasons listed above, the former channel is assumed to follow
+the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
+present, and the latter channel is assumed to follow the latter
+pattern unless outlying refaults have been observed.
+
+Workflow overview
+=================
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[]`` separately for anon and file types as clean
+file pages can be evicted regardless of swap constraints. These three
+variables are monotonically increasing.
+
+Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
+bits in order to fit into the gen counter in ``folio->flags``. Each
+truncated generation number is an index to ``lrugen->lists[]``. The
+sliding window technique is used to track at least ``MIN_NR_GENS`` and
+at most ``MAX_NR_GENS`` generations. The gen counter stores
+``(seq%MAX_NR_GENS)+1`` while a page is on one of ``lrugen->lists[]``;
+otherwise it stores zero.
+
+Each generation is divided into multiple tiers. Tiers represent
+different ranges of numbers of accesses through file descriptors.
+A page accessed ``N`` times through file descriptors is in tier
+``order_base_2(N)``. In contrast to moving across generations which
+requires the LRU lock, moving across tiers only requires operations on
+``folio->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refaults over all the tiers
+from anon and file types and decides which tiers from which types to
+evict or promote.
+
+There are two conceptually independent processes (as in the
+manufacturing process): the aging and the eviction. They form a
+closed-loop system, i.e., the page reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, it
+increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
+``MIN_NR_GENS``. The aging promotes hot pages to the youngest
+generation when it finds them accessed through page tables; the
+demotion of cold pages happens consequently when it increments
+``max_seq``. The aging uses page table walks and rmap walks to find
+young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
+and calls ``walk_page_range()`` with each ``mm_struct`` on this list
+to scan PTEs. On finding a young PTE, it clears the accessed bit and
+updates the gen counter of the page mapped by this PTE to
+``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it
+increments ``max_seq``. For the latter, when the eviction walks the
+rmap and finds a young PTE, the aging scans the adjacent PTEs and
+follows the same steps.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, it
+increments ``min_seq`` when ``lrugen->lists[]`` indexed by
+``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
+evict from, it first compares ``min_seq[]`` to select the older type.
+If they are equal, it selects the type whose first tier has a lower
+refault percentage. The first tier contains single-use unmapped clean
+pages, which are the best bet. The eviction sorts a page according to
+the gen counter if the aging has found this page accessed through page
+tables and updated the gen counter. It also promotes a page to the
+next generation, i.e., ``min_seq+1`` rather than ``max_seq``, if this
+page was accessed multiple times through file descriptors and the
+feedback loop has detected outlying refaults from the tier this page
+is in, using the first tier as a baseline.
+
+Summary
+-------
+The multigenerational LRU can be disassembled into the following
+components:
+
+* Generations
+* Page table walks
+* Rmap walks
+* Bloom filters
+* PID controller
+
+Between the aging and the eviction (processes), the latter drives the
+former by the sliding window over generations. Within the aging, rmap
+walks drive page table walks by inserting hot dense page tables to the
+Bloom filters. Within the eviction, the PID controller uses refaults
+as the feedback to turn on or off the eviction of certain types and
+tiers.
-- 
2.35.0.263.gb82422642f-goog



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young()
  2022-02-08  8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
@ 2022-02-08  8:24   ` Yu Zhao
  2022-02-08 10:33   ` Will Deacon
  1 sibling, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:24 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:51AM -0700, Yu Zhao wrote:

<snipped>

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index c4ba047a82d2..990358eca359 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -999,23 +999,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
>   * page after fork() + CoW for pfn mappings. We don't always have a
>   * hardware-managed access flag on arm64.
>   */
> -static inline bool arch_faults_on_old_pte(void)
> -{
> -	WARN_ON(preemptible());
> -
> -	return !cpu_has_hw_af();
> -}
> -#define arch_faults_on_old_pte		arch_faults_on_old_pte
> +#define arch_has_hw_pte_young		cpu_has_hw_af

Reworked arch_has_hw_pte_young() for arm64 according to:
https://lore.kernel.org/linux-mm/20220111141901.GA10338@willie-the-truck/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  2022-02-08  8:18 ` [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
@ 2022-02-08  8:27   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:27 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:52AM -0700, Yu Zhao wrote:
> Some architectures support the accessed bit in non-leaf PMD entries,
> e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
> as part of linear address translation [1]. Page table walkers that
> clear the accessed bit may use this capability to reduce their search
> space.
> 
> Note that:
> 1. Although an inline function is preferable, this capability is added
>    as a configuration option for the consistency with the existing
>    macros.
> 2. Due to the little interest in other varieties, this capability was
>    only tested on Intel and AMD CPUs.

Clarified ARCH_HAS_NONLEAF_PMD_YOUNG for x86 as requested here:
https://lore.kernel.org/linux-mm/CAHk-=wgvOqj6LUhNp8V5ddT8eZyYdFDzMZE73KgPggOnc28VWg@mail.gmail.com/

<snipped>

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9f5bd41bf660..e787b7fc75be 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -85,6 +85,7 @@ config X86
>  	select ARCH_HAS_PMEM_API		if X86_64
>  	select ARCH_HAS_PTE_DEVMAP		if X86_64
>  	select ARCH_HAS_PTE_SPECIAL
> +	select ARCH_HAS_NONLEAF_PMD_YOUNG

And enabled it for both 32-bit and 64-bit.

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-08  8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
@ 2022-02-08  8:28   ` Yu Zhao
  2022-02-10 20:41   ` Johannes Weiner
  2022-02-10 21:37   ` Matthew Wilcox
  2 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:28 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:54AM -0700, Yu Zhao wrote:

<snipped>

> diff --git a/mm/memory.c b/mm/memory.c
> index a7379196a47e..d27e5f1a2533 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4754,6 +4754,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
>  		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
>  }
>  
> +#ifdef CONFIG_LRU_GEN
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +	/* the LRU algorithm doesn't apply to sequential or random reads */
> +	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +	current->in_lru_fault = false;
> +}
> +#else
> +static void lru_gen_enter_fault(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void lru_gen_exit_fault(void)
> +{
> +}
> +#endif /* CONFIG_LRU_GEN */

Moved task_enter_lru_fault() from mm.h to memory.c as requested here:
https://lore.kernel.org/linux-mm/CAHk-=wib5-tUrf2=zYL9hjCqqFykZmTr_-vMAvSo48boCA+-Wg@mail.gmail.com/

Also renamed it to lru_gen_enter_fault().

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
@ 2022-02-08  8:33   ` Yu Zhao
  2022-02-08 16:50   ` Johannes Weiner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:33 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote:

<snipped>

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3326ee3903f3..e899623d5df0 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -892,6 +892,50 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# multigenerational LRU {
> +config LRU_GEN
> +	bool "Multigenerational LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +	help
> +	  A high performance LRU implementation for memory overcommit. See
> +	  Documentation/admin-guide/mm/multigen_lru.rst and
> +	  Documentation/vm/multigen_lru.rst for details.
> +
> +config NR_LRU_GENS
> +	int "Max number of generations"
> +	depends on LRU_GEN
> +	range 4 31
> +	default 4
> +	help
> +	  Do not increase this value unless you plan to use working set
> +	  estimation and proactive reclaim to optimize job scheduling in data
> +	  centers.
> +
> +	  This option uses order_base_2(N+1) bits in page flags.
> +
> +config TIERS_PER_GEN
> +	int "Number of tiers per generation"
> +	depends on LRU_GEN
> +	range 2 4
> +	default 4
> +	help
> +	  Do not decrease this value unless you run out of spare bits in page
> +	  flags, i.e., you see the "Not enough bits in page flags" build error.
> +
> +	  This option uses N-2 bits in page flags.

Moved Kconfig to this patch as suggested by:
https://lore.kernel.org/linux-mm/Yd6uHYtjGfgqjDpw@dhcp22.suse.cz/

Added two new macros as requested here:
https://lore.kernel.org/linux-mm/87czkyzhfe.fsf@linux.ibm.com/

<snipped>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d75a5738d1dc..5f0d92838712 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1285,9 +1285,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  
>  	if (PageSwapCache(page)) {
>  		swp_entry_t swap = { .val = page_private(page) };
> -		mem_cgroup_swapout(page, swap);
> +
> +		/* get a shadow entry before mem_cgroup_swapout() clears folio_memcg() */
>  		if (reclaimed && !mapping_exiting(mapping))
>  			shadow = workingset_eviction(page, target_memcg);
> +		mem_cgroup_swapout(page, swap);
>  		__delete_from_swap_cache(page, swap, shadow);
>  		xa_unlock_irq(&mapping->i_pages);
>  		put_swap_page(page, swap);
> @@ -2721,6 +2723,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
>  	unsigned long file;
>  	struct lruvec *target_lruvec;
>  
> +	if (lru_gen_enabled())
> +		return;
> +
>  	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>  
>  	/*
> @@ -3042,15 +3047,47 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
>  
>  #ifdef CONFIG_LRU_GEN
>  
> +enum {
> +	TYPE_ANON,
> +	TYPE_FILE,
> +};

Added two new macros as requested here:
https://lore.kernel.org/linux-mm/87czkyzhfe.fsf@linux.ibm.com/

<snipped>

> +static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	bool need_aging;
> +	long nr_to_scan;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	int swappiness = get_swappiness(memcg);
> +	DEFINE_MAX_SEQ(lruvec);
> +	DEFINE_MIN_SEQ(lruvec);
> +
> +	mem_cgroup_calculate_protection(NULL, memcg);
> +
> +	if (mem_cgroup_below_min(memcg))
> +		return;

Added mem_cgroup_calculate_protection() for readability as requested here:
https://lore.kernel.org/linux-mm/Ydf9RXPch5ddg%2FWC@dhcp22.suse.cz/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 07/12] mm: multigenerational LRU: support page table walks
  2022-02-08  8:18 ` [PATCH v7 07/12] mm: multigenerational LRU: support page table walks Yu Zhao
@ 2022-02-08  8:39   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:39 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:57AM -0700, Yu Zhao wrote:
> To avoid confusions, the term "iteration" specifically means the
> traversal of an entire mm_struct list; the term "walk" will be applied
> to page tables and the rmap, as usual.
> 
> To further exploit spatial locality, the aging prefers to walk page
> tables to search for young PTEs and promote hot pages. A runtime
> switch will be added in the next patch to enable or disable this
> feature. Without it, the aging relies on the rmap only.

Clarified that page table scanning is optional as requested here:
https://lore.kernel.org/linux-mm/YdxEqFPLDf+wI0xX@dhcp22.suse.cz/

> NB: this feature has nothing similar with the page table scanning in
> the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
> pages to swapcache and unmap them.
> 
> An mm_struct list is maintained for each memcg, and an mm_struct
> follows its owner task to the new memcg when this task is migrated.
> Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
> walk_page_range() with each mm_struct on this list to promote hot
> pages before it increments max_seq.
> 
> When multiple page table walkers (threads) iterate the same list, each
> of them gets a unique mm_struct; therefore they can run concurrently.
> Page table walkers ignore any misplaced pages, e.g., if an mm_struct
> was migrated, pages it left in the previous memcg won't be promoted
> when its current memcg is under reclaim. Similarly, page table walkers
> won't promote pages from nodes other than the one under reclaim.

Clarified the interaction between task migration and reclaim as requested here:
https://lore.kernel.org/linux-mm/YdxPEdsfl771Z7IX@dhcp22.suse.cz/

<snipped>

> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): no change
> 
>   Single workload:
>     memcached (anon): +[5.5, 7.5]%
>                 Ops/sec      KB/sec
>       patch1-6: 1015292.83   39490.38
>       patch1-7: 1080856.82   42040.53
> 
>   Configurations:
>     no change
> 
> Client benchmark results:
>   kswapd profiles:
>     patch1-6
>       45.49%  lzo1x_1_do_compress (real work)
>        7.38%  page_vma_mapped_walk
>        7.24%  _raw_spin_unlock_irq
>        2.64%  ptep_clear_flush
>        2.31%  __zram_bvec_write
>        2.13%  do_raw_spin_lock
>        2.09%  lru_gen_look_around
>        1.89%  free_unref_page_list
>        1.85%  memmove
>        1.74%  obj_malloc
> 
>     patch1-7
>       47.73%  lzo1x_1_do_compress (real work)
>        6.84%  page_vma_mapped_walk
>        6.14%  _raw_spin_unlock_irq
>        2.86%  walk_pte_range
>        2.79%  ptep_clear_flush
>        2.24%  __zram_bvec_write
>        2.10%  do_raw_spin_lock
>        1.94%  free_unref_page_list
>        1.80%  memmove
>        1.75%  obj_malloc
> 
>   Configurations:
>     no change

Added benchmark results to show the difference between page table
scanning and no page table scanning, as requested here:
https://lore.kernel.org/linux-mm/Ye6xS6xUD1SORdHJ@dhcp22.suse.cz/

<snipped>

> +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> +{
> +	static const struct mm_walk_ops mm_walk_ops = {
> +		.test_walk = should_skip_vma,
> +		.p4d_entry = walk_pud_range,
> +	};
> +
> +	int err;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> +	walk->next_addr = FIRST_USER_ADDRESS;
> +
> +	do {
> +		err = -EBUSY;
> +
> +		/* folio_update_gen() requires stable folio_memcg() */
> +		if (!mem_cgroup_trylock_pages(memcg))
> +			break;

Added a comment on the stable folio_memcg() requirement as requested
here:
https://lore.kernel.org/linux-mm/Yd6q0QdLVTS53vu4@dhcp22.suse.cz/

<snipped>

> +static struct lru_gen_mm_walk *alloc_mm_walk(void)
> +{
> +	if (current->reclaim_state && current->reclaim_state->mm_walk)
> +		return current->reclaim_state->mm_walk;
> +
> +	return kzalloc(sizeof(struct lru_gen_mm_walk),
> +		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +}

Replaced kvzalloc() with kzalloc() as requested here:
https://lore.kernel.org/linux-mm/Yd6tafG3CS7BoRYn@dhcp22.suse.cz/

Replaced GFP_KERNEL with __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN as
requested here:
https://lore.kernel.org/linux-mm/YefddYm8FAfJalNa@dhcp22.suse.cz/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap
  2022-02-08  8:18 ` [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap Yu Zhao
@ 2022-02-08  8:40   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:40 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:56AM -0700, Yu Zhao wrote:

<snipped>

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b72d75141e12..51c9bc8e965d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -436,6 +436,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem folio a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem folio from being released.
> @@ -497,6 +498,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem page a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem page from being released.
> @@ -934,6 +936,23 @@ void unlock_page_memcg(struct page *page);
>  
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
>  
> +/* try to stablize folio_memcg() for all the pages in a memcg */
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +	rcu_read_lock();
> +
> +	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> +		return true;
> +
> +	rcu_read_unlock();
> +	return false;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +	rcu_read_unlock();
> +}

Replaced the open-coded folio_memcg() lock with a new function
mem_cgroup_trylock_pages() as requested here:
https://lore.kernel.org/linux-mm/YeATr%2F%2FU6XD87fWF@dhcp22.suse.cz/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 09/12] mm: multigenerational LRU: runtime switch
  2022-02-08  8:18 ` [PATCH v7 09/12] mm: multigenerational LRU: runtime switch Yu Zhao
@ 2022-02-08  8:42   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:42 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:59AM -0700, Yu Zhao wrote:
> Add /sys/kernel/mm/lru_gen/enabled as a runtime switch. Features that
> can be enabled or disabled include:
>   0x0001: the multigenerational LRU
>   0x0002: the page table walks, when arch_has_hw_pte_young() returns
>           true
>   0x0004: the use of the accessed bit in non-leaf PMD entries, when
>           CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
>   [yYnN]: apply to all the features above
> E.g.,
>   echo y >/sys/kernel/mm/lru_gen/enabled
>   cat /sys/kernel/mm/lru_gen/enabled
>   0x0007
>   echo 5 >/sys/kernel/mm/lru_gen/enabled
>   cat /sys/kernel/mm/lru_gen/enabled
>   0x0005
> 
> NB: the page table walks happen on the scale of seconds under heavy
> memory pressure. Under such a condition, the mmap_lock contention is a
> lesser concern, compared with the LRU lock contention and the I/O
> congestion. So far the only well-known case of the mmap_lock
> contention is Android, due to Scudo [1] which allocates several
> thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree
> also have provided their own assessments [2][3]. However, if the page
> table walks do worsen the mmap_lock contention, the runtime switch can
> be used to disable this feature. In this case the multigenerational
> LRU will suffer a minor performance degradation, as shown previously.

Clarified the potential impact from the mmap_lock contention as
requested here:
https://lore.kernel.org/linux-mm/YdwQcl6D5Mbp9Z4h@dhcp22.suse.cz/

<snipped>

> +static void lru_gen_change_state(bool enable)
> +{
> +	static DEFINE_MUTEX(state_mutex);
> +
> +	struct mem_cgroup *memcg;
> +
> +	cgroup_lock();
> +	cpus_read_lock();
> +	get_online_mems();
> +	mutex_lock(&state_mutex);
> +
> +	if (enable == lru_gen_enabled())
> +		goto unlock;
> +
> +	if (enable)
> +		static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
> +	else
> +		static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);

Fixed the lockdep warning for memory hotplug:
https://lore.kernel.org/linux-mm/87a6g0nczg.fsf@linux.ibm.com/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention
  2022-02-08  8:19 ` [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention Yu Zhao
@ 2022-02-08  8:43   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:43 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:19:00AM -0700, Yu Zhao wrote:
> Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
> requested by many desktop users [1].
> 
> When set to value N, it prevents the working set of N milliseconds
> from getting evicted. The OOM killer is triggered if this working set
> can't be kept in memory. Based on the average human detectable lag
> (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
> Larger values like N=3000 make lags less noticeable at the risk of
> premature OOM kills.

Refactored min_ttl into a separate patch as requested here:
https://lore.kernel.org/linux-mm/YdxSUuDc3OC4pe+f@dhcp22.suse.cz/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-08  8:19 ` [PATCH v7 12/12] mm: multigenerational LRU: documentation Yu Zhao
@ 2022-02-08  8:44   ` Yu Zhao
  2022-02-14 10:28   ` Mike Rapoport
  1 sibling, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-08  8:44 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst

Refactored the doc into a separate patch as requested here:
https://lore.kernel.org/linux-mm/Yd73pDkMOMVHhXzu@kernel.org/

Reworked the doc as requested here:
https://lore.kernel.org/linux-mm/YdwKB3SfF7hkB9Xv@kernel.org/

<snipped>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (11 preceding siblings ...)
  2022-02-08  8:19 ` [PATCH v7 12/12] mm: multigenerational LRU: documentation Yu Zhao
@ 2022-02-08 10:11 ` Oleksandr Natalenko
  2022-02-08 11:14   ` Michal Hocko
  2022-02-11 20:12 ` Alexey Avramov
  2022-03-03  6:06 ` Vaibhav Jain
  14 siblings, 1 reply; 74+ messages in thread
From: Oleksandr Natalenko @ 2022-02-08 10:11 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko, Yu Zhao
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao

Hello.

On úterý 8. února 2022 9:18:50 CET Yu Zhao wrote:
> What's new
> ==========
> 1) Addressed all the comments received on the mailing list and in the
>    meeting with the stakeholders (will note on individual patches).
> 2) Measured the performance improvements for each patch between 5-8
>    (reported in the commit messages).
> 
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and straightforward.
> 
> Patchset overview
> =================
> The design and implementation overview was moved to patch 12 so that
> people can finish reading this cover letter.
> 
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Using hardware optimizations when trying to clear the accessed bit in
> many PTEs.
> 
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
> 
> 4. mm: multigenerational LRU: groundwork
> Adding the basic data structure and the functions that insert/remove
> pages to/from the multigenerational LRU (MGLRU) lists.
> 
> 5. mm: multigenerational LRU: minimal implementation
> A minimal (functional) implementation without any optimizations.
> 
> 6. mm: multigenerational LRU: exploit locality in rmap
> Improving the efficiency when using the rmap.
> 
> 7. mm: multigenerational LRU: support page table walks
> Adding the (optional) page table scanning.
> 
> 8. mm: multigenerational LRU: optimize multiple memcgs
> Optimizing the overall performance for multiple memcgs running mixed
> types of workloads.
> 
> 9. mm: multigenerational LRU: runtime switch
> Adding a runtime switch to enable or disable MGLRU.
> 
> 10. mm: multigenerational LRU: thrashing prevention
> 11. mm: multigenerational LRU: debugfs interface
> Providing userspace with additional features like thrashing prevention,
> working set estimation and proactive reclaim.
> 
> 12. mm: multigenerational LRU: documentation
> Adding a design doc and an admin guide.
> 
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
>       Apache Cassandra      Memcached
>       Apache Hadoop         MongoDB
>       Apache Spark          PostgreSQL
>       MariaDB (MySQL)       Redis
> 
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
> 
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
>    less wall time to sort three billion random integers, respectively,
>    under the medium- and the high-concurrency conditions, when
>    overcommitting memory. There were no statistically significant
>    changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
>    more transactions per minute (TPM), respectively, under the medium-
>    and the high-concurrency conditions, when overcommitting memory.
>    There were no statistically significant changes in TPM for the rest
>    of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
>    and [21.59, 30.02]% more operations per second (OPS), respectively,
>    for sequential access, random access and Gaussian (distribution)
>    access, when THP=always; 95% CIs [13.85, 15.97]% and
>    [23.94, 29.92]% more OPS, respectively, for random access and
>    Gaussian access, when THP=never. There were no statistically
>    significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
>    [2.16, 3.55]% more operations per second (OPS), respectively, for
>    exponential (distribution) access, random access and Zipfian
>    (distribution) access, when underutilizing memory; 95% CIs
>    [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
>    respectively, for exponential access, random access and Zipfian
>    access, when overcommitting memory.
> 
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
>    and [4.11, 7.50]% more operations per second (OPS), respectively,
>    for exponential (distribution) access, random access and Zipfian
>    (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
>    [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
>    exponential access, random access and Zipfian access, when swap was
>    on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
>    less average wall time to finish twelve parallel TeraSort jobs,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in average wall time for the rest of the
>    benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
>    minute (TPM) under the high-concurrency condition, when swap was
>    off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
>    [11.47, 19.36]% more total operations per second (OPS),
>    respectively, for sequential access, random access and Gaussian
>    (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
>    [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
>    for sequential access, random access and Gaussian access, when
>    THP=never.
> 
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
>       fs_fio_bench_hdd_mq      pft
>       fs_lmbench               pgsql-hammerdb
>       fs_parallelio            redis
>       fs_postmark              stream
>       hackbench                sysbenchthread
>       kernbench                tpcc_spark
>       memcached                unixbench
>       multichase               vm-scalability
>       mutilate                 will-it-scale
>       nginx
> 
> [01] https://trends.google.com
> [02] https://lore.kernel.org/lkml/20211102002002.92051-1-bot@edi.works/
> [03] https://lore.kernel.org/lkml/20211009054315.47073-1-bot@edi.works/
> [04] https://lore.kernel.org/lkml/20211021194103.65648-1-bot@edi.works/
> [05] https://lore.kernel.org/lkml/20211109021346.50266-1-bot@edi.works/
> [06] https://lore.kernel.org/lkml/20211202062806.80365-1-bot@edi.works/
> [07] https://lore.kernel.org/lkml/20211209072416.33606-1-bot@edi.works/
> [08] https://lore.kernel.org/lkml/20211218071041.24077-1-bot@edi.works/
> [09] https://lore.kernel.org/lkml/20211122053248.57311-1-bot@edi.works/
> [10] https://lore.kernel.org/lkml/20220104202247.2903702-1-yuzhao@google.com/
> 
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
>    I have Archlinux with 8G RAM + zswap + swap. While developing, I
>    have lots of apps opened such as multiple LSP-servers for different
>    langs, chats, two browsers, etc... Usually, my system gets quickly
>    to a point of SWAP-storms, where I have to kill LSP-servers,
>    restart browsers to free memory, etc, otherwise the system lags
>    heavily and is barely usable.
>    
>    1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
>    patchset, and I started up by opening lots of apps to create memory
>    pressure, and worked for a day like this. Till now I had *not a
>    single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
>    getting to the point of 3G in SWAP before without a single
>    SWAP-storm.
> 
> An anonymous user wrote [12]:
>    Using that v5 for some time and confirm that difference under heavy
>    load and memory pressure is significant.
> 
> Shuang wrote [13]:
>    With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
>    and [9.26, 10.36]% higher throughput, respectively, for random
>    access, Zipfian (distribution) access and Gaussian (distribution)
>    access, when the average number of jobs per CPU is 1; 95% CIs
>    [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
>    respectively, for random access, Zipfian access and Gaussian access,
>    when the average number of jobs per CPU is 2.
> 
> Daniel wrote [14]:
>    With memcached allocating ~100GB of byte-addressable Optante,
>    performance improvement in terms of throughput (measured as queries
>    per second) was about 10% for a series of workloads.
> 
> Large-scale deployments
> -----------------------
> The downstream kernels that have been using MGLRU include:
> 1. Android ARCVM [15]
> 2. Arch Linux Zen [16]
> 3. Chrome OS [17]
> 4. Liquorix [18]
> 5. post-factum [19]
> 6. XanMod [20]
> 
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [21] shows
> an overall 40% decrease in kswapd CPU usage, in addition to
> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
> 
> [11] https://lore.kernel.org/lkml/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
> [12] https://phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022?p=1301275#post1301275
> [13] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
> [14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
> [15] https://chromium.googlesource.com/chromiumos/third_party/kernel
> [16] https://archlinux.org
> [17] https://chromium.org
> [18] https://liquorix.net
> [19] https://gitlab.com/post-factum/pf-kernel
> [20] https://xanmod.org
> [21] https://research.google/pubs/pub44271/
> 
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
>    indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
>    work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
>    with similar effects.
> 
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
>    materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [22][23][24], the
>    new features will likely be put to use for both personal computers
>    and data centers.
> 3. Based on Google's track record, the new code will likely be well
>    maintained in the long term. It'd be more difficult if not
>    impossible to achieve similar effects on top of the existing
>    design.
> 
> [22] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
> [23] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
> [24] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
> 
> Yu Zhao (12):
>   mm: x86, arm64: add arch_has_hw_pte_young()
>   mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational LRU: groundwork
>   mm: multigenerational LRU: minimal implementation
>   mm: multigenerational LRU: exploit locality in rmap
>   mm: multigenerational LRU: support page table walks
>   mm: multigenerational LRU: optimize multiple memcgs
>   mm: multigenerational LRU: runtime switch
>   mm: multigenerational LRU: thrashing prevention
>   mm: multigenerational LRU: debugfs interface
>   mm: multigenerational LRU: documentation
> 
>  Documentation/admin-guide/mm/index.rst        |    1 +
>  Documentation/admin-guide/mm/multigen_lru.rst |  121 +
>  Documentation/vm/index.rst                    |    1 +
>  Documentation/vm/multigen_lru.rst             |  152 +
>  arch/Kconfig                                  |    9 +
>  arch/arm64/include/asm/pgtable.h              |   14 +-
>  arch/x86/Kconfig                              |    1 +
>  arch/x86/include/asm/pgtable.h                |    9 +-
>  arch/x86/mm/pgtable.c                         |    5 +-
>  fs/exec.c                                     |    2 +
>  fs/fuse/dev.c                                 |    3 +-
>  include/linux/cgroup.h                        |   15 +-
>  include/linux/memcontrol.h                    |   36 +
>  include/linux/mm.h                            |    8 +
>  include/linux/mm_inline.h                     |  214 ++
>  include/linux/mm_types.h                      |   78 +
>  include/linux/mmzone.h                        |  182 ++
>  include/linux/nodemask.h                      |    1 +
>  include/linux/page-flags-layout.h             |   19 +-
>  include/linux/page-flags.h                    |    4 +-
>  include/linux/pgtable.h                       |   17 +-
>  include/linux/sched.h                         |    4 +
>  include/linux/swap.h                          |    5 +
>  kernel/bounds.c                               |    3 +
>  kernel/cgroup/cgroup-internal.h               |    1 -
>  kernel/exit.c                                 |    1 +
>  kernel/fork.c                                 |    9 +
>  kernel/sched/core.c                           |    1 +
>  mm/Kconfig                                    |   50 +
>  mm/huge_memory.c                              |    3 +-
>  mm/memcontrol.c                               |   27 +
>  mm/memory.c                                   |   39 +-
>  mm/mm_init.c                                  |    6 +-
>  mm/page_alloc.c                               |    1 +
>  mm/rmap.c                                     |    7 +
>  mm/swap.c                                     |   55 +-
>  mm/vmscan.c                                   | 2831 ++++++++++++++++-
>  mm/workingset.c                               |  119 +-
>  38 files changed, 3908 insertions(+), 146 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst

Thanks for the new spin.

Is the patch submission broken for everyone, or for me only? I see raw emails cluttered with some garbage like =2D, and hence I cannot apply those neither from my email client nor from lore.

Probably, you've got a git repo where things can be pulled from so that we do not depend on mailing systems and/or tools breaking plaintext?

Thanks.

-- 
Oleksandr Natalenko (post-factum)




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young()
  2022-02-08  8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
  2022-02-08  8:24   ` Yu Zhao
@ 2022-02-08 10:33   ` Will Deacon
  1 sibling, 0 replies; 74+ messages in thread
From: Will Deacon @ 2022-02-08 10:33 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:51AM -0700, Yu Zhao wrote:
> Some architectures automatically set the accessed bit in PTEs, e.g.,
> x86 and arm64 v8.2. On architectures that don't have this capability,
> clearing the accessed bit in a PTE usually triggers a page fault
> following the TLB miss of this PTE (to emulate the accessed bit).
> 
> Being aware of this capability can help make better decisions, e.g.,
> whether to spread the work out over a period of time to reduce bursty
> page faults when trying to clear the accessed bit in many PTEs.
> 
> Note that theoretically this capability can be unreliable, e.g.,
> hotplugged CPUs might be different from builtin ones. Therefore it
> shouldn't be used in architecture-independent code that involves
> correctness, e.g., to determine whether TLB flushes are required (in
> combination with the accessed bit).
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  arch/arm64/include/asm/pgtable.h | 14 ++------------
>  arch/x86/include/asm/pgtable.h   |  6 +++---
>  include/linux/pgtable.h          | 13 +++++++++++++
>  mm/memory.c                      | 14 +-------------
>  4 files changed, 19 insertions(+), 28 deletions(-)

For the arm64 bit:

Acked-by: Will Deacon <will@kernel.org>

Will


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-02-08 10:11 ` [PATCH v7 00/12] Multigenerational LRU Framework Oleksandr Natalenko
@ 2022-02-08 11:14   ` Michal Hocko
  2022-02-08 11:23     ` Oleksandr Natalenko
  0 siblings, 1 reply; 74+ messages in thread
From: Michal Hocko @ 2022-02-08 11:14 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Yu Zhao, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86

On Tue 08-02-22 11:11:02, Oleksandr Natalenko wrote:
[...]
> Is the patch submission broken for everyone, or for me only? I see raw
> emails cluttered with some garbage like =2D, and hence I cannot apply
> those neither from my email client nor from lore.

The patchset seems to be OK both in my inbox and b4[1] has downloaded
the full thread without any issues and I could apply all the patches
just fine

[1] https://git.kernel.org/pub/scm/utils/b4/b4.git

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-02-08 11:14   ` Michal Hocko
@ 2022-02-08 11:23     ` Oleksandr Natalenko
  0 siblings, 0 replies; 74+ messages in thread
From: Oleksandr Natalenko @ 2022-02-08 11:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Yu Zhao, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86

Hello.

On úterý 8. února 2022 12:14:00 CET Michal Hocko wrote:
> On Tue 08-02-22 11:11:02, Oleksandr Natalenko wrote:
> [...]
> > Is the patch submission broken for everyone, or for me only? I see raw
> > emails cluttered with some garbage like =2D, and hence I cannot apply
> > those neither from my email client nor from lore.
> 
> The patchset seems to be OK both in my inbox and b4[1] has downloaded
> the full thread without any issues and I could apply all the patches
> just fine
> 
> [1] https://git.kernel.org/pub/scm/utils/b4/b4.git

Thanks, b4 worked for me as well.

-- 
Oleksandr Natalenko (post-factum)




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
  2022-02-08  8:33   ` Yu Zhao
@ 2022-02-08 16:50   ` Johannes Weiner
  2022-02-10  2:53     ` Yu Zhao
  2022-02-13 10:04   ` Hillf Danton
  2022-02-23  8:27   ` Huang, Ying
  3 siblings, 1 reply; 74+ messages in thread
From: Johannes Weiner @ 2022-02-08 16:50 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Hi Yu,

Thanks for restructuring this from the last version. It's easier to
learn the new model when you start out with the bare bones, then let
optimizations and self-contained features follow later.

On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote:
> To avoid confusions, the terms "promotion" and "demotion" will be
> applied to the multigenerational LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.
> 
> The aging produces young generations. Given an lruvec, it increments
> max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> promotes hot pages to the youngest generation when it finds them
> accessed through page tables; the demotion of cold pages happens
> consequently when it increments max_seq. Since the aging is only
> interested in hot pages, its complexity is O(nr_hot_pages). Promotion
> in the aging path doesn't require any LRU list operations, only the
> updates of the gen counter and lrugen->nr_pages[]; demotion, unless
> as the result of the increment of max_seq, requires LRU list
> operations, e.g., lru_deactivate_fn().

I'm having trouble with this changelog. It opens with a footnote and
summarizes certain aspects of the implementation whose importance to
the reader aren't entirely clear at this time.

It would be better to start with a high-level overview of the problem
and how this algorithm solves it. How the reclaim algorithm needs to
find the page that is most suitable for eviction and to signal when
it's time to give up and OOM. Then explain how grouping pages into
multiple generations accomplishes that - in particular compared to the
current two use-once/use-many lists.

Explain the problem of MMU vs syscall references, and how tiering
addresses this.

Explain the significance of refaults and how the algorithm responds to
them. Not in terms of which running averages are updated, but in terms
of user-visible behavior ("will start swapping (more)" etc.)

Express *intent*, how it's supposed to behave wrt workloads and memory
pressure. The code itself will explain the how, its complexity etc.

Most reviewers will understand the fundamental challenges of page
reclaim. The difficulty is matching individual aspects of the problem
space to your individual components and design choices you have made.

Let us in on that thinking, please ;)

> @@ -892,6 +892,50 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>  
> +# multigenerational LRU {
> +config LRU_GEN
> +	bool "Multigenerational LRU"
> +	depends on MMU
> +	# the following options can use up the spare bits in page flags
> +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> +	help
> +	  A high performance LRU implementation for memory overcommit. See
> +	  Documentation/admin-guide/mm/multigen_lru.rst and
> +	  Documentation/vm/multigen_lru.rst for details.

These files don't exist at this time, please introduce them before or
when referencing them. If they document things introduced later in the
patchset, please start with a minimal version of the file and update
it as you extend the algorithm and add optimizations etc.

It's really important to only reference previous patches, not later
ones. This allows reviewers to read the patches linearly.  Having to
search for missing pieces in patches you haven't looked at yet is bad.

> +config NR_LRU_GENS
> +	int "Max number of generations"
> +	depends on LRU_GEN
> +	range 4 31
> +	default 4
> +	help
> +	  Do not increase this value unless you plan to use working set
> +	  estimation and proactive reclaim to optimize job scheduling in data
> +	  centers.
> +
> +	  This option uses order_base_2(N+1) bits in page flags.
> +
> +config TIERS_PER_GEN
> +	int "Number of tiers per generation"
> +	depends on LRU_GEN
> +	range 2 4
> +	default 4
> +	help
> +	  Do not decrease this value unless you run out of spare bits in page
> +	  flags, i.e., you see the "Not enough bits in page flags" build error.
> +
> +	  This option uses N-2 bits in page flags.

Linus had pointed out that we shouldn't ask these questions of the
user. How do you pick numbers here? I'm familiar with workingset
estimation and proactive reclaim usecases but I wouldn't know.

Even if we removed the config option and hardcoded the number, this is
a question for kernel developers: What does "4" mean? How would
behavior differ if it were 3 or 5 instead? Presumably there is some
sort of behavior gradient. "As you increase the number of
generations/tiers, the user-visible behavior of the kernel will..."
This should really be documented.

I'd also reiterate Mel's point: Distribution kernels need to support
the full spectrum of applications and production environments. Unless
using non-defaults it's an extremely niche usecase (like compiling out
BUG() calls) compile-time options are not the right choice. If we do
need a tunable, it could make more sense to have a compile time upper
limit (to determine page flag space) combined with a runtime knob?

Thanks!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-08 16:50   ` Johannes Weiner
@ 2022-02-10  2:53     ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-10  2:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 11:50:09AM -0500, Johannes Weiner wrote:

<snipped>

> On Tue, Feb 08, 2022 at 01:18:55AM -0700, Yu Zhao wrote:
> > To avoid confusions, the terms "promotion" and "demotion" will be
> > applied to the multigenerational LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
> > 
> > The aging produces young generations. Given an lruvec, it increments
> > max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
> > promotes hot pages to the youngest generation when it finds them
> > accessed through page tables; the demotion of cold pages happens
> > consequently when it increments max_seq. Since the aging is only
> > interested in hot pages, its complexity is O(nr_hot_pages). Promotion
> > in the aging path doesn't require any LRU list operations, only the
> > updates of the gen counter and lrugen->nr_pages[]; demotion, unless
> > as the result of the increment of max_seq, requires LRU list
> > operations, e.g., lru_deactivate_fn().
> 
> I'm having trouble with this changelog. It opens with a footnote and
> summarizes certain aspects of the implementation whose importance to
> the reader aren't entirely clear at this time.
> 
> It would be better to start with a high-level overview of the problem
> and how this algorithm solves it. How the reclaim algorithm needs to
> find the page that is most suitable for eviction and to signal when
> it's time to give up and OOM. Then explain how grouping pages into
> multiple generations accomplishes that - in particular compared to the
> current two use-once/use-many lists.

Hi Johannes,

Thanks for reviewing!

I suspect the information you are looking for might have been in the
patchset but is scattered in a few places. Could you please glance at
the following pieces and let me know
  1. whether they cover some of the points you asked for
  2. and if so, whether there is a better order/place to present them?

The previous patch has a quick view on the architecture:
https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/

  Evictable pages are divided into multiple generations for each lruvec.
  The youngest generation number is stored in lrugen->max_seq for both
  anon and file types as they're aged on an equal footing. The oldest
  generation numbers are stored in lrugen->min_seq[] separately for anon
  and file types as clean file pages can be evicted regardless of swap
  constraints. These three variables are monotonically increasing.
  
  Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
  in order to fit into the gen counter in folio->flags. Each truncated
  generation number is an index to lrugen->lists[]. The sliding window
  technique is used to track at least MIN_NR_GENS and at most
  MAX_NR_GENS generations. The gen counter stores (seq%MAX_NR_GENS)+1
  while a page is on one of lrugen->lists[]. Otherwise it stores 0.
  
  There are two conceptually independent processes (as in the
  manufacturing process): "the aging", which produces young generations,
  and "the eviction", which consumes old generations. They form a
  closed-loop system, i.e., "the page reclaim". Both processes can be
  invoked from userspace for the purposes of working set estimation and
  proactive reclaim. These features are required to optimize job
  scheduling (bin packing) in data centers. The variable size of the
  sliding window is designed for such use cases...

And the design doc contains a bit more details, and I'd be happy to
present it earlier, if you think doing so would help.
https://lore.kernel.org/linux-mm/20220208081902.3550911-13-yuzhao@google.com/

> Explain the problem of MMU vs syscall references, and how tiering
> addresses this.

The previous patch also touched on this point:
https://lore.kernel.org/linux-mm/20220208081902.3550911-5-yuzhao@google.com/

  The protection of hot pages and the selection of cold pages are based
  on page access channels and patterns. There are two access channels:
  one through page tables and the other through file descriptors. The
  protection of the former channel is by design stronger because:
  1) The uncertainty in determining the access patterns of the former
     channel is higher due to the approximation of the accessed bit.
  2) The cost of evicting the former channel is higher due to the TLB
     flushes required and the likelihood of encountering the dirty bit.
  3) The penalty of underprotecting the former channel is higher because
     applications usually don't prepare themselves for major page faults
     like they do for blocked I/O. E.g., GUI applications commonly use
     dedicated I/O threads to avoid blocking the rendering threads.
  There are also two access patterns: one with temporal locality and the
  other without. For the reasons listed above, the former channel is
  assumed to follow the former pattern unless VM_SEQ_READ or
  VM_RAND_READ is present, and the latter channel is assumed to follow
  the latter pattern unless outlying refaults have been observed.

> Explain the significance of refaults and how the algorithm responds to
> them. Not in terms of which running averages are updated, but in terms
> of user-visible behavior ("will start swapping (more)" etc.)

And this patch touched on how tiers would help:
  1) It removes the cost of activation in the buffered access path by
     inferring whether pages accessed multiple times through file
     descriptors are statistically hot and thus worth promoting in the
     eviction path.
  2) It takes pages accessed through page tables into account and avoids
     overprotecting pages accessed multiple times through file
     descriptors. (Pages accessed through page tables are in the first
     tier since N=0.)
  3) More tiers provide better protection for pages accessed more than
     twice through file descriptors, when under heavy buffered I/O
     workloads.

And the design doc:
https://lore.kernel.org/linux-mm/20220208081902.3550911-13-yuzhao@google.com/

  To select a type and a tier to evict from, it first compares min_seq[]
  to select the older type. If they are equal, it selects the type whose
  first tier has a lower refault percentage. The first tier contains
  single-use unmapped clean pages, which are the best bet.

> Express *intent*, how it's supposed to behave wrt workloads and memory
> pressure. The code itself will explain the how, its complexity etc.

Hmm... This part I'm not so sure. It seems to me this is equivalent to
describing how it works.

> Most reviewers will understand the fundamental challenges of page
> reclaim. The difficulty is matching individual aspects of the problem
> space to your individual components and design choices you have made.
> 
> Let us in on that thinking, please ;)

Agreed. I'm sure I haven't covered everything. So I'm trying to figure
out what's important but missing/insufficient.

> > @@ -892,6 +892,50 @@ config ANON_VMA_NAME
> >  	  area from being merged with adjacent virtual memory areas due to the
> >  	  difference in their name.
> >  
> > +# multigenerational LRU {
> > +config LRU_GEN
> > +	bool "Multigenerational LRU"
> > +	depends on MMU
> > +	# the following options can use up the spare bits in page flags
> > +	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
> > +	help
> > +	  A high performance LRU implementation for memory overcommit. See
> > +	  Documentation/admin-guide/mm/multigen_lru.rst and
> > +	  Documentation/vm/multigen_lru.rst for details.
> 
> These files don't exist at this time, please introduce them before or
> when referencing them. If they document things introduced later in the
> patchset, please start with a minimal version of the file and update
> it as you extend the algorithm and add optimizations etc.
> 
> It's really important to only reference previous patches, not later
> ones. This allows reviewers to read the patches linearly.  Having to
> search for missing pieces in patches you haven't looked at yet is bad.

Okay, will remove this bit from this patch.

> > +config NR_LRU_GENS
> > +	int "Max number of generations"
> > +	depends on LRU_GEN
> > +	range 4 31
> > +	default 4
> > +	help
> > +	  Do not increase this value unless you plan to use working set
> > +	  estimation and proactive reclaim to optimize job scheduling in data
> > +	  centers.
> > +
> > +	  This option uses order_base_2(N+1) bits in page flags.
> > +
> > +config TIERS_PER_GEN
> > +	int "Number of tiers per generation"
> > +	depends on LRU_GEN
> > +	range 2 4
> > +	default 4
> > +	help
> > +	  Do not decrease this value unless you run out of spare bits in page
> > +	  flags, i.e., you see the "Not enough bits in page flags" build error.
> > +
> > +	  This option uses N-2 bits in page flags.
> 
> Linus had pointed out that we shouldn't ask these questions of the
> user. How do you pick numbers here? I'm familiar with workingset
> estimation and proactive reclaim usecases but I wouldn't know.
> 
> Even if we removed the config option and hardcoded the number, this is
> a question for kernel developers: What does "4" mean? How would
> behavior differ if it were 3 or 5 instead? Presumably there is some
> sort of behavior gradient. "As you increase the number of
> generations/tiers, the user-visible behavior of the kernel will..."
> This should really be documented.
> 
> I'd also reiterate Mel's point: Distribution kernels need to support
> the full spectrum of applications and production environments. Unless
> using non-defaults it's an extremely niche usecase (like compiling out
> BUG() calls) compile-time options are not the right choice. If we do
> need a tunable, it could make more sense to have a compile time upper
> limit (to determine page flag space) combined with a runtime knob?

I agree, and I think only time can answer all theses questions :)

This effort is not in the final stage but at very its beginning. More
experiments and wilder adoption are required to see how it's going to
evolve or where it leads. For now, there is just no way to tell whether
those values make sense for the majority or we need the runtime knobs.

These are valid concerns, but TBH, I think they are minor ones because
most users need not to worry about them -- this patchset has been used
in several downstream kernels and I haven't heard any complaints about
those options/values:
https://lore.kernel.org/linux-mm/20220208081902.3550911-1-yuzhao@google.com/

1. Android ARCVM
2. Arch Linux Zen
3. Chrome OS
4. Liquorix
5. post-factum
6. XanMod

Then why do we need these options? Because there are always exceptions,
as stated in the descriptions of those options. Sometimes we just can't
decide everything for users -- the answers lie in their use cases. The
bottom line is, if this starts bothering people or gets in somebody's
way, I'd be glad to revisit. Fair enough?

Thanks!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-08  8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
  2022-02-08  8:28   ` Yu Zhao
@ 2022-02-10 20:41   ` Johannes Weiner
  2022-02-15  9:43     ` Yu Zhao
  2022-02-10 21:37   ` Matthew Wilcox
  2 siblings, 1 reply; 74+ messages in thread
From: Johannes Weiner @ 2022-02-10 20:41 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Hello Yu,

On Tue, Feb 08, 2022 at 01:18:54AM -0700, Yu Zhao wrote:
> @@ -92,11 +92,196 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
>  	return lru;
>  }
>  
> +#ifdef CONFIG_LRU_GEN
> +
> +static inline bool lru_gen_enabled(void)
> +{
> +	return true;
> +}
> +
> +static inline bool lru_gen_in_fault(void)
> +{
> +	return current->in_lru_fault;
> +}
> +
> +static inline int lru_gen_from_seq(unsigned long seq)
> +{
> +	return seq % MAX_NR_GENS;
> +}
> +
> +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> +{
> +	unsigned long max_seq = lruvec->lrugen.max_seq;
> +
> +	VM_BUG_ON(gen >= MAX_NR_GENS);
> +
> +	/* see the comment on MIN_NR_GENS */
> +	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> +}

I'm still reading the series, so correct me if I'm wrong: the "active"
set is split into two generations for the sole purpose of the
second-chance policy for fresh faults, right?

If so, it'd be better to have the comment here instead of down by
MIN_NR_GENS. This is the place that defines what "active" is, so this
is where the reader asks what it means and what it implies. The
definition of MIN_NR_GENS can be briefer: "need at least two for
second chance, see lru_gen_is_active() for details".

> +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> +				       int zone, long delta)
> +{
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	lockdep_assert_held(&lruvec->lru_lock);
> +	WARN_ON_ONCE(delta != (int)delta);
> +
> +	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> +	__mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> +}

This is a duplicate of update_lru_size(), please use that instead.

Yeah technically you don't need the mem_cgroup_update_lru_size() but
that's not worth sweating over, better to keep it simple.

> +static inline void lru_gen_balance_size(struct lruvec *lruvec, struct folio *folio,
> +					int old_gen, int new_gen)

lru_gen_update_lru_sizes() for this one would be more descriptive imo
and in line with update_lru_size() that it's built on.

> +{
> +	int type = folio_is_file_lru(folio);
> +	int zone = folio_zonenum(folio);
> +	int delta = folio_nr_pages(folio);
> +	enum lru_list lru = type * LRU_INACTIVE_FILE;
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
> +	VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
> +	VM_BUG_ON(old_gen == -1 && new_gen == -1);

Could be a bit easier to read quickly with high-level descriptions:

> +	if (old_gen >= 0)
> +		WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
> +			   lrugen->nr_pages[old_gen][type][zone] - delta);
> +	if (new_gen >= 0)
> +		WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
> +			   lrugen->nr_pages[new_gen][type][zone] + delta);
> +
	/* Addition */
> +	if (old_gen < 0) {
> +		if (lru_gen_is_active(lruvec, new_gen))
> +			lru += LRU_ACTIVE;
> +		lru_gen_update_size(lruvec, lru, zone, delta);
> +		return;
> +	}
> +
	/* Removal */
> +	if (new_gen < 0) {
> +		if (lru_gen_is_active(lruvec, old_gen))
> +			lru += LRU_ACTIVE;
> +		lru_gen_update_size(lruvec, lru, zone, -delta);
> +		return;
> +	}
> +
	/* Promotion */
> +	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
> +		lru_gen_update_size(lruvec, lru, zone, -delta);
> +		lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta);
> +	}
> +
> +	/* Promotion is legit while a page is on an LRU list, but demotion isn't. */

	/* Demotion happens during aging when pages are isolated, never on-LRU */
> +	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
> +}

On that note, please move introduction of the promotion and demotion
bits to the next patch. They aren't used here yet, and I spent some
time jumping around patches to verify the promotion callers and
confirm the validy of the BUG_ON.

> +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +	int gen;
> +	unsigned long old_flags, new_flags;
> +	int type = folio_is_file_lru(folio);
> +	int zone = folio_zonenum(folio);
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	if (folio_test_unevictable(folio) || !lrugen->enabled)
> +		return false;

These two checks should be in the callsite and the function should
return void. Otherwise you can't understand the callsite without
drilling down into lrugen code, even if lrugen is disabled.

folio_add_lru() gets it right.

> +	/*
> +	 * There are three common cases for this page:
> +	 * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it
> +	 *    to the youngest generation.

"shouldn't be evicted" makes it sound like mlock. But they should just
be evicted last, right? Maybe:

	/*
	 * Pages start in different generations depending on
	 * advance knowledge we have about their hotness and
	 * evictability:
	 * 
	 * 1. Already active pages start out youngest. This can be
	 *    fresh faults, or refaults of previously hot pages.
	 * 2. Cold pages that require writeback before becoming
	 *    evictable start on the second oldest generation.
	 * 3. Everything else (clean, cold) starts old.
	 */

On that note, I think #1 is reintroducing a problem we have fixed
before, which is trashing the workingset with a flood of use-once
mmapped pages. It's the classic scenario where LFU beats LRU.

Mapped streaming IO isn't very common, but it does happen. See these
commits:

dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
31c0569c3b0b6cc8a867ac6665ca081553f7984c
645747462435d84c6c6a64269ed49cc3015f753d

From the changelog:

    The used-once mapped file page detection patchset.
    
    It is meant to help workloads with large amounts of shortly used file
    mappings, like rtorrent hashing a file or git when dealing with loose
    objects (git gc on a bigger site?).
    
    Right now, the VM activates referenced mapped file pages on first
    encounter on the inactive list and it takes a full memory cycle to
    reclaim them again.  When those pages dominate memory, the system
    no longer has a meaningful notion of 'working set' and is required
    to give up the active list to make reclaim progress.  Obviously,
    this results in rather bad scanning latencies and the wrong pages
    being reclaimed.
    
    This patch makes the VM be more careful about activating mapped file
    pages in the first place.  The minimum granted lifetime without
    another memory access becomes an inactive list cycle instead of the
    full memory cycle, which is more natural given the mentioned loads.

Translating this to multigen, it seems fresh faults should really
start on the second oldest rather than on the youngest generation, to
get a second chance but without jeopardizing the workingset if they
don't take it.

> +	 * 2) If it can't be evicted immediately, i.e., it's an anon page and
> +	 *    not in swapcache, or a dirty page pending writeback, add it to the
> +	 *    second oldest generation.
> +	 * 3) If it may be evicted immediately, e.g., it's a clean page, add it
> +	 *    to the oldest generation.
> +	 */
> +	if (folio_test_active(folio))
> +		gen = lru_gen_from_seq(lrugen->max_seq);
> +	else if ((!type && !folio_test_swapcache(folio)) ||
> +		 (folio_test_reclaim(folio) &&
> +		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
> +		gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
> +	else
> +		gen = lru_gen_from_seq(lrugen->min_seq[type]);

Condition #2 is not quite clear to me, and the comment is incomplete:
The code does put dirty/writeback pages on the oldest gen as long as
they haven't been marked for immediate reclaim by the scanner
yet. HOWEVER, once the scanner does see those pages and sets
PG_reclaim, it will also activate them to move them out of the way
until writeback finishes (see shrink_page_list()) - at which point
we'll trigger #1. So that second part of #2 appears unreachable.

It could be a good exercise to describe how cache pages move through
the generations, similar to the comment on lru_deactivate_file_fn().
It's a good example of intent vs implementation.

On another note, "!type" meaning "anon" is a bit rough. Please follow
the "bool file" convention used elsewhere.

> @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
>  {
>  	enum lru_list lru = folio_lru_list(folio);
>  
> +	if (lru_gen_add_folio(lruvec, folio, true))
> +		return;
> +

bool parameters are notoriously hard to follow in the callsite. Can
you please add lru_gen_add_folio_tail() instead and have them use a
common helper?

> @@ -127,6 +315,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  static __always_inline
>  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
>  {
> +	if (lru_gen_del_folio(lruvec, folio, false))
> +		return;
> +
>  	list_del(&folio->lru);
>  	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
>  			-folio_nr_pages(folio));
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aed44e9b5d89..0f5e8a995781 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -303,6 +303,78 @@ enum lruvec_flags {
>  					 */
>  };
>  
> +struct lruvec;
> +
> +#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> +#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> +
> +#ifdef CONFIG_LRU_GEN
> +
> +#define MIN_LRU_BATCH		BITS_PER_LONG
> +#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)

Those two aren't used in this patch, so it's hard to say whether they
are chosen correctly.

> + * Evictable pages are divided into multiple generations. The youngest and the
> + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
> + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> + * one of lrugen->lists[]. Otherwise it stores 0.
> + *
> + * A page is added to the youngest generation on faulting. The aging needs to
> + * check the accessed bit at least twice before handing this page over to the
> + * eviction. The first check takes care of the accessed bit set on the initial
> + * fault; the second check makes sure this page hasn't been used since then.
> + * This process, AKA second chance, requires a minimum of two generations,
> + * hence MIN_NR_GENS. And to be compatible with the active/inactive LRU, these
> + * two generations are mapped to the active; the rest of generations, if they
> + * exist, are mapped to the inactive. PG_active is always cleared while a page
> + * is on one of lrugen->lists[] so that demotion, which happens consequently
> + * when the aging produces a new generation, needs not to worry about it.
> + */
> +#define MIN_NR_GENS		2U
> +#define MAX_NR_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
> +
> +struct lru_gen_struct {

struct lrugen?

In fact, "lrugen" for the general function and variable namespace
might be better, the _ doesn't seem to pull its weight.

CONFIG_LRUGEN
struct lrugen
lrugen_foo()
etc.

> +	/* the aging increments the youngest generation number */
> +	unsigned long max_seq;
> +	/* the eviction increments the oldest generation numbers */
> +	unsigned long min_seq[ANON_AND_FILE];

The singular max_seq vs the split min_seq raises questions. Please add
a comment that explains or points to an explanation.

> +	/* the birth time of each generation in jiffies */
> +	unsigned long timestamps[MAX_NR_GENS];

This isn't in use until the thrashing-based OOM killing patch.

> +	/* the multigenerational LRU lists */
> +	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +	/* the sizes of the above lists */
> +	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> +	/* whether the multigenerational LRU is enabled */
> +	bool enabled;

Not (really) in use until the runtime switch. Best to keep everybody
checking the global flag for now, and have the runtime switch patch
introduce this flag and switch necessary callsites over.

> +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);

"state" is what we usually init :) How about lrugen_init_lruvec()?

You can drop the memcg parameter and use lruvec_memcg().

> +#ifdef CONFIG_MEMCG
> +void lru_gen_init_memcg(struct mem_cgroup *memcg);
> +void lru_gen_free_memcg(struct mem_cgroup *memcg);

This should be either init+exit, or alloc+free.

Thanks


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-08  8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
  2022-02-08  8:28   ` Yu Zhao
  2022-02-10 20:41   ` Johannes Weiner
@ 2022-02-10 21:37   ` Matthew Wilcox
  2022-02-13 21:16     ` Yu Zhao
  2 siblings, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2022-02-10 21:37 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 08, 2022 at 01:18:54AM -0700, Yu Zhao wrote:
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lrugen->max_seq for both
> anon and file types as they're aged on an equal footing. The oldest
> generation numbers are stored in lrugen->min_seq[] separately for anon
> and file types as clean file pages can be evicted regardless of swap
> constraints. These three variables are monotonically increasing.
> 
> Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
> in order to fit into the gen counter in folio->flags. Each truncated
> generation number is an index to lrugen->lists[]. The sliding window
> technique is used to track at least MIN_NR_GENS and at most
> MAX_NR_GENS generations. The gen counter stores (seq%MAX_NR_GENS)+1
> while a page is on one of lrugen->lists[]. Otherwise it stores 0.
> 
> There are two conceptually independent processes (as in the
> manufacturing process): "the aging", which produces young generations,
> and "the eviction", which consumes old generations. They form a
> closed-loop system, i.e., "the page reclaim". Both processes can be
> invoked from userspace for the purposes of working set estimation and
> proactive reclaim. These features are required to optimize job
> scheduling (bin packing) in data centers. The variable size of the
> sliding window is designed for such use cases [1][2].
> 
> To avoid confusions, the terms "hot" and "cold" will be applied to the
> multigenerational LRU, as a new convention; the terms "active" and
> "inactive" will be applied to the active/inactive LRU, as usual.

[...]

> +++ b/include/linux/page-flags-layout.h
> @@ -26,6 +26,14 @@
>  
>  #define ZONES_WIDTH		ZONES_SHIFT
>  
> +#ifdef CONFIG_LRU_GEN
> +/* LRU_GEN_WIDTH is generated from order_base_2(CONFIG_NR_LRU_GENS + 1). */
> +#define LRU_REFS_WIDTH		(CONFIG_TIERS_PER_GEN - 2)
> +#else
> +#define LRU_GEN_WIDTH		0
> +#define LRU_REFS_WIDTH		0
> +#endif /* CONFIG_LRU_GEN */

I'm concerned about the number of bits being used in page->flags.
It seems to me that we already have six bits in use to aid us in choosing
which pages to reclaim: referenced, lru, active, workingset, reclaim,
unevictable.

What I was hoping to see from this patch set was reuse of those bits.
That would give us 32 queues in total.  Some would be special (eg pages
cannot migrate out of the unevictable queue), but it seems to me that you
effectively have 4 queues for active and 4 queues for inactive at this
point (unless I misunderstood that).  I think we need special numbers
for: Not on the LRU and Unevictable, but that still leaves us with 30
generations to split between active & inactive.

But maybe we still need some of those bits?  Perhaps it's not OK to say
that queue id 0 is !LRU, queue 1 is unevictable, queue #2 is workingset,
queues 3-7 are active, queues 8-15 are various degrees of inactive.
I'm assuming that it's not sensible to have a page that's marked as both
"reclaim" and "workingset", but perhaps it is.

Anyway, I don't understand this area well enough.  I was just hoping
for some simplification.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (12 preceding siblings ...)
  2022-02-08 10:11 ` [PATCH v7 00/12] Multigenerational LRU Framework Oleksandr Natalenko
@ 2022-02-11 20:12 ` Alexey Avramov
  2022-02-12 21:01   ` Yu Zhao
  2022-03-03  6:06 ` Vaibhav Jain
  14 siblings, 1 reply; 74+ messages in thread
From: Alexey Avramov @ 2022-02-11 20:12 UTC (permalink / raw)
  To: yuzhao
  Cc: 21cnbao, Michael, ak, akpm, aneesh.kumar, axboe, catalin.marinas,
	corbet, dave.hansen, hannes, hdanton, jsbarnes, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, mgorman, mhocko, page-reclaim,
	riel, rppt, torvalds, vbabka, will, willy, x86, ying.huang

Aggressive swapping even with vm.swappiness=1 with MGLRU
========================================================

Reading a large mmapped file leads to a super agressive swapping.
Reducing vm.swappiness even to 1 does not have effect.

Demo: https://www.youtube.com/watch?v=J81kwJeuW58

Linux 5.17-rc3, Multigenerational LRU v7, 
vm.swappiness=1, MemTotal: 11.5 GiB.

$ cache-bench -r 35000 -m1 -b1 -p1 -f test20000
Reading mmapped file (file size: 20000 MiB)
cache-bench v0.2.0: https://github.com/hakavlad/cache-bench

Swapping started with MemAvailable=71%.
At the end 33 GiB was swapped out when MemAvailable=60%.

Is it OK?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-02-11 20:12 ` Alexey Avramov
@ 2022-02-12 21:01   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-12 21:01 UTC (permalink / raw)
  To: Alexey Avramov
  Cc: 21cnbao, Michael, ak, akpm, aneesh.kumar, axboe, catalin.marinas,
	corbet, dave.hansen, hannes, hdanton, jsbarnes, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, mgorman, mhocko, page-reclaim,
	riel, rppt, torvalds, vbabka, will, willy, x86, ying.huang

On Sat, Feb 12, 2022 at 05:12:19AM +0900, Alexey Avramov wrote:
> Aggressive swapping even with vm.swappiness=1 with MGLRU
> ========================================================
> 
> Reading a large mmapped file leads to a super agressive swapping.
> Reducing vm.swappiness even to 1 does not have effect.

Mind explaining why you think it's "super agressive"? I assume you
expected a different behavior that would perform better. If so,
please spell it out.

> Demo: https://www.youtube.com/watch?v=J81kwJeuW58
> 
> Linux 5.17-rc3, Multigenerational LRU v7, 
> vm.swappiness=1, MemTotal: 11.5 GiB.
> 
> $ cache-bench -r 35000 -m1 -b1 -p1 -f test20000
> Reading mmapped file (file size: 20000 MiB)
> cache-bench v0.2.0: https://github.com/hakavlad/cache-bench

Writing your own benchmark is a good exercise but fio is the standard
benchmark in this case. Please use it with --ioengine=mmap.

> Swapping started with MemAvailable=71%.
> At the end 33 GiB was swapped out when MemAvailable=60%.
> 
> Is it OK?

MemAvailable is an estimate (free + page cache), and it doesn't imply
any reclaim preferences. In the worst case scenario, e.g., out of swap
space, MemAvailable *may* be reclaimed.

Here is my benchmark result with file mmap + *high* swap usage. Ram
disk was used to reduce the variance in the result (and SSD wear out
if you care). More details on additional configurations here:
https://lore.kernel.org/linux-mm/20220208081902.3550911-6-yuzhao@google.com/

  Mixed workloads:
    fio (buffered I/O): +13%
                IOPS         BW
      5.17-rc3: 275k         1075MiB/s
            v7: 313k         1222MiB/s

    memcached (anon): +12%
                Ops/sec      KB/sec
      5.17-rc3: 511282.72    19861.04
            v7: 572408.80    22235.49

  cat mmap.sh
  systemctl restart memcached
  swapoff -a
  umount /mnt
  rmmod brd
  
  modprobe brd rd_nr=2 rd_size=56623104
  
  mkswap /dev/ram0
  swapon /dev/ram0
  
  mkfs.ext4 /dev/ram1
  mount -t ext4 /dev/ram1 /mnt
  
  memtier_benchmark -S /var/run/memcached/memcached.sock \
  -P memcache_binary -n allkeys --key-minimum=1 \
  --key-maximum=50000000 --key-pattern=P:P -c 1 \
  -t 36 --ratio 1:0 --pipeline 8 -d 2000
  
  sysctl vm.overcommit_memory=1
  
  fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
  --buffered=1 --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
  --iodepth_batch_complete=32 --rw=randread --random_distribution=random \
  --norandommap --time_based --ramp_time=10m --runtime=990m \
  --group_reporting &
  pid=$!
  
  sleep 200
  
  memcached.sock -P memcache_binary -n allkeys --key-minimum=1 \
  --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 --ratio 0:1 \
  --pipeline 8 --randomize --distinct-client-seed
  
  kill -INT $pid
  wait


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
  2022-02-08  8:33   ` Yu Zhao
  2022-02-08 16:50   ` Johannes Weiner
@ 2022-02-13 10:04   ` Hillf Danton
  2022-02-17  0:13     ` Yu Zhao
  2022-02-23  8:27   ` Huang, Ying
  3 siblings, 1 reply; 74+ messages in thread
From: Hillf Danton @ 2022-02-13 10:04 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Johannes Weiner, linux-kernel, linux-mm

Hello Yu

On Tue,  8 Feb 2022 01:18:55 -0700 Yu Zhao wrote:
> +
> +/******************************************************************************
> + *                          the aging
> + ******************************************************************************/
> +
> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> +{
> +	unsigned long old_flags, new_flags;
> +	int type = folio_is_file_lru(folio);
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> +
> +	do {
> +		new_flags = old_flags = READ_ONCE(folio->flags);
> +		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> +
> +		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;

Is the chance zero for deadloop if new_gen != old_gen?

> +		new_gen = (old_gen + 1) % MAX_NR_GENS;
> +
> +		new_flags &= ~LRU_GEN_MASK;
> +		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> +		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> +		/* for folio_end_writeback() */

		/* for folio_end_writeback() and sort_folio() */ in terms of
reclaiming?

> +		if (reclaiming)
> +			new_flags |= BIT(PG_reclaim);
> +	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> +
> +	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
> +
> +	return new_gen;
> +}

...

> +/******************************************************************************
> + *                          the eviction
> + ******************************************************************************/
> +
> +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
> +{

Nit, the 80-column-char format is prefered.

> +	bool success;
> +	int gen = folio_lru_gen(folio);
> +	int type = folio_is_file_lru(folio);
> +	int zone = folio_zonenum(folio);
> +	int tier = folio_lru_tier(folio);
> +	int delta = folio_nr_pages(folio);
> +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> +
> +	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
> +
> +	if (!folio_evictable(folio)) {
> +		success = lru_gen_del_folio(lruvec, folio, true);
> +		VM_BUG_ON_FOLIO(!success, folio);
> +		folio_set_unevictable(folio);
> +		lruvec_add_folio(lruvec, folio);
> +		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
> +		return true;
> +	}
> +
> +	if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
> +		success = lru_gen_del_folio(lruvec, folio, true);
> +		VM_BUG_ON_FOLIO(!success, folio);
> +		folio_set_swapbacked(folio);
> +		lruvec_add_folio_tail(lruvec, folio);
> +		return true;
> +	}
> +
> +	if (tier > tier_idx) {
> +		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> +
> +		gen = folio_inc_gen(lruvec, folio, false);
> +		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> +
> +		WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
> +			   lrugen->promoted[hist][type][tier - 1] + delta);
> +		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> +		return true;
> +	}
> +
> +	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> +	    (type && folio_test_dirty(folio))) {
> +		gen = folio_inc_gen(lruvec, folio, true);
> +		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> +		return true;

Make the cold dirty page cache younger instead of writeout in the backgroungd
reclaimer context, and the question rising is if laundry is defered until the
flusher threads are waken up in the following patches.

> +	}
> +
> +	return false;
> +}

Hillf


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-10 21:37   ` Matthew Wilcox
@ 2022-02-13 21:16     ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-13 21:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Thu, Feb 10, 2022 at 09:37:25PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 08, 2022 at 01:18:54AM -0700, Yu Zhao wrote:

[...]

> > +++ b/include/linux/page-flags-layout.h
> > @@ -26,6 +26,14 @@
> >  
> >  #define ZONES_WIDTH		ZONES_SHIFT
> >  
> > +#ifdef CONFIG_LRU_GEN
> > +/* LRU_GEN_WIDTH is generated from order_base_2(CONFIG_NR_LRU_GENS + 1). */
> > +#define LRU_REFS_WIDTH		(CONFIG_TIERS_PER_GEN - 2)
> > +#else
> > +#define LRU_GEN_WIDTH		0
> > +#define LRU_REFS_WIDTH		0
> > +#endif /* CONFIG_LRU_GEN */
> 
> I'm concerned about the number of bits being used in page->flags.
> It seems to me that we already have six bits in use to aid us in choosing
> which pages to reclaim: referenced, lru, active, workingset, reclaim,
> unevictable.
> 
> What I was hoping to see from this patch set was reuse of those bits.

Agreed. I have a plan to *reduce* some of those bits but it's a
relatively low priority item on my to-do list.

> That would give us 32 queues in total.  Some would be special (eg pages
> cannot migrate out of the unevictable queue), but it seems to me that you
> effectively have 4 queues for active and 4 queues for inactive at this
> point (unless I misunderstood that).  I think we need special numbers
> for: Not on the LRU and Unevictable, but that still leaves us with 30
> generations to split between active & inactive.
> 
> But maybe we still need some of those bits?  Perhaps it's not OK to say
> that queue id 0 is !LRU, queue 1 is unevictable, queue #2 is workingset,
> queues 3-7 are active, queues 8-15 are various degrees of inactive.
> I'm assuming that it's not sensible to have a page that's marked as both
> "reclaim" and "workingset", but perhaps it is.
> 
> Anyway, I don't understand this area well enough.  I was just hoping
> for some simplification.

I plan to use the spare bits in folio->lru to indicate which lru list
a folio is on, i.e., active/inactive or generations or unevictable.

In addition, swapbacked could go to folio->mapping -- we wouldn't need
it if there were no MADV_FREE, i.e., it would be equivalent to
PageAnon() || shmem_mapping().

These two work items can be done separately and in parallel with
everything else that's been going on lately. It'd awesome if somebody
volunteers, and I can find some resource from our side to test/review
the code so that we can have them done sooner.

The rest, in theory, can also be moved to somewhere but, IMO, it's not
really worth the effort given the situation isn't dire at the moment.
referenced and workingset are already reused between the active/inactive
lru and the multigenerational lru; reclaim is reused for readahead, but
readahead could be split out as an xa tag; lru is reused for isolation
synchronization.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-08  8:19 ` [PATCH v7 12/12] mm: multigenerational LRU: documentation Yu Zhao
  2022-02-08  8:44   ` Yu Zhao
@ 2022-02-14 10:28   ` Mike Rapoport
  2022-02-16  3:22     ` Yu Zhao
  1 sibling, 1 reply; 74+ messages in thread
From: Mike Rapoport @ 2022-02-14 10:28 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Hi,

On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote:
> Add a design doc and an admin guide.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
>  Documentation/vm/index.rst                    |   1 +
>  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++

Please consider splitting this patch into Documentation/admin-guide and
Documentation/vm parts.

For now I only had time to review the admin-guide part.

>  4 files changed, 275 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index c21b5823f126..2cf5bae62036 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -32,6 +32,7 @@ the Linux memory management.
>     idle_page_tracking
>     ksm
>     memory-hotplug
> +   multigen_lru
>     nommu-mmap
>     numa_memory_policy
>     numaperf
> diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
> new file mode 100644
> index 000000000000..16a543c8b886
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/multigen_lru.rst
> @@ -0,0 +1,121 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Multigenerational LRU
> +=====================
+
> +Quick start
> +===========

There is no explanation why one would want to use multigenerational LRU
until the next section.

I think there should be an overview that explains why users would want to
enable multigenerational LRU. 

> +Build configurations
> +--------------------
> +:Required: Set ``CONFIG_LRU_GEN=y``.

Maybe 

	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

> +
> +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> + multigenerational LRU by default.
> +
> +Runtime configurations
> +----------------------
> +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> + ``CONFIG_LRU_GEN_ENABLED=n``.
> +
> +This file accepts different values to enabled or disabled the
> +following features:

Maybe

  After multigenerational LRU is enabled, this file accepts different
  values to enable or disable the following feaures:

> +====== ========
> +Values Features
> +====== ========
> +0x0001 the multigenerational LRU

The multigenerational LRU what?

What will happen if I write 0x2 to this file?
Please consider splitting "enable" and "features" attributes.

> +0x0002 clear the accessed bit in leaf page table entries **in large
> +       batches**, when MMU sets it (e.g., on x86)

Is extra markup really needed here...

> +0x0004 clear the accessed bit in non-leaf page table entries **as
> +       well**, when MMU sets it (e.g., on x86)

... and here?

As for the descriptions, what is the user-visible effect of these features?
How different modes of clearing the access bit are reflected in, say, GUI
responsiveness, database TPS, or probability of OOM?

> +[yYnN] apply to all the features above
> +====== ========
> +
> +E.g.,
> +::
> +
> +    echo y >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0007
> +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> +    cat /sys/kernel/mm/lru_gen/enabled
> +    0x0005
> +
> +Most users should enable or disable all the features unless some of
> +them have unforeseen side effects.
> +
> +Recipes
> +=======
> +Personal computers
> +------------------
> +Personal computers are more sensitive to thrashing because it can
> +cause janks (lags when rendering UI) and negatively impact user
> +experience. The multigenerational LRU offers thrashing prevention to
> +the majority of laptop and desktop users who don't have oomd.

I'd expect something like this paragraph in overview.

> +
> +:Thrashing prevention: Write ``N`` to
> + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> + if this working set can't be kept in memory. Based on the average
> + human detectable lag (~100ms), ``N=1000`` usually eliminates
> + intolerable janks due to thrashing. Larger values like ``N=3000``
> + make janks less noticeable at the risk of premature OOM kills.

> +
> +Data centers
> +------------
> +Data centers want to optimize job scheduling (bin packing) to improve
> +memory utilizations. Job schedulers need to estimate whether a server
> +can allocate a certain amount of memory for a new job, and this step
> +is known as working set estimation, which doesn't impact the existing
> +jobs running on this server. They also want to attempt freeing some
> +cold memory from the existing jobs, and this step is known as proactive
> +reclaim, which improves the chance of landing a new job successfully.

This paragraph also fits overview.

> +
> +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> + for working set estimation and proactive reclaim.

Please add a note that this is build time option.

> +
> +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following

Is debugfs interface relevant only for datacenters? 

> + format:
> + ::
> +
> +   memcg  memcg_id  memcg_path
> +     node  node_id
> +       min_gen  birth_time  anon_size  file_size
> +       ...
> +       max_gen  birth_time  anon_size  file_size
> +
> + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> + youngest generation number. ``birth_time`` is in milliseconds.

It's unclear what is birth_time reference point. Is it milliseconds from
the system start or it is measured some other way?

> + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> + represents the group of the MRU pages and the oldest generation
> + represents the group of the LRU pages. For working set estimation, a

Please spell out MRU and LRU fully.

> + job scheduler writes to this file at a certain time interval to
> + create new generations, and it ranks available servers based on the
> + sizes of their cold memory defined by this time interval. For
> + proactive reclaim, a job scheduler writes to this file before it
> + tries to land a new job, and if it fails to materialize the cold
> + memory without impacting the existing jobs, it retries on the next
> + server according to the ranking result.

Is this knob only relevant for a job scheduler? Or it can be used in other
use-cases as well?

> +
> + This file accepts commands in the following subsections. Multiple

                              ^ described

> + command lines are supported, so does concatenation with delimiters
> + ``,`` and ``;``.
> +
> + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for
> + debugging.
> +
> +:Working set estimation: Write ``+ memcg_id node_id max_gen
> + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke
> + the aging. It scans PTEs for hot pages and promotes them to the
> + youngest generation ``max_gen``. Then it creates a new generation
> + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages
> + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead
> + as well as the coverage when scanning PTEs.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness
> + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
> + eviction. It evicts generations less than or equal to ``min_gen``.
> + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
> + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use
> + ``nr_to_reclaim`` to limit the number of pages to evict.

I feel that /sys/kernel/debug/lru_gen is too overloaded.

> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index 44365c4574a3..b48434300226 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -25,6 +25,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
>     ksm
>     memory-model
>     mmu_notifier
> +   multigen_lru
>     numa
>     overcommit-accounting
>     page_migration

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-10 20:41   ` Johannes Weiner
@ 2022-02-15  9:43     ` Yu Zhao
  2022-02-15 21:53       ` Johannes Weiner
  2022-03-11 10:16       ` Barry Song
  0 siblings, 2 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-15  9:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:

Thanks for reviewing.

> > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > +{
> > +	unsigned long max_seq = lruvec->lrugen.max_seq;
> > +
> > +	VM_BUG_ON(gen >= MAX_NR_GENS);
> > +
> > +	/* see the comment on MIN_NR_GENS */
> > +	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > +}
> 
> I'm still reading the series, so correct me if I'm wrong: the "active"
> set is split into two generations for the sole purpose of the
> second-chance policy for fresh faults, right?

To be precise, the active/inactive notion on top of generations is
just for ABI compatibility, e.g., the counters in /proc/vmstat.
Otherwise, this function wouldn't be needed.

> If so, it'd be better to have the comment here instead of down by
> MIN_NR_GENS. This is the place that defines what "active" is, so this
> is where the reader asks what it means and what it implies. The
> definition of MIN_NR_GENS can be briefer: "need at least two for
> second chance, see lru_gen_is_active() for details".

This could be understood this way. It'd be more appropriate to see
this function as an auxiliary and MIN_NR_GENS as something fundamental.
Therefore the former should refer to the latter. Specifically, the
"see the comment on MIN_NR_GENS" refers to this part:
  And to be compatible with the active/inactive LRU, these two
  generations are mapped to the active; the rest of generations, if
  they exist, are mapped to the inactive.

> > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> > +				       int zone, long delta)
> > +{
> > +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > +
> > +	lockdep_assert_held(&lruvec->lru_lock);
> > +	WARN_ON_ONCE(delta != (int)delta);
> > +
> > +	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> > +	__mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> > +}
> 
> This is a duplicate of update_lru_size(), please use that instead.
> 
> Yeah technically you don't need the mem_cgroup_update_lru_size() but
> that's not worth sweating over, better to keep it simple.

I agree we don't need the mem_cgroup_update_lru_size() -- let me spell
out why:
  this function is not needed here because it updates the counters used
  only by the active/inactive lru code, i.e., get_scan_count().

However, we can't reuse update_lru_size() because MGLRU can trip the
WARN_ONCE() in mem_cgroup_update_lru_size().

Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent.
To move a page to a different generation, the gen counter in page->flags
is updated first, which doesn't require the LRU lock. The second step,
i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it
usually isn't done immediately due to batching. Meanwhile, if this page
is, for example, isolated, nr_pages[] becomes temporarily unbalanced.
And this trips the WARN_ONCE().

<snipped>

> 	/* Promotion */
> > +	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
> > +		lru_gen_update_size(lruvec, lru, zone, -delta);
> > +		lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta);
> > +	}
> > +
> > +	/* Promotion is legit while a page is on an LRU list, but demotion isn't. */
> 
> 	/* Demotion happens during aging when pages are isolated, never on-LRU */
> > +	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
> > +}
> 
> On that note, please move introduction of the promotion and demotion
> bits to the next patch. They aren't used here yet, and I spent some
> time jumping around patches to verify the promotion callers and
> confirm the validy of the BUG_ON.

Will do.

> > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +	int gen;
> > +	unsigned long old_flags, new_flags;
> > +	int type = folio_is_file_lru(folio);
> > +	int zone = folio_zonenum(folio);
> > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +	if (folio_test_unevictable(folio) || !lrugen->enabled)
> > +		return false;
> 
> These two checks should be in the callsite and the function should
> return void. Otherwise you can't understand the callsite without
> drilling down into lrugen code, even if lrugen is disabled.

I agree it's a bit of nuisance this way. The alternative is we'd need
ifdef or another helper at the call sites because lrugen->enabled is
specific to lrugen.

> > +	/*
> > +	 * There are three common cases for this page:
> > +	 * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it
> > +	 *    to the youngest generation.
> 
> "shouldn't be evicted" makes it sound like mlock. But they should just
> be evicted last, right? Maybe:
> 
> 	/*
> 	 * Pages start in different generations depending on
> 	 * advance knowledge we have about their hotness and
> 	 * evictability:
> 	 * 
> 	 * 1. Already active pages start out youngest. This can be
> 	 *    fresh faults, or refaults of previously hot pages.
> 	 * 2. Cold pages that require writeback before becoming
> 	 *    evictable start on the second oldest generation.
> 	 * 3. Everything else (clean, cold) starts old.
> 	 */

Will do.

> On that note, I think #1 is reintroducing a problem we have fixed
> before, which is trashing the workingset with a flood of use-once
> mmapped pages. It's the classic scenario where LFU beats LRU.
> 
> Mapped streaming IO isn't very common, but it does happen. See these
> commits:
> 
> dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
> 31c0569c3b0b6cc8a867ac6665ca081553f7984c
> 645747462435d84c6c6a64269ed49cc3015f753d
> 
> From the changelog:
> 
>     The used-once mapped file page detection patchset.
>     
>     It is meant to help workloads with large amounts of shortly used file
>     mappings, like rtorrent hashing a file or git when dealing with loose
>     objects (git gc on a bigger site?).
>     
>     Right now, the VM activates referenced mapped file pages on first
>     encounter on the inactive list and it takes a full memory cycle to
>     reclaim them again.  When those pages dominate memory, the system
>     no longer has a meaningful notion of 'working set' and is required
>     to give up the active list to make reclaim progress.  Obviously,
>     this results in rather bad scanning latencies and the wrong pages
>     being reclaimed.
>     
>     This patch makes the VM be more careful about activating mapped file
>     pages in the first place.  The minimum granted lifetime without
>     another memory access becomes an inactive list cycle instead of the
>     full memory cycle, which is more natural given the mentioned loads.
> 
> Translating this to multigen, it seems fresh faults should really
> start on the second oldest rather than on the youngest generation, to
> get a second chance but without jeopardizing the workingset if they
> don't take it.

This is a good point, and I had worked on a similar idea but failed
to measure its benefits. In addition to placing mmapped file pages in
older generations, I also tried placing refaulted anon pages in older
generations. My conclusion was that the initial LRU positions of NFU
pages are not a bottleneck for workloads I've tested. The efficiency
of testing/clearing the accessed bit is.

And some applications are smart enough to leverage MADV_SEQUENTIAL.
In this case, MGLRU does place mmapped file pages in the oldest
generation.

I have an oversimplified script that uses memcached to mimic a
non-streaming workload and fio a (mmapped) streaming workload:
  1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times
     faster when using MGLRU. Somehow the baseline (rc3) swapped a lot.
     (It shouldn't, and I haven't figured out why.)
  2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1
     time faster when using MGLRU. Both MGLRU and the baseline swapped
     a lot.

           MADV_SEQUENTIAL    non-streaming ops/sec (memcached)
  rc3      yes                 292k
  rc3      no                  203k
  rc3+v7   yes                1967k
  rc3+v7   no                  436k

  cat mmap.sh
  modprobe brd rd_nr=2 rd_size=56623104
  
  mkswap /dev/ram0
  swapon /dev/ram0
  
  mkfs.ext4 /dev/ram1
  mount -t ext4 /dev/ram1 /mnt
  
  memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
    -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=P:P -c 1 \
    -t 36 --ratio 1:0 --pipeline 8 -d 2000
  
  # streaming workload: --fadvise_hint=0 disables MADV_SEQUENTIAL
  fio -name=mglru --numjobs=12 --directory=/mnt --size=4224m --buffered=1 \
    --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
    --iodepth_batch_complete=32 --rw=read --time_based --ramp_time=10m \
    --runtime=180m --group_reporting &
  pid=$!
  
  sleep 200
  
  # non-streaming workload
  memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
    -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=R:R \
    -c 1 -t 36 --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
  
  kill -INT $pid
  wait

> > +	 * 2) If it can't be evicted immediately, i.e., it's an anon page and
> > +	 *    not in swapcache, or a dirty page pending writeback, add it to the
> > +	 *    second oldest generation.
> > +	 * 3) If it may be evicted immediately, e.g., it's a clean page, add it
> > +	 *    to the oldest generation.
> > +	 */
> > +	if (folio_test_active(folio))
> > +		gen = lru_gen_from_seq(lrugen->max_seq);
> > +	else if ((!type && !folio_test_swapcache(folio)) ||
> > +		 (folio_test_reclaim(folio) &&
> > +		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
> > +		gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
> > +	else
> > +		gen = lru_gen_from_seq(lrugen->min_seq[type]);
> 
> Condition #2 is not quite clear to me, and the comment is incomplete:
> The code does put dirty/writeback pages on the oldest gen as long as
> they haven't been marked for immediate reclaim by the scanner
> yet.

Right.

> HOWEVER, once the scanner does see those pages and sets
> PG_reclaim, it will also activate them to move them out of the way
> until writeback finishes (see shrink_page_list()) - at which point
> we'll trigger #1. So that second part of #2 appears unreachable.

Yes, dirty file pages go to #1; dirty pages in swapcache go to #2.
(Ideally we want dirty file pages go to #2 too. IMO, the code would
 be cleaner that way.)

> It could be a good exercise to describe how cache pages move through
> the generations, similar to the comment on lru_deactivate_file_fn().
> It's a good example of intent vs implementation.

Will do.

> On another note, "!type" meaning "anon" is a bit rough. Please follow
> the "bool file" convention used elsewhere.

Originally I used "file", e.g., in v2:
https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/

But I was told to renamed it since "file" usually means file. Let me
rename it back to "file", unless somebody still objects.

> > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
> >  {
> >  	enum lru_list lru = folio_lru_list(folio);
> >  
> > +	if (lru_gen_add_folio(lruvec, folio, true))
> > +		return;
> > +
> 
> bool parameters are notoriously hard to follow in the callsite. Can
> you please add lru_gen_add_folio_tail() instead and have them use a
> common helper?

I'm not sure -- there are several places like this one. My question is
whether we want to do it throughout this patchset. We'd end up with
many helpers and duplicate code. E.g., in this file alone, we have two
functions taking bool parameters:
  lru_gen_add_folio(..., bool reclaiming)
  lru_gen_del_folio(..., bool reclaiming)

I can't say they are very readable; at least they are very compact
right now. My concern is that we might loose the latter without having
enough of the former.

Perhaps this is something that we could revisit after you've finished
reviewing the entire patchset?

> > @@ -127,6 +315,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
> >  static __always_inline
> >  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
> >  {
> > +	if (lru_gen_del_folio(lruvec, folio, false))
> > +		return;
> > +
> >  	list_del(&folio->lru);
> >  	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
> >  			-folio_nr_pages(folio));
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index aed44e9b5d89..0f5e8a995781 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -303,6 +303,78 @@ enum lruvec_flags {
> >  					 */
> >  };
> >  
> > +struct lruvec;
> > +
> > +#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > +#define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> > +
> > +#ifdef CONFIG_LRU_GEN
> > +
> > +#define MIN_LRU_BATCH		BITS_PER_LONG
> > +#define MAX_LRU_BATCH		(MIN_LRU_BATCH * 128)
> 
> Those two aren't used in this patch, so it's hard to say whether they
> are chosen correctly.

Right. They slipped during the v6/v7 refactoring. Will move them to
the next patch.

> > + * Evictable pages are divided into multiple generations. The youngest and the
> > + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> > + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> > + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
> > + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> > + * one of lrugen->lists[]. Otherwise it stores 0.
> > + *
> > + * A page is added to the youngest generation on faulting. The aging needs to
> > + * check the accessed bit at least twice before handing this page over to the
> > + * eviction. The first check takes care of the accessed bit set on the initial
> > + * fault; the second check makes sure this page hasn't been used since then.
> > + * This process, AKA second chance, requires a minimum of two generations,
> > + * hence MIN_NR_GENS. And to be compatible with the active/inactive LRU, these
> > + * two generations are mapped to the active; the rest of generations, if they
> > + * exist, are mapped to the inactive. PG_active is always cleared while a page
> > + * is on one of lrugen->lists[] so that demotion, which happens consequently
> > + * when the aging produces a new generation, needs not to worry about it.
> > + */
> > +#define MIN_NR_GENS		2U
> > +#define MAX_NR_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
> > +
> > +struct lru_gen_struct {
> 
> struct lrugen?
> 
> In fact, "lrugen" for the general function and variable namespace
> might be better, the _ doesn't seem to pull its weight.
> 
> CONFIG_LRUGEN
> struct lrugen
> lrugen_foo()
> etc.

No strong opinion here. I usually add underscores to functions and
types so that grep doesn't end up with tons of local variables.

> > +	/* the aging increments the youngest generation number */
> > +	unsigned long max_seq;
> > +	/* the eviction increments the oldest generation numbers */
> > +	unsigned long min_seq[ANON_AND_FILE];
> 
> The singular max_seq vs the split min_seq raises questions. Please add
> a comment that explains or points to an explanation.

Will do.

> > +	/* the birth time of each generation in jiffies */
> > +	unsigned long timestamps[MAX_NR_GENS];
> 
> This isn't in use until the thrashing-based OOM killing patch.

Will move it there.

> > +	/* the multigenerational LRU lists */
> > +	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> > +	/* the sizes of the above lists */
> > +	unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> > +	/* whether the multigenerational LRU is enabled */
> > +	bool enabled;
> 
> Not (really) in use until the runtime switch. Best to keep everybody
> checking the global flag for now, and have the runtime switch patch
> introduce this flag and switch necessary callsites over.

Will do.

> > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
> 
> "state" is what we usually init :) How about lrugen_init_lruvec()?

Same story as "file", lol -- this used to be lru_gen_init_lruvec():
https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/

Naming is hard. Hopefully we can finalize it this time.

> You can drop the memcg parameter and use lruvec_memcg().

lruvec_memcg() isn't available yet when pgdat_init_internals() calls
this function because mem_cgroup_disabled() is initialized afterward.

> > +#ifdef CONFIG_MEMCG
> > +void lru_gen_init_memcg(struct mem_cgroup *memcg);
> > +void lru_gen_free_memcg(struct mem_cgroup *memcg);
> 
> This should be either init+exit, or alloc+free.

Will do.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-15  9:43     ` Yu Zhao
@ 2022-02-15 21:53       ` Johannes Weiner
  2022-02-21  8:14         ` Yu Zhao
  2022-03-11 10:16       ` Barry Song
  1 sibling, 1 reply; 74+ messages in thread
From: Johannes Weiner @ 2022-02-15 21:53 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Hi Yu,

On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > +{
> > > +	unsigned long max_seq = lruvec->lrugen.max_seq;
> > > +
> > > +	VM_BUG_ON(gen >= MAX_NR_GENS);
> > > +
> > > +	/* see the comment on MIN_NR_GENS */
> > > +	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > +}
> > 
> > I'm still reading the series, so correct me if I'm wrong: the "active"
> > set is split into two generations for the sole purpose of the
> > second-chance policy for fresh faults, right?
> 
> To be precise, the active/inactive notion on top of generations is
> just for ABI compatibility, e.g., the counters in /proc/vmstat.
> Otherwise, this function wouldn't be needed.

Ah! would you mind adding this as a comment to the function?

But AFAICS there is the lru_gen_del_folio() callsite that maps it to
the PG_active flag - which in turn gets used by add_folio() to place
the thing back on the max_seq generation. So I suppose there is a
secondary purpose of the function for remembering the page's rough age
for non-reclaim isolation. It would be good to capture that as well in
a comment on the function.

> > If so, it'd be better to have the comment here instead of down by
> > MIN_NR_GENS. This is the place that defines what "active" is, so this
> > is where the reader asks what it means and what it implies. The
> > definition of MIN_NR_GENS can be briefer: "need at least two for
> > second chance, see lru_gen_is_active() for details".
> 
> This could be understood this way. It'd be more appropriate to see
> this function as an auxiliary and MIN_NR_GENS as something fundamental.
> Therefore the former should refer to the latter. Specifically, the
> "see the comment on MIN_NR_GENS" refers to this part:
>   And to be compatible with the active/inactive LRU, these two
>   generations are mapped to the active; the rest of generations, if
>   they exist, are mapped to the inactive.

I agree, thanks for enlightening me.

> > > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> > > +				       int zone, long delta)
> > > +{
> > > +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > +
> > > +	lockdep_assert_held(&lruvec->lru_lock);
> > > +	WARN_ON_ONCE(delta != (int)delta);
> > > +
> > > +	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> > > +	__mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> > > +}
> > 
> > This is a duplicate of update_lru_size(), please use that instead.
> > 
> > Yeah technically you don't need the mem_cgroup_update_lru_size() but
> > that's not worth sweating over, better to keep it simple.
> 
> I agree we don't need the mem_cgroup_update_lru_size() -- let me spell
> out why:
>   this function is not needed here because it updates the counters used
>   only by the active/inactive lru code, i.e., get_scan_count().
> 
> However, we can't reuse update_lru_size() because MGLRU can trip the
> WARN_ONCE() in mem_cgroup_update_lru_size().
> 
> Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent.
> To move a page to a different generation, the gen counter in page->flags
> is updated first, which doesn't require the LRU lock. The second step,
> i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it
> usually isn't done immediately due to batching. Meanwhile, if this page
> is, for example, isolated, nr_pages[] becomes temporarily unbalanced.
> And this trips the WARN_ONCE().

Good insight.

But in that case, I'd still think it's better to use update_lru_size()
and gate the memcg update on lrugen-enabled, with a short comment
saying that lrugen has its own per-cgroup counts already. It's just a
bit too error prone to duplicate the stat updates.

Even better would be:

static __always_inline
void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
{
	enum lru_list lru = folio_lru_list(folio);

	update_lru_size(lruvec, lru, folio_zonenum(folio),
			folio_nr_pages(folio));
	if (lrugen_enabled(lruvec))
		lrugen_add_folio(lruvec, folio);
	else
		list_add(&folio->lru, &lruvec->lists[lru]);
}

But it does mean you'd have to handle unevictable pages. I'm reviewing
from the position that mglru is going to supplant the existing reclaim
algorithm in the long term, though, so being more comprehensive and
eliminating special cases where possible is all-positive, IMO.

Up to you. I'd only insist on reusing update_lru_size() at least.

> > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +	int gen;
> > > +	unsigned long old_flags, new_flags;
> > > +	int type = folio_is_file_lru(folio);
> > > +	int zone = folio_zonenum(folio);
> > > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +
> > > +	if (folio_test_unevictable(folio) || !lrugen->enabled)
> > > +		return false;
> > 
> > These two checks should be in the callsite and the function should
> > return void. Otherwise you can't understand the callsite without
> > drilling down into lrugen code, even if lrugen is disabled.
> 
> I agree it's a bit of nuisance this way. The alternative is we'd need
> ifdef or another helper at the call sites because lrugen->enabled is
> specific to lrugen.

Coming from memcg, my experience has been that when you have a compile
time-optional MM extension like this, you'll sooner or later need a
config-independent helper to gate callbacks in generic code. So I
think it's a good idea to add one now.

One of these?

lruvec_on_lrugen()
lruvec_using_lrugen()
lruvec_lrugen_enabled()

lruvec_has_generations() :-)

> > On that note, I think #1 is reintroducing a problem we have fixed
> > before, which is trashing the workingset with a flood of use-once
> > mmapped pages. It's the classic scenario where LFU beats LRU.
> > 
> > Mapped streaming IO isn't very common, but it does happen. See these
> > commits:
> > 
> > dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
> > 31c0569c3b0b6cc8a867ac6665ca081553f7984c
> > 645747462435d84c6c6a64269ed49cc3015f753d
> > 
> > From the changelog:
> > 
> >     The used-once mapped file page detection patchset.
> >     
> >     It is meant to help workloads with large amounts of shortly used file
> >     mappings, like rtorrent hashing a file or git when dealing with loose
> >     objects (git gc on a bigger site?).
> >     
> >     Right now, the VM activates referenced mapped file pages on first
> >     encounter on the inactive list and it takes a full memory cycle to
> >     reclaim them again.  When those pages dominate memory, the system
> >     no longer has a meaningful notion of 'working set' and is required
> >     to give up the active list to make reclaim progress.  Obviously,
> >     this results in rather bad scanning latencies and the wrong pages
> >     being reclaimed.
> >     
> >     This patch makes the VM be more careful about activating mapped file
> >     pages in the first place.  The minimum granted lifetime without
> >     another memory access becomes an inactive list cycle instead of the
> >     full memory cycle, which is more natural given the mentioned loads.
> > 
> > Translating this to multigen, it seems fresh faults should really
> > start on the second oldest rather than on the youngest generation, to
> > get a second chance but without jeopardizing the workingset if they
> > don't take it.
> 
> This is a good point, and I had worked on a similar idea but failed
> to measure its benefits. In addition to placing mmapped file pages in
> older generations, I also tried placing refaulted anon pages in older
> generations. My conclusion was that the initial LRU positions of NFU
> pages are not a bottleneck for workloads I've tested. The efficiency
> of testing/clearing the accessed bit is.

The concern isn't the scan overhead, but jankiness from the workingset
being flooded out by streaming IO.

The concrete usecase at the time was a torrent client hashing a
downloaded file and thereby kicking out the desktop environment, which
caused jankiness. The hashing didn't benefit from caching - the file
wouldn't have fit into RAM anyway - so this was pointless to boot.

Essentially, the tradeoff is this:

1) If you treat new pages as hot, you accelerate workingset
transitions, but on the flipside you risk unnecessary refaults in
running applications when those new pages are one-off.

2) If you take new pages with a grain of salt, you protect existing
applications better from one-off floods, but risk refaults in NEW
application while they're trying to start up.

There are two arguments for why 2) is preferable:

1) Users are tolerant of cache misses when applications first launch,
   much less so after they've been running for hours.

2) Workingset transitions (and associated jankiness) are bounded by
   the amount of RAM you need to repopulate. But streaming IO is
   bounded by storage, and datasets are routinely several times the
   amount of RAM. Uncacheable sets in excess of RAM can produce an
   infinite stream of "new" references; not protecting the workingset
   from that means longer or even sustained jankiness.

> And some applications are smart enough to leverage MADV_SEQUENTIAL.
> In this case, MGLRU does place mmapped file pages in the oldest
> generation.

Yes, it makes sense to optimize when MADV_SEQUENTIAL is requested. But
that hint isn't reliably there, so it matters that we don't do poorly
when it's missing.

> I have an oversimplified script that uses memcached to mimic a
> non-streaming workload and fio a (mmapped) streaming workload:

Looking at the paramters and observed behavior, let me say up front
that this looks like a useful benchmark, but doesn't capture the
scenario I was talking about above.

For one, the presence of swapping in both kernels suggests that the
"streaming IO" component actually has repeat access that could benefit
from caching. Second, I would expect memcache is accessing its memory
frequently and consistently, and so could withstand workingset
challenges from streaming IO better than, say, a desktop environment.

More on that below.

>   1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times
>      faster when using MGLRU. Somehow the baseline (rc3) swapped a lot.
>      (It shouldn't, and I haven't figured out why.)

Baseline swaps when there are cache refaults. This is regardless of
the hint: you may say you're accessing these pages sequentially, but
the refaults say you're reusing them, with a frequency that suggests
they might be cacheable. So it tries to cache them.

I'd be curious if that results in fio being faster, or whether it's
all just pointless thrashing. Can you share the fio results too?

We could patch baseline to prioritize MADV_SEQUENTIAL more, but...

>   2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1
>      time faster when using MGLRU. Both MGLRU and the baseline swapped
>      a lot.

...in practice I think this scenario will matter to a lot more users.

I would again be interested in the fio results.

>            MADV_SEQUENTIAL    non-streaming ops/sec (memcached)
>   rc3      yes                 292k
>   rc3      no                  203k
>   rc3+v7   yes                1967k
>   rc3+v7   no                  436k
> 
>   cat mmap.sh
>   modprobe brd rd_nr=2 rd_size=56623104
>   
>   mkswap /dev/ram0
>   swapon /dev/ram0
>   
>   mkfs.ext4 /dev/ram1
>   mount -t ext4 /dev/ram1 /mnt
>   
>   memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
>     -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=P:P -c 1 \
>     -t 36 --ratio 1:0 --pipeline 8 -d 2000
>   
>   # streaming workload: --fadvise_hint=0 disables MADV_SEQUENTIAL
>   fio -name=mglru --numjobs=12 --directory=/mnt --size=4224m --buffered=1 \
>     --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
>     --iodepth_batch_complete=32 --rw=read --time_based --ramp_time=10m \
>     --runtime=180m --group_reporting &

As per above, I think this would be closer to a cacheable workingset
than a streaming IO pattern. It depends on total RAM of course, but
size=4G and time_based should loop around pretty quickly.

Would you mind rerunning with files larger than RAM, to avoid repeat
accesses (or at least only repeat with large distances)?

Depending on how hot memcache runs, it may or may not be able to hold
onto its workingset. Testing interactivity is notoriously hard, but
using a smaller, intermittent workload is probably more representative
of overall responsiveness. Let fio ramp until memory is full, then do
perf stat -r 10 /bin/sh -c 'git shortlog v5.0.. >/dev/null; sleep 1'

I'll try to reproduce this again too. Back then, that workload gave me
a very janky desktop experience, and the patch very obvious relief.

> > > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
> > >  {
> > >  	enum lru_list lru = folio_lru_list(folio);
> > >  
> > > +	if (lru_gen_add_folio(lruvec, folio, true))
> > > +		return;
> > > +
> > 
> > bool parameters are notoriously hard to follow in the callsite. Can
> > you please add lru_gen_add_folio_tail() instead and have them use a
> > common helper?
> 
> I'm not sure -- there are several places like this one. My question is
> whether we want to do it throughout this patchset. We'd end up with
> many helpers and duplicate code. E.g., in this file alone, we have two
> functions taking bool parameters:
>   lru_gen_add_folio(..., bool reclaiming)
>   lru_gen_del_folio(..., bool reclaiming)
> 
> I can't say they are very readable; at least they are very compact
> right now. My concern is that we might loose the latter without having
> enough of the former.
> 
> Perhaps this is something that we could revisit after you've finished
> reviewing the entire patchset?

Sure, fair enough.

> > > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
> > 
> > "state" is what we usually init :) How about lrugen_init_lruvec()?
> 
> Same story as "file", lol -- this used to be lru_gen_init_lruvec():
> https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/
> 
> Naming is hard. Hopefully we can finalize it this time.

Was that internal feedback? The revisions show this function went
through several names, but I don't see reviews requesting those. If
they weren't public I'm gonna pretend they didn't happen ;-)

> > You can drop the memcg parameter and use lruvec_memcg().
> 
> lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> this function because mem_cgroup_disabled() is initialized afterward.

Good catch. That'll container_of() into garbage. However, we have to
assume that somebody's going to try that simplification again, so we
should set up the code now to prevent issues.

cgroup_disable parsing is self-contained, so we can pull it ahead in
the init sequence. How about this?

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9d05c3ca2d5e..b544d768edc8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str)
 			break;
 		}
 	}
-	return 1;
+	return 0;
 }
-__setup("cgroup_disable=", cgroup_disable);
+early_param("cgroup_disable", cgroup_disable);
 
 void __init __weak enable_debug_cgroup(void) { }
 
Thanks!


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-14 10:28   ` Mike Rapoport
@ 2022-02-16  3:22     ` Yu Zhao
  2022-02-21  9:01       ` Mike Rapoport
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-16  3:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:

Thanks for reviewing.

> >  Documentation/admin-guide/mm/index.rst        |   1 +
> >  Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++
> >  Documentation/vm/index.rst                    |   1 +
> >  Documentation/vm/multigen_lru.rst             | 152 ++++++++++++++++++
> 
> Please consider splitting this patch into Documentation/admin-guide and
> Documentation/vm parts.

Will do.

> > +=====================
> > +Multigenerational LRU
> > +=====================
> +
> > +Quick start
> > +===========
> 
> There is no explanation why one would want to use multigenerational LRU
> until the next section.
> 
> I think there should be an overview that explains why users would want to
> enable multigenerational LRU. 

Will do.

> > +Build configurations
> > +--------------------
> > +:Required: Set ``CONFIG_LRU_GEN=y``.
> 
> Maybe 
> 
> 	Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU

Will do.

> > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the
> > + multigenerational LRU by default.
> > +
> > +Runtime configurations
> > +----------------------
> > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if
> > + ``CONFIG_LRU_GEN_ENABLED=n``.
> > +
> > +This file accepts different values to enabled or disabled the
> > +following features:
> 
> Maybe
> 
>   After multigenerational LRU is enabled, this file accepts different
>   values to enable or disable the following feaures:

Will do.

> > +====== ========
> > +Values Features
> > +====== ========
> > +0x0001 the multigenerational LRU
> 
> The multigenerational LRU what?

Itself? This depends on the POV, and I'm trying to determine what would
be the natural way to present it.

MGLRU itself could be seen as an add-on atop the existing page reclaim
or an alternative in parallel. The latter would be similar to sl[aou]b,
and that's how I personally see it.

But here I presented it more like the former because I feel this way is
more natural to users because they are like switches on a single panel.

> What will happen if I write 0x2 to this file?

Just like turning on a branch breaker while leaving the main breaker
off in a circuit breaker box. This is how I see it, and I'm totally
fine with changing it to whatever you'd recommend.

> Please consider splitting "enable" and "features" attributes.

How about s/Features/Components/?

> > +0x0002 clear the accessed bit in leaf page table entries **in large
> > +       batches**, when MMU sets it (e.g., on x86)
> 
> Is extra markup really needed here...
> 
> > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > +       well**, when MMU sets it (e.g., on x86)
> 
> ... and here?

Will do.

> As for the descriptions, what is the user-visible effect of these features?
> How different modes of clearing the access bit are reflected in, say, GUI
> responsiveness, database TPS, or probability of OOM?

These remain to be seen :) I just added these switches in v7, per Mel's
request from the meeting we had. These were never tested in the field.

> > +[yYnN] apply to all the features above
> > +====== ========
> > +
> > +E.g.,
> > +::
> > +
> > +    echo y >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0007
> > +    echo 5 >/sys/kernel/mm/lru_gen/enabled
> > +    cat /sys/kernel/mm/lru_gen/enabled
> > +    0x0005
> > +
> > +Most users should enable or disable all the features unless some of
> > +them have unforeseen side effects.
> > +
> > +Recipes
> > +=======
> > +Personal computers
> > +------------------
> > +Personal computers are more sensitive to thrashing because it can
> > +cause janks (lags when rendering UI) and negatively impact user
> > +experience. The multigenerational LRU offers thrashing prevention to
> > +the majority of laptop and desktop users who don't have oomd.
> 
> I'd expect something like this paragraph in overview.
> 
> > +
> > +:Thrashing prevention: Write ``N`` to
> > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of
> > + ``N`` milliseconds from getting evicted. The OOM killer is triggered
> > + if this working set can't be kept in memory. Based on the average
> > + human detectable lag (~100ms), ``N=1000`` usually eliminates
> > + intolerable janks due to thrashing. Larger values like ``N=3000``
> > + make janks less noticeable at the risk of premature OOM kills.
> 
> > +
> > +Data centers
> > +------------
> > +Data centers want to optimize job scheduling (bin packing) to improve
> > +memory utilizations. Job schedulers need to estimate whether a server
> > +can allocate a certain amount of memory for a new job, and this step
> > +is known as working set estimation, which doesn't impact the existing
> > +jobs running on this server. They also want to attempt freeing some
> > +cold memory from the existing jobs, and this step is known as proactive
> > +reclaim, which improves the chance of landing a new job successfully.
> 
> This paragraph also fits overview.

Will do.

> > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations
> > + for working set estimation and proactive reclaim.
> 
> Please add a note that this is build time option.

Will do.

> > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> 
> Is debugfs interface relevant only for datacenters? 

For the moment, yes.

> > + format:
> > + ::
> > +
> > +   memcg  memcg_id  memcg_path
> > +     node  node_id
> > +       min_gen  birth_time  anon_size  file_size
> > +       ...
> > +       max_gen  birth_time  anon_size  file_size
> > +
> > + ``min_gen`` is the oldest generation number and ``max_gen`` is the
> > + youngest generation number. ``birth_time`` is in milliseconds.
> 
> It's unclear what is birth_time reference point. Is it milliseconds from
> the system start or it is measured some other way?

Good point. Will clarify.

> > + ``anon_size`` and ``file_size`` are in pages. The youngest generation
> > + represents the group of the MRU pages and the oldest generation
> > + represents the group of the LRU pages. For working set estimation, a
> 
> Please spell out MRU and LRU fully.

Will do.

> > + job scheduler writes to this file at a certain time interval to
> > + create new generations, and it ranks available servers based on the
> > + sizes of their cold memory defined by this time interval. For
> > + proactive reclaim, a job scheduler writes to this file before it
> > + tries to land a new job, and if it fails to materialize the cold
> > + memory without impacting the existing jobs, it retries on the next
> > + server according to the ranking result.
> 
> Is this knob only relevant for a job scheduler? Or it can be used in other
> use-cases as well?

There are other concrete use cases but I'm not ready to discuss them
yet.

> > + This file accepts commands in the following subsections. Multiple
> 
>                               ^ described

Will do.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-13 10:04   ` Hillf Danton
@ 2022-02-17  0:13     ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-17  0:13 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Johannes Weiner, linux-kernel, linux-mm

On Sun, Feb 13, 2022 at 06:04:17PM +0800, Hillf Danton wrote:

Hi Hillf,

> On Tue,  8 Feb 2022 01:18:55 -0700 Yu Zhao wrote:
> > +
> > +/******************************************************************************
> > + *                          the aging
> > + ******************************************************************************/
> > +
> > +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > +{
> > +	unsigned long old_flags, new_flags;
> > +	int type = folio_is_file_lru(folio);
> > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +	int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > +
> > +	do {
> > +		new_flags = old_flags = READ_ONCE(folio->flags);
> > +		VM_BUG_ON_FOLIO(!(new_flags & LRU_GEN_MASK), folio);
> > +
> > +		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> 
> Is the chance zero for deadloop if new_gen != old_gen?

No, because the counter is only cleared during isolation, and here
it's protected again isolation (under the LRU lock, which is asserted
in the lru_gen_balance_size() -> lru_gen_update_size() path).

> > +		new_gen = (old_gen + 1) % MAX_NR_GENS;
> > +
> > +		new_flags &= ~LRU_GEN_MASK;
> > +		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
> > +		new_flags &= ~(LRU_REFS_MASK | LRU_REFS_FLAGS);
> > +		/* for folio_end_writeback() */
> 
> 		/* for folio_end_writeback() and sort_folio() */ in terms of
> reclaiming?

Right.

> > +		if (reclaiming)
> > +			new_flags |= BIT(PG_reclaim);
> > +	} while (cmpxchg(&folio->flags, old_flags, new_flags) != old_flags);
> > +
> > +	lru_gen_balance_size(lruvec, folio, old_gen, new_gen);
> > +
> > +	return new_gen;
> > +}
> 
> ...
> 
> > +/******************************************************************************
> > + *                          the eviction
> > + ******************************************************************************/
> > +
> > +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
> > +{
> 
> Nit, the 80-column-char format is prefered.

Will do.

> > +	bool success;
> > +	int gen = folio_lru_gen(folio);
> > +	int type = folio_is_file_lru(folio);
> > +	int zone = folio_zonenum(folio);
> > +	int tier = folio_lru_tier(folio);
> > +	int delta = folio_nr_pages(folio);
> > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > +
> > +	VM_BUG_ON_FOLIO(gen >= MAX_NR_GENS, folio);
> > +
> > +	if (!folio_evictable(folio)) {
> > +		success = lru_gen_del_folio(lruvec, folio, true);
> > +		VM_BUG_ON_FOLIO(!success, folio);
> > +		folio_set_unevictable(folio);
> > +		lruvec_add_folio(lruvec, folio);
> > +		__count_vm_events(UNEVICTABLE_PGCULLED, delta);
> > +		return true;
> > +	}
> > +
> > +	if (type && folio_test_anon(folio) && folio_test_dirty(folio)) {
> > +		success = lru_gen_del_folio(lruvec, folio, true);
> > +		VM_BUG_ON_FOLIO(!success, folio);
> > +		folio_set_swapbacked(folio);
> > +		lruvec_add_folio_tail(lruvec, folio);
> > +		return true;
> > +	}
> > +
> > +	if (tier > tier_idx) {
> > +		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> > +
> > +		gen = folio_inc_gen(lruvec, folio, false);
> > +		list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
> > +
> > +		WRITE_ONCE(lrugen->promoted[hist][type][tier - 1],
> > +			   lrugen->promoted[hist][type][tier - 1] + delta);
> > +		__mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> > +		return true;
> > +	}
> > +
> > +	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> > +	    (type && folio_test_dirty(folio))) {
> > +		gen = folio_inc_gen(lruvec, folio, true);
> > +		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> > +		return true;
> 
> Make the cold dirty page cache younger instead of writeout in the backgroungd
> reclaimer context, and the question rising is if laundry is defered until the
> flusher threads are waken up in the following patches.

This is a good point. In contrast to the active/inactive LRU, MGLRU
doesn't write out dirty file pages (kswapd or direct reclaimers) --
this is writeback's job and it should be better at doing this. In
fact, commit 21b4ee7029 ("xfs: drop ->writepage completely") has
disabled dirty file page writeouts in the reclaim path completely.

Reclaim indirectly wakes up writeback after clean file pages drop
below a threshold (dirty ratio). However, dirty pages might be under
counted on a system that uses a large number of mmapped file pages.
MGLRU optimizes this by calling folio_mark_dirty() on pages mapped
by dirty PTEs when scanning page tables. (Why not since it's already
looking at the accessed bit.)

The commit above explained this design choice from the performance
aspect. From the implementation aspect, it also creates a boundary
between reclaim and writeback. This simplifies things, e.g., the
PageWriteback() check in shrink_page_list is no longer relevant for
MGLRU, neither is the top half of the PageDirty() check.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [page-reclaim] [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface
  2022-02-08  8:19 ` [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface Yu Zhao
@ 2022-02-18 18:56   ` David Rientjes
  0 siblings, 0 replies; 74+ messages in thread
From: David Rientjes @ 2022-02-18 18:56 UTC (permalink / raw)
  To: Yu Zhao, Johannes Weiner, Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, 8 Feb 2022, Yu Zhao wrote:

> Add /sys/kernel/debug/lru_gen for working set estimation and proactive
> reclaim. These features are required to optimize job scheduling (bin
> packing) in data centers [1][2].
> 

Johannes, do you believe this interface is sufficient to induce memcg 
based reclaim or do you have plans to propose a memory.reclaim extension 
to memcg?

I assume that lru_gen could be used even if memcg is disabled whereas 
memory.reclaim would only be available on systems where the controller is 
mounted through cgroups.

Yet a question would probably arise about the stability of this interface 
since it lives in debugfs: can we plan on /sys/kernel/debug/lru_gen being 
long-term supported?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-15 21:53       ` Johannes Weiner
@ 2022-02-21  8:14         ` Yu Zhao
  2022-02-23 21:18           ` Yu Zhao
  2022-03-03 15:29           ` Johannes Weiner
  0 siblings, 2 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-21  8:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote:
> Hi Yu,
> 
> On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > +{
> > > > +	unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > +
> > > > +	VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > +
> > > > +	/* see the comment on MIN_NR_GENS */
> > > > +	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > +}
> > > 
> > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > set is split into two generations for the sole purpose of the
> > > second-chance policy for fresh faults, right?
> > 
> > To be precise, the active/inactive notion on top of generations is
> > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > Otherwise, this function wouldn't be needed.
> 
> Ah! would you mind adding this as a comment to the function?

Will do.

> But AFAICS there is the lru_gen_del_folio() callsite that maps it to
> the PG_active flag - which in turn gets used by add_folio() to place
> the thing back on the max_seq generation. So I suppose there is a
> secondary purpose of the function for remembering the page's rough age
> for non-reclaim isolation.>

Yes, e.g., migration.

> It would be good to capture that as well in a comment on the function.

Will do.

> > > > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> > > > +				       int zone, long delta)
> > > > +{
> > > > +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > > +
> > > > +	lockdep_assert_held(&lruvec->lru_lock);
> > > > +	WARN_ON_ONCE(delta != (int)delta);
> > > > +
> > > > +	__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> > > > +	__mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> > > > +}
> > > 
> > > This is a duplicate of update_lru_size(), please use that instead.
> > > 
> > > Yeah technically you don't need the mem_cgroup_update_lru_size() but
> > > that's not worth sweating over, better to keep it simple.
> > 
> > I agree we don't need the mem_cgroup_update_lru_size() -- let me spell
> > out why:
> >   this function is not needed here because it updates the counters used
> >   only by the active/inactive lru code, i.e., get_scan_count().
> > 
> > However, we can't reuse update_lru_size() because MGLRU can trip the
> > WARN_ONCE() in mem_cgroup_update_lru_size().
> > 
> > Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent.
> > To move a page to a different generation, the gen counter in page->flags
> > is updated first, which doesn't require the LRU lock. The second step,
> > i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it
> > usually isn't done immediately due to batching. Meanwhile, if this page
> > is, for example, isolated, nr_pages[] becomes temporarily unbalanced.
> > And this trips the WARN_ONCE().
> 
> Good insight.
> 
> But in that case, I'd still think it's better to use update_lru_size()
> and gate the memcg update on lrugen-enabled, with a short comment
> saying that lrugen has its own per-cgroup counts already. It's just a
> bit too error prone to duplicate the stat updates.
> 
> Even better would be:
> 
> static __always_inline
> void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
> {
> 	enum lru_list lru = folio_lru_list(folio);
> 
> 	update_lru_size(lruvec, lru, folio_zonenum(folio),
> 			folio_nr_pages(folio));
> 	if (lrugen_enabled(lruvec))
> 		lrugen_add_folio(lruvec, folio);
> 	else
> 		list_add(&folio->lru, &lruvec->lists[lru]);
> }
> 
> But it does mean you'd have to handle unevictable pages. I'm reviewing
> from the position that mglru is going to supplant the existing reclaim
> algorithm in the long term, though, so being more comprehensive and
> eliminating special cases where possible is all-positive, IMO.
> 
> Up to you. I'd only insist on reusing update_lru_size() at least.

Will do.

> > > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > +{
> > > > +	int gen;
> > > > +	unsigned long old_flags, new_flags;
> > > > +	int type = folio_is_file_lru(folio);
> > > > +	int zone = folio_zonenum(folio);
> > > > +	struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > +
> > > > +	if (folio_test_unevictable(folio) || !lrugen->enabled)
> > > > +		return false;
> > > 
> > > These two checks should be in the callsite and the function should
> > > return void. Otherwise you can't understand the callsite without
> > > drilling down into lrugen code, even if lrugen is disabled.
> > 
> > I agree it's a bit of nuisance this way. The alternative is we'd need
> > ifdef or another helper at the call sites because lrugen->enabled is
> > specific to lrugen.
> 
> Coming from memcg, my experience has been that when you have a compile
> time-optional MM extension like this, you'll sooner or later need a
> config-independent helper to gate callbacks in generic code. So I
> think it's a good idea to add one now.
> 
> One of these?
> 
> lruvec_on_lrugen()

SGTM.

Personally I'd reuse lru_gen_enabled(), by passing NULL/lruvec. But
my guess is you wouldn't like it.

> lruvec_using_lrugen()
> lruvec_lrugen_enabled()
> 
> lruvec_has_generations() :-)
> 
> > > On that note, I think #1 is reintroducing a problem we have fixed
> > > before, which is trashing the workingset with a flood of use-once
> > > mmapped pages. It's the classic scenario where LFU beats LRU.
> > > 
> > > Mapped streaming IO isn't very common, but it does happen. See these
> > > commits:
> > > 
> > > dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
> > > 31c0569c3b0b6cc8a867ac6665ca081553f7984c
> > > 645747462435d84c6c6a64269ed49cc3015f753d
> > > 
> > > From the changelog:
> > > 
> > >     The used-once mapped file page detection patchset.
> > >     
> > >     It is meant to help workloads with large amounts of shortly used file
> > >     mappings, like rtorrent hashing a file or git when dealing with loose
> > >     objects (git gc on a bigger site?).
> > >     
> > >     Right now, the VM activates referenced mapped file pages on first
> > >     encounter on the inactive list and it takes a full memory cycle to
> > >     reclaim them again.  When those pages dominate memory, the system
> > >     no longer has a meaningful notion of 'working set' and is required
> > >     to give up the active list to make reclaim progress.  Obviously,
> > >     this results in rather bad scanning latencies and the wrong pages
> > >     being reclaimed.
> > >     
> > >     This patch makes the VM be more careful about activating mapped file
> > >     pages in the first place.  The minimum granted lifetime without
> > >     another memory access becomes an inactive list cycle instead of the
> > >     full memory cycle, which is more natural given the mentioned loads.
> > > 
> > > Translating this to multigen, it seems fresh faults should really
> > > start on the second oldest rather than on the youngest generation, to
> > > get a second chance but without jeopardizing the workingset if they
> > > don't take it.
> > 
> > This is a good point, and I had worked on a similar idea but failed
> > to measure its benefits. In addition to placing mmapped file pages in
> > older generations, I also tried placing refaulted anon pages in older
> > generations. My conclusion was that the initial LRU positions of NFU
> > pages are not a bottleneck for workloads I've tested. The efficiency
> > of testing/clearing the accessed bit is.
> 
> The concern isn't the scan overhead, but jankiness from the workingset
> being flooded out by streaming IO.

Yes, MGLRU uses a different approach to solve this problem, and for
its approach, the scan overhead is the concern.

MGLRU detects (defines) the working set by scanning the entire memory
for each generation, and it counters the flooding by accelerating the
creation of generations. IOW, all mapped pages have an equal chance to
get scanned, no matter which generation they are in. This is a design
difference compared with the active/inactive LRU, which tries to scans
the active/inactive lists less/more frequently.

> The concrete usecase at the time was a torrent client hashing a
> downloaded file and thereby kicking out the desktop environment, which
> caused jankiness. The hashing didn't benefit from caching - the file
> wouldn't have fit into RAM anyway - so this was pointless to boot.
> 
> Essentially, the tradeoff is this:
> 
> 1) If you treat new pages as hot, you accelerate workingset
> transitions, but on the flipside you risk unnecessary refaults in
> running applications when those new pages are one-off.
> 
> 2) If you take new pages with a grain of salt, you protect existing
> applications better from one-off floods, but risk refaults in NEW
> application while they're trying to start up.

Agreed.

> There are two arguments for why 2) is preferable:
> 
> 1) Users are tolerant of cache misses when applications first launch,
>    much less so after they've been running for hours.

Our CUJs (Critical User Journeys) respectfully disagree :)

They are built on the observation that once users have moved onto
another tab/app, they are more likely to stay with the new tab/app
rather than go back to the old ones. Speaking for myself, this is
generally the case.

> 2) Workingset transitions (and associated jankiness) are bounded by
>    the amount of RAM you need to repopulate. But streaming IO is
>    bounded by storage, and datasets are routinely several times the
>    amount of RAM. Uncacheable sets in excess of RAM can produce an
>    infinite stream of "new" references; not protecting the workingset
>    from that means longer or even sustained jankiness.

I'd argue the opposite -- we shouldn't risk refaulting fresh hot pages
just to accommodate this concrete yet minor use case, especially
considering torrent has been given the means (MADV_SEQUENTIAL) to help
itself.

I appreciate all your points here. The bottom line is we agree this is
a trade off. For what disagree about, we could be both right -- it
comes down to what workloads we care about *more*.

To move forward, I propose we look at it from a non-technical POV:
would we want to offer users an alternative trade off so that they can
have greater flexibility?

> > And some applications are smart enough to leverage MADV_SEQUENTIAL.
> > In this case, MGLRU does place mmapped file pages in the oldest
> > generation.
> 
> Yes, it makes sense to optimize when MADV_SEQUENTIAL is requested. But
> that hint isn't reliably there, so it matters that we don't do poorly
> when it's missing.

Agreed.

> > I have an oversimplified script that uses memcached to mimic a
> > non-streaming workload and fio a (mmapped) streaming workload:
> 
> Looking at the paramters and observed behavior, let me say up front
> that this looks like a useful benchmark, but doesn't capture the
> scenario I was talking about above.
> 
> For one, the presence of swapping in both kernels suggests that the
> "streaming IO" component actually has repeat access that could benefit
> from caching. Second, I would expect memcache is accessing its memory
> frequently and consistently, and so could withstand workingset
> challenges from streaming IO better than, say, a desktop environment.

The fio workload is a real streaming workload, but the memcached
workload might have been too large to be a typical desktop workload.

More below.

> More on that below.
> 
> >   1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times
> >      faster when using MGLRU. Somehow the baseline (rc3) swapped a lot.
> >      (It shouldn't, and I haven't figured out why.)
> 
> Baseline swaps when there are cache refaults. This is regardless of
> the hint: you may say you're accessing these pages sequentially, but
> the refaults say you're reusing them, with a frequency that suggests
> they might be cacheable. So it tries to cache them.
> 
> I'd be curious if that results in fio being faster, or whether it's
> all just pointless thrashing. Can you share the fio results too?

More below.

> We could patch baseline to prioritize MADV_SEQUENTIAL more, but...
> 
> >   2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1
> >      time faster when using MGLRU. Both MGLRU and the baseline swapped
> >      a lot.
> 
> ...in practice I think this scenario will matter to a lot more users.

I strongly feel we should prioritize what's advertised on a man page
over an unspecified (performance) behavior.

> I would again be interested in the fio results.
> 
> >            MADV_SEQUENTIAL    non-streaming ops/sec (memcached)
> >   rc3      yes                 292k
> >   rc3      no                  203k
> >   rc3+v7   yes                1967k
> >   rc3+v7   no                  436k
> > 
> >   cat mmap.sh
> >   modprobe brd rd_nr=2 rd_size=56623104
> >   
> >   mkswap /dev/ram0
> >   swapon /dev/ram0
> >   
> >   mkfs.ext4 /dev/ram1
> >   mount -t ext4 /dev/ram1 /mnt
> >   
> >   memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
> >     -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=P:P -c 1 \
> >     -t 36 --ratio 1:0 --pipeline 8 -d 2000
> >   
> >   # streaming workload: --fadvise_hint=0 disables MADV_SEQUENTIAL
> >   fio -name=mglru --numjobs=12 --directory=/mnt --size=4224m --buffered=1 \
> >     --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
> >     --iodepth_batch_complete=32 --rw=read --time_based --ramp_time=10m \
> >     --runtime=180m --group_reporting &
> 
> As per above, I think this would be closer to a cacheable workingset
> than a streaming IO pattern. It depends on total RAM of course, but
> size=4G and time_based should loop around pretty quickly.

The file size here shouldn't matter since fio is smart enough to
invalidate page cache before it rewinds (for sequential access):

https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-invalidate
https://github.com/axboe/fio/blob/master/filesetup.c#L602

I think the problem might have been the memory size for memcached was
too large (100GB) to be all hot (limited by memory bandwidth).

> Would you mind rerunning with files larger than RAM, to avoid repeat
> accesses (or at least only repeat with large distances)?

Retested with the same free memory (120GB) for 40GB memcached and 200GB
fio.

           MADV_SEQUENTIAL  FADV_DONTNEED  memcached  fio
  rc4      no               yes            4716k      232k
  rc4+v7   no               yes            4307k      265k
  delta                                    -9%        +14%

MGLRU lost with memcached but won with fio for the same reason: it
doesn't have any heuristics to detect the streaming characteristic
(and therefore lost with memcached) but relies on faster scanning
(and therefore won with fio) to keep the working set in memory.

The baseline didn't swap this time (MGLRU did slightly), but it lost
with fio because it had to walk the rmap for each page in the entire
200GB VMA, at least once, even for this streaming workload.

This reflects the design difference I mentioned earlier.

  cat test.sh
  modprobe brd rd_nr=1 rd_size=268435456
  
  mkfs.ext4 /dev/ram0
  mount -t ext4 /dev/ram0 /mnt
  
  fallocate -l 40g /mnt/swapfile
  mkswap /mnt/swapfile
  swapon /mnt/swapfile
  
  fio -name=mglru --numjobs=1 --directory=/mnt --size=204800m \
    --buffered=1 --ioengine=mmap --fadvise_hint=0 --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=read --time_based --ramp_time=10m --runtime=180m \
    --group_reporting &
  pid=$!
  
  sleep 600
  
  # load objects
  memtier_benchmark -S /var/run/memcached/memcached.sock \
    -P memcache_binary -n allkeys --key-minimum=1 \
    --key-maximum=20000000 --key-pattern=P:P -c 1 -t 36 \
    --ratio 1:0 --pipeline 8 -d 2000
  # read objects
  memtier_benchmark -S /var/run/memcached/memcached.sock \
    -P memcache_binary -n allkeys --key-minimum=1 \
    --key-maximum=20000000 --key-pattern=R:R -c 1 -t 36 \
    --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

  kill -INT $pid

> Depending on how hot memcache runs, it may or may not be able to hold
> onto its workingset.

Agreed.

> Testing interactivity is notoriously hard, but
> using a smaller, intermittent workload is probably more representative
> of overall responsiveness. Let fio ramp until memory is full, then do
> perf stat -r 10 /bin/sh -c 'git shortlog v5.0.. >/dev/null; sleep 1'

I'll also check with the downstream maintainers to see if they have
heard any complaints about streaming workloads negatively impacting
user experience.

> I'll try to reproduce this again too. Back then, that workload gave me
> a very janky desktop experience, and the patch very obvious relief.

SGTM.

> > > > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
> > > >  {
> > > >  	enum lru_list lru = folio_lru_list(folio);
> > > >  
> > > > +	if (lru_gen_add_folio(lruvec, folio, true))
> > > > +		return;
> > > > +
> > > 
> > > bool parameters are notoriously hard to follow in the callsite. Can
> > > you please add lru_gen_add_folio_tail() instead and have them use a
> > > common helper?
> > 
> > I'm not sure -- there are several places like this one. My question is
> > whether we want to do it throughout this patchset. We'd end up with
> > many helpers and duplicate code. E.g., in this file alone, we have two
> > functions taking bool parameters:
> >   lru_gen_add_folio(..., bool reclaiming)
> >   lru_gen_del_folio(..., bool reclaiming)
> > 
> > I can't say they are very readable; at least they are very compact
> > right now. My concern is that we might loose the latter without having
> > enough of the former.
> > 
> > Perhaps this is something that we could revisit after you've finished
> > reviewing the entire patchset?
> 
> Sure, fair enough.
> 
> > > > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
> > > 
> > > "state" is what we usually init :) How about lrugen_init_lruvec()?
> > 
> > Same story as "file", lol -- this used to be lru_gen_init_lruvec():
> > https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/
> > 
> > Naming is hard. Hopefully we can finalize it this time.
> 
> Was that internal feedback? The revisions show this function went
> through several names, but I don't see reviews requesting those. If
> they weren't public I'm gonna pretend they didn't happen ;-)

Indeed. I lost track.

> > > You can drop the memcg parameter and use lruvec_memcg().
> > 
> > lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> > this function because mem_cgroup_disabled() is initialized afterward.
> 
> Good catch. That'll container_of() into garbage. However, we have to
> assume that somebody's going to try that simplification again, so we
> should set up the code now to prevent issues.
> 
> cgroup_disable parsing is self-contained, so we can pull it ahead in
> the init sequence. How about this?
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9d05c3ca2d5e..b544d768edc8 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str)
>  			break;
>  		}
>  	}
> -	return 1;
> +	return 0;
>  }
> -__setup("cgroup_disable=", cgroup_disable);
> +early_param("cgroup_disable", cgroup_disable);

I think early_param() is still after pgdat_init_internals(), no?

Thanks!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-16  3:22     ` Yu Zhao
@ 2022-02-21  9:01       ` Mike Rapoport
  2022-02-22  1:47         ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Mike Rapoport @ 2022-02-21  9:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> 
> > > +====== ========
> > > +Values Features
> > > +====== ========
> > > +0x0001 the multigenerational LRU
> > 
> > The multigenerational LRU what?
> 
> Itself? This depends on the POV, and I'm trying to determine what would
> be the natural way to present it.
> 
> MGLRU itself could be seen as an add-on atop the existing page reclaim
> or an alternative in parallel. The latter would be similar to sl[aou]b,
> and that's how I personally see it.
> 
> But here I presented it more like the former because I feel this way is
> more natural to users because they are like switches on a single panel.

Than I think it should be described as "enable multigenerational LRU" or
something like this.
 
> > What will happen if I write 0x2 to this file?
> 
> Just like turning on a branch breaker while leaving the main breaker
> off in a circuit breaker box. This is how I see it, and I'm totally
> fine with changing it to whatever you'd recommend.

That was my guess that when bit 0 is clear the rest do not matter :)
What's important, IMO, is that it is stated explicitly in the description.
 
> > Please consider splitting "enable" and "features" attributes.
> 
> How about s/Features/Components/?

I meant to use two attributes:

/sys/kernel/mm/lru_gen/enable for the main breaker, and
/sys/kernel/mm/lru_gen/features (or components) for the branch breakers
 
> > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > +       batches**, when MMU sets it (e.g., on x86)
> > 
> > Is extra markup really needed here...
> > 
> > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > +       well**, when MMU sets it (e.g., on x86)
> > 
> > ... and here?
> 
> Will do.
> 
> > As for the descriptions, what is the user-visible effect of these features?
> > How different modes of clearing the access bit are reflected in, say, GUI
> > responsiveness, database TPS, or probability of OOM?
> 
> These remain to be seen :) I just added these switches in v7, per Mel's
> request from the meeting we had. These were never tested in the field.

I see :)

It would be nice to have a description or/and examples of user-visible
effects when there will be some insight on what these features do.

> > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > 
> > Is debugfs interface relevant only for datacenters? 
> 
> For the moment, yes.

And what will happen if somebody uses these interfaces outside
datacenters? As soon as there is a sysfs intefrace, somebody will surely
play with it.

I think the job schedulers might be the most important user of that
interface, but the documentation should not presume it is the only user.
 
> > > + job scheduler writes to this file at a certain time interval to
> > > + create new generations, and it ranks available servers based on the
> > > + sizes of their cold memory defined by this time interval. For
> > > + proactive reclaim, a job scheduler writes to this file before it
> > > + tries to land a new job, and if it fails to materialize the cold
> > > + memory without impacting the existing jobs, it retries on the next
> > > + server according to the ranking result.
> > 
> > Is this knob only relevant for a job scheduler? Or it can be used in other
> > use-cases as well?
> 
> There are other concrete use cases but I'm not ready to discuss them
> yet.
 
Here as well, as soon as there is an interface it's not necessarily "job
scheduler" that will "write to this file", anybody can write to that file.
Please adjust the documentation to be more neutral regarding the use-cases.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-21  9:01       ` Mike Rapoport
@ 2022-02-22  1:47         ` Yu Zhao
  2022-02-23 10:58           ` Mike Rapoport
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-22  1:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> >
> > > > +====== ========
> > > > +Values Features
> > > > +====== ========
> > > > +0x0001 the multigenerational LRU
> > >
> > > The multigenerational LRU what?
> >
> > Itself? This depends on the POV, and I'm trying to determine what would
> > be the natural way to present it.
> >
> > MGLRU itself could be seen as an add-on atop the existing page reclaim
> > or an alternative in parallel. The latter would be similar to sl[aou]b,
> > and that's how I personally see it.
> >
> > But here I presented it more like the former because I feel this way is
> > more natural to users because they are like switches on a single panel.
>
> Than I think it should be described as "enable multigenerational LRU" or
> something like this.

Will do.

> > > What will happen if I write 0x2 to this file?
> >
> > Just like turning on a branch breaker while leaving the main breaker
> > off in a circuit breaker box. This is how I see it, and I'm totally
> > fine with changing it to whatever you'd recommend.
>
> That was my guess that when bit 0 is clear the rest do not matter :)
> What's important, IMO, is that it is stated explicitly in the description.

Will do.

> > > Please consider splitting "enable" and "features" attributes.
> >
> > How about s/Features/Components/?
>
> I meant to use two attributes:
>
> /sys/kernel/mm/lru_gen/enable for the main breaker, and
> /sys/kernel/mm/lru_gen/features (or components) for the branch breakers

It's a bit superfluous for my taste. I generally consider multiple
items to fall into the same category if they can be expressed by a
type of array, and I usually pack an array into a single file.

From your last review, I gauged this would be too overloaded for your
taste. So I'd be happy to make the change if you think two files look
more intuitive from user's perspective.

> > > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > > +       batches**, when MMU sets it (e.g., on x86)
> > >
> > > Is extra markup really needed here...
> > >
> > > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > > +       well**, when MMU sets it (e.g., on x86)
> > >
> > > ... and here?
> >
> > Will do.
> >
> > > As for the descriptions, what is the user-visible effect of these features?
> > > How different modes of clearing the access bit are reflected in, say, GUI
> > > responsiveness, database TPS, or probability of OOM?
> >
> > These remain to be seen :) I just added these switches in v7, per Mel's
> > request from the meeting we had. These were never tested in the field.
>
> I see :)
>
> It would be nice to have a description or/and examples of user-visible
> effects when there will be some insight on what these features do.

How does the following sound?

Clearing the accessed bit in large batches can theoretically cause
lock contention (mmap_lock), and if it happens the 0x0002 switch can
disable this feature. In this case the multigenerational LRU suffers a
minor performance degradation.
Clearing the accessed bit in non-leaf page table entries was only
verified on Intel and AMD, and if it causes problems on other x86
varieties the 0x0004 switch can disable this feature. In this case the
multigenerational LRU suffers a negligible performance degradation.

> > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > >
> > > Is debugfs interface relevant only for datacenters?
> >
> > For the moment, yes.
>
> And what will happen if somebody uses these interfaces outside
> datacenters? As soon as there is a sysfs intefrace, somebody will surely
> play with it.
>
> I think the job schedulers might be the most important user of that
> interface, but the documentation should not presume it is the only user.

Other ideas are more like brainstorming than concrete use cases, e.g.,
for desktop users, these interface can in theory speed up hibernation
(suspend to disk); for VM users, they can again in theory support auto
ballooning. These niches are really minor and less explored compared
with the data center use cases which have been dominant.

I was hoping we could focus on the essential and take one step at a
time. Later on, if there is additional demand and resource, then we
expand to cover more use cases.

> > > > + job scheduler writes to this file at a certain time interval to
> > > > + create new generations, and it ranks available servers based on the
> > > > + sizes of their cold memory defined by this time interval. For
> > > > + proactive reclaim, a job scheduler writes to this file before it
> > > > + tries to land a new job, and if it fails to materialize the cold
> > > > + memory without impacting the existing jobs, it retries on the next
> > > > + server according to the ranking result.
> > >
> > > Is this knob only relevant for a job scheduler? Or it can be used in other
> > > use-cases as well?
> >
> > There are other concrete use cases but I'm not ready to discuss them
> > yet.
>
> Here as well, as soon as there is an interface it's not necessarily "job
> scheduler" that will "write to this file", anybody can write to that file.
> Please adjust the documentation to be more neutral regarding the use-cases.

Will do.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
                     ` (2 preceding siblings ...)
  2022-02-13 10:04   ` Hillf Danton
@ 2022-02-23  8:27   ` Huang, Ying
  2022-02-23  9:36     ` Yu Zhao
  3 siblings, 1 reply; 74+ messages in thread
From: Huang, Ying @ 2022-02-23  8:27 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Hi, Yu,

Yu Zhao <yuzhao@google.com> writes:

> To avoid confusions, the terms "promotion" and "demotion" will be
> applied to the multigenerational LRU, as a new convention; the terms
> "activation" and "deactivation" will be applied to the active/inactive
> LRU, as usual.

In the memory tiering related commits and patchset, for example as follows,

commit 668e4147d8850df32ca41e28f52c146025ca45c6
Author: Yang Shi <yang.shi@linux.alibaba.com>
Date:   Thu Sep 2 14:59:19 2021 -0700

    mm/vmscan: add page demotion counter

https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/

"demote" and "promote" is used for migrating pages between different
types of memory.  Is it better for us to avoid overloading these words
too much to avoid the possible confusion?

> +static int get_swappiness(struct mem_cgroup *memcg)
> +{
> +	return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> +	       mem_cgroup_swappiness(memcg) : 0;
> +}

After we introduced demotion support in Linux kernel.  The anonymous
pages in the fast memory node could be demoted to the slow memory node
via the page reclaiming mechanism as in the following commit.  Can you
consider that too?

commit a2a36488a61cefe3129295c6e75b3987b9d7fd13
Author: Keith Busch <kbusch@kernel.org>
Date:   Thu Sep 2 14:59:26 2021 -0700

    mm/vmscan: Consider anonymous pages without swap
    
    Reclaim anonymous pages if a migration path is available now that demotion
    provides a non-swap recourse for reclaiming anon pages.
    
    Note that this check is subtly different from the can_age_anon_pages()
    checks.  This mechanism checks whether a specific page in a specific
    context can actually be reclaimed, given current swap space and cgroup
    limits.
    
    can_age_anon_pages() is a much simpler and more preliminary check which
    just says whether there is a possibility of future reclaim.


Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-23  8:27   ` Huang, Ying
@ 2022-02-23  9:36     ` Yu Zhao
  2022-02-24  0:59       ` Huang, Ying
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-23  9:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yu,
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > To avoid confusions, the terms "promotion" and "demotion" will be
> > applied to the multigenerational LRU, as a new convention; the terms
> > "activation" and "deactivation" will be applied to the active/inactive
> > LRU, as usual.
>
> In the memory tiering related commits and patchset, for example as follows,
>
> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> Author: Yang Shi <yang.shi@linux.alibaba.com>
> Date:   Thu Sep 2 14:59:19 2021 -0700
>
>     mm/vmscan: add page demotion counter
>
> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>
> "demote" and "promote" is used for migrating pages between different
> types of memory.  Is it better for us to avoid overloading these words
> too much to avoid the possible confusion?

Given that LRU and migration are usually different contexts, I think
we'd be fine, unless we want a third pair of terms.

> > +static int get_swappiness(struct mem_cgroup *memcg)
> > +{
> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> > +            mem_cgroup_swappiness(memcg) : 0;
> > +}
>
> After we introduced demotion support in Linux kernel.  The anonymous
> pages in the fast memory node could be demoted to the slow memory node
> via the page reclaiming mechanism as in the following commit.  Can you
> consider that too?

Sure. How do I check whether there is still space on the slow node?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-22  1:47         ` Yu Zhao
@ 2022-02-23 10:58           ` Mike Rapoport
  2022-02-23 21:20             ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Mike Rapoport @ 2022-02-23 10:58 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Mon, Feb 21, 2022 at 06:47:25PM -0700, Yu Zhao wrote:
> On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > > > Please consider splitting "enable" and "features" attributes.
> > >
> > > How about s/Features/Components/?
> >
> > I meant to use two attributes:
> >
> > /sys/kernel/mm/lru_gen/enable for the main breaker, and
> > /sys/kernel/mm/lru_gen/features (or components) for the branch breakers
> 
> It's a bit superfluous for my taste. I generally consider multiple
> items to fall into the same category if they can be expressed by a
> type of array, and I usually pack an array into a single file.
> 
> From your last review, I gauged this would be too overloaded for your
> taste. So I'd be happy to make the change if you think two files look
> more intuitive from user's perspective.
 
I do think that two attributes are more user-friendly, but I don't feel
strongly about it.

> > > > As for the descriptions, what is the user-visible effect of these features?
> > > > How different modes of clearing the access bit are reflected in, say, GUI
> > > > responsiveness, database TPS, or probability of OOM?
> > >
> > > These remain to be seen :) I just added these switches in v7, per Mel's
> > > request from the meeting we had. These were never tested in the field.
> >
> > I see :)
> >
> > It would be nice to have a description or/and examples of user-visible
> > effects when there will be some insight on what these features do.
> 
> How does the following sound?
> 
> Clearing the accessed bit in large batches can theoretically cause
> lock contention (mmap_lock), and if it happens the 0x0002 switch can
> disable this feature. In this case the multigenerational LRU suffers a
> minor performance degradation.
> Clearing the accessed bit in non-leaf page table entries was only
> verified on Intel and AMD, and if it causes problems on other x86
> varieties the 0x0004 switch can disable this feature. In this case the
> multigenerational LRU suffers a negligible performance degradation.
 
LGTM

> > > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > > >
> > > > Is debugfs interface relevant only for datacenters?
> > >
> > > For the moment, yes.
> >
> > And what will happen if somebody uses these interfaces outside
> > datacenters? As soon as there is a sysfs intefrace, somebody will surely
> > play with it.
> >
> > I think the job schedulers might be the most important user of that
> > interface, but the documentation should not presume it is the only user.
> 
> Other ideas are more like brainstorming than concrete use cases, e.g.,
> for desktop users, these interface can in theory speed up hibernation
> (suspend to disk); for VM users, they can again in theory support auto
> ballooning. These niches are really minor and less explored compared
> with the data center use cases which have been dominant.
> 
> I was hoping we could focus on the essential and take one step at a
> time. Later on, if there is additional demand and resource, then we
> expand to cover more use cases.

Apparently I was not clear :)

I didn't mean that you should describe other use-cases, I rather suggested
to make the documentation more neutral, e.g. using "a user writes to this
file ..." instead of "job scheduler writes to a file ...". Or maybe add a
sentence in the beginning of the "Data centers" section, for instance:

Data centers
------------

+ A representative example of multigenerational LRU users are job
schedulers.

Data centers want to optimize job scheduling (bin packing) to improve
memory utilizations. Job schedulers need to estimate whether a server


-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-21  8:14         ` Yu Zhao
@ 2022-02-23 21:18           ` Yu Zhao
  2022-02-25 16:34             ` Minchan Kim
  2022-03-03 15:29           ` Johannes Weiner
  1 sibling, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-23 21:18 UTC (permalink / raw)
  To: Johannes Weiner, Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

.
On Mon, Feb 21, 2022 at 1:14 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote:
> > Hi Yu,
> >
> > On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > +{
> > > > > +       unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > +
> > > > > +       VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > +
> > > > > +       /* see the comment on MIN_NR_GENS */
> > > > > +       return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > +}
> > > >
> > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > set is split into two generations for the sole purpose of the
> > > > second-chance policy for fresh faults, right?
> > >
> > > To be precise, the active/inactive notion on top of generations is
> > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > Otherwise, this function wouldn't be needed.
> >
> > Ah! would you mind adding this as a comment to the function?
>
> Will do.
>
> > But AFAICS there is the lru_gen_del_folio() callsite that maps it to
> > the PG_active flag - which in turn gets used by add_folio() to place
> > the thing back on the max_seq generation. So I suppose there is a
> > secondary purpose of the function for remembering the page's rough age
> > for non-reclaim isolation.>
>
> Yes, e.g., migration.
>
> > It would be good to capture that as well in a comment on the function.
>
> Will do.
>
> > > > > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> > > > > +                                      int zone, long delta)
> > > > > +{
> > > > > +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > > > +
> > > > > +       lockdep_assert_held(&lruvec->lru_lock);
> > > > > +       WARN_ON_ONCE(delta != (int)delta);
> > > > > +
> > > > > +       __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> > > > > +       __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> > > > > +}
> > > >
> > > > This is a duplicate of update_lru_size(), please use that instead.
> > > >
> > > > Yeah technically you don't need the mem_cgroup_update_lru_size() but
> > > > that's not worth sweating over, better to keep it simple.
> > >
> > > I agree we don't need the mem_cgroup_update_lru_size() -- let me spell
> > > out why:
> > >   this function is not needed here because it updates the counters used
> > >   only by the active/inactive lru code, i.e., get_scan_count().
> > >
> > > However, we can't reuse update_lru_size() because MGLRU can trip the
> > > WARN_ONCE() in mem_cgroup_update_lru_size().
> > >
> > > Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent.
> > > To move a page to a different generation, the gen counter in page->flags
> > > is updated first, which doesn't require the LRU lock. The second step,
> > > i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it
> > > usually isn't done immediately due to batching. Meanwhile, if this page
> > > is, for example, isolated, nr_pages[] becomes temporarily unbalanced.
> > > And this trips the WARN_ONCE().
> >
> > Good insight.
> >
> > But in that case, I'd still think it's better to use update_lru_size()
> > and gate the memcg update on lrugen-enabled, with a short comment
> > saying that lrugen has its own per-cgroup counts already. It's just a
> > bit too error prone to duplicate the stat updates.
> >
> > Even better would be:
> >
> > static __always_inline
> > void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
> > {
> >       enum lru_list lru = folio_lru_list(folio);
> >
> >       update_lru_size(lruvec, lru, folio_zonenum(folio),
> >                       folio_nr_pages(folio));
> >       if (lrugen_enabled(lruvec))
> >               lrugen_add_folio(lruvec, folio);
> >       else
> >               list_add(&folio->lru, &lruvec->lists[lru]);
> > }
> >
> > But it does mean you'd have to handle unevictable pages. I'm reviewing
> > from the position that mglru is going to supplant the existing reclaim
> > algorithm in the long term, though, so being more comprehensive and
> > eliminating special cases where possible is all-positive, IMO.
> >
> > Up to you. I'd only insist on reusing update_lru_size() at least.
>
> Will do.
>
> > > > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > > +{
> > > > > +       int gen;
> > > > > +       unsigned long old_flags, new_flags;
> > > > > +       int type = folio_is_file_lru(folio);
> > > > > +       int zone = folio_zonenum(folio);
> > > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > > +
> > > > > +       if (folio_test_unevictable(folio) || !lrugen->enabled)
> > > > > +               return false;
> > > >
> > > > These two checks should be in the callsite and the function should
> > > > return void. Otherwise you can't understand the callsite without
> > > > drilling down into lrugen code, even if lrugen is disabled.
> > >
> > > I agree it's a bit of nuisance this way. The alternative is we'd need
> > > ifdef or another helper at the call sites because lrugen->enabled is
> > > specific to lrugen.
> >
> > Coming from memcg, my experience has been that when you have a compile
> > time-optional MM extension like this, you'll sooner or later need a
> > config-independent helper to gate callbacks in generic code. So I
> > think it's a good idea to add one now.
> >
> > One of these?
> >
> > lruvec_on_lrugen()
>
> SGTM.
>
> Personally I'd reuse lru_gen_enabled(), by passing NULL/lruvec. But
> my guess is you wouldn't like it.
>
> > lruvec_using_lrugen()
> > lruvec_lrugen_enabled()
> >
> > lruvec_has_generations() :-)
> >
> > > > On that note, I think #1 is reintroducing a problem we have fixed
> > > > before, which is trashing the workingset with a flood of use-once
> > > > mmapped pages. It's the classic scenario where LFU beats LRU.
> > > >
> > > > Mapped streaming IO isn't very common, but it does happen. See these
> > > > commits:
> > > >
> > > > dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
> > > > 31c0569c3b0b6cc8a867ac6665ca081553f7984c
> > > > 645747462435d84c6c6a64269ed49cc3015f753d
> > > >
> > > > From the changelog:
> > > >
> > > >     The used-once mapped file page detection patchset.
> > > >
> > > >     It is meant to help workloads with large amounts of shortly used file
> > > >     mappings, like rtorrent hashing a file or git when dealing with loose
> > > >     objects (git gc on a bigger site?).
> > > >
> > > >     Right now, the VM activates referenced mapped file pages on first
> > > >     encounter on the inactive list and it takes a full memory cycle to
> > > >     reclaim them again.  When those pages dominate memory, the system
> > > >     no longer has a meaningful notion of 'working set' and is required
> > > >     to give up the active list to make reclaim progress.  Obviously,
> > > >     this results in rather bad scanning latencies and the wrong pages
> > > >     being reclaimed.
> > > >
> > > >     This patch makes the VM be more careful about activating mapped file
> > > >     pages in the first place.  The minimum granted lifetime without
> > > >     another memory access becomes an inactive list cycle instead of the
> > > >     full memory cycle, which is more natural given the mentioned loads.
> > > >
> > > > Translating this to multigen, it seems fresh faults should really
> > > > start on the second oldest rather than on the youngest generation, to
> > > > get a second chance but without jeopardizing the workingset if they
> > > > don't take it.
> > >
> > > This is a good point, and I had worked on a similar idea but failed
> > > to measure its benefits. In addition to placing mmapped file pages in
> > > older generations, I also tried placing refaulted anon pages in older
> > > generations. My conclusion was that the initial LRU positions of NFU
> > > pages are not a bottleneck for workloads I've tested. The efficiency
> > > of testing/clearing the accessed bit is.
> >
> > The concern isn't the scan overhead, but jankiness from the workingset
> > being flooded out by streaming IO.
>
> Yes, MGLRU uses a different approach to solve this problem, and for
> its approach, the scan overhead is the concern.
>
> MGLRU detects (defines) the working set by scanning the entire memory
> for each generation, and it counters the flooding by accelerating the
> creation of generations. IOW, all mapped pages have an equal chance to
> get scanned, no matter which generation they are in. This is a design
> difference compared with the active/inactive LRU, which tries to scans
> the active/inactive lists less/more frequently.
>
> > The concrete usecase at the time was a torrent client hashing a
> > downloaded file and thereby kicking out the desktop environment, which
> > caused jankiness. The hashing didn't benefit from caching - the file
> > wouldn't have fit into RAM anyway - so this was pointless to boot.
> >
> > Essentially, the tradeoff is this:
> >
> > 1) If you treat new pages as hot, you accelerate workingset
> > transitions, but on the flipside you risk unnecessary refaults in
> > running applications when those new pages are one-off.
> >
> > 2) If you take new pages with a grain of salt, you protect existing
> > applications better from one-off floods, but risk refaults in NEW
> > application while they're trying to start up.
>
> Agreed.
>
> > There are two arguments for why 2) is preferable:
> >
> > 1) Users are tolerant of cache misses when applications first launch,
> >    much less so after they've been running for hours.
>
> Our CUJs (Critical User Journeys) respectfully disagree :)
>
> They are built on the observation that once users have moved onto
> another tab/app, they are more likely to stay with the new tab/app
> rather than go back to the old ones. Speaking for myself, this is
> generally the case.
>
> > 2) Workingset transitions (and associated jankiness) are bounded by
> >    the amount of RAM you need to repopulate. But streaming IO is
> >    bounded by storage, and datasets are routinely several times the
> >    amount of RAM. Uncacheable sets in excess of RAM can produce an
> >    infinite stream of "new" references; not protecting the workingset
> >    from that means longer or even sustained jankiness.
>
> I'd argue the opposite -- we shouldn't risk refaulting fresh hot pages
> just to accommodate this concrete yet minor use case, especially
> considering torrent has been given the means (MADV_SEQUENTIAL) to help
> itself.
>
> I appreciate all your points here. The bottom line is we agree this is
> a trade off. For what disagree about, we could be both right -- it
> comes down to what workloads we care about *more*.
>
> To move forward, I propose we look at it from a non-technical POV:
> would we want to offer users an alternative trade off so that they can
> have greater flexibility?
>
> > > And some applications are smart enough to leverage MADV_SEQUENTIAL.
> > > In this case, MGLRU does place mmapped file pages in the oldest
> > > generation.
> >
> > Yes, it makes sense to optimize when MADV_SEQUENTIAL is requested. But
> > that hint isn't reliably there, so it matters that we don't do poorly
> > when it's missing.
>
> Agreed.
>
> > > I have an oversimplified script that uses memcached to mimic a
> > > non-streaming workload and fio a (mmapped) streaming workload:
> >
> > Looking at the paramters and observed behavior, let me say up front
> > that this looks like a useful benchmark, but doesn't capture the
> > scenario I was talking about above.
> >
> > For one, the presence of swapping in both kernels suggests that the
> > "streaming IO" component actually has repeat access that could benefit
> > from caching. Second, I would expect memcache is accessing its memory
> > frequently and consistently, and so could withstand workingset
> > challenges from streaming IO better than, say, a desktop environment.
>
> The fio workload is a real streaming workload, but the memcached
> workload might have been too large to be a typical desktop workload.
>
> More below.
>
> > More on that below.
> >
> > >   1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times
> > >      faster when using MGLRU. Somehow the baseline (rc3) swapped a lot.
> > >      (It shouldn't, and I haven't figured out why.)
> >
> > Baseline swaps when there are cache refaults. This is regardless of
> > the hint: you may say you're accessing these pages sequentially, but
> > the refaults say you're reusing them, with a frequency that suggests
> > they might be cacheable. So it tries to cache them.
> >
> > I'd be curious if that results in fio being faster, or whether it's
> > all just pointless thrashing. Can you share the fio results too?
>
> More below.
>
> > We could patch baseline to prioritize MADV_SEQUENTIAL more, but...
> >
> > >   2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1
> > >      time faster when using MGLRU. Both MGLRU and the baseline swapped
> > >      a lot.
> >
> > ...in practice I think this scenario will matter to a lot more users.
>
> I strongly feel we should prioritize what's advertised on a man page
> over an unspecified (performance) behavior.
>
> > I would again be interested in the fio results.
> >
> > >            MADV_SEQUENTIAL    non-streaming ops/sec (memcached)
> > >   rc3      yes                 292k
> > >   rc3      no                  203k
> > >   rc3+v7   yes                1967k
> > >   rc3+v7   no                  436k

Appending a few notes on the baseline results:
1. Apparently FADV_DONTNEED rejects mmaped pages -- I found no reasons
from the man page or the original commit why it should. I propose we
remove the page_mapped() check in lru_deactivate_file_fn(). Adding
Minchan to see how he thinks about this.
2. For MADV_SEQUENTIAL, I made workingset_refault() check (VM_SEQ_READ
| VM_RAND_READ) and return if they are set, and the performance got a
lot better (x3).
3. In page_referenced_one(), I think we should also exclude
VM_RAND_READ in addition to VM_SEQ_READ.

> > >   cat mmap.sh
> > >   modprobe brd rd_nr=2 rd_size=56623104
> > >
> > >   mkswap /dev/ram0
> > >   swapon /dev/ram0
> > >
> > >   mkfs.ext4 /dev/ram1
> > >   mount -t ext4 /dev/ram1 /mnt
> > >
> > >   memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
> > >     -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=P:P -c 1 \
> > >     -t 36 --ratio 1:0 --pipeline 8 -d 2000
> > >
> > >   # streaming workload: --fadvise_hint=0 disables MADV_SEQUENTIAL
> > >   fio -name=mglru --numjobs=12 --directory=/mnt --size=4224m --buffered=1 \
> > >     --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
> > >     --iodepth_batch_complete=32 --rw=read --time_based --ramp_time=10m \
> > >     --runtime=180m --group_reporting &
> >
> > As per above, I think this would be closer to a cacheable workingset
> > than a streaming IO pattern. It depends on total RAM of course, but
> > size=4G and time_based should loop around pretty quickly.
>
> The file size here shouldn't matter since fio is smart enough to
> invalidate page cache before it rewinds (for sequential access):
>
> https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-invalidate
> https://github.com/axboe/fio/blob/master/filesetup.c#L602
>
> I think the problem might have been the memory size for memcached was
> too large (100GB) to be all hot (limited by memory bandwidth).
>
> > Would you mind rerunning with files larger than RAM, to avoid repeat
> > accesses (or at least only repeat with large distances)?
>
> Retested with the same free memory (120GB) for 40GB memcached and 200GB
> fio.
>
>            MADV_SEQUENTIAL  FADV_DONTNEED  memcached  fio
>   rc4      no               yes            4716k      232k
>   rc4+v7   no               yes            4307k      265k
>   delta                                    -9%        +14%
>
> MGLRU lost with memcached but won with fio for the same reason: it
> doesn't have any heuristics to detect the streaming characteristic
> (and therefore lost with memcached) but relies on faster scanning
> (and therefore won with fio) to keep the working set in memory.
>
> The baseline didn't swap this time (MGLRU did slightly), but it lost
> with fio because it had to walk the rmap for each page in the entire
> 200GB VMA, at least once, even for this streaming workload.
>
> This reflects the design difference I mentioned earlier.
>
>   cat test.sh
>   modprobe brd rd_nr=1 rd_size=268435456
>
>   mkfs.ext4 /dev/ram0
>   mount -t ext4 /dev/ram0 /mnt
>
>   fallocate -l 40g /mnt/swapfile
>   mkswap /mnt/swapfile
>   swapon /mnt/swapfile
>
>   fio -name=mglru --numjobs=1 --directory=/mnt --size=204800m \
>     --buffered=1 --ioengine=mmap --fadvise_hint=0 --iodepth=128 \
>     --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>     --rw=read --time_based --ramp_time=10m --runtime=180m \
>     --group_reporting &
>   pid=$!
>
>   sleep 600
>
>   # load objects
>   memtier_benchmark -S /var/run/memcached/memcached.sock \
>     -P memcache_binary -n allkeys --key-minimum=1 \
>     --key-maximum=20000000 --key-pattern=P:P -c 1 -t 36 \
>     --ratio 1:0 --pipeline 8 -d 2000
>   # read objects
>   memtier_benchmark -S /var/run/memcached/memcached.sock \
>     -P memcache_binary -n allkeys --key-minimum=1 \
>     --key-maximum=20000000 --key-pattern=R:R -c 1 -t 36 \
>     --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
>
>   kill -INT $pid
>
> > Depending on how hot memcache runs, it may or may not be able to hold
> > onto its workingset.
>
> Agreed.
>
> > Testing interactivity is notoriously hard, but
> > using a smaller, intermittent workload is probably more representative
> > of overall responsiveness. Let fio ramp until memory is full, then do
> > perf stat -r 10 /bin/sh -c 'git shortlog v5.0.. >/dev/null; sleep 1'
>
> I'll also check with the downstream maintainers to see if they have
> heard any complaints about streaming workloads negatively impacting
> user experience.
>
> > I'll try to reproduce this again too. Back then, that workload gave me
> > a very janky desktop experience, and the patch very obvious relief.
>
> SGTM.
>
> > > > > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
> > > > >  {
> > > > >         enum lru_list lru = folio_lru_list(folio);
> > > > >
> > > > > +       if (lru_gen_add_folio(lruvec, folio, true))
> > > > > +               return;
> > > > > +
> > > >
> > > > bool parameters are notoriously hard to follow in the callsite. Can
> > > > you please add lru_gen_add_folio_tail() instead and have them use a
> > > > common helper?
> > >
> > > I'm not sure -- there are several places like this one. My question is
> > > whether we want to do it throughout this patchset. We'd end up with
> > > many helpers and duplicate code. E.g., in this file alone, we have two
> > > functions taking bool parameters:
> > >   lru_gen_add_folio(..., bool reclaiming)
> > >   lru_gen_del_folio(..., bool reclaiming)
> > >
> > > I can't say they are very readable; at least they are very compact
> > > right now. My concern is that we might loose the latter without having
> > > enough of the former.
> > >
> > > Perhaps this is something that we could revisit after you've finished
> > > reviewing the entire patchset?
> >
> > Sure, fair enough.
> >
> > > > > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
> > > >
> > > > "state" is what we usually init :) How about lrugen_init_lruvec()?
> > >
> > > Same story as "file", lol -- this used to be lru_gen_init_lruvec():
> > > https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/
> > >
> > > Naming is hard. Hopefully we can finalize it this time.
> >
> > Was that internal feedback? The revisions show this function went
> > through several names, but I don't see reviews requesting those. If
> > they weren't public I'm gonna pretend they didn't happen ;-)
>
> Indeed. I lost track.
>
> > > > You can drop the memcg parameter and use lruvec_memcg().
> > >
> > > lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> > > this function because mem_cgroup_disabled() is initialized afterward.
> >
> > Good catch. That'll container_of() into garbage. However, we have to
> > assume that somebody's going to try that simplification again, so we
> > should set up the code now to prevent issues.
> >
> > cgroup_disable parsing is self-contained, so we can pull it ahead in
> > the init sequence. How about this?
> >
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 9d05c3ca2d5e..b544d768edc8 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str)
> >                       break;
> >               }
> >       }
> > -     return 1;
> > +     return 0;
> >  }
> > -__setup("cgroup_disable=", cgroup_disable);
> > +early_param("cgroup_disable", cgroup_disable);
>
> I think early_param() is still after pgdat_init_internals(), no?
>
> Thanks!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation
  2022-02-23 10:58           ` Mike Rapoport
@ 2022-02-23 21:20             ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-23 21:20 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Wed, Feb 23, 2022 at 3:58 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Feb 21, 2022 at 06:47:25PM -0700, Yu Zhao wrote:
> > On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > > > > Please consider splitting "enable" and "features" attributes.
> > > >
> > > > How about s/Features/Components/?
> > >
> > > I meant to use two attributes:
> > >
> > > /sys/kernel/mm/lru_gen/enable for the main breaker, and
> > > /sys/kernel/mm/lru_gen/features (or components) for the branch breakers
> >
> > It's a bit superfluous for my taste. I generally consider multiple
> > items to fall into the same category if they can be expressed by a
> > type of array, and I usually pack an array into a single file.
> >
> > From your last review, I gauged this would be too overloaded for your
> > taste. So I'd be happy to make the change if you think two files look
> > more intuitive from user's perspective.
>
> I do think that two attributes are more user-friendly, but I don't feel
> strongly about it.
>
> > > > > As for the descriptions, what is the user-visible effect of these features?
> > > > > How different modes of clearing the access bit are reflected in, say, GUI
> > > > > responsiveness, database TPS, or probability of OOM?
> > > >
> > > > These remain to be seen :) I just added these switches in v7, per Mel's
> > > > request from the meeting we had. These were never tested in the field.
> > >
> > > I see :)
> > >
> > > It would be nice to have a description or/and examples of user-visible
> > > effects when there will be some insight on what these features do.
> >
> > How does the following sound?
> >
> > Clearing the accessed bit in large batches can theoretically cause
> > lock contention (mmap_lock), and if it happens the 0x0002 switch can
> > disable this feature. In this case the multigenerational LRU suffers a
> > minor performance degradation.
> > Clearing the accessed bit in non-leaf page table entries was only
> > verified on Intel and AMD, and if it causes problems on other x86
> > varieties the 0x0004 switch can disable this feature. In this case the
> > multigenerational LRU suffers a negligible performance degradation.
>
> LGTM
>
> > > > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > > > >
> > > > > Is debugfs interface relevant only for datacenters?
> > > >
> > > > For the moment, yes.
> > >
> > > And what will happen if somebody uses these interfaces outside
> > > datacenters? As soon as there is a sysfs intefrace, somebody will surely
> > > play with it.
> > >
> > > I think the job schedulers might be the most important user of that
> > > interface, but the documentation should not presume it is the only user.
> >
> > Other ideas are more like brainstorming than concrete use cases, e.g.,
> > for desktop users, these interface can in theory speed up hibernation
> > (suspend to disk); for VM users, they can again in theory support auto
> > ballooning. These niches are really minor and less explored compared
> > with the data center use cases which have been dominant.
> >
> > I was hoping we could focus on the essential and take one step at a
> > time. Later on, if there is additional demand and resource, then we
> > expand to cover more use cases.
>
> Apparently I was not clear :)
>
> I didn't mean that you should describe other use-cases, I rather suggested
> to make the documentation more neutral, e.g. using "a user writes to this
> file ..." instead of "job scheduler writes to a file ...". Or maybe add a
> sentence in the beginning of the "Data centers" section, for instance:
>
> Data centers
> ------------
>
> + A representative example of multigenerational LRU users are job
> schedulers.
>
> Data centers want to optimize job scheduling (bin packing) to improve
> memory utilizations. Job schedulers need to estimate whether a server

Yes, that makes sense. Will do. Thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-23  9:36     ` Yu Zhao
@ 2022-02-24  0:59       ` Huang, Ying
  2022-02-24  1:34         ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Huang, Ying @ 2022-02-24  0:59 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Yu,
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > To avoid confusions, the terms "promotion" and "demotion" will be
>> > applied to the multigenerational LRU, as a new convention; the terms
>> > "activation" and "deactivation" will be applied to the active/inactive
>> > LRU, as usual.
>>
>> In the memory tiering related commits and patchset, for example as follows,
>>
>> commit 668e4147d8850df32ca41e28f52c146025ca45c6
>> Author: Yang Shi <yang.shi@linux.alibaba.com>
>> Date:   Thu Sep 2 14:59:19 2021 -0700
>>
>>     mm/vmscan: add page demotion counter
>>
>> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>>
>> "demote" and "promote" is used for migrating pages between different
>> types of memory.  Is it better for us to avoid overloading these words
>> too much to avoid the possible confusion?
>
> Given that LRU and migration are usually different contexts, I think
> we'd be fine, unless we want a third pair of terms.

This is true before memory tiering is introduced.  In systems with
multiple types memory (called memory tiering), LRU is used to identify
pages to be migrated to the slow memory node.  Please take a look at
can_demote(), which is called in shrink_page_list().

>> > +static int get_swappiness(struct mem_cgroup *memcg)
>> > +{
>> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
>> > +            mem_cgroup_swappiness(memcg) : 0;
>> > +}
>>
>> After we introduced demotion support in Linux kernel.  The anonymous
>> pages in the fast memory node could be demoted to the slow memory node
>> via the page reclaiming mechanism as in the following commit.  Can you
>> consider that too?
>
> Sure. How do I check whether there is still space on the slow node?

You can always check the watermark of the slow node.  But now, we
actually don't check that (as in demote_page_list()), instead we will
wake up kswapd of the slow node.  The intended behavior is something
like,

  DRAM -> PMEM -> disk

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-24  0:59       ` Huang, Ying
@ 2022-02-24  1:34         ` Yu Zhao
  2022-02-24  3:31           ` Huang, Ying
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-24  1:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Yu,
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> > LRU, as usual.
> >>
> >> In the memory tiering related commits and patchset, for example as follows,
> >>
> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >>
> >>     mm/vmscan: add page demotion counter
> >>
> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >>
> >> "demote" and "promote" is used for migrating pages between different
> >> types of memory.  Is it better for us to avoid overloading these words
> >> too much to avoid the possible confusion?
> >
> > Given that LRU and migration are usually different contexts, I think
> > we'd be fine, unless we want a third pair of terms.
>
> This is true before memory tiering is introduced.  In systems with
> multiple types memory (called memory tiering), LRU is used to identify
> pages to be migrated to the slow memory node.  Please take a look at
> can_demote(), which is called in shrink_page_list().

This sounds clearly two contexts to me. Promotion/demotion (move
between generations) while pages are on LRU; or promotion/demotion
(migration between nodes) after pages are taken off LRU.

Note that promotion/demotion are not used in function names. They are
used to describe how MGLRU works, in comparison with the
active/inactive LRU. Memory tiering is not within this context.

> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> > +{
> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> > +}
> >>
> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> pages in the fast memory node could be demoted to the slow memory node
> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> consider that too?
> >
> > Sure. How do I check whether there is still space on the slow node?
>
> You can always check the watermark of the slow node.  But now, we
> actually don't check that (as in demote_page_list()), instead we will
> wake up kswapd of the slow node.  The intended behavior is something
> like,
>
>   DRAM -> PMEM -> disk

I'll look into this later -- for now, it's a low priority because
there isn't much demand. I'll bump it up if anybody is interested in
giving it a try. Meanwhile, please feel free to cook up something if
you are interested.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-24  1:34         ` Yu Zhao
@ 2022-02-24  3:31           ` Huang, Ying
  2022-02-24  4:09             ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Huang, Ying @ 2022-02-24  3:31 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Yu,
>> >>
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >>
>> >> > To avoid confusions, the terms "promotion" and "demotion" will be
>> >> > applied to the multigenerational LRU, as a new convention; the terms
>> >> > "activation" and "deactivation" will be applied to the active/inactive
>> >> > LRU, as usual.
>> >>
>> >> In the memory tiering related commits and patchset, for example as follows,
>> >>
>> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
>> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
>> >> Date:   Thu Sep 2 14:59:19 2021 -0700
>> >>
>> >>     mm/vmscan: add page demotion counter
>> >>
>> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>> >>
>> >> "demote" and "promote" is used for migrating pages between different
>> >> types of memory.  Is it better for us to avoid overloading these words
>> >> too much to avoid the possible confusion?
>> >
>> > Given that LRU and migration are usually different contexts, I think
>> > we'd be fine, unless we want a third pair of terms.
>>
>> This is true before memory tiering is introduced.  In systems with
>> multiple types memory (called memory tiering), LRU is used to identify
>> pages to be migrated to the slow memory node.  Please take a look at
>> can_demote(), which is called in shrink_page_list().
>
> This sounds clearly two contexts to me. Promotion/demotion (move
> between generations) while pages are on LRU; or promotion/demotion
> (migration between nodes) after pages are taken off LRU.
>
> Note that promotion/demotion are not used in function names. They are
> used to describe how MGLRU works, in comparison with the
> active/inactive LRU. Memory tiering is not within this context.

Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
/sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
better to avoid to use promote/demote directly for MGLRU in ABI.  A
possible solution is to use "mglru" and "promote/demote" together (such
as "mglru_promote_*" when it is needed?

>> >> > +static int get_swappiness(struct mem_cgroup *memcg)
>> >> > +{
>> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
>> >> > +            mem_cgroup_swappiness(memcg) : 0;
>> >> > +}
>> >>
>> >> After we introduced demotion support in Linux kernel.  The anonymous
>> >> pages in the fast memory node could be demoted to the slow memory node
>> >> via the page reclaiming mechanism as in the following commit.  Can you
>> >> consider that too?
>> >
>> > Sure. How do I check whether there is still space on the slow node?
>>
>> You can always check the watermark of the slow node.  But now, we
>> actually don't check that (as in demote_page_list()), instead we will
>> wake up kswapd of the slow node.  The intended behavior is something
>> like,
>>
>>   DRAM -> PMEM -> disk
>
> I'll look into this later -- for now, it's a low priority because
> there isn't much demand. I'll bump it up if anybody is interested in
> giving it a try. Meanwhile, please feel free to cook up something if
> you are interested.

When we introduce a new feature, we shouldn't break an existing one.
That is, not introducing regression.  I think that it is a rule?

If my understanding were correct, MGLRU will ignore to scan anonymous
page list even if there's demotion target for the node.  This breaks the
demotion feature in the upstream kernel.  Right?

It's a new feature to check whether there is still space on the slow
node.  We can look at that later.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-24  3:31           ` Huang, Ying
@ 2022-02-24  4:09             ` Yu Zhao
  2022-02-24  5:27               ` Huang, Ying
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-02-24  4:09 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Yu,
> >> >>
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> >> > LRU, as usual.
> >> >>
> >> >> In the memory tiering related commits and patchset, for example as follows,
> >> >>
> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >> >>
> >> >>     mm/vmscan: add page demotion counter
> >> >>
> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >> >>
> >> >> "demote" and "promote" is used for migrating pages between different
> >> >> types of memory.  Is it better for us to avoid overloading these words
> >> >> too much to avoid the possible confusion?
> >> >
> >> > Given that LRU and migration are usually different contexts, I think
> >> > we'd be fine, unless we want a third pair of terms.
> >>
> >> This is true before memory tiering is introduced.  In systems with
> >> multiple types memory (called memory tiering), LRU is used to identify
> >> pages to be migrated to the slow memory node.  Please take a look at
> >> can_demote(), which is called in shrink_page_list().
> >
> > This sounds clearly two contexts to me. Promotion/demotion (move
> > between generations) while pages are on LRU; or promotion/demotion
> > (migration between nodes) after pages are taken off LRU.
> >
> > Note that promotion/demotion are not used in function names. They are
> > used to describe how MGLRU works, in comparison with the
> > active/inactive LRU. Memory tiering is not within this context.
>
> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
> better to avoid to use promote/demote directly for MGLRU in ABI.  A
> possible solution is to use "mglru" and "promote/demote" together (such
> as "mglru_promote_*" when it is needed?

*If* it is needed. Currently there are no such plans.

> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> >> > +{
> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> >> > +}
> >> >>
> >> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> >> pages in the fast memory node could be demoted to the slow memory node
> >> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> >> consider that too?
> >> >
> >> > Sure. How do I check whether there is still space on the slow node?
> >>
> >> You can always check the watermark of the slow node.  But now, we
> >> actually don't check that (as in demote_page_list()), instead we will
> >> wake up kswapd of the slow node.  The intended behavior is something
> >> like,
> >>
> >>   DRAM -> PMEM -> disk
> >
> > I'll look into this later -- for now, it's a low priority because
> > there isn't much demand. I'll bump it up if anybody is interested in
> > giving it a try. Meanwhile, please feel free to cook up something if
> > you are interested.
>
> When we introduce a new feature, we shouldn't break an existing one.
> That is, not introducing regression.  I think that it is a rule?
>
> If my understanding were correct, MGLRU will ignore to scan anonymous
> page list even if there's demotion target for the node.  This breaks the
> demotion feature in the upstream kernel.  Right?

I'm not saying this shouldn't be fixed. I'm saying it's a low priority
until somebody is interested in using/testing it (or making it work).

Regarding regressions, I'm sure MGLRU *will* regress many workloads.
Its goal is to improve the majority of use cases, i.e., total net
gain. Trying to improve everything is methodically wrong because the
problem space is near infinite but the resource is limited. So we have
to prioritize major use cases over minor ones. The bottom line is
users have a choice not to use MGLRU.

> It's a new feature to check whether there is still space on the slow
> node.  We can look at that later.

SGTM.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-24  4:09             ` Yu Zhao
@ 2022-02-24  5:27               ` Huang, Ying
  2022-02-24  5:35                 ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Huang, Ying @ 2022-02-24  5:27 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >>
>> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Hi, Yu,
>> >> >>
>> >> >> Yu Zhao <yuzhao@google.com> writes:
>> >> >>
>> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
>> >> >> > applied to the multigenerational LRU, as a new convention; the terms
>> >> >> > "activation" and "deactivation" will be applied to the active/inactive
>> >> >> > LRU, as usual.
>> >> >>
>> >> >> In the memory tiering related commits and patchset, for example as follows,
>> >> >>
>> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
>> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
>> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
>> >> >>
>> >> >>     mm/vmscan: add page demotion counter
>> >> >>
>> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
>> >> >>
>> >> >> "demote" and "promote" is used for migrating pages between different
>> >> >> types of memory.  Is it better for us to avoid overloading these words
>> >> >> too much to avoid the possible confusion?
>> >> >
>> >> > Given that LRU and migration are usually different contexts, I think
>> >> > we'd be fine, unless we want a third pair of terms.
>> >>
>> >> This is true before memory tiering is introduced.  In systems with
>> >> multiple types memory (called memory tiering), LRU is used to identify
>> >> pages to be migrated to the slow memory node.  Please take a look at
>> >> can_demote(), which is called in shrink_page_list().
>> >
>> > This sounds clearly two contexts to me. Promotion/demotion (move
>> > between generations) while pages are on LRU; or promotion/demotion
>> > (migration between nodes) after pages are taken off LRU.
>> >
>> > Note that promotion/demotion are not used in function names. They are
>> > used to describe how MGLRU works, in comparison with the
>> > active/inactive LRU. Memory tiering is not within this context.
>>
>> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
>> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
>> better to avoid to use promote/demote directly for MGLRU in ABI.  A
>> possible solution is to use "mglru" and "promote/demote" together (such
>> as "mglru_promote_*" when it is needed?
>
> *If* it is needed. Currently there are no such plans.

OK.

>> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
>> >> >> > +{
>> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
>> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
>> >> >> > +}
>> >> >>
>> >> >> After we introduced demotion support in Linux kernel.  The anonymous
>> >> >> pages in the fast memory node could be demoted to the slow memory node
>> >> >> via the page reclaiming mechanism as in the following commit.  Can you
>> >> >> consider that too?
>> >> >
>> >> > Sure. How do I check whether there is still space on the slow node?
>> >>
>> >> You can always check the watermark of the slow node.  But now, we
>> >> actually don't check that (as in demote_page_list()), instead we will
>> >> wake up kswapd of the slow node.  The intended behavior is something
>> >> like,
>> >>
>> >>   DRAM -> PMEM -> disk
>> >
>> > I'll look into this later -- for now, it's a low priority because
>> > there isn't much demand. I'll bump it up if anybody is interested in
>> > giving it a try. Meanwhile, please feel free to cook up something if
>> > you are interested.
>>
>> When we introduce a new feature, we shouldn't break an existing one.
>> That is, not introducing regression.  I think that it is a rule?
>>
>> If my understanding were correct, MGLRU will ignore to scan anonymous
>> page list even if there's demotion target for the node.  This breaks the
>> demotion feature in the upstream kernel.  Right?
>
> I'm not saying this shouldn't be fixed. I'm saying it's a low priority
> until somebody is interested in using/testing it (or making it work).

We are interested in this feature and can help to test it.

> Regarding regressions, I'm sure MGLRU *will* regress many workloads.
> Its goal is to improve the majority of use cases, i.e., total net
> gain. Trying to improve everything is methodically wrong because the
> problem space is near infinite but the resource is limited. So we have
> to prioritize major use cases over minor ones. The bottom line is
> users have a choice not to use MGLRU.

This is a functionality regression, not performance regression.  Without
demotion support, some workloads will go OOM when DRAM is used up (while
PMEM isn't) if PMEM is onlined in movable zone (as recommended).

>> It's a new feature to check whether there is still space on the slow
>> node.  We can look at that later.
>
> SGTM.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation
  2022-02-24  5:27               ` Huang, Ying
@ 2022-02-24  5:35                 ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-02-24  5:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Wed, Feb 23, 2022 at 10:27 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Wed, Feb 23, 2022 at 8:32 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Feb 23, 2022 at 5:59 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > On Wed, Feb 23, 2022 at 1:28 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Yu,
> >> >> >>
> >> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >> >>
> >> >> >> > To avoid confusions, the terms "promotion" and "demotion" will be
> >> >> >> > applied to the multigenerational LRU, as a new convention; the terms
> >> >> >> > "activation" and "deactivation" will be applied to the active/inactive
> >> >> >> > LRU, as usual.
> >> >> >>
> >> >> >> In the memory tiering related commits and patchset, for example as follows,
> >> >> >>
> >> >> >> commit 668e4147d8850df32ca41e28f52c146025ca45c6
> >> >> >> Author: Yang Shi <yang.shi@linux.alibaba.com>
> >> >> >> Date:   Thu Sep 2 14:59:19 2021 -0700
> >> >> >>
> >> >> >>     mm/vmscan: add page demotion counter
> >> >> >>
> >> >> >> https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/
> >> >> >>
> >> >> >> "demote" and "promote" is used for migrating pages between different
> >> >> >> types of memory.  Is it better for us to avoid overloading these words
> >> >> >> too much to avoid the possible confusion?
> >> >> >
> >> >> > Given that LRU and migration are usually different contexts, I think
> >> >> > we'd be fine, unless we want a third pair of terms.
> >> >>
> >> >> This is true before memory tiering is introduced.  In systems with
> >> >> multiple types memory (called memory tiering), LRU is used to identify
> >> >> pages to be migrated to the slow memory node.  Please take a look at
> >> >> can_demote(), which is called in shrink_page_list().
> >> >
> >> > This sounds clearly two contexts to me. Promotion/demotion (move
> >> > between generations) while pages are on LRU; or promotion/demotion
> >> > (migration between nodes) after pages are taken off LRU.
> >> >
> >> > Note that promotion/demotion are not used in function names. They are
> >> > used to describe how MGLRU works, in comparison with the
> >> > active/inactive LRU. Memory tiering is not within this context.
> >>
> >> Because we have used pgdemote_* in /proc/vmstat, "demotion_enabled" in
> >> /sys/kernel/mm/numa, and will use pgpromote_* in /proc/vmstat.  It seems
> >> better to avoid to use promote/demote directly for MGLRU in ABI.  A
> >> possible solution is to use "mglru" and "promote/demote" together (such
> >> as "mglru_promote_*" when it is needed?
> >
> > *If* it is needed. Currently there are no such plans.
>
> OK.
>
> >> >> >> > +static int get_swappiness(struct mem_cgroup *memcg)
> >> >> >> > +{
> >> >> >> > +     return mem_cgroup_get_nr_swap_pages(memcg) >= MIN_LRU_BATCH ?
> >> >> >> > +            mem_cgroup_swappiness(memcg) : 0;
> >> >> >> > +}
> >> >> >>
> >> >> >> After we introduced demotion support in Linux kernel.  The anonymous
> >> >> >> pages in the fast memory node could be demoted to the slow memory node
> >> >> >> via the page reclaiming mechanism as in the following commit.  Can you
> >> >> >> consider that too?
> >> >> >
> >> >> > Sure. How do I check whether there is still space on the slow node?
> >> >>
> >> >> You can always check the watermark of the slow node.  But now, we
> >> >> actually don't check that (as in demote_page_list()), instead we will
> >> >> wake up kswapd of the slow node.  The intended behavior is something
> >> >> like,
> >> >>
> >> >>   DRAM -> PMEM -> disk
> >> >
> >> > I'll look into this later -- for now, it's a low priority because
> >> > there isn't much demand. I'll bump it up if anybody is interested in
> >> > giving it a try. Meanwhile, please feel free to cook up something if
> >> > you are interested.
> >>
> >> When we introduce a new feature, we shouldn't break an existing one.
> >> That is, not introducing regression.  I think that it is a rule?
> >>
> >> If my understanding were correct, MGLRU will ignore to scan anonymous
> >> page list even if there's demotion target for the node.  This breaks the
> >> demotion feature in the upstream kernel.  Right?
> >
> > I'm not saying this shouldn't be fixed. I'm saying it's a low priority
> > until somebody is interested in using/testing it (or making it work).
>
> We are interested in this feature and can help to test it.

That's great. I'll make sure it works in the next version.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-23 21:18           ` Yu Zhao
@ 2022-02-25 16:34             ` Minchan Kim
  0 siblings, 0 replies; 74+ messages in thread
From: Minchan Kim @ 2022-02-25 16:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, Linux ARM, open list:DOCUMENTATION, linux-kernel,
	Linux-MM, Kernel Page Reclaim v2, the arch/x86 maintainers,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Wed, Feb 23, 2022 at 02:18:24PM -0700, Yu Zhao wrote:
> .
> On Mon, Feb 21, 2022 at 1:14 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote:
> > > Hi Yu,
> > >
> > > On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > > +{
> > > > > > +       unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > > +
> > > > > > +       VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > > +
> > > > > > +       /* see the comment on MIN_NR_GENS */
> > > > > > +       return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > > +}
> > > > >
> > > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > > set is split into two generations for the sole purpose of the
> > > > > second-chance policy for fresh faults, right?
> > > >
> > > > To be precise, the active/inactive notion on top of generations is
> > > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > > Otherwise, this function wouldn't be needed.
> > >
> > > Ah! would you mind adding this as a comment to the function?
> >
> > Will do.
> >
> > > But AFAICS there is the lru_gen_del_folio() callsite that maps it to
> > > the PG_active flag - which in turn gets used by add_folio() to place
> > > the thing back on the max_seq generation. So I suppose there is a
> > > secondary purpose of the function for remembering the page's rough age
> > > for non-reclaim isolation.>
> >
> > Yes, e.g., migration.
> >
> > > It would be good to capture that as well in a comment on the function.
> >
> > Will do.
> >
> > > > > > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> > > > > > +                                      int zone, long delta)
> > > > > > +{
> > > > > > +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > > > > +
> > > > > > +       lockdep_assert_held(&lruvec->lru_lock);
> > > > > > +       WARN_ON_ONCE(delta != (int)delta);
> > > > > > +
> > > > > > +       __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> > > > > > +       __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> > > > > > +}
> > > > >
> > > > > This is a duplicate of update_lru_size(), please use that instead.
> > > > >
> > > > > Yeah technically you don't need the mem_cgroup_update_lru_size() but
> > > > > that's not worth sweating over, better to keep it simple.
> > > >
> > > > I agree we don't need the mem_cgroup_update_lru_size() -- let me spell
> > > > out why:
> > > >   this function is not needed here because it updates the counters used
> > > >   only by the active/inactive lru code, i.e., get_scan_count().
> > > >
> > > > However, we can't reuse update_lru_size() because MGLRU can trip the
> > > > WARN_ONCE() in mem_cgroup_update_lru_size().
> > > >
> > > > Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent.
> > > > To move a page to a different generation, the gen counter in page->flags
> > > > is updated first, which doesn't require the LRU lock. The second step,
> > > > i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it
> > > > usually isn't done immediately due to batching. Meanwhile, if this page
> > > > is, for example, isolated, nr_pages[] becomes temporarily unbalanced.
> > > > And this trips the WARN_ONCE().
> > >
> > > Good insight.
> > >
> > > But in that case, I'd still think it's better to use update_lru_size()
> > > and gate the memcg update on lrugen-enabled, with a short comment
> > > saying that lrugen has its own per-cgroup counts already. It's just a
> > > bit too error prone to duplicate the stat updates.
> > >
> > > Even better would be:
> > >
> > > static __always_inline
> > > void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
> > > {
> > >       enum lru_list lru = folio_lru_list(folio);
> > >
> > >       update_lru_size(lruvec, lru, folio_zonenum(folio),
> > >                       folio_nr_pages(folio));
> > >       if (lrugen_enabled(lruvec))
> > >               lrugen_add_folio(lruvec, folio);
> > >       else
> > >               list_add(&folio->lru, &lruvec->lists[lru]);
> > > }
> > >
> > > But it does mean you'd have to handle unevictable pages. I'm reviewing
> > > from the position that mglru is going to supplant the existing reclaim
> > > algorithm in the long term, though, so being more comprehensive and
> > > eliminating special cases where possible is all-positive, IMO.
> > >
> > > Up to you. I'd only insist on reusing update_lru_size() at least.
> >
> > Will do.
> >
> > > > > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > > > > +{
> > > > > > +       int gen;
> > > > > > +       unsigned long old_flags, new_flags;
> > > > > > +       int type = folio_is_file_lru(folio);
> > > > > > +       int zone = folio_zonenum(folio);
> > > > > > +       struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > > > > +
> > > > > > +       if (folio_test_unevictable(folio) || !lrugen->enabled)
> > > > > > +               return false;
> > > > >
> > > > > These two checks should be in the callsite and the function should
> > > > > return void. Otherwise you can't understand the callsite without
> > > > > drilling down into lrugen code, even if lrugen is disabled.
> > > >
> > > > I agree it's a bit of nuisance this way. The alternative is we'd need
> > > > ifdef or another helper at the call sites because lrugen->enabled is
> > > > specific to lrugen.
> > >
> > > Coming from memcg, my experience has been that when you have a compile
> > > time-optional MM extension like this, you'll sooner or later need a
> > > config-independent helper to gate callbacks in generic code. So I
> > > think it's a good idea to add one now.
> > >
> > > One of these?
> > >
> > > lruvec_on_lrugen()
> >
> > SGTM.
> >
> > Personally I'd reuse lru_gen_enabled(), by passing NULL/lruvec. But
> > my guess is you wouldn't like it.
> >
> > > lruvec_using_lrugen()
> > > lruvec_lrugen_enabled()
> > >
> > > lruvec_has_generations() :-)
> > >
> > > > > On that note, I think #1 is reintroducing a problem we have fixed
> > > > > before, which is trashing the workingset with a flood of use-once
> > > > > mmapped pages. It's the classic scenario where LFU beats LRU.
> > > > >
> > > > > Mapped streaming IO isn't very common, but it does happen. See these
> > > > > commits:
> > > > >
> > > > > dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
> > > > > 31c0569c3b0b6cc8a867ac6665ca081553f7984c
> > > > > 645747462435d84c6c6a64269ed49cc3015f753d
> > > > >
> > > > > From the changelog:
> > > > >
> > > > >     The used-once mapped file page detection patchset.
> > > > >
> > > > >     It is meant to help workloads with large amounts of shortly used file
> > > > >     mappings, like rtorrent hashing a file or git when dealing with loose
> > > > >     objects (git gc on a bigger site?).
> > > > >
> > > > >     Right now, the VM activates referenced mapped file pages on first
> > > > >     encounter on the inactive list and it takes a full memory cycle to
> > > > >     reclaim them again.  When those pages dominate memory, the system
> > > > >     no longer has a meaningful notion of 'working set' and is required
> > > > >     to give up the active list to make reclaim progress.  Obviously,
> > > > >     this results in rather bad scanning latencies and the wrong pages
> > > > >     being reclaimed.
> > > > >
> > > > >     This patch makes the VM be more careful about activating mapped file
> > > > >     pages in the first place.  The minimum granted lifetime without
> > > > >     another memory access becomes an inactive list cycle instead of the
> > > > >     full memory cycle, which is more natural given the mentioned loads.
> > > > >
> > > > > Translating this to multigen, it seems fresh faults should really
> > > > > start on the second oldest rather than on the youngest generation, to
> > > > > get a second chance but without jeopardizing the workingset if they
> > > > > don't take it.
> > > >
> > > > This is a good point, and I had worked on a similar idea but failed
> > > > to measure its benefits. In addition to placing mmapped file pages in
> > > > older generations, I also tried placing refaulted anon pages in older
> > > > generations. My conclusion was that the initial LRU positions of NFU
> > > > pages are not a bottleneck for workloads I've tested. The efficiency
> > > > of testing/clearing the accessed bit is.
> > >
> > > The concern isn't the scan overhead, but jankiness from the workingset
> > > being flooded out by streaming IO.
> >
> > Yes, MGLRU uses a different approach to solve this problem, and for
> > its approach, the scan overhead is the concern.
> >
> > MGLRU detects (defines) the working set by scanning the entire memory
> > for each generation, and it counters the flooding by accelerating the
> > creation of generations. IOW, all mapped pages have an equal chance to
> > get scanned, no matter which generation they are in. This is a design
> > difference compared with the active/inactive LRU, which tries to scans
> > the active/inactive lists less/more frequently.
> >
> > > The concrete usecase at the time was a torrent client hashing a
> > > downloaded file and thereby kicking out the desktop environment, which
> > > caused jankiness. The hashing didn't benefit from caching - the file
> > > wouldn't have fit into RAM anyway - so this was pointless to boot.
> > >
> > > Essentially, the tradeoff is this:
> > >
> > > 1) If you treat new pages as hot, you accelerate workingset
> > > transitions, but on the flipside you risk unnecessary refaults in
> > > running applications when those new pages are one-off.
> > >
> > > 2) If you take new pages with a grain of salt, you protect existing
> > > applications better from one-off floods, but risk refaults in NEW
> > > application while they're trying to start up.
> >
> > Agreed.
> >
> > > There are two arguments for why 2) is preferable:
> > >
> > > 1) Users are tolerant of cache misses when applications first launch,
> > >    much less so after they've been running for hours.
> >
> > Our CUJs (Critical User Journeys) respectfully disagree :)
> >
> > They are built on the observation that once users have moved onto
> > another tab/app, they are more likely to stay with the new tab/app
> > rather than go back to the old ones. Speaking for myself, this is
> > generally the case.
> >
> > > 2) Workingset transitions (and associated jankiness) are bounded by
> > >    the amount of RAM you need to repopulate. But streaming IO is
> > >    bounded by storage, and datasets are routinely several times the
> > >    amount of RAM. Uncacheable sets in excess of RAM can produce an
> > >    infinite stream of "new" references; not protecting the workingset
> > >    from that means longer or even sustained jankiness.
> >
> > I'd argue the opposite -- we shouldn't risk refaulting fresh hot pages
> > just to accommodate this concrete yet minor use case, especially
> > considering torrent has been given the means (MADV_SEQUENTIAL) to help
> > itself.
> >
> > I appreciate all your points here. The bottom line is we agree this is
> > a trade off. For what disagree about, we could be both right -- it
> > comes down to what workloads we care about *more*.
> >
> > To move forward, I propose we look at it from a non-technical POV:
> > would we want to offer users an alternative trade off so that they can
> > have greater flexibility?
> >
> > > > And some applications are smart enough to leverage MADV_SEQUENTIAL.
> > > > In this case, MGLRU does place mmapped file pages in the oldest
> > > > generation.
> > >
> > > Yes, it makes sense to optimize when MADV_SEQUENTIAL is requested. But
> > > that hint isn't reliably there, so it matters that we don't do poorly
> > > when it's missing.
> >
> > Agreed.
> >
> > > > I have an oversimplified script that uses memcached to mimic a
> > > > non-streaming workload and fio a (mmapped) streaming workload:
> > >
> > > Looking at the paramters and observed behavior, let me say up front
> > > that this looks like a useful benchmark, but doesn't capture the
> > > scenario I was talking about above.
> > >
> > > For one, the presence of swapping in both kernels suggests that the
> > > "streaming IO" component actually has repeat access that could benefit
> > > from caching. Second, I would expect memcache is accessing its memory
> > > frequently and consistently, and so could withstand workingset
> > > challenges from streaming IO better than, say, a desktop environment.
> >
> > The fio workload is a real streaming workload, but the memcached
> > workload might have been too large to be a typical desktop workload.
> >
> > More below.
> >
> > > More on that below.
> > >
> > > >   1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times
> > > >      faster when using MGLRU. Somehow the baseline (rc3) swapped a lot.
> > > >      (It shouldn't, and I haven't figured out why.)
> > >
> > > Baseline swaps when there are cache refaults. This is regardless of
> > > the hint: you may say you're accessing these pages sequentially, but
> > > the refaults say you're reusing them, with a frequency that suggests
> > > they might be cacheable. So it tries to cache them.
> > >
> > > I'd be curious if that results in fio being faster, or whether it's
> > > all just pointless thrashing. Can you share the fio results too?
> >
> > More below.
> >
> > > We could patch baseline to prioritize MADV_SEQUENTIAL more, but...
> > >
> > > >   2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1
> > > >      time faster when using MGLRU. Both MGLRU and the baseline swapped
> > > >      a lot.
> > >
> > > ...in practice I think this scenario will matter to a lot more users.
> >
> > I strongly feel we should prioritize what's advertised on a man page
> > over an unspecified (performance) behavior.
> >
> > > I would again be interested in the fio results.
> > >
> > > >            MADV_SEQUENTIAL    non-streaming ops/sec (memcached)
> > > >   rc3      yes                 292k
> > > >   rc3      no                  203k
> > > >   rc3+v7   yes                1967k
> > > >   rc3+v7   no                  436k
> 
> Appending a few notes on the baseline results:
> 1. Apparently FADV_DONTNEED rejects mmaped pages -- I found no reasons
> from the man page or the original commit why it should. I propose we
> remove the page_mapped() check in lru_deactivate_file_fn(). Adding
> Minchan to see how he thinks about this.

Hi Yu,

It's quite old code. I already forgot all the details. Maybe, I wanted t
minimize behavior changes since invalidate_inode_page already fiter mapped
pages out.

I don't have any strong reason not to move mapped pages for deactivation
but if we do, it would be better to move them into head of inactive list
instead of tail to give a promote chance  since other processes are still
mappping the page in their address space.

Thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
                   ` (13 preceding siblings ...)
  2022-02-11 20:12 ` Alexey Avramov
@ 2022-03-03  6:06 ` Vaibhav Jain
  2022-03-03  6:47   ` Yu Zhao
  14 siblings, 1 reply; 74+ messages in thread
From: Vaibhav Jain @ 2022-03-03  6:06 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko
  Cc: Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Yu Zhao


In a synthetic MongoDB Benchmark (YCSB) seeing an average of ~19% throughput
improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on
top of v5.16 kernel for MongoDB + YCSB bench across three different
request distriburions namely Exponential,Uniform and Zipfan

Test-Results
============

Average YCSB reported throughput (95% Confidence Interval):
|---------------------+---------------------+---------------------+---------------------|
| Kernel-Type         | Exponential         | Uniform             | Zipfan              |
|---------------------+---------------------+---------------------+---------------------|
| Base Kernel (v5.16) | 27324.701 ± 759.652 | 20671.590 ± 412.974 | 37713.761 ± 621.213 |
| v5.16 + MGLRU       | 32702.231 ± 287.957 | 24916.239 ± 217.977 | 44308.839 ± 701.829 |
|---------------------+---------------------+---------------------+---------------------|
| Speedup             | 19.68% ± 4.03%      | 20.11% ± 2.95%      | 17.49% ± 2.82%      |
|---------------------+---------------------+---------------------+---------------------|

n = 11 Samples x 3 (Distributions) x 2 (Kernels) = 66 Observations

Test Environment
================
Cpu: POWER10 (architected), altivec supported
platform: pSeries
CPUs: 32
MMU: Radix
Page-Size: 64K
Total-Memory: 64G

Distro
-------
# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

System-config
-------------
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# cat /proc/swaps 
Filename                                Type            Size            Used            Priority
/dev/dm-5                               partition       10485696        940864          -2

# cat /proc/sys/vm/overcommit_memory
0

#cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1 transparent_hugepage=never

MongoDB data partition
----------------------
lsblk /dev/sdb
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  128G  0 disk <home>/data/mongodb

mount | grep /dev/sdb
/dev/sdb on /root/vajain21/mglru/data/mongodb type ext4 (rw,relatime)

Testing Artifacts
==================

MongoDB-configuration
---------------------
MongoDB Commounity Server built from https://github.com/mongodb/mongo release v5.0.6

# mongod --version
db version v5.0.6
Build Info: {
      "version": "5.0.6",
      "gitVersion": "212a8dbb47f07427dae194a9c75baec1d81d9259",
      "openSSLVersion": "OpenSSL 1.1.1g FIPS  21 Apr 2020",
      "modules": [],
      "allocator": "tcmalloc",
      "environment": {
      "distarch": "ppc64le",
      "target_arch": "ppc64le"
      }
}

# cat /etc/mongod.conf 
storage:
  dbPath: <home-path>/data/mongodb
  journal:
     enabled: true
  engine: wiredTiger
  wiredTiger:
    engineConfig:
    cacheSizeGB: 50
  net:
    bindIp: 127.0.0.1
    unixDomainSocket:
    enabled: true
    pathPrefix: /run/mongodb
setParameter:
    enableLocalhostAuthBypass: true

YCSB (https://github.com/vaibhav92/YCSB/tree/mongodb-domain-sockets)
--------------------------------------------------------------------

YCSB forked from https://github.com/brianfrankcooper/YCSB.git. This fixes a
problem with YCSB when trying to connect to MongoDB on a unix domain socket. PR
raised to the project at https://github.com/brianfrankcooper/YCSB/pull/1587

Head Commit: fb2555a77005ae70c26e4adc46c945caf4daa2f9(" [core] Generate
classpath from all dependencies rather than just compile scoped")

Kernel-Config
-------------

Base-Kernel: https://github.com/torvalds/linux/ v5.16
Base-Kernel-Config:
https://github.com/vaibhav92/mglru-benchmark/blob/auto_build/config-non-mglru

Test-Kernel: https://linux-mm.googlesource.com/page-reclaim refs/changes/49/1549/1
Test-Kernel-Config:
https://github.com/vaibhav92/mglru-benchmark/blob/auto_build/config-mglru

CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
CONFIG_NR_LRU_GENS=4
CONFIG_TIERS_PER_GEN=4

YCSB:
recordcount=80000000
operationcount=80000000
readproportion=0.8
updateproportion=0.2
workload=site.ycsb.workloads.CoreWorkload
threads=64
requestdistributions={uniform, exponential, zipfian}

Test-Bench
===========
Source: https://github.com/vaibhav92/mglru-benchmark/tree/auto_build

Invoked via following command that will *destroy* contents of /dev/sdd
and use it as data disk for MongoDB:

$ export MONGODB_DISK=/dev/sdd; curl \
https://raw.githubusercontent.com/vaibhav92/mglru-benchmark/auto_build/build.sh
\ | sudo bash -s

Test-Methodology
================

Setup
-----
1. Pull & Build testing artifact v5.16 Base Kernel, MGLRU Kernel,
MongoDB, YCSB & Qemu for qemu-img tools
2. Format and mount provided MongoDB Data disk with ext4.
3. Generate Systemd service/slice files for MongoDB and place them into /etc/systemd/system/
4. Generate MongoDB configration pointing to the data disk mount.
5. Start the built MongoDB instance.
6. Ensure that MongoDB is running.

Load Test Data
---------------
1. Ensure that MongoDB instance is stopped.
2. Unmount the data disk and reformat it with ext4.
3. Restart MongoDB.
4. Spin off YCSB to load data into the Mongo instance.
5. Stop MongoDB + Unmount data Disk
6. Create a qcow2 image of the data disk and store it with test data.
7. Kexec into base kernel.

Test Phase (Happens at each boot)
---------------------------------
1. Select the distribution to be used for YCSB from
{"Uniform","Exponential","Zipfan"}
2. Restore the MongoDB qcow2 data disk Image to the disk
3. Mount the data disk and restart MongoDB daemon.
4. Start YCSB to generate the workload on MongoDB.
5. Once finished collect results.
6. Kexec into next-kernel which keeps switching between Base-Kernel &
MGLRU-Kernel when all three distriutions have been tested.

Setup and Load Test Data stages can be accomplished by following command:
#export MONGODB_DISK=/dev/sdd; \
curl https://raw.githubusercontent.com/vaibhav92/mglru-benchmark/auto_build/build.sh | bash -s

Once completed successfully it will kexec into the base kernel and start the
Test phase on boot via systemd service named 'mglru-benchmark'

Based on above results,
Tested-by: Vaibhav Jain<vaibhav@linux.ibm.com>

Yu Zhao <yuzhao@google.com> writes:

> What's new
> ==========
> 1) Addressed all the comments received on the mailing list and in the
>    meeting with the stakeholders (will note on individual patches).
> 2) Measured the performance improvements for each patch between 5-8
>    (reported in the commit messages).
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and straightforward.
>
> Patchset overview
> =================
> The design and implementation overview was moved to patch 12 so that
> people can finish reading this cover letter.
>
> 1. mm: x86, arm64: add arch_has_hw_pte_young()
> 2. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> Using hardware optimizations when trying to clear the accessed bit in
> many PTEs.
>
> 3. mm/vmscan.c: refactor shrink_node()
> A minor refactor.
>
> 4. mm: multigenerational LRU: groundwork
> Adding the basic data structure and the functions that insert/remove
> pages to/from the multigenerational LRU (MGLRU) lists.
>
> 5. mm: multigenerational LRU: minimal implementation
> A minimal (functional) implementation without any optimizations.
>
> 6. mm: multigenerational LRU: exploit locality in rmap
> Improving the efficiency when using the rmap.
>
> 7. mm: multigenerational LRU: support page table walks
> Adding the (optional) page table scanning.
>
> 8. mm: multigenerational LRU: optimize multiple memcgs
> Optimizing the overall performance for multiple memcgs running mixed
> types of workloads.
>
> 9. mm: multigenerational LRU: runtime switch
> Adding a runtime switch to enable or disable MGLRU.
>
> 10. mm: multigenerational LRU: thrashing prevention
> 11. mm: multigenerational LRU: debugfs interface
> Providing userspace with additional features like thrashing prevention,
> working set estimation and proactive reclaim.
>
> 12. mm: multigenerational LRU: documentation
> Adding a design doc and an admin guide.
>
> Benchmark results
> =================
> Independent lab results
> -----------------------
> Based on the popularity of searches [01] and the memory usage in
> Google's public cloud, the most popular open-source memory-hungry
> applications, in alphabetical order, are:
>       Apache Cassandra      Memcached
>       Apache Hadoop         MongoDB
>       Apache Spark          PostgreSQL
>       MariaDB (MySQL)       Redis
>
> An independent lab evaluated MGLRU with the most widely used benchmark
> suites for the above applications. They posted 960 data points along
> with kernel metrics and perf profiles collected over more than 500
> hours of total benchmark time. Their final reports show that, with 95%
> confidence intervals (CIs), the above applications all performed
> significantly better for at least part of their benchmark matrices.
>
> On 5.14:
> 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
>    less wall time to sort three billion random integers, respectively,
>    under the medium- and the high-concurrency conditions, when
>    overcommitting memory. There were no statistically significant
>    changes in wall time for the rest of the benchmark matrix.
> 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
>    more transactions per minute (TPM), respectively, under the medium-
>    and the high-concurrency conditions, when overcommitting memory.
>    There were no statistically significant changes in TPM for the rest
>    of the benchmark matrix.
> 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
>    and [21.59, 30.02]% more operations per second (OPS), respectively,
>    for sequential access, random access and Gaussian (distribution)
>    access, when THP=always; 95% CIs [13.85, 15.97]% and
>    [23.94, 29.92]% more OPS, respectively, for random access and
>    Gaussian access, when THP=never. There were no statistically
>    significant changes in OPS for the rest of the benchmark matrix.
> 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
>    [2.16, 3.55]% more operations per second (OPS), respectively, for
>    exponential (distribution) access, random access and Zipfian
>    (distribution) access, when underutilizing memory; 95% CIs
>    [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
>    respectively, for exponential access, random access and Zipfian
>    access, when overcommitting memory.
>
> On 5.15:
> 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
>    and [4.11, 7.50]% more operations per second (OPS), respectively,
>    for exponential (distribution) access, random access and Zipfian
>    (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
>    [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
>    exponential access, random access and Zipfian access, when swap was
>    on.
> 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
>    less average wall time to finish twelve parallel TeraSort jobs,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in average wall time for the rest of the
>    benchmark matrix.
> 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
>    minute (TPM) under the high-concurrency condition, when swap was
>    off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
>    respectively, under the medium- and the high-concurrency
>    conditions, when swap was on. There were no statistically
>    significant changes in TPM for the rest of the benchmark matrix.
> 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
>    [11.47, 19.36]% more total operations per second (OPS),
>    respectively, for sequential access, random access and Gaussian
>    (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
>    [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
>    for sequential access, random access and Gaussian access, when
>    THP=never.
>
> Our lab results
> ---------------
> To supplement the above results, we ran the following benchmark suites
> on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
> are popular among MM developers, but we prefer large-scale A/B
> experiments to validate improvements.)
>       fs_fio_bench_hdd_mq      pft
>       fs_lmbench               pgsql-hammerdb
>       fs_parallelio            redis
>       fs_postmark              stream
>       hackbench                sysbenchthread
>       kernbench                tpcc_spark
>       memcached                unixbench
>       multichase               vm-scalability
>       mutilate                 will-it-scale
>       nginx
>
> [01] https://trends.google.com
> [02] https://lore.kernel.org/lkml/20211102002002.92051-1-bot@edi.works/
> [03] https://lore.kernel.org/lkml/20211009054315.47073-1-bot@edi.works/
> [04] https://lore.kernel.org/lkml/20211021194103.65648-1-bot@edi.works/
> [05] https://lore.kernel.org/lkml/20211109021346.50266-1-bot@edi.works/
> [06] https://lore.kernel.org/lkml/20211202062806.80365-1-bot@edi.works/
> [07] https://lore.kernel.org/lkml/20211209072416.33606-1-bot@edi.works/
> [08] https://lore.kernel.org/lkml/20211218071041.24077-1-bot@edi.works/
> [09] https://lore.kernel.org/lkml/20211122053248.57311-1-bot@edi.works/
> [10] https://lore.kernel.org/lkml/20220104202247.2903702-1-yuzhao@google.com/
>
> Read-world applications
> =======================
> Third-party testimonials
> ------------------------
> Konstantin wrote [11]:
>    I have Archlinux with 8G RAM + zswap + swap. While developing, I
>    have lots of apps opened such as multiple LSP-servers for different
>    langs, chats, two browsers, etc... Usually, my system gets quickly
>    to a point of SWAP-storms, where I have to kill LSP-servers,
>    restart browsers to free memory, etc, otherwise the system lags
>    heavily and is barely usable.
>    
>    1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
>    patchset, and I started up by opening lots of apps to create memory
>    pressure, and worked for a day like this. Till now I had *not a
>    single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
>    getting to the point of 3G in SWAP before without a single
>    SWAP-storm.
>
> An anonymous user wrote [12]:
>    Using that v5 for some time and confirm that difference under heavy
>    load and memory pressure is significant.
>
> Shuang wrote [13]:
>    With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
>    and [9.26, 10.36]% higher throughput, respectively, for random
>    access, Zipfian (distribution) access and Gaussian (distribution)
>    access, when the average number of jobs per CPU is 1; 95% CIs
>    [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher throughput,
>    respectively, for random access, Zipfian access and Gaussian access,
>    when the average number of jobs per CPU is 2.
>
> Daniel wrote [14]:
>    With memcached allocating ~100GB of byte-addressable Optante,
>    performance improvement in terms of throughput (measured as queries
>    per second) was about 10% for a series of workloads.
>
> Large-scale deployments
> -----------------------
> The downstream kernels that have been using MGLRU include:
> 1. Android ARCVM [15]
> 2. Arch Linux Zen [16]
> 3. Chrome OS [17]
> 4. Liquorix [18]
> 5. post-factum [19]
> 6. XanMod [20]
>
> We've rolled out MGLRU to tens of millions of Chrome OS users and
> about a million Android users. Google's fleetwide profiling [21] shows
> an overall 40% decrease in kswapd CPU usage, in addition to
> improvements in other UX metrics, e.g., an 85% decrease in the number
> of low-memory kills at the 75th percentile and an 18% decrease in
> rendering latency at the 50th percentile.
>
> [11] https://lore.kernel.org/lkml/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
> [12] https://phoronix.com/forums/forum/software/general-linux-open-source/1301258-mglru-is-a-very-enticing-enhancement-for-linux-in-2022?p=1301275#post1301275
> [13] https://lore.kernel.org/lkml/20220105024423.26409-1-szhai2@cs.rochester.edu/
> [14] https://lore.kernel.org/linux-mm/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
> [15] https://chromium.googlesource.com/chromiumos/third_party/kernel
> [16] https://archlinux.org
> [17] https://chromium.org
> [18] https://liquorix.net
> [19] https://gitlab.com/post-factum/pf-kernel
> [20] https://xanmod.org
> [21] https://research.google/pubs/pub44271/
>
> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
>    indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
>    work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
>    with similar effects.
>
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
>    materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [22][23][24], the
>    new features will likely be put to use for both personal computers
>    and data centers.
> 3. Based on Google's track record, the new code will likely be well
>    maintained in the long term. It'd be more difficult if not
>    impossible to achieve similar effects on top of the existing
>    design.
>
> [22] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/
> [23] https://lore.kernel.org/lkml/20210716081449.22187-1-sj38.park@gmail.com/
> [24] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
>
> Yu Zhao (12):
>   mm: x86, arm64: add arch_has_hw_pte_young()
>   mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational LRU: groundwork
>   mm: multigenerational LRU: minimal implementation
>   mm: multigenerational LRU: exploit locality in rmap
>   mm: multigenerational LRU: support page table walks
>   mm: multigenerational LRU: optimize multiple memcgs
>   mm: multigenerational LRU: runtime switch
>   mm: multigenerational LRU: thrashing prevention
>   mm: multigenerational LRU: debugfs interface
>   mm: multigenerational LRU: documentation
>
>  Documentation/admin-guide/mm/index.rst        |    1 +
>  Documentation/admin-guide/mm/multigen_lru.rst |  121 +
>  Documentation/vm/index.rst                    |    1 +
>  Documentation/vm/multigen_lru.rst             |  152 +
>  arch/Kconfig                                  |    9 +
>  arch/arm64/include/asm/pgtable.h              |   14 +-
>  arch/x86/Kconfig                              |    1 +
>  arch/x86/include/asm/pgtable.h                |    9 +-
>  arch/x86/mm/pgtable.c                         |    5 +-
>  fs/exec.c                                     |    2 +
>  fs/fuse/dev.c                                 |    3 +-
>  include/linux/cgroup.h                        |   15 +-
>  include/linux/memcontrol.h                    |   36 +
>  include/linux/mm.h                            |    8 +
>  include/linux/mm_inline.h                     |  214 ++
>  include/linux/mm_types.h                      |   78 +
>  include/linux/mmzone.h                        |  182 ++
>  include/linux/nodemask.h                      |    1 +
>  include/linux/page-flags-layout.h             |   19 +-
>  include/linux/page-flags.h                    |    4 +-
>  include/linux/pgtable.h                       |   17 +-
>  include/linux/sched.h                         |    4 +
>  include/linux/swap.h                          |    5 +
>  kernel/bounds.c                               |    3 +
>  kernel/cgroup/cgroup-internal.h               |    1 -
>  kernel/exit.c                                 |    1 +
>  kernel/fork.c                                 |    9 +
>  kernel/sched/core.c                           |    1 +
>  mm/Kconfig                                    |   50 +
>  mm/huge_memory.c                              |    3 +-
>  mm/memcontrol.c                               |   27 +
>  mm/memory.c                                   |   39 +-
>  mm/mm_init.c                                  |    6 +-
>  mm/page_alloc.c                               |    1 +
>  mm/rmap.c                                     |    7 +
>  mm/swap.c                                     |   55 +-
>  mm/vmscan.c                                   | 2831 ++++++++++++++++-
>  mm/workingset.c                               |  119 +-
>  38 files changed, 3908 insertions(+), 146 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
>  create mode 100644 Documentation/vm/multigen_lru.rst
>
> -- 
> 2.35.0.263.gb82422642f-goog
>
>

-- 
Cheers
~ Vaibhav


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 00/12] Multigenerational LRU Framework
  2022-03-03  6:06 ` Vaibhav Jain
@ 2022-03-03  6:47   ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-03-03  6:47 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Barry Song, Catalin Marinas,
	Dave Hansen, Hillf Danton, Jens Axboe, Jesse Barnes,
	Jonathan Corbet, Linus Torvalds, Matthew Wilcox, Michael Larabel,
	Mike Rapoport, Rik van Riel, Vlastimil Babka, Will Deacon,
	Ying Huang, linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86

On Thu, Mar 03, 2022 at 11:36:51AM +0530, Vaibhav Jain wrote:
> 
> In a synthetic MongoDB Benchmark (YCSB) seeing an average of ~19% throughput
> improvement on POWER10(Radix MMU + 64K Page Size) with MGLRU patches on
> top of v5.16 kernel for MongoDB + YCSB bench across three different
> request distriburions namely Exponential,Uniform and Zipfan

Thanks, Vaibhav. I'll post the next version in a few days and include
your tested-by tag.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-21  8:14         ` Yu Zhao
  2022-02-23 21:18           ` Yu Zhao
@ 2022-03-03 15:29           ` Johannes Weiner
  2022-03-03 19:26             ` Yu Zhao
  1 sibling, 1 reply; 74+ messages in thread
From: Johannes Weiner @ 2022-03-03 15:29 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	linux-arm-kernel, linux-doc, linux-kernel, linux-mm,
	page-reclaim, x86, Brian Geffon, Jan Alexander Steffens,
	Oleksandr Natalenko, Steven Barrett, Suleiman Souhlal,
	Daniel Byrne, Donald Carr, Holger Hoffstätte,
	Konstantin Kharlamov, Shuang Zhai, Sofia Trinh

Hi Yu,

On Mon, Feb 21, 2022 at 01:14:24AM -0700, Yu Zhao wrote:
> On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > +{
> > > > > +	unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > +
> > > > > +	VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > +
> > > > > +	/* see the comment on MIN_NR_GENS */
> > > > > +	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > +}
> > > > 
> > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > set is split into two generations for the sole purpose of the
> > > > second-chance policy for fresh faults, right?
> > > 
> > > To be precise, the active/inactive notion on top of generations is
> > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > Otherwise, this function wouldn't be needed.
> > 
> > Ah! would you mind adding this as a comment to the function?
> 
> Will do.
> 
> > But AFAICS there is the lru_gen_del_folio() callsite that maps it to
> > the PG_active flag - which in turn gets used by add_folio() to place
> > the thing back on the max_seq generation. So I suppose there is a
> > secondary purpose of the function for remembering the page's rough age
> > for non-reclaim isolation.>
> 
> Yes, e.g., migration.

Ok, thanks for clarifying. That should also be in the comment.

On scan resistance:

> > The concern isn't the scan overhead, but jankiness from the workingset
> > being flooded out by streaming IO.
> 
> Yes, MGLRU uses a different approach to solve this problem, and for
> its approach, the scan overhead is the concern.
> 
> MGLRU detects (defines) the working set by scanning the entire memory
> for each generation, and it counters the flooding by accelerating the
> creation of generations. IOW, all mapped pages have an equal chance to
> get scanned, no matter which generation they are in. This is a design
> difference compared with the active/inactive LRU, which tries to scans
> the active/inactive lists less/more frequently.
>
> > The concrete usecase at the time was a torrent client hashing a
> > downloaded file and thereby kicking out the desktop environment, which
> > caused jankiness. The hashing didn't benefit from caching - the file
> > wouldn't have fit into RAM anyway - so this was pointless to boot.
> > 
> > Essentially, the tradeoff is this:
> > 
> > 1) If you treat new pages as hot, you accelerate workingset
> > transitions, but on the flipside you risk unnecessary refaults in
> > running applications when those new pages are one-off.
> > 
> > 2) If you take new pages with a grain of salt, you protect existing
> > applications better from one-off floods, but risk refaults in NEW
> > application while they're trying to start up.
> 
> Agreed.
> 
> > There are two arguments for why 2) is preferable:
> > 
> > 1) Users are tolerant of cache misses when applications first launch,
> >    much less so after they've been running for hours.
> 
> Our CUJs (Critical User Journeys) respectfully disagree :)
> 
> They are built on the observation that once users have moved onto
> another tab/app, they are more likely to stay with the new tab/app
> rather than go back to the old ones. Speaking for myself, this is
> generally the case.

That's in line with what I said. Where is the disagreement?

> > 2) Workingset transitions (and associated jankiness) are bounded by
> >    the amount of RAM you need to repopulate. But streaming IO is
> >    bounded by storage, and datasets are routinely several times the
> >    amount of RAM. Uncacheable sets in excess of RAM can produce an
> >    infinite stream of "new" references; not protecting the workingset
> >    from that means longer or even sustained jankiness.
> 
> I'd argue the opposite -- we shouldn't risk refaulting fresh hot pages
> just to accommodate this concrete yet minor use case, especially
> considering torrent has been given the means (MADV_SEQUENTIAL) to help
> itself.
>
> I appreciate all your points here. The bottom line is we agree this is
> a trade off. For what disagree about, we could be both right -- it
> comes down to what workloads we care about *more*.

It's a straight-forward question: How does MGLRU avoid cache pollution
from scans?

Your answer above seems to be "it just does". Your answer here seems
to be "it doesn't, but it doesn't matter". Forgive me if I'm
misreading what you're saying.

But it's not a minor concern. Read the motivation behind any modern
cache algorithm - ARC, LIRS, Clock-Pro, LRU-K, 2Q - and scan
resistance is the reason for why they all exist in the first place.


    "The LRU-K algorithm surpasses conventional buffering algorithms
     in discriminating between frequently and infrequently referenced
     pages."

        - The LRU-K page replacement algorithm for database disk
          buffering, O'Neil et al, 1993

    "Although LRU replacement policy has been commonly used in the
     buffer cache management, it is well known for its inability to
     cope with access patterns with weak locality."

        - LIRS: an efficient low inter-reference recency set
         replacement policy to improve buffer cache performance,
         Jiang, Zhang, 2002

    "The self-tuning, low-overhead, scan-resistant adaptive
     replacement cache algorithm outperforms the least-recently-used
     algorithm by dynamically responding to changing access patterns
     and continually balancing between workload recency and frequency
     features."

        - Outperforming LRU with an adaptive replacement cache
          algorithm, Megiddo, Modha, 2004

    "Over the last three decades, the inability of LRU as well as
     CLOCK to handle weak locality accesses has become increasingly
     serious, and an effective fix becomes increasingly desirable.

        - CLOCK-Pro: An Effective Improvement of the CLOCK
          Replacement, Jiang et al, 2005


We can't rely on MADV_SEQUENTIAL alone. Not all accesses know in
advance that they'll be one-off; it can be a group of uncoordinated
tasks causing the pattern etc.

This is a pretty fundamental issue. It would be good to get a more
satisfying answer on this.

> > > > You can drop the memcg parameter and use lruvec_memcg().
> > > 
> > > lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> > > this function because mem_cgroup_disabled() is initialized afterward.
> > 
> > Good catch. That'll container_of() into garbage. However, we have to
> > assume that somebody's going to try that simplification again, so we
> > should set up the code now to prevent issues.
> > 
> > cgroup_disable parsing is self-contained, so we can pull it ahead in
> > the init sequence. How about this?
> > 
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 9d05c3ca2d5e..b544d768edc8 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str)
> >  			break;
> >  		}
> >  	}
> > -	return 1;
> > +	return 0;
> >  }
> > -__setup("cgroup_disable=", cgroup_disable);
> > +early_param("cgroup_disable", cgroup_disable);
> 
> I think early_param() is still after pgdat_init_internals(), no?

It's called twice for some reason, but AFAICS the first one is always
called before pgdat_init_internals():

start_kernel()
  setup_arch()
    parse_early_param()
    x86_init.paging.pagetable_init();
      paging_init()
        zone_sizes_init()
          free_area_init()
            free_area_init_node()
              free_area_init_core()
                pgdat_init_internals()
  parse_early_param()

It's the same/similar for arm, sparc and mips.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-03 15:29           ` Johannes Weiner
@ 2022-03-03 19:26             ` Yu Zhao
  2022-03-03 21:43               ` Johannes Weiner
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-03-03 19:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Thu, Mar 3, 2022 at 8:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Yu,
>
> On Mon, Feb 21, 2022 at 01:14:24AM -0700, Yu Zhao wrote:
> > On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote:
> > > On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > > +{
> > > > > > +     unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > > +
> > > > > > +     VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > > +
> > > > > > +     /* see the comment on MIN_NR_GENS */
> > > > > > +     return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > > +}
> > > > >
> > > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > > set is split into two generations for the sole purpose of the
> > > > > second-chance policy for fresh faults, right?
> > > >
> > > > To be precise, the active/inactive notion on top of generations is
> > > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > > Otherwise, this function wouldn't be needed.
> > >
> > > Ah! would you mind adding this as a comment to the function?
> >
> > Will do.
> >
> > > But AFAICS there is the lru_gen_del_folio() callsite that maps it to
> > > the PG_active flag - which in turn gets used by add_folio() to place
> > > the thing back on the max_seq generation. So I suppose there is a
> > > secondary purpose of the function for remembering the page's rough age
> > > for non-reclaim isolation.>
> >
> > Yes, e.g., migration.
>
> Ok, thanks for clarifying. That should also be in the comment.

Thanks. Will do.

> On scan resistance:
>
> > > The concern isn't the scan overhead, but jankiness from the workingset
> > > being flooded out by streaming IO.
> >
> > Yes, MGLRU uses a different approach to solve this problem, and for
> > its approach, the scan overhead is the concern.
> >
> > MGLRU detects (defines) the working set by scanning the entire memory
> > for each generation, and it counters the flooding by accelerating the
> > creation of generations. IOW, all mapped pages have an equal chance to
> > get scanned, no matter which generation they are in. This is a design
> > difference compared with the active/inactive LRU, which tries to scans
> > the active/inactive lists less/more frequently.
> >
> > > The concrete usecase at the time was a torrent client hashing a
> > > downloaded file and thereby kicking out the desktop environment, which
> > > caused jankiness. The hashing didn't benefit from caching - the file
> > > wouldn't have fit into RAM anyway - so this was pointless to boot.
> > >
> > > Essentially, the tradeoff is this:
> > >
> > > 1) If you treat new pages as hot, you accelerate workingset
> > > transitions, but on the flipside you risk unnecessary refaults in
> > > running applications when those new pages are one-off.
> > >
> > > 2) If you take new pages with a grain of salt, you protect existing
> > > applications better from one-off floods, but risk refaults in NEW
> > > application while they're trying to start up.
> >
> > Agreed.
> >
> > > There are two arguments for why 2) is preferable:
> > >
> > > 1) Users are tolerant of cache misses when applications first launch,
> > >    much less so after they've been running for hours.
> >
> > Our CUJs (Critical User Journeys) respectfully disagree :)
> >
> > They are built on the observation that once users have moved onto
> > another tab/app, they are more likely to stay with the new tab/app
> > rather than go back to the old ones. Speaking for myself, this is
> > generally the case.
>
> That's in line with what I said. Where is the disagreement?

Probably I've misinterpreted what you meant. The reasoning behind 1)
sounds to me like:
Cache misses of existing apps are more detrimental to user experience,
and therefore we choose to sacrifice the performance of newly launched
apps to avoid flooding.

My argument is that (phone/laptop/desktop) users usually care more
about the performance of newly launched apps -- this sounds a
contradiction of  what 1) says -- and therefore sacrificing the
performance of newly launched apps is not generally a good idea.

> > > 2) Workingset transitions (and associated jankiness) are bounded by
> > >    the amount of RAM you need to repopulate. But streaming IO is
> > >    bounded by storage, and datasets are routinely several times the
> > >    amount of RAM. Uncacheable sets in excess of RAM can produce an
> > >    infinite stream of "new" references; not protecting the workingset
> > >    from that means longer or even sustained jankiness.
> >
> > I'd argue the opposite -- we shouldn't risk refaulting fresh hot pages
> > just to accommodate this concrete yet minor use case, especially
> > considering torrent has been given the means (MADV_SEQUENTIAL) to help
> > itself.
> >
> > I appreciate all your points here. The bottom line is we agree this is
> > a trade off. For what disagree about, we could be both right -- it
> > comes down to what workloads we care about *more*.
>
> It's a straight-forward question: How does MGLRU avoid cache pollution
> from scans?
>
> Your answer above seems to be "it just does". Your answer here seems
> to be "it doesn't, but it doesn't matter". Forgive me if I'm
> misreading what you're saying.

This might have gotten lost. At the beginning, I explained:

> > Yes, MGLRU uses a different approach to solve this problem, and for
> > its approach, the scan overhead is the concern.
> >
> > MGLRU detects (defines) the working set by scanning the entire memory
> > for each generation, and it counters the flooding by accelerating the
> > creation of generations. IOW, all mapped pages have an equal chance to
> > get scanned, no matter which generation they are in. This is a design
> > difference compared with the active/inactive LRU, which tries to scan
> > the active/inactive lists less/more frequently.

To summarize: MGLRU counteracts flooding by shortening the time
interval that defines an entire working set. In contrast, the
active/inactive LRU uses a longer time interval for the active
(established working set) and a shorter one for the inactive (flood or
new working set).

> But it's not a minor concern.

This depends on the POV :)

For what workloads I care about more, i.e., the majority of
phone/laptop/desktop users and top open-source memory hogs running on
servers, I've heard no complaints (yet).

> Read the motivation behind any modern
> cache algorithm - ARC, LIRS, Clock-Pro, LRU-K, 2Q - and scan
> resistance is the reason for why they all exist in the first place.

I agree that flooding is a concern (major from your POV; minor from my
POV). I assume that you agree that the solution is always a tradeoff.
I'm merely suggesting both tradeoffs I summarized above have their
merits and we shouldn't put all eggs in one bucket.

>     "The LRU-K algorithm surpasses conventional buffering algorithms
>      in discriminating between frequently and infrequently referenced
>      pages."
>
>         - The LRU-K page replacement algorithm for database disk
>           buffering, O'Neil et al, 1993
>
>     "Although LRU replacement policy has been commonly used in the
>      buffer cache management, it is well known for its inability to
>      cope with access patterns with weak locality."
>
>         - LIRS: an efficient low inter-reference recency set
>          replacement policy to improve buffer cache performance,
>          Jiang, Zhang, 2002
>
>     "The self-tuning, low-overhead, scan-resistant adaptive
>      replacement cache algorithm outperforms the least-recently-used
>      algorithm by dynamically responding to changing access patterns
>      and continually balancing between workload recency and frequency
>      features."
>
>         - Outperforming LRU with an adaptive replacement cache
>           algorithm, Megiddo, Modha, 2004
>
>     "Over the last three decades, the inability of LRU as well as
>      CLOCK to handle weak locality accesses has become increasingly
>      serious, and an effective fix becomes increasingly desirable.
>
>         - CLOCK-Pro: An Effective Improvement of the CLOCK
>           Replacement, Jiang et al, 2005
>
>
> We can't rely on MADV_SEQUENTIAL alone.

Agreed.

> Not all accesses know in
> advance that they'll be one-off; it can be a group of uncoordinated
> tasks causing the pattern etc.

For what we know, MADV_SEQUENTIAL can be a solution. There are things
we don't know, hence the "agreed" above :)

> This is a pretty fundamental issue.

Agreed.

My points are:
1. The practice value of this fundamental issue (minor or major).
2. The merits of different tradeoffs (better or worse).

IMO, both depend on the POV.

> It would be good to get a more
> satisfying answer on this.

Agreed.

> > > > > You can drop the memcg parameter and use lruvec_memcg().
> > > >
> > > > lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> > > > this function because mem_cgroup_disabled() is initialized afterward.
> > >
> > > Good catch. That'll container_of() into garbage. However, we have to
> > > assume that somebody's going to try that simplification again, so we
> > > should set up the code now to prevent issues.
> > >
> > > cgroup_disable parsing is self-contained, so we can pull it ahead in
> > > the init sequence. How about this?
> > >
> > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > index 9d05c3ca2d5e..b544d768edc8 100644
> > > --- a/kernel/cgroup/cgroup.c
> > > +++ b/kernel/cgroup/cgroup.c
> > > @@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str)
> > >                     break;
> > >             }
> > >     }
> > > -   return 1;
> > > +   return 0;
> > >  }
> > > -__setup("cgroup_disable=", cgroup_disable);
> > > +early_param("cgroup_disable", cgroup_disable);
> >
> > I think early_param() is still after pgdat_init_internals(), no?
>
> It's called twice for some reason, but AFAICS the first one is always
> called before pgdat_init_internals():
>
> start_kernel()
>   setup_arch()
>     parse_early_param()
>     x86_init.paging.pagetable_init();
>       paging_init()
>         zone_sizes_init()
>           free_area_init()
>             free_area_init_node()
>               free_area_init_core()
>                 pgdat_init_internals()
>   parse_early_param()
>
> It's the same/similar for arm, sparc and mips.

Thanks for checking. But I'd rather live with an additional parameter
than risk breaking some archs.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-03 19:26             ` Yu Zhao
@ 2022-03-03 21:43               ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2022-03-03 21:43 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Mel Gorman, Michal Hocko, Andi Kleen,
	Aneesh Kumar, Barry Song, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang,
	Linux ARM, open list:DOCUMENTATION, linux-kernel, Linux-MM,
	Kernel Page Reclaim v2, the arch/x86 maintainers, Brian Geffon,
	Jan Alexander Steffens, Oleksandr Natalenko, Steven Barrett,
	Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Thu, Mar 03, 2022 at 12:26:45PM -0700, Yu Zhao wrote:
> On Thu, Mar 3, 2022 at 8:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Mon, Feb 21, 2022 at 01:14:24AM -0700, Yu Zhao wrote:
> > > On Tue, Feb 15, 2022 at 04:53:56PM -0500, Johannes Weiner wrote:
> > > > On Tue, Feb 15, 2022 at 02:43:05AM -0700, Yu Zhao wrote:
> > > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > > > You can drop the memcg parameter and use lruvec_memcg().
> > > > >
> > > > > lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> > > > > this function because mem_cgroup_disabled() is initialized afterward.
> > > >
> > > > Good catch. That'll container_of() into garbage. However, we have to
> > > > assume that somebody's going to try that simplification again, so we
> > > > should set up the code now to prevent issues.
> > > >
> > > > cgroup_disable parsing is self-contained, so we can pull it ahead in
> > > > the init sequence. How about this?
> > > >
> > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > > index 9d05c3ca2d5e..b544d768edc8 100644
> > > > --- a/kernel/cgroup/cgroup.c
> > > > +++ b/kernel/cgroup/cgroup.c
> > > > @@ -6464,9 +6464,9 @@ static int __init cgroup_disable(char *str)
> > > >                     break;
> > > >             }
> > > >     }
> > > > -   return 1;
> > > > +   return 0;
> > > >  }
> > > > -__setup("cgroup_disable=", cgroup_disable);
> > > > +early_param("cgroup_disable", cgroup_disable);
> > >
> > > I think early_param() is still after pgdat_init_internals(), no?
> >
> > It's called twice for some reason, but AFAICS the first one is always
> > called before pgdat_init_internals():
> >
> > start_kernel()
> >   setup_arch()
> >     parse_early_param()
> >     x86_init.paging.pagetable_init();
> >       paging_init()
> >         zone_sizes_init()
> >           free_area_init()
> >             free_area_init_node()
> >               free_area_init_core()
> >                 pgdat_init_internals()
> >   parse_early_param()
> >
> > It's the same/similar for arm, sparc and mips.
> 
> Thanks for checking. But I'd rather live with an additional parameter
> than risk breaking some archs.

As per above, somebody is going to try to make that simplification
again in the future. It doesn't make a lot of sense to have a reviewer
trip over it, have a discussion about just how subtle this dependency
is, and then still leave it in for others. parse_early_param() is
documented to be called by arch code early on, there isn't a good
reason to mistrust our own codebase like that. And special-casing this
situation just complicates maintainability and hackability.

Please just fix the ordering and use lruvec_memcg(), thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-02-15  9:43     ` Yu Zhao
  2022-02-15 21:53       ` Johannes Weiner
@ 2022-03-11 10:16       ` Barry Song
  2022-03-11 23:45         ` Yu Zhao
  1 sibling, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-11 10:16 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, page-reclaim, x86,
	Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Tue, Feb 15, 2022 at 10:43 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
>
> Thanks for reviewing.
>
> > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > +{
> > > +   unsigned long max_seq = lruvec->lrugen.max_seq;
> > > +
> > > +   VM_BUG_ON(gen >= MAX_NR_GENS);
> > > +
> > > +   /* see the comment on MIN_NR_GENS */
> > > +   return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > +}
> >
> > I'm still reading the series, so correct me if I'm wrong: the "active"
> > set is split into two generations for the sole purpose of the
> > second-chance policy for fresh faults, right?
>
> To be precise, the active/inactive notion on top of generations is
> just for ABI compatibility, e.g., the counters in /proc/vmstat.
> Otherwise, this function wouldn't be needed.

Hi Yu,
I am still quite confused as i am seeing both active/inactive and lru_gen.
eg:

root@ubuntu:~# cat /proc/vmstat | grep active
nr_zone_inactive_anon 22797
nr_zone_active_anon 578405
nr_zone_inactive_file 0
nr_zone_active_file 4156
nr_inactive_anon 22800
nr_active_anon 578574
nr_inactive_file 0
nr_active_file 4215

and:

root@ubuntu:~# cat /sys//kernel/debug/lru_gen

...
memcg    36 /user.slice/user-0.slice/user@0.service
 node     0
         20      18820         22           0
         21       7452          0           0
         22       7448          0           0
memcg    33 /user.slice/user-0.slice/user@0.service/app.slice
 node     0
          0    2171452          0           0
          1    2171452          0           0
          2    2171452          0           0
          3    2171452          0           0
memcg    37 /user.slice/user-0.slice/session-1.scope
 node     0
         42      51804     102127           0
         43      18840     275622           0
         44      16104     216805           1

Does it mean one page could be in both one of the generations and one
of the active/inactive lists?
Do we have some mapping relationship between active/inactive lists
with generations?


>
> > If so, it'd be better to have the comment here instead of down by
> > MIN_NR_GENS. This is the place that defines what "active" is, so this
> > is where the reader asks what it means and what it implies. The
> > definition of MIN_NR_GENS can be briefer: "need at least two for
> > second chance, see lru_gen_is_active() for details".
>
> This could be understood this way. It'd be more appropriate to see
> this function as an auxiliary and MIN_NR_GENS as something fundamental.
> Therefore the former should refer to the latter. Specifically, the
> "see the comment on MIN_NR_GENS" refers to this part:
>   And to be compatible with the active/inactive LRU, these two
>   generations are mapped to the active; the rest of generations, if
>   they exist, are mapped to the inactive.
>
> > > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru,
> > > +                                  int zone, long delta)
> > > +{
> > > +   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > +
> > > +   lockdep_assert_held(&lruvec->lru_lock);
> > > +   WARN_ON_ONCE(delta != (int)delta);
> > > +
> > > +   __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta);
> > > +   __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta);
> > > +}
> >
> > This is a duplicate of update_lru_size(), please use that instead.
> >
> > Yeah technically you don't need the mem_cgroup_update_lru_size() but
> > that's not worth sweating over, better to keep it simple.
>
> I agree we don't need the mem_cgroup_update_lru_size() -- let me spell
> out why:
>   this function is not needed here because it updates the counters used
>   only by the active/inactive lru code, i.e., get_scan_count().
>
> However, we can't reuse update_lru_size() because MGLRU can trip the
> WARN_ONCE() in mem_cgroup_update_lru_size().
>
> Unlike lru_zone_size[], lrugen->nr_pages[] is eventually consistent.
> To move a page to a different generation, the gen counter in page->flags
> is updated first, which doesn't require the LRU lock. The second step,
> i.e., the update of lrugen->nr_pages[], requires the LRU lock, and it
> usually isn't done immediately due to batching. Meanwhile, if this page
> is, for example, isolated, nr_pages[] becomes temporarily unbalanced.
> And this trips the WARN_ONCE().
>
> <snipped>
>
> >       /* Promotion */
> > > +   if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
> > > +           lru_gen_update_size(lruvec, lru, zone, -delta);
> > > +           lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta);
> > > +   }
> > > +
> > > +   /* Promotion is legit while a page is on an LRU list, but demotion isn't. */
> >
> >       /* Demotion happens during aging when pages are isolated, never on-LRU */
> > > +   VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
> > > +}
> >
> > On that note, please move introduction of the promotion and demotion
> > bits to the next patch. They aren't used here yet, and I spent some
> > time jumping around patches to verify the promotion callers and
> > confirm the validy of the BUG_ON.
>
> Will do.
>
> > > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> > > +{
> > > +   int gen;
> > > +   unsigned long old_flags, new_flags;
> > > +   int type = folio_is_file_lru(folio);
> > > +   int zone = folio_zonenum(folio);
> > > +   struct lru_gen_struct *lrugen = &lruvec->lrugen;
> > > +
> > > +   if (folio_test_unevictable(folio) || !lrugen->enabled)
> > > +           return false;
> >
> > These two checks should be in the callsite and the function should
> > return void. Otherwise you can't understand the callsite without
> > drilling down into lrugen code, even if lrugen is disabled.
>
> I agree it's a bit of nuisance this way. The alternative is we'd need
> ifdef or another helper at the call sites because lrugen->enabled is
> specific to lrugen.
>
> > > +   /*
> > > +    * There are three common cases for this page:
> > > +    * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it
> > > +    *    to the youngest generation.
> >
> > "shouldn't be evicted" makes it sound like mlock. But they should just
> > be evicted last, right? Maybe:
> >
> >       /*
> >        * Pages start in different generations depending on
> >        * advance knowledge we have about their hotness and
> >        * evictability:
> >        *
> >        * 1. Already active pages start out youngest. This can be
> >        *    fresh faults, or refaults of previously hot pages.
> >        * 2. Cold pages that require writeback before becoming
> >        *    evictable start on the second oldest generation.
> >        * 3. Everything else (clean, cold) starts old.
> >        */
>
> Will do.
>
> > On that note, I think #1 is reintroducing a problem we have fixed
> > before, which is trashing the workingset with a flood of use-once
> > mmapped pages. It's the classic scenario where LFU beats LRU.
> >
> > Mapped streaming IO isn't very common, but it does happen. See these
> > commits:
> >
> > dfc8d636cdb95f7b792d5ba8c9f3b295809c125d
> > 31c0569c3b0b6cc8a867ac6665ca081553f7984c
> > 645747462435d84c6c6a64269ed49cc3015f753d
> >
> > From the changelog:
> >
> >     The used-once mapped file page detection patchset.
> >
> >     It is meant to help workloads with large amounts of shortly used file
> >     mappings, like rtorrent hashing a file or git when dealing with loose
> >     objects (git gc on a bigger site?).
> >
> >     Right now, the VM activates referenced mapped file pages on first
> >     encounter on the inactive list and it takes a full memory cycle to
> >     reclaim them again.  When those pages dominate memory, the system
> >     no longer has a meaningful notion of 'working set' and is required
> >     to give up the active list to make reclaim progress.  Obviously,
> >     this results in rather bad scanning latencies and the wrong pages
> >     being reclaimed.
> >
> >     This patch makes the VM be more careful about activating mapped file
> >     pages in the first place.  The minimum granted lifetime without
> >     another memory access becomes an inactive list cycle instead of the
> >     full memory cycle, which is more natural given the mentioned loads.
> >
> > Translating this to multigen, it seems fresh faults should really
> > start on the second oldest rather than on the youngest generation, to
> > get a second chance but without jeopardizing the workingset if they
> > don't take it.
>
> This is a good point, and I had worked on a similar idea but failed
> to measure its benefits. In addition to placing mmapped file pages in
> older generations, I also tried placing refaulted anon pages in older
> generations. My conclusion was that the initial LRU positions of NFU
> pages are not a bottleneck for workloads I've tested. The efficiency
> of testing/clearing the accessed bit is.
>
> And some applications are smart enough to leverage MADV_SEQUENTIAL.
> In this case, MGLRU does place mmapped file pages in the oldest
> generation.
>
> I have an oversimplified script that uses memcached to mimic a
> non-streaming workload and fio a (mmapped) streaming workload:
>   1. With MADV_SEQUENTIAL, the non-streaming workload is about 5 times
>      faster when using MGLRU. Somehow the baseline (rc3) swapped a lot.
>      (It shouldn't, and I haven't figured out why.)
>   2. Without MADV_SEQUENTIAL, the non-streaming workload is about 1
>      time faster when using MGLRU. Both MGLRU and the baseline swapped
>      a lot.
>
>            MADV_SEQUENTIAL    non-streaming ops/sec (memcached)
>   rc3      yes                 292k
>   rc3      no                  203k
>   rc3+v7   yes                1967k
>   rc3+v7   no                  436k
>
>   cat mmap.sh
>   modprobe brd rd_nr=2 rd_size=56623104
>
>   mkswap /dev/ram0
>   swapon /dev/ram0
>
>   mkfs.ext4 /dev/ram1
>   mount -t ext4 /dev/ram1 /mnt
>
>   memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
>     -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=P:P -c 1 \
>     -t 36 --ratio 1:0 --pipeline 8 -d 2000
>
>   # streaming workload: --fadvise_hint=0 disables MADV_SEQUENTIAL
>   fio -name=mglru --numjobs=12 --directory=/mnt --size=4224m --buffered=1 \
>     --ioengine=mmap --iodepth=128 --iodepth_batch_submit=32 \
>     --iodepth_batch_complete=32 --rw=read --time_based --ramp_time=10m \
>     --runtime=180m --group_reporting &
>   pid=$!
>
>   sleep 200
>
>   # non-streaming workload
>   memtier_benchmark -S /var/run/memcached/memcached.sock -P memcache_binary \
>     -n allkeys --key-minimum=1 --key-maximum=50000000 --key-pattern=R:R \
>     -c 1 -t 36 --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
>
>   kill -INT $pid
>   wait

We used to put a faulted file page in inactive, if we access it a
second time, it can be promoted
to active. then in recent years, we have also applied this to anon
pages while kernel adds
workingset protection for anon pages. so basically both anon and file
pages go into the inactive
list for the 1st time, if we access it for the second time, they go to
the active list. if we don't access
it any more, they are likely to be reclaimed as they are inactive.
we do have some special fastpath for code section, executable file
pages are kept on active list
as long as they are accessed.

so all of the above concerns are actually not that correct?


>
> > > +    * 2) If it can't be evicted immediately, i.e., it's an anon page and
> > > +    *    not in swapcache, or a dirty page pending writeback, add it to the
> > > +    *    second oldest generation.
> > > +    * 3) If it may be evicted immediately, e.g., it's a clean page, add it
> > > +    *    to the oldest generation.
> > > +    */
> > > +   if (folio_test_active(folio))
> > > +           gen = lru_gen_from_seq(lrugen->max_seq);
> > > +   else if ((!type && !folio_test_swapcache(folio)) ||
> > > +            (folio_test_reclaim(folio) &&
> > > +             (folio_test_dirty(folio) || folio_test_writeback(folio))))
> > > +           gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
> > > +   else
> > > +           gen = lru_gen_from_seq(lrugen->min_seq[type]);
> >
> > Condition #2 is not quite clear to me, and the comment is incomplete:
> > The code does put dirty/writeback pages on the oldest gen as long as
> > they haven't been marked for immediate reclaim by the scanner
> > yet.
>
> Right.
>
> > HOWEVER, once the scanner does see those pages and sets
> > PG_reclaim, it will also activate them to move them out of the way
> > until writeback finishes (see shrink_page_list()) - at which point
> > we'll trigger #1. So that second part of #2 appears unreachable.
>
> Yes, dirty file pages go to #1; dirty pages in swapcache go to #2.
> (Ideally we want dirty file pages go to #2 too. IMO, the code would
>  be cleaner that way.)
>
> > It could be a good exercise to describe how cache pages move through
> > the generations, similar to the comment on lru_deactivate_file_fn().
> > It's a good example of intent vs implementation.
>
> Will do.
>
> > On another note, "!type" meaning "anon" is a bit rough. Please follow
> > the "bool file" convention used elsewhere.
>
> Originally I used "file", e.g., in v2:
> https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/
>
> But I was told to renamed it since "file" usually means file. Let me
> rename it back to "file", unless somebody still objects.
>
> > > @@ -113,6 +298,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
> > >  {
> > >     enum lru_list lru = folio_lru_list(folio);
> > >
> > > +   if (lru_gen_add_folio(lruvec, folio, true))
> > > +           return;
> > > +
> >
> > bool parameters are notoriously hard to follow in the callsite. Can
> > you please add lru_gen_add_folio_tail() instead and have them use a
> > common helper?
>
> I'm not sure -- there are several places like this one. My question is
> whether we want to do it throughout this patchset. We'd end up with
> many helpers and duplicate code. E.g., in this file alone, we have two
> functions taking bool parameters:
>   lru_gen_add_folio(..., bool reclaiming)
>   lru_gen_del_folio(..., bool reclaiming)
>
> I can't say they are very readable; at least they are very compact
> right now. My concern is that we might loose the latter without having
> enough of the former.
>
> Perhaps this is something that we could revisit after you've finished
> reviewing the entire patchset?
>
> > > @@ -127,6 +315,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
> > >  static __always_inline
> > >  void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
> > >  {
> > > +   if (lru_gen_del_folio(lruvec, folio, false))
> > > +           return;
> > > +
> > >     list_del(&folio->lru);
> > >     update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
> > >                     -folio_nr_pages(folio));
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index aed44e9b5d89..0f5e8a995781 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -303,6 +303,78 @@ enum lruvec_flags {
> > >                                      */
> > >  };
> > >
> > > +struct lruvec;
> > > +
> > > +#define LRU_GEN_MASK               ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> > > +#define LRU_REFS_MASK              ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> > > +
> > > +#ifdef CONFIG_LRU_GEN
> > > +
> > > +#define MIN_LRU_BATCH              BITS_PER_LONG
> > > +#define MAX_LRU_BATCH              (MIN_LRU_BATCH * 128)
> >
> > Those two aren't used in this patch, so it's hard to say whether they
> > are chosen correctly.
>
> Right. They slipped during the v6/v7 refactoring. Will move them to
> the next patch.
>
> > > + * Evictable pages are divided into multiple generations. The youngest and the
> > > + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
> > > + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
> > > + * offset within MAX_NR_GENS, gen, indexes the LRU list of the corresponding
> > > + * generation. The gen counter in folio->flags stores gen+1 while a page is on
> > > + * one of lrugen->lists[]. Otherwise it stores 0.
> > > + *
> > > + * A page is added to the youngest generation on faulting. The aging needs to
> > > + * check the accessed bit at least twice before handing this page over to the
> > > + * eviction. The first check takes care of the accessed bit set on the initial
> > > + * fault; the second check makes sure this page hasn't been used since then.
> > > + * This process, AKA second chance, requires a minimum of two generations,
> > > + * hence MIN_NR_GENS. And to be compatible with the active/inactive LRU, these
> > > + * two generations are mapped to the active; the rest of generations, if they
> > > + * exist, are mapped to the inactive. PG_active is always cleared while a page
> > > + * is on one of lrugen->lists[] so that demotion, which happens consequently
> > > + * when the aging produces a new generation, needs not to worry about it.
> > > + */
> > > +#define MIN_NR_GENS                2U
> > > +#define MAX_NR_GENS                ((unsigned int)CONFIG_NR_LRU_GENS)
> > > +
> > > +struct lru_gen_struct {
> >
> > struct lrugen?
> >
> > In fact, "lrugen" for the general function and variable namespace
> > might be better, the _ doesn't seem to pull its weight.
> >
> > CONFIG_LRUGEN
> > struct lrugen
> > lrugen_foo()
> > etc.
>
> No strong opinion here. I usually add underscores to functions and
> types so that grep doesn't end up with tons of local variables.
>
> > > +   /* the aging increments the youngest generation number */
> > > +   unsigned long max_seq;
> > > +   /* the eviction increments the oldest generation numbers */
> > > +   unsigned long min_seq[ANON_AND_FILE];
> >
> > The singular max_seq vs the split min_seq raises questions. Please add
> > a comment that explains or points to an explanation.
>
> Will do.
>
> > > +   /* the birth time of each generation in jiffies */
> > > +   unsigned long timestamps[MAX_NR_GENS];
> >
> > This isn't in use until the thrashing-based OOM killing patch.
>
> Will move it there.
>
> > > +   /* the multigenerational LRU lists */
> > > +   struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> > > +   /* the sizes of the above lists */
> > > +   unsigned long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
> > > +   /* whether the multigenerational LRU is enabled */
> > > +   bool enabled;
> >
> > Not (really) in use until the runtime switch. Best to keep everybody
> > checking the global flag for now, and have the runtime switch patch
> > introduce this flag and switch necessary callsites over.
>
> Will do.
>
> > > +void lru_gen_init_state(struct mem_cgroup *memcg, struct lruvec *lruvec);
> >
> > "state" is what we usually init :) How about lrugen_init_lruvec()?
>
> Same story as "file", lol -- this used to be lru_gen_init_lruvec():
> https://lore.kernel.org/linux-mm/20210413065633.2782273-9-yuzhao@google.com/
>
> Naming is hard. Hopefully we can finalize it this time.
>
> > You can drop the memcg parameter and use lruvec_memcg().
>
> lruvec_memcg() isn't available yet when pgdat_init_internals() calls
> this function because mem_cgroup_disabled() is initialized afterward.
>
> > > +#ifdef CONFIG_MEMCG
> > > +void lru_gen_init_memcg(struct mem_cgroup *memcg);
> > > +void lru_gen_free_memcg(struct mem_cgroup *memcg);
> >
> > This should be either init+exit, or alloc+free.
>
> Will do.

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-11 10:16       ` Barry Song
@ 2022-03-11 23:45         ` Yu Zhao
  2022-03-12 10:37           ` Barry Song
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-03-11 23:45 UTC (permalink / raw)
  To: Barry Song
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Fri, Mar 11, 2022 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 10:43 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> >
> > Thanks for reviewing.
> >
> > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > +{
> > > > +   unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > +
> > > > +   VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > +
> > > > +   /* see the comment on MIN_NR_GENS */
> > > > +   return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > +}
> > >
> > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > set is split into two generations for the sole purpose of the
> > > second-chance policy for fresh faults, right?
> >
> > To be precise, the active/inactive notion on top of generations is
> > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > Otherwise, this function wouldn't be needed.
>
> Hi Yu,
> I am still quite confused as i am seeing both active/inactive and lru_gen.
> eg:
>
> root@ubuntu:~# cat /proc/vmstat | grep active
> nr_zone_inactive_anon 22797
> nr_zone_active_anon 578405
> nr_zone_inactive_file 0
> nr_zone_active_file 4156
> nr_inactive_anon 22800
> nr_active_anon 578574
> nr_inactive_file 0
> nr_active_file 4215

Yes, this is expected. We have to maintain the ABI, i.e., the
*_active/inactive_* counters.

> and:
>
> root@ubuntu:~# cat /sys//kernel/debug/lru_gen
>
> ...
> memcg    36 /user.slice/user-0.slice/user@0.service
>  node     0
>          20      18820         22           0
>          21       7452          0           0
>          22       7448          0           0
> memcg    33 /user.slice/user-0.slice/user@0.service/app.slice
>  node     0
>           0    2171452          0           0
>           1    2171452          0           0
>           2    2171452          0           0
>           3    2171452          0           0
> memcg    37 /user.slice/user-0.slice/session-1.scope
>  node     0
>          42      51804     102127           0
>          43      18840     275622           0
>          44      16104     216805           1
>
> Does it mean one page could be in both one of the generations and one
> of the active/inactive lists?

In terms of the data structure, evictable pages are either on
lruvec->lists or lrugen->lists.

> Do we have some mapping relationship between active/inactive lists
> with generations?

For the counters, yes -- pages in max_seq and max_seq-1 are counted as
active, and the rest are inactive.

> We used to put a faulted file page in inactive, if we access it a
> second time, it can be promoted
> to active. then in recent years, we have also applied this to anon
> pages while kernel adds
> workingset protection for anon pages. so basically both anon and file
> pages go into the inactive
> list for the 1st time, if we access it for the second time, they go to
> the active list. if we don't access
> it any more, they are likely to be reclaimed as they are inactive.
> we do have some special fastpath for code section, executable file
> pages are kept on active list
> as long as they are accessed.

Yes.

> so all of the above concerns are actually not that correct?

They are valid concerns but I don't know any popular workloads that
care about them.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-11 23:45         ` Yu Zhao
@ 2022-03-12 10:37           ` Barry Song
  2022-03-12 21:11             ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-12 10:37 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Sat, Mar 12, 2022 at 12:45 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Mar 11, 2022 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Feb 15, 2022 at 10:43 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > >
> > > Thanks for reviewing.
> > >
> > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > +{
> > > > > +   unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > +
> > > > > +   VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > +
> > > > > +   /* see the comment on MIN_NR_GENS */
> > > > > +   return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > +}
> > > >
> > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > set is split into two generations for the sole purpose of the
> > > > second-chance policy for fresh faults, right?
> > >
> > > To be precise, the active/inactive notion on top of generations is
> > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > Otherwise, this function wouldn't be needed.
> >
> > Hi Yu,
> > I am still quite confused as i am seeing both active/inactive and lru_gen.
> > eg:
> >
> > root@ubuntu:~# cat /proc/vmstat | grep active
> > nr_zone_inactive_anon 22797
> > nr_zone_active_anon 578405
> > nr_zone_inactive_file 0
> > nr_zone_active_file 4156
> > nr_inactive_anon 22800
> > nr_active_anon 578574
> > nr_inactive_file 0
> > nr_active_file 4215
>
> Yes, this is expected. We have to maintain the ABI, i.e., the
> *_active/inactive_* counters.
>
> > and:
> >
> > root@ubuntu:~# cat /sys//kernel/debug/lru_gen
> >
> > ...
> > memcg    36 /user.slice/user-0.slice/user@0.service
> >  node     0
> >          20      18820         22           0
> >          21       7452          0           0
> >          22       7448          0           0
> > memcg    33 /user.slice/user-0.slice/user@0.service/app.slice
> >  node     0
> >           0    2171452          0           0
> >           1    2171452          0           0
> >           2    2171452          0           0
> >           3    2171452          0           0
> > memcg    37 /user.slice/user-0.slice/session-1.scope
> >  node     0
> >          42      51804     102127           0
> >          43      18840     275622           0
> >          44      16104     216805           1
> >
> > Does it mean one page could be in both one of the generations and one
> > of the active/inactive lists?
>
> In terms of the data structure, evictable pages are either on
> lruvec->lists or lrugen->lists.
>
> > Do we have some mapping relationship between active/inactive lists
> > with generations?
>
> For the counters, yes -- pages in max_seq and max_seq-1 are counted as
> active, and the rest are inactive.
>
> > We used to put a faulted file page in inactive, if we access it a
> > second time, it can be promoted
> > to active. then in recent years, we have also applied this to anon
> > pages while kernel adds
> > workingset protection for anon pages. so basically both anon and file
> > pages go into the inactive
> > list for the 1st time, if we access it for the second time, they go to
> > the active list. if we don't access
> > it any more, they are likely to be reclaimed as they are inactive.
> > we do have some special fastpath for code section, executable file
> > pages are kept on active list
> > as long as they are accessed.
>
> Yes.
>
> > so all of the above concerns are actually not that correct?
>
> They are valid concerns but I don't know any popular workloads that
> care about them.

Hi Yu,
here we can get a workload in Kim's patchset while he added workingset
protection
for anon pages:
https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
anon pages used to go to active rather than inactive, but kim's patchset
moved to use inactive first. then only after the anon page is accessed
second time, it can move to active.

"In current implementation, newly created or swap-in anonymous page is

started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

As we can see, hot pages are swapped-out and it would cause swap-in later."

Is MGLRU able to avoid the swap-out of the 50 hot pages? since MGLRU
is putting faulted pages to the youngest generation directly, do we have the
risk mentioned in Kim's patchset?

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-12 10:37           ` Barry Song
@ 2022-03-12 21:11             ` Yu Zhao
  2022-03-13  4:57               ` Barry Song
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-03-12 21:11 UTC (permalink / raw)
  To: Barry Song
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Sat, Mar 12, 2022 at 3:37 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 12, 2022 at 12:45 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Mar 11, 2022 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Feb 15, 2022 at 10:43 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > >
> > > > Thanks for reviewing.
> > > >
> > > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > > +{
> > > > > > +   unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > > +
> > > > > > +   VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > > +
> > > > > > +   /* see the comment on MIN_NR_GENS */
> > > > > > +   return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > > +}
> > > > >
> > > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > > set is split into two generations for the sole purpose of the
> > > > > second-chance policy for fresh faults, right?
> > > >
> > > > To be precise, the active/inactive notion on top of generations is
> > > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > > Otherwise, this function wouldn't be needed.
> > >
> > > Hi Yu,
> > > I am still quite confused as i am seeing both active/inactive and lru_gen.
> > > eg:
> > >
> > > root@ubuntu:~# cat /proc/vmstat | grep active
> > > nr_zone_inactive_anon 22797
> > > nr_zone_active_anon 578405
> > > nr_zone_inactive_file 0
> > > nr_zone_active_file 4156
> > > nr_inactive_anon 22800
> > > nr_active_anon 578574
> > > nr_inactive_file 0
> > > nr_active_file 4215
> >
> > Yes, this is expected. We have to maintain the ABI, i.e., the
> > *_active/inactive_* counters.
> >
> > > and:
> > >
> > > root@ubuntu:~# cat /sys//kernel/debug/lru_gen
> > >
> > > ...
> > > memcg    36 /user.slice/user-0.slice/user@0.service
> > >  node     0
> > >          20      18820         22           0
> > >          21       7452          0           0
> > >          22       7448          0           0
> > > memcg    33 /user.slice/user-0.slice/user@0.service/app.slice
> > >  node     0
> > >           0    2171452          0           0
> > >           1    2171452          0           0
> > >           2    2171452          0           0
> > >           3    2171452          0           0
> > > memcg    37 /user.slice/user-0.slice/session-1.scope
> > >  node     0
> > >          42      51804     102127           0
> > >          43      18840     275622           0
> > >          44      16104     216805           1
> > >
> > > Does it mean one page could be in both one of the generations and one
> > > of the active/inactive lists?
> >
> > In terms of the data structure, evictable pages are either on
> > lruvec->lists or lrugen->lists.
> >
> > > Do we have some mapping relationship between active/inactive lists
> > > with generations?
> >
> > For the counters, yes -- pages in max_seq and max_seq-1 are counted as
> > active, and the rest are inactive.
> >
> > > We used to put a faulted file page in inactive, if we access it a
> > > second time, it can be promoted
> > > to active. then in recent years, we have also applied this to anon
> > > pages while kernel adds
> > > workingset protection for anon pages. so basically both anon and file
> > > pages go into the inactive
> > > list for the 1st time, if we access it for the second time, they go to
> > > the active list. if we don't access
> > > it any more, they are likely to be reclaimed as they are inactive.
> > > we do have some special fastpath for code section, executable file
> > > pages are kept on active list
> > > as long as they are accessed.
> >
> > Yes.
> >
> > > so all of the above concerns are actually not that correct?
> >
> > They are valid concerns but I don't know any popular workloads that
> > care about them.
>
> Hi Yu,
> here we can get a workload in Kim's patchset while he added workingset
> protection
> for anon pages:
> https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/

Thanks. I wouldn't call that a workload because it's not a real
application. By popular workloads, I mean applications that the
majority of people actually run on phones, in cloud, etc.

> anon pages used to go to active rather than inactive, but kim's patchset
> moved to use inactive first. then only after the anon page is accessed
> second time, it can move to active.

Yes. To clarify, the A-bit doesn't really mean the first or second
access. It can be many accesses each time it's set.

> "In current implementation, newly created or swap-in anonymous page is
>
> started on the active list. Growing the active list results in rebalancing
> active/inactive list so old pages on the active list are demoted to the
> inactive list. Hence, hot page on the active list isn't protected at all.
>
> Following is an example of this situation.
>
> Assume that 50 hot pages on active list and system can contain total
> 100 pages. Numbers denote the number of pages on active/inactive
> list (active | inactive). (h) stands for hot pages and (uo) stands for
> used-once pages.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> As we can see, hot pages are swapped-out and it would cause swap-in later."
>
> Is MGLRU able to avoid the swap-out of the 50 hot pages?

I think the real question is why the 50 hot pages can be moved to the
inactive list. If they are really hot, the A-bit should protect them.

> since MGLRU
> is putting faulted pages to the youngest generation directly, do we have the
> risk mentioned in Kim's patchset?

There are always risks :) I could imagine a thousand ways to make VM
suffer, but all of them could be irrelevant to how it actually does in
production. So a concrete use case of yours would be much appreciated
for this discussion.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-12 21:11             ` Yu Zhao
@ 2022-03-13  4:57               ` Barry Song
  2022-03-14 11:11                 ` Barry Song
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-13  4:57 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Sun, Mar 13, 2022 at 10:12 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Sat, Mar 12, 2022 at 3:37 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 12, 2022 at 12:45 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Fri, Mar 11, 2022 at 3:16 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 15, 2022 at 10:43 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Thu, Feb 10, 2022 at 03:41:57PM -0500, Johannes Weiner wrote:
> > > > >
> > > > > Thanks for reviewing.
> > > > >
> > > > > > > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
> > > > > > > +{
> > > > > > > +   unsigned long max_seq = lruvec->lrugen.max_seq;
> > > > > > > +
> > > > > > > +   VM_BUG_ON(gen >= MAX_NR_GENS);
> > > > > > > +
> > > > > > > +   /* see the comment on MIN_NR_GENS */
> > > > > > > +   return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
> > > > > > > +}
> > > > > >
> > > > > > I'm still reading the series, so correct me if I'm wrong: the "active"
> > > > > > set is split into two generations for the sole purpose of the
> > > > > > second-chance policy for fresh faults, right?
> > > > >
> > > > > To be precise, the active/inactive notion on top of generations is
> > > > > just for ABI compatibility, e.g., the counters in /proc/vmstat.
> > > > > Otherwise, this function wouldn't be needed.
> > > >
> > > > Hi Yu,
> > > > I am still quite confused as i am seeing both active/inactive and lru_gen.
> > > > eg:
> > > >
> > > > root@ubuntu:~# cat /proc/vmstat | grep active
> > > > nr_zone_inactive_anon 22797
> > > > nr_zone_active_anon 578405
> > > > nr_zone_inactive_file 0
> > > > nr_zone_active_file 4156
> > > > nr_inactive_anon 22800
> > > > nr_active_anon 578574
> > > > nr_inactive_file 0
> > > > nr_active_file 4215
> > >
> > > Yes, this is expected. We have to maintain the ABI, i.e., the
> > > *_active/inactive_* counters.
> > >
> > > > and:
> > > >
> > > > root@ubuntu:~# cat /sys//kernel/debug/lru_gen
> > > >
> > > > ...
> > > > memcg    36 /user.slice/user-0.slice/user@0.service
> > > >  node     0
> > > >          20      18820         22           0
> > > >          21       7452          0           0
> > > >          22       7448          0           0
> > > > memcg    33 /user.slice/user-0.slice/user@0.service/app.slice
> > > >  node     0
> > > >           0    2171452          0           0
> > > >           1    2171452          0           0
> > > >           2    2171452          0           0
> > > >           3    2171452          0           0
> > > > memcg    37 /user.slice/user-0.slice/session-1.scope
> > > >  node     0
> > > >          42      51804     102127           0
> > > >          43      18840     275622           0
> > > >          44      16104     216805           1
> > > >
> > > > Does it mean one page could be in both one of the generations and one
> > > > of the active/inactive lists?
> > >
> > > In terms of the data structure, evictable pages are either on
> > > lruvec->lists or lrugen->lists.
> > >
> > > > Do we have some mapping relationship between active/inactive lists
> > > > with generations?
> > >
> > > For the counters, yes -- pages in max_seq and max_seq-1 are counted as
> > > active, and the rest are inactive.
> > >
> > > > We used to put a faulted file page in inactive, if we access it a
> > > > second time, it can be promoted
> > > > to active. then in recent years, we have also applied this to anon
> > > > pages while kernel adds
> > > > workingset protection for anon pages. so basically both anon and file
> > > > pages go into the inactive
> > > > list for the 1st time, if we access it for the second time, they go to
> > > > the active list. if we don't access
> > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > we do have some special fastpath for code section, executable file
> > > > pages are kept on active list
> > > > as long as they are accessed.
> > >
> > > Yes.
> > >
> > > > so all of the above concerns are actually not that correct?
> > >
> > > They are valid concerns but I don't know any popular workloads that
> > > care about them.
> >
> > Hi Yu,
> > here we can get a workload in Kim's patchset while he added workingset
> > protection
> > for anon pages:
> > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
>
> Thanks. I wouldn't call that a workload because it's not a real
> application. By popular workloads, I mean applications that the
> majority of people actually run on phones, in cloud, etc.
>
> > anon pages used to go to active rather than inactive, but kim's patchset
> > moved to use inactive first. then only after the anon page is accessed
> > second time, it can move to active.
>
> Yes. To clarify, the A-bit doesn't really mean the first or second
> access. It can be many accesses each time it's set.
>
> > "In current implementation, newly created or swap-in anonymous page is
> >
> > started on the active list. Growing the active list results in rebalancing
> > active/inactive list so old pages on the active list are demoted to the
> > inactive list. Hence, hot page on the active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list and system can contain total
> > 100 pages. Numbers denote the number of pages on active/inactive
> > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > used-once pages.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > As we can see, hot pages are swapped-out and it would cause swap-in later."
> >
> > Is MGLRU able to avoid the swap-out of the 50 hot pages?
>
> I think the real question is why the 50 hot pages can be moved to the
> inactive list. If they are really hot, the A-bit should protect them.

This is a good question.

I guess it  is probably because the current lru is trying to maintain a balance
between the sizes of active and inactive lists. Thus, it can shrink active list
even though pages might be still "hot" but not the recently accessed ones.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

the old kernel without anon workingset protection put workload 2 on active, so
pushed 50 hot pages from active to inactive. workload 3 would further contribute
to evict the 50 hot pages.

it seems mglru doesn't demote pages from the youngest generation to older
generation only in order to balance the list size? so mglru is probably safe
in these cases.

I will run some tests mentioned in Kim's patchset and report the result to you
afterwards.

>
> > since MGLRU
> > is putting faulted pages to the youngest generation directly, do we have the
> > risk mentioned in Kim's patchset?
>
> There are always risks :) I could imagine a thousand ways to make VM
> suffer, but all of them could be irrelevant to how it actually does in
> production. So a concrete use case of yours would be much appreciated
> for this discussion.

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-13  4:57               ` Barry Song
@ 2022-03-14 11:11                 ` Barry Song
  2022-03-14 16:45                   ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-14 11:11 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

> > > >
> > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > second time, it can be promoted
> > > > > to active. then in recent years, we have also applied this to anon
> > > > > pages while kernel adds
> > > > > workingset protection for anon pages. so basically both anon and file
> > > > > pages go into the inactive
> > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > the active list. if we don't access
> > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > we do have some special fastpath for code section, executable file
> > > > > pages are kept on active list
> > > > > as long as they are accessed.
> > > >
> > > > Yes.
> > > >
> > > > > so all of the above concerns are actually not that correct?
> > > >
> > > > They are valid concerns but I don't know any popular workloads that
> > > > care about them.
> > >
> > > Hi Yu,
> > > here we can get a workload in Kim's patchset while he added workingset
> > > protection
> > > for anon pages:
> > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> >
> > Thanks. I wouldn't call that a workload because it's not a real
> > application. By popular workloads, I mean applications that the
> > majority of people actually run on phones, in cloud, etc.
> >
> > > anon pages used to go to active rather than inactive, but kim's patchset
> > > moved to use inactive first. then only after the anon page is accessed
> > > second time, it can move to active.
> >
> > Yes. To clarify, the A-bit doesn't really mean the first or second
> > access. It can be many accesses each time it's set.
> >
> > > "In current implementation, newly created or swap-in anonymous page is
> > >
> > > started on the active list. Growing the active list results in rebalancing
> > > active/inactive list so old pages on the active list are demoted to the
> > > inactive list. Hence, hot page on the active list isn't protected at all.
> > >
> > > Following is an example of this situation.
> > >
> > > Assume that 50 hot pages on active list and system can contain total
> > > 100 pages. Numbers denote the number of pages on active/inactive
> > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > used-once pages.
> > >
> > > 1. 50 hot pages on active list
> > > 50(h) | 0
> > >
> > > 2. workload: 50 newly created (used-once) pages
> > > 50(uo) | 50(h)
> > >
> > > 3. workload: another 50 newly created (used-once) pages
> > > 50(uo) | 50(uo), swap-out 50(h)
> > >
> > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > >
> > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> >
> > I think the real question is why the 50 hot pages can be moved to the
> > inactive list. If they are really hot, the A-bit should protect them.
>
> This is a good question.
>
> I guess it  is probably because the current lru is trying to maintain a balance
> between the sizes of active and inactive lists. Thus, it can shrink active list
> even though pages might be still "hot" but not the recently accessed ones.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> the old kernel without anon workingset protection put workload 2 on active, so
> pushed 50 hot pages from active to inactive. workload 3 would further contribute
> to evict the 50 hot pages.
>
> it seems mglru doesn't demote pages from the youngest generation to older
> generation only in order to balance the list size? so mglru is probably safe
> in these cases.
>
> I will run some tests mentioned in Kim's patchset and report the result to you
> afterwards.
>

Hi Yu,
I did find putting faulted pages to the youngest generation lead to some
regression in the case ebizzy Kim's patchset mentioned while he tried
to support workingset protection for anon pages.
i did a little bit modification for rand_chunk() which is probably similar
with the modifcation() Kim mentioned in his patchset. The modification
can be found here:
https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef

The test env is a x86 machine in which I have set memory size to 2.5GB and
set zRAM to 2GB and disabled external disk swap.

with the vanilla kernel:
\time -v ./a.out -vv -t 4 -s 209715200 -S 200000

so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)

typical result:
        Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
        User time (seconds): 36.19
        System time (seconds): 229.72
        Percent of CPU this job got: 371%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2166196
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 9990128
        Minor (reclaiming a frame) page faults: 33315945
        Voluntary context switches: 59144
        Involuntary context switches: 167754
        Swaps: 0
        File system inputs: 2760
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

with gen_lru and lru_gen/enabled=0x3:
typical result:
Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
User time (seconds): 36.34
System time (seconds): 276.07
Percent of CPU this job got: 378%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
           **** 15% time +
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2168120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 13362810
             ***** 30% page fault +
Minor (reclaiming a frame) page faults: 33394617
Voluntary context switches: 55216
Involuntary context switches: 137220
Swaps: 0
File system inputs: 4088
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

with gen_lru and lru_gen/enabled=0x7:
typical result:
Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
User time (seconds): 36.13
System time (seconds): 251.71
Percent of CPU this job got: 378%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
         *****better than enabled=0x3, worse than vanilla
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2120988
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 12706512
Minor (reclaiming a frame) page faults: 33422243
Voluntary context switches: 49485
Involuntary context switches: 126765
Swaps: 0
File system inputs: 2976
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

I can also reproduce the problem on arm64.

I am not saying this is going to block mglru from being mainlined. But  I am
still curious if this is an issue worth being addressed somehow in mglru.

> >
> > > since MGLRU
> > > is putting faulted pages to the youngest generation directly, do we have the
> > > risk mentioned in Kim's patchset?
> >
> > There are always risks :) I could imagine a thousand ways to make VM
> > suffer, but all of them could be irrelevant to how it actually does in
> > production. So a concrete use case of yours would be much appreciated
> > for this discussion.
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-14 11:11                 ` Barry Song
@ 2022-03-14 16:45                   ` Yu Zhao
  2022-03-14 23:38                     ` Barry Song
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-03-14 16:45 UTC (permalink / raw)
  To: Barry Song
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Michal Hocko,
	Andi Kleen, Aneesh Kumar, Catalin Marinas, Dave Hansen,
	Hillf Danton, Jens Axboe, Jesse Barnes, Jonathan Corbet,
	Linus Torvalds, Matthew Wilcox, Michael Larabel, Mike Rapoport,
	Rik van Riel, Vlastimil Babka, Will Deacon, Ying Huang, LAK,
	Linux Doc Mailing List, LKML, Linux-MM, Kernel Page Reclaim v2,
	x86, Brian Geffon, Jan Alexander Steffens, Oleksandr Natalenko,
	Steven Barrett, Suleiman Souhlal, Daniel Byrne, Donald Carr,
	Holger Hoffstätte, Konstantin Kharlamov, Shuang Zhai,
	Sofia Trinh

On Mon, Mar 14, 2022 at 5:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > > > >
> > > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > > second time, it can be promoted
> > > > > > to active. then in recent years, we have also applied this to anon
> > > > > > pages while kernel adds
> > > > > > workingset protection for anon pages. so basically both anon and file
> > > > > > pages go into the inactive
> > > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > > the active list. if we don't access
> > > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > > we do have some special fastpath for code section, executable file
> > > > > > pages are kept on active list
> > > > > > as long as they are accessed.
> > > > >
> > > > > Yes.
> > > > >
> > > > > > so all of the above concerns are actually not that correct?
> > > > >
> > > > > They are valid concerns but I don't know any popular workloads that
> > > > > care about them.
> > > >
> > > > Hi Yu,
> > > > here we can get a workload in Kim's patchset while he added workingset
> > > > protection
> > > > for anon pages:
> > > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> > >
> > > Thanks. I wouldn't call that a workload because it's not a real
> > > application. By popular workloads, I mean applications that the
> > > majority of people actually run on phones, in cloud, etc.
> > >
> > > > anon pages used to go to active rather than inactive, but kim's patchset
> > > > moved to use inactive first. then only after the anon page is accessed
> > > > second time, it can move to active.
> > >
> > > Yes. To clarify, the A-bit doesn't really mean the first or second
> > > access. It can be many accesses each time it's set.
> > >
> > > > "In current implementation, newly created or swap-in anonymous page is
> > > >
> > > > started on the active list. Growing the active list results in rebalancing
> > > > active/inactive list so old pages on the active list are demoted to the
> > > > inactive list. Hence, hot page on the active list isn't protected at all.
> > > >
> > > > Following is an example of this situation.
> > > >
> > > > Assume that 50 hot pages on active list and system can contain total
> > > > 100 pages. Numbers denote the number of pages on active/inactive
> > > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > > used-once pages.
> > > >
> > > > 1. 50 hot pages on active list
> > > > 50(h) | 0
> > > >
> > > > 2. workload: 50 newly created (used-once) pages
> > > > 50(uo) | 50(h)
> > > >
> > > > 3. workload: another 50 newly created (used-once) pages
> > > > 50(uo) | 50(uo), swap-out 50(h)
> > > >
> > > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > > >
> > > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> > >
> > > I think the real question is why the 50 hot pages can be moved to the
> > > inactive list. If they are really hot, the A-bit should protect them.
> >
> > This is a good question.
> >
> > I guess it  is probably because the current lru is trying to maintain a balance
> > between the sizes of active and inactive lists. Thus, it can shrink active list
> > even though pages might be still "hot" but not the recently accessed ones.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > the old kernel without anon workingset protection put workload 2 on active, so
> > pushed 50 hot pages from active to inactive. workload 3 would further contribute
> > to evict the 50 hot pages.
> >
> > it seems mglru doesn't demote pages from the youngest generation to older
> > generation only in order to balance the list size? so mglru is probably safe
> > in these cases.
> >
> > I will run some tests mentioned in Kim's patchset and report the result to you
> > afterwards.
> >
>
> Hi Yu,
> I did find putting faulted pages to the youngest generation lead to some
> regression in the case ebizzy Kim's patchset mentioned while he tried
> to support workingset protection for anon pages.
> i did a little bit modification for rand_chunk() which is probably similar
> with the modifcation() Kim mentioned in his patchset. The modification
> can be found here:
> https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef
>
> The test env is a x86 machine in which I have set memory size to 2.5GB and
> set zRAM to 2GB and disabled external disk swap.
>
> with the vanilla kernel:
> \time -v ./a.out -vv -t 4 -s 209715200 -S 200000
>
> so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)
>
> typical result:
>         Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
>         User time (seconds): 36.19
>         System time (seconds): 229.72
>         Percent of CPU this job got: 371%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 2166196
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 9990128
>         Minor (reclaiming a frame) page faults: 33315945
>         Voluntary context switches: 59144
>         Involuntary context switches: 167754
>         Swaps: 0
>         File system inputs: 2760
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>
> with gen_lru and lru_gen/enabled=0x3:
> typical result:
> Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> User time (seconds): 36.34
> System time (seconds): 276.07
> Percent of CPU this job got: 378%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
>            **** 15% time +
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 2168120
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 13362810
>              ***** 30% page fault +
> Minor (reclaiming a frame) page faults: 33394617
> Voluntary context switches: 55216
> Involuntary context switches: 137220
> Swaps: 0
> File system inputs: 4088
> File system outputs: 8
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> with gen_lru and lru_gen/enabled=0x7:
> typical result:
> Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> User time (seconds): 36.13
> System time (seconds): 251.71
> Percent of CPU this job got: 378%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
>          *****better than enabled=0x3, worse than vanilla
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 2120988
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 12706512
> Minor (reclaiming a frame) page faults: 33422243
> Voluntary context switches: 49485
> Involuntary context switches: 126765
> Swaps: 0
> File system inputs: 2976
> File system outputs: 8
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> I can also reproduce the problem on arm64.
>
> I am not saying this is going to block mglru from being mainlined. But  I am
> still curious if this is an issue worth being addressed somehow in mglru.

You've missed something very important: *thoughput* :)

Dollars to doughnuts there was a large increase in throughput -- I
haven't tried this benchmark but I've seen many reports similar to
this one.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-14 16:45                   ` Yu Zhao
@ 2022-03-14 23:38                     ` Barry Song
       [not found]                       ` <CAOUHufa9eY44QadfGTzsxa2=hEvqwahXd7Canck5Gt-N6c4UKA@mail.gmail.com>
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-14 23:38 UTC (permalink / raw)
  To: yuzhao
  Cc: 21cnbao, Hi-Angel, Michael, ak, akpm, aneesh.kumar, axboe,
	bgeffon, catalin.marinas, corbet, d, dave.hansen, djbyrne,
	hannes, hdanton, heftig, holger, jsbarnes, linux-arm-kernel,
	linux-doc, linux-kernel, linux-mm, mgorman, mhocko, oleksandr,
	page-reclaim, riel, rppt, sofia.trinh, steven, suleiman, szhai2,
	torvalds, vbabka, will, willy, x86, ying.huang

On Tue, Mar 15, 2022 at 5:45 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Mar 14, 2022 at 5:12 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > > > >
> > > > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > > > second time, it can be promoted
> > > > > > > to active. then in recent years, we have also applied this to anon
> > > > > > > pages while kernel adds
> > > > > > > workingset protection for anon pages. so basically both anon and file
> > > > > > > pages go into the inactive
> > > > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > > > the active list. if we don't access
> > > > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > > > we do have some special fastpath for code section, executable file
> > > > > > > pages are kept on active list
> > > > > > > as long as they are accessed.
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > so all of the above concerns are actually not that correct?
> > > > > >
> > > > > > They are valid concerns but I don't know any popular workloads that
> > > > > > care about them.
> > > > >
> > > > > Hi Yu,
> > > > > here we can get a workload in Kim's patchset while he added workingset
> > > > > protection
> > > > > for anon pages:
> > > > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> > > >
> > > > Thanks. I wouldn't call that a workload because it's not a real
> > > > application. By popular workloads, I mean applications that the
> > > > majority of people actually run on phones, in cloud, etc.
> > > >
> > > > > anon pages used to go to active rather than inactive, but kim's patchset
> > > > > moved to use inactive first. then only after the anon page is accessed
> > > > > second time, it can move to active.
> > > >
> > > > Yes. To clarify, the A-bit doesn't really mean the first or second
> > > > access. It can be many accesses each time it's set.
> > > >
> > > > > "In current implementation, newly created or swap-in anonymous page is
> > > > >
> > > > > started on the active list. Growing the active list results in rebalancing
> > > > > active/inactive list so old pages on the active list are demoted to the
> > > > > inactive list. Hence, hot page on the active list isn't protected at all.
> > > > >
> > > > > Following is an example of this situation.
> > > > >
> > > > > Assume that 50 hot pages on active list and system can contain total
> > > > > 100 pages. Numbers denote the number of pages on active/inactive
> > > > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > > > used-once pages.
> > > > >
> > > > > 1. 50 hot pages on active list
> > > > > 50(h) | 0
> > > > >
> > > > > 2. workload: 50 newly created (used-once) pages
> > > > > 50(uo) | 50(h)
> > > > >
> > > > > 3. workload: another 50 newly created (used-once) pages
> > > > > 50(uo) | 50(uo), swap-out 50(h)
> > > > >
> > > > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > > > >
> > > > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> > > >
> > > > I think the real question is why the 50 hot pages can be moved to the
> > > > inactive list. If they are really hot, the A-bit should protect them.
> > >
> > > This is a good question.
> > >
> > > I guess it  is probably because the current lru is trying to maintain a balance
> > > between the sizes of active and inactive lists. Thus, it can shrink active list
> > > even though pages might be still "hot" but not the recently accessed ones.
> > >
> > > 1. 50 hot pages on active list
> > > 50(h) | 0
> > >
> > > 2. workload: 50 newly created (used-once) pages
> > > 50(uo) | 50(h)
> > >
> > > 3. workload: another 50 newly created (used-once) pages
> > > 50(uo) | 50(uo), swap-out 50(h)
> > >
> > > the old kernel without anon workingset protection put workload 2 on active, so
> > > pushed 50 hot pages from active to inactive. workload 3 would further contribute
> > > to evict the 50 hot pages.
> > >
> > > it seems mglru doesn't demote pages from the youngest generation to older
> > > generation only in order to balance the list size? so mglru is probably safe
> > > in these cases.
> > >
> > > I will run some tests mentioned in Kim's patchset and report the result to you
> > > afterwards.
> > >
> >
> > Hi Yu,
> > I did find putting faulted pages to the youngest generation lead to some
> > regression in the case ebizzy Kim's patchset mentioned while he tried
> > to support workingset protection for anon pages.
> > i did a little bit modification for rand_chunk() which is probably similar
> > with the modifcation() Kim mentioned in his patchset. The modification
> > can be found here:
> > https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef
> >
> > The test env is a x86 machine in which I have set memory size to 2.5GB and
> > set zRAM to 2GB and disabled external disk swap.
> >
> > with the vanilla kernel:
> > \time -v ./a.out -vv -t 4 -s 209715200 -S 200000
> >
> > so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)
> >
> > typical result:
> >         Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> >         User time (seconds): 36.19
> >         System time (seconds): 229.72
> >         Percent of CPU this job got: 371%
> >         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
> >         Average shared text size (kbytes): 0
> >         Average unshared data size (kbytes): 0
> >         Average stack size (kbytes): 0
> >         Average total size (kbytes): 0
> >         Maximum resident set size (kbytes): 2166196
> >         Average resident set size (kbytes): 0
> >         Major (requiring I/O) page faults: 9990128
> >         Minor (reclaiming a frame) page faults: 33315945
> >         Voluntary context switches: 59144
> >         Involuntary context switches: 167754
> >         Swaps: 0
> >         File system inputs: 2760
> >         File system outputs: 8
> >         Socket messages sent: 0
> >         Socket messages received: 0
> >         Signals delivered: 0
> >         Page size (bytes): 4096
> >         Exit status: 0
> >
> > with gen_lru and lru_gen/enabled=0x3:
> > typical result:
> > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > User time (seconds): 36.34
> > System time (seconds): 276.07
> > Percent of CPU this job got: 378%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
> >            **** 15% time +
> > Average shared text size (kbytes): 0
> > Average unshared data size (kbytes): 0
> > Average stack size (kbytes): 0
> > Average total size (kbytes): 0
> > Maximum resident set size (kbytes): 2168120
> > Average resident set size (kbytes): 0
> > Major (requiring I/O) page faults: 13362810
> >              ***** 30% page fault +
> > Minor (reclaiming a frame) page faults: 33394617
> > Voluntary context switches: 55216
> > Involuntary context switches: 137220
> > Swaps: 0
> > File system inputs: 4088
> > File system outputs: 8
> > Socket messages sent: 0
> > Socket messages received: 0
> > Signals delivered: 0
> > Page size (bytes): 4096
> > Exit status: 0
> >
> > with gen_lru and lru_gen/enabled=0x7:
> > typical result:
> > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > User time (seconds): 36.13
> > System time (seconds): 251.71
> > Percent of CPU this job got: 378%
> > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
> >          *****better than enabled=0x3, worse than vanilla
> > Average shared text size (kbytes): 0
> > Average unshared data size (kbytes): 0
> > Average stack size (kbytes): 0
> > Average total size (kbytes): 0
> > Maximum resident set size (kbytes): 2120988
> > Average resident set size (kbytes): 0
> > Major (requiring I/O) page faults: 12706512
> > Minor (reclaiming a frame) page faults: 33422243
> > Voluntary context switches: 49485
> > Involuntary context switches: 126765
> > Swaps: 0
> > File system inputs: 2976
> > File system outputs: 8
> > Socket messages sent: 0
> > Socket messages received: 0
> > Signals delivered: 0
> > Page size (bytes): 4096
> > Exit status: 0
> >
> > I can also reproduce the problem on arm64.
> >
> > I am not saying this is going to block mglru from being mainlined. But  I am
> > still curious if this is an issue worth being addressed somehow in mglru.
>
> You've missed something very important: *thoughput* :)
>

noop :-)
in the test case, there are 4 threads. they are searching a key in 10 chunks
of memory. for each chunk, the size is 200MB.
a "random" chunk index is returned for those threads to search. but chunk2
is the hottest, and chunk3, 7, 4 are relatively hotter than others.
static inline unsigned int rand_chunk(void)
{
	/* simulate hot and cold chunk */
	unsigned int rand[16] = {2, 2, 3, 4, 5, 2, 6, 7, 9, 2, 8, 3, 7, 2, 2, 4};
	static int nr = 0;
	return rand[nr++%16];
}

each thread does search_mem():
static unsigned int search_mem(void)
{
	record_t key, *found;
	record_t *src, *copy;
	unsigned int chunk;
	size_t copy_size = chunk_size;
	unsigned int i;
	unsigned int state = 0;

	/* run 160 loops or till timeout */
	for (i = 0; threads_go == 1 && i < 160; i++) {
		chunk = rand_chunk();
		src = mem[chunk];
		...
		copy = alloc_mem(copy_size);
		...
		memcpy(copy, src, copy_size);

		key = rand_num(copy_size / record_size, &state);

		bsearch(&key, copy, copy_size / record_size,
			record_size, compare);

			/* Below check is mainly for memory corruption or other bug */
			if (found == NULL) {
				fprintf(stderr, "Couldn't find key %zd\n", key);
				exit(1);
			}
		}		/* end if ! touch_pages */

		free_mem(copy, copy_size);
	}

	return (i);
}

each thread picks up a chunk, then allocates a new memory and copies the chunk to the
new allocated memory, and searches a key in the allocated memory.

as i have set time to rather big by -S, so each thread actually exits while it
completes 160 loops.
$ \time -v ./ebizzy -t 4 -s $((200*1024*1024)) -S 6000000

so the one who finishes the whole jobs earlier wins in throughput as
well.

> Dollars to doughnuts there was a large increase in throughput -- I
> haven't tried this benchmark but I've seen many reports similar to
> this one.

I have no doubt about this. I am just trying to figure out some potential we can
further achieve in mglru.

Thanks,
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
       [not found]                         ` <CAGsJ_4zvj5rmz7DkW-kJx+jmUT9G8muLJ9De--NZma9ey0Oavw@mail.gmail.com>
@ 2022-03-15 10:29                           ` Barry Song
  2022-03-16  2:46                             ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-15 10:29 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Konstantin Kharlamov, Michael Larabel, Andi Kleen, Andrew Morton,
	Aneesh Kumar K . V, Jens Axboe, Brian Geffon, Catalin Marinas,
	Jonathan Corbet, Donald Carr, Dave Hansen, Daniel Byrne,
	Johannes Weiner, Hillf Danton, Jan Alexander Steffens,
	Holger Hoffstätte, Jesse Barnes, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM, Mel Gorman,
	Michal Hocko, Oleksandr Natalenko, Kernel Page Reclaim v2,
	Rik van Riel, Mike Rapoport, Sofia Trinh, Steven Barrett,
	Suleiman Souhlal, Shuang Zhai, Linus Torvalds, Vlastimil Babka,
	Will Deacon, Matthew Wilcox, the arch/x86 maintainers,
	Huang Ying

On Tue, Mar 15, 2022 at 10:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 15, 2022 at 6:18 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Mar 14, 2022 at 5:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Mar 15, 2022 at 5:45 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Mon, Mar 14, 2022 at 5:12 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > > > > >
> > > > > > > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > > > > > > second time, it can be promoted
> > > > > > > > > > to active. then in recent years, we have also applied this to anon
> > > > > > > > > > pages while kernel adds
> > > > > > > > > > workingset protection for anon pages. so basically both anon and file
> > > > > > > > > > pages go into the inactive
> > > > > > > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > > > > > > the active list. if we don't access
> > > > > > > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > > > > > > we do have some special fastpath for code section, executable file
> > > > > > > > > > pages are kept on active list
> > > > > > > > > > as long as they are accessed.
> > > > > > > > >
> > > > > > > > > Yes.
> > > > > > > > >
> > > > > > > > > > so all of the above concerns are actually not that correct?
> > > > > > > > >
> > > > > > > > > They are valid concerns but I don't know any popular workloads that
> > > > > > > > > care about them.
> > > > > > > >
> > > > > > > > Hi Yu,
> > > > > > > > here we can get a workload in Kim's patchset while he added workingset
> > > > > > > > protection
> > > > > > > > for anon pages:
> > > > > > > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> > > > > > >
> > > > > > > Thanks. I wouldn't call that a workload because it's not a real
> > > > > > > application. By popular workloads, I mean applications that the
> > > > > > > majority of people actually run on phones, in cloud, etc.
> > > > > > >
> > > > > > > > anon pages used to go to active rather than inactive, but kim's patchset
> > > > > > > > moved to use inactive first. then only after the anon page is accessed
> > > > > > > > second time, it can move to active.
> > > > > > >
> > > > > > > Yes. To clarify, the A-bit doesn't really mean the first or second
> > > > > > > access. It can be many accesses each time it's set.
> > > > > > >
> > > > > > > > "In current implementation, newly created or swap-in anonymous page is
> > > > > > > >
> > > > > > > > started on the active list. Growing the active list results in rebalancing
> > > > > > > > active/inactive list so old pages on the active list are demoted to the
> > > > > > > > inactive list. Hence, hot page on the active list isn't protected at all.
> > > > > > > >
> > > > > > > > Following is an example of this situation.
> > > > > > > >
> > > > > > > > Assume that 50 hot pages on active list and system can contain total
> > > > > > > > 100 pages. Numbers denote the number of pages on active/inactive
> > > > > > > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > > > > > > used-once pages.
> > > > > > > >
> > > > > > > > 1. 50 hot pages on active list
> > > > > > > > 50(h) | 0
> > > > > > > >
> > > > > > > > 2. workload: 50 newly created (used-once) pages
> > > > > > > > 50(uo) | 50(h)
> > > > > > > >
> > > > > > > > 3. workload: another 50 newly created (used-once) pages
> > > > > > > > 50(uo) | 50(uo), swap-out 50(h)
> > > > > > > >
> > > > > > > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > > > > > > >
> > > > > > > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> > > > > > >
> > > > > > > I think the real question is why the 50 hot pages can be moved to the
> > > > > > > inactive list. If they are really hot, the A-bit should protect them.
> > > > > >
> > > > > > This is a good question.
> > > > > >
> > > > > > I guess it  is probably because the current lru is trying to maintain a balance
> > > > > > between the sizes of active and inactive lists. Thus, it can shrink active list
> > > > > > even though pages might be still "hot" but not the recently accessed ones.
> > > > > >
> > > > > > 1. 50 hot pages on active list
> > > > > > 50(h) | 0
> > > > > >
> > > > > > 2. workload: 50 newly created (used-once) pages
> > > > > > 50(uo) | 50(h)
> > > > > >
> > > > > > 3. workload: another 50 newly created (used-once) pages
> > > > > > 50(uo) | 50(uo), swap-out 50(h)
> > > > > >
> > > > > > the old kernel without anon workingset protection put workload 2 on active, so
> > > > > > pushed 50 hot pages from active to inactive. workload 3 would further contribute
> > > > > > to evict the 50 hot pages.
> > > > > >
> > > > > > it seems mglru doesn't demote pages from the youngest generation to older
> > > > > > generation only in order to balance the list size? so mglru is probably safe
> > > > > > in these cases.
> > > > > >
> > > > > > I will run some tests mentioned in Kim's patchset and report the result to you
> > > > > > afterwards.
> > > > > >
> > > > >
> > > > > Hi Yu,
> > > > > I did find putting faulted pages to the youngest generation lead to some
> > > > > regression in the case ebizzy Kim's patchset mentioned while he tried
> > > > > to support workingset protection for anon pages.
> > > > > i did a little bit modification for rand_chunk() which is probably similar
> > > > > with the modifcation() Kim mentioned in his patchset. The modification
> > > > > can be found here:
> > > > > https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef
> > > > >
> > > > > The test env is a x86 machine in which I have set memory size to 2.5GB and
> > > > > set zRAM to 2GB and disabled external disk swap.
> > > > >
> > > > > with the vanilla kernel:
> > > > > \time -v ./a.out -vv -t 4 -s 209715200 -S 200000
> > > > >
> > > > > so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)
> > > > >
> > > > > typical result:
> > > > >         Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > > > >         User time (seconds): 36.19
> > > > >         System time (seconds): 229.72
> > > > >         Percent of CPU this job got: 371%
> > > > >         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
> > > > >         Average shared text size (kbytes): 0
> > > > >         Average unshared data size (kbytes): 0
> > > > >         Average stack size (kbytes): 0
> > > > >         Average total size (kbytes): 0
> > > > >         Maximum resident set size (kbytes): 2166196
> > > > >         Average resident set size (kbytes): 0
> > > > >         Major (requiring I/O) page faults: 9990128
> > > > >         Minor (reclaiming a frame) page faults: 33315945
> > > > >         Voluntary context switches: 59144
> > > > >         Involuntary context switches: 167754
> > > > >         Swaps: 0
> > > > >         File system inputs: 2760
> > > > >         File system outputs: 8
> > > > >         Socket messages sent: 0
> > > > >         Socket messages received: 0
> > > > >         Signals delivered: 0
> > > > >         Page size (bytes): 4096
> > > > >         Exit status: 0
> > > > >
> > > > > with gen_lru and lru_gen/enabled=0x3:
> > > > > typical result:
> > > > > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > > > > User time (seconds): 36.34
> > > > > System time (seconds): 276.07
> > > > > Percent of CPU this job got: 378%
> > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
> > > > >            **** 15% time +
> > > > > Average shared text size (kbytes): 0
> > > > > Average unshared data size (kbytes): 0
> > > > > Average stack size (kbytes): 0
> > > > > Average total size (kbytes): 0
> > > > > Maximum resident set size (kbytes): 2168120
> > > > > Average resident set size (kbytes): 0
> > > > > Major (requiring I/O) page faults: 13362810
> > > > >              ***** 30% page fault +
> > > > > Minor (reclaiming a frame) page faults: 33394617
> > > > > Voluntary context switches: 55216
> > > > > Involuntary context switches: 137220
> > > > > Swaps: 0
> > > > > File system inputs: 4088
> > > > > File system outputs: 8
> > > > > Socket messages sent: 0
> > > > > Socket messages received: 0
> > > > > Signals delivered: 0
> > > > > Page size (bytes): 4096
> > > > > Exit status: 0
> > > > >
> > > > > with gen_lru and lru_gen/enabled=0x7:
> > > > > typical result:
> > > > > Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> > > > > User time (seconds): 36.13
> > > > > System time (seconds): 251.71
> > > > > Percent of CPU this job got: 378%
> > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
> > > > >          *****better than enabled=0x3, worse than vanilla
> > > > > Average shared text size (kbytes): 0
> > > > > Average unshared data size (kbytes): 0
> > > > > Average stack size (kbytes): 0
> > > > > Average total size (kbytes): 0
> > > > > Maximum resident set size (kbytes): 2120988
> > > > > Average resident set size (kbytes): 0
> > > > > Major (requiring I/O) page faults: 12706512
> > > > > Minor (reclaiming a frame) page faults: 33422243
> > > > > Voluntary context switches: 49485
> > > > > Involuntary context switches: 126765
> > > > > Swaps: 0
> > > > > File system inputs: 2976
> > > > > File system outputs: 8
> > > > > Socket messages sent: 0
> > > > > Socket messages received: 0
> > > > > Signals delivered: 0
> > > > > Page size (bytes): 4096
> > > > > Exit status: 0
> > > > >
> > > > > I can also reproduce the problem on arm64.
> > > > >
> > > > > I am not saying this is going to block mglru from being mainlined. But  I am
> > > > > still curious if this is an issue worth being addressed somehow in mglru.
> > > >
> > > > You've missed something very important: *thoughput* :)
> > > >
> > >
> > > noop :-)
> > > in the test case, there are 4 threads. they are searching a key in 10 chunks
> > > of memory. for each chunk, the size is 200MB.
> > > a "random" chunk index is returned for those threads to search. but chunk2
> > > is the hottest, and chunk3, 7, 4 are relatively hotter than others.
> > > static inline unsigned int rand_chunk(void)
> > > {
> > >         /* simulate hot and cold chunk */
> > >         unsigned int rand[16] = {2, 2, 3, 4, 5, 2, 6, 7, 9, 2, 8, 3, 7, 2, 2, 4};
> >
> > This is sequential access, not what you claim above, because you have
> > a repeating sequence.
> >
> > In this case MGLRU is expected to be slower because it doesn't try to
> > optimize it, as discussed before [1]. The reason is, with a manageable
> > complexity, we can only optimize so many things. And MGLRU chose to
> > optimize (arguably) popular workloads, since, AFAIK, no real-world
> > applications streams anon memory.
> >
> > To verify this is indeed sequential access, you could make rand[]
> > larger, e.g., 160, with the same portions of 2s, 3s, 4s, etc, but
> > their positions are random. The following change shows MGLRU is ~20%
> > faster on my Snapdragon 7c + 2.5G DRAM + 2GB zram.
> >
> >  static inline unsigned int rand_chunk(void)
> >  {
> >         /* simulate hot and cold chunk */
> > -       unsigned int rand[16] = {2, 2, 3, 4, 5, 2, 6, 7, 9, 2, 8, 3,
> > 7, 2, 2, 4};
> > +       unsigned int rand[160] = {
> > +               2, 4, 7, 3, 4, 2, 7, 2, 7, 8, 6, 9, 7, 6, 5, 4,
> > +               6, 2, 6, 4, 2, 9, 2, 5, 5, 4, 7, 2, 7, 7, 5, 2,
> > +               4, 4, 3, 3, 2, 4, 2, 2, 5, 2, 4, 2, 8, 2, 2, 3,
> > +               2, 2, 2, 2, 2, 8, 4, 2, 2, 4, 2, 2, 2, 2, 3, 2,
> > +               8, 5, 2, 2, 3, 2, 8, 2, 6, 2, 4, 8, 5, 2, 9, 2,
> > +               8, 7, 9, 2, 4, 4, 3, 3, 2, 8, 2, 2, 3, 3, 2, 7,
> > +               7, 5, 2, 2, 8, 2, 2, 2, 5, 2, 4, 3, 2, 3, 6, 3,
> > +               3, 3, 9, 4, 2, 3, 9, 7, 7, 6, 2, 2, 4, 2, 6, 2,
> > +               9, 7, 7, 7, 9, 3, 4, 2, 3, 2, 7, 3, 2, 2, 2, 6,
> > +               8, 3, 7, 6, 2, 2, 2, 4, 7, 2, 5, 7, 4, 7, 9, 9,
> > +       };
> >         static int nr = 0;
> > -       return rand[nr++%16];
> > +       return rand[nr++%160];
> >  }
> >
> > Yet better, you could use some standard benchmark suites, written by
> > reputable organizations, e.g., memtier, YCSB, to generate more
> > realistic distributions, as I've suggested before [2].
> >
> > >         static int nr = 0;
> > >         return rand[nr++%16];
> > > }
> > >
> > > each thread does search_mem():
> > > static unsigned int search_mem(void)
> > > {
> > >         record_t key, *found;
> > >         record_t *src, *copy;
> > >         unsigned int chunk;
> > >         size_t copy_size = chunk_size;
> > >         unsigned int i;
> > >         unsigned int state = 0;
> > >
> > >         /* run 160 loops or till timeout */
> > >         for (i = 0; threads_go == 1 && i < 160; i++) {
> >
> > I see you've modified the original benchmark. But with "-S 200000",
> > should this test finish within an hour instead of the following?
> >     Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
> >
> > >                 chunk = rand_chunk();
> > >                 src = mem[chunk];
> > >                 ...
> > >                 copy = alloc_mem(copy_size);
> > >                 ...
> > >                 memcpy(copy, src, copy_size);
> > >
> > >                 key = rand_num(copy_size / record_size, &state);
> > >
> > >                 bsearch(&key, copy, copy_size / record_size,
> > >                         record_size, compare);
> > >
> > >                         /* Below check is mainly for memory corruption or other bug */
> > >                         if (found == NULL) {
> > >                                 fprintf(stderr, "Couldn't find key %zd\n", key);
> > >                                 exit(1);
> > >                         }
> > >                 }               /* end if ! touch_pages */
> > >
> > >                 free_mem(copy, copy_size);
> > >         }
> > >
> > >         return (i);
> > > }
> > >
> > > each thread picks up a chunk, then allocates a new memory and copies the chunk to the
> > > new allocated memory, and searches a key in the allocated memory.
> > >
> > > as i have set time to rather big by -S, so each thread actually exits while it
> > > completes 160 loops.
> > > $ \time -v ./ebizzy -t 4 -s $((200*1024*1024)) -S 6000000
> >
> > Ok, you actually used "-S 6000000".
>
> I have two exits, either 160 loops have been done or -S gets timeout.
> Since -S is very big, the process exits from the completion of 160
> loops.
>
> I am seeing mglru is getting very similar speed with vanilla lru by
> using your rand_chunk() with 160 entries. the command is like:
> \time -v ./a.out -t 4 -s $((200*1024*1024)) -S 600000 -m
>
> The time to complete jobs begins to be more random, but on average,
> mglru seems to be 5% faster. actually, i am seeing mglru can be faster
> than vanilla even with more page faults. for example,
>
> MGLRU:
>         Command being timed: "./mt.out -t 4 -s 209715200 -S 600000 -m"
>         User time (seconds): 32.68
>         System time (seconds): 227.19
>         Percent of CPU this job got: 370%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:10.23
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 2175292
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 10977244
>         Minor (reclaiming a frame) page faults: 33447638
>         Voluntary context switches: 44466
>         Involuntary context switches: 108413
>         Swaps: 0
>         File system inputs: 7704
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>
>
> VANILLA:
>         Command being timed: "./mt.out -t 4 -s 209715200 -S 600000 -m"
>         User time (seconds): 32.20
>         System time (seconds): 248.18
>         Percent of CPU this job got: 371%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:15.55
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 2174384
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 10002206
>         Minor (reclaiming a frame) page faults: 33392151
>         Voluntary context switches: 76966
>         Involuntary context switches: 184841
>         Swaps: 0
>         File system inputs: 2032
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>

basically a perf comparison:
vanilla:
    23.81%  [lz4_compress]  [k] LZ4_compress_fast_extState
    14.15%  [kernel]        [k] LZ4_decompress_safe
    10.48%  libc-2.33.so    [.] __memmove_avx_unaligned_erms
     2.49%  [kernel]        [k] native_queued_spin_lock_slowpath
     2.05%  [kernel]        [k] clear_page_erms
     1.69%  [kernel]        [k] native_irq_return_iret
     1.49%  [kernel]        [k] mem_cgroup_css_rstat_flush
     1.05%  [kernel]        [k] _raw_spin_lock
     1.05%  [kernel]        [k] sync_regs
     1.00%  [kernel]        [k] smp_call_function_many_cond
     0.97%  [kernel]        [k] memset_erms
     0.95%  [zram]          [k] zram_bvec_rw.constprop.0
     0.91%  [kernel]        [k] down_read_trylock
     0.90%  [kernel]        [k] memcpy_erms
     0.89%  [zram]          [k] __zram_bvec_read.constprop.0
     0.88%  [kernel]        [k] psi_group_change
     0.84%  [kernel]        [k] isolate_lru_pages
     0.78%  [kernel]        [k] zs_map_object
     0.76%  [kernel]        [k] __handle_mm_fault
     0.72%  [kernel]        [k] page_vma_mapped_walk

mglru:
    23.43%  [lz4_compress]  [k] LZ4_compress_fast_extState
    16.90%  [kernel]        [k] LZ4_decompress_safe
    12.60%  libc-2.33.so    [.] __memmove_avx_unaligned_erms
     2.26%  [kernel]        [k] clear_page_erms
     2.06%  [kernel]        [k] native_queued_spin_lock_slowpath
     1.77%  [kernel]        [k] native_irq_return_iret
     1.18%  [kernel]        [k] sync_regs
     1.12%  [zram]          [k] __zram_bvec_read.constprop.0
     0.98%  [kernel]        [k] psi_group_change
     0.97%  [zram]          [k] zram_bvec_rw.constprop.0
     0.96%  [kernel]        [k] memset_erms
     0.95%  [kernel]        [k] isolate_folios
     0.92%  [kernel]        [k] zs_map_object
     0.92%  [kernel]        [k] _raw_spin_lock
     0.87%  [kernel]        [k] memcpy_erms
     0.83%  [kernel]        [k] smp_call_function_many_cond
     0.83%  [kernel]        [k] __handle_mm_fault
     0.78%  [kernel]        [k] unmap_page_range
     0.71%  [kernel]        [k] rmqueue_bulk
     0.70%  [kernel]        [k] page_counter_uncharge

it seems vanilla kernel puts more time on native_queued_spin_lock_slowpath(),
down_read_trylock(), mem_cgroup_css_rstat_flush(), isolate_lru_pages() and
page_vma_mapped_walk(), but mglru puts more time on decompress, memmove
and isolate_folios().

That is probably why mglru can be a bit faster even with more major page
faults.

>
> I guess the main cause of the regression for the previous sequence
> with 16 entries is that the ebizzy has a new allocated copy in
> search_mem(), which is mapped and used only once in each loop.
> and the temp copy can push out those hot chunks.
>
> Anyway, I understand it is a trade-off between warmly embracing new
> pages and holding old pages tightly. Real user cases from phone, server,
> desktop will be judging this better.
>
> >
> > [1] https://lore.kernel.org/linux-mm/YhNJ4LVWpmZgLh4I@google.com/
> > [2] https://lore.kernel.org/linux-mm/YgggI+vvtNvh3jBY@google.com/
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-15 10:29                           ` Barry Song
@ 2022-03-16  2:46                             ` Yu Zhao
  2022-03-16  4:37                               ` Barry Song
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-03-16  2:46 UTC (permalink / raw)
  To: Barry Song
  Cc: Konstantin Kharlamov, Michael Larabel, Andi Kleen, Andrew Morton,
	Aneesh Kumar K . V, Jens Axboe, Brian Geffon, Catalin Marinas,
	Jonathan Corbet, Donald Carr, Dave Hansen, Daniel Byrne,
	Johannes Weiner, Hillf Danton, Jan Alexander Steffens,
	Holger Hoffstätte, Jesse Barnes, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM, Mel Gorman,
	Michal Hocko, Oleksandr Natalenko, Kernel Page Reclaim v2,
	Rik van Riel, Mike Rapoport, Sofia Trinh, Steven Barrett,
	Suleiman Souhlal, Shuang Zhai, Linus Torvalds, Vlastimil Babka,
	Will Deacon, Matthew Wilcox, the arch/x86 maintainers,
	Huang Ying

On Tue, Mar 15, 2022 at 4:29 AM Barry Song <21cnbao@gmail.com> wrote:

<snipped>

> > I guess the main cause of the regression for the previous sequence
> > with 16 entries is that the ebizzy has a new allocated copy in
> > search_mem(), which is mapped and used only once in each loop.
> > and the temp copy can push out those hot chunks.
> >
> > Anyway, I understand it is a trade-off between warmly embracing new
> > pages and holding old pages tightly. Real user cases from phone, server,
> > desktop will be judging this better.

Thanks for all the details. I looked into them today and found no
regressions when running with your original program.

After I explain why, I hope you'd be convinced that using programs
like this one is not a good way to measure things :)

Problems:
1) Given the 2.5GB configuration and a sequence of cold/hot chunks, I
assume your program tries to simulate a handful of apps running on a
phone.  A short repeating sequence is closer to sequential access than
to real user behaviors, as I suggested last time. You could check out
how something similar is done here [1].
2) Under the same assumption (phone), C programs are very different
from Android apps in terms of runtime memory behaviors, e.g., JVM GC
[2].
3) Assuming you are interested in the runtime memory behavior of C/C++
programs, your program is still not very representative. All C/C++
programs I'm familiar with choose to link against TCmalloc, jemalloc
or implement their own allocators. GNU libc, IMO, has a small market
share nowadays.
4) TCmalloc/jemalloc are not only optimized for multithreading, they
are also THP aware. THP is very important when benchmarking page
reclaim, e.g., two similarly warm THPs can comprise 511+1 or 1+511 of
warm+cold 4K pages. The LRU algorithm that chooses more of the former
is at the disadvantage. Unless it's recommended by the applications
you are trying to benchmark, THP should be disabled. (Android
generally doesn't use THP.)
5) Swap devices are also important. Zram should NOT be used unless you
know your benchmark doesn't generate incompressible data. The LRU
algorithm that chooses more incompressible pages is at disadvantage.

Here is my result: on the same Snapdragon 7c + 2.5GB RAM + 1.5GB
ramdisk swap, with your original program compiled against libc malloc
and TCMalloc, to 32-bit and 64-bit binaries:

# cat /sys/kernel/mm/lru_gen/enabled
0x0003
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# modprobe brd rd_nr=1 rd_size=1572864
# if=/dev/zero of=/dev/ram0 bs=1M
# mkswap /dev/ram0
# swapoff -a
# swapon /dev/ram0

# ldd test_absl_32
        linux-vdso.so.1 (0xf6e7f000)
        libabsl_malloc.so.2103.0.1 =>
/usr/lib/libabsl_malloc.so.2103.0.1 (0xf6e23000)
        libpthread.so.0 => /lib/libpthread.so.0 (0xf6dff000)
        libc.so.6 => /lib/libc.so.6 (0xf6d07000)
        /lib/ld-linux-armhf.so.3 (0x09df0000)
        libabsl_base.so.2103.0.1 => /usr/lib/libabsl_base.so.2103.0.1
(0xf6ce5000)
        libabsl_raw_logging.so.2103.0.1 =>
/usr/lib/libabsl_raw_logging.so.2103.0.1 (0xf6cc4000)
        libabsl_spinlock_wait.so.2103.0.1 =>
/usr/lib/libabsl_spinlock_wait.so.2103.0.1 (0xf6ca3000)
        libc++.so.1 => /usr/lib/libc++.so.1 (0xf6c04000)
        libc++abi.so.1 => /usr/lib/libc++abi.so.1 (0xf6bcd000)
# file test_absl_64
test_absl_64: ELF 64-bit LSB executable, ARM aarch64, version 1
(SYSV), statically linked
# ldd test_gnu_32
        linux-vdso.so.1 (0xeabef000)
        libpthread.so.0 => /lib/libpthread.so.0 (0xeab92000)
        libc.so.6 => /lib/libc.so.6 (0xeaa9a000)
        /lib/ld-linux-armhf.so.3 (0x05690000)
# file test_gnu_64
test_gnu_64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV),
statically linked

### baseline 5.17-rc8

# perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
10 records/s
real 59.00 s
user 39.83 s
sys  174.18 s

    18.51%  [.] memcpy
    15.98%  [k] __pi_clear_page
     5.59%  [k] rmqueue_pcplist
     5.19%  [k] do_raw_spin_lock
     5.09%  [k] memmove
     4.60%  [k] _raw_spin_unlock_irq
     3.62%  [k] _raw_spin_unlock_irqrestore
     3.61%  [k] free_unref_page_list
     3.29%  [k] zap_pte_range
     2.53%  [k] local_daif_restore
     2.50%  [k] down_read_trylock
     1.41%  [k] handle_mm_fault
     1.32%  [k] do_anonymous_page
     1.31%  [k] up_read
     1.03%  [k] free_swap_cache

### MGLRU v9

# perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
11 records/s
real 57.00 s
user 39.39 s

    19.36%  [.] memcpy
    16.50%  [k] __pi_clear_page
     6.21%  [k] memmove
     5.57%  [k] rmqueue_pcplist
     5.07%  [k] do_raw_spin_lock
     4.96%  [k] _raw_spin_unlock_irqrestore
     4.25%  [k] free_unref_page_list
     3.80%  [k] zap_pte_range
     3.69%  [k] _raw_spin_unlock_irq
     2.71%  [k] local_daif_restore
     2.10%  [k] down_read_trylock
     1.50%  [k] handle_mm_fault
     1.29%  [k] do_anonymous_page
     1.17%  [k] free_swap_cache
     1.08%  [k] up_read

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/chromiumos/tast/local/memory/mempressure/mempressure.go
[2] https://developer.android.com/topic/performance/memory-overview


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-16  2:46                             ` Yu Zhao
@ 2022-03-16  4:37                               ` Barry Song
  2022-03-16  5:44                                 ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-16  4:37 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Konstantin Kharlamov, Michael Larabel, Andi Kleen, Andrew Morton,
	Aneesh Kumar K . V, Jens Axboe, Brian Geffon, Catalin Marinas,
	Jonathan Corbet, Donald Carr, Dave Hansen, Daniel Byrne,
	Johannes Weiner, Hillf Danton, Jan Alexander Steffens,
	Holger Hoffstätte, Jesse Barnes, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM, Mel Gorman,
	Michal Hocko, Oleksandr Natalenko, Kernel Page Reclaim v2,
	Rik van Riel, Mike Rapoport, Sofia Trinh, Steven Barrett,
	Suleiman Souhlal, Shuang Zhai, Linus Torvalds, Vlastimil Babka,
	Will Deacon, Matthew Wilcox, the arch/x86 maintainers,
	Huang Ying

On Wed, Mar 16, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Mar 15, 2022 at 4:29 AM Barry Song <21cnbao@gmail.com> wrote:
>
> <snipped>
>
> > > I guess the main cause of the regression for the previous sequence
> > > with 16 entries is that the ebizzy has a new allocated copy in
> > > search_mem(), which is mapped and used only once in each loop.
> > > and the temp copy can push out those hot chunks.
> > >
> > > Anyway, I understand it is a trade-off between warmly embracing new
> > > pages and holding old pages tightly. Real user cases from phone, server,
> > > desktop will be judging this better.
>
> Thanks for all the details. I looked into them today and found no
> regressions when running with your original program.
>
> After I explain why, I hope you'd be convinced that using programs
> like this one is not a good way to measure things :)
>

Yep. I agree ebizzy might not be a good one to measure things.
I chose it only because Kim's patchset which moved anon pages
to inactive at the first detected access  was using it. Before kim's
patchset, anon pages were placed in the active list from the first
beginning:
https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/

in ebizzy, there is a used-once allocated memory in each
search_mem(). I guess that is why Kim's patchset chose
it.

> Problems:
> 1) Given the 2.5GB configuration and a sequence of cold/hot chunks, I
> assume your program tries to simulate a handful of apps running on a
> phone.  A short repeating sequence is closer to sequential access than
> to real user behaviors, as I suggested last time. You could check out
> how something similar is done here [1].
> 2) Under the same assumption (phone), C programs are very different
> from Android apps in terms of runtime memory behaviors, e.g., JVM GC
> [2].
> 3) Assuming you are interested in the runtime memory behavior of C/C++
> programs, your program is still not very representative. All C/C++
> programs I'm familiar with choose to link against TCmalloc, jemalloc
> or implement their own allocators. GNU libc, IMO, has a small market
> share nowadays.
> 4) TCmalloc/jemalloc are not only optimized for multithreading, they
> are also THP aware. THP is very important when benchmarking page
> reclaim, e.g., two similarly warm THPs can comprise 511+1 or 1+511 of
> warm+cold 4K pages. The LRU algorithm that chooses more of the former
> is at the disadvantage. Unless it's recommended by the applications
> you are trying to benchmark, THP should be disabled. (Android
> generally doesn't use THP.)
> 5) Swap devices are also important. Zram should NOT be used unless you
> know your benchmark doesn't generate incompressible data. The LRU
> algorithm that chooses more incompressible pages is at disadvantage.
>

Thanks for all the information above. very useful.

> Here is my result: on the same Snapdragon 7c + 2.5GB RAM + 1.5GB
> ramdisk swap, with your original program compiled against libc malloc
> and TCMalloc, to 32-bit and 64-bit binaries:

I noticed an important difference is that you are using ramdisk, so there
is no cost on "i/o". I assume compression/decompression is the i/o cost to
zRAM.

>
> # cat /sys/kernel/mm/lru_gen/enabled
> 0x0003
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> # modprobe brd rd_nr=1 rd_size=1572864
> # if=/dev/zero of=/dev/ram0 bs=1M
> # mkswap /dev/ram0
> # swapoff -a
> # swapon /dev/ram0
>
> # ldd test_absl_32
>         linux-vdso.so.1 (0xf6e7f000)
>         libabsl_malloc.so.2103.0.1 =>
> /usr/lib/libabsl_malloc.so.2103.0.1 (0xf6e23000)
>         libpthread.so.0 => /lib/libpthread.so.0 (0xf6dff000)
>         libc.so.6 => /lib/libc.so.6 (0xf6d07000)
>         /lib/ld-linux-armhf.so.3 (0x09df0000)
>         libabsl_base.so.2103.0.1 => /usr/lib/libabsl_base.so.2103.0.1
> (0xf6ce5000)
>         libabsl_raw_logging.so.2103.0.1 =>
> /usr/lib/libabsl_raw_logging.so.2103.0.1 (0xf6cc4000)
>         libabsl_spinlock_wait.so.2103.0.1 =>
> /usr/lib/libabsl_spinlock_wait.so.2103.0.1 (0xf6ca3000)
>         libc++.so.1 => /usr/lib/libc++.so.1 (0xf6c04000)
>         libc++abi.so.1 => /usr/lib/libc++abi.so.1 (0xf6bcd000)
> # file test_absl_64
> test_absl_64: ELF 64-bit LSB executable, ARM aarch64, version 1
> (SYSV), statically linked
> # ldd test_gnu_32
>         linux-vdso.so.1 (0xeabef000)
>         libpthread.so.0 => /lib/libpthread.so.0 (0xeab92000)
>         libc.so.6 => /lib/libc.so.6 (0xeaa9a000)
>         /lib/ld-linux-armhf.so.3 (0x05690000)
> # file test_gnu_64
> test_gnu_64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV),
> statically linked
>
> ### baseline 5.17-rc8
>
> # perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
> 10 records/s
> real 59.00 s
> user 39.83 s
> sys  174.18 s
>
>     18.51%  [.] memcpy
>     15.98%  [k] __pi_clear_page
>      5.59%  [k] rmqueue_pcplist
>      5.19%  [k] do_raw_spin_lock
>      5.09%  [k] memmove
>      4.60%  [k] _raw_spin_unlock_irq
>      3.62%  [k] _raw_spin_unlock_irqrestore
>      3.61%  [k] free_unref_page_list
>      3.29%  [k] zap_pte_range
>      2.53%  [k] local_daif_restore
>      2.50%  [k] down_read_trylock
>      1.41%  [k] handle_mm_fault
>      1.32%  [k] do_anonymous_page
>      1.31%  [k] up_read
>      1.03%  [k] free_swap_cache
>
> ### MGLRU v9
>
> # perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
> 11 records/s
> real 57.00 s
> user 39.39 s
>
>     19.36%  [.] memcpy
>     16.50%  [k] __pi_clear_page
>      6.21%  [k] memmove
>      5.57%  [k] rmqueue_pcplist
>      5.07%  [k] do_raw_spin_lock
>      4.96%  [k] _raw_spin_unlock_irqrestore
>      4.25%  [k] free_unref_page_list
>      3.80%  [k] zap_pte_range
>      3.69%  [k] _raw_spin_unlock_irq
>      2.71%  [k] local_daif_restore
>      2.10%  [k] down_read_trylock
>      1.50%  [k] handle_mm_fault
>      1.29%  [k] do_anonymous_page
>      1.17%  [k] free_swap_cache
>      1.08%  [k] up_read
>

I think your result is right. but if you take a look at the number of
major faults, will you find mglru have more page faults?
i ask this question because i can see mglru even wins with lower
hit ratio in the previous report I sent.

> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/chromiumos/tast/local/memory/mempressure/mempressure.go
> [2] https://developer.android.com/topic/performance/memory-overview

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-16  4:37                               ` Barry Song
@ 2022-03-16  5:44                                 ` Yu Zhao
  2022-03-16  6:06                                   ` Barry Song
  0 siblings, 1 reply; 74+ messages in thread
From: Yu Zhao @ 2022-03-16  5:44 UTC (permalink / raw)
  To: Barry Song
  Cc: Konstantin Kharlamov, Michael Larabel, Andi Kleen, Andrew Morton,
	Aneesh Kumar K . V, Jens Axboe, Brian Geffon, Catalin Marinas,
	Jonathan Corbet, Donald Carr, Dave Hansen, Daniel Byrne,
	Johannes Weiner, Hillf Danton, Jan Alexander Steffens,
	Holger Hoffstätte, Jesse Barnes, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM, Mel Gorman,
	Michal Hocko, Oleksandr Natalenko, Kernel Page Reclaim v2,
	Rik van Riel, Mike Rapoport, Sofia Trinh, Steven Barrett,
	Suleiman Souhlal, Shuang Zhai, Linus Torvalds, Vlastimil Babka,
	Will Deacon, Matthew Wilcox, the arch/x86 maintainers,
	Huang Ying

On Tue, Mar 15, 2022 at 10:37 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Tue, Mar 15, 2022 at 4:29 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > <snipped>
> >
> > > > I guess the main cause of the regression for the previous sequence
> > > > with 16 entries is that the ebizzy has a new allocated copy in
> > > > search_mem(), which is mapped and used only once in each loop.
> > > > and the temp copy can push out those hot chunks.
> > > >
> > > > Anyway, I understand it is a trade-off between warmly embracing new
> > > > pages and holding old pages tightly. Real user cases from phone, server,
> > > > desktop will be judging this better.
> >
> > Thanks for all the details. I looked into them today and found no
> > regressions when running with your original program.
> >
> > After I explain why, I hope you'd be convinced that using programs
> > like this one is not a good way to measure things :)
> >
>
> Yep. I agree ebizzy might not be a good one to measure things.
> I chose it only because Kim's patchset which moved anon pages
> to inactive at the first detected access  was using it. Before kim's
> patchset, anon pages were placed in the active list from the first
> beginning:
> https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
>
> in ebizzy, there is a used-once allocated memory in each
> search_mem(). I guess that is why Kim's patchset chose
> it.
>
> > Problems:
> > 1) Given the 2.5GB configuration and a sequence of cold/hot chunks, I
> > assume your program tries to simulate a handful of apps running on a
> > phone.  A short repeating sequence is closer to sequential access than
> > to real user behaviors, as I suggested last time. You could check out
> > how something similar is done here [1].
> > 2) Under the same assumption (phone), C programs are very different
> > from Android apps in terms of runtime memory behaviors, e.g., JVM GC
> > [2].
> > 3) Assuming you are interested in the runtime memory behavior of C/C++
> > programs, your program is still not very representative. All C/C++
> > programs I'm familiar with choose to link against TCmalloc, jemalloc
> > or implement their own allocators. GNU libc, IMO, has a small market
> > share nowadays.
> > 4) TCmalloc/jemalloc are not only optimized for multithreading, they
> > are also THP aware. THP is very important when benchmarking page
> > reclaim, e.g., two similarly warm THPs can comprise 511+1 or 1+511 of
> > warm+cold 4K pages. The LRU algorithm that chooses more of the former
> > is at the disadvantage. Unless it's recommended by the applications
> > you are trying to benchmark, THP should be disabled. (Android
> > generally doesn't use THP.)
> > 5) Swap devices are also important. Zram should NOT be used unless you
> > know your benchmark doesn't generate incompressible data. The LRU
> > algorithm that chooses more incompressible pages is at disadvantage.
> >
>
> Thanks for all the information above. very useful.
>
> > Here is my result: on the same Snapdragon 7c + 2.5GB RAM + 1.5GB
> > ramdisk swap, with your original program compiled against libc malloc
> > and TCMalloc, to 32-bit and 64-bit binaries:
>
> I noticed an important difference is that you are using ramdisk, so there
> is no cost on "i/o". I assume compression/decompression is the i/o cost to
> zRAM.

The cost is not the point; the fairness is:

1) Ramdisk is fair to both LRU algorithms.
2) Zram punishes the LRU algorithm that chooses incompressible pages.
IOW, this algorithm needs to compress more pages in order to save the
same amount of memory.

> > # cat /sys/kernel/mm/lru_gen/enabled
> > 0x0003
> > # cat /sys/kernel/mm/transparent_hugepage/enabled
> > always madvise [never]
> >
> > # modprobe brd rd_nr=1 rd_size=1572864
> > # if=/dev/zero of=/dev/ram0 bs=1M
> > # mkswap /dev/ram0
> > # swapoff -a
> > # swapon /dev/ram0
> >
> > # ldd test_absl_32
> >         linux-vdso.so.1 (0xf6e7f000)
> >         libabsl_malloc.so.2103.0.1 =>
> > /usr/lib/libabsl_malloc.so.2103.0.1 (0xf6e23000)
> >         libpthread.so.0 => /lib/libpthread.so.0 (0xf6dff000)
> >         libc.so.6 => /lib/libc.so.6 (0xf6d07000)
> >         /lib/ld-linux-armhf.so.3 (0x09df0000)
> >         libabsl_base.so.2103.0.1 => /usr/lib/libabsl_base.so.2103.0.1
> > (0xf6ce5000)
> >         libabsl_raw_logging.so.2103.0.1 =>
> > /usr/lib/libabsl_raw_logging.so.2103.0.1 (0xf6cc4000)
> >         libabsl_spinlock_wait.so.2103.0.1 =>
> > /usr/lib/libabsl_spinlock_wait.so.2103.0.1 (0xf6ca3000)
> >         libc++.so.1 => /usr/lib/libc++.so.1 (0xf6c04000)
> >         libc++abi.so.1 => /usr/lib/libc++abi.so.1 (0xf6bcd000)
> > # file test_absl_64
> > test_absl_64: ELF 64-bit LSB executable, ARM aarch64, version 1
> > (SYSV), statically linked
> > # ldd test_gnu_32
> >         linux-vdso.so.1 (0xeabef000)
> >         libpthread.so.0 => /lib/libpthread.so.0 (0xeab92000)
> >         libc.so.6 => /lib/libc.so.6 (0xeaa9a000)
> >         /lib/ld-linux-armhf.so.3 (0x05690000)
> > # file test_gnu_64
> > test_gnu_64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV),
> > statically linked
> >
> > ### baseline 5.17-rc8
> >
> > # perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
> > 10 records/s
> > real 59.00 s
> > user 39.83 s
> > sys  174.18 s
> >
> >     18.51%  [.] memcpy
> >     15.98%  [k] __pi_clear_page
> >      5.59%  [k] rmqueue_pcplist
> >      5.19%  [k] do_raw_spin_lock
> >      5.09%  [k] memmove
> >      4.60%  [k] _raw_spin_unlock_irq
> >      3.62%  [k] _raw_spin_unlock_irqrestore
> >      3.61%  [k] free_unref_page_list
> >      3.29%  [k] zap_pte_range
> >      2.53%  [k] local_daif_restore
> >      2.50%  [k] down_read_trylock
> >      1.41%  [k] handle_mm_fault
> >      1.32%  [k] do_anonymous_page
> >      1.31%  [k] up_read
> >      1.03%  [k] free_swap_cache
> >
> > ### MGLRU v9
> >
> > # perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
> > 11 records/s
> > real 57.00 s
> > user 39.39 s
> >
> >     19.36%  [.] memcpy
> >     16.50%  [k] __pi_clear_page
> >      6.21%  [k] memmove
> >      5.57%  [k] rmqueue_pcplist
> >      5.07%  [k] do_raw_spin_lock
> >      4.96%  [k] _raw_spin_unlock_irqrestore
> >      4.25%  [k] free_unref_page_list
> >      3.80%  [k] zap_pte_range
> >      3.69%  [k] _raw_spin_unlock_irq
> >      2.71%  [k] local_daif_restore
> >      2.10%  [k] down_read_trylock
> >      1.50%  [k] handle_mm_fault
> >      1.29%  [k] do_anonymous_page
> >      1.17%  [k] free_swap_cache
> >      1.08%  [k] up_read
> >
>
> I think your result is right. but if you take a look at the number of
> major faults, will you find mglru have more page faults?
> i ask this question because i can see mglru even wins with lower
> hit ratio in the previous report I sent.

Yes, I did see the elevated major faults:

# baseline total 11503878
majfault       4745116
pgsteal_kswapd 3056793
pgsteal_direct 3701969

# MGLRU total 11928659
pgmajfault     5762213
pgsteal_kswapd 2098253
pgsteal_direct 4068193


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-16  5:44                                 ` Yu Zhao
@ 2022-03-16  6:06                                   ` Barry Song
  2022-03-16 21:37                                     ` Yu Zhao
  0 siblings, 1 reply; 74+ messages in thread
From: Barry Song @ 2022-03-16  6:06 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Konstantin Kharlamov, Michael Larabel, Andi Kleen, Andrew Morton,
	Aneesh Kumar K . V, Jens Axboe, Brian Geffon, Catalin Marinas,
	Jonathan Corbet, Donald Carr, Dave Hansen, Daniel Byrne,
	Johannes Weiner, Hillf Danton, Jan Alexander Steffens,
	Holger Hoffstätte, Jesse Barnes, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM, Mel Gorman,
	Michal Hocko, Oleksandr Natalenko, Kernel Page Reclaim v2,
	Rik van Riel, Mike Rapoport, Sofia Trinh, Steven Barrett,
	Suleiman Souhlal, Shuang Zhai, Linus Torvalds, Vlastimil Babka,
	Will Deacon, Matthew Wilcox, the arch/x86 maintainers,
	Huang Ying

On Wed, Mar 16, 2022 at 6:44 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Mar 15, 2022 at 10:37 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Mar 16, 2022 at 3:47 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Tue, Mar 15, 2022 at 4:29 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > <snipped>
> > >
> > > > > I guess the main cause of the regression for the previous sequence
> > > > > with 16 entries is that the ebizzy has a new allocated copy in
> > > > > search_mem(), which is mapped and used only once in each loop.
> > > > > and the temp copy can push out those hot chunks.
> > > > >
> > > > > Anyway, I understand it is a trade-off between warmly embracing new
> > > > > pages and holding old pages tightly. Real user cases from phone, server,
> > > > > desktop will be judging this better.
> > >
> > > Thanks for all the details. I looked into them today and found no
> > > regressions when running with your original program.
> > >
> > > After I explain why, I hope you'd be convinced that using programs
> > > like this one is not a good way to measure things :)
> > >
> >
> > Yep. I agree ebizzy might not be a good one to measure things.
> > I chose it only because Kim's patchset which moved anon pages
> > to inactive at the first detected access  was using it. Before kim's
> > patchset, anon pages were placed in the active list from the first
> > beginning:
> > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> >
> > in ebizzy, there is a used-once allocated memory in each
> > search_mem(). I guess that is why Kim's patchset chose
> > it.
> >
> > > Problems:
> > > 1) Given the 2.5GB configuration and a sequence of cold/hot chunks, I
> > > assume your program tries to simulate a handful of apps running on a
> > > phone.  A short repeating sequence is closer to sequential access than
> > > to real user behaviors, as I suggested last time. You could check out
> > > how something similar is done here [1].
> > > 2) Under the same assumption (phone), C programs are very different
> > > from Android apps in terms of runtime memory behaviors, e.g., JVM GC
> > > [2].
> > > 3) Assuming you are interested in the runtime memory behavior of C/C++
> > > programs, your program is still not very representative. All C/C++
> > > programs I'm familiar with choose to link against TCmalloc, jemalloc
> > > or implement their own allocators. GNU libc, IMO, has a small market
> > > share nowadays.
> > > 4) TCmalloc/jemalloc are not only optimized for multithreading, they
> > > are also THP aware. THP is very important when benchmarking page
> > > reclaim, e.g., two similarly warm THPs can comprise 511+1 or 1+511 of
> > > warm+cold 4K pages. The LRU algorithm that chooses more of the former
> > > is at the disadvantage. Unless it's recommended by the applications
> > > you are trying to benchmark, THP should be disabled. (Android
> > > generally doesn't use THP.)
> > > 5) Swap devices are also important. Zram should NOT be used unless you
> > > know your benchmark doesn't generate incompressible data. The LRU
> > > algorithm that chooses more incompressible pages is at disadvantage.
> > >
> >
> > Thanks for all the information above. very useful.
> >
> > > Here is my result: on the same Snapdragon 7c + 2.5GB RAM + 1.5GB
> > > ramdisk swap, with your original program compiled against libc malloc
> > > and TCMalloc, to 32-bit and 64-bit binaries:
> >
> > I noticed an important difference is that you are using ramdisk, so there
> > is no cost on "i/o". I assume compression/decompression is the i/o cost to
> > zRAM.
>
> The cost is not the point; the fairness is:
>
> 1) Ramdisk is fair to both LRU algorithms.
> 2) Zram punishes the LRU algorithm that chooses incompressible pages.
> IOW, this algorithm needs to compress more pages in order to save the
> same amount of memory.

I see your point. but my point is that with higher I/O cost to swap
in and swap out pages,  more major faults(lower hit ratio) will
contribute to the loss of final performance.

So for the particular case, if we move to a real disk as a swap
device, we might see the same result as zRAM I was using
since you also reported more page faults.

>
> > > # cat /sys/kernel/mm/lru_gen/enabled
> > > 0x0003
> > > # cat /sys/kernel/mm/transparent_hugepage/enabled
> > > always madvise [never]
> > >
> > > # modprobe brd rd_nr=1 rd_size=1572864
> > > # if=/dev/zero of=/dev/ram0 bs=1M
> > > # mkswap /dev/ram0
> > > # swapoff -a
> > > # swapon /dev/ram0
> > >
> > > # ldd test_absl_32
> > >         linux-vdso.so.1 (0xf6e7f000)
> > >         libabsl_malloc.so.2103.0.1 =>
> > > /usr/lib/libabsl_malloc.so.2103.0.1 (0xf6e23000)
> > >         libpthread.so.0 => /lib/libpthread.so.0 (0xf6dff000)
> > >         libc.so.6 => /lib/libc.so.6 (0xf6d07000)
> > >         /lib/ld-linux-armhf.so.3 (0x09df0000)
> > >         libabsl_base.so.2103.0.1 => /usr/lib/libabsl_base.so.2103.0.1
> > > (0xf6ce5000)
> > >         libabsl_raw_logging.so.2103.0.1 =>
> > > /usr/lib/libabsl_raw_logging.so.2103.0.1 (0xf6cc4000)
> > >         libabsl_spinlock_wait.so.2103.0.1 =>
> > > /usr/lib/libabsl_spinlock_wait.so.2103.0.1 (0xf6ca3000)
> > >         libc++.so.1 => /usr/lib/libc++.so.1 (0xf6c04000)
> > >         libc++abi.so.1 => /usr/lib/libc++abi.so.1 (0xf6bcd000)
> > > # file test_absl_64
> > > test_absl_64: ELF 64-bit LSB executable, ARM aarch64, version 1
> > > (SYSV), statically linked
> > > # ldd test_gnu_32
> > >         linux-vdso.so.1 (0xeabef000)
> > >         libpthread.so.0 => /lib/libpthread.so.0 (0xeab92000)
> > >         libc.so.6 => /lib/libc.so.6 (0xeaa9a000)
> > >         /lib/ld-linux-armhf.so.3 (0x05690000)
> > > # file test_gnu_64
> > > test_gnu_64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV),
> > > statically linked
> > >
> > > ### baseline 5.17-rc8
> > >
> > > # perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
> > > 10 records/s
> > > real 59.00 s
> > > user 39.83 s
> > > sys  174.18 s
> > >
> > >     18.51%  [.] memcpy
> > >     15.98%  [k] __pi_clear_page
> > >      5.59%  [k] rmqueue_pcplist
> > >      5.19%  [k] do_raw_spin_lock
> > >      5.09%  [k] memmove
> > >      4.60%  [k] _raw_spin_unlock_irq
> > >      3.62%  [k] _raw_spin_unlock_irqrestore
> > >      3.61%  [k] free_unref_page_list
> > >      3.29%  [k] zap_pte_range
> > >      2.53%  [k] local_daif_restore
> > >      2.50%  [k] down_read_trylock
> > >      1.41%  [k] handle_mm_fault
> > >      1.32%  [k] do_anonymous_page
> > >      1.31%  [k] up_read
> > >      1.03%  [k] free_swap_cache
> > >
> > > ### MGLRU v9
> > >
> > > # perf record ./test_gnu_64 -t 4 -s $((200*1024*1024)) -S 6000000
> > > 11 records/s
> > > real 57.00 s
> > > user 39.39 s
> > >
> > >     19.36%  [.] memcpy
> > >     16.50%  [k] __pi_clear_page
> > >      6.21%  [k] memmove
> > >      5.57%  [k] rmqueue_pcplist
> > >      5.07%  [k] do_raw_spin_lock
> > >      4.96%  [k] _raw_spin_unlock_irqrestore

Enabling ARM64_PSEUDO_NMI and irqchip.gicv3_pseudo_nmi=
might help figure out the real code which is taking CPU time
in a spin_lock_irqsave area.

> > >      4.25%  [k] free_unref_page_list
> > >      3.80%  [k] zap_pte_range
> > >      3.69%  [k] _raw_spin_unlock_irq
> > >      2.71%  [k] local_daif_restore
> > >      2.10%  [k] down_read_trylock
> > >      1.50%  [k] handle_mm_fault
> > >      1.29%  [k] do_anonymous_page
> > >      1.17%  [k] free_swap_cache
> > >      1.08%  [k] up_read
> > >
> >
> > I think your result is right. but if you take a look at the number of
> > major faults, will you find mglru have more page faults?
> > i ask this question because i can see mglru even wins with lower
> > hit ratio in the previous report I sent.
>
> Yes, I did see the elevated major faults:
>
> # baseline total 11503878
> majfault       4745116
> pgsteal_kswapd 3056793
> pgsteal_direct 3701969
>
> # MGLRU total 11928659
> pgmajfault     5762213
> pgsteal_kswapd 2098253
> pgsteal_direct 4068193

This is a really good sign. Thanks to MGLRU's good implementation,
it seems the kernel is spending more time on useful jobs, regardless
of the hit ratio.

Thanks
Barry


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork
  2022-03-16  6:06                                   ` Barry Song
@ 2022-03-16 21:37                                     ` Yu Zhao
  0 siblings, 0 replies; 74+ messages in thread
From: Yu Zhao @ 2022-03-16 21:37 UTC (permalink / raw)
  To: Barry Song
  Cc: Konstantin Kharlamov, Michael Larabel, Andi Kleen, Andrew Morton,
	Aneesh Kumar K . V, Jens Axboe, Brian Geffon, Catalin Marinas,
	Jonathan Corbet, Donald Carr, Dave Hansen, Daniel Byrne,
	Johannes Weiner, Hillf Danton, Jan Alexander Steffens,
	Holger Hoffstätte, Jesse Barnes, Linux ARM,
	open list:DOCUMENTATION, linux-kernel, Linux-MM, Mel Gorman,
	Michal Hocko, Oleksandr Natalenko, Kernel Page Reclaim v2,
	Rik van Riel, Mike Rapoport, Sofia Trinh, Steven Barrett,
	Suleiman Souhlal, Shuang Zhai, Linus Torvalds, Vlastimil Babka,
	Will Deacon, Matthew Wilcox, the arch/x86 maintainers,
	Huang Ying

On Wed, Mar 16, 2022 at 12:06 AM Barry Song <21cnbao@gmail.com> wrote:

< snipped>
> > The cost is not the point; the fairness is:
> >
> > 1) Ramdisk is fair to both LRU algorithms.
> > 2) Zram punishes the LRU algorithm that chooses incompressible pages.
> > IOW, this algorithm needs to compress more pages in order to save the
> > same amount of memory.
>
> I see your point. but my point is that with higher I/O cost to swap
> in and swap out pages,  more major faults(lower hit ratio) will
> contribute to the loss of final performance.
>
> So for the particular case, if we move to a real disk as a swap
> device, we might see the same result as zRAM I was using
> since you also reported more page faults.

If we wanted to talk about I/O cost, we would need to consider the
number of writes and writeback patterns as well. The LRU algorithm
that *unconsciously* picks more clean pages has an advantage because
writes are usually slower than reads. Similarly, the LRU algorithm
that *unconsciously* picks a cluster of cold pages that later would be
faulted in together also has the advantage because sequential reads
are faster than random reads. Do we want to go into this rabbit hole?
I think not. That's exactly why I suggested we focus on the fairness.
But, just outta curiosity, MGLRU was faster when swapping to a slow
MMC disk.

# mmc cid read /sys/class/mmc_host/mmc1/mmc1:0001
type: 'MMC'
manufacturer: 'SanDisk-Toshiba Corporation' ''
product: 'DA4064' 1.24400152
serial: 0x00000000
manfacturing date: 2006 aug

# baseline + THP=never
0 records/s
real 872.00 s
user 51.69 s
sys  483.09 s

    13.07%  __memcpy_neon
    11.37%  __pi_clear_page
     9.35%  _raw_spin_unlock_irq
     5.52%  mod_delayed_work_on
     5.17%  _raw_spin_unlock_irqrestore
     3.95%  do_raw_spin_lock
     3.87%  rmqueue_pcplist
     3.60%  local_daif_restore
     3.17%  free_unref_page_list
     2.74%  zap_pte_range
     2.00%  handle_mm_fault
     1.19%  do_anonymous_page

# MGLRU + THP=never
0 records/s
real 821.00 s
user 44.45 s
sys  428.21 s

    13.28%  __memcpy_neon
    12.78%  __pi_clear_page
     9.14%  _raw_spin_unlock_irq
     5.95%  _raw_spin_unlock_irqrestore
     5.08%  mod_delayed_work_on
     4.45%  do_raw_spin_lock
     3.86%  local_daif_restore
     3.81%  rmqueue_pcplist
     3.32%  free_unref_page_list
     2.89%  zap_pte_range
     1.89%  handle_mm_fault
     1.10%  do_anonymous_page

# baseline + THP=madvise
0 records/s
real 1341.00 s
user 68.15 s
sys  681.42 s

    12.33%  __memcpy_neon
    11.78%  _raw_spin_unlock_irq
     8.79%  __pi_clear_page
     7.63%  mod_delayed_work_on
     5.49%  _raw_spin_unlock_irqrestore
     3.23%  local_daif_restore
     3.00%  do_raw_spin_lock
     2.83%  rmqueue_pcplist
     2.21%  handle_mm_fault
     2.00%  zap_pte_range
     1.51%  free_unref_page_list
     1.33%  do_swap_page
     1.17%  do_anonymous_page

# MGLRU + THP=madvise
0 records/s
real 1315.00 s
user 60.59 s
sys  620.56 s

    12.34%  __memcpy_neon
    12.17%  _raw_spin_unlock_irq
     9.33%  __pi_clear_page
     7.33%  mod_delayed_work_on
     6.01%  _raw_spin_unlock_irqrestore
     3.27%  local_daif_restore
     3.23%  do_raw_spin_lock
     2.98%  rmqueue_pcplist
     2.12%  handle_mm_fault
     2.04%  zap_pte_range
     1.65%  free_unref_page_list
     1.27%  do_swap_page
     1.11%  do_anonymous_page


^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2022-03-16 21:37 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08  8:18 [PATCH v7 00/12] Multigenerational LRU Framework Yu Zhao
2022-02-08  8:18 ` [PATCH v7 01/12] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2022-02-08  8:24   ` Yu Zhao
2022-02-08 10:33   ` Will Deacon
2022-02-08  8:18 ` [PATCH v7 02/12] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2022-02-08  8:27   ` Yu Zhao
2022-02-08  8:18 ` [PATCH v7 03/12] mm/vmscan.c: refactor shrink_node() Yu Zhao
2022-02-08  8:18 ` [PATCH v7 04/12] mm: multigenerational LRU: groundwork Yu Zhao
2022-02-08  8:28   ` Yu Zhao
2022-02-10 20:41   ` Johannes Weiner
2022-02-15  9:43     ` Yu Zhao
2022-02-15 21:53       ` Johannes Weiner
2022-02-21  8:14         ` Yu Zhao
2022-02-23 21:18           ` Yu Zhao
2022-02-25 16:34             ` Minchan Kim
2022-03-03 15:29           ` Johannes Weiner
2022-03-03 19:26             ` Yu Zhao
2022-03-03 21:43               ` Johannes Weiner
2022-03-11 10:16       ` Barry Song
2022-03-11 23:45         ` Yu Zhao
2022-03-12 10:37           ` Barry Song
2022-03-12 21:11             ` Yu Zhao
2022-03-13  4:57               ` Barry Song
2022-03-14 11:11                 ` Barry Song
2022-03-14 16:45                   ` Yu Zhao
2022-03-14 23:38                     ` Barry Song
     [not found]                       ` <CAOUHufa9eY44QadfGTzsxa2=hEvqwahXd7Canck5Gt-N6c4UKA@mail.gmail.com>
     [not found]                         ` <CAGsJ_4zvj5rmz7DkW-kJx+jmUT9G8muLJ9De--NZma9ey0Oavw@mail.gmail.com>
2022-03-15 10:29                           ` Barry Song
2022-03-16  2:46                             ` Yu Zhao
2022-03-16  4:37                               ` Barry Song
2022-03-16  5:44                                 ` Yu Zhao
2022-03-16  6:06                                   ` Barry Song
2022-03-16 21:37                                     ` Yu Zhao
2022-02-10 21:37   ` Matthew Wilcox
2022-02-13 21:16     ` Yu Zhao
2022-02-08  8:18 ` [PATCH v7 05/12] mm: multigenerational LRU: minimal implementation Yu Zhao
2022-02-08  8:33   ` Yu Zhao
2022-02-08 16:50   ` Johannes Weiner
2022-02-10  2:53     ` Yu Zhao
2022-02-13 10:04   ` Hillf Danton
2022-02-17  0:13     ` Yu Zhao
2022-02-23  8:27   ` Huang, Ying
2022-02-23  9:36     ` Yu Zhao
2022-02-24  0:59       ` Huang, Ying
2022-02-24  1:34         ` Yu Zhao
2022-02-24  3:31           ` Huang, Ying
2022-02-24  4:09             ` Yu Zhao
2022-02-24  5:27               ` Huang, Ying
2022-02-24  5:35                 ` Yu Zhao
2022-02-08  8:18 ` [PATCH v7 06/12] mm: multigenerational LRU: exploit locality in rmap Yu Zhao
2022-02-08  8:40   ` Yu Zhao
2022-02-08  8:18 ` [PATCH v7 07/12] mm: multigenerational LRU: support page table walks Yu Zhao
2022-02-08  8:39   ` Yu Zhao
2022-02-08  8:18 ` [PATCH v7 08/12] mm: multigenerational LRU: optimize multiple memcgs Yu Zhao
2022-02-08  8:18 ` [PATCH v7 09/12] mm: multigenerational LRU: runtime switch Yu Zhao
2022-02-08  8:42   ` Yu Zhao
2022-02-08  8:19 ` [PATCH v7 10/12] mm: multigenerational LRU: thrashing prevention Yu Zhao
2022-02-08  8:43   ` Yu Zhao
2022-02-08  8:19 ` [PATCH v7 11/12] mm: multigenerational LRU: debugfs interface Yu Zhao
2022-02-18 18:56   ` [page-reclaim] " David Rientjes
2022-02-08  8:19 ` [PATCH v7 12/12] mm: multigenerational LRU: documentation Yu Zhao
2022-02-08  8:44   ` Yu Zhao
2022-02-14 10:28   ` Mike Rapoport
2022-02-16  3:22     ` Yu Zhao
2022-02-21  9:01       ` Mike Rapoport
2022-02-22  1:47         ` Yu Zhao
2022-02-23 10:58           ` Mike Rapoport
2022-02-23 21:20             ` Yu Zhao
2022-02-08 10:11 ` [PATCH v7 00/12] Multigenerational LRU Framework Oleksandr Natalenko
2022-02-08 11:14   ` Michal Hocko
2022-02-08 11:23     ` Oleksandr Natalenko
2022-02-11 20:12 ` Alexey Avramov
2022-02-12 21:01   ` Yu Zhao
2022-03-03  6:06 ` Vaibhav Jain
2022-03-03  6:47   ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).