linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/11] Multigenerational LRU Framework
@ 2021-08-18  6:30 Yu Zhao
  2021-08-18  6:30 ` [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
                   ` (14 more replies)
  0 siblings, 15 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:30 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/81/1281/1

Problems
========
Active/inactive
---------------
Data centers need to predict whether a job can successfully land on a
machine without actually impacting the existing jobs. The granularity
of the active/inactive is too coarse to be useful for job schedulers
to make such decisions. In addition, data centers need to monitor
their memory utilization for horizontal scaling. The active/inactive
are relative terms and therefore cannot give any insight into a pool
of machines, e.g., aggregating the active/inactive across multiple
machines without a common frame of reference yields no meaningful
results.

Phones and laptops need to make good choices about what to evict
because they are more sensitive to major faults and power consumption.
Major faults can cause janks, i.e., slow UI renderings, and negatively
impact user experience. The selection between anon and file types has
been suboptimal because it is difficult to compare the access patterns
of the two types. On phones and laptops, executable pages are
frequently evicted despite the fact that there are many less
frequently used anon pages. Conversely, on workstations building large
projects, anon pages are sometimes swapped out while there are many
less recently used file pages.

Fundamentally, the notion of active/inactive has very limited ability
to measure temporal locality.

Rmap walk
---------
Traversing a list of pages and searching the rmap for PTEs mapping
each page can be very expensive because those pages are likely to be
unrelated. For workloads using a high percentage of anon memory, the
rmap becomes a bottleneck in page reclaim. For example, kswapd can
easily spend more CPU time in the rmap than in anything else on
laptops running Chrome. And the kernel can spend more CPU time in the
rmap than in any other functions on servers that heavily overcommit
anon memory.

Simply put, it does not take advantage of spatial locality when using
the rmap to test the accessed bit over a large number of pages.

Solutions
=========
Generations
-----------
This solution introduces a temporal dimension. Each generation is a
dot on the timeline and its population includes all mapped pages that
have been accessed since the birth of this generation.

All eviction choices are made based on generation numbers, which are
simple and yet effective. A large number of pages can be spread out
across many generations. Since each generation is timestamped at
birth, its population is aggregatable across different machines. This
is especially useful for data centers that require working set
estimation and proactive reclaim.

Page table walk
---------------
Each walk traverses an mm_struct list to scan PTEs for accessed pages
only. Processes that have been sleeping since the last walk are
skipped. The cost of this solution is roughly proportional to the
number of accessed pages. Since page tables usually have good spatial
locality for workloads using a high percentage of anon memory, the end
result is generally a significant reduction in kswapd CPU usage.

Note that page table walks are conditional and therefore do not
replace the rmap. For workloads that have sparse mappings, this
solution falls back to the rmap.

Use cases
=========
Page cache overcommit
---------------------
Tiers within each generation are specifically designed to improve the
performance of page cache under memory pressure. The fio/io_uring
benchmark shows 14% increase in IOPS for buffered I/O.

Without this patchset, the profile of fio/io_uring looks like:
  12.03%  __page_cache_alloc
   6.53%  shrink_active_list
   2.53%  mark_page_accessed

With this patchset, it looks like:
   9.45%  __page_cache_alloc
   0.52%  mark_page_accessed

Essentially, the idea of tiers is a feedback loop based on trial and
error. Instead of unconditionally moving file pages to the active list
upon the second access, this solution monitors refaults and
conditionally protects file pages with outlying refaults.

Anon memory overcommit
----------------------
Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI.

Without this patchset, the profile of kswapd looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

With this patchset, it looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are highly correlated to
user experience.

Working set estimation
----------------------
Userspace can invoke the aging by writing "+ memcg_id node_id max_gen
[swappiness]" to /sys/kernel/debug/lru_gen. Reading this debugfs
interface returns the birth time and the population of each
generation.

Given a pool of machines, by periodically invoking the aging, a job
scheduler is able to rank these machines based on the sizes of their
working sets and in turn selects the most ideal ones to land new jobs.

Proactive reclaim
-----------------
Userspace can invoke the eviction by writing "- memcg_id node_id
min_gen [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen.
Multiple command lines are supported, as is concatenation with
delimiters "," and ";".

A typical use case is that a job scheduler invokes the eviction in
anticipation of new jobs. The savings from proactive reclaim can
provide certain SLA to landing these new jobs.

Yu Zhao (11):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: protection
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst          |    1 +
 Documentation/vm/multigen_lru.rst   |  134 ++
 arch/Kconfig                        |    9 +
 arch/arm64/include/asm/cpufeature.h |   19 +-
 arch/arm64/include/asm/pgtable.h    |   10 +-
 arch/arm64/kernel/cpufeature.c      |   19 +
 arch/arm64/mm/proc.S                |   12 -
 arch/arm64/tools/cpucaps            |    1 +
 arch/x86/Kconfig                    |    1 +
 arch/x86/include/asm/pgtable.h      |    9 +-
 arch/x86/mm/pgtable.c               |    5 +-
 fs/exec.c                           |    2 +
 fs/fuse/dev.c                       |    3 +-
 include/linux/cgroup.h              |   15 +-
 include/linux/memcontrol.h          |    9 +
 include/linux/mm.h                  |   34 +
 include/linux/mm_inline.h           |  201 ++
 include/linux/mm_types.h            |  107 ++
 include/linux/mmzone.h              |  103 ++
 include/linux/nodemask.h            |    1 +
 include/linux/oom.h                 |   16 +
 include/linux/page-flags-layout.h   |   19 +-
 include/linux/page-flags.h          |    4 +-
 include/linux/pgtable.h             |   16 +-
 include/linux/sched.h               |    3 +
 include/linux/swap.h                |    1 +
 kernel/bounds.c                     |    3 +
 kernel/cgroup/cgroup-internal.h     |    1 -
 kernel/exit.c                       |    1 +
 kernel/fork.c                       |   10 +
 kernel/kthread.c                    |    1 +
 kernel/sched/core.c                 |    2 +
 mm/Kconfig                          |   59 +
 mm/huge_memory.c                    |    3 +-
 mm/memcontrol.c                     |   28 +
 mm/memory.c                         |   21 +-
 mm/mm_init.c                        |    6 +-
 mm/mmzone.c                         |    2 +
 mm/oom_kill.c                       |    4 +-
 mm/rmap.c                           |    7 +
 mm/swap.c                           |   55 +-
 mm/swapfile.c                       |    2 +
 mm/vmscan.c                         | 2674 ++++++++++++++++++++++++++-
 mm/workingset.c                     |  119 +-
 44 files changed, 3591 insertions(+), 161 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young()
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
@ 2021-08-18  6:30 ` Yu Zhao
  2021-08-19  9:19   ` Will Deacon
  2021-08-18  6:30 ` [PATCH v4 02/11] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:30 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao

Some architectures set the accessed bit in PTEs automatically, e.g.,
x86, and arm64 v8.2 and later. On architectures that do not have this
capability, clearing the accessed bit in a PTE triggers a page fault
following the TLB miss.

Being aware of this capability can help make better decisions, i.e.,
whether to limit the size of each batch of PTEs and the burst of
batches when clearing the accessed bit.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/cpufeature.h | 19 ++++++-------------
 arch/arm64/include/asm/pgtable.h    | 10 ++++------
 arch/arm64/kernel/cpufeature.c      | 19 +++++++++++++++++++
 arch/arm64/mm/proc.S                | 12 ------------
 arch/arm64/tools/cpucaps            |  1 +
 arch/x86/include/asm/pgtable.h      |  6 +++---
 include/linux/pgtable.h             | 12 ++++++++++++
 mm/memory.c                         | 14 +-------------
 8 files changed, 46 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 9bb9d11750d7..2020b9e818c8 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -776,6 +776,12 @@ static inline bool system_supports_tlb_range(void)
 		cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
 }
 
+/* Check whether hardware update of the Access flag is supported. */
+static inline bool system_has_hw_af(void)
+{
+	return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
+}
+
 extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
 
 static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
@@ -799,19 +805,6 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
 	}
 }
 
-/* Check whether hardware update of the Access flag is supported */
-static inline bool cpu_has_hw_af(void)
-{
-	u64 mmfr1;
-
-	if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
-		return false;
-
-	mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
-	return cpuid_feature_extract_unsigned_field(mmfr1,
-						ID_AA64MMFR1_HADBS_SHIFT);
-}
-
 static inline bool cpu_has_pan(void)
 {
 	u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f09bf5c02891..b63a6a7b62ee 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -993,13 +993,11 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
  * page after fork() + CoW for pfn mappings. We don't always have a
  * hardware-managed access flag on arm64.
  */
-static inline bool arch_faults_on_old_pte(void)
+static inline bool arch_has_hw_pte_young(void)
 {
-	WARN_ON(preemptible());
-
-	return !cpu_has_hw_af();
+	return system_has_hw_af();
 }
-#define arch_faults_on_old_pte		arch_faults_on_old_pte
+#define arch_has_hw_pte_young		arch_has_hw_pte_young
 
 /*
  * Experimentally, it's cheap to set the access flag in hardware and we
@@ -1007,7 +1005,7 @@ static inline bool arch_faults_on_old_pte(void)
  */
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
-	return !arch_faults_on_old_pte();
+	return arch_has_hw_pte_young();
 }
 #define arch_wants_old_prefaulted_pte	arch_wants_old_prefaulted_pte
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 0ead8bfedf20..d05de77626f5 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1650,6 +1650,14 @@ static bool has_hw_dbm(const struct arm64_cpu_capabilities *cap,
 	return true;
 }
 
+static void cpu_enable_hw_af(struct arm64_cpu_capabilities const *cap)
+{
+	u64 val = read_sysreg(tcr_el1);
+
+	write_sysreg(val | TCR_HA, tcr_el1);
+	isb();
+	local_flush_tlb_all();
+}
 #endif
 
 #ifdef CONFIG_ARM64_AMU_EXTN
@@ -2126,6 +2134,17 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.matches = has_hw_dbm,
 		.cpu_enable = cpu_enable_hw_dbm,
 	},
+	{
+		.desc = "Hardware update of the Access flag",
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.capability = ARM64_HW_AF,
+		.sys_reg = SYS_ID_AA64MMFR1_EL1,
+		.sign = FTR_UNSIGNED,
+		.field_pos = ID_AA64MMFR1_HADBS_SHIFT,
+		.min_field_value = 1,
+		.matches = has_cpuid_feature,
+		.cpu_enable = cpu_enable_hw_af,
+	},
 #endif
 	{
 		.desc = "CRC32 instructions",
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 35936c5ae1ce..b066d5712e3d 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -478,18 +478,6 @@ SYM_FUNC_START(__cpu_setup)
 	 * Set the IPS bits in TCR_EL1.
 	 */
 	tcr_compute_pa_size tcr, #TCR_IPS_SHIFT, x5, x6
-#ifdef CONFIG_ARM64_HW_AFDBM
-	/*
-	 * Enable hardware update of the Access Flags bit.
-	 * Hardware dirty bit management is enabled later,
-	 * via capabilities.
-	 */
-	mrs	x9, ID_AA64MMFR1_EL1
-	and	x9, x9, #0xf
-	cbz	x9, 1f
-	orr	tcr, tcr, #TCR_HA		// hardware Access flag update
-1:
-#endif	/* CONFIG_ARM64_HW_AFDBM */
 	msr	mair_el1, mair
 	msr	tcr_el1, tcr
 	/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 49305c2e6dfd..d52f50671e60 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -35,6 +35,7 @@ HAS_STAGE2_FWB
 HAS_SYSREG_GIC_CPUIF
 HAS_TLB_RANGE
 HAS_VIRT_HOST_EXTN
+HW_AF
 HW_DBM
 KVM_PROTECTED_MODE
 MISMATCHED_CACHE_TYPE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..3908780fc408 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_check(void)
 	return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(void)
 {
-	return false;
+	return true;
 }
 
 #endif	/* __ASSEMBLY__ */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..3a8221fa2c76 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,18 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef arch_has_hw_pte_young
+static inline bool arch_has_hw_pte_young(void)
+{
+	/*
+	 * Those arches which have hw access flag feature need to implement
+	 * their own helper. By default, "false" means pagefault will be hit
+	 * on old pte.
+	 */
+	return false;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..2f96179db219 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
-#ifndef arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
-{
-	/*
-	 * Those arches which don't have hw access flag feature need to
-	 * implement their own helper. By default, "true" means pagefault
-	 * will be hit on old pte.
-	 */
-	return true;
-}
-#endif
-
 #ifndef arch_wants_old_prefaulted_pte
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
@@ -2769,7 +2757,7 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	 * On architectures with software "accessed" bits, we would
 	 * take a double page fault, so mark it accessed here.
 	 */
-	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+	if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
 		pte_t entry;
 
 		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 02/11] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
  2021-08-18  6:30 ` [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
@ 2021-08-18  6:30 ` Yu Zhao
  2021-08-18  6:30 ` [PATCH v4 03/11] mm/vmscan.c: refactor shrink_node() Yu Zhao
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

Some architectures support the accessed bit on non-leaf PMD entries,
e.g., x86_64 sets the accessed bit on a non-leaf PMD entry when using
it as part of linear address translation [1]. As an optimization, page
table walkers who are interested in the accessed bit can skip the PTEs
under a non-leaf PMD entry if the accessed bit is cleared on this
non-leaf PMD entry.

Although an inline function may be preferable, this capability is
added as a configuration option to look consistent when used with the
existing macros.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (October 2019), section 4.8

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 arch/Kconfig                   | 9 +++++++++
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 3 ++-
 arch/x86/mm/pgtable.c          | 5 ++++-
 include/linux/pgtable.h        | 4 ++--
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 129df498a8e1..5b6b4f95372f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1282,6 +1282,15 @@ config ARCH_SPLIT_ARG64
 config ARCH_HAS_ELFCORE_COMPAT
 	bool
 
+config ARCH_HAS_NONLEAF_PMD_YOUNG
+	bool
+	depends on PGTABLE_LEVELS > 2
+	help
+	  Architectures that select this are able to set the accessed bit on
+	  non-leaf PMD entries in addition to leaf PTE entries where pages are
+	  mapped. For them, page table walkers that clear the accessed bit may
+	  stop at non-leaf PMD entries if they do not see the accessed bit.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 88fb922c23a0..36a81d31f711 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -84,6 +84,7 @@ config X86
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
+	select ARCH_HAS_NONLEAF_PMD_YOUNG	if X86_64
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3908780fc408..01a1763123ff 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -817,7 +817,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
+	       (_KERNPG_TABLE & ~_PAGE_ACCESSED);
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..a224193d84bf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
 	return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 3a8221fa2c76..483d5ff7a33e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -211,7 +211,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -232,7 +232,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 03/11] mm/vmscan.c: refactor shrink_node()
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
  2021-08-18  6:30 ` [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
  2021-08-18  6:30 ` [PATCH v4 02/11] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
@ 2021-08-18  6:30 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 04/11] mm: multigenerational lru: groundwork Yu Zhao
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

This patch refactors shrink_node(). This will make the upcoming
changes to mm/vmscan.c more readable.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 mm/vmscan.c | 186 +++++++++++++++++++++++++++-------------------------
 1 file changed, 98 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4620df62f0ff..b6d14880bd76 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2437,6 +2437,103 @@ enum scan_balance {
 	SCAN_FILE,
 };
 
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+	unsigned long file;
+	struct lruvec *target_lruvec;
+
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+	/*
+	 * Determine the scan balance between anon and file LRUs.
+	 */
+	spin_lock_irq(&target_lruvec->lru_lock);
+	sc->anon_cost = target_lruvec->anon_cost;
+	sc->file_cost = target_lruvec->file_cost;
+	spin_unlock_irq(&target_lruvec->lru_lock);
+
+	/*
+	 * Target desirable inactive:active list ratios for the anon
+	 * and file LRU lists.
+	 */
+	if (!sc->force_deactivate) {
+		unsigned long refaults;
+
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+			sc->may_deactivate |= DEACTIVATE_ANON;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+		/*
+		 * When refaults are being observed, it means a new
+		 * workingset is being established. Deactivate to get
+		 * rid of any stale active pages quickly.
+		 */
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+			sc->may_deactivate |= DEACTIVATE_FILE;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_FILE;
+	} else
+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+	/*
+	 * If we have plenty of inactive file pages that aren't
+	 * thrashing, try to reclaim those first before touching
+	 * anonymous pages.
+	 */
+	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+		sc->cache_trim_mode = 1;
+	else
+		sc->cache_trim_mode = 0;
+
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long total_high_wmark = 0;
+		unsigned long free, anon;
+		int z;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		/*
+		 * Consider anon: if that's low too, this isn't a
+		 * runaway file reclaim problem, but rather just
+		 * extreme pressure. Reclaim as per usual then.
+		 */
+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+		sc->file_is_tiny =
+			file + free <= total_high_wmark &&
+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+			anon >> sc->priority;
+	}
+}
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -2882,7 +2979,6 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
-	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
@@ -2892,93 +2988,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
-	/*
-	 * Determine the scan balance between anon and file LRUs.
-	 */
-	spin_lock_irq(&target_lruvec->lru_lock);
-	sc->anon_cost = target_lruvec->anon_cost;
-	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&target_lruvec->lru_lock);
-
-	/*
-	 * Target desirable inactive:active list ratios for the anon
-	 * and file LRU lists.
-	 */
-	if (!sc->force_deactivate) {
-		unsigned long refaults;
-
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_ANON);
-		if (refaults != target_lruvec->refaults[0] ||
-			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
-			sc->may_deactivate |= DEACTIVATE_ANON;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_ANON;
-
-		/*
-		 * When refaults are being observed, it means a new
-		 * workingset is being established. Deactivate to get
-		 * rid of any stale active pages quickly.
-		 */
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_FILE);
-		if (refaults != target_lruvec->refaults[1] ||
-		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
-			sc->may_deactivate |= DEACTIVATE_FILE;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_FILE;
-	} else
-		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-
-	/*
-	 * If we have plenty of inactive file pages that aren't
-	 * thrashing, try to reclaim those first before touching
-	 * anonymous pages.
-	 */
-	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
-	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
-		sc->cache_trim_mode = 1;
-	else
-		sc->cache_trim_mode = 0;
-
-	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
-	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long total_high_wmark = 0;
-		unsigned long free, anon;
-		int z;
-
-		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		/*
-		 * Consider anon: if that's low too, this isn't a
-		 * runaway file reclaim problem, but rather just
-		 * extreme pressure. Reclaim as per usual then.
-		 */
-		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-
-		sc->file_is_tiny =
-			file + free <= total_high_wmark &&
-			!(sc->may_deactivate & DEACTIVATE_ANON) &&
-			anon >> sc->priority;
-	}
+	prepare_scan_count(pgdat, sc);
 
 	shrink_node_memcgs(pgdat, sc);
 
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 04/11] mm: multigenerational lru: groundwork
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (2 preceding siblings ...)
  2021-08-18  6:30 ` [PATCH v4 03/11] mm/vmscan.c: refactor shrink_node() Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 05/11] mm: multigenerational lru: protection Yu Zhao
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

For each lruvec, evictable pages are divided into multiple
generations. The youngest generation number is stored in
lrugen->max_seq for both anon and file types as they are aged on an
equal footing. The oldest generation numbers are stored in
lrugen->min_seq[2] separately for anon and file types as clean file
pages can be evicted regardless of swap and writeback constraints.
These three variables are monotonically increasing. Generation numbers
are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit
into page->flags. The sliding window technique is used to prevent
truncated generation numbers from overlapping. Each truncated
generation number is an index to
lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed N times via
file descriptors belong to tier order_base_2(N). Each generation
contains at most MAX_NR_TIERS tiers, and they require additional
MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across
generations which requires list operations, moving across tiers only
involves operations on page->flags and therefore has a negligible
cost. A feedback loop modeled after the PID controller monitors
refault rates of all tiers and decides when to protect pages from
which tiers.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space for the purpose of working set estimation and proactive reclaim.

The aging produces young generations. Given an lruvec, the aging
traverses lruvec_memcg()->mm_list and calls walk_page_range() to scan
PTEs for accessed pages (a mm_struct list is maintained for each
memcg). Upon finding one, the aging updates its generation number to
max_seq (modulo MAX_NR_GENS). After each round of traversal, the aging
increments max_seq. The aging is due when both min_seq[2] have caught
up with max_seq-1.

The eviction consumes old generations. Given an lruvec, the eviction
scans pages on lrugen->lists indexed by anon and file min_seq[2]
(modulo MAX_NR_GENS). It first tries to select a type based on the
values of min_seq[2]. If they are equal, it selects the type that has
a lower refault rate. The eviction sorts a page according to its
updated generation number if the aging has found this page accessed.
It also moves a page to the next generation if this page is from an
upper tier that has a higher refault rate than the base tier. The
eviction increments min_seq[2] of a selected type when it finds
lrugen->lists indexed by min_seq[2] of this selected type are empty.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 fs/fuse/dev.c                     |   3 +-
 include/linux/cgroup.h            |  15 +-
 include/linux/mm.h                |   2 +
 include/linux/mm_inline.h         | 201 ++++++++++++++++++
 include/linux/mmzone.h            |  92 +++++++++
 include/linux/page-flags-layout.h |  19 +-
 include/linux/page-flags.h        |   4 +-
 kernel/bounds.c                   |   3 +
 kernel/cgroup/cgroup-internal.h   |   1 -
 mm/huge_memory.c                  |   3 +-
 mm/mm_init.c                      |   6 +-
 mm/mmzone.c                       |   2 +
 mm/swapfile.c                     |   2 +
 mm/vmscan.c                       | 329 ++++++++++++++++++++++++++++++
 14 files changed, 669 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 1c8f79b3dd06..673d987652ee 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
 	       1 << PG_active |
 	       1 << PG_workingset |
 	       1 << PG_reclaim |
-	       1 << PG_waiters))) {
+	       1 << PG_waiters |
+	       LRU_GEN_MASK | LRU_USAGE_MASK))) {
 		dump_page(page, "fuse: trying to steal weird page");
 		return 1;
 	}
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7bf60454a313..1ebc27c8fee7 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+	mutex_lock(&cgroup_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
  * as locks used during the cgroup_subsys::attach() methods.
  */
 #ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)					\
 	rcu_dereference_check((task)->cgroups,				\
@@ -707,6 +718,8 @@ struct cgroup;
 static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
 static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..159b7c94e067 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,6 +1092,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
 #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_USAGE_PGOFF		(LRU_GEN_PGOFF - LRU_USAGE_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..19e722ec7cf3 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -79,11 +79,206 @@ static __always_inline enum lru_list page_lru(struct page *page)
 	return lru;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+#ifdef CONFIG_LRU_GEN_ENABLED
+DECLARE_STATIC_KEY_TRUE(lru_gen_static_key);
+
+static inline bool lru_gen_enabled(void)
+{
+	return static_branch_likely(&lru_gen_static_key);
+}
+#else
+DECLARE_STATIC_KEY_FALSE(lru_gen_static_key);
+
+static inline bool lru_gen_enabled(void)
+{
+	return static_branch_unlikely(&lru_gen_static_key);
+}
+#endif
+
+/* Return an index within the sliding window that tracks MAX_NR_GENS generations. */
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+	return seq % MAX_NR_GENS;
+}
+
+/* Return a proper index regardless whether we keep a full history of stats. */
+static inline int lru_hist_from_seq(int seq)
+{
+	return seq % NR_STAT_GENS;
+}
+
+/* Convert the level of usage to a tier. See the comment on MAX_NR_TIERS. */
+static inline int lru_tier_from_usage(int usage)
+{
+	VM_BUG_ON(usage > BIT(LRU_USAGE_WIDTH));
+
+	return order_base_2(usage + 1);
+}
+
+/* The youngest and the second youngest generations are counted as active. */
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+	unsigned long max_seq = READ_ONCE(lruvec->evictable.max_seq);
+
+	VM_BUG_ON(!max_seq);
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+
+	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+/* Update the sizes of the multigenerational lru lists. */
+static inline void lru_gen_update_size(struct page *page, struct lruvec *lruvec,
+				       int old_gen, int new_gen)
+{
+	int type = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	int delta = thp_nr_pages(page);
+	enum lru_list lru = type * LRU_FILE;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+	VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
+	VM_BUG_ON(old_gen == -1 && new_gen == -1);
+
+	if (old_gen >= 0)
+		WRITE_ONCE(lrugen->sizes[old_gen][type][zone],
+			   lrugen->sizes[old_gen][type][zone] - delta);
+	if (new_gen >= 0)
+		WRITE_ONCE(lrugen->sizes[new_gen][type][zone],
+			   lrugen->sizes[new_gen][type][zone] + delta);
+
+	if (old_gen < 0) {
+		if (lru_gen_is_active(lruvec, new_gen))
+			lru += LRU_ACTIVE;
+		update_lru_size(lruvec, lru, zone, delta);
+		return;
+	}
+
+	if (new_gen < 0) {
+		if (lru_gen_is_active(lruvec, old_gen))
+			lru += LRU_ACTIVE;
+		update_lru_size(lruvec, lru, zone, -delta);
+		return;
+	}
+
+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+		update_lru_size(lruvec, lru, zone, -delta);
+		update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+
+	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
+}
+
+/* Add a page to one of the multigenerational lru lists. Return true on success. */
+static inline bool lru_gen_add_page(struct page *page, struct lruvec *lruvec, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+	int type = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	if (PageUnevictable(page) || !lrugen->enabled[type])
+		return false;
+	/*
+	 * If a page shouldn't be considered for eviction, i.e., a page mapped
+	 * upon fault during which the accessed bit is set, add it to the
+	 * youngest generation.
+	 *
+	 * If a page can't be evicted immediately, i.e., an anon page not in
+	 * swap cache or a dirty page pending writeback, add it to the second
+	 * oldest generation.
+	 *
+	 * If a page could be evicted immediately, e.g., a clean page, add it to
+	 * the oldest generation.
+	 */
+	if (PageActive(page))
+		gen = lru_gen_from_seq(lrugen->max_seq);
+	else if ((!type && !PageSwapCache(page)) ||
+		 (PageReclaim(page) && (PageDirty(page) || PageWriteback(page))))
+		gen = lru_gen_from_seq(lrugen->min_seq[type] + 1);
+	else
+		gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(page->flags);
+		VM_BUG_ON_PAGE(new_flags & LRU_GEN_MASK, page);
+
+		new_flags &= ~(LRU_GEN_MASK | BIT(PG_active));
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(page, lruvec, -1, gen);
+	if (reclaiming)
+		list_add_tail(&page->lru, &lrugen->lists[gen][type][zone]);
+	else
+		list_add(&page->lru, &lrugen->lists[gen][type][zone]);
+
+	return true;
+}
+
+/* Delete a page from one of the multigenerational lru lists. Return true on success. */
+static inline bool lru_gen_del_page(struct page *page, struct lruvec *lruvec, bool reclaiming)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+
+	do {
+		new_flags = old_flags = READ_ONCE(page->flags);
+		if (!(new_flags & LRU_GEN_MASK))
+			return false;
+
+		VM_BUG_ON_PAGE(PageActive(page), page);
+		VM_BUG_ON_PAGE(PageUnevictable(page), page);
+
+		gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+		new_flags &= ~LRU_GEN_MASK;
+		if ((new_flags & LRU_TIER_FLAGS) != LRU_TIER_FLAGS)
+			new_flags &= ~(LRU_USAGE_MASK | LRU_TIER_FLAGS);
+		/* see the comment on PageReferenced()/PageReclaim() in shrink_page_list() */
+		if (reclaiming)
+			new_flags &= ~(BIT(PG_referenced) | BIT(PG_reclaim));
+		else if (lru_gen_is_active(lruvec, gen))
+			new_flags |= BIT(PG_active);
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(page, lruvec, gen, -1);
+	list_del(&page->lru);
+
+	return true;
+}
+
+#else /* CONFIG_LRU_GEN */
+
+static inline bool lru_gen_enabled(void)
+{
+	return false;
+}
+
+static inline bool lru_gen_add_page(struct page *page, struct lruvec *lruvec, bool reclaiming)
+{
+	return false;
+}
+
+static inline bool lru_gen_del_page(struct page *page, struct lruvec *lruvec, bool reclaiming)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec)
 {
 	enum lru_list lru = page_lru(page);
 
+	if (lru_gen_add_page(page, lruvec, false))
+		return;
+
 	update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
@@ -93,6 +288,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 {
 	enum lru_list lru = page_lru(page);
 
+	if (lru_gen_add_page(page, lruvec, true))
+		return;
+
 	update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
 	list_add_tail(&page->lru, &lruvec->lists[lru]);
 }
@@ -100,6 +298,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec)
 {
+	if (lru_gen_del_page(page, lruvec, false))
+		return;
+
 	list_del(&page->lru);
 	update_lru_size(lruvec, page_lru(page), page_zonenum(page),
 			-thp_nr_pages(page));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fcb535560028..d6c2c3a4ba43 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -294,6 +294,94 @@ enum lruvec_flags {
 					 */
 };
 
+struct lruvec;
+
+#define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+#define LRU_USAGE_MASK		((BIT(LRU_USAGE_WIDTH) - 1) << LRU_USAGE_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+/*
+ * For each lruvec, evictable pages are divided into multiple generations. The
+ * youngest and the oldest generation numbers, AKA max_seq and min_seq, are
+ * monotonically increasing. The sliding window technique is used to track at
+ * most MAX_NR_GENS and at least MIN_NR_GENS generations. An offset within the
+ * window, AKA gen, indexes an array of per-type and per-zone lists for the
+ * corresponding generation. The counter in page->flags stores gen+1 while a
+ * page is on one of the multigenerational lru lists. Otherwise, it stores 0.
+ */
+#define MAX_NR_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
+
+/*
+ * Each generation is then divided into multiple tiers. Tiers represent levels
+ * of usage from file descriptors, i.e., mark_page_accessed(). In contrast to
+ * moving across generations which requires the lru lock, moving across tiers
+ * only involves an atomic operation on page->flags and therefore has a
+ * negligible cost.
+ *
+ * The purposes of tiers are to:
+ *   1) estimate whether pages accessed multiple times via file descriptors are
+ *   more active than pages accessed only via page tables by separating the two
+ *   access types into upper tiers and the base tier and comparing refault rates
+ *   across tiers.
+ *   2) improve buffered io performance by deferring the protection of pages
+ *   accessed multiple times until the eviction. That is the protection happens
+ *   in the reclaim path, not the access path.
+ *
+ * Pages accessed N times via file descriptors belong to tier order_base_2(N).
+ * The base tier may be marked by PageReferenced(). All upper tiers are marked
+ * by PageReferenced() && PageWorkingset(). Additional bits from page->flags are
+ * used to support more than one upper tier.
+ */
+#define MAX_NR_TIERS		((unsigned int)CONFIG_TIERS_PER_GEN)
+#define LRU_TIER_FLAGS		(BIT(PG_referenced) | BIT(PG_workingset))
+
+/* Whether to keep historical stats for each generation. */
+#ifdef CONFIG_LRU_GEN_STATS
+#define NR_STAT_GENS		((unsigned int)CONFIG_NR_LRU_GENS)
+#else
+#define NR_STAT_GENS		1U
+#endif
+
+struct lrugen {
+	/* the aging increments the max generation number */
+	unsigned long max_seq;
+	/* the eviction increments the min generation numbers */
+	unsigned long min_seq[ANON_AND_FILE];
+	/* the birth time of each generation in jiffies */
+	unsigned long timestamps[MAX_NR_GENS];
+	/* the multigenerational lru lists */
+	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* the sizes of the multigenerational lru lists in pages */
+	unsigned long sizes[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* to determine which type and its tiers to evict */
+	atomic_long_t refaulted[NR_STAT_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	atomic_long_t evicted[NR_STAT_GENS][ANON_AND_FILE][MAX_NR_TIERS];
+	/* the base tier isn't protected, hence the minus one */
+	unsigned long protected[NR_STAT_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
+	/* the exponential moving average of refaulted */
+	unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
+	/* the exponential moving average of evicted+protected */
+	unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
+	/* whether the multigenerational lru is enabled */
+	bool enabled[ANON_AND_FILE];
+};
+
+void lru_gen_init_lrugen(struct lruvec *lruvec);
+void lru_gen_set_state(bool enable, bool main, bool swap);
+
+#else /* CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_lrugen(struct lruvec *lruvec)
+{
+}
+
+static inline void lru_gen_set_state(bool enable, bool main, bool swap)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	/* per lruvec lru_lock for memcg */
@@ -311,6 +399,10 @@ struct lruvec {
 	unsigned long			refaults[ANON_AND_FILE];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+#ifdef CONFIG_LRU_GEN
+	/* unevictable pages are on LRU_UNEVICTABLE */
+	struct lrugen			evictable;
+#endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index ef1e3e736e14..ce8d5732a3aa 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -26,6 +26,14 @@
 
 #define ZONES_WIDTH		ZONES_SHIFT
 
+#ifdef CONFIG_LRU_GEN
+/* LRU_GEN_WIDTH is generated from order_base_2(CONFIG_NR_LRU_GENS + 1). */
+#define LRU_USAGE_WIDTH		(CONFIG_TIERS_PER_GEN - 2)
+#else
+#define LRU_GEN_WIDTH		0
+#define LRU_USAGE_WIDTH		0
+#endif
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #define SECTIONS_SHIFT	(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
@@ -55,7 +63,8 @@
 #define SECTIONS_WIDTH		0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_USAGE_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
+	<= BITS_PER_LONG - NR_PAGEFLAGS
 #define NODES_WIDTH		NODES_SHIFT
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 #error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +98,8 @@
 #define LAST_CPUPID_SHIFT 0
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
-	<= BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_USAGE_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -100,8 +109,8 @@
 #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
-#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
-	> BITS_PER_LONG - NR_PAGEFLAGS
+#if ZONES_WIDTH + LRU_GEN_WIDTH + LRU_USAGE_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
+	KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5922031ffab6..0156ac5f08f0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -848,7 +848,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1UL << PG_private	| 1UL << PG_private_2	|	\
 	 1UL << PG_writeback	| 1UL << PG_reserved	|	\
 	 1UL << PG_slab		| 1UL << PG_active 	|	\
-	 1UL << PG_unevictable	| __PG_MLOCKED)
+	 1UL << PG_unevictable	| __PG_MLOCKED | LRU_GEN_MASK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
@@ -859,7 +859,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+	((((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_USAGE_MASK)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 9795d75b09b2..aba13aa7336c 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -22,6 +22,9 @@ int main(void)
 	DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
 #endif
 	DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
+#ifdef CONFIG_LRU_GEN
+	DEFINE(LRU_GEN_WIDTH, order_base_2(CONFIG_NR_LRU_GENS + 1));
+#endif
 	/* End of constants */
 
 	return 0;
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc17a9d..bec59189e206 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -146,7 +146,6 @@ struct cgroup_mgctx {
 #define DEFINE_CGROUP_MGCTX(name)						\
 	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
 
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 extern struct cgroup_subsys *cgroup_subsys[];
 extern struct list_head cgroup_roots;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..d5ccbfb50352 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2390,7 +2390,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
-			 (1L << PG_dirty)));
+			 (1L << PG_dirty) |
+			 LRU_GEN_MASK | LRU_USAGE_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9ddaf0e1b0ab..ef0deadb90a7 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
 
 	shift = 8 * sizeof(unsigned long);
 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
-		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
+		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_USAGE_WIDTH;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
 		LAST_CPUPID_WIDTH,
 		KASAN_TAG_WIDTH,
+		LRU_GEN_WIDTH,
+		LRU_USAGE_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
 		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..2055d66a7f22 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+
+	lru_gen_init_lrugen(lruvec);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1e07d1c776f2..19dacc4ae35e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2688,6 +2688,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	err = 0;
 	atomic_inc(&proc_poll_event);
 	wake_up_interruptible(&proc_poll_wait);
+	lru_gen_set_state(false, false, true);
 
 out_dput:
 	filp_close(victim, NULL);
@@ -3343,6 +3344,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	mutex_unlock(&swapon_mutex);
 	atomic_inc(&proc_poll_event);
 	wake_up_interruptible(&proc_poll_wait);
+	lru_gen_set_state(true, false, true);
 
 	error = 0;
 	goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d14880bd76..a02b5ff37e31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2731,6 +2732,334 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 }
 
+#ifdef CONFIG_LRU_GEN
+
+/*
+ * After a page is faulted in, the aging must scan it twice before the eviction
+ * can consider it. The first scan clears the accessed bit set during the
+ * initial fault. And the second scan makes sure it hasn't been used since the
+ * first scan.
+ */
+#define MIN_NR_GENS	2
+
+#define MAX_BATCH_SIZE	8192
+
+/******************************************************************************
+ *                          shorthand helpers
+ ******************************************************************************/
+
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->evictable.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->evictable.min_seq[0]),		\
+		READ_ONCE((lruvec)->evictable.min_seq[1]),		\
+	}
+
+#define for_each_type_zone(type, zone)					\
+	for ((type) = 0; (type) < ANON_AND_FILE; (type)++)		\
+		for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+#define for_each_gen_type_zone(gen, type, zone)				\
+	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
+		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
+			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+static int page_lru_gen(struct page *page)
+{
+	unsigned long flags = READ_ONCE(page->flags);
+
+	return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static int page_lru_tier(struct page *page)
+{
+	int usage;
+	unsigned long flags = READ_ONCE(page->flags);
+
+	usage = (flags & LRU_TIER_FLAGS) == LRU_TIER_FLAGS ?
+		((flags & LRU_USAGE_MASK) >> LRU_USAGE_PGOFF) + 1 : 0;
+
+	return lru_tier_from_usage(usage);
+}
+
+static int get_lo_wmark(unsigned long max_seq, unsigned long *min_seq, int swappiness)
+{
+	return max_seq - max(min_seq[!swappiness], min_seq[1]) + 1;
+}
+
+static int get_hi_wmark(unsigned long max_seq, unsigned long *min_seq, int swappiness)
+{
+	return max_seq - min(min_seq[!swappiness], min_seq[1]) + 1;
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int type)
+{
+	return lruvec->evictable.max_seq - lruvec->evictable.min_seq[type] + 1;
+}
+
+static int get_swappiness(struct mem_cgroup *memcg)
+{
+	return mem_cgroup_get_nr_swap_pages(memcg) >= (long)SWAP_CLUSTER_MAX ?
+	       mem_cgroup_swappiness(memcg) : 0;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	return get_nr_gens(lruvec, 0) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, 0) <= MAX_NR_GENS &&
+	       get_nr_gens(lruvec, 1) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, 1) <= MAX_NR_GENS;
+}
+
+/******************************************************************************
+ *                          state change
+ ******************************************************************************/
+
+#ifdef CONFIG_LRU_GEN_ENABLED
+DEFINE_STATIC_KEY_TRUE(lru_gen_static_key);
+#else
+DEFINE_STATIC_KEY_FALSE(lru_gen_static_key);
+#endif
+
+static DEFINE_MUTEX(lru_gen_state_mutex);
+static int lru_gen_nr_swapfiles;
+
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	enum lru_list lru;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	for_each_evictable_lru(lru) {
+		type = is_file_lru(lru);
+
+		if (lrugen->enabled[type] && !list_empty(&lruvec->lists[lru]))
+			return false;
+	}
+
+	for_each_gen_type_zone(gen, type, zone) {
+		if (!lrugen->enabled[type] && !list_empty(&lrugen->lists[gen][type][zone]))
+			return false;
+
+		VM_WARN_ON_ONCE(!lrugen->enabled[type] && lrugen->sizes[gen][type][zone]);
+	}
+
+	return true;
+}
+
+static bool fill_lists(struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int remaining = MAX_BATCH_SIZE;
+
+	for_each_evictable_lru(lru) {
+		int type = is_file_lru(lru);
+		bool active = is_active_lru(lru);
+		struct list_head *head = &lruvec->lists[lru];
+
+		if (!lruvec->evictable.enabled[type])
+			continue;
+
+		while (!list_empty(head)) {
+			bool success;
+			struct page *page = lru_to_page(head);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page) != active, page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page);
+			VM_BUG_ON_PAGE(page_lru_gen(page) >= 0, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			del_page_from_lru_list(page, lruvec);
+			success = lru_gen_add_page(page, lruvec, false);
+			VM_BUG_ON(!success);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static bool drain_lists(struct lruvec *lruvec)
+{
+	int gen, type, zone;
+	int remaining = MAX_BATCH_SIZE;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		struct list_head *head = &lruvec->evictable.lists[gen][type][zone];
+
+		if (lruvec->evictable.enabled[type])
+			continue;
+
+		while (!list_empty(head)) {
+			bool success;
+			struct page *page = lru_to_page(head);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page), page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page);
+			VM_BUG_ON_PAGE(page_zonenum(page) != zone, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			success = lru_gen_del_page(page, lruvec, false);
+			VM_BUG_ON(!success);
+			add_page_to_lru_list(page, lruvec);
+
+			if (!--remaining)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * For file page tracking, we enable/disable it according to the main switch.
+ * For anon page tracking, we only enabled it when the main switch is on and
+ * there is at least one swapfile; we disable it when there are no swapfiles
+ * regardless of the value of the main switch. Otherwise, we will eventually
+ * reach the max size of the sliding window and have to call inc_min_seq().
+ */
+void lru_gen_set_state(bool enable, bool main, bool swap)
+{
+	struct mem_cgroup *memcg;
+
+	mem_hotplug_begin();
+	mutex_lock(&lru_gen_state_mutex);
+	cgroup_lock();
+
+	if (swap) {
+		if (enable)
+			swap = !lru_gen_nr_swapfiles++;
+		else
+			swap = !--lru_gen_nr_swapfiles;
+	}
+
+	if (main && enable != lru_gen_enabled()) {
+		if (enable)
+			static_branch_enable(&lru_gen_static_key);
+		else
+			static_branch_disable(&lru_gen_static_key);
+	} else if (!swap || !lru_gen_enabled())
+		goto unlock;
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+			struct lrugen *lrugen = &lruvec->evictable;
+
+			spin_lock_irq(&lruvec->lru_lock);
+
+			VM_BUG_ON(!seq_is_valid(lruvec));
+			VM_BUG_ON(!state_is_valid(lruvec));
+
+			lrugen->enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles;
+			lrugen->enabled[1] = lru_gen_enabled();
+
+			while (!(enable ? fill_lists(lruvec) : drain_lists(lruvec))) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				cond_resched();
+				spin_lock_irq(&lruvec->lru_lock);
+			}
+
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+unlock:
+	cgroup_unlock();
+	mutex_unlock(&lru_gen_state_mutex);
+	mem_hotplug_done();
+}
+
+static int __meminit __maybe_unused mem_notifier(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	struct memory_notify *mn = arg;
+	int nid = mn->status_change_nid;
+
+	if (nid == NUMA_NO_NODE)
+		return NOTIFY_DONE;
+
+	pgdat = NODE_DATA(nid);
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_DONE;
+
+	mutex_lock(&lru_gen_state_mutex);
+	cgroup_lock();
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		struct lrugen *lrugen = &lruvec->evictable;
+
+		VM_BUG_ON(!seq_is_valid(lruvec));
+		VM_BUG_ON(!state_is_valid(lruvec));
+
+		lrugen->enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles;
+		lrugen->enabled[1] = lru_gen_enabled();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	cgroup_unlock();
+	mutex_unlock(&lru_gen_state_mutex);
+
+	return NOTIFY_DONE;
+}
+
+/******************************************************************************
+ *                          initialization
+ ******************************************************************************/
+
+void lru_gen_init_lrugen(struct lruvec *lruvec)
+{
+	int i;
+	int gen, type, zone;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	lrugen->max_seq = MIN_NR_GENS + 1;
+	lrugen->enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles;
+	lrugen->enabled[1] = lru_gen_enabled();
+
+	for (i = 0; i <= MIN_NR_GENS + 1; i++)
+		lrugen->timestamps[i] = jiffies;
+
+	for_each_gen_type_zone(gen, type, zone)
+		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+}
+
+static int __init init_lru_gen(void)
+{
+	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
+	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
+	if (hotplug_memory_notifier(mem_notifier, 0))
+		pr_err("lru_gen: failed to subscribe hotplug notifications\n");
+
+	return 0;
+};
+/*
+ * We want to run as early as possible because debug code may call mm_alloc()
+ * and mmput(). Our only dependency mm_kobj is initialized one stage earlier.
+ */
+arch_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 05/11] mm: multigenerational lru: protection
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (3 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 04/11] mm: multigenerational lru: groundwork Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 06/11] mm: multigenerational lru: mm_struct list Yu Zhao
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

The protection is based on page access types and patterns. There are
two access types: one via page tables and the other via file
descriptors. The protection of the former type is by design stronger
because:
  1) The uncertainty in determining the access patterns of the former
  type is higher due to the coalesced nature of the accessed bit.
  2) The cost of evicting the former type is higher due to the TLB
  flushes required and the likelihood of involving I/O.
  3) The penalty of under-protecting the former type is higher because
  applications usually do not prepare themselves for major faults like
  they do for blocked I/O. For example, client applications commonly
  dedicate blocked I/O to separate threads to avoid UI janks that
  negatively affect user experience.

There are also two access patterns: one with temporal locality and the
other without. The latter pattern, e.g., random and sequential, needs
to be explicitly excluded to avoid weakening the protection of the
former pattern. Generally the former type follows the former pattern
unless MADV_SEQUENTIAL is specified and the latter type follows the
latter pattern unless outlying refaults have been observed.

Upon faulting, a page is added to the youngest generation, which
provides the strongest protection as the eviction will not consider
this page before the aging has scanned it at least twice. The first
scan clears the accessed bit set during the initial fault. And the
second scan makes sure this page has not been used since the first
scan. A page from any other generations is brought back to the
youngest generation whenever the aging finds the accessed bit set on
any of the PTEs mapping this page.

Unmapped pages are initially added to the oldest generation and then
conditionally protected by tiers. Pages accessed N times via file
descriptors belong to tier order_base_2(N). Each tier keeps track of
how many pages from it have refaulted. Tier 0 is the base tier and
pages from it are evicted unconditionally because there are no better
candidates. Pages from an upper tier are either evicted or moved to
the next generation, depending on whether this upper tier has a higher
refault rate than the base tier. This model has the following
advantages:
  1) It removes the cost in the buffered access path and reduces the
  overall cost of protection because pages are conditionally protected
  in the reclaim path.
  2) It takes mapped pages into account and avoids overprotecting
  pages accessed multiple times via file descriptors.
  3 Additional tiers improve the protection of pages accessed more
  than twice.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 include/linux/mm.h    |  32 ++++++++++++
 include/linux/sched.h |   3 ++
 mm/memory.c           |   7 +++
 mm/swap.c             |  51 +++++++++++++++++-
 mm/vmscan.c           |  91 +++++++++++++++++++++++++++++++-
 mm/workingset.c       | 119 +++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 298 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 159b7c94e067..7a91518792ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1778,6 +1778,25 @@ void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+
+static inline void task_enter_nonseq_fault(void)
+{
+	WARN_ON(current->in_nonseq_fault);
+
+	current->in_nonseq_fault = 1;
+}
+
+static inline void task_exit_nonseq_fault(void)
+{
+	WARN_ON(!current->in_nonseq_fault);
+
+	current->in_nonseq_fault = 0;
+}
+
+static inline bool task_in_nonseq_fault(void)
+{
+	return current->in_nonseq_fault;
+}
 #else
 static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 					 unsigned long address, unsigned int flags,
@@ -1799,6 +1818,19 @@ static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows) { }
+
+static inline void task_enter_nonseq_fault(void)
+{
+}
+
+static inline void task_exit_nonseq_fault(void)
+{
+}
+
+static inline bool task_in_nonseq_fault(void)
+{
+	return false;
+}
 #endif
 
 static inline void unmap_shared_mapping_range(struct address_space *mapping,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d88641..fd41c9c86cd1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -843,6 +843,9 @@ struct task_struct {
 #ifdef CONFIG_MEMCG
 	unsigned			in_user_fault:1;
 #endif
+#ifdef CONFIG_MMU
+	unsigned			in_nonseq_fault:1;
+#endif
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 2f96179db219..fa40a5b7a7a7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4752,6 +4752,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			   unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
+	bool nonseq_fault = !(vma->vm_flags & VM_SEQ_READ);
 
 	__set_current_state(TASK_RUNNING);
 
@@ -4773,11 +4774,17 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flags & FAULT_FLAG_USER)
 		mem_cgroup_enter_user_fault();
 
+	if (nonseq_fault)
+		task_enter_nonseq_fault();
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
 	else
 		ret = __handle_mm_fault(vma, address, flags);
 
+	if (nonseq_fault)
+		task_exit_nonseq_fault();
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 19600430e536..0d3fb2ee3fd6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -411,6 +411,43 @@ static void __lru_cache_activate_page(struct page *page)
 	local_unlock(&lru_pvecs.lock);
 }
 
+#ifdef CONFIG_LRU_GEN
+static void page_inc_usage(struct page *page)
+{
+	unsigned long usage;
+	unsigned long old_flags, new_flags;
+
+	if (PageUnevictable(page))
+		return;
+
+	/* see the comment on MAX_NR_TIERS */
+	do {
+		new_flags = old_flags = READ_ONCE(page->flags);
+
+		if (!(new_flags & BIT(PG_referenced))) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		if (!(new_flags & BIT(PG_workingset))) {
+			new_flags |= BIT(PG_workingset);
+			continue;
+		}
+
+		usage = new_flags & LRU_USAGE_MASK;
+		usage = min(usage + BIT(LRU_USAGE_PGOFF), LRU_USAGE_MASK);
+
+		new_flags &= ~LRU_USAGE_MASK;
+		new_flags |= usage;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+}
+#else
+static void page_inc_usage(struct page *page)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 /*
  * Mark a page as having seen activity.
  *
@@ -425,6 +462,11 @@ void mark_page_accessed(struct page *page)
 {
 	page = compound_head(page);
 
+	if (lru_gen_enabled()) {
+		page_inc_usage(page);
+		return;
+	}
+
 	if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	} else if (PageUnevictable(page)) {
@@ -468,6 +510,11 @@ void lru_cache_add(struct page *page)
 	VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+	/* see the comment in lru_gen_add_page() */
+	if (lru_gen_enabled() && !PageUnevictable(page) &&
+	    task_in_nonseq_fault() && !(current->flags & PF_MEMALLOC))
+		SetPageActive(page);
+
 	get_page(page);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
@@ -569,7 +616,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -684,7 +731,7 @@ void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a02b5ff37e31..788b4d1ce149 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1094,9 +1094,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		mem_cgroup_swapout(page, swap);
+
+		/* get a shadow entry before page_memcg() is cleared */
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(page, target_memcg);
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
@@ -2813,6 +2815,93 @@ static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
 	       get_nr_gens(lruvec, 1) <= MAX_NR_GENS;
 }
 
+/******************************************************************************
+ *                          refault feedback loop
+ ******************************************************************************/
+
+/*
+ * A feedback loop modeled after the PID controller. Currently supports the
+ * proportional (P) and the integral (I) terms; the derivative (D) term can be
+ * added if necessary. The setpoint (SP) is the desired position; the process
+ * variable (PV) is the measured position. The error is the difference between
+ * the SP and the PV. A positive error results in a positive control output
+ * correction, which, in our case, is to allow eviction.
+ *
+ * The P term is the refault rate of the current generation being evicted. The I
+ * term is the exponential moving average of the refault rates of the previous
+ * generations, using the smoothing factor 1/2.
+ *
+ * Our goal is to make sure upper tiers have similar refault rates as the base
+ * tier. That is we try to be fair to all tiers by maintaining similar refault
+ * rates across them.
+ */
+struct controller_pos {
+	unsigned long refaulted;
+	unsigned long total;
+	int gain;
+};
+
+static void read_controller_pos(struct controller_pos *pos, struct lruvec *lruvec,
+				int type, int tier, int gain)
+{
+	struct lrugen *lrugen = &lruvec->evictable;
+	int hist = lru_hist_from_seq(lrugen->min_seq[type]);
+
+	pos->refaulted = lrugen->avg_refaulted[type][tier] +
+			 atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+	pos->total = lrugen->avg_total[type][tier] +
+		     atomic_long_read(&lrugen->evicted[hist][type][tier]);
+	if (tier)
+		pos->total += lrugen->protected[hist][type][tier - 1];
+	pos->gain = gain;
+}
+
+static void reset_controller_pos(struct lruvec *lruvec, int gen, int type)
+{
+	int tier;
+	int hist = lru_hist_from_seq(gen);
+	struct lrugen *lrugen = &lruvec->evictable;
+	bool carryover = gen == lru_gen_from_seq(lrugen->min_seq[type]);
+
+	if (!carryover && NR_STAT_GENS == 1)
+		return;
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		if (carryover) {
+			unsigned long sum;
+
+			sum = lrugen->avg_refaulted[type][tier] +
+			      atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+			WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
+
+			sum = lrugen->avg_total[type][tier] +
+			      atomic_long_read(&lrugen->evicted[hist][type][tier]);
+			if (tier)
+				sum += lrugen->protected[hist][type][tier - 1];
+			WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
+
+			if (NR_STAT_GENS > 1)
+				continue;
+		}
+
+		atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
+		atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
+		if (tier)
+			WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
+	}
+}
+
+static bool positive_ctrl_err(struct controller_pos *sp, struct controller_pos *pv)
+{
+	/*
+	 * Allow eviction if the PV has a limited number of refaulted pages or a
+	 * lower refault rate than the SP.
+	 */
+	return pv->refaulted < SWAP_CLUSTER_MAX ||
+	       pv->refaulted * max(sp->total, 1UL) * sp->gain <=
+	       sp->refaulted * max(pv->total, 1UL) * pv->gain;
+}
+
 /******************************************************************************
  *                          state change
  ******************************************************************************/
diff --git a/mm/workingset.c b/mm/workingset.c
index 5ba3e42446fa..75dbfba773a6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
 {
-	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
@@ -212,10 +211,116 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
-	*evictionp = entry << bucket_order;
+	*evictionp = entry;
 	*workingsetp = workingset;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+static int page_get_usage(struct page *page)
+{
+	unsigned long flags = READ_ONCE(page->flags);
+
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_USAGE_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+
+	/* see the comment on MAX_NR_TIERS */
+	return flags & BIT(PG_workingset) ?
+	       (flags & LRU_USAGE_MASK) >> LRU_USAGE_PGOFF : 0;
+}
+
+/* Return a token to be stored in the shadow entry of a page being evicted. */
+static void *lru_gen_eviction(struct page *page)
+{
+	int hist, tier;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lrugen *lrugen;
+	int type = page_is_file_lru(page);
+	int usage = page_get_usage(page);
+	bool workingset = PageWorkingset(page);
+	struct mem_cgroup *memcg = page_memcg(page);
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->evictable;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	token = (min_seq << LRU_USAGE_WIDTH) | usage;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_usage(usage + workingset);
+	atomic_long_add(thp_nr_pages(page), &lrugen->evicted[hist][type][tier]);
+
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+}
+
+/* Count a refaulted page based on the token stored in its shadow entry. */
+static void lru_gen_refault(struct page *page, void *shadow)
+{
+	int hist, tier, usage;
+	int memcg_id;
+	bool workingset;
+	unsigned long token;
+	unsigned long min_seq;
+	struct lruvec *lruvec;
+	struct lrugen *lrugen;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat;
+	int type = page_is_file_lru(page);
+
+	unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
+	if (page_pgdat(page) != pgdat)
+		return;
+
+	rcu_read_lock();
+	memcg = page_memcg_rcu(page);
+	if (mem_cgroup_id(memcg) != memcg_id)
+		goto unlock;
+
+	usage = token & (BIT(LRU_USAGE_WIDTH) - 1);
+	if (usage && !workingset)
+		goto unlock;
+
+	token >>= LRU_USAGE_WIDTH;
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	lrugen = &lruvec->evictable;
+	min_seq = READ_ONCE(lrugen->min_seq[type]);
+	if (token != (min_seq & (EVICTION_MASK >> LRU_USAGE_WIDTH)))
+		goto unlock;
+
+	hist = lru_hist_from_seq(min_seq);
+	tier = lru_tier_from_usage(usage + workingset);
+	atomic_long_add(thp_nr_pages(page), &lrugen->refaulted[hist][type][tier]);
+	inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type);
+
+	/*
+	 * Tiers don't offer any protection to pages accessed via page tables.
+	 * That's what generations do. Tiers can't fully protect pages after
+	 * their usage has exceeded the max value. Conservatively count these
+	 * two conditions as stalls even though they might not indicate any real
+	 * memory pressure.
+	 */
+	if (task_in_nonseq_fault() || usage + workingset == BIT(LRU_USAGE_WIDTH)) {
+		SetPageWorkingset(page);
+		inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+#else /* CONFIG_LRU_GEN */
+
+static void *lru_gen_eviction(struct page *page)
+{
+	return NULL;
+}
+
+static void lru_gen_refault(struct page *page, void *shadow)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 /**
  * workingset_age_nonresident - age non-resident entries as LRU ages
  * @lruvec: the lruvec that was aged
@@ -264,10 +369,14 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	if (lru_gen_enabled())
+		return lru_gen_eviction(page);
+
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
+	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, thp_nr_pages(page));
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -296,7 +405,13 @@ void workingset_refault(struct page *page, void *shadow)
 	bool workingset;
 	int memcgid;
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(page, shadow);
+		return;
+	}
+
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	eviction <<= bucket_order;
 
 	rcu_read_lock();
 	/*
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 06/11] mm: multigenerational lru: mm_struct list
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (4 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 05/11] mm: multigenerational lru: protection Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 07/11] mm: multigenerational lru: aging Yu Zhao
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

To scan PTEs for accessed pages, a mm_struct list is maintained for
each memcg. When multiple threads traverse the same memcg->mm_list,
each of them gets a unique mm_struct and therefore they can run
walk_page_range() concurrently to reach page tables of all processes
of this memcg.

And to skip page tables of processes that have been sleeping since the
last walk, the usage of mm_struct is also tracked between context
switches.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 fs/exec.c                  |   2 +
 include/linux/memcontrol.h |   6 +
 include/linux/mm_types.h   | 107 +++++++++++++
 kernel/exit.c              |   1 +
 kernel/fork.c              |  10 ++
 kernel/kthread.c           |   1 +
 kernel/sched/core.c        |   2 +
 mm/memcontrol.c            |  28 ++++
 mm/vmscan.c                | 313 +++++++++++++++++++++++++++++++++++++
 9 files changed, 470 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..7ead083bcb39 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1005,6 +1005,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1015,6 +1016,7 @@ static int exec_mmap(struct mm_struct *mm)
 	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	activate_mm(active_mm, mm);
+	lru_gen_switch_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bfe5c486f4ad..5e223cecb5c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -230,6 +230,8 @@ struct obj_cgroup {
 	};
 };
 
+struct lru_gen_mm_list;
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -349,6 +351,10 @@ struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	struct lru_gen_mm_list *mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..d9a2ba150ce8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,8 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -571,6 +573,22 @@ struct mm_struct {
 
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
+#endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* the node of a global or per-memcg mm_struct list */
+			struct list_head list;
+#ifdef CONFIG_MEMCG
+			/* points to the memcg of the owner task above */
+			struct mem_cgroup *memcg;
+#endif
+			/* whether this mm_struct has been used since the last walk */
+			nodemask_t nodes;
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+			/* the number of CPUs using this mm_struct */
+			atomic_t nr_cpus;
+#endif
+		} lrugen;
 #endif
 	} __randomize_layout;
 
@@ -598,6 +616,95 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+void lru_gen_init_mm(struct mm_struct *mm);
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg);
+void lru_gen_free_mm_list(struct mem_cgroup *memcg);
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+/* Track the usage of each mm_struct so that we can skip inactive ones. */
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new)
+{
+	/* exclude init_mm, efi_mm, etc. */
+	if (!core_kernel_data((unsigned long)old)) {
+		VM_BUG_ON(old == &init_mm);
+
+		nodes_setall(old->lrugen.nodes);
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+		atomic_dec(&old->lrugen.nr_cpus);
+		VM_BUG_ON_MM(atomic_read(&old->lrugen.nr_cpus) < 0, old);
+#endif
+	} else
+		VM_BUG_ON_MM(READ_ONCE(old->lrugen.list.prev) ||
+			     READ_ONCE(old->lrugen.list.next), old);
+
+	if (!core_kernel_data((unsigned long)new)) {
+		VM_BUG_ON(new == &init_mm);
+
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+		atomic_inc(&new->lrugen.nr_cpus);
+		VM_BUG_ON_MM(atomic_read(&new->lrugen.nr_cpus) < 0, new);
+#endif
+	} else
+		VM_BUG_ON_MM(READ_ONCE(new->lrugen.list.prev) ||
+			     READ_ONCE(new->lrugen.list.next), new);
+}
+
+/* Return whether this mm_struct is being used on any CPUs. */
+static inline bool lru_gen_mm_is_active(struct mm_struct *mm)
+{
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	return !cpumask_empty(mm_cpumask(mm));
+#else
+	return atomic_read(&mm->lrugen.nr_cpus);
+#endif
+}
+
+#else /* CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline int lru_gen_alloc_mm_list(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline void lru_gen_free_mm_list(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new)
+{
+}
+
+static inline bool lru_gen_mm_is_active(struct mm_struct *mm)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/kernel/exit.c b/kernel/exit.c
index 9a89e7f36acb..c24d5ffae792 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -422,6 +422,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..e5f5dd5ac584 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -669,6 +669,7 @@ static void check_mm(struct mm_struct *mm)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
 #endif
+	VM_BUG_ON_MM(lru_gen_mm_is_active(mm), mm);
 }
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1066,6 +1067,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1108,6 +1110,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2530,6 +2533,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock the task to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5b37a8567168..fd827fdad26b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1361,6 +1361,7 @@ void kthread_use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
+	lru_gen_switch_mm(active_mm, mm);
 	local_irq_enable();
 	task_unlock(tsk);
 #ifdef finish_arch_post_lock_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20ffcc044134..eea1457704ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4665,6 +4665,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_switch_mm(prev->active_mm, next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
@@ -8391,6 +8392,7 @@ void idle_task_exit(void)
 
 	if (mm != &init_mm) {
 		switch_mm(mm, &init_mm, current);
+		lru_gen_switch_mm(mm, &init_mm);
 		finish_arch_post_lock_switch();
 	}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 702a81dfe72d..8597992797d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5172,6 +5172,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->vmstats_percpu);
+	lru_gen_free_mm_list(memcg);
 	kfree(memcg);
 }
 
@@ -5221,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
+	if (lru_gen_alloc_mm_list(memcg))
+		goto fail;
+
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
 		goto fail;
 
@@ -6182,6 +6186,29 @@ static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task = NULL;
+
+	cgroup_taskset_for_each_leader(task, css, tset)
+		break;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6523,6 +6550,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 788b4d1ce149..15eadf2a135e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2902,6 +2902,312 @@ static bool positive_ctrl_err(struct controller_pos *sp, struct controller_pos *
 	       sp->refaulted * max(pv->total, 1UL) * pv->gain;
 }
 
+/******************************************************************************
+ *                          mm_struct list
+ ******************************************************************************/
+
+enum {
+	MM_SCHED_ACTIVE,	/* running processes */
+	MM_SCHED_INACTIVE,	/* sleeping processes */
+	MM_LOCK_CONTENTION,	/* lock contentions */
+	MM_VMA_INTERVAL,	/* VMAs within the range of each PUD/PMD/PTE */
+	MM_LEAF_OTHER_NODE,	/* entries not from the node under reclaim */
+	MM_LEAF_OTHER_MEMCG,	/* entries not from the memcg under reclaim */
+	MM_LEAF_OLD,		/* old entries */
+	MM_LEAF_YOUNG,		/* young entries */
+	MM_LEAF_DIRTY,		/* dirty entries */
+	MM_LEAF_HOLE,		/* non-present entries */
+	MM_NONLEAF_OLD,		/* old non-leaf PMD entries */
+	MM_NONLEAF_YOUNG,	/* young non-leaf PMD entries */
+	NR_MM_STATS
+};
+
+/* mnemonic codes for the stats above */
+#define MM_STAT_CODES		"aicvnmoydhlu"
+
+struct lru_gen_mm_list {
+	/* the head of a global or per-memcg mm_struct list */
+	struct list_head head;
+	/* protects the list */
+	spinlock_t lock;
+	struct {
+		/* set to max_seq after each round of walk */
+		unsigned long cur_seq;
+		/* the next mm_struct on the list to walk */
+		struct list_head *iter;
+		/* to wait for the last walker to finish */
+		struct wait_queue_head wait;
+		/* the number of concurrent walkers */
+		int nr_walkers;
+		/* stats for debugging */
+		unsigned long stats[NR_STAT_GENS][NR_MM_STATS];
+	} nodes[0];
+};
+
+static struct lru_gen_mm_list *global_mm_list;
+
+static struct lru_gen_mm_list *alloc_mm_list(void)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+
+	mm_list = kvzalloc(struct_size(mm_list, nodes, nr_node_ids), GFP_KERNEL);
+	if (!mm_list)
+		return NULL;
+
+	INIT_LIST_HEAD(&mm_list->head);
+	spin_lock_init(&mm_list->lock);
+
+	for_each_node(nid) {
+		mm_list->nodes[nid].cur_seq = MIN_NR_GENS;
+		mm_list->nodes[nid].iter = &mm_list->head;
+		init_waitqueue_head(&mm_list->nodes[nid].wait);
+	}
+
+	return mm_list;
+}
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg ? memcg->mm_list : root_mem_cgroup->mm_list;
+#endif
+	VM_BUG_ON(memcg);
+
+	return global_mm_list;
+}
+
+void lru_gen_init_mm(struct mm_struct *mm)
+{
+	INIT_LIST_HEAD(&mm->lrugen.list);
+#ifdef CONFIG_MEMCG
+	mm->lrugen.memcg = NULL;
+#endif
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	atomic_set(&mm->lrugen.nr_cpus, 0);
+#endif
+	nodes_clear(mm->lrugen.nodes);
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON_MM(!list_empty(&mm->lrugen.list), mm);
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON_MM(mm->lrugen.memcg, mm);
+	WRITE_ONCE(mm->lrugen.memcg, memcg);
+#endif
+	spin_lock(&mm_list->lock);
+	list_add_tail(&mm->lrugen.list, &mm_list->head);
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+#ifdef CONFIG_MEMCG
+	struct lru_gen_mm_list *mm_list = get_mm_list(mm->lrugen.memcg);
+#else
+	struct lru_gen_mm_list *mm_list = get_mm_list(NULL);
+#endif
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		if (mm_list->nodes[nid].iter != &mm->lrugen.list)
+			continue;
+
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+		if (mm_list->nodes[nid].iter == &mm_list->head)
+			WRITE_ONCE(mm_list->nodes[nid].cur_seq,
+				   mm_list->nodes[nid].cur_seq + 1);
+	}
+
+	list_del_init(&mm->lrugen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lrugen.memcg);
+	WRITE_ONCE(mm->lrugen.memcg, NULL);
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	memcg->mm_list = alloc_mm_list();
+
+	return memcg->mm_list ? 0 : -ENOMEM;
+}
+
+void lru_gen_free_mm_list(struct mem_cgroup *memcg)
+{
+	kvfree(memcg->mm_list);
+	memcg->mm_list = NULL;
+}
+
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(mm->owner);
+	rcu_read_unlock();
+	if (memcg == mm->lrugen.memcg)
+		return;
+
+	VM_BUG_ON_MM(!mm->lrugen.memcg, mm);
+	VM_BUG_ON_MM(list_empty(&mm->lrugen.list), mm);
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+
+static bool mm_is_migrated(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	return READ_ONCE(mm->lrugen.memcg) != memcg;
+}
+#else
+static bool mm_is_migrated(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	return false;
+}
+#endif
+
+struct mm_walk_args {
+	struct mem_cgroup *memcg;
+	unsigned long max_seq;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long next_addr;
+	int node_id;
+	int swappiness;
+	int batch_size;
+	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	int mm_stats[NR_MM_STATS];
+	unsigned long bitmap[0];
+};
+
+static void reset_mm_stats(struct lru_gen_mm_list *mm_list, bool last,
+			   struct mm_walk_args *args)
+{
+	int i;
+	int nid = args->node_id;
+	int hist = lru_hist_from_seq(args->max_seq);
+
+	lockdep_assert_held(&mm_list->lock);
+
+	for (i = 0; i < NR_MM_STATS; i++) {
+		WRITE_ONCE(mm_list->nodes[nid].stats[hist][i],
+			   mm_list->nodes[nid].stats[hist][i] + args->mm_stats[i]);
+		args->mm_stats[i] = 0;
+	}
+
+	if (!last || NR_STAT_GENS == 1)
+		return;
+
+	hist = lru_hist_from_seq(args->max_seq + 1);
+	for (i = 0; i < NR_MM_STATS; i++)
+		WRITE_ONCE(mm_list->nodes[nid].stats[hist][i], 0);
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct mm_walk_args *args)
+{
+	int type;
+	unsigned long size = 0;
+
+	if (!lru_gen_mm_is_active(mm) && !node_isset(args->node_id, mm->lrugen.nodes))
+		return true;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	for (type = !args->swappiness; type < ANON_AND_FILE; type++) {
+		size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+			       get_mm_counter(mm, MM_ANONPAGES) +
+			       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	/* leave the legwork to the rmap if the mappings are too sparse */
+	if (size < max(SWAP_CLUSTER_MAX, mm_pgtables_bytes(mm) / PAGE_SIZE))
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+/* To support multiple walkers that concurrently walk an mm_struct list. */
+static bool get_next_mm(struct mm_walk_args *args, struct mm_struct **iter)
+{
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	int nid = args->node_id;
+	struct lru_gen_mm_list *mm_list = get_mm_list(args->memcg);
+
+	if (*iter)
+		mmput_async(*iter);
+	else if (args->max_seq <= READ_ONCE(mm_list->nodes[nid].cur_seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(args->max_seq > mm_list->nodes[nid].cur_seq + 1);
+	VM_BUG_ON(*iter && args->max_seq < mm_list->nodes[nid].cur_seq);
+	VM_BUG_ON(*iter && !mm_list->nodes[nid].nr_walkers);
+
+	if (args->max_seq <= mm_list->nodes[nid].cur_seq) {
+		last = *iter;
+		goto done;
+	}
+
+	if (mm_list->nodes[nid].iter == &mm_list->head) {
+		VM_BUG_ON(*iter || mm_list->nodes[nid].nr_walkers);
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+	}
+
+	while (!mm && mm_list->nodes[nid].iter != &mm_list->head) {
+		mm = list_entry(mm_list->nodes[nid].iter, struct mm_struct, lrugen.list);
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+		if (should_skip_mm(mm, args))
+			mm = NULL;
+
+		args->mm_stats[mm ? MM_SCHED_ACTIVE : MM_SCHED_INACTIVE]++;
+	}
+
+	if (mm_list->nodes[nid].iter == &mm_list->head)
+		WRITE_ONCE(mm_list->nodes[nid].cur_seq,
+			   mm_list->nodes[nid].cur_seq + 1);
+done:
+	if (*iter && !mm)
+		mm_list->nodes[nid].nr_walkers--;
+	if (!*iter && mm)
+		mm_list->nodes[nid].nr_walkers++;
+
+	last = last && !mm_list->nodes[nid].nr_walkers &&
+	       mm_list->nodes[nid].iter == &mm_list->head;
+
+	reset_mm_stats(mm_list, last, args);
+
+	spin_unlock(&mm_list->lock);
+
+	*iter = mm;
+	if (mm)
+		node_clear(nid, mm->lrugen.nodes);
+
+	return last;
+}
+
 /******************************************************************************
  *                          state change
  ******************************************************************************/
@@ -3135,6 +3441,13 @@ static int __init init_lru_gen(void)
 {
 	BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
+
+	if (mem_cgroup_disabled()) {
+		global_mm_list = alloc_mm_list();
+		if (!global_mm_list)
+			return -ENOMEM;
+	}
 
 	if (hotplug_memory_notifier(mem_notifier, 0))
 		pr_err("lru_gen: failed to subscribe hotplug notifications\n");
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 07/11] mm: multigenerational lru: aging
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (5 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 06/11] mm: multigenerational lru: mm_struct list Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 08/11] mm: multigenerational lru: eviction Yu Zhao
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

The aging produces young generations. Given an lruvec, the aging
traverses lruvec_memcg()->mm_list and calls walk_page_range() to scan
PTEs for accessed pages. Upon finding one, the aging updates its
generation number to max_seq (modulo MAX_NR_GENS). After each round of
traversal, the aging increments max_seq. The aging is due when both
min_seq[2] have caught up with max_seq-1.

The aging uses the following optimizations when walking page tables:
  1) It skips page tables of processes that have been sleeping since
  the last walk.
  2) It skips non-leaf PMD entries that have the accessed bit cleared
  when CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
  3) It does not zigzag between a PGD table and the same PMD or PTE
  table spanning multiple VMAs. In other words, it finishes all the
  VMAs within the range of the same PMD or PTE table before it returns
  to this PGD table.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 include/linux/memcontrol.h |   3 +
 include/linux/mmzone.h     |  11 +
 include/linux/oom.h        |  16 +
 include/linux/swap.h       |   1 +
 mm/oom_kill.c              |   4 +-
 mm/rmap.c                  |   7 +
 mm/swap.c                  |   4 +-
 mm/vmscan.c                | 903 +++++++++++++++++++++++++++++++++++++
 8 files changed, 945 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e223cecb5c2..657d94344dfc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1346,10 +1346,13 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 
 static inline void lock_page_memcg(struct page *page)
 {
+	/* to match page_memcg_rcu() */
+	rcu_read_lock();
 }
 
 static inline void unlock_page_memcg(struct page *page)
 {
+	rcu_read_unlock();
 }
 
 static inline void mem_cgroup_handle_over_high(void)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d6c2c3a4ba43..b6005e881862 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -295,6 +295,7 @@ enum lruvec_flags {
 };
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_USAGE_MASK		((BIT(LRU_USAGE_WIDTH) - 1) << LRU_USAGE_PGOFF)
@@ -369,6 +370,7 @@ struct lrugen {
 
 void lru_gen_init_lrugen(struct lruvec *lruvec);
 void lru_gen_set_state(bool enable, bool main, bool swap);
+void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw);
 
 #else /* CONFIG_LRU_GEN */
 
@@ -380,6 +382,10 @@ static inline void lru_gen_set_state(bool enable, bool main, bool swap)
 {
 }
 
+static inline void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 struct lruvec {
@@ -874,6 +880,8 @@ struct deferred_split {
 };
 #endif
 
+struct mm_walk_args;
+
 /*
  * On NUMA machines, each NUMA node would have a pg_data_t to describe
  * it's memory layout. On UMA machines there is a single pglist_data which
@@ -979,6 +987,9 @@ typedef struct pglist_data {
 
 	unsigned long		flags;
 
+#ifdef CONFIG_LRU_GEN
+	struct mm_walk_args	*mm_walk_args;
+#endif
 	ZONE_PADDING(_pad2_)
 
 	/* Per-node vmstats */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..c4c8c7e71099 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -57,6 +57,22 @@ struct oom_control {
 extern struct mutex oom_lock;
 extern struct mutex oom_adj_mutex;
 
+#ifdef CONFIG_MMU
+extern struct task_struct *oom_reaper_list;
+extern struct wait_queue_head oom_reaper_wait;
+
+static inline bool oom_reaping_in_progress(void)
+{
+	/* racy check to see if oom reaping could be in progress */
+	return READ_ONCE(oom_reaper_list) || !waitqueue_active(&oom_reaper_wait);
+}
+#else
+static inline bool oom_reaping_in_progress(void)
+{
+	return false;
+}
+#endif
+
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flag_origin = true;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6f5a43251593..c838e67dfa3a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -368,6 +368,7 @@ extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
 extern void deactivate_page(struct page *page);
+extern void activate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..eca484ee3a3d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -507,8 +507,8 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 bool __oom_reap_task_mm(struct mm_struct *mm)
diff --git a/mm/rmap.c b/mm/rmap.c
index b9eb5c12f3fe..f4963d60ff68 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -789,6 +790,12 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (pvmw.pte) {
+			/* the multigenerational lru exploits the spatial locality */
+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+			    !(vma->vm_flags & VM_SEQ_READ)) {
+				lru_gen_scan_around(&pvmw);
+				referenced++;
+			}
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/swap.c b/mm/swap.c
index 0d3fb2ee3fd6..0315cfa9fa41 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -347,7 +347,7 @@ static bool need_activate_page_drain(int cpu)
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void activate_page(struct page *page)
+void activate_page(struct page *page)
 {
 	page = compound_head(page);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
@@ -367,7 +367,7 @@ static inline void activate_page_drain(int cpu)
 {
 }
 
-static void activate_page(struct page *page)
+void activate_page(struct page *page)
 {
 	struct lruvec *lruvec;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 15eadf2a135e..757ba4f415cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,8 @@
 #include <linux/dax.h>
 #include <linux/psi.h>
 #include <linux/memory.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3208,6 +3210,883 @@ static bool get_next_mm(struct mm_walk_args *args, struct mm_struct **iter)
 	return last;
 }
 
+/******************************************************************************
+ *                          the aging
+ ******************************************************************************/
+
+static int page_update_gen(struct page *page, int gen)
+{
+	unsigned long old_flags, new_flags;
+
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+
+	do {
+		new_flags = old_flags = READ_ONCE(page->flags);
+
+		if (!(new_flags & LRU_GEN_MASK)) {
+			new_flags |= BIT(PG_referenced);
+			continue;
+		}
+
+		new_flags &= ~(LRU_GEN_MASK | LRU_USAGE_MASK | LRU_TIER_FLAGS);
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+	} while (new_flags != old_flags &&
+		 cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+static void page_inc_gen(struct page *page, struct lruvec *lruvec, bool reclaiming)
+{
+	int old_gen, new_gen;
+	unsigned long old_flags, new_flags;
+	int type = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	do {
+		new_flags = old_flags = READ_ONCE(page->flags);
+		VM_BUG_ON_PAGE(!(new_flags & LRU_GEN_MASK), page);
+
+		new_gen = ((new_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		/* page_update_gen() has updated this page? */
+		if (new_gen >= 0 && new_gen != old_gen) {
+			list_move(&page->lru, &lrugen->lists[new_gen][type][zone]);
+			return;
+		}
+
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+
+		new_flags &= ~(LRU_GEN_MASK | LRU_USAGE_MASK | LRU_TIER_FLAGS);
+		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
+		/* for rotate_reclaimable_page() */
+		if (reclaiming)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(page, lruvec, old_gen, new_gen);
+	if (reclaiming)
+		list_move(&page->lru, &lrugen->lists[new_gen][type][zone]);
+	else
+		list_move_tail(&page->lru, &lrugen->lists[new_gen][type][zone]);
+}
+
+static void update_batch_size(struct page *page, int old_gen, int new_gen,
+			      struct mm_walk_args *args)
+{
+	int type = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	int delta = thp_nr_pages(page);
+
+	VM_BUG_ON(old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+	args->batch_size++;
+
+	args->nr_pages[old_gen][type][zone] -= delta;
+	args->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct mm_walk_args *args)
+{
+	int gen, type, zone;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	if (!args->batch_size)
+		return;
+
+	args->batch_size = 0;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	for_each_gen_type_zone(gen, type, zone) {
+		enum lru_list lru = type * LRU_FILE;
+		int total = args->nr_pages[gen][type][zone];
+
+		if (!total)
+			continue;
+
+		args->nr_pages[gen][type][zone] = 0;
+		WRITE_ONCE(lrugen->sizes[gen][type][zone],
+			   lrugen->sizes[gen][type][zone] + total);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		update_lru_size(lruvec, lru, zone, total);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_walk_args *args = walk->private;
+
+	if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+	    (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ)))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !args->swappiness;
+
+	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+		return true;
+
+	mapping = vma->vm_file->f_mapping;
+	if (!mapping->a_ops->writepage)
+		return true;
+
+	return (shmem_mapping(mapping) && !args->swappiness) || mapping_unevictable(mapping);
+}
+
+/*
+ * Some userspace memory allocators create many single-page VMAs. So instead of
+ * returning back to the PGD table for each of such VMAs, we finish at least an
+ * entire PMD table and therefore avoid many zigzags.
+ */
+static bool get_next_vma(struct mm_walk *walk, unsigned long mask, unsigned long size,
+			 unsigned long *start, unsigned long *end)
+{
+	unsigned long next = round_up(*end, size);
+	struct mm_walk_args *args = walk->private;
+
+	VM_BUG_ON(mask & size);
+	VM_BUG_ON(*start >= *end);
+	VM_BUG_ON((next & mask) != (*start & mask));
+
+	while (walk->vma) {
+		if (next >= walk->vma->vm_end) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		if ((next & mask) != (walk->vma->vm_start & mask))
+			return false;
+
+		if (should_skip_vma(walk->vma->vm_start, walk->vma->vm_end, walk)) {
+			walk->vma = walk->vma->vm_next;
+			continue;
+		}
+
+		*start = max(next, walk->vma->vm_start);
+		next = (next | ~mask) + 1;
+		/* rounded-up boundaries can wrap to 0 */
+		*end = next && next < walk->vma->vm_end ? next : walk->vma->vm_end;
+
+		args->mm_stats[MM_VMA_INTERVAL]++;
+
+		return true;
+	}
+
+	return false;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int remote = 0;
+	struct mm_walk_args *args = walk->private;
+	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
+
+	VM_BUG_ON(pmd_leaf(*pmd));
+
+	pte = pte_offset_map_lock(walk->mm, pmd, start & PMD_MASK, &ptl);
+	arch_enter_lazy_mmu_mode();
+restart:
+	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		struct page *page;
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn)) {
+			args->mm_stats[MM_LEAF_HOLE]++;
+			continue;
+		}
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i])) {
+			args->mm_stats[MM_LEAF_OLD]++;
+			continue;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < args->start_pfn || pfn >= args->end_pfn) {
+			args->mm_stats[MM_LEAF_OTHER_NODE]++;
+			remote++;
+			continue;
+		}
+
+		page = compound_head(pfn_to_page(pfn));
+		if (page_to_nid(page) != args->node_id) {
+			args->mm_stats[MM_LEAF_OTHER_NODE]++;
+			remote++;
+			continue;
+		}
+
+		if (page_memcg_rcu(page) != args->memcg) {
+			args->mm_stats[MM_LEAF_OTHER_MEMCG]++;
+			continue;
+		}
+
+		VM_BUG_ON(addr < walk->vma->vm_start || addr >= walk->vma->vm_end);
+		if (!ptep_test_and_clear_young(walk->vma, addr, pte + i))
+			continue;
+
+		if (pte_dirty(pte[i]) && !PageDirty(page) &&
+		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page))) {
+			set_page_dirty(page);
+			args->mm_stats[MM_LEAF_DIRTY]++;
+		}
+
+		old_gen = page_update_gen(page, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(page, old_gen, new_gen, args);
+		args->mm_stats[MM_LEAF_YOUNG]++;
+	}
+
+	if (i < PTRS_PER_PTE && get_next_vma(walk, PMD_MASK, PAGE_SIZE, &start, &end))
+		goto restart;
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte, ptl);
+
+	return IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && !remote;
+}
+
+/*
+ * We scan PMD entries in two passes. The first pass reaches to PTE tables and
+ * doesn't take the PMD lock. The second pass clears the accessed bit on PMD
+ * entries and needs to take the PMD lock.
+ */
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long start,
+				  struct vm_area_struct *vma, struct mm_walk *walk)
+{
+	int i;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct mm_walk_args *args = walk->private;
+	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	start &= PUD_MASK;
+	pmd = pmd_offset(pud, start);
+	ptl = pmd_lock(walk->mm, pmd);
+	arch_enter_lazy_mmu_mode();
+
+	for_each_set_bit(i, args->bitmap, PTRS_PER_PMD) {
+		struct page *page;
+		unsigned long pfn = pmd_pfn(pmd[i]);
+		unsigned long addr = start + i * PMD_SIZE;
+
+		if (!pmd_present(pmd[i]) || is_huge_zero_pmd(pmd[i])) {
+			args->mm_stats[MM_LEAF_HOLE]++;
+			continue;
+		}
+
+		if (WARN_ON_ONCE(pmd_devmap(pmd[i])))
+			continue;
+
+		if (!pmd_young(pmd[i])) {
+			args->mm_stats[MM_LEAF_OLD]++;
+			continue;
+		}
+
+		if (!pmd_trans_huge(pmd[i])) {
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) &&
+			    pmdp_test_and_clear_young(vma, addr, pmd + i))
+				args->mm_stats[MM_NONLEAF_YOUNG]++;
+			continue;
+		}
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < args->start_pfn || pfn >= args->end_pfn) {
+			args->mm_stats[MM_LEAF_OTHER_NODE]++;
+			continue;
+		}
+
+		page = pfn_to_page(pfn);
+		VM_BUG_ON_PAGE(PageTail(page), page);
+		if (page_to_nid(page) != args->node_id) {
+			args->mm_stats[MM_LEAF_OTHER_NODE]++;
+			continue;
+		}
+
+		if (page_memcg_rcu(page) != args->memcg) {
+			args->mm_stats[MM_LEAF_OTHER_MEMCG]++;
+			continue;
+		}
+
+		VM_BUG_ON(addr < vma->vm_start || addr >= vma->vm_end);
+		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+			continue;
+
+		if (pmd_dirty(pmd[i]) && !PageDirty(page) &&
+		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page))) {
+			set_page_dirty(page);
+			args->mm_stats[MM_LEAF_DIRTY]++;
+		}
+
+		old_gen = page_update_gen(page, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(page, old_gen, new_gen, args);
+		args->mm_stats[MM_LEAF_YOUNG]++;
+	}
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+
+	bitmap_zero(args->bitmap, PTRS_PER_PMD);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long start,
+				  struct vm_area_struct *vma, struct mm_walk *walk)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	int leaf = 0;
+	int nonleaf = 0;
+	struct mm_walk_args *args = walk->private;
+
+	VM_BUG_ON(pud_leaf(*pud));
+
+	pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+	vma = walk->vma;
+	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		pmd_t val = pmd_read_atomic(pmd + i);
+
+		/* for pmd_read_atomic() */
+		barrier();
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(val)) {
+			args->mm_stats[MM_LEAF_HOLE]++;
+			continue;
+		}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (pmd_trans_huge(val)) {
+			unsigned long pfn = pmd_pfn(val);
+
+			if (is_huge_zero_pmd(val)) {
+				args->mm_stats[MM_LEAF_HOLE]++;
+				continue;
+			}
+
+			if (!pmd_young(val)) {
+				args->mm_stats[MM_LEAF_OLD]++;
+				continue;
+			}
+
+			if (pfn < args->start_pfn || pfn >= args->end_pfn) {
+				args->mm_stats[MM_LEAF_OTHER_NODE]++;
+				continue;
+			}
+
+			__set_bit(i, args->bitmap);
+			leaf++;
+			continue;
+		}
+#endif
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+		if (!pmd_young(val)) {
+			args->mm_stats[MM_NONLEAF_OLD]++;
+			continue;
+		}
+#endif
+		if (walk_pte_range(&val, addr, next, walk)) {
+			__set_bit(i, args->bitmap);
+			nonleaf++;
+		}
+	}
+
+	if (leaf) {
+		walk_pmd_range_locked(pud, start, vma, walk);
+		leaf = nonleaf = 0;
+	}
+
+	if (i < PTRS_PER_PMD && get_next_vma(walk, PUD_MASK, PMD_SIZE, &start, &end))
+		goto restart;
+
+	if (nonleaf)
+		walk_pmd_range_locked(pud, start, vma, walk);
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	int i;
+	pud_t *pud;
+	unsigned long addr;
+	unsigned long next;
+	struct mm_walk_args *args = walk->private;
+
+	VM_BUG_ON(p4d_leaf(*p4d));
+
+	pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+	for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+		pud_t val = READ_ONCE(pud[i]);
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+			continue;
+
+		walk_pmd_range(&val, addr, next, walk);
+
+		if (args->batch_size >= MAX_BATCH_SIZE) {
+			end = (addr | ~PUD_MASK) + 1;
+			goto done;
+		}
+	}
+
+	if (i < PTRS_PER_PUD && get_next_vma(walk, P4D_MASK, PUD_SIZE, &start, &end))
+		goto restart;
+
+	end = round_up(end, P4D_SIZE);
+done:
+	/* rounded-up boundaries can wrap to 0 */
+	args->next_addr = end && walk->vma ? max(end, walk->vma->vm_start) : 0;
+
+	return -EAGAIN;
+}
+
+static void walk_mm(struct mm_walk_args *args, struct mm_struct *mm)
+{
+	static const struct mm_walk_ops mm_walk_ops = {
+		.test_walk = should_skip_vma,
+		.p4d_entry = walk_pud_range,
+	};
+
+	int err;
+	struct mem_cgroup *memcg = args->memcg;
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(args->node_id));
+
+	args->next_addr = FIRST_USER_ADDRESS;
+
+	do {
+		unsigned long start = args->next_addr;
+		unsigned long end = mm->highest_vm_end;
+
+		err = -EBUSY;
+
+		rcu_read_lock();
+#ifdef CONFIG_MEMCG
+		if (memcg && atomic_read(&memcg->moving_account)) {
+			args->mm_stats[MM_LOCK_CONTENTION]++;
+			goto contended;
+		}
+#endif
+		if (!mmap_read_trylock(mm)) {
+			args->mm_stats[MM_LOCK_CONTENTION]++;
+			goto contended;
+		}
+
+		err = walk_page_range(mm, start, end, &mm_walk_ops, args);
+
+		mmap_read_unlock(mm);
+
+		reset_batch_size(lruvec, args);
+contended:
+		rcu_read_unlock();
+
+		cond_resched();
+	} while (err == -EAGAIN && args->next_addr &&
+		 !mm_is_oom_victim(mm) && !mm_is_migrated(mm, memcg));
+}
+
+static struct mm_walk_args *alloc_mm_walk_args(int nid)
+{
+	struct pglist_data *pgdat;
+	int size = sizeof(struct mm_walk_args);
+
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) ||
+	    IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+		size += sizeof(unsigned long) * BITS_TO_LONGS(PTRS_PER_PMD);
+
+	if (!current_is_kswapd())
+		return kvzalloc_node(size, GFP_KERNEL, nid);
+
+	VM_BUG_ON(nid == NUMA_NO_NODE);
+
+	pgdat = NODE_DATA(nid);
+	if (!pgdat->mm_walk_args)
+		pgdat->mm_walk_args = kvzalloc_node(size, GFP_KERNEL, nid);
+
+	return pgdat->mm_walk_args;
+}
+
+static void free_mm_walk_args(struct mm_walk_args *args)
+{
+	if (!current_is_kswapd())
+		kvfree(args);
+}
+
+static bool inc_min_seq(struct lruvec *lruvec, int type)
+{
+	int gen, zone;
+	int remaining = MAX_BATCH_SIZE;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+		return true;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct page *page = lru_to_page(head);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page), page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page);
+			VM_BUG_ON_PAGE(page_zonenum(page) != zone, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			page_inc_gen(page, lruvec, false);
+
+			if (!--remaining)
+				return false;
+		}
+
+		VM_BUG_ON(lrugen->sizes[gen][type][zone]);
+	}
+
+	reset_controller_pos(lruvec, gen, type);
+	WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+
+	return true;
+}
+
+static bool try_to_inc_min_seq(struct lruvec *lruvec, int type)
+{
+	int gen, zone;
+	bool success = false;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	while (get_nr_gens(lruvec, type) > MIN_NR_GENS) {
+		gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			if (!list_empty(&lrugen->lists[gen][type][zone]))
+				return success;
+		}
+
+		reset_controller_pos(lruvec, gen, type);
+		WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
+
+		success = true;
+	}
+
+	return success;
+}
+
+static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq)
+{
+	int gen, type, zone;
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_seq != lrugen->max_seq)
+		goto unlock;
+
+	for (type = 0; type < ANON_AND_FILE; type++) {
+		if (try_to_inc_min_seq(lruvec, type))
+			continue;
+
+		while (!inc_min_seq(lruvec, type)) {
+			spin_unlock_irq(&lruvec->lru_lock);
+			cond_resched();
+			spin_lock_irq(&lruvec->lru_lock);
+		}
+	}
+
+	gen = lru_gen_from_seq(lrugen->max_seq - 1);
+	for_each_type_zone(type, zone) {
+		enum lru_list lru = type * LRU_FILE;
+		long total = lrugen->sizes[gen][type][zone];
+
+		if (!total)
+			continue;
+
+		WARN_ON_ONCE(total != (int)total);
+
+		update_lru_size(lruvec, lru, zone, total);
+		update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -total);
+	}
+
+	gen = lru_gen_from_seq(lrugen->max_seq + 1);
+	for_each_type_zone(type, zone) {
+		VM_BUG_ON(lrugen->sizes[gen][type][zone]);
+		VM_BUG_ON(!list_empty(&lrugen->lists[gen][type][zone]));
+	}
+
+	for (type = 0; type < ANON_AND_FILE; type++)
+		reset_controller_pos(lruvec, gen, type);
+
+	WRITE_ONCE(lrugen->timestamps[gen], jiffies);
+	/* make sure all preceding modifications appear first */
+	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
+unlock:
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+/* Main function used by the foreground, the background and the user-triggered aging. */
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			       struct scan_control *sc, int swappiness)
+{
+	bool last;
+	struct mm_walk_args *args;
+	struct mm_struct *mm = NULL;
+	struct lrugen *lrugen = &lruvec->evictable;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	int nid = pgdat->node_id;
+
+	VM_BUG_ON(max_seq > READ_ONCE(lrugen->max_seq));
+
+	/*
+	 * If we are not from run_aging() and clearing the accessed bit may
+	 * trigger page faults, then don't proceed to clearing all accessed
+	 * PTEs. Instead, fallback to lru_gen_scan_around(), which only clears a
+	 * handful of accessed PTEs. This is less efficient but causes fewer
+	 * page faults on CPUs that don't have the capability.
+	 */
+	if ((current->flags & PF_MEMALLOC) && !arch_has_hw_pte_young()) {
+		inc_max_seq(lruvec, max_seq);
+		return true;
+	}
+
+	args = alloc_mm_walk_args(nid);
+	if (!args)
+		return false;
+
+	args->memcg = memcg;
+	args->max_seq = max_seq;
+	args->start_pfn = pgdat->node_start_pfn;
+	args->end_pfn = pgdat_end_pfn(pgdat);
+	args->node_id = nid;
+	args->swappiness = swappiness;
+
+	do {
+		last = get_next_mm(args, &mm);
+		if (mm)
+			walk_mm(args, mm);
+
+		cond_resched();
+	} while (mm);
+
+	free_mm_walk_args(args);
+
+	if (!last) {
+		/* don't wait unless we may have trouble reclaiming */
+		if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2)
+			wait_event_killable(mm_list->nodes[nid].wait,
+					    max_seq < READ_ONCE(lrugen->max_seq));
+
+		return max_seq < READ_ONCE(lrugen->max_seq);
+	}
+
+	VM_BUG_ON(max_seq != READ_ONCE(lrugen->max_seq));
+
+	inc_max_seq(lruvec, max_seq);
+	/* either we see any waiters or they will see updated max_seq */
+	if (wq_has_sleeper(&mm_list->nodes[nid].wait))
+		wake_up_all(&mm_list->nodes[nid].wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
+/* Protect the working set accessed within the last N milliseconds. */
+static unsigned long lru_gen_min_ttl;
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	if (sc->file_is_tiny && mutex_trylock(&oom_lock)) {
+		struct oom_control oc = {
+			.gfp_mask = sc->gfp_mask,
+			.order = sc->order,
+		};
+
+		/* to avoid overkilling */
+		if (!oom_reaping_in_progress())
+			out_of_memory(&oc);
+
+		mutex_unlock(&oom_lock);
+	}
+
+	if (READ_ONCE(lru_gen_min_ttl))
+		sc->file_is_tiny = 1;
+
+	if (!mem_cgroup_disabled() && !sc->force_deactivate) {
+		sc->force_deactivate = 1;
+		return;
+	}
+
+	sc->force_deactivate = 0;
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		int swappiness = get_swappiness(memcg);
+		DEFINE_MAX_SEQ(lruvec);
+		DEFINE_MIN_SEQ(lruvec);
+
+		if (get_lo_wmark(max_seq, min_seq, swappiness) == MIN_NR_GENS)
+			try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+#define NR_TO_SCAN	(SWAP_CLUSTER_MAX * 2)
+#define SIZE_TO_SCAN	(NR_TO_SCAN * PAGE_SIZE)
+
+/* Scan the vicinity of an accessed PTE when shrink_page_list() uses the rmap. */
+void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw)
+{
+	int i;
+	pte_t *pte;
+	struct page *page;
+	int old_gen, new_gen;
+	unsigned long start;
+	unsigned long end;
+	unsigned long addr;
+	struct mem_cgroup *memcg = page_memcg(pvmw->page);
+	struct pglist_data *pgdat = page_pgdat(pvmw->page);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	unsigned long bitmap[BITS_TO_LONGS(NR_TO_SCAN)] = {};
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_BUG_ON_PAGE(PageLRU(pvmw->page), pvmw->page);
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+	if (end - start > SIZE_TO_SCAN) {
+		if (pvmw->address - start < SIZE_TO_SCAN / 2)
+			end = start + SIZE_TO_SCAN;
+		else if (end - pvmw->address < SIZE_TO_SCAN / 2)
+			start = end - SIZE_TO_SCAN;
+		else {
+			start = pvmw->address - SIZE_TO_SCAN / 2;
+			end = pvmw->address + SIZE_TO_SCAN / 2;
+		}
+	}
+
+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+	new_gen = lru_gen_from_seq(READ_ONCE(lruvec->evictable.max_seq));
+
+	rcu_read_lock();
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i]))
+			continue;
+
+		VM_BUG_ON(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		page = compound_head(pfn_to_page(pfn));
+		if (page_to_nid(page) != pgdat->node_id)
+			continue;
+
+		if (page_memcg_rcu(page) != memcg)
+			continue;
+
+		VM_BUG_ON(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+			continue;
+
+		old_gen = page_lru_gen(page);
+		if (old_gen < 0)
+			SetPageReferenced(page);
+		else if (old_gen != new_gen)
+			__set_bit(i, bitmap);
+
+		if (pte_dirty(pte[i]) && !PageDirty(page) &&
+		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
+			set_page_dirty(page);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	rcu_read_unlock();
+
+	if (bitmap_weight(bitmap, NR_TO_SCAN) < PAGEVEC_SIZE) {
+		for_each_set_bit(i, bitmap, NR_TO_SCAN)
+			activate_page(pte_page(pte[i]));
+		return;
+	}
+
+	lock_page_memcg(pvmw->page);
+	spin_lock_irq(&lruvec->lru_lock);
+
+	new_gen = lru_gen_from_seq(lruvec->evictable.max_seq);
+
+	for_each_set_bit(i, bitmap, NR_TO_SCAN) {
+		page = compound_head(pte_page(pte[i]));
+		if (page_memcg_rcu(page) != memcg)
+			continue;
+
+		old_gen = page_update_gen(page, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			lru_gen_update_size(page, lruvec, old_gen, new_gen);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+	unlock_page_memcg(pvmw->page);
+}
+
 /******************************************************************************
  *                          state change
  ******************************************************************************/
@@ -3392,9 +4271,18 @@ static int __meminit __maybe_unused mem_notifier(struct notifier_block *self,
 
 	pgdat = NODE_DATA(nid);
 
+	if (action == MEM_CANCEL_ONLINE || action == MEM_OFFLINE) {
+		free_mm_walk_args(pgdat->mm_walk_args);
+		pgdat->mm_walk_args = NULL;
+		return NOTIFY_DONE;
+	}
+
 	if (action != MEM_GOING_ONLINE)
 		return NOTIFY_DONE;
 
+	if (!WARN_ON_ONCE(pgdat->mm_walk_args))
+		pgdat->mm_walk_args = alloc_mm_walk_args(NUMA_NO_NODE);
+
 	mutex_lock(&lru_gen_state_mutex);
 	cgroup_lock();
 
@@ -3443,6 +4331,10 @@ static int __init init_lru_gen(void)
 	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
 	BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1);
 
+	VM_BUG_ON(PMD_SIZE / PAGE_SIZE != PTRS_PER_PTE);
+	VM_BUG_ON(PUD_SIZE / PMD_SIZE != PTRS_PER_PMD);
+	VM_BUG_ON(P4D_SIZE / PUD_SIZE != PTRS_PER_PUD);
+
 	if (mem_cgroup_disabled()) {
 		global_mm_list = alloc_mm_list();
 		if (!global_mm_list)
@@ -3460,6 +4352,12 @@ static int __init init_lru_gen(void)
  */
 arch_initcall(init_lru_gen);
 
+#else /* CONFIG_LRU_GEN */
+
+static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -4313,6 +5211,11 @@ static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		lru_gen_age_node(pgdat, sc);
+		return;
+	}
+
 	if (!total_swap_pages)
 		return;
 
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 08/11] mm: multigenerational lru: eviction
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (6 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 07/11] mm: multigenerational lru: aging Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 09/11] mm: multigenerational lru: user interface Yu Zhao
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

The eviction consumes old generations. Given an lruvec, the eviction
scans pages on lrugen->lists indexed by anon and file min_seq[2]
(modulo MAX_NR_GENS). It first tries to select a type based on the
values of min_seq[2]. If they are equal, it selects the type that has
a lower refault rate. The eviction sorts a page according to its
updated generation number if the aging has found this page accessed.
It also moves a page to the next generation if this page is from an
upper tier that has a higher refault rate than the base tier. The
eviction increments min_seq[2] of a selected type when it finds
lrugen->lists indexed by min_seq[2] of this selected type are empty.

With the aging and the eviction in place, implementing page reclaim
becomes quite straightforward:
  1) To reduce the latency, direct reclaim skips the aging unless both
  min_seq[2] are equal to max_seq-1. Then it invokes the eviction.
  2) To avoid the aging in the direct reclaim path, kswapd invokes the
  aging if either of min_seq[2] is equal to max_seq-1. Then it invokes
  the eviction.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 mm/vmscan.c | 440 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 440 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 757ba4f415cc..2f1fffbd2d61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1311,6 +1311,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* lru_gen_scan_around() has updated this page? */
+		if (lru_gen_enabled() && !ignore_references &&
+		    page_mapped(page) && PageReferenced(page))
+			goto keep_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -2447,6 +2452,9 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -4087,6 +4095,426 @@ void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw)
 	unlock_page_memcg(pvmw->page);
 }
 
+/******************************************************************************
+ *                          the eviction
+ ******************************************************************************/
+
+static bool sort_page(struct page *page, struct lruvec *lruvec, int tier_to_isolate)
+{
+	bool success;
+	int gen = page_lru_gen(page);
+	int type = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	int tier = page_lru_tier(page);
+	struct lrugen *lrugen = &lruvec->evictable;
+
+	VM_BUG_ON_PAGE(gen < 0, page);
+	VM_BUG_ON_PAGE(tier_to_isolate < 0, page);
+
+	/* a lazy-free page that has been written into? */
+	if (type && PageDirty(page) && PageAnon(page)) {
+		success = lru_gen_del_page(page, lruvec, false);
+		VM_BUG_ON_PAGE(!success, page);
+		SetPageSwapBacked(page);
+		add_page_to_lru_list_tail(page, lruvec);
+		return true;
+	}
+
+	/* page_update_gen() has updated this page? */
+	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+		list_move(&page->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
+	/* protect this page if its tier has a higher refault rate */
+	if (tier_to_isolate < tier) {
+		int hist = lru_hist_from_seq(gen);
+
+		page_inc_gen(page, lruvec, false);
+		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
+			   lrugen->protected[hist][type][tier - 1] + thp_nr_pages(page));
+		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type);
+		return true;
+	}
+
+	/* mark this page for reclaim if it's pending writeback */
+	if (PageWriteback(page) || (type && PageDirty(page))) {
+		page_inc_gen(page, lruvec, true);
+		return true;
+	}
+
+	return false;
+}
+
+static bool isolate_page(struct page *page, struct lruvec *lruvec, struct scan_control *sc)
+{
+	bool success;
+
+	if (!sc->may_unmap && page_mapped(page))
+		return false;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (PageDirty(page) || (PageAnon(page) && !PageSwapCache(page))))
+		return false;
+
+	if (!get_page_unless_zero(page))
+		return false;
+
+	if (!TestClearPageLRU(page)) {
+		put_page(page);
+		return false;
+	}
+
+	success = lru_gen_del_page(page, lruvec, true);
+	VM_BUG_ON_PAGE(!success, page);
+
+	return true;
+}
+
+static int scan_pages(struct lruvec *lruvec, struct scan_control *sc, long *nr_to_scan,
+		      int type, int tier, struct list_head *list)
+{
+	bool success;
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int remaining = MAX_BATCH_SIZE;
+	struct lrugen *lrugen = &lruvec->evictable;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
+		return -ENOENT;
+
+	gen = lru_gen_from_seq(lrugen->min_seq[type]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lrugen->lists[gen][type][zone];
+
+		while (!list_empty(head)) {
+			struct page *page = lru_to_page(head);
+			int delta = thp_nr_pages(page);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page), page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != type, page);
+			VM_BUG_ON_PAGE(page_zonenum(page) != zone, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			scanned += delta;
+
+			if (sort_page(page, lruvec, tier))
+				sorted += delta;
+			else if (isolate_page(page, lruvec, sc)) {
+				list_add(&page->lru, list);
+				isolated += delta;
+			} else {
+				list_move(&page->lru, &moved);
+				skipped += delta;
+			}
+
+			if (!--remaining)
+				break;
+
+			if (max(isolated, skipped) >= SWAP_CLUSTER_MAX)
+				break;
+		}
+
+		list_splice(&moved, head);
+		__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+
+		if (!remaining || isolated >= SWAP_CLUSTER_MAX)
+			break;
+	}
+
+	success = try_to_inc_min_seq(lruvec, type);
+
+	*nr_to_scan -= scanned;
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc)) {
+		__count_vm_events(item, isolated);
+		__count_vm_events(PGREFILL, sorted);
+	}
+	__count_memcg_events(memcg, item, isolated);
+	__count_memcg_events(memcg, PGREFILL, sorted);
+	__count_vm_events(PGSCAN_ANON + type, isolated);
+
+	if (isolated)
+		return isolated;
+	/*
+	 * We may have trouble finding eligible pages due to reclaim_idx,
+	 * may_unmap and may_writepage. The following check makes sure we won't
+	 * be stuck if we aren't making enough progress.
+	 */
+	return !remaining || success || *nr_to_scan <= 0 ? 0 : -ENOENT;
+}
+
+static int get_tier_to_isolate(struct lruvec *lruvec, int type)
+{
+	int tier;
+	struct controller_pos sp, pv;
+
+	/*
+	 * Ideally we don't want to evict upper tiers that have higher refault
+	 * rates. However, we need to leave a margin for the fluctuations in
+	 * refault rates. So we use a larger gain factor to make sure upper
+	 * tiers are indeed more active. We choose 2 because the lowest upper
+	 * tier would have twice of the refault rate of the base tier, according
+	 * to their numbers of accesses.
+	 */
+	read_controller_pos(&sp, lruvec, type, 0, 1);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_controller_pos(&pv, lruvec, type, tier, 2);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	return tier - 1;
+}
+
+static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_to_isolate)
+{
+	int type, tier;
+	struct controller_pos sp, pv;
+	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+
+	/*
+	 * Compare the refault rates between the base tiers of anon and file to
+	 * determine which type to evict. Also need to compare the refault rates
+	 * of the upper tiers of the selected type with that of the base tier of
+	 * the other type to determine which tier of the selected type to evict.
+	 */
+	read_controller_pos(&sp, lruvec, 0, 0, gain[0]);
+	read_controller_pos(&pv, lruvec, 1, 0, gain[1]);
+	type = positive_ctrl_err(&sp, &pv);
+
+	read_controller_pos(&sp, lruvec, !type, 0, gain[!type]);
+	for (tier = 1; tier < MAX_NR_TIERS; tier++) {
+		read_controller_pos(&pv, lruvec, type, tier, gain[type]);
+		if (!positive_ctrl_err(&sp, &pv))
+			break;
+	}
+
+	*tier_to_isolate = tier - 1;
+
+	return type;
+}
+
+static int isolate_pages(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			 long *nr_to_scan, int *type_to_scan, struct list_head *list)
+{
+	int i;
+	int type;
+	int isolated;
+	int tier = -1;
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (get_hi_wmark(max_seq, min_seq, swappiness) == MIN_NR_GENS)
+		return 0;
+	/*
+	 * Try to select a type based on generations and swappiness, and if that
+	 * fails, fall back to get_type_to_scan(). When anon and file are both
+	 * available from the same generation, swappiness 200 is interpreted as
+	 * anon first and swappiness 1 is interpreted as file first.
+	 */
+	if (!swappiness)
+		type = 1;
+	else if (min_seq[0] > min_seq[1])
+		type = 1;
+	else if (min_seq[0] < min_seq[1])
+		type = 0;
+	else if (swappiness == 1)
+		type = 1;
+	else if (swappiness == 200)
+		type = 0;
+	else
+		type = get_type_to_scan(lruvec, swappiness, &tier);
+
+	if (tier == -1)
+		tier = get_tier_to_isolate(lruvec, type);
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		isolated = scan_pages(lruvec, sc, nr_to_scan, type, tier, list);
+		if (isolated >= 0)
+			break;
+
+		type = !type;
+		tier = get_tier_to_isolate(lruvec, type);
+	}
+
+	if (isolated < 0)
+		isolated = *nr_to_scan = 0;
+
+	*type_to_scan = type;
+
+	return isolated;
+}
+
+/* Main function used by the foreground, the background and the user-triggered eviction. */
+static bool evict_pages(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			long *nr_to_scan)
+{
+	int type;
+	int isolated;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct page *page;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	isolated = isolate_pages(lruvec, sc, swappiness, nr_to_scan, &type, &list);
+	VM_BUG_ON(list_empty(&list) == !!isolated);
+
+	if (isolated)
+		__mod_node_page_state(pgdat, NR_ISOLATED_ANON + type, isolated);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (!isolated)
+		goto done;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+	/*
+	 * We need to prevent rejected pages from being added back to the same
+	 * lists they were isolated from. Otherwise we may risk looping on them
+	 * forever.
+	 */
+	list_for_each_entry(page, &list, lru) {
+		if (!page_evictable(page))
+			continue;
+
+		if (!PageReclaim(page) || !(PageDirty(page) || PageWriteback(page)))
+			SetPageActive(page);
+
+		ClearPageReferenced(page);
+		ClearPageWorkingset(page);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + type, -isolated);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(memcg, item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+done:
+	return *nr_to_scan > 0 && sc->nr_reclaimed < sc->nr_to_reclaim;
+}
+
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+{
+	int gen, type, zone;
+	int nr_gens;
+	long nr_to_scan = 0;
+	struct lrugen *lrugen = &lruvec->evictable;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	for (type = !swappiness; type < ANON_AND_FILE; type++) {
+		unsigned long seq;
+
+		for (seq = min_seq[type]; seq <= max_seq; seq++) {
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone <= sc->reclaim_idx; zone++)
+				nr_to_scan += READ_ONCE(lrugen->sizes[gen][type][zone]);
+		}
+	}
+
+	if (nr_to_scan <= 0)
+		return 0;
+
+	nr_gens = get_hi_wmark(max_seq, min_seq, swappiness);
+
+	if (current_is_kswapd()) {
+		gen = lru_gen_from_seq(max_seq - nr_gens + 1);
+		if (time_is_before_eq_jiffies(READ_ONCE(lrugen->timestamps[gen]) +
+					      READ_ONCE(lru_gen_min_ttl)))
+			sc->file_is_tiny = 0;
+
+		/* leave the work to lru_gen_age_node() */
+		if (nr_gens == MIN_NR_GENS)
+			return 0;
+
+		if (nr_to_scan >= sc->nr_to_reclaim)
+			sc->force_deactivate = 0;
+	}
+
+	nr_to_scan = max(nr_to_scan >> sc->priority, (long)!mem_cgroup_online(memcg));
+	if (!nr_to_scan || nr_gens > MIN_NR_GENS)
+		return nr_to_scan;
+
+	/* move onto other memcgs if we haven't tried them all yet */
+	if (!mem_cgroup_disabled() && !sc->force_deactivate) {
+		sc->skipped_deactivate = 1;
+		return 0;
+	}
+
+	return try_to_inc_max_seq(lruvec, max_seq, sc, swappiness) ? nr_to_scan : 0;
+}
+
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	long scanned = 0;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	lru_add_drain();
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		long nr_to_scan;
+		int swappiness = sc->may_swap ? get_swappiness(memcg) : 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness) - scanned;
+		if (nr_to_scan <= 0)
+			break;
+
+		scanned += nr_to_scan;
+
+		if (!evict_pages(lruvec, sc, swappiness, &nr_to_scan))
+			break;
+
+		scanned -= nr_to_scan;
+
+		if (mem_cgroup_below_min(memcg) ||
+		    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
 /******************************************************************************
  *                          state change
  ******************************************************************************/
@@ -4358,6 +4786,10 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
 }
 
+static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -4371,6 +4803,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		lru_gen_shrink_lruvec(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -4837,6 +5274,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 09/11] mm: multigenerational lru: user interface
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (7 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 08/11] mm: multigenerational lru: eviction Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 10/11] mm: multigenerational lru: Kconfig Yu Zhao
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

Add /sys/kernel/mm/lru_gen/enabled to enable and disable the
multigenerational lru at runtime.

Add /sys/kernel/mm/lru_gen/min_ttl_ms to protect the working set of a
given number of milliseconds. The OOM killer is invoked if this
working set cannot be kept in memory.

Add /sys/kernel/debug/lru_gen to monitor the multigenerational lru and
invoke the aging and the eviction. This file has the following output:
  memcg  memcg_id  memcg_path
    node  node_id
      min_gen  birth_time  anon_size  file_size
      ...
      max_gen  birth_time  anon_size  file_size

min_gen is the oldest generation number and max_gen is the youngest
generation number. birth_time is in milliseconds. anon_size and
file_size are in pages.

This file takes the following input:
  + memcg_id node_id max_gen [swappiness]
  - memcg_id node_id min_gen [swappiness] [nr_to_reclaim]

The first command line invokes the aging, which scans PTEs for
accessed pages and then creates the next generation max_gen+1. A swap
file and a non-zero swappiness, which overrides vm.swappiness, are
required to scan PTEs mapping anon pages. The second command line
invokes the eviction, which evicts generations less than or equal to
min_gen. min_gen should be less than max_gen-1 as max_gen and
max_gen-1 are not fully aged and therefore cannot be evicted.
nr_to_reclaim can be used to limit the number of pages to evict.
Multiple command lines are supported, as is concatenation with
delimiters "," and ";".

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 include/linux/nodemask.h |   1 +
 mm/vmscan.c              | 412 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 413 insertions(+)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 567c3ddba2c4..90840c459abc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define first_memory_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
+#define next_memory_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1U
 #define nr_online_nodes		1U
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2f1fffbd2d61..c6d539a73d00 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -52,6 +52,8 @@
 #include <linux/memory.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/ctype.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4732,6 +4734,410 @@ static int __meminit __maybe_unused mem_notifier(struct notifier_block *self,
 	return NOTIFY_DONE;
 }
 
+/******************************************************************************
+ *                          sysfs interface
+ ******************************************************************************/
+
+static ssize_t show_min_ttl(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", jiffies_to_msecs(READ_ONCE(lru_gen_min_ttl)));
+}
+
+static ssize_t store_min_ttl(struct kobject *kobj, struct kobj_attribute *attr,
+			     const char *buf, size_t len)
+{
+	unsigned int msecs;
+
+	if (kstrtouint(buf, 10, &msecs))
+		return -EINVAL;
+
+	WRITE_ONCE(lru_gen_min_ttl, msecs_to_jiffies(msecs));
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_min_ttl_attr = __ATTR(
+	min_ttl_ms, 0644, show_min_ttl, store_min_ttl
+);
+
+static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", lru_gen_enabled());
+}
+
+static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr,
+			    const char *buf, size_t len)
+{
+	int enable;
+
+	if (kstrtoint(buf, 10, &enable))
+		return -EINVAL;
+
+	lru_gen_set_state(enable, true, false);
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+	enabled, 0644, show_enable, store_enable
+);
+
+static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_min_ttl_attr.attr,
+	&lru_gen_enabled_attr.attr,
+	NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+	.name = "lru_gen",
+	.attrs = lru_gen_attrs,
+};
+
+/******************************************************************************
+ *                          debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg;
+	loff_t nr_to_skip = *pos;
+
+	m->private = kvmalloc(PATH_MAX, GFP_KERNEL);
+	if (!m->private)
+		return ERR_PTR(-ENOMEM);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			if (!nr_to_skip--)
+				return mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+	if (!IS_ERR_OR_NULL(v))
+		mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+	kvfree(m->private);
+	m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	int nid = lruvec_pgdat(v)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(v);
+
+	++*pos;
+
+	nid = next_memory_node(nid);
+	if (nid == MAX_NUMNODES) {
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+		if (!memcg)
+			return NULL;
+
+		nid = first_memory_node;
+	}
+
+	return mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+}
+
+static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
+				  unsigned long max_seq, unsigned long *min_seq,
+				  unsigned long seq)
+{
+	int i;
+	int type, tier;
+	int hist = lru_hist_from_seq(seq);
+	struct lrugen *lrugen = &lruvec->evictable;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	for (tier = 0; tier < MAX_NR_TIERS; tier++) {
+		seq_printf(m, "            %10d", tier);
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			unsigned long n[3] = {};
+
+			if (seq == max_seq) {
+				n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
+				n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
+
+				seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]);
+			} else if (seq == min_seq[type] || NR_STAT_GENS > 1) {
+				n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]);
+				n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]);
+				if (tier)
+					n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]);
+
+				seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]);
+			} else
+				seq_puts(m, "          0           0           0 ");
+		}
+		seq_putc(m, '\n');
+	}
+
+	seq_puts(m, "                      ");
+	for (i = 0; i < NR_MM_STATS; i++) {
+		if (i == 6)
+			seq_puts(m, "\n                      ");
+
+		if (seq == max_seq && NR_STAT_GENS == 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(mm_list->nodes[nid].stats[hist][i]),
+				   toupper(MM_STAT_CODES[i]));
+		else if (seq != max_seq && NR_STAT_GENS > 1)
+			seq_printf(m, " %10lu%c", READ_ONCE(mm_list->nodes[nid].stats[hist][i]),
+				   MM_STAT_CODES[i]);
+		else
+			seq_puts(m, "          0 ");
+	}
+	seq_putc(m, '\n');
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+	unsigned long seq;
+	bool full = !debugfs_real_fops(m->file)->write;
+	struct lruvec *lruvec = v;
+	struct lrugen *lrugen = &lruvec->evictable;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (nid == first_memory_node) {
+		const char *path = memcg ? m->private : "";
+
+#ifdef CONFIG_MEMCG
+		if (memcg)
+			cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+		seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path);
+	}
+
+	seq_printf(m, " node %5d\n", nid);
+
+	if (!full)
+		seq = min(min_seq[0], min_seq[1]);
+	else if (max_seq >= MAX_NR_GENS)
+		seq = max_seq - MAX_NR_GENS + 1;
+	else
+		seq = 0;
+
+	for (; seq <= max_seq; seq++) {
+		int gen, type, zone;
+		unsigned int msecs;
+
+		gen = lru_gen_from_seq(seq);
+		msecs = jiffies_to_msecs(jiffies - READ_ONCE(lrugen->timestamps[gen]));
+
+		seq_printf(m, " %10lu %10u", seq, msecs);
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			long size = 0;
+
+			if (seq < min_seq[type]) {
+				seq_puts(m, "         -0 ");
+				continue;
+			}
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				size += READ_ONCE(lrugen->sizes[gen][type][zone]);
+
+			seq_printf(m, " %10lu ", max(size, 0L));
+		}
+
+		seq_putc(m, '\n');
+
+		if (full)
+			lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq);
+	}
+
+	return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+	.start = lru_gen_seq_start,
+	.stop = lru_gen_seq_stop,
+	.next = lru_gen_seq_next,
+	.show = lru_gen_seq_show,
+};
+
+static int run_aging(struct lruvec *lruvec, unsigned long seq, int swappiness)
+{
+	struct scan_control sc = {};
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq == max_seq)
+		try_to_inc_max_seq(lruvec, max_seq, &sc, swappiness);
+
+	return seq > max_seq ? -EINVAL : 0;
+}
+
+static int run_eviction(struct lruvec *lruvec, unsigned long seq, int swappiness,
+			unsigned long nr_to_reclaim)
+{
+	unsigned int flags;
+	struct blk_plug plug;
+	int err = -EINTR;
+	long nr_to_scan = LONG_MAX;
+	struct scan_control sc = {
+		.nr_to_reclaim = nr_to_reclaim,
+		.may_writepage = 1,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq >= max_seq - 1)
+		return -EINVAL;
+
+	flags = memalloc_noreclaim_save();
+
+	blk_start_plug(&plug);
+
+	while (!signal_pending(current)) {
+		DEFINE_MIN_SEQ(lruvec);
+
+		if (seq < min(min_seq[!swappiness], min_seq[swappiness < 200]) ||
+		    !evict_pages(lruvec, &sc, swappiness, &nr_to_scan)) {
+			err = 0;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+
+	memalloc_noreclaim_restore(flags);
+
+	return err;
+}
+
+static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
+		   int swappiness, unsigned long nr_to_reclaim)
+{
+	struct lruvec *lruvec;
+	int err = -EINVAL;
+	struct mem_cgroup *memcg = NULL;
+
+	if (!mem_cgroup_disabled()) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+#endif
+		rcu_read_unlock();
+
+		if (!memcg)
+			goto done;
+	}
+	if (memcg_id != mem_cgroup_id(memcg))
+		goto done;
+
+	if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+		goto done;
+
+	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+	if (swappiness == -1)
+		swappiness = get_swappiness(memcg);
+	else if (swappiness > 200U)
+		goto done;
+
+	switch (cmd) {
+	case '+':
+		err = run_aging(lruvec, seq, swappiness);
+		break;
+	case '-':
+		err = run_eviction(lruvec, seq, swappiness, nr_to_reclaim);
+		break;
+	}
+done:
+	mem_cgroup_put(memcg);
+
+	return err;
+}
+
+static ssize_t lru_gen_seq_write(struct file *file, const char __user *src,
+				 size_t len, loff_t *pos)
+{
+	void *buf;
+	char *cur, *next;
+	int err = 0;
+
+	buf = kvmalloc(len + 1, GFP_USER);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, src, len)) {
+		kvfree(buf);
+		return -EFAULT;
+	}
+
+	next = buf;
+	next[len] = '\0';
+
+	while ((cur = strsep(&next, ",;\n"))) {
+		int n;
+		int end;
+		char cmd;
+		unsigned int memcg_id;
+		unsigned int nid;
+		unsigned long seq;
+		unsigned int swappiness = -1;
+		unsigned long nr_to_reclaim = -1;
+
+		cur = skip_spaces(cur);
+		if (!*cur)
+			continue;
+
+		n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+			   &seq, &end, &swappiness, &end, &nr_to_reclaim, &end);
+		if (n < 4 || cur[end]) {
+			err = -EINVAL;
+			break;
+		}
+
+		err = run_cmd(cmd, memcg_id, nid, seq, swappiness, nr_to_reclaim);
+		if (err)
+			break;
+	}
+
+	kvfree(buf);
+
+	return err ? : len;
+}
+
+static int lru_gen_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &lru_gen_seq_ops);
+}
+
+static const struct file_operations lru_gen_rw_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.write = lru_gen_seq_write,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static const struct file_operations lru_gen_ro_fops = {
+	.open = lru_gen_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -4772,6 +5178,12 @@ static int __init init_lru_gen(void)
 	if (hotplug_memory_notifier(mem_notifier, 0))
 		pr_err("lru_gen: failed to subscribe hotplug notifications\n");
 
+	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
+		pr_err("lru_gen: failed to create sysfs group\n");
+
+	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops);
+	debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops);
+
 	return 0;
 };
 /*
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 10/11] mm: multigenerational lru: Kconfig
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (8 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 09/11] mm: multigenerational lru: user interface Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-08-18  6:31 ` [PATCH v4 11/11] mm: multigenerational lru: documentation Yu Zhao
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

Add configuration options for the multigenerational lru.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 mm/Kconfig | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 40a9bfcd5062..4cd257cfdf84 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -889,4 +889,63 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+# the multigenerational lru {
+config LRU_GEN
+	bool "Multigenerational LRU"
+	depends on MMU
+	# the following options may leave not enough spare bits in page->flags
+	depends on !MAXSMP && (64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP)
+	help
+	  A high performance LRU implementation to heavily overcommit workloads
+	  that are not IO bound. See Documentation/vm/multigen_lru.rst for
+	  details.
+
+	  Warning: do not enable this option unless you plan to use it because
+	  it introduces a small per-process and per-memcg and per-node memory
+	  overhead.
+
+config LRU_GEN_ENABLED
+	bool "Turn on by default"
+	depends on LRU_GEN
+	help
+	  The default value of /sys/kernel/mm/lru_gen/enabled is 0. This option
+	  changes it to 1.
+
+	  Warning: the default value is the fast path. See
+	  Documentation/static-keys.txt for details.
+
+config LRU_GEN_STATS
+	bool "Full stats for debugging"
+	depends on LRU_GEN
+	help
+	  This option keeps full stats for each generation, which can be read
+	  from /sys/kernel/debug/lru_gen_full.
+
+	  Warning: do not enable this option unless you plan to use it because
+	  it introduces an additional small per-process and per-memcg and
+	  per-node memory overhead.
+
+config NR_LRU_GENS
+	int "Max number of generations"
+	depends on LRU_GEN
+	range 4 31
+	default 7
+	help
+	  This will use order_base_2(N+1) spare bits from page flags.
+
+	  Warning: do not use numbers larger than necessary because each
+	  generation introduces a small per-node and per-memcg memory overhead.
+
+config TIERS_PER_GEN
+	int "Number of tiers per generation"
+	depends on LRU_GEN
+	range 2 5
+	default 4
+	help
+	  This will use N-2 spare bits from page flags.
+
+	  Larger values generally offer better protection to active pages under
+	  heavy buffered I/O workloads.
+# }
+
 endmenu
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 11/11] mm: multigenerational lru: documentation
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (9 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 10/11] mm: multigenerational lru: Kconfig Yu Zhao
@ 2021-08-18  6:31 ` Yu Zhao
  2021-10-09  5:43 ` [PATCH v4 00/11] Multigenerational LRU Framework bot
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: Yu Zhao @ 2021-08-18  6:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Hillf Danton, page-reclaim, Yu Zhao, Konstantin Kharlamov

Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
---
 Documentation/vm/index.rst        |   1 +
 Documentation/vm/multigen_lru.rst | 134 ++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..c353b3f55924 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -17,6 +17,7 @@ various features of the Linux memory management
 
    swap_numa
    zswap
+   multigen_lru
 
 Kernel developers MM documentation
 ==================================
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..adedff5319d9
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,134 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigenerational LRU
+=====================
+
+Quick Start
+===========
+Build Configurations
+--------------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
+ default.
+
+Runtime Configurations
+----------------------
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature was not turned on by default.
+
+:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to
+ protect the working set of ``N`` milliseconds. The OOM killer is
+ invoked if this working set cannot be kept in memory.
+
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature
+ is turned on. This file has the following output:
+
+::
+
+  memcg  memcg_id  memcg_path
+    node  node_id
+      min_gen  birth_time  anon_size  file_size
+      ...
+      max_gen  birth_time  anon_size  file_size
+
+``min_gen`` is the oldest generation number and ``max_gen`` is the
+youngest generation number. ``birth_time`` is in milliseconds.
+``anon_size`` and ``file_size`` are in pages.
+
+Phones/Laptops/Workstations
+---------------------------
+No additional configurations required.
+
+Servers/Data Centers
+--------------------
+:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a
+ larger number.
+
+:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger
+ number.
+
+:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``.
+
+:Working set estimation: Write ``+ memcg_id node_id max_gen
+ [swappiness]`` to ``/sys/kernel/debug/lru_gen`` to invoke the aging,
+ which scans PTEs for accessed pages and then creates the next
+ generation ``max_gen+1``. A swap file and a non-zero ``swappiness``,
+ which overrides ``vm.swappiness``, are required to scan PTEs mapping
+ anon pages.
+
+:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness]
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
+ eviction, which evicts generations less than or equal to ``min_gen``.
+ ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
+ ``max_gen-1`` are not fully aged and therefore cannot be evicted.
+ ``nr_to_reclaim`` can be used to limit the number of pages to evict.
+ Multiple command lines are supported, so does concatenation with
+ delimiters ``,`` and ``;``.
+
+Framework
+=========
+For each ``lruvec``, evictable pages are divided into multiple
+generations. The youngest generation number is stored in
+``lrugen->max_seq`` for both anon and file types as they are aged on
+an equal footing. The oldest generation numbers are stored in
+``lrugen->min_seq[2]`` separately for anon and file types as clean
+file pages can be evicted regardless of swap and writeback
+constraints. These three variables are monotonically increasing.
+Generation numbers are truncated into
+``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
+``page->flags``. The sliding window technique is used to prevent
+truncated generation numbers from overlapping. Each truncated
+generation number is an index to an array of per-type and per-zone
+lists ``lrugen->lists``.
+
+Each generation is then divided into multiple tiers. Tiers represent
+levels of usage from file descriptors only. Pages accessed ``N`` times
+via file descriptors belong to tier ``order_base_2(N)``. Each
+generation contains at most ``CONFIG_TIERS_PER_GEN`` tiers, and they
+require additional ``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``.
+In contrast to moving across generations which requires list
+operations, moving across tiers only involves operations on
+``page->flags`` and therefore has a negligible cost. A feedback loop
+modeled after the PID controller monitors refault rates of all tiers
+and decides when to protect pages from which tiers.
+
+The framework comprises two conceptually independent components: the
+aging and the eviction, which can be invoked separately from user
+space for the purpose of working set estimation and proactive reclaim.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, the aging
+traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()``
+to scan PTEs for accessed pages (a ``mm_struct`` list is maintained
+for each ``memcg``). Upon finding one, the aging updates its
+generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``).
+After each round of traversal, the aging increments ``max_seq``. The
+aging is due when both ``min_seq[2]`` have caught up with
+``max_seq-1``.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, the
+eviction scans pages on the per-zone lists indexed by anon and file
+``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to
+select a type based on the values of ``min_seq[2]``. If they are
+equal, it selects the type that has a lower refault rate. The eviction
+sorts a page according to its updated generation number if the aging
+has found this page accessed. It also moves a page to the next
+generation if this page is from an upper tier that has a higher
+refault rate than the base tier. The eviction increments
+``min_seq[2]`` of a selected type when it finds all the per-zone lists
+indexed by ``min_seq[2]`` of this selected type are empty.
+
+To-do List
+==========
+KVM Optimization
+----------------
+Support shadow page table walk.
+
+NUMA Optimization
+-----------------
+Optimize page table walk for NUMA.
-- 
2.33.0.rc1.237.g0d66db33f3-goog



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young()
  2021-08-18  6:30 ` [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
@ 2021-08-19  9:19   ` Will Deacon
  2021-08-19 21:23     ` Yu Zhao
  0 siblings, 1 reply; 19+ messages in thread
From: Will Deacon @ 2021-08-19  9:19 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, linux-kernel, Hillf Danton, page-reclaim

On Wed, Aug 18, 2021 at 12:30:57AM -0600, Yu Zhao wrote:
> Some architectures set the accessed bit in PTEs automatically, e.g.,
> x86, and arm64 v8.2 and later. On architectures that do not have this
> capability, clearing the accessed bit in a PTE triggers a page fault
> following the TLB miss.
> 
> Being aware of this capability can help make better decisions, i.e.,
> whether to limit the size of each batch of PTEs and the burst of
> batches when clearing the accessed bit.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/arm64/include/asm/cpufeature.h | 19 ++++++-------------
>  arch/arm64/include/asm/pgtable.h    | 10 ++++------
>  arch/arm64/kernel/cpufeature.c      | 19 +++++++++++++++++++
>  arch/arm64/mm/proc.S                | 12 ------------
>  arch/arm64/tools/cpucaps            |  1 +
>  arch/x86/include/asm/pgtable.h      |  6 +++---
>  include/linux/pgtable.h             | 12 ++++++++++++
>  mm/memory.c                         | 14 +-------------
>  8 files changed, 46 insertions(+), 47 deletions(-)

Please cc linux-arm-kernel and the maintainers on arm64 patches.

> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 9bb9d11750d7..2020b9e818c8 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -776,6 +776,12 @@ static inline bool system_supports_tlb_range(void)
>  		cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
>  }
>  
> +/* Check whether hardware update of the Access flag is supported. */
> +static inline bool system_has_hw_af(void)
> +{
> +	return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
> +}

How accurate does this need to be? Heterogeneous (big/little) systems are
very common on arm64, so the existing code enables hardware access flag
unconditionally on CPUs that support it, meaning we could end up running
on a system where some CPUs have hardware update and others do not.

With your change, we only enable hardware access flag if _all_ CPUs support
it (and furthermore, we prevent late onlining of CPUs without the feature
if was detected at boot). This sacrifices a lot of flexibility, particularly
if we end up tackling CPU errata in this area in future, and it's not clear
that it's really required for what you're trying to do.

Will


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young()
  2021-08-19  9:19   ` Will Deacon
@ 2021-08-19 21:23     ` Yu Zhao
  2021-10-10  8:59       ` Hillf Danton
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Zhao @ 2021-08-19 21:23 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linux-MM, linux-kernel, Hillf Danton, Kernel Page Reclaim v2,
	Catalin Marinas, Mark Rutland, Marc Zyngier, Suzuki Poulose,
	linux-arm-kernel

On Thu, Aug 19, 2021 at 3:19 AM Will Deacon <will@kernel.org> wrote:
>
> On Wed, Aug 18, 2021 at 12:30:57AM -0600, Yu Zhao wrote:
> > Some architectures set the accessed bit in PTEs automatically, e.g.,
> > x86, and arm64 v8.2 and later. On architectures that do not have this
> > capability, clearing the accessed bit in a PTE triggers a page fault
> > following the TLB miss.
> >
> > Being aware of this capability can help make better decisions, i.e.,
> > whether to limit the size of each batch of PTEs and the burst of
> > batches when clearing the accessed bit.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/arm64/include/asm/cpufeature.h | 19 ++++++-------------
> >  arch/arm64/include/asm/pgtable.h    | 10 ++++------
> >  arch/arm64/kernel/cpufeature.c      | 19 +++++++++++++++++++
> >  arch/arm64/mm/proc.S                | 12 ------------
> >  arch/arm64/tools/cpucaps            |  1 +
> >  arch/x86/include/asm/pgtable.h      |  6 +++---
> >  include/linux/pgtable.h             | 12 ++++++++++++
> >  mm/memory.c                         | 14 +-------------
> >  8 files changed, 46 insertions(+), 47 deletions(-)
>
> Please cc linux-arm-kernel and the maintainers on arm64 patches.

Done. Also adding a link to the original post:
https://lore.kernel.org/patchwork/patch/1478354/

> > diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> > index 9bb9d11750d7..2020b9e818c8 100644
> > --- a/arch/arm64/include/asm/cpufeature.h
> > +++ b/arch/arm64/include/asm/cpufeature.h
> > @@ -776,6 +776,12 @@ static inline bool system_supports_tlb_range(void)
> >               cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
> >  }
> >
> > +/* Check whether hardware update of the Access flag is supported. */
> > +static inline bool system_has_hw_af(void)
> > +{
> > +     return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
> > +}
>
> How accurate does this need to be? Heterogeneous (big/little) systems are
> very common on arm64, so the existing code enables hardware access flag
> unconditionally on CPUs that support it, meaning we could end up running
> on a system where some CPUs have hardware update and others do not.
>
> With your change, we only enable hardware access flag if _all_ CPUs support
> it (and furthermore, we prevent late onlining of CPUs without the feature
> if was detected at boot). This sacrifices a lot of flexibility, particularly
> if we end up tackling CPU errata in this area in future, and it's not clear
> that it's really required for what you're trying to do.

It doesn't need to be accurate but then my question is how helpful it
is if it's not accurate. Conversely, shouldn't all CPUs have it if
it's really helpful? So it seems to me whether such a flexibility is
needed in the future is questionable -- AFAIK, there are no CPUs (ARM
or not) that have such a behavior in the present. I agree we want to
try to be future proof, but usually this comes at a cost. For this
specific case, we would need two functions to detect the capability at
global and local levels to fully explore this theoretical flexibility.

The bottomline is I don't have a problem with having an additional
function to detect the capability at a global level. Note that the
specific concern in this patchset is that if a CPU thinks all other
CPUs have the capability and clears the accessed bit on many PTEs,
then those who don't have the capability may suffer the faults for
that action. (This is different from the cow_user_page() case.)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 00/11] Multigenerational LRU Framework
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (10 preceding siblings ...)
  2021-08-18  6:31 ` [PATCH v4 11/11] mm: multigenerational lru: documentation Yu Zhao
@ 2021-10-09  5:43 ` bot
  2021-10-21 19:41 ` bot
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 19+ messages in thread
From: bot @ 2021-10-09  5:43 UTC (permalink / raw)
  To: yuzhao
  Cc: hdanton, linux-kernel, linux-mm, page-reclaim, corbet, michael,
	sofia.trinh

Kernel / MariaDB benchmark with MGLRU

TLDR
====
With the MGLRU, MariaDB achieved 95% CIs [5.24, 10.71]% and [20.22,
25.97]% more transactions per minute (TPM), respectively, under the
medium- and high-concurrency conditions when slightly overcommitting
memory. There were no statistically significant changes in TPM under
other conditions.

Rationale
=========
Memory overcommit can improve utilization and, if not overdone, can
also increase throughput. The challenges are estimating working sets
and optimizing page reclaim. The risks are performance degradations
and OOM kills. Unless overcoming the challenges, the only way to
reduce the risks is to overprovision memory.

MariaDB is one of the most popular open-source RDBMSs. HammerDB is
the leading open-source benchmarking software derived from the TPC
specifications. OLTP is the most important use case for RDBMSs.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory conditions: % of memory size
* Underutilizing: ~10% on inactive file list
* Overcommitting: ~10% swapped out

Concurrency conditions: average # of users per CPU
* Low: ~3
* Medium: ~13
* High: ~19

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~45

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
  refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
hammerdbcli auto prep_tpcc.tcl
systemctl stop mariadb
e2image <backup /mnt/data>

<for each kernel>
    grub2-set-default <baseline / patched>
    <for each memory condition>
        <update /etc/my.cnf>
        <for each concurrency condition>
            <update run_tpcc.tcl>
            <for each data point>
                systemctl stop mariadb
                e2image <restore /mnt/data>
                reboot
                hammerdbcli auto run_tpcc.tcl
                <collect TPM>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0          -2

$ mount | grep data
/dev/nvme0n1p4 on /mnt/data type ext4 (rw,relatime,seclabel)

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

MariaDB
=======
$ mysql --version
mysql  Ver 15.1 Distrib 10.3.28-MariaDB, for Linux (x86_64) using
readline 5.1

$ cat /etc/my.cnf
<existing parameters>

[mysqld]
innodb_buffer_pool_size=<50G, 60G>
innodb_doublewrite=0
innodb_flush_log_at_trx_commit=0
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_flush_neighbors=0
innodb_io_capacity=4000
innodb_io_capacity_max=20000
innodb_log_buffer_size=1G
innodb_log_file_size=20G
innodb_max_dirty_pages_pct=90
innodb_max_dirty_pages_pct_lwm=10
max_connections=1000
datadir=/mnt/data

HammerDB
========
$ hammerdbcli -h
HammerDB CLI v4.2
Copyright (C) 2003-2021 Steve Shaw
Type "help" for a list of commands
Usage: hammerdbcli [ auto [ script_to_autoload.tcl  ] ]

$ cat prep_tpcc.tcl
dbset db maria
diset connection maria_socket /var/lib/mysql/mysql.sock
diset tpcc maria_count_ware 1200
diset tpcc maria_num_vu 32
diset tpcc maria_partition true
buildschema
waittocomplete
quit

$ cat run_tpcc.tcl
dbset db maria
diset connection maria_socket /var/lib/mysql/mysql.sock
diset tpcc maria_total_iterations 20000000
diset tpcc maria_driver timed
diset tpcc maria_rampup 10
diset tpcc maria_duration 30
diset tpcc maria_allwarehouse true
vuset logtotemp 1
vuset unique 1
loadscript
vuset vu <100, 400, 600>
vucreate
vurun
runtimer 3000
Vudestroy

Results
=======
Comparing the patched with the baseline kernel, MariaDB achieved 95%
CIs [5.24, 10.71]% and [20.22, 25.97]% more TPM, respectively, under
the medium- and high-concurrency conditions when slightly
overcommitting memory. There were no statistically significant
changes in TPM under other conditions.

+--------------------+-----------------------+-----------------------+
| Mean TPM [95% CI]  | Underutilizing memory | Overcommitting memory |
+--------------------+-----------------------+-----------------------+
| Low concurrency    | 270811.6 / 271522.7   | 447933.4 / 447283.3   |
|                    | [-40.97, 1463.17]     | [-1330.61, 30.41]     |
+--------------------+-----------------------+-----------------------+
| Medium concurrency | 240212.9 / 242846.7   | 327276.6 / 353372.7   |
|                    | [-2611.38, 7878.98]   | [17149.01, 35043.19]  |
+--------------------+-----------------------+-----------------------+
| High concurrency   | 283897.8 / 283668.1   | 274069.7 / 337366.8   |
|                    | [-11538.08, 11078.68] | [55417.42, 71176.78]  |
+--------------------+-----------------------+-----------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing overcommitting with underutilizing memory, MariaDB achieved
95% CIs [65.12, 65.68]% and [32.45, 40.04]% more TPM, respectively,
under the low- and medium-concurrency conditions when using the
baseline kernel; 95% CIs [64.48, 64.98]%, [43.53, 47.50]% and [16.48,
21.38]% more TPM, respectively, under the low-, medium- and
high-concurrency conditions when using the patched kernel. There were
no statistically significant changes in TPM under other conditions.

+--------------------+------------------------+----------------------+
| Mean TPM [95% CI]  | Baseline kernel        | Patched kernel       |
+--------------------+------------------------+----------------------+
| Low concurrency    | 270811.6 / 447933.4    | 271522.7 / 447283.3  |
|                    | [176362.0, 177881.6]   | [175089.3, 176431.9] |
+--------------------+------------------------+----------------------+
| Medium concurrency | 240212.9 / 327276.6    | 242846.7 / 353372.7  |
|                    | [77946.4, 96181.0]     | [105707.7, 115344.3] |
+--------------------+------------------------+----------------------+
| High concurrency   | 283897.8 / 274069.7    | 283668.1 / 337366.8  |
|                    | [-21605.703, 1949.503] | [46758.85, 60638.55] |
+--------------------+------------------------+----------------------+
Table 2. Comparison between underutilizing and overcommitting memory

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/mariadb/5.14

References
==========
HammerDB v4.2 New Features:
https://www.hammerdb.com/blog/uncategorized/hammerdb-v4-2-new-features
-pt1-mariadb-build-and-test-example-with-the-cli/

Appendix
========
$ cat raw_data.r
v <- c(
# baseline 50g 100vu
269531,270113,270256,270367,270393,270630,270707,271373,272291,272455,
# baseline 50g 400vu
231856,234985,235144,235552,238551,239994,244413,245255,247997,248382,
# baseline 50g 600vu
256365,271733,275966,280623,281014,283764,293327,296750,298728,300708,
# baseline 60g 100vu
446973,447383,447412,447489,447874,448046,448123,448531,448739,448764,
# baseline 60g 400vu
312427,312936,313780,321503,329554,330551,332377,333584,337105,348949,
# baseline 60g 600vu
262338,262971,266242,266489,268036,272494,279045,281472,289942,291668,
# patched 50g 100vu
270621,270913,271026,271137,271517,271616,271699,272117,272218,272363,
# patched 50g 400vu
233314,238265,238722,240540,241676,245204,245688,247440,248417,249201,
# patched 50g 600vu
271114,271928,277562,279455,282074,285515,287836,288508,289451,303238,
# patched 60g 100vu
445923,446178,446837,446889,447331,447480,447823,447999,448145,448228,
# patched 60g 400vu
345705,349373,350832,351229,351758,352520,355130,355247,357762,364171,
# patched 60g 600vu
330860,334705,336001,337291,338326,338361,338970,339163,339784,340207
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (m in 1:2) {
    for (c in 1:3) {
        r <- t.test(a[, c, m, 1], a[, c, m, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("m%d c%d: no significance", m, c)
        } else {
            s <- sprintf("m%d c%d: [%.2f, %.2f]%%", m, c, -p[2],
-p[1])
        }
        print(s)
    }
}

# 50g vs 60g
for (k in 1:2) {
    for (c in 1:3) {
        r <- t.test(a[, c, 1, k], a[, c, 2, k])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("k%d c%d: no significance", k, c)
        } else {
            s <- sprintf("k%d c%d: [%.2f, %.2f]%%", k, c, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data.r

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -2.0139, df = 15.122, p-value = 0.06217
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1463.17673    40.97673
sample estimates:
mean of x mean of y
 270811.6  271522.7

[1] "50g 100vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -1.0564, df = 17.673, p-value = 0.305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7878.98  2611.38
sample estimates:
mean of x mean of y
 240212.9  242846.7

[1] "50g 400vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = 0.043083, df = 15.895, p-value = 0.9662
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11078.68  11538.08
sample estimates:
mean of x mean of y
 283897.8  283668.1

[1] "50g 600vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = 2.0171, df = 16.831, p-value = 0.05993
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -30.41577 1330.61577
sample estimates:
mean of x mean of y
 447933.4  447283.3

[1] "60g 100vu: no significance"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -6.3473, df = 12.132, p-value = 3.499e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -35043.19 -17149.01
sample estimates:
mean of x mean of y
 327276.6  353372.7

[1] "60g 400vu: [5.24, 10.71]%"

        Welch Two Sample t-test

data:  a[, c, m, 1] and a[, c, m, 2]
t = -17.844, df = 10.233, p-value = 4.822e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -71176.78 -55417.42
sample estimates:
mean of x mean of y
 274069.7  337366.8

[1] "60g 600vu: [20.22, 25.97]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -495.48, df = 15.503, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -177881.6 -176362.0
sample estimates:
mean of x mean of y
 270811.6  447933.4

[1] "baseline 100vu: [65.12, 65.68]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -20.601, df = 13.182, p-value = 2.062e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -96181.0 -77946.4
sample estimates:
mean of x mean of y
 240212.9  327276.6

[1] "baseline 400vu: [32.45, 40.04]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = 1.7607, df = 16.986, p-value = 0.09628
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1949.503 21605.703
sample estimates:
mean of x mean of y
 283897.8  274069.7

[1] "baseline 600vu: no significance"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -553.68, df = 16.491, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -176431.9 -175089.3
sample estimates:
mean of x mean of y
 271522.7  447283.3

[1] "patched 100vu: [64.48, 64.98]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -48.194, df = 17.992, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -115344.3 -105707.7
sample estimates:
mean of x mean of y
 242846.7  353372.7

[1] "patched 400vu: [43.53, 47.50]%"

        Welch Two Sample t-test

data:  a[, c, 1, k] and a[, c, 2, k]
t = -17.109, df = 10.6, p-value = 4.629e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -60638.55 -46758.85
sample estimates:
mean of x mean of y
 283668.1  337366.8

[1] "patched 600vu: [16.48, 21.38]%"


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young()
  2021-08-19 21:23     ` Yu Zhao
@ 2021-10-10  8:59       ` Hillf Danton
  0 siblings, 0 replies; 19+ messages in thread
From: Hillf Danton @ 2021-10-10  8:59 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Will Deacon, Linux-MM, linux-kernel

On Thu, 19 Aug 2021 15:23:02 -0600 Yu Zhao wrote:
>On Thu, Aug 19, 2021 at 3:19 AM Will Deacon <will@kernel.org> wrote:
>>
>> How accurate does this need to be? Heterogeneous (big/little) systems are
>> very common on arm64, so the existing code enables hardware access flag
>> unconditionally on CPUs that support it, meaning we could end up running
>> on a system where some CPUs have hardware update and others do not.
>>
>> With your change, we only enable hardware access flag if _all_ CPUs support
>> it (and furthermore, we prevent late onlining of CPUs without the feature
>> if was detected at boot). This sacrifices a lot of flexibility, particularly
>> if we end up tackling CPU errata in this area in future, and it's not clear
>> that it's really required for what you're trying to do.
>
>It doesn't need to be accurate but then my question is how helpful it
>is if it's not accurate.

Alternatively to make the issue simpler, spin without arm64 included given
that it will be revisited once MGLRU lands in the mainline tree.

Hillf


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 00/11] Multigenerational LRU Framework
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (11 preceding siblings ...)
  2021-10-09  5:43 ` [PATCH v4 00/11] Multigenerational LRU Framework bot
@ 2021-10-21 19:41 ` bot
  2021-11-02  0:20 ` bot
  2021-11-09  2:13 ` bot
  14 siblings, 0 replies; 19+ messages in thread
From: bot @ 2021-10-21 19:41 UTC (permalink / raw)
  To: yuzhao
  Cc: hdanton, linux-kernel, linux-mm, page-reclaim, corbet, michael,
	sofia.trinh

Kernel / Memcached benchmark with MGLRU

TLDR
====
With the MGLRU, Memcached achieved 95% CIs [23.54, 32.25]%, [20.76,
41.61]%, [13.85, 15.97]%, [21.59, 30.02]% and [23.94, 29.92]% more
operations per second (OPS), respectively, for sequential access w/
THP=always, random access w/ THP=always, random access w/ THP=never,
Gaussian access w/ THP=always and Gaussian access w/ THP=never. There
were no statistically significant changes in OPS for sequential
access w/ THP=never.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradations and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Memcached is one of the most popular open-source in-memory KV stores.
memtier_benchmark is the leading open-source KV store benchmarking
software that supports multiple access patterns. THP can have a
negative effect under memory pressure, due to internal and/or
external fragmentations.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory conditions: % of memory size
* Underutilizing: N/A
* Overcommitting: ~10% swapped out (zram)

THP (2MB Transparent Huge Pages):
* Always
* Never

Read patterns (2kB objects):
* Parallel sequential
* Uniform random
* Gaussian (SD = 1/6 of key range)

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Note that the goal of this benchmark is to compare the performance
for the same key range, object size, and hit ratio. Since Memcached
does not support backing storage, it requires fewer in-memory objects
to underutilize memory, which reduces the hit ratio and therefore is
not applicable in this case.

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>

<for each kernel>
    grub2-set-default <baseline, patched>
    <for each THP setting>
        echo <always, never> > \
            /sys/kernel/mm/transparent_hugepage/enabled
        <update /etc/sysconfig/memcached>
        <for each access pattern>
            <update run_memtier.sh>
            <for each data point>
                reboot
                run_memtier.sh
                <collect OPS>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/zram0        partition     8388604       0        -2

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

Memcached
=========
$ memcached -V
memcached 1.5.22

$ cat /etc/sysconfig/memcached
USER="memcached"
MAXCONN="10000"
CACHESIZE="65536"
OPTIONS="-s /tmp/memcached.sock -a 0766 -t 16 -b 10000 -B binary <-L>"
memtier_benchmark
$ memtier_benchmark -v
memtier_benchmark 1.3.0
Copyright (C) 2011-2020 Redis Labs Ltd.
This is free software.  You may redistribute copies of it under the
terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

$ cat run_memtier.sh
# load objects
memtier_benchmark -S /tmp/memcached.sock -P memcache_binary -n
allkeys -c 1 -t 16 --ratio 1:0 --pipeline 1 -d 2000 --key-minimum=1
--key-maximum=30000000 --key-pattern=P:P

# run benchmark
memtier_benchmark -S /tmp/memcached.sock -P memcache_binary -n
30000000 -c 1 -t 16 --ratio 0:1 --pipeline 1 --randomize
--distinct-client-seed --key-minimum=1 --key-maximum=30000000
--key-pattern=<P:P, R:R, G:G>

Results
=======
Comparing the patched with the baseline kernel, Memcached achieved
95% CIs [23.54, 32.25]%, [20.76, 41.61]%, [13.85, 15.97]%, [21.59,
30.02]% and [23.94, 29.92]% more OPS, respectively, for sequential
access w/ THP=always, random access w/ THP=always, random access w/
THP=never, Gaussian access w/ THP=always and Gaussian access w/
THP=never. There were no statistically significant changes in OPS for
sequential access w/ THP=never.

+-------------------+-----------------------+------------------------+
| Mean OPS [95% CI] | THP=always            | THP=never              |
+-------------------+-----------------------+------------------------+
| Sequential access | 519599.7 / 664543.2   | 525394.8 / 527170.6    |
|                   | [122297.9, 167589.0]  | [-15138.63, 18690.31]  |
+-------------------+-----------------------+------------------------+
| Random access     | 450033.2 / 590360.7   | 509237.3 / 585142.4    |
|                   | [93415.59, 187239.37] | [70504.51, 81305.60]   |
+-------------------+-----------------------+------------------------+
| Gaussian access   | 481182.4 / 605358.7   | 531270.8 / 674341.4    |
|                   | [103892.6, 144460.0]] | [127199.8, 158941.2]   |
+-------------------+-----------------------+------------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing THP=never with THP=always, Memcached achieved 95% CIs
[2.73, 23.58]% and [5.45, 15.37]% more OPS, respectively, for random
access and Gaussian access when using the baseline kernel; 95% CIs
[-22.65, -18.69]% and [10.67, 12.12]% more OPS, respectively, for
sequential access and Gaussian access when using the patched kernel.
There were no statistically significant changes in OPS under other
conditions.

+-------------------+-----------------------+------------------------+
| Mean OPS [95% CI] | Baseline kernel       |  Patched kernel        |
+-------------------+-----------------------+------------------------+
| Sequential access | 519599.7 / 525394.8   | 664543.2 / 527170.6    |
|                   | [-18739.71, 30329.80] | [-150551.0, -124194.1] |
+-------------------+-----------------------+------------------------+
| Random access     | 450033.2 / 509237.3   | 590360.7 / 585142.4    |
|                   | [12303.49, 106104.69] | [-10816.1516, 379.475] |
+-------------------+-----------------------+------------------------+
| Gaussian access   | 481182.4 / 531270.8   | 605358.7 / 674341.4    |
|                   | [26229.02, 73947.84]  | [64570.58, 73394.70]   |
+-------------------+-----------------------+------------------------+
Table 2. Comparison between THP=always and THP=never

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/memcached/5.14

References
==========
memtier_benchmark: A High-Throughput Benchmarking Tool for Redis &
Memcached
https://redis.com/blog/memtier_benchmark-a-high-throughput-benchmarking-tool-for-redis-memcached/

Appendix
========
$ cat raw_data.r
v <- c(
    # baseline THP=always sequential
    460266.29, 466497.70, 516145.38, 523474.39, 528507.72, 529481.86, 533867.92, 537028.56, 546027.45, 554699.89,
    # baseline THP=always random
    371470.66, 378967.63, 381137.01, 385205.60, 449100.72, 474670.76, 490470.46, 513341.53, 525159.49, 530808.55,
    # baseline THP=always Gaussian
    455674.14, 457089.50, 460001.46, 463269.94, 468283.00, 474169.61, 477684.67, 506331.96, 507875.30, 541444.54,
    # baseline THP=never sequential
    501887.04, 507303.10, 509573.54, 515222.79, 517429.04, 530805.74, 536490.44, 538088.45, 540459.92, 556687.57,
    # baseline THP=never random
    496489.97, 506444.42, 508002.80, 508707.39, 509746.28, 511157.58, 511897.57, 511926.06, 512652.28, 515348.95,
    # baseline THP=never Gaussian
    493199.15, 504207.48, 518781.40, 520536.21, 528619.45, 540677.91, 544365.57, 551698.32, 554046.80, 556576.14,
    # patched THP=always sequential
    660711.43, 660936.88, 661275.57, 662540.65, 663417.25, 665546.99, 665680.49, 667564.03, 668555.96, 669202.36,
    # patched THP=always random
    582574.69, 583714.04, 587102.54, 587375.85, 588997.85, 589052.96, 593922.17, 594722.98, 596178.28, 599965.83,
    # patched THP=always Gaussian
    601707.98, 602055.03, 603020.28, 603335.93, 604519.55, 605086.48, 607405.59, 607570.79, 609009.54, 609875.98,
    # patched THP=never sequential
    507753.56, 509462.65, 509964.30, 510369.66, 515001.36, 531685.00, 543709.22, 545142.98, 548392.56, 550224.74,
    # patched THP=never random
    571017.21, 579705.57, 582801.51, 584475.82, 586247.73, 587209.97, 587354.87, 588661.14, 591237.23, 592712.76,
    # patched THP=never Gaussian
    666403.77, 669691.68, 670248.43, 672190.97, 672466.43, 674320.42, 674897.72, 677282.76, 678886.51, 687024.85
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (thp in 1:2) {
    for (pattern in 1:3) {
        r <- t.test(a[, pattern, thp, 1], a[, pattern, thp, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("thp%d pattern%d: no significance", thp, pattern)
        } else {
            s <- sprintf("thp%d pattern%d: [%.2f, %.2f]%%", thp, pattern, -p[2], -p[1])
        }
        print(s)
    }
}

# THP=always vs THP=never
for (kernel in 1:2) {
    for (pattern in 1:3) {
        r <- t.test(a[, pattern, 1, kernel], a[, pattern, 2, kernel])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kernel%d pattern%d: no significance", kernel, pattern)
        } else {
            s <- sprintf("kernel%d pattern%d: [%.2f, %.2f]%%", kernel, pattern, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data.r

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -14.434, df = 9.1861, p-value = 1.269e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -167589.0 -122297.9
sample estimates:
mean of x mean of y
 519599.7  664543.2

[1] "thp1 pattern1: [23.54, 32.25]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -6.7518, df = 9.1333, p-value = 7.785e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -187239.37  -93415.59
sample estimates:
mean of x mean of y
 450033.2  590360.7

[1] "thp1 pattern2: [20.76, 41.61]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -13.805, df = 9.1933, p-value = 1.866e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -144460.0 -103892.6
sample estimates:
mean of x mean of y
 481182.4  605358.7

[1] "thp1 pattern3: [21.59, 30.02]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -0.22059, df = 17.979, p-value = 0.8279
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18690.31  15138.63
sample estimates:
mean of x mean of y
 525394.8  527170.6

[1] "thp2 pattern1: no significance"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -29.606, df = 17.368, p-value = 2.611e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -81305.60 -70504.51
sample estimates:
mean of x mean of y
 509237.3  585142.4

[1] "thp2 pattern2: [13.85, 15.97]%"

        Welch Two Sample t-test

data:  a[, pattern, thp, 1] and a[, pattern, thp, 2]
t = -20.02, df = 10.251, p-value = 1.492e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -158941.2 -127199.8
sample estimates:
mean of x mean of y
 531270.8  674341.4

[1] "thp2 pattern3: [23.94, 29.92]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -0.50612, df = 14.14, p-value = 0.6206
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -30329.80  18739.71
sample estimates:
mean of x mean of y
 519599.7  525394.8

[1] "kernel1 pattern1: no significance"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -2.8503, df = 9.1116, p-value = 0.01885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -106104.69  -12303.49
sample estimates:
mean of x mean of y
 450033.2  509237.3

[1] "kernel1 pattern2: [2.73, 23.58]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -4.4308, df = 16.918, p-value = 0.0003701
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -73947.84 -26229.02
sample estimates:
mean of x mean of y
 481182.4  531270.8

[1] "kernel1 pattern3: [5.45, 15.37]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = 23.374, df = 9.5538, p-value = 9.402e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 124194.1 150551.0
sample estimates:
mean of x mean of y
 664543.2  527170.6

[1] "kernel2 pattern1: [-22.65, -18.69]%"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = 1.96, df = 17.806, p-value = 0.06583
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -379.4756 10816.1516
sample estimates:
mean of x mean of y
 590360.7  585142.4

[1] "kernel2 pattern2: no significance"

        Welch Two Sample t-test

data:  a[, pattern, 1, kernel] and a[, pattern, 2, kernel]
t = -33.687, df = 13.354, p-value = 2.614e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -73394.70 -64570.58
sample estimates:
mean of x mean of y
 605358.7  674341.4

[1] "kernel2 pattern3: [10.67, 12.12]%"


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 00/11] Multigenerational LRU Framework
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (12 preceding siblings ...)
  2021-10-21 19:41 ` bot
@ 2021-11-02  0:20 ` bot
  2021-11-09  2:13 ` bot
  14 siblings, 0 replies; 19+ messages in thread
From: bot @ 2021-11-02  0:20 UTC (permalink / raw)
  To: yuzhao
  Cc: hdanton, linux-kernel, linux-mm, page-reclaim, corbet, michael,
	sofia.trinh

Kernel / Apache Spark benchmark with MGLRU

TLDR
====
With the MGLRU, Apache Spark took 95% CIs [9.28, 11.19]% and [12.20,
14.93]% less wall time to sort 3 billion random integers,
respectively, under the medium- and high-concurrency conditions when
slightly overcommitting memory. There were no statistically
significant changes in wall time when sorting the same dataset under
other conditions.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradations and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Apache Spark is one of the most popular open-source big-data
frameworks. Dataset sorting is the most widely used benchmark for
such frameworks.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory conditions: % of memory size
* Underutilizing: ~10% on inactive file list
* Overcommitting: ~10% swapped out

Concurrency conditions: average # of workers per CPU
* Low: 1
* Medium: 2
* High: 3

Cluster mode: local
Dataset size: 3 billion random integers (57GB text)

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
spark-shell < gen.scala

<for each kernel>
    grub2-set-default <baseline, patched>
    <for each memory condition>
        <update run_spark.sh>
        <for each concurrency condition>
            <update run_spark.sh>
            <for each data point>
                reboot
                run_spark.sh
                <collect wall time>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0        -2

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Apache Spark
============
$ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.12
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

$ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer

object GenData {
    def main(args: Array[String]): Unit = {
        val file = new File("dataset.txt")
        val writer = new BufferedWriter(new FileWriter(file))
        val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
        for(_ <- 0 until 300000000) {
            for (i <- 0 until 10) {
                buf.update(i, scala.util.Random.nextLong())
            }
            writer.write(s"${buf.mkString(",")}\n")
        }
        writer.close()
    }
}
GenData.main(Array())

$ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession

object SparkSort {
    def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder().getOrCreate()
        val file = sc.textFile("dataset.txt", 32)
        val start = java.time.Instant.now()
        val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10)
        val finish = java.time.Instant.now()
        println(s"wall time: ${ChronoUnit.SECONDS.between(start, finish)}")
        results.foreach(println)
        spark.stop()
    }
}
SparkSort.main(Array())

$ cat run_spark.sh
spark-shell --master local\[<32, 64, 96>\] --driver-memory <52G, 62G> < sort.scala

Results
=======
Comparing the patched with the baseline kernel, Apache Spark took 95%
CIs [9.28, 11.19]% and [12.20, 14.93]% less wall time to sort the
dataset, respectively, under the medium- and high-concurrency
conditions when slightly overcommitting memory. There were no
statistically significant changes in wall time under other conditions.

+--------------------+-----------------------+-----------------------+
| Mean wall time (s) | Underutilizing memory | Overcommitting memory |
| [95% CI]           |                       |                       |
+--------------------+-----------------------+-----------------------+
| Low concurrency    | 1037.1 / 1037.0       | 1038.2 / 1036.6       |
|                    | [-1.41, 1.21]         | [-3.67, 0.47]         |
+--------------------+-----------------------+-----------------------+
| Medium concurrency | 1141.8 / 1142.6       | 1297.9 / 1165.1       |
|                    | [-1.35, 2.95]         | [-145.21, -120.38]    |
+--------------------+-----------------------+-----------------------+
| High concurrency   | 1239.3 / 1236.4       | 1456.8 / 1259.2       |
|                    | [-7.81, 2.01]         | [-217.53, -177.66]    |
+--------------------+-----------------------+-----------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing overcommitting with underutilizing memory, Apache Spark
took 95% CIs [12.58, 14.76]% and [15.95, 19.15]% more wall time to
sort the dataset, respectively, under the low- and medium-concurrency
conditions when using the baseline kernel; 95% CIs [1.78, 2.16]% and
[1.42, 2.27]% more wall time, respectively, under the medium- and
high-concurrency conditions when using the patched kernel. There were
no statistically significant changes in wall time under other
conditions.

+--------------------+------------------------+----------------------+
| Mean wall time (s) | Baseline kernel        | Patched kernel       |
| [95% CI]           |                        |                      |
+--------------------+------------------------+----------------------+
| Low concurrency    | 1037.1 / 1038.2        | 1037.0 / 1036.6      |
|                    | [-0.31, 2.51]          | [-2.43, 1.63]        |
+--------------------+------------------------+----------------------+
| Medium concurrency | 1141.8 / 1297.9        | 1142.6 / 1165.1      |
|                    | [143.68, 168.51]       | [20.33, 24.66]       |
+--------------------+------------------------+----------------------+
| High concurrency   | 1239.3 / 1456.8        | 1236.4 / 1259.2      |
|                    | [197.62, 237.37]       | [17.55, 28.04]       |
+--------------------+------------------------+----------------------+
Table 2. Comparison between underutilizing and overcommitting memory

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/spark/5.14

Appendix
========
$ cat raw_data_spark.r
v <- c(
    # baseline 52g 32t
    1034, 1036, 1036, 1037, 1037, 1037, 1038, 1038, 1038, 1040,
    # baseline 52g 64t
    1139, 1139, 1140, 1140, 1142, 1143, 1143, 1144, 1144, 1144,
    # baseline 52g 96t
    1236, 1237, 1238, 1238, 1238, 1239, 1240, 1241, 1243, 1243,
    # baseline 62g 32t
    1036, 1036, 1038, 1038, 1038, 1038, 1039, 1039, 1040, 1040,
    # baseline 62g 64t
    1266, 1277, 1284, 1296, 1299, 1302, 1311, 1313, 1314, 1317,
    # baseline 62g 96t
    1403, 1431, 1440, 1447, 1460, 1461, 1467, 1475, 1487, 1497,
    # patched 52g 32t
    1035, 1036, 1036, 1037, 1037, 1037, 1037, 1038, 1038, 1039,
    # patched 52g 64t
    1138, 1140, 1140, 1143, 1143, 1143, 1144, 1145, 1145, 1145,
    # patched 52g 96t
    1228, 1228, 1233, 1234, 1235, 1236, 1236, 1240, 1246, 1248,
    # patched 62g 32t
    1032, 1035, 1035, 1035, 1036, 1036, 1037, 1039, 1040, 1041,
    # patched 62g 64t
    1162, 1164, 1164, 1164, 1164, 1164, 1166, 1166, 1168, 1169,
    # patched 62g 96t
    1252, 1256, 1256, 1258, 1260, 1260, 1260, 1260, 1265, 1265
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (mem in 1:2) {
    for (con in 1:3) {
        r <- t.test(a[, con, mem, 1], a[, con, mem, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("mem%d con%d: no significance", mem, con)
        } else {
            s <- sprintf("mem%d con%d: [%.2f, %.2f]%%", mem, con, -p[2], -p[1])
        }
        print(s)
    }
}

# 52g vs 62g
for (ker in 1:2) {
    for (con in 1:3) {
        r <- t.test(a[, con, 1, ker], a[, con, 2, ker])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("ker%d con%d: no significance", ker, con)
        } else {
            s <- sprintf("ker%d con%d: [%.2f, %.2f]%%", ker, con, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_spark.r

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 0.16059, df = 16.4, p-value = 0.8744
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.21749  1.41749
sample estimates:
mean of x mean of y
   1037.1    1037.0

[1] "mem1 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = -0.78279, df = 17.565, p-value = 0.4442
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.950923  1.350923
sample estimates:
mean of x mean of y
   1141.8    1142.6

[1] "mem1 con2: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 1.2933, df = 11.303, p-value = 0.2217
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.019103  7.819103
sample estimates:
mean of x mean of y
   1239.3    1236.4

[1] "mem1 con3: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 1.6562, df = 13.458, p-value = 0.1208
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4799188  3.6799188
sample estimates:
mean of x mean of y
   1038.2    1036.6

[1] "mem2 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 24.096, df = 9.2733, p-value = 1.115e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 120.3881 145.2119
sample estimates:
mean of x mean of y
   1297.9    1165.1

[1] "mem2 con2: [-11.19, -9.28]%"

        Welch Two Sample t-test

data:  a[, con, mem, 1] and a[, con, mem, 2]
t = 22.289, df = 9.3728, p-value = 1.944e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 177.6666 217.5334
sample estimates:
mean of x mean of y
   1456.8    1259.2

[1] "mem2 con3: [-14.93, -12.20]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -1.6398, df = 17.697, p-value = 0.1187
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.5110734  0.3110734
sample estimates:
mean of x mean of y
   1037.1    1038.2

[1] "ker1 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -28.33, df = 9.2646, p-value = 2.57e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -168.5106 -143.6894
sample estimates:
mean of x mean of y
   1141.8    1297.9

[1] "ker1 con2: [12.58, 14.76]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -24.694, df = 9.1353, p-value = 1.12e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -237.3794 -197.6206
sample estimates:
mean of x mean of y
   1239.3    1456.8

[1] "ker1 con3: [15.95, 19.15]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = 0.42857, df = 12.15, p-value = 0.6757
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.630775  2.430775
sample estimates:
mean of x mean of y
   1037.0    1036.6

[1] "ker2 con1: no significance"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -21.865, df = 17.646, p-value = 3.151e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.66501 -20.33499
sample estimates:
mean of x mean of y
   1142.6    1165.1

[1] "ker2 con2: [1.78, 2.16]%"

        Welch Two Sample t-test

data:  a[, con, 1, ker] and a[, con, 2, ker]
t = -9.2738, df = 14.72, p-value = 1.561e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -28.04897 -17.55103
sample estimates:
mean of x mean of y
   1236.4    1259.2

[1] "ker2 con3: [1.42, 2.27]%"


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 00/11] Multigenerational LRU Framework
  2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
                   ` (13 preceding siblings ...)
  2021-11-02  0:20 ` bot
@ 2021-11-09  2:13 ` bot
  14 siblings, 0 replies; 19+ messages in thread
From: bot @ 2021-11-09  2:13 UTC (permalink / raw)
  To: yuzhao
  Cc: hdanton, linux-kernel, linux-mm, page-reclaim, corbet, michael,
	sofia.trinh

Kernel / MongoDB benchmark with MGLRU

TLDR
====
With the MGLRU, MongoDB achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]%
and [2.16, 3.55]% more operations per second (OPS) respectively for
exponential (distribution) access, random access and Zipfian access,
when underutizling memory; 95% CIs [8.83, 10.03]%, [21.12, 23.14]%
and [5.53, 6.46]% more OPS respectively for exponential access,
random access and Zipfian access, when slightly overcommitting memory.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradation and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

MongoDB is one of the most popular open-source NoSQL databases. YCSB
is the leading open-source NoSQL database benchmarking software that
supports multiple access distributions.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.14
* Patched: 5.14 + MGLRU

Memory utilization: % of memory size
* Underutilizing: ~15% on inactive file list
* Overcommitting: ~5% swapped out

Concurrency: average # of users per CPU
* Medium: 2

Access distributions (1kB objects, 20% update):
* Exponential
* Uniform random
* Zipfian

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Note that MongoDB reached the peak performance with the concurrency
for this benchmark, i.e., its performance degraded with fewer or more
users for this benchmark.

Procedure
=========
The latest MGLRU patchset for the 5.14 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/1

Baseline and patched 5.14 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
ycsb_load.sh
systemctl stop mongod
e2image <backup /mnt/data>

<for each kernel>
    grub2-set-default <baseline, patched>
    <for each memory utilization>
        <update /etc/mongod.conf>
        <for each access distribution>
            <update ycsb_run.sh>
            <for each data point>
                systemctl stop mongod
                e2image <restore /mnt/data>
                reboot
                ycsb_run.sh
                <collect OPS>

Hardware
========
Memory (GB): 64
CPU (total #): 32
NVMe SSD (GB): 1024

OS
==
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0        -2

$ cat /proc/cmdline
<existing parameters> systemd.unified_cgroup_hierarchy=1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

MongoDB
=======
$ mongod --version
db version v5.0.3
Build Info: {
    "version": "5.0.3",
    "gitVersion": "657fea5a61a74d7a79df7aff8e4bcf0bc742b748",
    "openSSLVersion": "OpenSSL 1.1.1g FIPS  21 Apr 2020",
    "modules": [],
    "allocator": "tcmalloc",
    "environment": {
        "distmod": "rhel80",
        "distarch": "x86_64",
        "target_arch": "x86_64"
    }
}

$ cat /etc/mongod.conf
# mongod.conf
<existing parameters>

# Where and how to store data.
storage:
  dbPath: /mnt/data
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      cacheSizeGB: <50, 60>

<existing parameters>

YCSB
====
$ git log
commit ce3eb9ce51c84ee9e236998cdd2cefaeb96798a8 (HEAD -> master,
origin/master, origin/HEAD)
Author: Ivan <john.koepi@gmail.com>
Date:   Tue Feb 16 17:38:00 2021 +0200

    [scylla] enable token aware LB by default, improve the docs (#1507)

$ cat ycsb_load.sh
# load objects
ycsb load mongodb -s -threads 16 \
    -p mongodb.url=mongodb://%2Ftmp%2Fmongodb-27017.sock \
    -p workload=site.ycsb.workloads.CoreWorkload \
    -p recordcount=80000000

$ cat ycsb_run.sh
# run benchmark
ycsb run mongodb -s -threads 64 \
    -p mongodb.url=mongodb://%2Ftmp%2Fmongodb-27017.sock \
    -p workload=site.ycsb.workloads.CoreWorkload \
    -p recordcount=80000000 -p operationcount=80000000 \
    -p readproportion=0.8 -p updateproportion=0.2 \
    -p requestdistribution=<exponential, uniform, zipfian>

Results
=======
Comparing the patched with the baseline kernel, MongoDB achieved 95%
CIs [2.23, 3.44]%, [6.97, 9.73]% and [2.16, 3.55]% more OPS
respectively for exponential access, random access and Zipfian
access, when underutizling memory; 95% CIs [8.83, 10.03]%, [21.12,
23.14]% and [5.53, 6.46]% more OPS respectively for exponential
access, random access and Zipfian access, when slightly
overcommitting memory.

+--------------------+-----------------------+-----------------------+
| Mean OPS [95% CI]  | Underutilizing memory | Overcommitting memory |
+--------------------+-----------------------+-----------------------+
| Exponential access | 76615.56 / 78788.76   | 73984.90 / 80961.66   |
|                    | [1708.76, 2637.62]    | [6533.94, 7419.58]    |
+--------------------+-----------------------+-----------------------+
| Random access      | 62093.40 / 67276.01   | 55990.56 / 68379.91   |
|                    | [4324.96, 6040.25]    | [11824.09, 12954.62]  |
+--------------------+-----------------------+-----------------------+
| Zipfian access     | 92532.25 / 95174.43   | 93545.62 / 99151.12   |
|                    | [1997.20, 3287.17]    | [5171.27, 6039.72]    |
+--------------------+-----------------------+-----------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing overcommitting with underutilizing memory, MongoDB achieved
95% CIs [-4.10, -2.77]%, [-11.20, -8.46]% and [0.36, 1.83]% more OPS
respectively for exponential access, random access and Zipfian
access, when using the baseline kernel; 95% CIs [2.27, 3.25]%, [0.78,
2.50]% and [3.81, 4.54]% more OPS respectively for exponential
access, random access and Zipfian access, when using the patched
kernel.

+--------------------+-----------------------+-----------------------+
| Mean OPS [95% CI]  | Baseline kernel       |  Patched kernel       |
+--------------------+-----------------------+-----------------------+
| Exponential access | 76615.56 / 73984.90   | 78788.76 / 80961.66   |
|                    | [-3139.12, -2122.20]  | [1786.70, 2559.09]    |
+--------------------+-----------------------+-----------------------+
| Random access      | 62093.40 / 55990.56   | 67276.01 / 68379.91   |
|                    | [-6953.44, -5252.23]  | [525.42, 1682.38]     |
+--------------------+-----------------------+-----------------------+
| Zipfian access     | 92532.25 / 93545.62   | 95174.43 / 99151.12   |
|                    | [330.99, 1695.75]     | [3628.31, 4325.06]    |
+--------------------+-----------------------+-----------------------+
Table 2. Comparison between underutilizing and overcommitting memory

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/mongodb/5.14

Appendix
========
$ cat raw_data_mongodb.r
v <- c(
    # baseline 50g exp
    75814.86, 75884.91, 76052.71, 76621.01, 76641.19, 76661.24, 76870.15, 77017.79, 77289.08, 77302.67,
    # baseline 50g uni
    60638.17, 60968.91, 61128.61, 61548.40, 61779.30, 61917.58, 62152.28, 63440.15, 63625.47, 63735.11,
    # baseline 50g zip
    91271.16, 91482.41, 91524.17, 92467.16, 92585.62, 92843.29, 92885.65, 93229.98, 93408.94, 93624.08,
    # baseline 60g exp
    73183.67, 73191.30, 73527.58, 73831.79, 74047.95, 74056.24, 74401.23, 74418.53, 74547.58, 74643.08,
    # baseline 60g uni
    55175.76, 55477.42, 55605.52, 55680.21, 55903.39, 56171.05, 56375.06, 56380.43, 56509.94, 56626.78,
    # baseline 60g zip
    92653.82, 92775.02, 93100.44, 93290.21, 93593.74, 93775.64, 93868.72, 93915.12, 94194.77, 94288.69,
    # patched 50g exp
    78349.95, 78385.64, 78392.33, 78419.91, 78726.59, 78738.68, 78930.72, 78948.25, 79404.38, 79591.14,
    # patched 50g uni
    66622.91, 66667.33, 66951.43, 67104.80, 67117.30, 67196.90, 67389.75, 67406.62, 68131.43, 68171.61,
    # patched 50g zip
    94261.14, 94822.34, 94914.70, 95114.89, 95156.75, 95205.90, 95383.78, 95612.00, 95624.00, 95648.81,
    # patched 60g exp
    80272.04, 80612.33, 80679.23, 80717.74, 81011.18, 81029.64, 81146.68, 81371.84, 81379.13, 81396.76,
    # patched 60g uni
    67559.52, 67600.11, 67718.90, 68062.57, 68278.78, 68446.56, 68452.82, 68853.86, 69278.34, 69547.67,
    # patched 60g zip
    98706.81, 98864.41, 98903.77, 99044.10, 99155.68, 99162.94, 99165.64, 99482.31, 99484.91, 99540.62
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (mem in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, mem, 1], a[, dist, mem, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("mem%d dist%d: no significance", mem, dist)
        } else {
            s <- sprintf("mem%d dist%d: [%.2f, %.2f]%%", mem, dist, -p[2], -p[1])
        }
        print(s)
    }
}

# 50g vs 60g
for (kern in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern])
        print(r)

p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d dist%d: no significance", kern, dist)
        } else {
            s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_mongodb.r

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -9.8624, df = 17.23, p-value = 1.671e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2637.627 -1708.769
sample estimates:
mean of x mean of y
 76615.56  78788.76

[1] "mem1 dist1: [2.23, 3.44]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -13.081, df = 12.744, p-value = 9.287e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6040.256 -4324.964
sample estimates:
mean of x mean of y
 62093.40  67276.01

[1] "mem1 dist2: [6.97, 9.73]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -8.8194, df = 13.459, p-value = 5.833e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3287.17 -1997.20
sample estimates:
mean of x mean of y
 92532.25  95174.43

[1] "mem1 dist3: [2.16, 3.55]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -33.368, df = 16.192, p-value = 2.329e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7419.582 -6533.942
sample estimates:
mean of x mean of y
 73984.90  80961.66

[1] "mem2 dist1: [8.83, 10.03]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -46.386, df = 16.338, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12954.62 -11824.09
sample estimates:
mean of x mean of y
 55990.56  68379.91

[1] "mem2 dist2: [21.12, 23.14]%"

        Welch Two Sample t-test

data:  a[, dist, mem, 1] and a[, dist, mem, 2]
t = -27.844, df = 13.209, p-value = 4.049e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6039.729 -5171.275
sample estimates:
mean of x mean of y
 93545.62  99151.12

[1] "mem2 dist3: [5.53, 6.46]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 10.87, df = 18, p-value = 2.439e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2122.207 3139.125
sample estimates:
mean of x mean of y
 76615.56  73984.90

[1] "kern1 dist1: [-4.10, -2.77]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 15.593, df = 12.276, p-value = 1.847e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 5252.237 6953.447
sample estimates:
mean of x mean of y
 62093.40  55990.56

[1] "kern1 dist2: [-11.20, -8.46]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -3.1512, df = 15.811, p-value = 0.006252
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1695.7509  -330.9911
sample estimates:
mean of x mean of y
 92532.25  93545.62

[1] "kern1 dist3: [0.36, 1.83]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -11.836, df = 17.672, p-value = 7.84e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2559.092 -1786.704
sample estimates:
mean of x mean of y
 78788.76  80961.66

[1] "kern2 dist1: [2.27, 3.25]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -4.0276, df = 16.921, p-value = 0.0008807
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1682.3864  -525.4236
sample estimates:
mean of x mean of y
 67276.01  68379.91

[1] "kern2 dist2: [0.78, 2.50]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -24.26, df = 15.517, p-value = 9.257e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4325.062 -3628.314
sample estimates:
mean of x mean of y
 95174.43  99151.12

[1] "kern2 dist3: [3.81, 4.54]%"


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-11-09  2:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18  6:30 [PATCH v4 00/11] Multigenerational LRU Framework Yu Zhao
2021-08-18  6:30 ` [PATCH v4 01/11] mm: x86, arm64: add arch_has_hw_pte_young() Yu Zhao
2021-08-19  9:19   ` Will Deacon
2021-08-19 21:23     ` Yu Zhao
2021-10-10  8:59       ` Hillf Danton
2021-08-18  6:30 ` [PATCH v4 02/11] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Yu Zhao
2021-08-18  6:30 ` [PATCH v4 03/11] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-08-18  6:31 ` [PATCH v4 04/11] mm: multigenerational lru: groundwork Yu Zhao
2021-08-18  6:31 ` [PATCH v4 05/11] mm: multigenerational lru: protection Yu Zhao
2021-08-18  6:31 ` [PATCH v4 06/11] mm: multigenerational lru: mm_struct list Yu Zhao
2021-08-18  6:31 ` [PATCH v4 07/11] mm: multigenerational lru: aging Yu Zhao
2021-08-18  6:31 ` [PATCH v4 08/11] mm: multigenerational lru: eviction Yu Zhao
2021-08-18  6:31 ` [PATCH v4 09/11] mm: multigenerational lru: user interface Yu Zhao
2021-08-18  6:31 ` [PATCH v4 10/11] mm: multigenerational lru: Kconfig Yu Zhao
2021-08-18  6:31 ` [PATCH v4 11/11] mm: multigenerational lru: documentation Yu Zhao
2021-10-09  5:43 ` [PATCH v4 00/11] Multigenerational LRU Framework bot
2021-10-21 19:41 ` bot
2021-11-02  0:20 ` bot
2021-11-09  2:13 ` bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).