linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/14] Multigenerational LRU
@ 2021-03-13  7:57 Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
                   ` (17 more replies)
  0 siblings, 18 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
a performant, versatile and straightforward augment.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1

Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101

Background
==========
DRAM is a major factor in total cost of ownership, and improving
memory overcommit brings a high return on investment. Over the past
decade of research and experimentation in memory overcommit, we
observed a distinct trend across millions of servers and clients: the
size of page cache has been decreasing because of the growing
popularity of cloud storage. Nowadays anon pages account for more than
90% of our memory consumption and page cache contains mostly
executable pages.

Problems
========
Notion of the active/inactive
-----------------------------
For servers equipped with hundreds of gigabytes of memory, the
granularity of the active/inactive is too coarse to be useful for job
scheduling. And false active/inactive rates are relatively high. In
addition, scans of largely varying numbers of pages are unpredictable
because inactive_is_low() is based on magic numbers.

For phones and laptops, the eviction is biased toward file pages
because the selection has to resort to heuristics as direct
comparisons between anon and file types are infeasible. On Android and
Chrome OS, executable pages are frequently evicted despite the fact
that there are many less recently used anon pages. This causes "janks"
(slow UI rendering) and negatively impacts user experience.

For systems with multiple nodes and/or memcgs, it is impossible to
compare lruvecs based on the notion of the active/inactive.

Incremental scans via the rmap
------------------------------
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For most of
the systems running cloud workloads, incremental scans lose the
advantage under sustained memory pressure due to high ratios of the
number of scanned pages to the number of reclaimed pages. In our case,
the average ratio of pgscan to pgsteal is about 7.

On top of that, the rmap has poor memory locality due to its complex
data structures. The combined effects typically result in a high
amount of CPU usage in the reclaim path. For example, with zram, a
typical kswapd profile on v5.11 looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

And with real swap, it looks like:
  45.16%  page_vma_mapped_walk
   7.61%  do_raw_spin_lock
   5.69%  vma_interval_tree_iter_next
   4.91%  vma_interval_tree_subtree_search
   3.71%  page_referenced_one

Solutions
=========
Notion of generation numbers
----------------------------
The notion of generation numbers introduces a quantitative approach to
memory overcommit. A larger number of pages can be spread out across
configurable generations, and thus they have relatively low false
active/inactive rates. Each generation includes all pages that have
been referenced since the last generation.

Given an lruvec, scans and the selections between anon and file types
are all based on generation numbers, which are simple and yet
effective. For different lruvecs, comparisons are still possible based
on birth times of generations.

Differential scans via page tables
----------------------------------
Each differential scan discovers all pages that have been referenced
since the last scan. Specifically, it walks the mm_struct list
associated with an lruvec to scan page tables of processes that have
been scheduled since the last scan. The cost of each differential scan
is roughly proportional to the number of referenced pages it
discovers. Unless address spaces are extremely sparse, page tables
usually have better memory locality than the rmap. The end result is
generally a significant reduction in CPU usage, for most of the
systems running cloud workloads.

On Chrome OS, our real-world benchmark that browses popular websites
in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
(full) less PSI on v5.11. And kswapd profile looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced 7%. These metrics are
important to phones and laptops as they are correlated to user
experience.

Workflow
========
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lruvec->evictable.max_seq
for both anon and file types as they are aged on an equal footing. The
oldest generation numbers are stored in lruvec->evictable.min_seq[2]
separately for anon and file types as clean file pages can be evicted
regardless of may_swap or may_writepage. Generation numbers are
truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
page->flags. The sliding window technique is used to prevent truncated
generation numbers from overlapping. Each truncated generation number
is an index to
lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
Evictable pages are added to the per-zone lists indexed by max_seq or
min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
faulted in or read ahead. The workflow comprises two conceptually
independent functions: the aging and the eviction.

Aging
-----
The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq. The aging maintains either a
system-wide mm_struct list or per-memcg mm_struct lists and tracks
whether an mm_struct is being used on any CPUs or has been used since
the last scan. Multiple threads can concurrently work on the same
mm_struct list, and each of them will be given a different mm_struct
belonging to a process that has been scheduled since the last scan.

Eviction
--------
The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It selects a type according to the values of min_seq[2] and
swappiness. During a scan, the eviction either sorts or isolates a
page, depending on whether the aging has updated its generation
number. When it finds all the per-zone lists are empty, the eviction
increments min_seq[2] indexed by this selected type. The eviction
triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Use cases
=========
On Android, our most advanced simulation that generates memory
pressure from realistic user behavior shows 18% fewer low-memory
kills, which in turn reduces cold starts by 16%.

On Borg, a similar approach enables us to identify jobs that
underutilize their memory and downsize them considerably without
compromising any of our service level indicators.

On Chrome OS, our field telemetry reports 96% fewer low-memory tab
discards and 59% fewer OOM kills from fully-utilized devices and no UX
regressions from underutilized devices.

For other use cases include working set estimation, proactive reclaim,
far memory tiering and NUMA-aware job scheduling, please refer to the
documentation included in this series and the following references.

References
==========
1. Long-term SLOs for reclaimed cloud computing resources
   https://research.google/pubs/pub43017/
2. Profiling a warehouse-scale computer
   https://research.google/pubs/pub44271/
3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
   https://research.google/pubs/pub48329/
4. Software-defined far memory in warehouse-scale computers
   https://research.google/pubs/pub48551/
5. Borg: the Next Generation
   https://research.google/pubs/pub49065/

Yu Zhao (14):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
    !CONFIG_MEMCG
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/huge_mm.h: define is_huge_zero_pmd() if
    !CONFIG_TRANSPARENT_HUGEPAGE
  include/linux/cgroup.h: export cgroup_mutex
  mm/swap.c: export activate_page()
  mm, x86: support the access bit on non-leaf PMD entries
  mm/pagewalk.c: add pud_entry_post() for post-order traversals
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: core
  mm: multigenerational lru: page activation
  mm: multigenerational lru: user space interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  210 +++
 arch/Kconfig                      |    8 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/proc/task_mmu.c                |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/huge_mm.h           |    5 +
 include/linux/memcontrol.h        |    5 +-
 include/linux/mm.h                |    1 +
 include/linux/mm_inline.h         |  246 ++++
 include/linux/mm_types.h          |  135 ++
 include/linux/mmzone.h            |   62 +-
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   20 +-
 include/linux/pagewalk.h          |    4 +
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    5 +-
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   29 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   14 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |   13 +-
 mm/mmzone.c                       |    2 +
 mm/pagewalk.c                     |    5 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   58 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
 39 files changed, 2870 insertions(+), 144 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13 15:09   ` Matthew Wilcox
  2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

We want to make sure the rcu lock is held while using
page_memcg_rcu(). But having a WARN_ON_ONCE() in page_memcg_rcu() when
!CONFIG_MEMCG is superfluous because of the following legit use case:

  memcg = lock_page_memcg(page1)
    (rcu_read_lock() if CONFIG_MEMCG=y)

  do something to page1

  if (page_memcg_rcu(page2) == memcg)
    do something to page2 too as it cannot be migrated away from the
    memcg either.

  unlock_page_memcg(page1)
    (rcu_read_unlock() if CONFIG_MEMCG=y)

This patch removes the WARN_ON_ONCE() from page_memcg_rcu() for the
!CONFIG_MEMCG case.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/memcontrol.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..f325aeb4b4e8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1079,7 +1079,6 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
 
 static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
 {
-	WARN_ON_ONCE(!rcu_read_lock_held());
 	return NULL;
 }
 
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Currently next_memory_node only exists when CONFIG_NUMA=y. This patch
defines the macro for the !CONFIG_NUMA case.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/nodemask.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index ac398e143c9a..89fe4e3592f9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node	0
 #define first_memory_node	0
 #define next_online_node(nid)	(MAX_NUMNODES)
+#define next_memory_node(nid)	(MAX_NUMNODES)
 #define nr_node_ids		1U
 #define nr_online_nodes		1U
 
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Currently is_huge_zero_pmd() only exists when
CONFIG_TRANSPARENT_HUGEPAGE=y. This patch defines the function for the
!CONFIG_TRANSPARENT_HUGEPAGE case.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/huge_mm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ba973efcd369..0ba7b3f9029c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -443,6 +443,11 @@ static inline bool is_huge_zero_page(struct page *page)
 	return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
 	return false;
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (2 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Export cgroup_mutex so it can be used to synchronize with memcg
allocations.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/cgroup.h | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4f2f79de083e..bd5744360cfa 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+	mutex_lock(&cgroup_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
  * as locks used during the cgroup_subsys::attach() methods.
  */
 #ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)					\
 	rcu_dereference_check((task)->cgroups,				\
@@ -704,6 +715,8 @@ struct cgroup;
 static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
 static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 05/14] mm/swap.c: export activate_page()
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (3 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Export activate_page(), which is a merger between the existing
activate_page() and __lru_cache_activate_page(),  so it can be used to
activate pages that are already on lru or queued in lru_pvecs.lru_add.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/swap.h |  1 +
 mm/swap.c            | 28 +++++++++++++++-------------
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..de2bbbf181ba 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,6 +344,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
+extern void activate_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..f20ed56ebbbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -334,7 +334,7 @@ static bool need_activate_page_drain(int cpu)
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void activate_page(struct page *page)
+static void activate_page_on_lru(struct page *page)
 {
 	page = compound_head(page);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
@@ -354,7 +354,7 @@ static inline void activate_page_drain(int cpu)
 {
 }
 
-static void activate_page(struct page *page)
+static void activate_page_on_lru(struct page *page)
 {
 	struct lruvec *lruvec;
 
@@ -368,11 +368,22 @@ static void activate_page(struct page *page)
 }
 #endif
 
-static void __lru_cache_activate_page(struct page *page)
+/*
+ * If the page is on the LRU, queue it for activation via
+ * lru_pvecs.activate_page. Otherwise, assume the page is on a
+ * pagevec, mark it active and it'll be moved to the active
+ * LRU on the next drain.
+ */
+void activate_page(struct page *page)
 {
 	struct pagevec *pvec;
 	int i;
 
+	if (PageLRU(page)) {
+		activate_page_on_lru(page);
+		return;
+	}
+
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
 
@@ -421,16 +432,7 @@ void mark_page_accessed(struct page *page)
 		 * evictable page accessed has no effect.
 		 */
 	} else if (!PageActive(page)) {
-		/*
-		 * If the page is on the LRU, queue it for activation via
-		 * lru_pvecs.activate_page. Otherwise, assume the page is on a
-		 * pagevec, mark it active and it'll be moved to the active
-		 * LRU on the next drain.
-		 */
-		if (PageLRU(page))
-			activate_page(page);
-		else
-			__lru_cache_activate_page(page);
+		activate_page(page);
 		ClearPageReferenced(page);
 		workingset_activation(page);
 	}
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (4 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-14 22:12   ` Zi Yan
  2021-03-14 23:22   ` Dave Hansen
  2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Some architectures support the accessed bit on non-leaf PMD entries
(parents) in addition to leaf PTE entries (children) where pages are
mapped, e.g., x86_64 sets the accessed bit on a parent when using it
as part of linear-address translation [1]. Page table walkers who are
interested in the accessed bit on children can take advantage of this:
they do not need to search the children when the accessed bit is not
set on a parent, given that they have previously cleared the accessed
bit on this parent in addition to its children.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (October 2019), section 4.8

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/Kconfig                   | 8 ++++++++
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 2 +-
 arch/x86/mm/pgtable.c          | 5 ++++-
 include/linux/pgtable.h        | 4 ++--
 5 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 2bb30673d8e6..137446d17732 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -783,6 +783,14 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	bool
 
+config HAVE_ARCH_PARENT_PMD_YOUNG
+	bool
+	help
+	  Architectures that select this are able to set the accessed bit on
+	  non-leaf PMD entries in addition to leaf PTE entries where pages are
+	  mapped. For them, page table walkers that clear the accessed bit may
+	  stop at non-leaf PMD entries when they do not see the accessed bit.
+
 config HAVE_ARCH_HUGE_VMAP
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..b5972eb82337 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -163,6 +163,7 @@ config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
+	select HAVE_ARCH_PARENT_PMD_YOUNG	if X86_64
 	select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
 	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..a6b5cfe1fc5a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return ((pmd_flags(pmd) | _PAGE_ACCESSED) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f6a9e2e36642..1c27e6f43f80 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
 	return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e772392a379..08dd9b8c055a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -193,7 +193,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -214,7 +214,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (5 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Add a new callback pud_entry_post() to struct mm_walk_ops so that page
table walkers can visit the non-leaf PMD entries of a PUD entry after
they have visited with the leaf PTE entries. This allows page table
walkers who clear the accessed bit to take advantage of the last
commit, in a similar way walk_pte_range() works for the PTE entries of
a PMD entry: they only need to take PTL once to search all the child
entries of a parent entry.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/pagewalk.h | 4 ++++
 mm/pagewalk.c            | 5 +++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b1cb6b753abb..2b68ae9d27d3 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -11,6 +11,8 @@ struct mm_walk;
  * @pgd_entry:		if set, called for each non-empty PGD (top-level) entry
  * @p4d_entry:		if set, called for each non-empty P4D entry
  * @pud_entry:		if set, called for each non-empty PUD entry
+ * @pud_entry_post:	if set, called for each non-empty PUD entry after
+ *			pmd_entry is called, for post-order traversal.
  * @pmd_entry:		if set, called for each non-empty PMD entry
  *			this handler is required to be able to handle
  *			pmd_trans_huge() pmds.  They may simply choose to
@@ -41,6 +43,8 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pud_entry)(pud_t *pud, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
+	int (*pud_entry_post)(pud_t *pud, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk);
 	int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_entry)(pte_t *pte, unsigned long addr,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..8ed1533f7eda 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -160,6 +160,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
+
+		if (ops->pud_entry_post)
+			err = ops->pud_entry_post(pud, addr, next, walk);
+		if (err)
+			break;
 	} while (pud++, addr = next, addr != end);
 
 	return err;
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node()
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (6 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Heuristics in shrink_node() are rather independent and can be
refactored into a separate function to improve readability.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 186 +++++++++++++++++++++++++++-------------------------
 1 file changed, 98 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..1a24d2e0a4cb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2224,6 +2224,103 @@ enum scan_balance {
 	SCAN_FILE,
 };
 
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+	unsigned long file;
+	struct lruvec *target_lruvec;
+
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+	/*
+	 * Determine the scan balance between anon and file LRUs.
+	 */
+	spin_lock_irq(&target_lruvec->lru_lock);
+	sc->anon_cost = target_lruvec->anon_cost;
+	sc->file_cost = target_lruvec->file_cost;
+	spin_unlock_irq(&target_lruvec->lru_lock);
+
+	/*
+	 * Target desirable inactive:active list ratios for the anon
+	 * and file LRU lists.
+	 */
+	if (!sc->force_deactivate) {
+		unsigned long refaults;
+
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+			sc->may_deactivate |= DEACTIVATE_ANON;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+		/*
+		 * When refaults are being observed, it means a new
+		 * workingset is being established. Deactivate to get
+		 * rid of any stale active pages quickly.
+		 */
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+			sc->may_deactivate |= DEACTIVATE_FILE;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_FILE;
+	} else
+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+	/*
+	 * If we have plenty of inactive file pages that aren't
+	 * thrashing, try to reclaim those first before touching
+	 * anonymous pages.
+	 */
+	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+		sc->cache_trim_mode = 1;
+	else
+		sc->cache_trim_mode = 0;
+
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long total_high_wmark = 0;
+		unsigned long free, anon;
+		int z;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		/*
+		 * Consider anon: if that's low too, this isn't a
+		 * runaway file reclaim problem, but rather just
+		 * extreme pressure. Reclaim as per usual then.
+		 */
+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+		sc->file_is_tiny =
+			file + free <= total_high_wmark &&
+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+			anon >> sc->priority;
+	}
+}
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -2669,7 +2766,6 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
-	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
@@ -2679,93 +2775,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
-	/*
-	 * Determine the scan balance between anon and file LRUs.
-	 */
-	spin_lock_irq(&target_lruvec->lru_lock);
-	sc->anon_cost = target_lruvec->anon_cost;
-	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&target_lruvec->lru_lock);
-
-	/*
-	 * Target desirable inactive:active list ratios for the anon
-	 * and file LRU lists.
-	 */
-	if (!sc->force_deactivate) {
-		unsigned long refaults;
-
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_ANON);
-		if (refaults != target_lruvec->refaults[0] ||
-			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
-			sc->may_deactivate |= DEACTIVATE_ANON;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_ANON;
-
-		/*
-		 * When refaults are being observed, it means a new
-		 * workingset is being established. Deactivate to get
-		 * rid of any stale active pages quickly.
-		 */
-		refaults = lruvec_page_state(target_lruvec,
-				WORKINGSET_ACTIVATE_FILE);
-		if (refaults != target_lruvec->refaults[1] ||
-		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
-			sc->may_deactivate |= DEACTIVATE_FILE;
-		else
-			sc->may_deactivate &= ~DEACTIVATE_FILE;
-	} else
-		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
-
-	/*
-	 * If we have plenty of inactive file pages that aren't
-	 * thrashing, try to reclaim those first before touching
-	 * anonymous pages.
-	 */
-	file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
-	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
-		sc->cache_trim_mode = 1;
-	else
-		sc->cache_trim_mode = 0;
-
-	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
-	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long total_high_wmark = 0;
-		unsigned long free, anon;
-		int z;
-
-		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		/*
-		 * Consider anon: if that's low too, this isn't a
-		 * runaway file reclaim problem, but rather just
-		 * extreme pressure. Reclaim as per usual then.
-		 */
-		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
-
-		sc->file_is_tiny =
-			file + free <= total_high_wmark &&
-			!(sc->may_deactivate & DEACTIVATE_ANON) &&
-			anon >> sc->priority;
-	}
+	prepare_scan_count(pgdat, sc);
 
 	shrink_node_memcgs(pgdat, sc);
 
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (7 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-15 19:40   ` Rik van Riel
  2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Add an infrastructure that maintains either a system-wide mm_struct
list or per-memcg mm_struct lists. Multiple threads can concurrently
work on the same mm_struct list, and each of them will be given a
different mm_struct. Those who finish early can optionally wait on the
rest after the iterator has reached the end of the list.

This infrastructure also tracks whether an mm_struct is being used on
any CPUs or has been used since the last time a worker looked at it.
In other words, workers will not be given an mm_struct that belongs to
a process that has been sleeping.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 fs/exec.c                  |   2 +
 include/linux/memcontrol.h |   4 +
 include/linux/mm_types.h   | 135 +++++++++++++++++++
 include/linux/mmzone.h     |   2 -
 kernel/exit.c              |   1 +
 kernel/fork.c              |  10 ++
 kernel/kthread.c           |   1 +
 kernel/sched/core.c        |   2 +
 mm/memcontrol.c            |  28 ++++
 mm/vmscan.c                | 263 +++++++++++++++++++++++++++++++++++++
 10 files changed, 446 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..c691d4d7720c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1008,6 +1008,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1018,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
 	if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	activate_mm(active_mm, mm);
+	lru_gen_switch_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
 	tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f325aeb4b4e8..591557c5b7e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -335,6 +335,10 @@ struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	struct lru_gen_mm_list *mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0974ad501a47..b8a038a016f2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,8 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -382,6 +384,8 @@ struct core_state {
 	struct completion startup;
 };
 
+#define ANON_AND_FILE 2
+
 struct kioctx_table;
 struct mm_struct {
 	struct {
@@ -560,6 +564,22 @@ struct mm_struct {
 
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
+#endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* node of a global or per-memcg mm list */
+			struct list_head list;
+#ifdef CONFIG_MEMCG
+			/* points to memcg of the owner task above */
+			struct mem_cgroup *memcg;
+#endif
+			/* indicates this mm has been used since last walk */
+			nodemask_t nodes[ANON_AND_FILE];
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+			/* number of cpus that are using this mm */
+			atomic_t nr_cpus;
+#endif
+		} lru_gen;
 #endif
 	} __randomize_layout;
 
@@ -587,6 +607,121 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+	/* head of a global or per-memcg mm list */
+	struct list_head head;
+	/* protects the list */
+	spinlock_t lock;
+	struct {
+		/* set to max_seq after each round of walk */
+		unsigned long cur_seq;
+		/* next mm on the list to walk */
+		struct list_head *iter;
+		/* to wait for last worker to finish */
+		struct wait_queue_head wait;
+		/* number of concurrent workers */
+		int nr_workers;
+	} nodes[0];
+};
+
+void lru_gen_init_mm(struct mm_struct *mm);
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg);
+void lru_gen_free_mm_list(struct mem_cgroup *memcg);
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+/*
+ * Track usage so mms that haven't been used since last walk can be skipped.
+ *
+ * This function introduces a theoretical overhead for each mm switch, but it
+ * hasn't been measurable.
+ */
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new)
+{
+	int file;
+
+	/* exclude init_mm, efi_mm, etc. */
+	if (!core_kernel_data((unsigned long)old)) {
+		VM_BUG_ON(old == &init_mm);
+
+		for (file = 0; file < ANON_AND_FILE; file++)
+			nodes_setall(old->lru_gen.nodes[file]);
+
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+		atomic_dec(&old->lru_gen.nr_cpus);
+		VM_BUG_ON_MM(atomic_read(&old->lru_gen.nr_cpus) < 0, old);
+#endif
+	} else
+		VM_BUG_ON_MM(READ_ONCE(old->lru_gen.list.prev) ||
+			     READ_ONCE(old->lru_gen.list.next), old);
+
+	if (!core_kernel_data((unsigned long)new)) {
+		VM_BUG_ON(new == &init_mm);
+
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+		atomic_inc(&new->lru_gen.nr_cpus);
+		VM_BUG_ON_MM(atomic_read(&new->lru_gen.nr_cpus) < 0, new);
+#endif
+	} else
+		VM_BUG_ON_MM(READ_ONCE(new->lru_gen.list.prev) ||
+			     READ_ONCE(new->lru_gen.list.next), new);
+}
+
+/* Returns whether the mm is being used on any cpus. */
+static inline bool lru_gen_mm_is_active(struct mm_struct *mm)
+{
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	return !cpumask_empty(mm_cpumask(mm));
+#else
+	return atomic_read(&mm->lru_gen.nr_cpus);
+#endif
+}
+
+#else /* CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline int lru_gen_alloc_mm_list(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline void lru_gen_free_mm_list(struct mem_cgroup *memcg)
+{
+}
+
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct *new)
+{
+}
+
+static inline bool lru_gen_mm_is_active(struct mm_struct *mm)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..a99a1050565a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -285,8 +285,6 @@ static inline bool is_active_lru(enum lru_list lru)
 	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
 }
 
-#define ANON_AND_FILE 2
-
 enum lruvec_flags {
 	LRUVEC_CONGESTED,		/* lruvec has many dirty pages
 					 * backed by a congested BDI
diff --git a/kernel/exit.c b/kernel/exit.c
index 04029e35e69a..e4292717ce37 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -422,6 +422,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index d3171e8e88e5..e261b797955d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -665,6 +665,7 @@ static void check_mm(struct mm_struct *mm)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
 #endif
+	VM_BUG_ON_MM(lru_gen_mm_is_active(mm), mm);
 }
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1047,6 +1048,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1089,6 +1091,7 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2513,6 +2516,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock p to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1578973c5740..8da7767bb06a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1303,6 +1303,7 @@ void kthread_use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
+	lru_gen_switch_mm(active_mm, mm);
 	local_irq_enable();
 	task_unlock(tsk);
 #ifdef finish_arch_post_lock_switch
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb629595f..56274a14ce09 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4308,6 +4308,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_switch_mm(prev->active_mm, next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
@@ -7599,6 +7600,7 @@ void idle_task_exit(void)
 
 	if (mm != &init_mm) {
 		switch_mm(mm, &init_mm, current);
+		lru_gen_switch_mm(mm, &init_mm);
 		finish_arch_post_lock_switch();
 	}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..5836780fe138 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5209,6 +5209,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->vmstats_percpu);
 	free_percpu(memcg->vmstats_local);
+	lru_gen_free_mm_list(memcg);
 	kfree(memcg);
 }
 
@@ -5261,6 +5262,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
+	if (lru_gen_alloc_mm_list(memcg))
+		goto fail;
+
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
 		goto fail;
 
@@ -6165,6 +6169,29 @@ static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct task_struct *task = NULL;
+
+	cgroup_taskset_for_each_leader(task, css, tset)
+		;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6505,6 +6532,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1a24d2e0a4cb..f7657ab0d4b7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4314,3 +4314,266 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
+
+#ifdef CONFIG_LRU_GEN
+
+/******************************************************************************
+ *                           global and per-memcg mm list
+ ******************************************************************************/
+
+/*
+ * After pages are faulted in, they become the youngest generation. They must
+ * go through aging process twice before they can be evicted. After first scan,
+ * their accessed bit set during initial faults are cleared and they become the
+ * second youngest generation. And second scan makes sure they haven't been used
+ * since the first.
+ */
+#define MIN_NR_GENS 2
+
+static struct lru_gen_mm_list *global_mm_list;
+
+static struct lru_gen_mm_list *alloc_mm_list(void)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+
+	mm_list = kzalloc(struct_size(mm_list, nodes, nr_node_ids), GFP_KERNEL);
+	if (!mm_list)
+		return NULL;
+
+	INIT_LIST_HEAD(&mm_list->head);
+	spin_lock_init(&mm_list->lock);
+
+	for_each_node(nid) {
+		mm_list->nodes[nid].cur_seq = MIN_NR_GENS - 1;
+		mm_list->nodes[nid].iter = &mm_list->head;
+		init_waitqueue_head(&mm_list->nodes[nid].wait);
+	}
+
+	return mm_list;
+}
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg ? memcg->mm_list : root_mem_cgroup->mm_list;
+#endif
+	VM_BUG_ON(memcg);
+
+	return global_mm_list;
+}
+
+void lru_gen_init_mm(struct mm_struct *mm)
+{
+	int file;
+
+	INIT_LIST_HEAD(&mm->lru_gen.list);
+#ifdef CONFIG_MEMCG
+	mm->lru_gen.memcg = NULL;
+#endif
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	atomic_set(&mm->lru_gen.nr_cpus, 0);
+#endif
+	for (file = 0; file < ANON_AND_FILE; file++)
+		nodes_clear(mm->lru_gen.nodes[file]);
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON_MM(!list_empty(&mm->lru_gen.list), mm);
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON_MM(mm->lru_gen.memcg, mm);
+	WRITE_ONCE(mm->lru_gen.memcg, memcg);
+#endif
+	spin_lock(&mm_list->lock);
+	list_add_tail(&mm->lru_gen.list, &mm_list->head);
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+#ifdef CONFIG_MEMCG
+	struct lru_gen_mm_list *mm_list = get_mm_list(mm->lru_gen.memcg);
+#else
+	struct lru_gen_mm_list *mm_list = get_mm_list(NULL);
+#endif
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		if (mm_list->nodes[nid].iter != &mm->lru_gen.list)
+			continue;
+
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+		if (mm_list->nodes[nid].iter == &mm_list->head)
+			WRITE_ONCE(mm_list->nodes[nid].cur_seq,
+				   mm_list->nodes[nid].cur_seq + 1);
+	}
+
+	list_del_init(&mm->lru_gen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lru_gen.memcg);
+	WRITE_ONCE(mm->lru_gen.memcg, NULL);
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	memcg->mm_list = alloc_mm_list();
+
+	return memcg->mm_list ? 0 : -ENOMEM;
+}
+
+void lru_gen_free_mm_list(struct mem_cgroup *memcg)
+{
+	kfree(memcg->mm_list);
+	memcg->mm_list = NULL;
+}
+
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(mm->owner);
+	rcu_read_unlock();
+	if (memcg == mm->lru_gen.memcg)
+		return;
+
+	VM_BUG_ON_MM(!mm->lru_gen.memcg, mm);
+	VM_BUG_ON_MM(list_empty(&mm->lru_gen.list), mm);
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+
+static bool mm_has_migrated(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	return READ_ONCE(mm->lru_gen.memcg) != memcg;
+}
+#else
+static bool mm_has_migrated(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	return false;
+}
+#endif
+
+static bool should_skip_mm(struct mm_struct *mm, int nid, int swappiness)
+{
+	int file;
+	unsigned long size = 0;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	for (file = !swappiness; file < ANON_AND_FILE; file++) {
+		if (lru_gen_mm_is_active(mm) || node_isset(nid, mm->lru_gen.nodes[file]))
+			size += file ? get_mm_counter(mm, MM_FILEPAGES) :
+				       get_mm_counter(mm, MM_ANONPAGES) +
+				       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	if (size < SWAP_CLUSTER_MAX)
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+/* To support multiple workers that concurrently walk mm list. */
+static bool get_next_mm(struct lruvec *lruvec, unsigned long next_seq,
+			int swappiness, struct mm_struct **iter)
+{
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	if (*iter)
+		mmput_async(*iter);
+	else if (next_seq <= READ_ONCE(mm_list->nodes[nid].cur_seq))
+		return false;
+
+	spin_lock(&mm_list->lock);
+
+	VM_BUG_ON(next_seq > mm_list->nodes[nid].cur_seq + 1);
+	VM_BUG_ON(*iter && next_seq < mm_list->nodes[nid].cur_seq);
+	VM_BUG_ON(*iter && !mm_list->nodes[nid].nr_workers);
+
+	if (next_seq <= mm_list->nodes[nid].cur_seq) {
+		last = *iter;
+		goto done;
+	}
+
+	if (mm_list->nodes[nid].iter == &mm_list->head) {
+		VM_BUG_ON(*iter || mm_list->nodes[nid].nr_workers);
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+	}
+
+	while (!mm && mm_list->nodes[nid].iter != &mm_list->head) {
+		mm = list_entry(mm_list->nodes[nid].iter, struct mm_struct, lru_gen.list);
+		mm_list->nodes[nid].iter = mm_list->nodes[nid].iter->next;
+		if (should_skip_mm(mm, nid, swappiness))
+			mm = NULL;
+	}
+
+	if (mm_list->nodes[nid].iter == &mm_list->head)
+		WRITE_ONCE(mm_list->nodes[nid].cur_seq,
+			   mm_list->nodes[nid].cur_seq + 1);
+done:
+	if (*iter && !mm)
+		mm_list->nodes[nid].nr_workers--;
+	if (!*iter && mm)
+		mm_list->nodes[nid].nr_workers++;
+
+	last = last && !mm_list->nodes[nid].nr_workers &&
+	       mm_list->nodes[nid].iter == &mm_list->head;
+
+	spin_unlock(&mm_list->lock);
+
+	*iter = mm;
+
+	return last;
+}
+
+/******************************************************************************
+ *                          initialization
+ ******************************************************************************/
+
+static int __init init_lru_gen(void)
+{
+	if (mem_cgroup_disabled()) {
+		global_mm_list = alloc_mm_list();
+		if (!global_mm_list) {
+			pr_err("lru_gen: failed to allocate global mm list\n");
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+};
+/*
+ * We want to run as early as possible because some debug code, e.g.,
+ * dma_resv_lockdep(), calls mm_alloc() and mmput(). We only depend on mm_kobj,
+ * which is initialized one stage earlier by postcore_initcall().
+ */
+arch_initcall(init_lru_gen);
+
+#endif /* CONFIG_LRU_GEN */
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (8 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-15  2:02   ` Andi Kleen
  2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in max_seq for both anon and
file types as they are aged on an equal footing. The oldest generation
numbers are stored in min_seq[2] separately for anon and file types as
clean file pages can be evicted regardless of may_swap or
may_writepage. Generation numbers are truncated into
ilog2(MAX_NR_GENS)+1 bits in order to fit into page->flags. The
sliding window technique is used to prevent truncated generation
numbers from overlapping. Each truncated generation number is an index
to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
Evictable pages are added to the per-zone lists indexed by max_seq or
min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
faulted in or read ahead.

The workflow comprises two conceptually independent functions: the
aging and the eviction. The aging produces young generations. Given an
lruvec, the aging walks the mm_struct list associated with this
lruvec, i.e., memcg->mm_list or global_mm_list, to scan page tables
for referenced pages. Upon finding one, the aging updates its
generation number to max_seq. After each round of scan, the aging
increments max_seq. Since scans are differential with respect to
referenced pages, the cost is roughly proportional to their number.

The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It selects a type based on the values of min_seq[2] and swappiness.
During a scan, the eviction either sorts or isolates a page, depending
on whether the aging has updated its generation number or not. When it
finds all the per-zone lists are empty, the eviction increments
min_seq[2] indexed by this selected type. The eviction triggers the
aging when both of min_seq[2] reaches max_seq-1, assuming both anon
and file types are reclaimable.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/mm.h                |    1 +
 include/linux/mm_inline.h         |  194 +++++
 include/linux/mmzone.h            |   54 ++
 include/linux/page-flags-layout.h |   20 +-
 mm/huge_memory.c                  |    3 +-
 mm/mm_init.c                      |   13 +-
 mm/mmzone.c                       |    2 +
 mm/swap.c                         |    4 +
 mm/swapfile.c                     |    4 +
 mm/vmscan.c                       | 1255 +++++++++++++++++++++++++++++
 10 files changed, 1541 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..ac57ea124fb8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1070,6 +1070,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
 #define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 #define KASAN_TAG_PGOFF		(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF		(KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..2d306cab36bc 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -79,11 +79,199 @@ static __always_inline enum lru_list page_lru(struct page *page)
 	return lru;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+#ifdef CONFIG_LRU_GEN_ENABLED
+DECLARE_STATIC_KEY_TRUE(lru_gen_static_key);
+#define lru_gen_enabled() static_branch_likely(&lru_gen_static_key)
+#else
+DECLARE_STATIC_KEY_FALSE(lru_gen_static_key);
+#define lru_gen_enabled() static_branch_unlikely(&lru_gen_static_key)
+#endif
+
+/*
+ * Raw generation numbers (seq) from struct lru_gen are in unsigned long and
+ * therefore (virtually) monotonic; truncated generation numbers (gen) occupy
+ * at most ilog2(MAX_NR_GENS)+1 bits in page flags and therefore are cyclic.
+ */
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+	return seq % MAX_NR_GENS;
+}
+
+/* The youngest and the second youngest generations are considered active. */
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+	unsigned long max_seq = READ_ONCE(lruvec->evictable.max_seq);
+
+	VM_BUG_ON(!max_seq);
+	VM_BUG_ON(gen >= MAX_NR_GENS);
+
+	return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
+}
+
+/* Returns -1 when multigenerational lru is disabled or page is isolated. */
+static inline int page_lru_gen(struct page *page)
+{
+	return ((READ_ONCE(page->flags) & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
+/* Update multigenerational lru sizes in addition to active/inactive lru sizes. */
+static inline void lru_gen_update_size(struct page *page, struct lruvec *lruvec,
+				       int old_gen, int new_gen)
+{
+	int file = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	int delta = thp_nr_pages(page);
+	enum lru_list lru = LRU_FILE * file;
+
+	lockdep_assert_held(&lruvec->lru_lock);
+	VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS);
+	VM_BUG_ON(old_gen == -1 && new_gen == -1);
+
+	if (old_gen >= 0)
+		WRITE_ONCE(lruvec->evictable.sizes[old_gen][file][zone],
+			   lruvec->evictable.sizes[old_gen][file][zone] - delta);
+	if (new_gen >= 0)
+		WRITE_ONCE(lruvec->evictable.sizes[new_gen][file][zone],
+			   lruvec->evictable.sizes[new_gen][file][zone] + delta);
+
+	if (old_gen < 0) {
+		if (lru_gen_is_active(lruvec, new_gen))
+			lru += LRU_ACTIVE;
+		update_lru_size(lruvec, lru, zone, delta);
+		return;
+	}
+
+	if (new_gen < 0) {
+		if (lru_gen_is_active(lruvec, old_gen))
+			lru += LRU_ACTIVE;
+		update_lru_size(lruvec, lru, zone, -delta);
+		return;
+	}
+
+	if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
+		update_lru_size(lruvec, lru, zone, -delta);
+		update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
+	}
+
+	/* can't deactivate a page without deleting it first */
+	VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
+}
+
+/* Add a page to a multigenerational lru list. Returns true on success. */
+static inline bool page_set_lru_gen(struct page *page, struct lruvec *lruvec, bool front)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+	int file = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+
+	if (PageUnevictable(page) || !lruvec->evictable.enabled[file])
+		return false;
+	/*
+	 * If a page is being faulted in, mark it as the youngest generation.
+	 * try_walk_mm_list() may look at the size of the youngest generation
+	 * to determine if a page table walk is needed.
+	 *
+	 * If an unmapped page is being activated, e.g., mark_page_accessed(),
+	 * mark it as the second youngest generation so it won't affect
+	 * try_walk_mm_list().
+	 *
+	 * If a page is being evicted, i.e., waiting for writeback, mark it
+	 * as the second oldest generation so it won't be scanned again
+	 * immediately. And if there are more than three generations, it won't
+	 * be counted as active either.
+	 *
+	 * If a page is being deactivated, rotated by writeback or allocated
+	 * by readahead, mark it as the oldest generation so it will evicted
+	 * first.
+	 */
+	if (PageActive(page) && page_mapped(page))
+		gen = lru_gen_from_seq(lruvec->evictable.max_seq);
+	else if (PageActive(page))
+		gen = lru_gen_from_seq(lruvec->evictable.max_seq - 1);
+	else if (PageReclaim(page))
+		gen = lru_gen_from_seq(lruvec->evictable.min_seq[file] + 1);
+	else
+		gen = lru_gen_from_seq(lruvec->evictable.min_seq[file]);
+
+	do {
+		old_flags = READ_ONCE(page->flags);
+		VM_BUG_ON_PAGE(old_flags & LRU_GEN_MASK, page);
+
+		new_flags = (old_flags & ~(LRU_GEN_MASK | BIT(PG_active) | BIT(PG_workingset))) |
+			    ((gen + 1UL) << LRU_GEN_PGOFF);
+		/* mark page as workingset if active */
+		if (PageActive(page))
+			new_flags |= BIT(PG_workingset);
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(page, lruvec, -1, gen);
+	if (front)
+		list_add(&page->lru, &lruvec->evictable.lists[gen][file][zone]);
+	else
+		list_add_tail(&page->lru, &lruvec->evictable.lists[gen][file][zone]);
+
+	return true;
+}
+
+/* Delete a page from a multigenerational lru list. Returns true on success. */
+static inline bool page_clear_lru_gen(struct page *page, struct lruvec *lruvec)
+{
+	int gen;
+	unsigned long old_flags, new_flags;
+
+	do {
+		old_flags = READ_ONCE(page->flags);
+		if (!(old_flags & LRU_GEN_MASK))
+			return false;
+
+		VM_BUG_ON_PAGE(PageActive(page), page);
+		VM_BUG_ON_PAGE(PageUnevictable(page), page);
+
+		gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+
+		new_flags = old_flags & ~LRU_GEN_MASK;
+		/* mark page active accordingly */
+		if (lru_gen_is_active(lruvec, gen))
+			new_flags |= BIT(PG_active);
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(page, lruvec, gen, -1);
+	list_del(&page->lru);
+
+	return true;
+}
+
+#else /* CONFIG_LRU_GEN */
+
+static inline bool lru_gen_enabled(void)
+{
+	return false;
+}
+
+static inline bool page_set_lru_gen(struct page *page, struct lruvec *lruvec, bool front)
+{
+	return false;
+}
+
+static inline bool page_clear_lru_gen(struct page *page, struct lruvec *lruvec)
+{
+	return false;
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec)
 {
 	enum lru_list lru = page_lru(page);
 
+	if (page_set_lru_gen(page, lruvec, true))
+		return;
+
 	update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
@@ -93,6 +281,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 {
 	enum lru_list lru = page_lru(page);
 
+	if (page_set_lru_gen(page, lruvec, false))
+		return;
+
 	update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
 	list_add_tail(&page->lru, &lruvec->lists[lru]);
 }
@@ -100,6 +291,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec)
 {
+	if (page_clear_lru_gen(page, lruvec))
+		return;
+
 	list_del(&page->lru);
 	update_lru_size(lruvec, page_lru(page), page_zonenum(page),
 			-thp_nr_pages(page));
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a99a1050565a..173083bb846e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -291,6 +291,56 @@ enum lruvec_flags {
 					 */
 };
 
+struct lruvec;
+
+#define LRU_GEN_MASK	((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
+
+#ifdef CONFIG_LRU_GEN
+
+#define MAX_NR_GENS	CONFIG_NR_LRU_GENS
+
+/*
+ * For a common x86_64 configuration that has 3 zones and 7 generations,
+ * the size of this struct is 1112; and 4 zones and 15 generations, the
+ * size is 3048. Though it can be configured to have 6 zones and 63
+ * generations, there is unlikely a need for it.
+ */
+struct lru_gen {
+	/* aging increments max generation number */
+	unsigned long max_seq;
+	/* eviction increments min generation numbers */
+	unsigned long min_seq[ANON_AND_FILE];
+	/* birth time of each generation in jiffies */
+	unsigned long timestamps[MAX_NR_GENS];
+	/* multigenerational lru lists */
+	struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* sizes of multigenerational lru lists in pages */
+	unsigned long sizes[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* used with swappiness to determine which to reclaim */
+	unsigned long isolated[ANON_AND_FILE];
+#ifdef CONFIG_MEMCG
+	/* reclaim priority to compare with other memcgs */
+	atomic_t priority;
+#endif
+	/* whether multigenerational lru is enabled */
+	bool enabled[ANON_AND_FILE];
+};
+
+void lru_gen_init_lruvec(struct lruvec *lruvec);
+void lru_gen_set_state(bool enable, bool main, bool swap);
+
+#else /* CONFIG_LRU_GEN */
+
+static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+}
+
+static inline void lru_gen_set_state(bool enable, bool main, bool swap)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	/* per lruvec lru_lock for memcg */
@@ -308,6 +358,10 @@ struct lruvec {
 	unsigned long			refaults[ANON_AND_FILE];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+#ifdef CONFIG_LRU_GEN
+	/* unevictable pages are on LRU_UNEVICTABLE */
+	struct lru_gen			evictable;
+#endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 7d4ec26d8a3e..0c24ace9da3c 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -24,6 +24,20 @@
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
 
+#ifndef CONFIG_LRU_GEN
+#define LRU_GEN_WIDTH 0
+#else
+#if CONFIG_NR_LRU_GENS < 8
+#define LRU_GEN_WIDTH 3
+#elif CONFIG_NR_LRU_GENS < 16
+#define LRU_GEN_WIDTH 4
+#elif CONFIG_NR_LRU_GENS < 32
+#define LRU_GEN_WIDTH 5
+#else
+#define LRU_GEN_WIDTH 6
+#endif
+#endif /* CONFIG_LRU_GEN */
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 
@@ -56,7 +70,7 @@
 
 #define ZONES_WIDTH		ZONES_SHIFT
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#if SECTIONS_WIDTH+ZONES_WIDTH+LRU_GEN_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
 #define NODES_WIDTH		NODES_SHIFT
 #else
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
@@ -83,14 +97,14 @@
 #define KASAN_TAG_WIDTH 0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT+KASAN_TAG_WIDTH \
+#if SECTIONS_WIDTH+ZONES_WIDTH+LRU_GEN_WIDTH+NODES_WIDTH+KASAN_TAG_WIDTH+LAST_CPUPID_SHIFT \
 	<= BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
 #endif
 
-#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+LAST_CPUPID_WIDTH+KASAN_TAG_WIDTH \
+#if SECTIONS_WIDTH+ZONES_WIDTH+LRU_GEN_WIDTH+NODES_WIDTH+KASAN_TAG_WIDTH+LAST_CPUPID_WIDTH \
 	> BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..be9bf681313c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2422,7 +2422,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
-			 (1L << PG_dirty)));
+			 (1L << PG_dirty) |
+			 LRU_GEN_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 8e02e865cc65..0b91a25fbdee 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,27 +71,30 @@ void __init mminit_verify_pageflags_layout(void)
 	width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
 		- LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d lru_gen %d Flags %d\n",
 		SECTIONS_WIDTH,
 		NODES_WIDTH,
 		ZONES_WIDTH,
 		LAST_CPUPID_WIDTH,
 		KASAN_TAG_WIDTH,
+		LRU_GEN_WIDTH,
 		NR_PAGEFLAGS);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
-		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
+		"Section %d Node %d Zone %d Lastcpupid %d Kasantag %d lru_gen %d\n",
 		SECTIONS_SHIFT,
 		NODES_SHIFT,
 		ZONES_SHIFT,
 		LAST_CPUPID_SHIFT,
-		KASAN_TAG_WIDTH);
+		KASAN_TAG_WIDTH,
+		LRU_GEN_WIDTH);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
-		"Section %lu Node %lu Zone %lu Lastcpupid %lu Kasantag %lu\n",
+		"Section %lu Node %lu Zone %lu Lastcpupid %lu Kasantag %lu lru_gen %lu\n",
 		(unsigned long)SECTIONS_PGSHIFT,
 		(unsigned long)NODES_PGSHIFT,
 		(unsigned long)ZONES_PGSHIFT,
 		(unsigned long)LAST_CPUPID_PGSHIFT,
-		(unsigned long)KASAN_TAG_PGSHIFT);
+		(unsigned long)KASAN_TAG_PGSHIFT,
+		(unsigned long)LRU_GEN_PGOFF);
 	mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
 		"Node/Zone ID: %lu -> %lu\n",
 		(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..2ec0d7793424 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+
+	lru_gen_init_lruvec(lruvec);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index f20ed56ebbbf..bd10efe00684 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -300,6 +300,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 
 void lru_note_cost_page(struct page *page)
 {
+	/* multigenerational lru doesn't use any heuristics */
+	if (lru_gen_enabled())
+		return;
+
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), thp_nr_pages(page));
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 084a5b9a18e5..fe03cfeaa08f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2702,6 +2702,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	err = 0;
 	atomic_inc(&proc_poll_event);
 	wake_up_interruptible(&proc_poll_wait);
+	/* stop anon multigenerational lru if it's enabled */
+	lru_gen_set_state(false, false, true);
 
 out_dput:
 	filp_close(victim, NULL);
@@ -3348,6 +3350,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	mutex_unlock(&swapon_mutex);
 	atomic_inc(&proc_poll_event);
 	wake_up_interruptible(&proc_poll_wait);
+	/* start anon multigenerational lru if it's enabled */
+	lru_gen_set_state(true, false, true);
 
 	error = 0;
 	goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7657ab0d4b7..fd49a9a5d7f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,8 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1110,6 +1112,10 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* in case this page was found accessed after it was isolated */
+		if (lru_gen_enabled() && !ignore_references && PageReferenced(page))
+			goto activate_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -2229,6 +2235,10 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
+	/* multigenerational lru doesn't use any heuristics */
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
 	/*
@@ -2518,6 +2528,19 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 }
 
+#ifdef CONFIG_LRU_GEN
+static void age_lru_gens(struct pglist_data *pgdat, struct scan_control *sc);
+static void shrink_lru_gens(struct lruvec *lruvec, struct scan_control *sc);
+#else
+static void age_lru_gens(struct pglist_data *pgdat, struct scan_control *sc)
+{
+}
+
+static void shrink_lru_gens(struct lruvec *lruvec, struct scan_control *sc)
+{
+}
+#endif
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
@@ -2529,6 +2552,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	struct blk_plug plug;
 	bool scan_adjusted;
 
+	if (lru_gen_enabled()) {
+		shrink_lru_gens(lruvec, sc);
+		return;
+	}
+
 	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
@@ -2995,6 +3023,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
+	/* multigenerational lru doesn't use any heuristics */
+	if (lru_gen_enabled())
+		return;
+
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
 	target_lruvec->refaults[0] = refaults;
@@ -3369,6 +3401,11 @@ static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
+	if (lru_gen_enabled()) {
+		age_lru_gens(pgdat, sc);
+		return;
+	}
+
 	if (!total_swap_pages)
 		return;
 
@@ -4553,12 +4590,1227 @@ static bool get_next_mm(struct lruvec *lruvec, unsigned long next_seq,
 	return last;
 }
 
+/******************************************************************************
+ *                           aging (page table walk)
+ ******************************************************************************/
+
+#define DEFINE_MAX_SEQ(lruvec)						\
+	unsigned long max_seq = READ_ONCE((lruvec)->evictable.max_seq)
+
+#define DEFINE_MIN_SEQ(lruvec)						\
+	unsigned long min_seq[ANON_AND_FILE] = {			\
+		READ_ONCE((lruvec)->evictable.min_seq[0]),		\
+		READ_ONCE((lruvec)->evictable.min_seq[1]),		\
+	}
+
+#define for_each_gen_type_zone(gen, file, zone)				\
+	for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)			\
+		for ((file) = 0; (file) < ANON_AND_FILE; (file)++)	\
+			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+#define for_each_type_zone(file, zone)					\
+	for ((file) = 0; (file) < ANON_AND_FILE; (file)++)		\
+		for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
+
+#define MAX_BATCH_SIZE 8192
+
+static DEFINE_PER_CPU(int [MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES], lru_batch_size);
+
+static void update_batch_size(struct page *page, int old_gen, int new_gen)
+{
+	int file = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+	int delta = thp_nr_pages(page);
+
+	VM_BUG_ON(preemptible());
+	VM_BUG_ON(in_interrupt());
+	VM_BUG_ON(old_gen >= MAX_NR_GENS);
+	VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+	__this_cpu_sub(lru_batch_size[old_gen][file][zone], delta);
+	__this_cpu_add(lru_batch_size[new_gen][file][zone], delta);
+}
+
+static void reset_batch_size(struct lruvec *lruvec)
+{
+	int gen, file, zone;
+
+	VM_BUG_ON(preemptible());
+	VM_BUG_ON(in_interrupt());
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	for_each_gen_type_zone(gen, file, zone) {
+		enum lru_list lru = LRU_FILE * file;
+		int total = __this_cpu_read(lru_batch_size[gen][file][zone]);
+
+		if (!total)
+			continue;
+
+		__this_cpu_write(lru_batch_size[gen][file][zone], 0);
+
+		WRITE_ONCE(lruvec->evictable.sizes[gen][file][zone],
+			   lruvec->evictable.sizes[gen][file][zone] + total);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		update_lru_size(lruvec, lru, zone, total);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static int page_update_lru_gen(struct page *page, int new_gen)
+{
+	int old_gen;
+	unsigned long old_flags, new_flags;
+
+	VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+	do {
+		old_flags = READ_ONCE(page->flags);
+
+		old_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		if (old_gen < 0) {
+			/* make sure shrink_page_list() rejects this page */
+			if (!PageReferenced(page))
+				SetPageReferenced(page);
+			break;
+		}
+
+		new_flags = (old_flags & ~LRU_GEN_MASK) | ((new_gen + 1UL) << LRU_GEN_PGOFF);
+		if (old_flags == new_flags)
+			break;
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	/* sort_page_by_gen() will sort this page during eviction */
+
+	return old_gen;
+}
+
+struct mm_walk_args {
+	struct mem_cgroup *memcg;
+	unsigned long max_seq;
+	unsigned long next_addr;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long addr_bitmap;
+	int node_id;
+	int batch_size;
+	bool should_walk[ANON_AND_FILE];
+};
+
+static inline unsigned long get_addr_mask(unsigned long addr)
+{
+	return BIT((addr & ~PUD_MASK) >> ilog2(PUD_SIZE / BITS_PER_LONG));
+}
+
+static int walk_pte_range(pmd_t *pmdp, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct mm_walk_args *args = walk->private;
+	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
+
+	pmd = pmd_read_atomic(pmdp);
+	barrier();
+	if (!pmd_present(pmd) || pmd_trans_huge(pmd))
+		return 0;
+
+	VM_BUG_ON(pmd_huge(pmd) || pmd_devmap(pmd) || is_hugepd(__hugepd(pmd_val(pmd))));
+
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) && !pmd_young(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, &pmd, start, &ptl);
+	arch_enter_lazy_mmu_mode();
+
+	for (; start != end; pte++, start += PAGE_SIZE) {
+		struct page *page;
+		unsigned long pfn = pte_pfn(*pte);
+
+		if (!pte_present(*pte) || !pte_young(*pte) || is_zero_pfn(pfn))
+			continue;
+
+		/*
+		 * If this pte maps a page from a different node, set the
+		 * bitmap to prevent the accessed bit on its parent pmd from
+		 * being cleared.
+		 */
+		if (pfn < args->start_pfn || pfn >= args->end_pfn) {
+			args->addr_bitmap |= get_addr_mask(start);
+			continue;
+		}
+
+		page = compound_head(pte_page(*pte));
+		if (page_to_nid(page) != args->node_id) {
+			args->addr_bitmap |= get_addr_mask(start);
+			continue;
+		}
+		if (page_memcg_rcu(page) != args->memcg)
+			continue;
+
+		if (ptep_test_and_clear_young(walk->vma, start, pte)) {
+			old_gen = page_update_lru_gen(page, new_gen);
+			if (old_gen >= 0 && old_gen != new_gen) {
+				update_batch_size(page, old_gen, new_gen);
+				args->batch_size++;
+			}
+		}
+
+		if (pte_dirty(*pte) && !PageDirty(page) &&
+		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
+			set_page_dirty(page);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte, ptl);
+
+	return 0;
+}
+
+static int walk_pmd_range(pud_t *pudp, unsigned long start, unsigned long end,
+			  struct mm_walk *walk)
+{
+	pud_t pud;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct mm_walk_args *args = walk->private;
+	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
+
+	pud = READ_ONCE(*pudp);
+	if (!pud_present(pud) || WARN_ON_ONCE(pud_trans_huge(pud)))
+		return 0;
+
+	VM_BUG_ON(pud_huge(pud) || pud_devmap(pud) || is_hugepd(__hugepd(pud_val(pud))));
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+	    !IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG))
+		goto done;
+
+	pmd = pmd_offset(&pud, start);
+	ptl = pmd_lock(walk->mm, pmd);
+	arch_enter_lazy_mmu_mode();
+
+	for (; start != end; pmd++, start = pmd_addr_end(start, end)) {
+		struct page *page;
+		unsigned long pfn = pmd_pfn(*pmd);
+
+		if (!pmd_present(*pmd) || !pmd_young(*pmd) || is_huge_zero_pmd(*pmd))
+			continue;
+
+		if (!pmd_trans_huge(*pmd)) {
+			if (!(args->addr_bitmap & get_addr_mask(start)) &&
+			    (!(pmd_addr_end(start, end) & ~PMD_MASK) ||
+			     !walk->vma->vm_next ||
+			     (walk->vma->vm_next->vm_start & PMD_MASK) > end))
+				pmdp_test_and_clear_young(walk->vma, start, pmd);
+			continue;
+		}
+
+		if (pfn < args->start_pfn || pfn >= args->end_pfn)
+			continue;
+
+		page = pmd_page(*pmd);
+		if (page_to_nid(page) != args->node_id)
+			continue;
+		if (page_memcg_rcu(page) != args->memcg)
+			continue;
+
+		if (pmdp_test_and_clear_young(walk->vma, start, pmd)) {
+			old_gen = page_update_lru_gen(page, new_gen);
+			if (old_gen >= 0 && old_gen != new_gen) {
+				update_batch_size(page, old_gen, new_gen);
+				args->batch_size++;
+			}
+		}
+
+		if (pmd_dirty(*pmd) && !PageDirty(page) &&
+		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
+			set_page_dirty(page);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+done:
+	args->addr_bitmap = 0;
+
+	if (args->batch_size < MAX_BATCH_SIZE)
+		return 0;
+
+	args->next_addr = end;
+
+	return -EAGAIN;
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_walk_args *args = walk->private;
+
+	if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_HUGETLB))
+		return true;
+
+	if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !args->should_walk[0];
+
+	if (vma_is_shmem(vma))
+		return !args->should_walk[0] ||
+		       mapping_unevictable(vma->vm_file->f_mapping);
+
+	return !args->should_walk[1] || vma_is_dax(vma) ||
+	       vma == get_gate_vma(vma->vm_mm) ||
+	       mapping_unevictable(vma->vm_file->f_mapping);
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, int swappiness)
+{
+	int err;
+	int file;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	struct mm_walk_args args = {};
+	struct mm_walk_ops ops = {
+		.test_walk = should_skip_vma,
+		.pmd_entry = walk_pte_range,
+		.pud_entry_post = walk_pmd_range,
+	};
+
+	args.memcg = memcg;
+	args.max_seq = READ_ONCE(lruvec->evictable.max_seq);
+	args.next_addr = FIRST_USER_ADDRESS;
+	args.start_pfn = pgdat->node_start_pfn;
+	args.end_pfn = pgdat_end_pfn(pgdat);
+	args.node_id = pgdat->node_id;
+
+	for (file = !swappiness; file < ANON_AND_FILE; file++)
+		args.should_walk[file] = lru_gen_mm_is_active(mm) ||
+			node_isset(pgdat->node_id, mm->lru_gen.nodes[file]);
+
+	do {
+		unsigned long start = args.next_addr;
+		unsigned long end = mm->highest_vm_end;
+
+		err = -EBUSY;
+
+		preempt_disable();
+		rcu_read_lock();
+
+#ifdef CONFIG_MEMCG
+		if (memcg && atomic_read(&memcg->moving_account))
+			goto contended;
+#endif
+		if (!mmap_read_trylock(mm))
+			goto contended;
+
+		args.batch_size = 0;
+
+		err = walk_page_range(mm, start, end, &ops, &args);
+
+		mmap_read_unlock(mm);
+
+		if (args.batch_size)
+			reset_batch_size(lruvec);
+contended:
+		rcu_read_unlock();
+		preempt_enable();
+
+		cond_resched();
+	} while (err == -EAGAIN && !mm_is_oom_victim(mm) && !mm_has_migrated(mm, memcg));
+
+	if (err)
+		return;
+
+	for (file = !swappiness; file < ANON_AND_FILE; file++) {
+		if (args.should_walk[file])
+			node_clear(pgdat->node_id, mm->lru_gen.nodes[file]);
+	}
+}
+
+static void page_inc_lru_gen(struct page *page, struct lruvec *lruvec, bool front)
+{
+	int old_gen, new_gen;
+	unsigned long old_flags, new_flags;
+	int file = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+
+	old_gen = lru_gen_from_seq(lruvec->evictable.min_seq[file]);
+
+	do {
+		old_flags = READ_ONCE(page->flags);
+		new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		VM_BUG_ON_PAGE(new_gen < 0, page);
+		if (new_gen >= 0 && new_gen != old_gen)
+			goto sort;
+
+		new_gen = (old_gen + 1) % MAX_NR_GENS;
+		new_flags = (old_flags & ~LRU_GEN_MASK) | ((new_gen + 1UL) << LRU_GEN_PGOFF);
+		/* mark page for reclaim if pending writeback */
+		if (front)
+			new_flags |= BIT(PG_reclaim);
+	} while (cmpxchg(&page->flags, old_flags, new_flags) != old_flags);
+
+	lru_gen_update_size(page, lruvec, old_gen, new_gen);
+sort:
+	if (front)
+		list_move(&page->lru, &lruvec->evictable.lists[new_gen][file][zone]);
+	else
+		list_move_tail(&page->lru, &lruvec->evictable.lists[new_gen][file][zone]);
+}
+
+static int get_nr_gens(struct lruvec *lruvec, int file)
+{
+	return lruvec->evictable.max_seq - lruvec->evictable.min_seq[file] + 1;
+}
+
+static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
+{
+	lockdep_assert_held(&lruvec->lru_lock);
+
+	return get_nr_gens(lruvec, 0) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, 0) <= MAX_NR_GENS &&
+	       get_nr_gens(lruvec, 1) >= MIN_NR_GENS &&
+	       get_nr_gens(lruvec, 1) <= MAX_NR_GENS;
+}
+
+static bool try_inc_min_seq(struct lruvec *lruvec, int file)
+{
+	int gen, zone;
+	bool success = false;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	while (get_nr_gens(lruvec, file) > MIN_NR_GENS) {
+		gen = lru_gen_from_seq(lruvec->evictable.min_seq[file]);
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			if (!list_empty(&lruvec->evictable.lists[gen][file][zone]))
+				return success;
+		}
+
+		lruvec->evictable.isolated[file] = 0;
+		WRITE_ONCE(lruvec->evictable.min_seq[file],
+			   lruvec->evictable.min_seq[file] + 1);
+
+		success = true;
+	}
+
+	return success;
+}
+
+static bool inc_min_seq(struct lruvec *lruvec, int file)
+{
+	int gen, zone;
+	int batch_size = 0;
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (get_nr_gens(lruvec, file) != MAX_NR_GENS)
+		return true;
+
+	gen = lru_gen_from_seq(lruvec->evictable.min_seq[file]);
+
+	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		struct list_head *head = &lruvec->evictable.lists[gen][file][zone];
+
+		while (!list_empty(head)) {
+			struct page *page = lru_to_page(head);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page), page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != file, page);
+			VM_BUG_ON_PAGE(page_zonenum(page) != zone, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			page_inc_lru_gen(page, lruvec, false);
+
+			if (++batch_size == MAX_BATCH_SIZE)
+				return false;
+		}
+
+		VM_BUG_ON(lruvec->evictable.sizes[gen][file][zone]);
+	}
+
+	lruvec->evictable.isolated[file] = 0;
+	WRITE_ONCE(lruvec->evictable.min_seq[file],
+		   lruvec->evictable.min_seq[file] + 1);
+
+	return true;
+}
+
+static void inc_max_seq(struct lruvec *lruvec)
+{
+	int gen, file, zone;
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	for (file = 0; file < ANON_AND_FILE; file++) {
+		if (try_inc_min_seq(lruvec, file))
+			continue;
+
+		while (!inc_min_seq(lruvec, file)) {
+			spin_unlock_irq(&lruvec->lru_lock);
+			cond_resched();
+			spin_lock_irq(&lruvec->lru_lock);
+		}
+	}
+
+	gen = lru_gen_from_seq(lruvec->evictable.max_seq - 1);
+	for_each_type_zone(file, zone) {
+		enum lru_list lru = LRU_FILE * file;
+		long total = lruvec->evictable.sizes[gen][file][zone];
+
+		WARN_ON_ONCE(total != (int)total);
+
+		update_lru_size(lruvec, lru, zone, total);
+		update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -total);
+	}
+
+	gen = lru_gen_from_seq(lruvec->evictable.max_seq + 1);
+	for_each_type_zone(file, zone) {
+		VM_BUG_ON(lruvec->evictable.sizes[gen][file][zone]);
+		VM_BUG_ON(!list_empty(&lruvec->evictable.lists[gen][file][zone]));
+	}
+
+	WRITE_ONCE(lruvec->evictable.timestamps[gen], jiffies);
+	/* make sure the birth time is valid when read locklessly */
+	smp_store_release(&lruvec->evictable.max_seq, lruvec->evictable.max_seq + 1);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+/* Main function used by foreground, background and user-triggered aging. */
+static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
+			 struct scan_control *sc, int swappiness)
+{
+	bool last;
+	struct mm_struct *mm = NULL;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
+
+	/*
+	 * For each walk of the mm list of a memcg, we decrement the priority
+	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
+	 * priorities of all lruvecs.
+	 *
+	 * So if this lruvec has a higher priority (smaller value), it means
+	 * other concurrent reclaimers (global or memcg reclaim) have walked
+	 * its mm list. Skip it for this priority to balance the pressure on
+	 * all memcgs.
+	 */
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
+	    sc->priority > atomic_read(&lruvec->evictable.priority))
+		return false;
+#endif
+
+	do {
+		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
+		if (mm)
+			walk_mm(lruvec, mm, swappiness);
+
+		cond_resched();
+	} while (mm);
+
+	if (!last) {
+		/* foreground aging prefers not to wait unless "necessary" */
+		if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2)
+			wait_event_killable(mm_list->nodes[nid].wait,
+				next_seq < READ_ONCE(lruvec->evictable.max_seq));
+
+		return next_seq < READ_ONCE(lruvec->evictable.max_seq);
+	}
+
+	VM_BUG_ON(next_seq != READ_ONCE(lruvec->evictable.max_seq));
+
+	inc_max_seq(lruvec);
+
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		atomic_add_unless(&lruvec->evictable.priority, -1, 0);
+#endif
+
+	/* order against inc_max_seq() */
+	smp_mb();
+	/* either we see any waiters or they will see updated max_seq */
+	if (waitqueue_active(&mm_list->nodes[nid].wait))
+		wake_up_all(&mm_list->nodes[nid].wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
+/******************************************************************************
+ *                           eviction (lru list scan)
+ ******************************************************************************/
+
+static int max_nr_gens(unsigned long max_seq, unsigned long *min_seq, int swappiness)
+{
+	return max_seq - min(min_seq[!swappiness], min_seq[1]) + 1;
+}
+
+static bool sort_page_by_gen(struct page *page, struct lruvec *lruvec)
+{
+	bool success;
+	int gen = page_lru_gen(page);
+	int file = page_is_file_lru(page);
+	int zone = page_zonenum(page);
+
+	VM_BUG_ON_PAGE(gen == -1, page);
+
+	/* a lazy free page that has been written into */
+	if (file && PageDirty(page) && PageAnon(page)) {
+		success = page_clear_lru_gen(page, lruvec);
+		VM_BUG_ON_PAGE(!success, page);
+		SetPageSwapBacked(page);
+		add_page_to_lru_list_tail(page, lruvec);
+		return true;
+	}
+
+	/* page_update_lru_gen() has updated the page */
+	if (gen != lru_gen_from_seq(lruvec->evictable.min_seq[file])) {
+		list_move(&page->lru, &lruvec->evictable.lists[gen][file][zone]);
+		return true;
+	}
+
+	/*
+	 * A page can't be immediately evicted, and page_inc_lru_gen() will
+	 * mark it for reclaim and hopefully writeback will write it soon.
+	 *
+	 * During page table walk, we call set_page_dirty() on pages that have
+	 * dirty PTEs, which helps account dirty pages so writeback should do
+	 * its job.
+	 */
+	if (PageLocked(page) || PageWriteback(page) || (file && PageDirty(page))) {
+		page_inc_lru_gen(page, lruvec, true);
+		return true;
+	}
+
+	return false;
+}
+
+static bool should_skip_page(struct page *page, struct scan_control *sc)
+{
+	if (!sc->may_unmap && page_mapped(page))
+		return true;
+
+	if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+	    (PageDirty(page) || (PageAnon(page) && !PageSwapCache(page))))
+		return true;
+
+	if (!get_page_unless_zero(page))
+		return true;
+
+	if (!TestClearPageLRU(page)) {
+		put_page(page);
+		return true;
+	}
+
+	return false;
+}
+
+static void isolate_page_by_gen(struct page *page, struct lruvec *lruvec)
+{
+	bool success;
+
+	success = page_clear_lru_gen(page, lruvec);
+	VM_BUG_ON_PAGE(!success, page);
+
+	if (PageActive(page)) {
+		ClearPageActive(page);
+		/* make sure shrink_page_list() rejects this page */
+		if (!PageReferenced(page))
+			SetPageReferenced(page);
+		return;
+	}
+
+	/* make sure shrink_page_list() doesn't write back this page */
+	if (PageReclaim(page))
+		ClearPageReclaim(page);
+	/* make sure shrink_page_list() doesn't reject this page */
+	if (PageReferenced(page))
+		ClearPageReferenced(page);
+}
+
+static int scan_lru_gen_pages(struct lruvec *lruvec, struct scan_control *sc,
+			      long *nr_to_scan, int file, struct list_head *list)
+{
+	bool success;
+	int gen, zone;
+	enum vm_event_item item;
+	int sorted = 0;
+	int scanned = 0;
+	int isolated = 0;
+	int batch_size = 0;
+
+	VM_BUG_ON(!list_empty(list));
+
+	if (get_nr_gens(lruvec, file) == MIN_NR_GENS)
+		return -ENOENT;
+
+	gen = lru_gen_from_seq(lruvec->evictable.min_seq[file]);
+
+	for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+		LIST_HEAD(moved);
+		int skipped = 0;
+		struct list_head *head = &lruvec->evictable.lists[gen][file][zone];
+
+		while (!list_empty(head)) {
+			struct page *page = lru_to_page(head);
+			int delta = thp_nr_pages(page);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page), page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != file, page);
+			VM_BUG_ON_PAGE(page_zonenum(page) != zone, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			scanned += delta;
+
+			if (sort_page_by_gen(page, lruvec))
+				sorted += delta;
+			else if (should_skip_page(page, sc)) {
+				list_move(&page->lru, &moved);
+				skipped += delta;
+			} else {
+				isolate_page_by_gen(page, lruvec);
+				list_add(&page->lru, list);
+				isolated += delta;
+			}
+
+			if (scanned >= *nr_to_scan || isolated >= SWAP_CLUSTER_MAX ||
+			    ++batch_size == MAX_BATCH_SIZE)
+				break;
+		}
+
+		list_splice(&moved, head);
+		__count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
+
+		if (scanned >= *nr_to_scan || isolated >= SWAP_CLUSTER_MAX ||
+		    batch_size == MAX_BATCH_SIZE)
+			break;
+	}
+
+	success = try_inc_min_seq(lruvec, file);
+
+	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, scanned);
+	__count_memcg_events(lruvec_memcg(lruvec), item, scanned);
+	__count_vm_events(PGSCAN_ANON + file, scanned);
+
+	*nr_to_scan -= scanned;
+
+	if (*nr_to_scan <= 0 || success || isolated)
+		return isolated;
+	/*
+	 * We may have trouble finding eligible pages due to restrictions from
+	 * reclaim_idx, may_unmap and may_writepage. The following check makes
+	 * sure we won't be stuck if we aren't making enough progress.
+	 */
+	return batch_size == MAX_BATCH_SIZE && sorted >= SWAP_CLUSTER_MAX ? 0 : -ENOENT;
+}
+
+static int isolate_lru_gen_pages(struct lruvec *lruvec, struct scan_control *sc,
+				 int swappiness, long *nr_to_scan, int *file,
+				 struct list_head *list)
+{
+	int i;
+	int isolated;
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	VM_BUG_ON(!seq_is_valid(lruvec));
+
+	if (max_nr_gens(max_seq, min_seq, swappiness) == MIN_NR_GENS)
+		return 0;
+
+	/* simply choose a type based on generations and swappiness */
+	*file = !swappiness || min_seq[0] > min_seq[1] ||
+		(min_seq[0] == min_seq[1] &&
+		 max(lruvec->evictable.isolated[0], 1UL) * (200 - swappiness) >
+		 max(lruvec->evictable.isolated[1], 1UL) * (swappiness - 1));
+
+	for (i = !swappiness; i < ANON_AND_FILE; i++) {
+		isolated = scan_lru_gen_pages(lruvec, sc, nr_to_scan, *file, list);
+		if (isolated >= 0)
+			break;
+
+		*file = !*file;
+	}
+
+	if (isolated < 0)
+		isolated = *nr_to_scan = 0;
+
+	lruvec->evictable.isolated[*file] += isolated;
+
+	return isolated;
+}
+
+/* Main function used by foreground, background and user-triggered eviction. */
+static bool evict_lru_gen_pages(struct lruvec *lruvec, struct scan_control *sc,
+				int swappiness, long *nr_to_scan)
+{
+	int file;
+	int isolated;
+	int reclaimed;
+	LIST_HEAD(list);
+	struct page *page;
+	enum vm_event_item item;
+	struct reclaim_stat stat;
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	isolated = isolate_lru_gen_pages(lruvec, sc, swappiness, nr_to_scan, &file, &list);
+	VM_BUG_ON(list_empty(&list) == !!isolated);
+
+	if (isolated)
+		__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, isolated);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	if (!isolated)
+		goto done;
+
+	reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
+	/*
+	 * We have to prevent any pages from being added back to the same list
+	 * it was isolated from. Otherwise we may risk looping on them forever.
+	 */
+	list_for_each_entry(page, &list, lru) {
+		if (!PageReclaim(page) && !PageMlocked(page) && !PageActive(page))
+			SetPageActive(page);
+	}
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	move_pages_to_lru(lruvec, &list);
+
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -isolated);
+
+	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+	if (!cgroup_reclaim(sc))
+		__count_vm_events(item, reclaimed);
+	__count_memcg_events(lruvec_memcg(lruvec), item, reclaimed);
+	__count_vm_events(PGSTEAL_ANON + file, reclaimed);
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_uncharge_list(&list);
+	free_unref_page_list(&list);
+
+	sc->nr_reclaimed += reclaimed;
+done:
+	return *nr_to_scan > 0 && sc->nr_reclaimed < sc->nr_to_reclaim;
+}
+
+/******************************************************************************
+ *                           reclaim (aging + eviction)
+ ******************************************************************************/
+
+static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+				    int swappiness)
+{
+	int gen, file, zone;
+	long nr_to_scan = 0;
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	lru_add_drain();
+
+	for (file = !swappiness; file < ANON_AND_FILE; file++) {
+		unsigned long seq;
+
+		for (seq = min_seq[file]; seq <= max_seq; seq++) {
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone <= sc->reclaim_idx; zone++)
+				nr_to_scan += READ_ONCE(
+					lruvec->evictable.sizes[gen][file][zone]);
+		}
+	}
+
+	nr_to_scan = max(nr_to_scan, 0L);
+	nr_to_scan = round_up(nr_to_scan >> sc->priority, SWAP_CLUSTER_MAX);
+
+	if (max_nr_gens(max_seq, min_seq, swappiness) > MIN_NR_GENS)
+		return nr_to_scan;
+
+	/* kswapd does background aging, i.e., age_lru_gens() */
+	if (current_is_kswapd())
+		return 0;
+
+	return walk_mm_list(lruvec, max_seq, sc, swappiness) ? nr_to_scan : 0;
+}
+
+static int get_swappiness(struct lruvec *lruvec)
+{
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	int swappiness = mem_cgroup_get_nr_swap_pages(memcg) >= (long)SWAP_CLUSTER_MAX ?
+			 mem_cgroup_swappiness(memcg) : 0;
+
+	VM_BUG_ON(swappiness > 200U);
+
+	return swappiness;
+}
+
+static void shrink_lru_gens(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct blk_plug plug;
+	unsigned long scanned = 0;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	blk_start_plug(&plug);
+
+	while (true) {
+		long nr_to_scan;
+		int swappiness = sc->may_swap ? get_swappiness(lruvec) : 0;
+
+		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness) - scanned;
+		if (nr_to_scan < (long)SWAP_CLUSTER_MAX)
+			break;
+
+		scanned += nr_to_scan;
+
+		if (!evict_lru_gen_pages(lruvec, sc, swappiness, &nr_to_scan))
+			break;
+
+		scanned -= nr_to_scan;
+
+		if (mem_cgroup_below_min(memcg) ||
+		    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+			break;
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+}
+
+/******************************************************************************
+ *                           background aging
+ ******************************************************************************/
+
+static int lru_gen_spread = MIN_NR_GENS;
+
+static int min_nr_gens(unsigned long max_seq, unsigned long *min_seq, int swappiness)
+{
+	return max_seq - max(min_seq[!swappiness], min_seq[1]) + 1;
+}
+
+static void try_walk_mm_list(struct lruvec *lruvec, struct scan_control *sc)
+{
+	int gen, file, zone;
+	long old_and_young[2] = {};
+	int spread = READ_ONCE(lru_gen_spread);
+	int swappiness = get_swappiness(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	lru_add_drain();
+
+	for (file = !swappiness; file < ANON_AND_FILE; file++) {
+		unsigned long seq;
+
+		for (seq = min_seq[file]; seq <= max_seq; seq++) {
+			gen = lru_gen_from_seq(seq);
+
+			for (zone = 0; zone < MAX_NR_ZONES; zone++)
+				old_and_young[seq == max_seq] += READ_ONCE(
+					lruvec->evictable.sizes[gen][file][zone]);
+		}
+	}
+
+	old_and_young[0] = max(old_and_young[0], 0L);
+	old_and_young[1] = max(old_and_young[1], 0L);
+
+	if (old_and_young[0] + old_and_young[1] < SWAP_CLUSTER_MAX)
+		return;
+
+	/* try to spread pages out across spread+1 generations */
+	if (old_and_young[0] >= old_and_young[1] * spread &&
+	    min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS))
+		return;
+
+	walk_mm_list(lruvec, max_seq, sc, swappiness);
+}
+
+static void age_lru_gens(struct pglist_data *pgdat, struct scan_control *sc)
+{
+	struct mem_cgroup *memcg;
+
+	VM_BUG_ON(!current_is_kswapd());
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		if (!mem_cgroup_below_min(memcg) &&
+		    (!mem_cgroup_below_low(memcg) || sc->memcg_low_reclaim))
+			try_walk_mm_list(lruvec, sc);
+
+#ifdef CONFIG_MEMCG
+		if (!mem_cgroup_disabled())
+			atomic_add_unless(&lruvec->evictable.priority, 1, DEF_PRIORITY);
+#endif
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+}
+
+/******************************************************************************
+ *                           state change
+ ******************************************************************************/
+
+#ifdef CONFIG_LRU_GEN_ENABLED
+DEFINE_STATIC_KEY_TRUE(lru_gen_static_key);
+#else
+DEFINE_STATIC_KEY_FALSE(lru_gen_static_key);
+#endif
+
+static DEFINE_MUTEX(lru_gen_state_mutex);
+static int lru_gen_nr_swapfiles __read_mostly;
+
+static bool fill_lru_gen_lists(struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int batch_size = 0;
+
+	for_each_evictable_lru(lru) {
+		int file = is_file_lru(lru);
+		bool active = is_active_lru(lru);
+		struct list_head *head = &lruvec->lists[lru];
+
+		if (!lruvec->evictable.enabled[file])
+			continue;
+
+		while (!list_empty(head)) {
+			bool success;
+			struct page *page = lru_to_page(head);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page) != active, page);
+			VM_BUG_ON_PAGE(page_lru_gen(page) != -1, page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != file, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			del_page_from_lru_list(page, lruvec);
+			success = page_set_lru_gen(page, lruvec, true);
+			VM_BUG_ON(!success);
+
+			if (++batch_size == MAX_BATCH_SIZE)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static bool drain_lru_gen_lists(struct lruvec *lruvec)
+{
+	int gen, file, zone;
+	int batch_size = 0;
+
+	for_each_gen_type_zone(gen, file, zone) {
+		struct list_head *head = &lruvec->evictable.lists[gen][file][zone];
+
+		if (lruvec->evictable.enabled[file])
+			continue;
+
+		while (!list_empty(head)) {
+			bool success;
+			struct page *page = lru_to_page(head);
+
+			VM_BUG_ON_PAGE(PageTail(page), page);
+			VM_BUG_ON_PAGE(PageUnevictable(page), page);
+			VM_BUG_ON_PAGE(PageActive(page), page);
+			VM_BUG_ON_PAGE(page_is_file_lru(page) != file, page);
+			VM_BUG_ON_PAGE(page_zonenum(page) != zone, page);
+
+			prefetchw_prev_lru_page(page, head, flags);
+
+			success = page_clear_lru_gen(page, lruvec);
+			VM_BUG_ON(!success);
+			add_page_to_lru_list(page, lruvec);
+
+			if (++batch_size == MAX_BATCH_SIZE)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static bool __maybe_unused state_is_valid(struct lruvec *lruvec)
+{
+	int gen, file, zone;
+	enum lru_list lru;
+
+	for_each_evictable_lru(lru) {
+		file = is_file_lru(lru);
+
+		if (lruvec->evictable.enabled[file] &&
+		    !list_empty(&lruvec->lists[lru]))
+			return false;
+	}
+
+	for_each_gen_type_zone(gen, file, zone) {
+		if (!lruvec->evictable.enabled[file] &&
+		    !list_empty(&lruvec->evictable.lists[gen][file][zone]))
+			return false;
+
+		VM_WARN_ONCE(!lruvec->evictable.enabled[file] &&
+			     lruvec->evictable.sizes[gen][file][zone],
+			     "lru_gen: possible unbalanced number of pages");
+	}
+
+	return true;
+}
+
+/*
+ * We enable/disable file multigenerational lru according to the main switch.
+ *
+ * For anon multigenerational lru, we only enabled it when main switch is on
+ * and there is at least one swapfile; we disable it when there is no swapfile
+ * regardless of the value of the main switch. Otherwise, we may eventually
+ * run out of generation numbers and have to call inc_min_seq(), which brings
+ * an unnecessary cost.
+ */
+void lru_gen_set_state(bool enable, bool main, bool swap)
+{
+	struct mem_cgroup *memcg;
+
+	mem_hotplug_begin();
+	mutex_lock(&lru_gen_state_mutex);
+	cgroup_lock();
+
+	main = main && enable != lru_gen_enabled();
+	swap = swap && !(enable ? lru_gen_nr_swapfiles++ : --lru_gen_nr_swapfiles);
+	swap = swap && lru_gen_enabled();
+	if (!main && !swap)
+		goto unlock;
+
+	if (main) {
+		if (enable)
+			static_branch_enable(&lru_gen_static_key);
+		else
+			static_branch_disable(&lru_gen_static_key);
+	}
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+			spin_lock_irq(&lruvec->lru_lock);
+
+			VM_BUG_ON(!seq_is_valid(lruvec));
+			VM_BUG_ON(!state_is_valid(lruvec));
+
+			WRITE_ONCE(lruvec->evictable.enabled[0],
+				   lru_gen_enabled() && lru_gen_nr_swapfiles);
+			WRITE_ONCE(lruvec->evictable.enabled[1],
+				   lru_gen_enabled());
+
+			while (!(enable ? fill_lru_gen_lists(lruvec) :
+					  drain_lru_gen_lists(lruvec))) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				cond_resched();
+				spin_lock_irq(&lruvec->lru_lock);
+			}
+
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+
+		cond_resched();
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+unlock:
+	cgroup_unlock();
+	mutex_unlock(&lru_gen_state_mutex);
+	mem_hotplug_done();
+}
+
+static int __meminit __maybe_unused
+lru_gen_online_mem(struct notifier_block *self, unsigned long action, void *arg)
+{
+	struct mem_cgroup *memcg;
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+
+	if (action != MEM_GOING_ONLINE || nid == NUMA_NO_NODE)
+		return NOTIFY_DONE;
+
+	mutex_lock(&lru_gen_state_mutex);
+	cgroup_lock();
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+		VM_BUG_ON(!seq_is_valid(lruvec));
+		VM_BUG_ON(!state_is_valid(lruvec));
+
+		WRITE_ONCE(lruvec->evictable.enabled[0],
+			   lru_gen_enabled() && lru_gen_nr_swapfiles);
+		WRITE_ONCE(lruvec->evictable.enabled[1],
+			   lru_gen_enabled());
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	cgroup_unlock();
+	mutex_unlock(&lru_gen_state_mutex);
+
+	return NOTIFY_DONE;
+}
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
 
+void lru_gen_init_lruvec(struct lruvec *lruvec)
+{
+	int i;
+	int gen, file, zone;
+
+#ifdef CONFIG_MEMCG
+	atomic_set(&lruvec->evictable.priority, DEF_PRIORITY);
+#endif
+
+	lruvec->evictable.max_seq = MIN_NR_GENS;
+	lruvec->evictable.enabled[0] = lru_gen_enabled() && lru_gen_nr_swapfiles;
+	lruvec->evictable.enabled[1] = lru_gen_enabled();
+
+	for (i = 0; i <= MIN_NR_GENS; i++)
+		lruvec->evictable.timestamps[i] = jiffies;
+
+	for_each_gen_type_zone(gen, file, zone)
+		INIT_LIST_HEAD(&lruvec->evictable.lists[gen][file][zone]);
+}
+
 static int __init init_lru_gen(void)
 {
+	BUILD_BUG_ON(MAX_NR_GENS <= MIN_NR_GENS);
+	BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
+
 	if (mem_cgroup_disabled()) {
 		global_mm_list = alloc_mm_list();
 		if (!global_mm_list) {
@@ -4567,6 +5819,9 @@ static int __init init_lru_gen(void)
 		}
 	}
 
+	if (hotplug_memory_notifier(lru_gen_online_mem, 0))
+		pr_err("lru_gen: failed to subscribe hotplug notifications\n");
+
 	return 0;
 };
 /*
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 11/14] mm: multigenerational lru: page activation
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (9 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-16 16:34   ` Matthew Wilcox
  2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

In the page fault path, we want to add pages to the per-zone lists
index by max_seq as they cannot be evicted without going through
the aging first. For anon pages, we rename
lru_cache_add_inactive_or_unevictable() to lru_cache_add_page_vma()
and add a new parameter, which is set to true in the page fault path,
to indicate whether they should be added to the per-zone lists index
by max_seq. For page/swap cache, since we cannot differentiate the
page fault path from the read ahead path at the time we call
lru_cache_add() in add_to_page_cache_lru() and
__read_swap_cache_async(), we have to add a new function
lru_gen_activate_page(), which is essentially activate_page(), to move
pages to the per-zone lists indexed by max_seq at a later time.
Hopefully we would find pages we want to activate in lru_pvecs.lru_add
and simply set PageActive() on them without having to actually move
them.

In the reclaim path, pages mapped around a referenced PTE may also
have been referenced due to spatial locality. We add a new function
lru_gen_scan_around() to scan the vicinity of such a PTE.

In addition, we add a new function page_is_active() to tell whether a
page is active. We cannot use PageActive() because it is only set on
active pages while they are not on multigenerational lru. It is
cleared while pages are on multigenerational lru, in order to spare
the aging the trouble of clearing it when an active generation becomes
inactive. Internally, page_is_active() compares the generation number
of a page with max_seq and max_seq-1, which are active generations and
protected from the eviction. Other generations, which may or may not
exist, are inactive.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 fs/proc/task_mmu.c        |  3 ++-
 include/linux/mm_inline.h | 52 ++++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h    |  6 +++++
 include/linux/swap.h      |  4 +--
 kernel/events/uprobes.c   |  2 +-
 mm/huge_memory.c          |  2 +-
 mm/khugepaged.c           |  2 +-
 mm/memory.c               | 14 +++++++----
 mm/migrate.c              |  2 +-
 mm/rmap.c                 |  6 +++++
 mm/swap.c                 | 26 +++++++++++--------
 mm/swapfile.c             |  2 +-
 mm/userfaultfd.c          |  2 +-
 mm/vmscan.c               | 53 ++++++++++++++++++++++++++++++++++++++-
 14 files changed, 150 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3cec6fbef725..7cd173710e76 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/mm_inline.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1720,7 +1721,7 @@ static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty,
 	if (PageSwapCache(page))
 		md->swapcache += nr_pages;
 
-	if (PageActive(page) || PageUnevictable(page))
+	if (PageUnevictable(page) || page_is_active(compound_head(page), NULL))
 		md->active += nr_pages;
 
 	if (PageWriteback(page))
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 2d306cab36bc..a1a382418fc4 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -116,6 +116,49 @@ static inline int page_lru_gen(struct page *page)
 	return ((READ_ONCE(page->flags) & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
 }
 
+/* This function works regardless whether multigenerational lru is enabled. */
+static inline bool page_is_active(struct page *page, struct lruvec *lruvec)
+{
+	struct mem_cgroup *memcg;
+	int gen = page_lru_gen(page);
+	bool active = false;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (gen < 0)
+		return PageActive(page);
+
+	if (lruvec) {
+		VM_BUG_ON_PAGE(PageUnevictable(page), page);
+		VM_BUG_ON_PAGE(PageActive(page), page);
+		lockdep_assert_held(&lruvec->lru_lock);
+
+		return lru_gen_is_active(lruvec, gen);
+	}
+
+	rcu_read_lock();
+
+	memcg = page_memcg_rcu(page);
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	active = lru_gen_is_active(lruvec, gen);
+
+	rcu_read_unlock();
+
+	return active;
+}
+
+/* Activate a page from page cache or swap cache after it's mapped. */
+static inline void lru_gen_activate_page(struct page *page, struct vm_area_struct *vma)
+{
+	if (!lru_gen_enabled() || PageActive(page))
+		return;
+
+	if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_HUGETLB))
+		return;
+
+	activate_page(page);
+}
+
 /* Update multigenerational lru sizes in addition to active/inactive lru sizes. */
 static inline void lru_gen_update_size(struct page *page, struct lruvec *lruvec,
 				       int old_gen, int new_gen)
@@ -252,6 +295,15 @@ static inline bool lru_gen_enabled(void)
 	return false;
 }
 
+static inline bool page_is_active(struct page *page, struct lruvec *lruvec)
+{
+	return PageActive(page);
+}
+
+static inline void lru_gen_activate_page(struct page *page, struct vm_area_struct *vma)
+{
+}
+
 static inline bool page_set_lru_gen(struct page *page, struct lruvec *lruvec, bool front)
 {
 	return false;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 173083bb846e..99156602cd06 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -292,6 +292,7 @@ enum lruvec_flags {
 };
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK	((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 
@@ -328,6 +329,7 @@ struct lru_gen {
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
 void lru_gen_set_state(bool enable, bool main, bool swap);
+void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw);
 
 #else /* CONFIG_LRU_GEN */
 
@@ -339,6 +341,10 @@ static inline void lru_gen_set_state(bool enable, bool main, bool swap)
 {
 }
 
+static inline void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 struct lruvec {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2bbbf181ba..0e7532c7db22 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -350,8 +350,8 @@ extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_inactive_or_unevictable(struct page *page,
-						struct vm_area_struct *vma);
+extern void lru_cache_add_page_vma(struct page *page, struct vm_area_struct *vma,
+				   bool faulting);
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6addc9780319..4e93e5602723 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	if (new_page) {
 		get_page(new_page);
 		page_add_new_anon_rmap(new_page, vma, addr, false);
-		lru_cache_add_inactive_or_unevictable(new_page, vma);
+		lru_cache_add_page_vma(new_page, vma, false);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
 		dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be9bf681313c..62e14da5264e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -637,7 +637,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
-		lru_cache_add_inactive_or_unevictable(page, vma);
+		lru_cache_add_page_vma(page, vma, true);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a7d6cb912b05..08a43910f232 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1199,7 +1199,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-	lru_cache_add_inactive_or_unevictable(new_page, vma);
+	lru_cache_add_page_vma(new_page, vma, true);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index c8e357627318..7188607bddb9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
 
 #include <trace/events/kmem.h>
 
@@ -845,7 +846,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	copy_user_highpage(new_page, page, addr, src_vma);
 	__SetPageUptodate(new_page);
 	page_add_new_anon_rmap(new_page, dst_vma, addr, false);
-	lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
+	lru_cache_add_page_vma(new_page, dst_vma, false);
 	rss[mm_counter(new_page)]++;
 
 	/* All done, just insert the new page copy in the child */
@@ -2913,7 +2914,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
 		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
-		lru_cache_add_inactive_or_unevictable(new_page, vma);
+		lru_cache_add_page_vma(new_page, vma, true);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3436,9 +3437,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
-		lru_cache_add_inactive_or_unevictable(page, vma);
+		lru_cache_add_page_vma(page, vma, true);
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		lru_gen_activate_page(page, vma);
 	}
 
 	swap_free(entry);
@@ -3582,7 +3584,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
-	lru_cache_add_inactive_or_unevictable(page, vma);
+	lru_cache_add_page_vma(page, vma, true);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3707,6 +3709,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 
 	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
 	page_add_file_rmap(page, true);
+	lru_gen_activate_page(page, vma);
 	/*
 	 * deposit and withdraw with pmd lock held
 	 */
@@ -3750,10 +3753,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, addr, false);
-		lru_cache_add_inactive_or_unevictable(page, vma);
+		lru_cache_add_page_vma(page, vma, true);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
+		lru_gen_activate_page(page, vma);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..1064b03cac33 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -3004,7 +3004,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	inc_mm_counter(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, addr, false);
 	if (!is_zone_device_page(page))
-		lru_cache_add_inactive_or_unevictable(page, vma);
+		lru_cache_add_page_vma(page, vma, false);
 	get_page(page);
 
 	if (flush) {
diff --git a/mm/rmap.c b/mm/rmap.c
index b0fc27e77d6d..a44f9ee74ee1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -792,6 +793,11 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (pvmw.pte) {
+			/* multigenerational lru exploits spatial locality */
+			if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
+				lru_gen_scan_around(&pvmw);
+				referenced++;
+			}
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/swap.c b/mm/swap.c
index bd10efe00684..7aa85004b490 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -310,7 +310,7 @@ void lru_note_cost_page(struct page *page)
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (!PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && !page_is_active(page, lruvec)) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -341,7 +341,7 @@ static bool need_activate_page_drain(int cpu)
 static void activate_page_on_lru(struct page *page)
 {
 	page = compound_head(page);
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && !page_is_active(page, NULL)) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
@@ -435,7 +435,7 @@ void mark_page_accessed(struct page *page)
 		 * this list is never rotated or maintained, so marking an
 		 * evictable page accessed has no effect.
 		 */
-	} else if (!PageActive(page)) {
+	} else if (!page_is_active(page, NULL)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 		workingset_activation(page);
@@ -471,15 +471,14 @@ void lru_cache_add(struct page *page)
 EXPORT_SYMBOL(lru_cache_add);
 
 /**
- * lru_cache_add_inactive_or_unevictable
+ * lru_cache_add_page_vma
  * @page:  the page to be added to LRU
  * @vma:   vma in which page is mapped for determining reclaimability
  *
- * Place @page on the inactive or unevictable LRU list, depending on its
- * evictability.
+ * Place @page on an LRU list, depending on its evictability.
  */
-void lru_cache_add_inactive_or_unevictable(struct page *page,
-					 struct vm_area_struct *vma)
+void lru_cache_add_page_vma(struct page *page, struct vm_area_struct *vma,
+			    bool faulting)
 {
 	bool unevictable;
 
@@ -496,6 +495,11 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
 		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
+
+	/* multigenerational lru uses PageActive() to track page faults */
+	if (lru_gen_enabled() && !unevictable && faulting)
+		SetPageActive(page);
+
 	lru_cache_add(page);
 }
 
@@ -522,7 +526,7 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
  */
 static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
-	bool active = PageActive(page);
+	bool active = page_is_active(page, lruvec);
 	int nr_pages = thp_nr_pages(page);
 
 	if (PageUnevictable(page))
@@ -562,7 +566,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageActive(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page) && page_is_active(page, lruvec)) {
 		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec);
@@ -676,7 +680,7 @@ void deactivate_file_page(struct page *page)
  */
 void deactivate_page(struct page *page)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageLRU(page) && !PageUnevictable(page) && page_is_active(page, NULL)) {
 		struct pagevec *pvec;
 
 		local_lock(&lru_pvecs.lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index fe03cfeaa08f..c0956b3bde03 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1936,7 +1936,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		page_add_anon_rmap(page, vma, addr, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
-		lru_cache_add_inactive_or_unevictable(page, vma);
+		lru_cache_add_page_vma(page, vma, false);
 	}
 	swap_free(entry);
 out:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..e1d4cd3103b8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -123,7 +123,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
-	lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	lru_cache_add_page_vma(page, dst_vma, true);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fd49a9a5d7f5..ce868d89dc53 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1876,7 +1876,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec);
 		nr_pages = thp_nr_pages(page);
 		nr_moved += nr_pages;
-		if (PageActive(page))
+		if (page_is_active(page, lruvec))
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
@@ -4688,6 +4688,57 @@ static int page_update_lru_gen(struct page *page, int new_gen)
 	return old_gen;
 }
 
+void lru_gen_scan_around(struct page_vma_mapped_walk *pvmw)
+{
+	pte_t *pte;
+	unsigned long start, end;
+	int old_gen, new_gen;
+	unsigned long flags;
+	struct lruvec *lruvec;
+	struct mem_cgroup *memcg;
+	struct pglist_data *pgdat = page_pgdat(pvmw->page);
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_BUG_ON_VMA(pvmw->address < pvmw->vma->vm_start, pvmw->vma);
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+	pte = pvmw->pte - ((pvmw->address - start) >> PAGE_SHIFT);
+
+	memcg = lock_page_memcg(pvmw->page);
+	lruvec = lock_page_lruvec_irqsave(pvmw->page, &flags);
+
+	new_gen = lru_gen_from_seq(lruvec->evictable.max_seq);
+
+	for (; start != end; pte++, start += PAGE_SIZE) {
+		struct page *page;
+		unsigned long pfn = pte_pfn(*pte);
+
+		if (!pte_present(*pte) || !pte_young(*pte) || is_zero_pfn(pfn))
+			continue;
+
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		page = compound_head(pte_page(*pte));
+		if (page_to_nid(page) != pgdat->node_id)
+			continue;
+		if (page_memcg_rcu(page) != memcg)
+			continue;
+		/*
+		 * We may be holding many locks. So try to finish as fast as
+		 * possible and leave the accessed and the dirty bits to page
+		 * table walk.
+		 */
+		old_gen = page_update_lru_gen(page, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			lru_gen_update_size(page, lruvec, old_gen, new_gen);
+	}
+
+	unlock_page_lruvec_irqrestore(lruvec, flags);
+	unlock_page_memcg(pvmw->page);
+}
+
 struct mm_walk_args {
 	struct mem_cgroup *memcg;
 	unsigned long max_seq;
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 12/14] mm: multigenerational lru: user space interface
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (10 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13 12:23   ` kernel test robot
  2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Add a sysfs file /sys/kernel/mm/lru_gen/enabled so user space can
enable and disable multigenerational lru at runtime.

Add a sysfs file /sys/kernel/mm/lru_gen/spread so user space can
spread pages out across multiple generations. More generations make
the background aging more aggressive.

Add a debugfs file /sys/kernel/debug/lru_gen so user space can monitor
multigenerational lru and trigger the aging and the eviction. This
file has the following output:
  memcg  memcg_id  memcg_path
    node  node_id
      min_gen  birth_time  anon_size  file_size
      ...
      max_gen  birth_time  anon_size  file_size

Given a memcg and a node, "min_gen" is the oldest generation (number)
and "max_gen" is the youngest. Birth time is in milliseconds. Anon and
file sizes are in pages.

Write "+ memcg_id node_id gen [swappiness]" to this file to account
referenced pages to generation "max_gen" and create next generation
"max_gen"+1. "gen" must be equal to "max_gen" in order to avoid races.
A swap file and a non-zero swappiness value are required to scan anon
pages. If swapping is not desired, set vm.swappiness to 0 and
overwrite it with a non-zero "swappiness".

Write "- memcg_id node_id gen [swappiness] [nr_to_reclaim]" to this
file to evict generations less than or equal to "gen". "gen" must be
less than "max_gen"-1 as "max_gen" and "max_gen"-1 are active
generations and therefore protected from the eviction. "nr_to_reclaim"
can be used to limit the number of pages to be evicted.

Multiple command lines are supported, so does concatenation with
delimiters "," and ";".

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 334 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce868d89dc53..b59b556e9587 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,7 @@
 #include <linux/psi.h>
 #include <linux/pagewalk.h>
 #include <linux/memory.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5833,6 +5834,334 @@ lru_gen_online_mem(struct notifier_block *self, unsigned long action, void *arg)
 	return NOTIFY_DONE;
 }
 
+/******************************************************************************
+ *                          sysfs interface
+ ******************************************************************************/
+
+static ssize_t show_lru_gen_spread(struct kobject *kobj, struct kobj_attribute *attr,
+				   char *buf)
+{
+	return sprintf(buf, "%d\n", READ_ONCE(lru_gen_spread));
+}
+
+static ssize_t store_lru_gen_spread(struct kobject *kobj, struct kobj_attribute *attr,
+				    const char *buf, size_t len)
+{
+	int spread;
+
+	if (kstrtoint(buf, 10, &spread) || spread >= MAX_NR_GENS)
+		return -EINVAL;
+
+	WRITE_ONCE(lru_gen_spread, spread);
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_spread_attr = __ATTR(
+	spread, 0644,
+	show_lru_gen_spread, store_lru_gen_spread
+);
+
+static ssize_t show_lru_gen_enabled(struct kobject *kobj, struct kobj_attribute *attr,
+				    char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%ld\n", lru_gen_enabled());
+}
+
+static ssize_t store_lru_gen_enabled(struct kobject *kobj, struct kobj_attribute *attr,
+				     const char *buf, size_t len)
+{
+	int enable;
+
+	if (kstrtoint(buf, 10, &enable))
+		return -EINVAL;
+
+	lru_gen_set_state(enable, true, false);
+
+	return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+	enabled, 0644, show_lru_gen_enabled, store_lru_gen_enabled
+);
+
+static struct attribute *lru_gen_attrs[] = {
+	&lru_gen_spread_attr.attr,
+	&lru_gen_enabled_attr.attr,
+	NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+	.name = "lru_gen",
+	.attrs = lru_gen_attrs,
+};
+
+/******************************************************************************
+ *                          debugfs interface
+ ******************************************************************************/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg;
+	loff_t nr_to_skip = *pos;
+
+	m->private = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!m->private)
+		return ERR_PTR(-ENOMEM);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		int nid;
+
+		for_each_node_state(nid, N_MEMORY) {
+			if (!nr_to_skip--)
+				return mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+	if (!IS_ERR_OR_NULL(v))
+		mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+	kfree(m->private);
+	m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	int nid = lruvec_pgdat(v)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(v);
+
+	++*pos;
+
+	nid = next_memory_node(nid);
+	if (nid == MAX_NUMNODES) {
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+		if (!memcg)
+			return NULL;
+
+		nid = first_memory_node;
+	}
+
+	return mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+}
+
+static int lru_gen_seq_show(struct seq_file *m, void *v)
+{
+	unsigned long seq;
+	struct lruvec *lruvec = v;
+	int nid = lruvec_pgdat(lruvec)->node_id;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	DEFINE_MAX_SEQ(lruvec);
+	DEFINE_MIN_SEQ(lruvec);
+
+	if (nid == first_memory_node) {
+#ifdef CONFIG_MEMCG
+		if (memcg)
+			cgroup_path(memcg->css.cgroup, m->private, PATH_MAX);
+#endif
+		seq_printf(m, "memcg %5hu %s\n",
+			   mem_cgroup_id(memcg), (char *)m->private);
+	}
+
+	seq_printf(m, "  node %4d\n", nid);
+
+	for (seq = min(min_seq[0], min_seq[1]); seq <= max_seq; seq++) {
+		int gen, file, zone;
+		unsigned int msecs;
+		long sizes[ANON_AND_FILE] = {};
+
+		gen = lru_gen_from_seq(seq);
+
+		msecs = jiffies_to_msecs(jiffies - READ_ONCE(
+				lruvec->evictable.timestamps[gen]));
+
+		for_each_type_zone(file, zone)
+			sizes[file] += READ_ONCE(
+				lruvec->evictable.sizes[gen][file][zone]);
+
+		sizes[0] = max(sizes[0], 0L);
+		sizes[1] = max(sizes[1], 0L);
+
+		seq_printf(m, "%11lu %9u %9lu %9lu\n",
+			   seq, msecs, sizes[0], sizes[1]);
+	}
+
+	return 0;
+}
+
+static const struct seq_operations lru_gen_seq_ops = {
+	.start = lru_gen_seq_start,
+	.stop = lru_gen_seq_stop,
+	.next = lru_gen_seq_next,
+	.show = lru_gen_seq_show,
+};
+
+static int lru_gen_debugfs_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &lru_gen_seq_ops);
+}
+
+static int advance_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
+{
+	struct scan_control sc = {
+		.target_mem_cgroup = lruvec_memcg(lruvec),
+	};
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq == max_seq)
+		walk_mm_list(lruvec, max_seq, &sc, swappiness);
+
+	return seq > max_seq ? -EINVAL : 0;
+}
+
+static int advance_min_seq(struct lruvec *lruvec, unsigned long seq, int swappiness,
+			   unsigned long nr_to_reclaim)
+{
+	struct blk_plug plug;
+	int err = -EINTR;
+	long nr_to_scan = LONG_MAX;
+	struct scan_control sc = {
+		.nr_to_reclaim = nr_to_reclaim,
+		.target_mem_cgroup = lruvec_memcg(lruvec),
+		.may_writepage = 1,
+		.may_unmap = 1,
+		.may_swap = 1,
+		.reclaim_idx = MAX_NR_ZONES - 1,
+		.gfp_mask = GFP_KERNEL,
+	};
+	DEFINE_MAX_SEQ(lruvec);
+
+	if (seq >= max_seq - 1)
+		return -EINVAL;
+
+	blk_start_plug(&plug);
+
+	while (!signal_pending(current)) {
+		DEFINE_MIN_SEQ(lruvec);
+
+		if (seq < min(min_seq[!swappiness], min_seq[swappiness < 200]) ||
+		    !evict_lru_gen_pages(lruvec, &sc, swappiness, &nr_to_scan)) {
+			err = 0;
+			break;
+		}
+
+		cond_resched();
+	}
+
+	blk_finish_plug(&plug);
+
+	return err;
+}
+
+static int advance_seq(char cmd, int memcg_id, int nid, unsigned long seq,
+		       int swappiness, unsigned long nr_to_reclaim)
+{
+	struct lruvec *lruvec;
+	int err = -EINVAL;
+	struct mem_cgroup *memcg = NULL;
+
+	if (!mem_cgroup_disabled()) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(memcg_id);
+#ifdef CONFIG_MEMCG
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+#endif
+		rcu_read_unlock();
+
+		if (!memcg)
+			goto done;
+	}
+	if (memcg_id != mem_cgroup_id(memcg))
+		goto done;
+
+	if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY))
+		goto done;
+
+	lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+	if (swappiness == -1)
+		swappiness = get_swappiness(lruvec);
+	else if (swappiness > 200U)
+		goto done;
+
+	switch (cmd) {
+	case '+':
+		err = advance_max_seq(lruvec, seq, swappiness);
+		break;
+	case '-':
+		err = advance_min_seq(lruvec, seq, swappiness, nr_to_reclaim);
+		break;
+	}
+done:
+	mem_cgroup_put(memcg);
+
+	return err;
+}
+
+static ssize_t lru_gen_debugfs_write(struct file *file, const char __user *src,
+				     size_t len, loff_t *pos)
+{
+	void *buf;
+	char *cur, *next;
+	int err = 0;
+
+	buf = kvmalloc(len + 1, GFP_USER);
+	if (!buf)
+		return -ENOMEM;
+
+	if (copy_from_user(buf, src, len)) {
+		kvfree(buf);
+		return -EFAULT;
+	}
+
+	next = buf;
+	next[len] = '\0';
+
+	while ((cur = strsep(&next, ",;\n"))) {
+		int n;
+		int end;
+		char cmd;
+		int memcg_id;
+		int nid;
+		unsigned long seq;
+		int swappiness = -1;
+		unsigned long nr_to_reclaim = -1;
+
+		cur = skip_spaces(cur);
+		if (!*cur)
+			continue;
+
+		n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
+			   &seq, &end, &swappiness, &end, &nr_to_reclaim, &end);
+		if (n < 4 || cur[end]) {
+			err = -EINVAL;
+			break;
+		}
+
+		err = advance_seq(cmd, memcg_id, nid, seq, swappiness, nr_to_reclaim);
+		if (err)
+			break;
+	}
+
+	kvfree(buf);
+
+	return err ? : len;
+}
+
+static const struct file_operations lru_gen_debugfs_ops = {
+	.open = lru_gen_debugfs_open,
+	.read = seq_read,
+	.write = lru_gen_debugfs_write,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
 /******************************************************************************
  *                          initialization
  ******************************************************************************/
@@ -5873,6 +6202,11 @@ static int __init init_lru_gen(void)
 	if (hotplug_memory_notifier(lru_gen_online_mem, 0))
 		pr_err("lru_gen: failed to subscribe hotplug notifications\n");
 
+	if (sysfs_create_group(mm_kobj, &lru_gen_attr_group))
+		pr_err("lru_gen: failed to create sysfs group\n");
+
+	debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_debugfs_ops);
+
 	return 0;
 };
 /*
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 13/14] mm: multigenerational lru: Kconfig
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (11 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-13 12:53   ` kernel test robot
  2021-03-13 13:36   ` kernel test robot
  2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Add configuration options for multigenerational lru.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/Kconfig | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..3a5bcc2d7a45 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,33 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config LRU_GEN
+	bool "Multigenerational LRU"
+	depends on MMU
+	help
+	  High performance multigenerational LRU to heavily overcommit workloads
+	  that are not IO bound. See Documentation/vm/multigen_lru.rst for
+	  details.
+
+	  Warning: do not enable this option unless you plan to use it because
+	  it introduces a small per-process memory overhead.
+
+config NR_LRU_GENS
+	int "Max number of generations"
+	depends on LRU_GEN
+	range 4 63
+	default 7
+	help
+	  This will use ilog2(N)+1 spare bits from page flags.
+
+	  Warning: do not use numbers larger than necessary because each
+	  generation introduces a small per-node and per-memcg memory overhead.
+
+config LRU_GEN_ENABLED
+	bool "Turn on by default"
+	depends on LRU_GEN
+	help
+	  The default value of /sys/kernel/mm/lru_gen/enabled is 0. This option
+	  changes it to 1.
+
 endmenu
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v1 14/14] mm: multigenerational lru: documentation
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (12 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
@ 2021-03-13  7:57 ` Yu Zhao
  2021-03-19  9:31   ` Alex Shi
  2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-13  7:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim, Yu Zhao

Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 Documentation/vm/index.rst        |   1 +
 Documentation/vm/multigen_lru.rst | 210 ++++++++++++++++++++++++++++++
 2 files changed, 211 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..c353b3f55924 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -17,6 +17,7 @@ various features of the Linux memory management
 
    swap_numa
    zswap
+   multigen_lru
 
 Kernel developers MM documentation
 ==================================
diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst
new file mode 100644
index 000000000000..fea927da2572
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,210 @@
+=====================
+Multigenerational LRU
+=====================
+
+Quick Start
+===========
+Build Options
+-------------
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support
+ a maximum of ``X`` generations.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
+ default.
+
+Runtime Options
+---------------
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature was not turned on by default.
+
+:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N``
+ to spread pages out across ``N+1`` generations. ``N`` must be less
+ than ``X``. Larger values make the background aging more aggressive.
+
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature.
+ This file has the following output:
+
+::
+
+  memcg  memcg_id  memcg_path
+    node  node_id
+      min_gen  birth_time  anon_size  file_size
+      ...
+      max_gen  birth_time  anon_size  file_size
+
+Given a memcg and a node, ``min_gen`` is the oldest generation
+(number) and ``max_gen`` is the youngest. Birth time is in
+milliseconds. Anon and file sizes are in pages.
+
+Recipes
+-------
+:Android on ARMv8.1+: ``X=4``, ``N=0``
+
+:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
+ ``ARM64_HW_AFDBM``
+
+:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
+
+:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
+ to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
+ generation ``max_gen`` and create the next generation ``max_gen+1``.
+ ``gen`` must be equal to ``max_gen`` in order to avoid races. A swap
+ file and a non-zero swappiness value are required to scan anon pages.
+ If swapping is not desired, set ``vm.swappiness`` to ``0`` and
+ overwrite it with a non-zero ``swappiness``.
+
+:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
+ generations less than or equal to ``gen``. ``gen`` must be less than
+ ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active generations
+ and therefore protected from the eviction. ``nr_to_reclaim`` can be
+ used to limit the number of pages to be evicted. Multiple command
+ lines are supported, so does concatenation with delimiters ``,`` and
+ ``;``.
+
+Workflow
+========
+Evictable pages are divided into multiple generations for each
+``lruvec``. The youngest generation number is stored in ``max_seq``
+for both anon and file types as they are aged on an equal footing. The
+oldest generation numbers are stored in ``min_seq[2]`` separately for
+anon and file types as clean file pages can be evicted regardless of
+swap and write-back constraints. Generation numbers are truncated into
+``ilog2(CONFIG_NR_LRU_GENS)+1`` bits in order to fit into
+``page->flags``. The sliding window technique is used to prevent
+truncated generation numbers from overlapping. Each truncated
+generation number is an index to an array of per-type and per-zone
+lists. Evictable pages are added to the per-zone lists indexed by
+``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``),
+depending on whether they are being faulted in or read ahead. The
+workflow comprises two conceptually independent functions: the aging
+and the eviction.
+
+Aging
+-----
+The aging produces young generations. Given an ``lruvec``, the aging
+scans page tables for referenced pages of this ``lruvec``. Upon
+finding one, the aging updates its generation number to ``max_seq``.
+After each round of scan, the aging increments ``max_seq``. The aging
+maintains either a system-wide ``mm_struct`` list or per-memcg
+``mm_struct`` lists, and it only scans page tables of processes that
+have been scheduled since the last scan. Since scans are differential
+with respect to referenced pages, the cost is roughly proportional to
+their number.
+
+Eviction
+--------
+The eviction consumes old generations. Given an ``lruvec``, the
+eviction scans the pages on the per-zone lists indexed by either of
+``min_seq[2]``. It selects a type according to the values of
+``min_seq[2]`` and swappiness. During a scan, the eviction either
+sorts or isolates a page, depending on whether the aging has updated
+its generation number. When it finds all the per-zone lists are empty,
+the eviction increments ``min_seq[2]`` indexed by this selected type.
+The eviction triggers the aging when both of ``min_seq[2]`` reaches
+``max_seq-1``, assuming both anon and file types are reclaimable.
+
+Rationale
+=========
+Characteristics of cloud workloads
+----------------------------------
+With cloud storage gone mainstream, the role of local storage has
+diminished. For most of the systems running cloud workloads, anon
+pages account for the majority of memory consumption and page cache
+contains mostly executable pages. Notably, the portion of the unmapped
+is negligible.
+
+As a result, swapping is necessary to achieve substantial memory
+overcommit. And the ``rmap`` is the hottest in the reclaim path
+because its usage is proportional to the number of scanned pages,
+which on average is many times the number of reclaimed pages.
+
+With ``zram``, a typical ``kswapd`` profile on v5.11 looks like:
+
+::
+
+  31.03%  page_vma_mapped_walk
+  25.59%  lzo1x_1_do_compress
+   4.63%  do_raw_spin_lock
+   3.89%  vma_interval_tree_iter_next
+   3.33%  vma_interval_tree_subtree_search
+
+And with real swap, it looks like:
+
+::
+
+  45.16%  page_vma_mapped_walk
+   7.61%  do_raw_spin_lock
+   5.69%  vma_interval_tree_iter_next
+   4.91%  vma_interval_tree_subtree_search
+   3.71%  page_referenced_one
+
+Limitations of the Current Implementation
+-----------------------------------------
+Notion of the Active/Inactive
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+For servers equipped with hundreds of gigabytes of memory, the
+granularity of the active/inactive is too coarse to be useful for job
+scheduling. And false active/inactive rates are relatively high.
+
+For phones and laptops, the eviction is biased toward file pages
+because the selection has to resort to heuristics as direct
+comparisons between anon and file types are infeasible.
+
+For systems with multiple nodes and/or memcgs, it is impossible to
+compare ``lruvec``\s based on the notion of the active/inactive.
+
+Incremental Scans via the ``rmap``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each incremental scan picks up at where the last scan left off and
+stops after it has found a handful of unreferenced pages. For most of
+the systems running cloud workloads, incremental scans lose the
+advantage under sustained memory pressure due to high ratios of the
+number of scanned pages to the number of reclaimed pages. On top of
+that, the ``rmap`` has poor memory locality due to its complex data
+structures. The combined effects typically result in a high amount of
+CPU usage in the reclaim path.
+
+Benefits of the Multigenerational LRU
+-------------------------------------
+Notion of Generation Numbers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The notion of generation numbers introduces a quantitative approach to
+memory overcommit. A larger number of pages can be spread out across
+configurable generations, and thus they have relatively low false
+active/inactive rates. Each generation includes all pages that have
+been referenced since the last generation.
+
+Given an ``lruvec``, scans and the selections between anon and file
+types are all based on generation numbers, which are simple and yet
+effective. For different ``lruvec``\s, comparisons are still possible
+based on birth times of generations.
+
+Differential Scans via Page Tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each differential scan discovers all pages that have been referenced
+since the last scan. Specifically, it walks the ``mm_struct`` list
+associated with an ``lruvec`` to scan page tables of processes that
+have been scheduled since the last scan. The cost of each differential
+scan is roughly proportional to the number of referenced pages it
+discovers. Unless address spaces are extremely sparse, page tables
+usually have better memory locality than the ``rmap``. The end result
+is generally a significant reduction in CPU usage, for most of the
+systems running cloud workloads.
+
+To-do List
+==========
+KVM Optimization
+----------------
+Support shadow page table walk.
+
+NUMA Optimization
+-----------------
+Add per-node RSS for ``should_skip_mm()``.
+
+Refault Tracking Optimization
+-----------------------------
+Use generation numbers rather than LRU positions in
+``workingset_eviction()`` and ``workingset_refault()``.
-- 
2.31.0.rc2.261.g7f71774620-goog



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 12/14] mm: multigenerational lru: user space interface
  2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
@ 2021-03-13 12:23   ` kernel test robot
  0 siblings, 0 replies; 65+ messages in thread
From: kernel test robot @ 2021-03-13 12:23 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: kbuild-all, Alex Shi, Andrew Morton,
	Linux Memory Management List, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman

Hi Yu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on tip/x86/mm tip/sched/core linus/master v5.12-rc2]
[cannot apply to cgroup/for-next tip/perf/core hnaz-linux-mm/master next-20210312]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yu-Zhao/Multigenerational-LRU/20210313-160036
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git d0962f2b24c99889a386f0658c71535f56358f77
compiler: s390-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


cppcheck possible warnings: (new ones prefixed by >>, may not real problems)

   mm/vmscan.c:4107:22: warning: Local variable kswapd shadows outer function [shadowFunction]
    struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
                        ^
   mm/vmscan.c:3909:12: note: Shadowed declaration
   static int kswapd(void *p)
              ^
   mm/vmscan.c:4107:22: note: Shadow variable
    struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
                        ^

vim +6140 mm/vmscan.c

  6106	
  6107	static ssize_t lru_gen_debugfs_write(struct file *file, const char __user *src,
  6108					     size_t len, loff_t *pos)
  6109	{
  6110		void *buf;
  6111		char *cur, *next;
  6112		int err = 0;
  6113	
  6114		buf = kvmalloc(len + 1, GFP_USER);
  6115		if (!buf)
  6116			return -ENOMEM;
  6117	
  6118		if (copy_from_user(buf, src, len)) {
  6119			kvfree(buf);
  6120			return -EFAULT;
  6121		}
  6122	
  6123		next = buf;
  6124		next[len] = '\0';
  6125	
  6126		while ((cur = strsep(&next, ",;\n"))) {
  6127			int n;
  6128			int end;
  6129			char cmd;
  6130			int memcg_id;
  6131			int nid;
  6132			unsigned long seq;
  6133			int swappiness = -1;
  6134			unsigned long nr_to_reclaim = -1;
  6135	
  6136			cur = skip_spaces(cur);
  6137			if (!*cur)
  6138				continue;
  6139	
> 6140			n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid,
  6141				   &seq, &end, &swappiness, &end, &nr_to_reclaim, &end);
  6142			if (n < 4 || cur[end]) {
  6143				err = -EINVAL;
  6144				break;
  6145			}
  6146	
  6147			err = advance_seq(cmd, memcg_id, nid, seq, swappiness, nr_to_reclaim);
  6148			if (err)
  6149				break;
  6150		}
  6151	
  6152		kvfree(buf);
  6153	
  6154		return err ? : len;
  6155	}
  6156	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 13/14] mm: multigenerational lru: Kconfig
  2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
@ 2021-03-13 12:53   ` kernel test robot
  2021-03-13 13:36   ` kernel test robot
  1 sibling, 0 replies; 65+ messages in thread
From: kernel test robot @ 2021-03-13 12:53 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: kbuild-all, Alex Shi, Andrew Morton,
	Linux Memory Management List, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 12321 bytes --]

Hi Yu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/x86/mm tip/sched/core linus/master v5.12-rc2]
[cannot apply to cgroup/for-next tip/perf/core hnaz-linux-mm/master next-20210312]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yu-Zhao/Multigenerational-LRU/20210313-160036
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git d0962f2b24c99889a386f0658c71535f56358f77
config: mips-randconfig-r022-20210313 (attached as .config)
compiler: mipsel-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7a8b80d7f0d02852d49395fc6e035743816f6b1d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yu-Zhao/Multigenerational-LRU/20210313-160036
        git checkout 7a8b80d7f0d02852d49395fc6e035743816f6b1d
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   mm/vmscan.c: In function 'walk_pte_range':
>> mm/vmscan.c:4776:56: error: implicit declaration of function 'pmd_young'; did you mean 'pte_young'? [-Werror=implicit-function-declaration]
    4776 |  if (IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) && !pmd_young(pmd))
         |                                                        ^~~~~~~~~
         |                                                        pte_young
   mm/vmscan.c: In function 'walk_pmd_range':
>> mm/vmscan.c:4851:23: error: implicit declaration of function 'pmd_pfn'; did you mean 'pmd_off'? [-Werror=implicit-function-declaration]
    4851 |   unsigned long pfn = pmd_pfn(*pmd);
         |                       ^~~~~~~
         |                       pmd_off
>> mm/vmscan.c:4882:7: error: implicit declaration of function 'pmd_dirty'; did you mean 'pte_dirty'? [-Werror=implicit-function-declaration]
    4882 |   if (pmd_dirty(*pmd) && !PageDirty(page) &&
         |       ^~~~~~~~~
         |       pte_dirty
   cc1: some warnings being treated as errors
--
   mm/memcontrol.c: In function 'mem_cgroup_attach':
>> mm/memcontrol.c:6179:3: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
    6179 |   ;
         |   ^


vim +4776 mm/vmscan.c

4c59e20072808a Yu Zhao 2021-03-13  4759  
4c59e20072808a Yu Zhao 2021-03-13  4760  static int walk_pte_range(pmd_t *pmdp, unsigned long start, unsigned long end,
4c59e20072808a Yu Zhao 2021-03-13  4761  			  struct mm_walk *walk)
4c59e20072808a Yu Zhao 2021-03-13  4762  {
4c59e20072808a Yu Zhao 2021-03-13  4763  	pmd_t pmd;
4c59e20072808a Yu Zhao 2021-03-13  4764  	pte_t *pte;
4c59e20072808a Yu Zhao 2021-03-13  4765  	spinlock_t *ptl;
4c59e20072808a Yu Zhao 2021-03-13  4766  	struct mm_walk_args *args = walk->private;
4c59e20072808a Yu Zhao 2021-03-13  4767  	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
4c59e20072808a Yu Zhao 2021-03-13  4768  
4c59e20072808a Yu Zhao 2021-03-13  4769  	pmd = pmd_read_atomic(pmdp);
4c59e20072808a Yu Zhao 2021-03-13  4770  	barrier();
4c59e20072808a Yu Zhao 2021-03-13  4771  	if (!pmd_present(pmd) || pmd_trans_huge(pmd))
4c59e20072808a Yu Zhao 2021-03-13  4772  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4773  
4c59e20072808a Yu Zhao 2021-03-13  4774  	VM_BUG_ON(pmd_huge(pmd) || pmd_devmap(pmd) || is_hugepd(__hugepd(pmd_val(pmd))));
4c59e20072808a Yu Zhao 2021-03-13  4775  
4c59e20072808a Yu Zhao 2021-03-13 @4776  	if (IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) && !pmd_young(pmd))
4c59e20072808a Yu Zhao 2021-03-13  4777  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4778  
4c59e20072808a Yu Zhao 2021-03-13  4779  	pte = pte_offset_map_lock(walk->mm, &pmd, start, &ptl);
4c59e20072808a Yu Zhao 2021-03-13  4780  	arch_enter_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4781  
4c59e20072808a Yu Zhao 2021-03-13  4782  	for (; start != end; pte++, start += PAGE_SIZE) {
4c59e20072808a Yu Zhao 2021-03-13  4783  		struct page *page;
4c59e20072808a Yu Zhao 2021-03-13  4784  		unsigned long pfn = pte_pfn(*pte);
4c59e20072808a Yu Zhao 2021-03-13  4785  
4c59e20072808a Yu Zhao 2021-03-13  4786  		if (!pte_present(*pte) || !pte_young(*pte) || is_zero_pfn(pfn))
4c59e20072808a Yu Zhao 2021-03-13  4787  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4788  
4c59e20072808a Yu Zhao 2021-03-13  4789  		/*
4c59e20072808a Yu Zhao 2021-03-13  4790  		 * If this pte maps a page from a different node, set the
4c59e20072808a Yu Zhao 2021-03-13  4791  		 * bitmap to prevent the accessed bit on its parent pmd from
4c59e20072808a Yu Zhao 2021-03-13  4792  		 * being cleared.
4c59e20072808a Yu Zhao 2021-03-13  4793  		 */
4c59e20072808a Yu Zhao 2021-03-13  4794  		if (pfn < args->start_pfn || pfn >= args->end_pfn) {
4c59e20072808a Yu Zhao 2021-03-13  4795  			args->addr_bitmap |= get_addr_mask(start);
4c59e20072808a Yu Zhao 2021-03-13  4796  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4797  		}
4c59e20072808a Yu Zhao 2021-03-13  4798  
4c59e20072808a Yu Zhao 2021-03-13  4799  		page = compound_head(pte_page(*pte));
4c59e20072808a Yu Zhao 2021-03-13  4800  		if (page_to_nid(page) != args->node_id) {
4c59e20072808a Yu Zhao 2021-03-13  4801  			args->addr_bitmap |= get_addr_mask(start);
4c59e20072808a Yu Zhao 2021-03-13  4802  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4803  		}
4c59e20072808a Yu Zhao 2021-03-13  4804  		if (page_memcg_rcu(page) != args->memcg)
4c59e20072808a Yu Zhao 2021-03-13  4805  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4806  
4c59e20072808a Yu Zhao 2021-03-13  4807  		if (ptep_test_and_clear_young(walk->vma, start, pte)) {
4c59e20072808a Yu Zhao 2021-03-13  4808  			old_gen = page_update_lru_gen(page, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4809  			if (old_gen >= 0 && old_gen != new_gen) {
4c59e20072808a Yu Zhao 2021-03-13  4810  				update_batch_size(page, old_gen, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4811  				args->batch_size++;
4c59e20072808a Yu Zhao 2021-03-13  4812  			}
4c59e20072808a Yu Zhao 2021-03-13  4813  		}
4c59e20072808a Yu Zhao 2021-03-13  4814  
4c59e20072808a Yu Zhao 2021-03-13  4815  		if (pte_dirty(*pte) && !PageDirty(page) &&
4c59e20072808a Yu Zhao 2021-03-13  4816  		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
4c59e20072808a Yu Zhao 2021-03-13  4817  			set_page_dirty(page);
4c59e20072808a Yu Zhao 2021-03-13  4818  	}
4c59e20072808a Yu Zhao 2021-03-13  4819  
4c59e20072808a Yu Zhao 2021-03-13  4820  	arch_leave_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4821  	pte_unmap_unlock(pte, ptl);
4c59e20072808a Yu Zhao 2021-03-13  4822  
4c59e20072808a Yu Zhao 2021-03-13  4823  	return 0;
4c59e20072808a Yu Zhao 2021-03-13  4824  }
4c59e20072808a Yu Zhao 2021-03-13  4825  
4c59e20072808a Yu Zhao 2021-03-13  4826  static int walk_pmd_range(pud_t *pudp, unsigned long start, unsigned long end,
4c59e20072808a Yu Zhao 2021-03-13  4827  			  struct mm_walk *walk)
4c59e20072808a Yu Zhao 2021-03-13  4828  {
4c59e20072808a Yu Zhao 2021-03-13  4829  	pud_t pud;
4c59e20072808a Yu Zhao 2021-03-13  4830  	pmd_t *pmd;
4c59e20072808a Yu Zhao 2021-03-13  4831  	spinlock_t *ptl;
4c59e20072808a Yu Zhao 2021-03-13  4832  	struct mm_walk_args *args = walk->private;
4c59e20072808a Yu Zhao 2021-03-13  4833  	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
4c59e20072808a Yu Zhao 2021-03-13  4834  
4c59e20072808a Yu Zhao 2021-03-13  4835  	pud = READ_ONCE(*pudp);
4c59e20072808a Yu Zhao 2021-03-13  4836  	if (!pud_present(pud) || WARN_ON_ONCE(pud_trans_huge(pud)))
4c59e20072808a Yu Zhao 2021-03-13  4837  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4838  
4c59e20072808a Yu Zhao 2021-03-13  4839  	VM_BUG_ON(pud_huge(pud) || pud_devmap(pud) || is_hugepd(__hugepd(pud_val(pud))));
4c59e20072808a Yu Zhao 2021-03-13  4840  
4c59e20072808a Yu Zhao 2021-03-13  4841  	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
4c59e20072808a Yu Zhao 2021-03-13  4842  	    !IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG))
4c59e20072808a Yu Zhao 2021-03-13  4843  		goto done;
4c59e20072808a Yu Zhao 2021-03-13  4844  
4c59e20072808a Yu Zhao 2021-03-13  4845  	pmd = pmd_offset(&pud, start);
4c59e20072808a Yu Zhao 2021-03-13  4846  	ptl = pmd_lock(walk->mm, pmd);
4c59e20072808a Yu Zhao 2021-03-13  4847  	arch_enter_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4848  
4c59e20072808a Yu Zhao 2021-03-13  4849  	for (; start != end; pmd++, start = pmd_addr_end(start, end)) {
4c59e20072808a Yu Zhao 2021-03-13  4850  		struct page *page;
4c59e20072808a Yu Zhao 2021-03-13 @4851  		unsigned long pfn = pmd_pfn(*pmd);
4c59e20072808a Yu Zhao 2021-03-13  4852  
4c59e20072808a Yu Zhao 2021-03-13  4853  		if (!pmd_present(*pmd) || !pmd_young(*pmd) || is_huge_zero_pmd(*pmd))
4c59e20072808a Yu Zhao 2021-03-13  4854  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4855  
4c59e20072808a Yu Zhao 2021-03-13  4856  		if (!pmd_trans_huge(*pmd)) {
4c59e20072808a Yu Zhao 2021-03-13  4857  			if (!(args->addr_bitmap & get_addr_mask(start)) &&
4c59e20072808a Yu Zhao 2021-03-13  4858  			    (!(pmd_addr_end(start, end) & ~PMD_MASK) ||
4c59e20072808a Yu Zhao 2021-03-13  4859  			     !walk->vma->vm_next ||
4c59e20072808a Yu Zhao 2021-03-13  4860  			     (walk->vma->vm_next->vm_start & PMD_MASK) > end))
4c59e20072808a Yu Zhao 2021-03-13  4861  				pmdp_test_and_clear_young(walk->vma, start, pmd);
4c59e20072808a Yu Zhao 2021-03-13  4862  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4863  		}
4c59e20072808a Yu Zhao 2021-03-13  4864  
4c59e20072808a Yu Zhao 2021-03-13  4865  		if (pfn < args->start_pfn || pfn >= args->end_pfn)
4c59e20072808a Yu Zhao 2021-03-13  4866  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4867  
4c59e20072808a Yu Zhao 2021-03-13  4868  		page = pmd_page(*pmd);
4c59e20072808a Yu Zhao 2021-03-13  4869  		if (page_to_nid(page) != args->node_id)
4c59e20072808a Yu Zhao 2021-03-13  4870  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4871  		if (page_memcg_rcu(page) != args->memcg)
4c59e20072808a Yu Zhao 2021-03-13  4872  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4873  
4c59e20072808a Yu Zhao 2021-03-13  4874  		if (pmdp_test_and_clear_young(walk->vma, start, pmd)) {
4c59e20072808a Yu Zhao 2021-03-13  4875  			old_gen = page_update_lru_gen(page, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4876  			if (old_gen >= 0 && old_gen != new_gen) {
4c59e20072808a Yu Zhao 2021-03-13  4877  				update_batch_size(page, old_gen, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4878  				args->batch_size++;
4c59e20072808a Yu Zhao 2021-03-13  4879  			}
4c59e20072808a Yu Zhao 2021-03-13  4880  		}
4c59e20072808a Yu Zhao 2021-03-13  4881  
4c59e20072808a Yu Zhao 2021-03-13 @4882  		if (pmd_dirty(*pmd) && !PageDirty(page) &&
4c59e20072808a Yu Zhao 2021-03-13  4883  		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
4c59e20072808a Yu Zhao 2021-03-13  4884  			set_page_dirty(page);
4c59e20072808a Yu Zhao 2021-03-13  4885  	}
4c59e20072808a Yu Zhao 2021-03-13  4886  
4c59e20072808a Yu Zhao 2021-03-13  4887  	arch_leave_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4888  	spin_unlock(ptl);
4c59e20072808a Yu Zhao 2021-03-13  4889  done:
4c59e20072808a Yu Zhao 2021-03-13  4890  	args->addr_bitmap = 0;
4c59e20072808a Yu Zhao 2021-03-13  4891  
4c59e20072808a Yu Zhao 2021-03-13  4892  	if (args->batch_size < MAX_BATCH_SIZE)
4c59e20072808a Yu Zhao 2021-03-13  4893  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4894  
4c59e20072808a Yu Zhao 2021-03-13  4895  	args->next_addr = end;
4c59e20072808a Yu Zhao 2021-03-13  4896  
4c59e20072808a Yu Zhao 2021-03-13  4897  	return -EAGAIN;
4c59e20072808a Yu Zhao 2021-03-13  4898  }
4c59e20072808a Yu Zhao 2021-03-13  4899  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23788 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 13/14] mm: multigenerational lru: Kconfig
  2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
  2021-03-13 12:53   ` kernel test robot
@ 2021-03-13 13:36   ` kernel test robot
  1 sibling, 0 replies; 65+ messages in thread
From: kernel test robot @ 2021-03-13 13:36 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: kbuild-all, clang-built-linux, Alex Shi, Andrew Morton,
	Linux Memory Management List, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 12927 bytes --]

Hi Yu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/x86/mm tip/sched/core linus/master v5.12-rc2]
[cannot apply to cgroup/for-next tip/perf/core hnaz-linux-mm/master next-20210312]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yu-Zhao/Multigenerational-LRU/20210313-160036
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git d0962f2b24c99889a386f0658c71535f56358f77
config: mips-randconfig-r036-20210313 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project dfd27ebbd0eb137c9a439b7c537bb87ba903efd3)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install mips cross compiling tool for clang build
        # apt-get install binutils-mips-linux-gnu
        # https://github.com/0day-ci/linux/commit/7a8b80d7f0d02852d49395fc6e035743816f6b1d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yu-Zhao/Multigenerational-LRU/20210313-160036
        git checkout 7a8b80d7f0d02852d49395fc6e035743816f6b1d
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> mm/vmscan.c:4776:56: error: implicit declaration of function 'pmd_young' [-Werror,-Wimplicit-function-declaration]
           if (IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) && !pmd_young(pmd))
                                                                 ^
   mm/vmscan.c:4776:56: note: did you mean 'pte_young'?
   arch/mips/include/asm/pgtable.h:365:19: note: 'pte_young' declared here
   static inline int pte_young(pte_t pte)  { return pte_val(pte) & _PAGE_ACCESSED; }
                     ^
>> mm/vmscan.c:4851:23: error: implicit declaration of function 'pmd_pfn' [-Werror,-Wimplicit-function-declaration]
                   unsigned long pfn = pmd_pfn(*pmd);
                                       ^
   mm/vmscan.c:4851:23: note: did you mean 'pmd_off'?
   include/linux/pgtable.h:131:22: note: 'pmd_off' declared here
   static inline pmd_t *pmd_off(struct mm_struct *mm, unsigned long va)
                        ^
   mm/vmscan.c:4853:30: error: implicit declaration of function 'pmd_young' [-Werror,-Wimplicit-function-declaration]
                   if (!pmd_present(*pmd) || !pmd_young(*pmd) || is_huge_zero_pmd(*pmd))
                                              ^
>> mm/vmscan.c:4882:7: error: implicit declaration of function 'pmd_dirty' [-Werror,-Wimplicit-function-declaration]
                   if (pmd_dirty(*pmd) && !PageDirty(page) &&
                       ^
   mm/vmscan.c:4882:7: note: did you mean 'pte_dirty'?
   arch/mips/include/asm/pgtable.h:364:19: note: 'pte_dirty' declared here
   static inline int pte_dirty(pte_t pte)  { return pte_val(pte) & _PAGE_MODIFIED; }
                     ^
   4 errors generated.


vim +/pmd_young +4776 mm/vmscan.c

4c59e20072808a Yu Zhao 2021-03-13  4759  
4c59e20072808a Yu Zhao 2021-03-13  4760  static int walk_pte_range(pmd_t *pmdp, unsigned long start, unsigned long end,
4c59e20072808a Yu Zhao 2021-03-13  4761  			  struct mm_walk *walk)
4c59e20072808a Yu Zhao 2021-03-13  4762  {
4c59e20072808a Yu Zhao 2021-03-13  4763  	pmd_t pmd;
4c59e20072808a Yu Zhao 2021-03-13  4764  	pte_t *pte;
4c59e20072808a Yu Zhao 2021-03-13  4765  	spinlock_t *ptl;
4c59e20072808a Yu Zhao 2021-03-13  4766  	struct mm_walk_args *args = walk->private;
4c59e20072808a Yu Zhao 2021-03-13  4767  	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
4c59e20072808a Yu Zhao 2021-03-13  4768  
4c59e20072808a Yu Zhao 2021-03-13  4769  	pmd = pmd_read_atomic(pmdp);
4c59e20072808a Yu Zhao 2021-03-13  4770  	barrier();
4c59e20072808a Yu Zhao 2021-03-13  4771  	if (!pmd_present(pmd) || pmd_trans_huge(pmd))
4c59e20072808a Yu Zhao 2021-03-13  4772  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4773  
4c59e20072808a Yu Zhao 2021-03-13  4774  	VM_BUG_ON(pmd_huge(pmd) || pmd_devmap(pmd) || is_hugepd(__hugepd(pmd_val(pmd))));
4c59e20072808a Yu Zhao 2021-03-13  4775  
4c59e20072808a Yu Zhao 2021-03-13 @4776  	if (IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) && !pmd_young(pmd))
4c59e20072808a Yu Zhao 2021-03-13  4777  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4778  
4c59e20072808a Yu Zhao 2021-03-13  4779  	pte = pte_offset_map_lock(walk->mm, &pmd, start, &ptl);
4c59e20072808a Yu Zhao 2021-03-13  4780  	arch_enter_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4781  
4c59e20072808a Yu Zhao 2021-03-13  4782  	for (; start != end; pte++, start += PAGE_SIZE) {
4c59e20072808a Yu Zhao 2021-03-13  4783  		struct page *page;
4c59e20072808a Yu Zhao 2021-03-13  4784  		unsigned long pfn = pte_pfn(*pte);
4c59e20072808a Yu Zhao 2021-03-13  4785  
4c59e20072808a Yu Zhao 2021-03-13  4786  		if (!pte_present(*pte) || !pte_young(*pte) || is_zero_pfn(pfn))
4c59e20072808a Yu Zhao 2021-03-13  4787  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4788  
4c59e20072808a Yu Zhao 2021-03-13  4789  		/*
4c59e20072808a Yu Zhao 2021-03-13  4790  		 * If this pte maps a page from a different node, set the
4c59e20072808a Yu Zhao 2021-03-13  4791  		 * bitmap to prevent the accessed bit on its parent pmd from
4c59e20072808a Yu Zhao 2021-03-13  4792  		 * being cleared.
4c59e20072808a Yu Zhao 2021-03-13  4793  		 */
4c59e20072808a Yu Zhao 2021-03-13  4794  		if (pfn < args->start_pfn || pfn >= args->end_pfn) {
4c59e20072808a Yu Zhao 2021-03-13  4795  			args->addr_bitmap |= get_addr_mask(start);
4c59e20072808a Yu Zhao 2021-03-13  4796  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4797  		}
4c59e20072808a Yu Zhao 2021-03-13  4798  
4c59e20072808a Yu Zhao 2021-03-13  4799  		page = compound_head(pte_page(*pte));
4c59e20072808a Yu Zhao 2021-03-13  4800  		if (page_to_nid(page) != args->node_id) {
4c59e20072808a Yu Zhao 2021-03-13  4801  			args->addr_bitmap |= get_addr_mask(start);
4c59e20072808a Yu Zhao 2021-03-13  4802  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4803  		}
4c59e20072808a Yu Zhao 2021-03-13  4804  		if (page_memcg_rcu(page) != args->memcg)
4c59e20072808a Yu Zhao 2021-03-13  4805  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4806  
4c59e20072808a Yu Zhao 2021-03-13  4807  		if (ptep_test_and_clear_young(walk->vma, start, pte)) {
4c59e20072808a Yu Zhao 2021-03-13  4808  			old_gen = page_update_lru_gen(page, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4809  			if (old_gen >= 0 && old_gen != new_gen) {
4c59e20072808a Yu Zhao 2021-03-13  4810  				update_batch_size(page, old_gen, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4811  				args->batch_size++;
4c59e20072808a Yu Zhao 2021-03-13  4812  			}
4c59e20072808a Yu Zhao 2021-03-13  4813  		}
4c59e20072808a Yu Zhao 2021-03-13  4814  
4c59e20072808a Yu Zhao 2021-03-13  4815  		if (pte_dirty(*pte) && !PageDirty(page) &&
4c59e20072808a Yu Zhao 2021-03-13  4816  		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
4c59e20072808a Yu Zhao 2021-03-13  4817  			set_page_dirty(page);
4c59e20072808a Yu Zhao 2021-03-13  4818  	}
4c59e20072808a Yu Zhao 2021-03-13  4819  
4c59e20072808a Yu Zhao 2021-03-13  4820  	arch_leave_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4821  	pte_unmap_unlock(pte, ptl);
4c59e20072808a Yu Zhao 2021-03-13  4822  
4c59e20072808a Yu Zhao 2021-03-13  4823  	return 0;
4c59e20072808a Yu Zhao 2021-03-13  4824  }
4c59e20072808a Yu Zhao 2021-03-13  4825  
4c59e20072808a Yu Zhao 2021-03-13  4826  static int walk_pmd_range(pud_t *pudp, unsigned long start, unsigned long end,
4c59e20072808a Yu Zhao 2021-03-13  4827  			  struct mm_walk *walk)
4c59e20072808a Yu Zhao 2021-03-13  4828  {
4c59e20072808a Yu Zhao 2021-03-13  4829  	pud_t pud;
4c59e20072808a Yu Zhao 2021-03-13  4830  	pmd_t *pmd;
4c59e20072808a Yu Zhao 2021-03-13  4831  	spinlock_t *ptl;
4c59e20072808a Yu Zhao 2021-03-13  4832  	struct mm_walk_args *args = walk->private;
4c59e20072808a Yu Zhao 2021-03-13  4833  	int old_gen, new_gen = lru_gen_from_seq(args->max_seq);
4c59e20072808a Yu Zhao 2021-03-13  4834  
4c59e20072808a Yu Zhao 2021-03-13  4835  	pud = READ_ONCE(*pudp);
4c59e20072808a Yu Zhao 2021-03-13  4836  	if (!pud_present(pud) || WARN_ON_ONCE(pud_trans_huge(pud)))
4c59e20072808a Yu Zhao 2021-03-13  4837  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4838  
4c59e20072808a Yu Zhao 2021-03-13  4839  	VM_BUG_ON(pud_huge(pud) || pud_devmap(pud) || is_hugepd(__hugepd(pud_val(pud))));
4c59e20072808a Yu Zhao 2021-03-13  4840  
4c59e20072808a Yu Zhao 2021-03-13  4841  	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
4c59e20072808a Yu Zhao 2021-03-13  4842  	    !IS_ENABLED(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG))
4c59e20072808a Yu Zhao 2021-03-13  4843  		goto done;
4c59e20072808a Yu Zhao 2021-03-13  4844  
4c59e20072808a Yu Zhao 2021-03-13  4845  	pmd = pmd_offset(&pud, start);
4c59e20072808a Yu Zhao 2021-03-13  4846  	ptl = pmd_lock(walk->mm, pmd);
4c59e20072808a Yu Zhao 2021-03-13  4847  	arch_enter_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4848  
4c59e20072808a Yu Zhao 2021-03-13  4849  	for (; start != end; pmd++, start = pmd_addr_end(start, end)) {
4c59e20072808a Yu Zhao 2021-03-13  4850  		struct page *page;
4c59e20072808a Yu Zhao 2021-03-13 @4851  		unsigned long pfn = pmd_pfn(*pmd);
4c59e20072808a Yu Zhao 2021-03-13  4852  
4c59e20072808a Yu Zhao 2021-03-13  4853  		if (!pmd_present(*pmd) || !pmd_young(*pmd) || is_huge_zero_pmd(*pmd))
4c59e20072808a Yu Zhao 2021-03-13  4854  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4855  
4c59e20072808a Yu Zhao 2021-03-13  4856  		if (!pmd_trans_huge(*pmd)) {
4c59e20072808a Yu Zhao 2021-03-13  4857  			if (!(args->addr_bitmap & get_addr_mask(start)) &&
4c59e20072808a Yu Zhao 2021-03-13  4858  			    (!(pmd_addr_end(start, end) & ~PMD_MASK) ||
4c59e20072808a Yu Zhao 2021-03-13  4859  			     !walk->vma->vm_next ||
4c59e20072808a Yu Zhao 2021-03-13  4860  			     (walk->vma->vm_next->vm_start & PMD_MASK) > end))
4c59e20072808a Yu Zhao 2021-03-13  4861  				pmdp_test_and_clear_young(walk->vma, start, pmd);
4c59e20072808a Yu Zhao 2021-03-13  4862  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4863  		}
4c59e20072808a Yu Zhao 2021-03-13  4864  
4c59e20072808a Yu Zhao 2021-03-13  4865  		if (pfn < args->start_pfn || pfn >= args->end_pfn)
4c59e20072808a Yu Zhao 2021-03-13  4866  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4867  
4c59e20072808a Yu Zhao 2021-03-13  4868  		page = pmd_page(*pmd);
4c59e20072808a Yu Zhao 2021-03-13  4869  		if (page_to_nid(page) != args->node_id)
4c59e20072808a Yu Zhao 2021-03-13  4870  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4871  		if (page_memcg_rcu(page) != args->memcg)
4c59e20072808a Yu Zhao 2021-03-13  4872  			continue;
4c59e20072808a Yu Zhao 2021-03-13  4873  
4c59e20072808a Yu Zhao 2021-03-13  4874  		if (pmdp_test_and_clear_young(walk->vma, start, pmd)) {
4c59e20072808a Yu Zhao 2021-03-13  4875  			old_gen = page_update_lru_gen(page, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4876  			if (old_gen >= 0 && old_gen != new_gen) {
4c59e20072808a Yu Zhao 2021-03-13  4877  				update_batch_size(page, old_gen, new_gen);
4c59e20072808a Yu Zhao 2021-03-13  4878  				args->batch_size++;
4c59e20072808a Yu Zhao 2021-03-13  4879  			}
4c59e20072808a Yu Zhao 2021-03-13  4880  		}
4c59e20072808a Yu Zhao 2021-03-13  4881  
4c59e20072808a Yu Zhao 2021-03-13 @4882  		if (pmd_dirty(*pmd) && !PageDirty(page) &&
4c59e20072808a Yu Zhao 2021-03-13  4883  		    !(PageAnon(page) && PageSwapBacked(page) && !PageSwapCache(page)))
4c59e20072808a Yu Zhao 2021-03-13  4884  			set_page_dirty(page);
4c59e20072808a Yu Zhao 2021-03-13  4885  	}
4c59e20072808a Yu Zhao 2021-03-13  4886  
4c59e20072808a Yu Zhao 2021-03-13  4887  	arch_leave_lazy_mmu_mode();
4c59e20072808a Yu Zhao 2021-03-13  4888  	spin_unlock(ptl);
4c59e20072808a Yu Zhao 2021-03-13  4889  done:
4c59e20072808a Yu Zhao 2021-03-13  4890  	args->addr_bitmap = 0;
4c59e20072808a Yu Zhao 2021-03-13  4891  
4c59e20072808a Yu Zhao 2021-03-13  4892  	if (args->batch_size < MAX_BATCH_SIZE)
4c59e20072808a Yu Zhao 2021-03-13  4893  		return 0;
4c59e20072808a Yu Zhao 2021-03-13  4894  
4c59e20072808a Yu Zhao 2021-03-13  4895  	args->next_addr = end;
4c59e20072808a Yu Zhao 2021-03-13  4896  
4c59e20072808a Yu Zhao 2021-03-13  4897  	return -EAGAIN;
4c59e20072808a Yu Zhao 2021-03-13  4898  }
4c59e20072808a Yu Zhao 2021-03-13  4899  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31615 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG
  2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
@ 2021-03-13 15:09   ` Matthew Wilcox
  2021-03-14  7:45     ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2021-03-13 15:09 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi, Ying Huang,
	linux-kernel, page-reclaim

On Sat, Mar 13, 2021 at 12:57:34AM -0700, Yu Zhao wrote:
> We want to make sure the rcu lock is held while using
> page_memcg_rcu(). But having a WARN_ON_ONCE() in page_memcg_rcu() when
> !CONFIG_MEMCG is superfluous because of the following legit use case:
> 
>   memcg = lock_page_memcg(page1)
>     (rcu_read_lock() if CONFIG_MEMCG=y)
> 
>   do something to page1
> 
>   if (page_memcg_rcu(page2) == memcg)
>     do something to page2 too as it cannot be migrated away from the
>     memcg either.
> 
>   unlock_page_memcg(page1)
>     (rcu_read_unlock() if CONFIG_MEMCG=y)
> 
> This patch removes the WARN_ON_ONCE() from page_memcg_rcu() for the
> !CONFIG_MEMCG case.

I think this is wrong.  Usually we try to have the same locking
environment no matter what the CONFIG options are, like with
kmap_atomic().  I think lock_page_memcg() should disable RCU even if
CONFIG_MEMCG=n.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG
  2021-03-13 15:09   ` Matthew Wilcox
@ 2021-03-14  7:45     ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-14  7:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi, Ying Huang,
	linux-kernel, page-reclaim

On Sat, Mar 13, 2021 at 03:09:18PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 13, 2021 at 12:57:34AM -0700, Yu Zhao wrote:
> > We want to make sure the rcu lock is held while using
> > page_memcg_rcu(). But having a WARN_ON_ONCE() in page_memcg_rcu() when
> > !CONFIG_MEMCG is superfluous because of the following legit use case:
> > 
> >   memcg = lock_page_memcg(page1)
> >     (rcu_read_lock() if CONFIG_MEMCG=y)
> > 
> >   do something to page1
> > 
> >   if (page_memcg_rcu(page2) == memcg)
> >     do something to page2 too as it cannot be migrated away from the
> >     memcg either.
> > 
> >   unlock_page_memcg(page1)
> >     (rcu_read_unlock() if CONFIG_MEMCG=y)
> > 
> > This patch removes the WARN_ON_ONCE() from page_memcg_rcu() for the
> > !CONFIG_MEMCG case.
> 
> I think this is wrong.  Usually we try to have the same locking
> environment no matter what the CONFIG options are, like with
> kmap_atomic().  I think lock_page_memcg() should disable RCU even if
> CONFIG_MEMCG=n.

I agree in principle. On this topic I often debate myself where to
draw the line between being rigorous and paranoid. But in this
particular case, I thought it's no brainer because, imo, most of the
systems that don't use memcgs are small and preemptable, e.g.,
openwrt. They wouldn't appreciate a larger code size or rcu stalls due
to preemptions of functions that take rcu locks just to be rigorous.

This shouldn't be a problem if we only do so when CONFIG_DEBUG_VM=y,
but then its test coverage is another question. I'd be happy to work
out something in this direction, hopefully worth the trouble, if you
think this compromise is acceptable.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
@ 2021-03-14 22:12   ` Zi Yan
  2021-03-14 22:51     ` Matthew Wilcox
  2021-03-14 23:22   ` Dave Hansen
  1 sibling, 1 reply; 65+ messages in thread
From: Zi Yan @ 2021-03-14 22:12 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

[-- Attachment #1: Type: text/plain, Size: 897 bytes --]

On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> Some architectures support the accessed bit on non-leaf PMD entries
> (parents) in addition to leaf PTE entries (children) where pages are
> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> as part of linear-address translation [1]. Page table walkers who are
> interested in the accessed bit on children can take advantage of this:
> they do not need to search the children when the accessed bit is not
> set on a parent, given that they have previously cleared the accessed
> bit on this parent in addition to its children.
>
> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>      Volume 3 (October 2019), section 4.8

Just curious. Does this also apply to non-leaf PUD entries? Do you
mind sharing which sentence from the manual gives the information?

Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (13 preceding siblings ...)
  2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
@ 2021-03-14 22:48 ` Zi Yan
  2021-03-15  0:52   ` Yu Zhao
  2021-03-15  1:13 ` Hillf Danton
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Zi Yan @ 2021-03-14 22:48 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

[-- Attachment #1: Type: text/plain, Size: 9109 bytes --]

On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.
>
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
>
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> ========
> Notion of the active/inactive
> -----------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
>
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
>
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
>
> Incremental scans via the rmap
> ------------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.
>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one
>
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
>
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.
>
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.
>
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw

Is this profile from a system with this patchset applied or not?
Do you mind sharing some profiling data with before and after applying
the patchset? So it would be easier to see the improvement brought by
this patchset.

>
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced 7%. These metrics are
> important to phones and laptops as they are correlated to user
> experience.
>
> Workflow
> ========
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lruvec->evictable.max_seq
> for both anon and file types as they are aged on an equal footing. The
> oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> separately for anon and file types as clean file pages can be evicted
> regardless of may_swap or may_writepage. Generation numbers are
> truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> page->flags. The sliding window technique is used to prevent truncated
> generation numbers from overlapping. Each truncated generation number
> is an index to
> lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in or read ahead. The workflow comprises two conceptually
> independent functions: the aging and the eviction.
>
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq. The aging maintains either a
> system-wide mm_struct list or per-memcg mm_struct lists and tracks
> whether an mm_struct is being used on any CPUs or has been used since
> the last scan. Multiple threads can concurrently work on the same
> mm_struct list, and each of them will be given a different mm_struct
> belonging to a process that has been scheduled since the last scan.
>
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It selects a type according to the values of min_seq[2] and
> swappiness. During a scan, the eviction either sorts or isolates a
> page, depending on whether the aging has updated its generation
> number. When it finds all the per-zone lists are empty, the eviction
> increments min_seq[2] indexed by this selected type. The eviction
> triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
>
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
>
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
>
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no UX
> regressions from underutilized devices.
>
> For other use cases include working set estimation, proactive reclaim,
> far memory tiering and NUMA-aware job scheduling, please refer to the
> documentation included in this series and the following references.

Are there any performance numbers for specific application (before and
after applying the patches) you can show to demonstrate the improvement?

Thanks.

>
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>   https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>   https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>   https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>   https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>   https://research.google/pubs/pub49065/
>


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-14 22:12   ` Zi Yan
@ 2021-03-14 22:51     ` Matthew Wilcox
  2021-03-15  0:03       ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2021-03-14 22:51 UTC (permalink / raw)
  To: Zi Yan
  Cc: Yu Zhao, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
> 
> > Some architectures support the accessed bit on non-leaf PMD entries
> > (parents) in addition to leaf PTE entries (children) where pages are
> > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > as part of linear-address translation [1]. Page table walkers who are
> > interested in the accessed bit on children can take advantage of this:
> > they do not need to search the children when the accessed bit is not
> > set on a parent, given that they have previously cleared the accessed
> > bit on this parent in addition to its children.
> >
> > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
> >      Volume 3 (October 2019), section 4.8
> 
> Just curious. Does this also apply to non-leaf PUD entries? Do you
> mind sharing which sentence from the manual gives the information?

The first few sentences from 4.8:

: For any paging-structure entry that is used during linear-address
: translation, bit 5 is the accessed flag. For paging-structure
: entries that map a page (as opposed to referencing another paging
: structure), bit 6 is the dirty flag. These flags are provided for
: use by memory-management software to manage the transfer of pages and
: paging structures into and out of physical memory.

: Whenever the processor uses a paging-structure entry as part of
: linear-address translation, it sets the accessed flag in that entry
: (if it is not already set).

The way they differentiate between the A and D bits makes it clear to
me that the A bit is set at each level of the tree, but the D bit is
only set on leaf entries.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
  2021-03-14 22:12   ` Zi Yan
@ 2021-03-14 23:22   ` Dave Hansen
  2021-03-15  3:16     ` Yu Zhao
  1 sibling, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2021-03-14 23:22 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On 3/12/21 11:57 PM, Yu Zhao wrote:
> Some architectures support the accessed bit on non-leaf PMD entries
> (parents) in addition to leaf PTE entries (children) where pages are
> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> as part of linear-address translation [1]. Page table walkers who are
> interested in the accessed bit on children can take advantage of this:
> they do not need to search the children when the accessed bit is not
> set on a parent, given that they have previously cleared the accessed
> bit on this parent in addition to its children.

I'd like to hear a *LOT* more about how this is going to be used.

The one part of this which is entirely missing is the interaction with
the TLB and mid-level paging structure caches.  The CPU is pretty
aggressive about setting no-leaf accessed bits when TLB entries are
created.  This *looks* to be depending on that behavior, but it would be
nice to spell it out explicitly.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-14 22:51     ` Matthew Wilcox
@ 2021-03-15  0:03       ` Yu Zhao
  2021-03-15  0:27         ` Zi Yan
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-15  0:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote:
> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
> > On 13 Mar 2021, at 2:57, Yu Zhao wrote:
> > 
> > > Some architectures support the accessed bit on non-leaf PMD entries
> > > (parents) in addition to leaf PTE entries (children) where pages are
> > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > > as part of linear-address translation [1]. Page table walkers who are
> > > interested in the accessed bit on children can take advantage of this:
> > > they do not need to search the children when the accessed bit is not
> > > set on a parent, given that they have previously cleared the accessed
> > > bit on this parent in addition to its children.
> > >
> > > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
> > >      Volume 3 (October 2019), section 4.8
> > 
> > Just curious. Does this also apply to non-leaf PUD entries? Do you
> > mind sharing which sentence from the manual gives the information?
> 
> The first few sentences from 4.8:
> 
> : For any paging-structure entry that is used during linear-address
> : translation, bit 5 is the accessed flag. For paging-structure
> : entries that map a page (as opposed to referencing another paging
> : structure), bit 6 is the dirty flag. These flags are provided for
> : use by memory-management software to manage the transfer of pages and
> : paging structures into and out of physical memory.
> 
> : Whenever the processor uses a paging-structure entry as part of
> : linear-address translation, it sets the accessed flag in that entry
> : (if it is not already set).

As far as I know x86 is the one that supports this.

> The way they differentiate between the A and D bits makes it clear to
> me that the A bit is set at each level of the tree, but the D bit is
> only set on leaf entries.

And the difference makes perfect sense (to me). Kudos to Intel.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-15  0:03       ` Yu Zhao
@ 2021-03-15  0:27         ` Zi Yan
  2021-03-15  1:04           ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Zi Yan @ 2021-03-15  0:27 UTC (permalink / raw)
  To: Yu Zhao, Matthew Wilcox
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi, Ying Huang,
	linux-kernel, page-reclaim

[-- Attachment #1: Type: text/plain, Size: 2288 bytes --]

On 14 Mar 2021, at 20:03, Yu Zhao wrote:

> On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote:
>> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
>>> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
>>>
>>>> Some architectures support the accessed bit on non-leaf PMD entries
>>>> (parents) in addition to leaf PTE entries (children) where pages are
>>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
>>>> as part of linear-address translation [1]. Page table walkers who are
>>>> interested in the accessed bit on children can take advantage of this:
>>>> they do not need to search the children when the accessed bit is not
>>>> set on a parent, given that they have previously cleared the accessed
>>>> bit on this parent in addition to its children.
>>>>
>>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>>>>      Volume 3 (October 2019), section 4.8
>>>
>>> Just curious. Does this also apply to non-leaf PUD entries? Do you
>>> mind sharing which sentence from the manual gives the information?
>>
>> The first few sentences from 4.8:
>>
>> : For any paging-structure entry that is used during linear-address
>> : translation, bit 5 is the accessed flag. For paging-structure
>> : entries that map a page (as opposed to referencing another paging
>> : structure), bit 6 is the dirty flag. These flags are provided for
>> : use by memory-management software to manage the transfer of pages and
>> : paging structures into and out of physical memory.
>>
>> : Whenever the processor uses a paging-structure entry as part of
>> : linear-address translation, it sets the accessed flag in that entry
>> : (if it is not already set).

Matthew, thanks for the pointer.

>
> As far as I know x86 is the one that supports this.
>
>> The way they differentiate between the A and D bits makes it clear to
>> me that the A bit is set at each level of the tree, but the D bit is
>> only set on leaf entries.
>
> And the difference makes perfect sense (to me). Kudos to Intel.

Hi Yu,

You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no HAVE_ARCH_PARENT_PUD_YOUNG.
Is it PUD granularity too large to be useful for multigenerational LRU algorithm?

Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
@ 2021-03-15  0:52   ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-15  0:52 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Sun, Mar 14, 2021 at 06:48:17PM -0400, Zi Yan wrote:
> On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> > Problems
> > ========

> >   31.03%  page_vma_mapped_walk
> >   25.59%  lzo1x_1_do_compress
> >    4.63%  do_raw_spin_lock
> >    3.89%  vma_interval_tree_iter_next
> >    3.33%  vma_interval_tree_subtree_search

> > Solutions
> > =========

> >   49.36%  lzo1x_1_do_compress
> >    4.54%  page_vma_mapped_walk
> >    4.45%  memset_erms
> >    3.47%  walk_pte_range
> >    2.88%  zram_bvec_rw
> 
> Is this profile from a system with this patchset applied or not?
> Do you mind sharing some profiling data with before and after applying
> the patchset? So it would be easier to see the improvement brought by
> this patchset.

I've snipped everything else to make the context more clear.

These two kswapd profiles were collected under roughly the same memory
pressure. In other words, kswapd reclaimed (compressed) about the same
number of pages and therefore spent about the same amount of CPU time
in lzo1x_1_do_compress() in each profile.

The percentages of lzo1x_1_do_compress() are different because the
total CPU usage are different. Dividing the second percentage by the
first, we know we have roughly cut kswapd CPU usage by half.

> Are there any performance numbers for specific application (before and
> after applying the patches) you can show to demonstrate the improvement?

The kswapd profiles are from Chrome OS, i.e., laptops running the
v5.11 kernel and the Chrome browser. And we've also collected
benchmarks from various workloads on servers and phones running older
kernel versions too. Do you have a platform in mind? I'd be happy to
share the data with you. Or if you have some workloads/benchmarks, I
could collect some numbers from them too.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-15  0:27         ` Zi Yan
@ 2021-03-15  1:04           ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-15  1:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Sun, Mar 14, 2021 at 08:27:29PM -0400, Zi Yan wrote:
> On 14 Mar 2021, at 20:03, Yu Zhao wrote:
> 
> > On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote:
> >> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
> >>> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
> >>>
> >>>> Some architectures support the accessed bit on non-leaf PMD entries
> >>>> (parents) in addition to leaf PTE entries (children) where pages are
> >>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> >>>> as part of linear-address translation [1]. Page table walkers who are
> >>>> interested in the accessed bit on children can take advantage of this:
> >>>> they do not need to search the children when the accessed bit is not
> >>>> set on a parent, given that they have previously cleared the accessed
> >>>> bit on this parent in addition to its children.
> >>>>
> >>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
> >>>>      Volume 3 (October 2019), section 4.8
> >>>
> >>> Just curious. Does this also apply to non-leaf PUD entries? Do you
> >>> mind sharing which sentence from the manual gives the information?
> >>
> >> The first few sentences from 4.8:
> >>
> >> : For any paging-structure entry that is used during linear-address
> >> : translation, bit 5 is the accessed flag. For paging-structure
> >> : entries that map a page (as opposed to referencing another paging
> >> : structure), bit 6 is the dirty flag. These flags are provided for
> >> : use by memory-management software to manage the transfer of pages and
> >> : paging structures into and out of physical memory.
> >>
> >> : Whenever the processor uses a paging-structure entry as part of
> >> : linear-address translation, it sets the accessed flag in that entry
> >> : (if it is not already set).
> 
> Matthew, thanks for the pointer.
> 
> >
> > As far as I know x86 is the one that supports this.
> >
> >> The way they differentiate between the A and D bits makes it clear to
> >> me that the A bit is set at each level of the tree, but the D bit is
> >> only set on leaf entries.
> >
> > And the difference makes perfect sense (to me). Kudos to Intel.
> 
> Hi Yu,
> 
> You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no HAVE_ARCH_PARENT_PUD_YOUNG.
> Is it PUD granularity too large to be useful for multigenerational LRU algorithm?

Oh, sorry. I overlooked this part of the question.

Yes, you are right. We found no measurable performance difference
between using and not using the accessed bit on non-leaf PUD entries.

For the PMD case, the difference is tiny but still measurable on small
systems, e.g., laptops with 4GB memory. It's clear (a few percent in
kswapd) on servers with tens of GBs of 4KB pages.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (14 preceding siblings ...)
  2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
@ 2021-03-15  1:13 ` Hillf Danton
  2021-03-15  6:49   ` Yu Zhao
  2021-03-15 18:00 ` Dave Hansen
  2021-03-15 18:38 ` Yang Shi
  17 siblings, 1 reply; 65+ messages in thread
From: Hillf Danton @ 2021-03-15  1:13 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, page-reclaim,
	Hillf Danton, linux-kernel, linux-mm

On Sat, 13 Mar 2021 00:57:33 -0700 Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.

It makes my day, Monday of thick smog in one of the far east big
cities, to read the fresh work, something like 0b0695f2b34a that removed
heuristics as much as possible, of a coming Mr. Kswapd. 

> 
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
> 
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
> 
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
> 
> Problems
> ========
> Notion of the active/inactive
> -----------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
> 
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
> 
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
> 
> Incremental scans via the rmap
> ------------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.
> 
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
> 
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one
> 
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
> 
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.
> 
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.
> 
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
> 
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced 7%. These metrics are
> important to phones and laptops as they are correlated to user
> experience.
> 
> Workflow
> ========
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lruvec->evictable.max_seq
> for both anon and file types as they are aged on an equal footing. The
> oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> separately for anon and file types as clean file pages can be evicted
> regardless of may_swap or may_writepage. Generation numbers are
> truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> page->flags. The sliding window technique is used to prevent truncated
> generation numbers from overlapping. Each truncated generation number
> is an index to
> lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in or read ahead. The workflow comprises two conceptually
> independent functions: the aging and the eviction.
> 
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq. The aging maintains either a
> system-wide mm_struct list or per-memcg mm_struct lists and tracks
> whether an mm_struct is being used on any CPUs or has been used since
> the last scan. Multiple threads can concurrently work on the same
> mm_struct list, and each of them will be given a different mm_struct
> belonging to a process that has been scheduled since the last scan.
> 
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It selects a type according to the values of min_seq[2] and
> swappiness. During a scan, the eviction either sorts or isolates a
> page, depending on whether the aging has updated its generation
> number. When it finds all the per-zone lists are empty, the eviction
> increments min_seq[2] indexed by this selected type. The eviction
> triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
> 
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
> 
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
> 
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no UX
> regressions from underutilized devices.
> 
> For other use cases include working set estimation, proactive reclaim,
> far memory tiering and NUMA-aware job scheduling, please refer to the
> documentation included in this series and the following references.
> 
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>    https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>    https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>    https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>    https://research.google/pubs/pub49065/
> 
> Yu Zhao (14):
>   include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
>     !CONFIG_MEMCG
>   include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
>   include/linux/huge_mm.h: define is_huge_zero_pmd() if
>     !CONFIG_TRANSPARENT_HUGEPAGE
>   include/linux/cgroup.h: export cgroup_mutex
>   mm/swap.c: export activate_page()
>   mm, x86: support the access bit on non-leaf PMD entries
>   mm/pagewalk.c: add pud_entry_post() for post-order traversals
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: core
>   mm: multigenerational lru: page activation
>   mm: multigenerational lru: user space interface
>   mm: multigenerational lru: Kconfig
>   mm: multigenerational lru: documentation
> 
>  Documentation/vm/index.rst        |    1 +
>  Documentation/vm/multigen_lru.rst |  210 +++
>  arch/Kconfig                      |    8 +
>  arch/x86/Kconfig                  |    1 +
>  arch/x86/include/asm/pgtable.h    |    2 +-
>  arch/x86/mm/pgtable.c             |    5 +-
>  fs/exec.c                         |    2 +
>  fs/proc/task_mmu.c                |    3 +-
>  include/linux/cgroup.h            |   15 +-
>  include/linux/huge_mm.h           |    5 +
>  include/linux/memcontrol.h        |    5 +-
>  include/linux/mm.h                |    1 +
>  include/linux/mm_inline.h         |  246 ++++
>  include/linux/mm_types.h          |  135 ++
>  include/linux/mmzone.h            |   62 +-
>  include/linux/nodemask.h          |    1 +
>  include/linux/page-flags-layout.h |   20 +-
>  include/linux/pagewalk.h          |    4 +
>  include/linux/pgtable.h           |    4 +-
>  include/linux/swap.h              |    5 +-
>  kernel/events/uprobes.c           |    2 +-
>  kernel/exit.c                     |    1 +
>  kernel/fork.c                     |   10 +
>  kernel/kthread.c                  |    1 +
>  kernel/sched/core.c               |    2 +
>  mm/Kconfig                        |   29 +
>  mm/huge_memory.c                  |    5 +-
>  mm/khugepaged.c                   |    2 +-
>  mm/memcontrol.c                   |   28 +
>  mm/memory.c                       |   14 +-
>  mm/migrate.c                      |    2 +-
>  mm/mm_init.c                      |   13 +-
>  mm/mmzone.c                       |    2 +
>  mm/pagewalk.c                     |    5 +
>  mm/rmap.c                         |    6 +
>  mm/swap.c                         |   58 +-
>  mm/swapfile.c                     |    6 +-
>  mm/userfaultfd.c                  |    2 +-
>  mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
>  39 files changed, 2870 insertions(+), 144 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
> 
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
@ 2021-03-15  2:02   ` Andi Kleen
  2021-03-15  3:37     ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2021-03-15  2:02 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:
> +
> +#ifdef CONFIG_MEMCG
> +		if (memcg && atomic_read(&memcg->moving_account))
> +			goto contended;
> +#endif
> +		if (!mmap_read_trylock(mm))
> +			goto contended;

These are essentially spinloops. Surely you need a cpu_relax() somewhere?

In general for all of spinloop like constructs it would be useful to
consider how to teach lockdep about them.

> +	do {
> +		old_flags = READ_ONCE(page->flags);
> +		new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +		VM_BUG_ON_PAGE(new_gen < 0, page);
> +		if (new_gen >= 0 && new_gen != old_gen)
> +			goto sort;
> +
> +		new_gen = (old_gen + 1) % MAX_NR_GENS;
> +		new_flags = (old_flags & ~LRU_GEN_MASK) | ((new_gen + 1UL) << LRU_GEN_PGOFF);
> +		/* mark page for reclaim if pending writeback */
> +		if (front)
> +			new_flags |= BIT(PG_reclaim);
> +	} while (cmpxchg(&page->flags, old_flags, new_flags) !=
> old_flags);

I see this cmpxchg flags pattern a lot. Could there be some common code
factoring?

-Andi


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries
  2021-03-14 23:22   ` Dave Hansen
@ 2021-03-15  3:16     ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-15  3:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote:
> On 3/12/21 11:57 PM, Yu Zhao wrote:
> > Some architectures support the accessed bit on non-leaf PMD entries
> > (parents) in addition to leaf PTE entries (children) where pages are
> > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > as part of linear-address translation [1]. Page table walkers who are
> > interested in the accessed bit on children can take advantage of this:
> > they do not need to search the children when the accessed bit is not
> > set on a parent, given that they have previously cleared the accessed
> > bit on this parent in addition to its children.
> 
> I'd like to hear a *LOT* more about how this is going to be used.
> 
> The one part of this which is entirely missing is the interaction with
> the TLB and mid-level paging structure caches.  The CPU is pretty
> aggressive about setting no-leaf accessed bits when TLB entries are
> created.  This *looks* to be depending on that behavior, but it would be
> nice to spell it out explicitly.

Good point. Let me start with a couple of observations we've made:
  1) some applications create very sparse address spaces, for various
  reasons. A notable example is those using Scudo memory allocations:
  they usually have double-digit numbers of PTE entries for each PMD
  entry (and thousands of VMAs for just a few hundred MBs of memory
  usage, sigh...).
  2) scans of an address space (from the reclaim path) are much less
  frequent than context switches of it. Under our heaviest memory
  pressure (30%+ overcommitted; guess how much we've profited from
  it :) ), their magnitudes are still on different orders.
  Specifically, on our smallest system (2GB, with PCID), we observed
  no difference between flushing and not flushing TLB in terms of page
  selections. We actually observed more TLB misses under heavier
  memory pressure, and our theory is that this is due to increased
  memory footprint that causes the pressure.

There are two use cases for the accessed bit on non-leaf PMD entries:
the hot tracking and the cold tracking. I'll focus on the cold
tracking, which is what this series about.

Since non-leaf entries are more likely to be cached, in theory, the
false negative rate is higher compared with leaf entries as the CPU
won't set the accessed bit again until the next TLB miss. (Here a
false negative means the accessed bit isn't set on an entry has been
used, after we cleared the accessed bit. And IIRC, there are also
false positives, i.e., the accessed bit is set on entries used by
speculative execution only.) But this is not a problem because of the
second observation aforementioned.

Now let's consider the worst case scenario: what happens when we hit
a false negative on a non-leaf PMD entry? We think the pages mapped
by the PTE entries of this PMD entry are inactive and try to reclaim
them, until we see the accessed bit set on one of the PTE entries.
This will cost us one futile attempt for all the 512 PTE entries. A
glance at lru_gen_scan_around() in the 11th patch would explain
exactly why. If you are guessing that function embodies the same idea
of "fault around", you are right.

And there are two places that could benefit from this patch (and the
next) immediately, independent to this series. One is
clear_refs_test_walk() in fs/proc/task_mmu.c. The other is
madvise_pageout_page_range() and madvise_cold_page_range() in
mm/madvise.c. Both are page table walkers that clear the accessed bit.

I think I've covered a lot of ground but I'm sure there is a lot more.
So please feel free to add and I'll include everything we discuss here
in the next version.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-15  2:02   ` Andi Kleen
@ 2021-03-15  3:37     ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-15  3:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Sun, Mar 14, 2021 at 07:02:01PM -0700, Andi Kleen wrote:
> Yu Zhao <yuzhao@google.com> writes:

Hi Andi!

Recovering the context a bit:

		err = -EBUSY;

> > +
> > +#ifdef CONFIG_MEMCG
> > +		if (memcg && atomic_read(&memcg->moving_account))
> > +			goto contended;
> > +#endif
> > +		if (!mmap_read_trylock(mm))
> > +			goto contended;
> 
> These are essentially spinloops. Surely you need a cpu_relax() somewhere?

contended:
		...
		cond_resched();
	} while (err == -EAGAIN && !mm_is_oom_victim(mm) && !mm_has_migrated(mm, memcg));

So if it's contended, we break the loop.

> In general for all of spinloop like constructs it would be useful to
> consider how to teach lockdep about them.
> 
> > +	do {
> > +		old_flags = READ_ONCE(page->flags);
> > +		new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> > +		VM_BUG_ON_PAGE(new_gen < 0, page);
> > +		if (new_gen >= 0 && new_gen != old_gen)
> > +			goto sort;
> > +
> > +		new_gen = (old_gen + 1) % MAX_NR_GENS;
> > +		new_flags = (old_flags & ~LRU_GEN_MASK) | ((new_gen + 1UL) << LRU_GEN_PGOFF);
> > +		/* mark page for reclaim if pending writeback */
> > +		if (front)
> > +			new_flags |= BIT(PG_reclaim);
> > +	} while (cmpxchg(&page->flags, old_flags, new_flags) !=
> > old_flags);
> 
> I see this cmpxchg flags pattern a lot. Could there be some common code
> factoring?

Thanks for noticing this. A shorthand macro would be nice. Hmm... let
me investigate. I don't know how to do this off the top of my head:

A macro can be used like

  cmpxchg_macro() {
     func();
  }

without backslashes and it'll be expanded to

  do {
     func();
   } while ();


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-15  1:13 ` Hillf Danton
@ 2021-03-15  6:49   ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-15  6:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, page-reclaim,
	linux-kernel, linux-mm

On Mon, Mar 15, 2021 at 09:13:50AM +0800, Hillf Danton wrote:
> On Sat, 13 Mar 2021 00:57:33 -0700 Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and
> > often making poor choices about what to evict. We would like to offer
> > a performant, versatile and straightforward augment.
> 
> It makes my day, Monday of thick smog in one of the far east big
> cities, to read the fresh work, something like 0b0695f2b34a that removed
> heuristics as much as possible, of a coming Mr. Kswapd. 

Hi Hillf!

Sorry to hear about the smog, we don't have smog, only a few feet of
snow...

I shared the latest version of the cover letter here, if you are a fan
of Google Docs:
https://docs.google.com/document/d/1UxcpPAFNk1KpTJDKDXWekj_n6ebpQ-cwbXZlYoebTVM

And speaking of heuristics, yeah, I totally understand. We've had more
than fair share of problems with get_scan_count() and
inactive_is_low(). And we are still carrying a workaround (admittedly
a terrible one) we posted more that a decade ago, on some of our old
kernel versions:
https://lore.kernel.org/linux-mm/20101028191523.GA14972@google.com/

And people who run the Chrome browser but don't have this patch (non-
Chrome OS) had problems!
https://lore.kernel.org/linux-mm/54C77086.7090505@suse.cz/

With generation numbers, the equivalent to inactive_is_low() is

  max_seq - min(min_seq[!swappiness], min_seq[1]) + 1 > MIN_NR_GENS

in get_nr_to_scan(); and the equivalent to get_scan_count() is

  *file = !swappiness || min_seq[0] > min_seq[1] ||
	  (min_seq[0] == min_seq[1] &&
	   max(lruvec->evictable.isolated[0], 1UL) * (200 - swappiness) >
	   max(lruvec->evictable.isolated[1], 1UL) * (swappiness - 1));

in isolate_lru_gen_pages(), both in the 10th patch.

They work amazingly well for us, and hopefully for you too :)


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (15 preceding siblings ...)
  2021-03-15  1:13 ` Hillf Danton
@ 2021-03-15 18:00 ` Dave Hansen
  2021-03-16  2:24   ` Yu Zhao
  2021-03-15 18:38 ` Yang Shi
  17 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2021-03-15 18:00 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On 3/12/21 11:57 PM, Yu Zhao wrote:
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.

This makes a compelling argument that current reclaim is not well
optimized for anonymous memory with low rates of sharing.  Basically,
anonymous rmap is very powerful, but we're not getting enough bang for
our buck out of it.

I also understand that the workloads you reference are anonymous-heavy
and that page cache isn't a *major* component.

But, what does happens to page-cache-heavy workloads?  Does this just
effectively force databases that want to use shmem over to hugetlbfs?
How bad does this scanning get in the worst case if there's a lot of
sharing?

I'm kinda surprised by this, but my 16GB laptop has a lot more page
cache than I would have guessed:

> Active(anon):    4065088 kB
> Inactive(anon):  3981928 kB
> Active(file):    2260580 kB
> Inactive(file):  3738096 kB
> AnonPages:       6624776 kB
> Mapped:           692036 kB
> Shmem:            776276 kB

Most of it isn't mapped, but it's far from all being used for text.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
                   ` (16 preceding siblings ...)
  2021-03-15 18:00 ` Dave Hansen
@ 2021-03-15 18:38 ` Yang Shi
  2021-03-16  3:38   ` Yu Zhao
  17 siblings, 1 reply; 65+ messages in thread
From: Yang Shi @ 2021-03-15 18:38 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Linux MM, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Ying Huang, Linux Kernel Mailing List, page-reclaim

On Fri, Mar 12, 2021 at 11:57 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.
>
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
>
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> ========
> Notion of the active/inactive
> -----------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
>
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
>
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
>
> Incremental scans via the rmap
> ------------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.

So, you mean the reclaim efficiency is just 1/7? It seems quite low.
Just out of curiosity, did you have more insights about why it is that
low? I think it heavily depends on workload. We have page cache heavy
workloads, the efficiency rate is quite high.

>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one

I guess it is because your workloads have a lot of shared anon pages?

>
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
>
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.

It means you replace the active/inactive lists to multiple lists, from
most active to least active?

>
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.

How's about unmapped page caches? I think they are still quite common
for a lot of workloads.

>
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
>
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced 7%. These metrics are
> important to phones and laptops as they are correlated to user
> experience.
>
> Workflow
> ========
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lruvec->evictable.max_seq
> for both anon and file types as they are aged on an equal footing. The
> oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> separately for anon and file types as clean file pages can be evicted
> regardless of may_swap or may_writepage. Generation numbers are
> truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> page->flags. The sliding window technique is used to prevent truncated
> generation numbers from overlapping. Each truncated generation number
> is an index to
> lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in or read ahead. The workflow comprises two conceptually
> independent functions: the aging and the eviction.

Could you please illustrate the data structures? I think this would be
very helpful to understand the code. I haven't looked into the code
closely yet, per my shallow understanding to the above paragraphs, the
new lruvec looks like:

----------------
| max_seq  |
----------------
| .....            |
----------------
| min_seq.  | -----> -------------
----------------          |  Anon    | ---------> -------------------
                              ------------               | MAX_ZONE  |
--------> list of pages
                             |  File       |              --------------------
                              -------------              | .......
          | --------->
                                                            --------------------
                                                            | ZONE_DMA
 | --------->
                                                            --------------------

And the max_seq/min_seq is per memcg, is my understanding correct?

>
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq. The aging maintains either a
> system-wide mm_struct list or per-memcg mm_struct lists and tracks
> whether an mm_struct is being used on any CPUs or has been used since
> the last scan. Multiple threads can concurrently work on the same
> mm_struct list, and each of them will be given a different mm_struct
> belonging to a process that has been scheduled since the last scan.

I don't quite get how the "aging" works. IIUC, you have a dedicated
kernel thread or threads to scan the page tables periodically to
update the generations and promote or demote pages among the lists or
the "aging" just happens in reclaimer?

Thanks,
Yang

>
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It selects a type according to the values of min_seq[2] and
> swappiness. During a scan, the eviction either sorts or isolates a
> page, depending on whether the aging has updated its generation
> number. When it finds all the per-zone lists are empty, the eviction
> increments min_seq[2] indexed by this selected type. The eviction
> triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
>
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
>
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
>
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no UX
> regressions from underutilized devices.
>
> For other use cases include working set estimation, proactive reclaim,
> far memory tiering and NUMA-aware job scheduling, please refer to the
> documentation included in this series and the following references.
>
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>    https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>    https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>    https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>    https://research.google/pubs/pub49065/
>
> Yu Zhao (14):
>   include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
>     !CONFIG_MEMCG
>   include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
>   include/linux/huge_mm.h: define is_huge_zero_pmd() if
>     !CONFIG_TRANSPARENT_HUGEPAGE
>   include/linux/cgroup.h: export cgroup_mutex
>   mm/swap.c: export activate_page()
>   mm, x86: support the access bit on non-leaf PMD entries
>   mm/pagewalk.c: add pud_entry_post() for post-order traversals
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: core
>   mm: multigenerational lru: page activation
>   mm: multigenerational lru: user space interface
>   mm: multigenerational lru: Kconfig
>   mm: multigenerational lru: documentation
>
>  Documentation/vm/index.rst        |    1 +
>  Documentation/vm/multigen_lru.rst |  210 +++
>  arch/Kconfig                      |    8 +
>  arch/x86/Kconfig                  |    1 +
>  arch/x86/include/asm/pgtable.h    |    2 +-
>  arch/x86/mm/pgtable.c             |    5 +-
>  fs/exec.c                         |    2 +
>  fs/proc/task_mmu.c                |    3 +-
>  include/linux/cgroup.h            |   15 +-
>  include/linux/huge_mm.h           |    5 +
>  include/linux/memcontrol.h        |    5 +-
>  include/linux/mm.h                |    1 +
>  include/linux/mm_inline.h         |  246 ++++
>  include/linux/mm_types.h          |  135 ++
>  include/linux/mmzone.h            |   62 +-
>  include/linux/nodemask.h          |    1 +
>  include/linux/page-flags-layout.h |   20 +-
>  include/linux/pagewalk.h          |    4 +
>  include/linux/pgtable.h           |    4 +-
>  include/linux/swap.h              |    5 +-
>  kernel/events/uprobes.c           |    2 +-
>  kernel/exit.c                     |    1 +
>  kernel/fork.c                     |   10 +
>  kernel/kthread.c                  |    1 +
>  kernel/sched/core.c               |    2 +
>  mm/Kconfig                        |   29 +
>  mm/huge_memory.c                  |    5 +-
>  mm/khugepaged.c                   |    2 +-
>  mm/memcontrol.c                   |   28 +
>  mm/memory.c                       |   14 +-
>  mm/migrate.c                      |    2 +-
>  mm/mm_init.c                      |   13 +-
>  mm/mmzone.c                       |    2 +
>  mm/pagewalk.c                     |    5 +
>  mm/rmap.c                         |    6 +
>  mm/swap.c                         |   58 +-
>  mm/swapfile.c                     |    6 +-
>  mm/userfaultfd.c                  |    2 +-
>  mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
>  39 files changed, 2870 insertions(+), 144 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst
>
> --
> 2.31.0.rc2.261.g7f71774620-goog
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
@ 2021-03-15 19:40   ` Rik van Riel
  2021-03-16  2:07     ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Rik van Riel @ 2021-03-15 19:40 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

[-- Attachment #1: Type: text/plain, Size: 635 bytes --]

On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:

> +/*
> + * After pages are faulted in, they become the youngest generation.
> They must
> + * go through aging process twice before they can be evicted. After
> first scan,
> + * their accessed bit set during initial faults are cleared and they
> become the
> + * second youngest generation. And second scan makes sure they
> haven't been used
> + * since the first.
> + */

I have to wonder if the reductions in OOM kills and 
low-memory tab discards is due to this aging policy
change, rather than from the switch to virtual scanning.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-15 19:40   ` Rik van Riel
@ 2021-03-16  2:07     ` Huang, Ying
  2021-03-16  3:57       ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-16  2:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Yu Zhao, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

Rik van Riel <riel@surriel.com> writes:

> On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>
>> +/*
>> + * After pages are faulted in, they become the youngest generation.
>> They must
>> + * go through aging process twice before they can be evicted. After
>> first scan,
>> + * their accessed bit set during initial faults are cleared and they
>> become the
>> + * second youngest generation. And second scan makes sure they
>> haven't been used
>> + * since the first.
>> + */
>
> I have to wonder if the reductions in OOM kills and 
> low-memory tab discards is due to this aging policy
> change, rather than from the switch to virtual scanning.

If my understanding were correct, the temperature of the processes is
considered in addition to that of the individual pages.  That is, the
pages of the processes that haven't been scheduled after the previous
scanning will not be scanned.  I guess that this helps OOM kills?

If so, how about just take advantage of that information for OOM killing
and page reclaiming?  For example, if a process hasn't been scheduled
for long time, just reclaim its private pages.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-15 18:00 ` Dave Hansen
@ 2021-03-16  2:24   ` Yu Zhao
  2021-03-16 14:50     ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-16  2:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
> On 3/12/21 11:57 PM, Yu Zhao wrote:
> > Background
> > ==========
> > DRAM is a major factor in total cost of ownership, and improving
> > memory overcommit brings a high return on investment. Over the past
> > decade of research and experimentation in memory overcommit, we
> > observed a distinct trend across millions of servers and clients: the
> > size of page cache has been decreasing because of the growing
> > popularity of cloud storage. Nowadays anon pages account for more than
> > 90% of our memory consumption and page cache contains mostly
> > executable pages.
> 
> This makes a compelling argument that current reclaim is not well
> optimized for anonymous memory with low rates of sharing.  Basically,
> anonymous rmap is very powerful, but we're not getting enough bang for
> our buck out of it.
> 
> I also understand that the workloads you reference are anonymous-heavy
> and that page cache isn't a *major* component.
> 
> But, what does happens to page-cache-heavy workloads?  Does this just
> effectively force databases that want to use shmem over to hugetlbfs?

No, they should benefit too. In terms of page reclaim, shmem pages are
basically considered anon: they are on anon lru and dirty shmem pages
can only be swapped (we can safely assume clean shmem pages are
virtually nonexistent) in contrast to file pages that have backing
storage and need to be written back.

I should have phrased it better: our accounting is based on what the
kernel provides, i.e., anon/file (lru) sizes you listed below.

> How bad does this scanning get in the worst case if there's a lot of
> sharing?

Actually the improvement is larger when there is more sharing, i.e.,
higher map_count larger improvement. Let's assume we have a shmem
page mapped by two processes. To reclaim this page, we need to make
sure neither PTE from the two sets of page tables has the accessed
bit. The current page reclaim uses the rmap, i.e., rmap_walk_file().
It first looks up the two VMAs (from the two processes mapping this
shmem file) in the interval tree of this shmem file, then from each
VMA, it goes through PGD/PUD/PMD to reach the PTE. The page can't be
reclaimed if either of the PTEs has the accessed bit, therefore cost
of the scanning is more than proportional to the number of accesses,
when there is a lot sharing.

Why this series makes it better? We track the usage of page tables.
Specifically, we work alongside switch_mm(): if one of the processes
above hasn't be scheduled since the last scan, we don't need to scan
its page tables. So the cost is roughly proportional to the number of
accesses, regardless of how many processes. And instead of scanning
pages one by one, we do it in large batches. However, page tables can
be very sparse -- this is not a problem for the rmap because it knows
exactly where the PTEs are (by vma_address()). We only know ranges (by
vma->vm_start/vm_end). This is where the accessed bit on non-leaf
PMDs can be of help.

But I guess you are wondering what downsides are. Well, we haven't
seen any (yet). We do have page cache (non-shmem) heavy workloads,
but not at a scale large enough to make any statistically meaningful
observations. We are very interested in working with anybody who has
page cache (non-shmem) heavy workloads and is willing to try out this
series.

> I'm kinda surprised by this, but my 16GB laptop has a lot more page
> cache than I would have guessed:
> 
> > Active(anon):    4065088 kB
> > Inactive(anon):  3981928 kB
> > Active(file):    2260580 kB
> > Inactive(file):  3738096 kB
> > AnonPages:       6624776 kB
> > Mapped:           692036 kB
> > Shmem:            776276 kB
> 
> Most of it isn't mapped, but it's far from all being used for text.

We have categorized two groups:
  1) average users that haven't experienced memory pressure since
  their systems have booted. The booting process fills up page cache
  with one-off file pages, and they remain until users experience
  memory pressure. This can be confirmed by looking at those counters
  of a freshly rebooted and idle system. My guess this is the case for
  your laptop.
  2) engineering users who store git repos and compile locally. They
  complained about their browsers being janky because anon memory got
  swapped even though their systems had a lot of stale file pages in
  page cache, with the current page reclaim. They are what we consider
  part of the page cache (non-shmem) heavy group.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-15 18:38 ` Yang Shi
@ 2021-03-16  3:38   ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-16  3:38 UTC (permalink / raw)
  To: Yang Shi
  Cc: Linux MM, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Ying Huang, Linux Kernel Mailing List, page-reclaim

On Mon, Mar 15, 2021 at 11:38:20AM -0700, Yang Shi wrote:
> On Fri, Mar 12, 2021 at 11:57 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and
> > often making poor choices about what to evict. We would like to offer
> > a performant, versatile and straightforward augment.
> >
> > Repo
> > ====
> > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
> >
> > Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
> >
> > Background
> > ==========
> > DRAM is a major factor in total cost of ownership, and improving
> > memory overcommit brings a high return on investment. Over the past
> > decade of research and experimentation in memory overcommit, we
> > observed a distinct trend across millions of servers and clients: the
> > size of page cache has been decreasing because of the growing
> > popularity of cloud storage. Nowadays anon pages account for more than
> > 90% of our memory consumption and page cache contains mostly
> > executable pages.
> >
> > Problems
> > ========
> > Notion of the active/inactive
> > -----------------------------
> > For servers equipped with hundreds of gigabytes of memory, the
> > granularity of the active/inactive is too coarse to be useful for job
> > scheduling. And false active/inactive rates are relatively high. In
> > addition, scans of largely varying numbers of pages are unpredictable
> > because inactive_is_low() is based on magic numbers.
> >
> > For phones and laptops, the eviction is biased toward file pages
> > because the selection has to resort to heuristics as direct
> > comparisons between anon and file types are infeasible. On Android and
> > Chrome OS, executable pages are frequently evicted despite the fact
> > that there are many less recently used anon pages. This causes "janks"
> > (slow UI rendering) and negatively impacts user experience.
> >
> > For systems with multiple nodes and/or memcgs, it is impossible to
> > compare lruvecs based on the notion of the active/inactive.
> >
> > Incremental scans via the rmap
> > ------------------------------
> > Each incremental scan picks up at where the last scan left off and
> > stops after it has found a handful of unreferenced pages. For most of
> > the systems running cloud workloads, incremental scans lose the
> > advantage under sustained memory pressure due to high ratios of the
> > number of scanned pages to the number of reclaimed pages. In our case,
> > the average ratio of pgscan to pgsteal is about 7.
> 
> So, you mean the reclaim efficiency is just 1/7? It seems quite low.

Well, from the perspective of memory utilization, 6/7 is non-idle. And
in our dictionary, high "reclaim efficiency" is synonym for
underutilization :)

> Just out of curiosity, did you have more insights about why it is that
> low? I think it heavily depends on workload. We have page cache heavy
> workloads, the efficiency rate is quite high.

Yes, our observation on (a small group of) page cache heavy workloads
is the same. They access files via file descriptors, and sometimes
stream large files, i.e., only reading each file page once. Those
pages they leave in page cache are highly reclaimable because they
are clean, not mapped into page tables and therefore can be dropped
quickly.

> > On top of that, the rmap has poor memory locality due to its complex
> > data structures. The combined effects typically result in a high
> > amount of CPU usage in the reclaim path. For example, with zram, a
> > typical kswapd profile on v5.11 looks like:
> >   31.03%  page_vma_mapped_walk
> >   25.59%  lzo1x_1_do_compress
> >    4.63%  do_raw_spin_lock
> >    3.89%  vma_interval_tree_iter_next
> >    3.33%  vma_interval_tree_subtree_search
> >
> > And with real swap, it looks like:
> >   45.16%  page_vma_mapped_walk
> >    7.61%  do_raw_spin_lock
> >    5.69%  vma_interval_tree_iter_next
> >    4.91%  vma_interval_tree_subtree_search
> >    3.71%  page_referenced_one
> 
> I guess it is because your workloads have a lot of shared anon pages?

Sharing (map_count > 1) does make kswapd profile look worse. But the
majority of our anon memory including shmem is not shared but mapped
(map_count = 1).

> > Solutions
> > =========
> > Notion of generation numbers
> > ----------------------------
> > The notion of generation numbers introduces a quantitative approach to
> > memory overcommit. A larger number of pages can be spread out across
> > configurable generations, and thus they have relatively low false
> > active/inactive rates. Each generation includes all pages that have
> > been referenced since the last generation.
> >
> > Given an lruvec, scans and the selections between anon and file types
> > are all based on generation numbers, which are simple and yet
> > effective. For different lruvecs, comparisons are still possible based
> > on birth times of generations.
> 
> It means you replace the active/inactive lists to multiple lists, from
> most active to least active?

Precisely.

> > Differential scans via page tables
> > ----------------------------------
> > Each differential scan discovers all pages that have been referenced
> > since the last scan. Specifically, it walks the mm_struct list
> > associated with an lruvec to scan page tables of processes that have
> > been scheduled since the last scan. The cost of each differential scan
> > is roughly proportional to the number of referenced pages it
> > discovers. Unless address spaces are extremely sparse, page tables
> > usually have better memory locality than the rmap. The end result is
> > generally a significant reduction in CPU usage, for most of the
> > systems running cloud workloads.
> 
> How's about unmapped page caches? I think they are still quite common
> for a lot of workloads.

Yes, they are covered too, by mark_page_accessed(), when they are
read/written via file descriptors.

> > On Chrome OS, our real-world benchmark that browses popular websites
> > in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> > (full) less PSI on v5.11. And kswapd profile looks like:
> >   49.36%  lzo1x_1_do_compress
> >    4.54%  page_vma_mapped_walk
> >    4.45%  memset_erms
> >    3.47%  walk_pte_range
> >    2.88%  zram_bvec_rw
> >
> > In addition, direct reclaim latency is reduced by 22% at 99th
> > percentile and the number of refaults is reduced 7%. These metrics are
> > important to phones and laptops as they are correlated to user
> > experience.
> >
> > Workflow
> > ========
> > Evictable pages are divided into multiple generations for each lruvec.
> > The youngest generation number is stored in lruvec->evictable.max_seq
> > for both anon and file types as they are aged on an equal footing. The
> > oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> > separately for anon and file types as clean file pages can be evicted
> > regardless of may_swap or may_writepage. Generation numbers are
> > truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> > page->flags. The sliding window technique is used to prevent truncated
> > generation numbers from overlapping. Each truncated generation number
> > is an index to
> > lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> > Evictable pages are added to the per-zone lists indexed by max_seq or
> > min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> > faulted in or read ahead. The workflow comprises two conceptually
> > independent functions: the aging and the eviction.
> 
> Could you please illustrate the data structures? I think this would be
> very helpful to understand the code. I haven't looked into the code
> closely yet, per my shallow understanding to the above paragraphs, the
> new lruvec looks like:
> 
> ----------------
> | max_seq  |
> ----------------
> | .....            |
> ----------------
> | min_seq.  | -----> -------------
> ----------------          |  Anon    | ---------> -------------------
>                               ------------               | MAX_ZONE  |
> --------> list of pages
>                              |  File       |              --------------------
>                               -------------              | .......
>           | --------->
>                                                             --------------------
>                                                             | ZONE_DMA
>  | --------->
>                                                             --------------------
> 
> And the max_seq/min_seq is per memcg, is my understanding correct?

Yes, on single-node systems. To be precise, they are per lruvec. Each
memcg has N lruvecs for N-node systems.

A crude analogy would be a ring buffer: the aging to the writer
advancing max_seq and the eviction to the reader advancing min_seq,
in terms of generations. (The aging only tags pages -- it doesn't add
pages to the lists; page allocations do.)

> > Aging
> > -----
> > The aging produces young generations. Given an lruvec, the aging scans
> > page tables for referenced pages of this lruvec. Upon finding one, the
> > aging updates its generation number to max_seq. After each round of
> > scan, the aging increments max_seq. The aging maintains either a
> > system-wide mm_struct list or per-memcg mm_struct lists and tracks
> > whether an mm_struct is being used on any CPUs or has been used since
> > the last scan. Multiple threads can concurrently work on the same
> > mm_struct list, and each of them will be given a different mm_struct
> > belonging to a process that has been scheduled since the last scan.
> 
> I don't quite get how the "aging" works. IIUC, you have a dedicated
> kernel thread or threads to scan the page tables periodically to
> update the generations and promote or demote pages among the lists or
> the "aging" just happens in reclaimer?

The aging can happen in any reclaiming threads, when let's say "the
inactive" is low. There is no dedicated kernel threads, unless you
count kswapd as one.

For example, for memcg reclaim, we have:
  page charge failure
    memcg reclaim
      select a node
        get lruvec from the node and the memcg
retry:
          if max_seq - min_seq < 2, i.e., no inactive pages
            the aging: scan the mm_struct lists
              increment max_seq
          the eviction: scan the page lists
            if the per-zone lists are empty
              increment min_seq
              goto retry


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-16  2:07     ` Huang, Ying
@ 2021-03-16  3:57       ` Yu Zhao
  2021-03-16  6:44         ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-16  3:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> Rik van Riel <riel@surriel.com> writes:
> 
> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >
> >> +/*
> >> + * After pages are faulted in, they become the youngest generation.
> >> They must
> >> + * go through aging process twice before they can be evicted. After
> >> first scan,
> >> + * their accessed bit set during initial faults are cleared and they
> >> become the
> >> + * second youngest generation. And second scan makes sure they
> >> haven't been used
> >> + * since the first.
> >> + */
> >
> > I have to wonder if the reductions in OOM kills and 
> > low-memory tab discards is due to this aging policy
> > change, rather than from the switch to virtual scanning.

There are no policy changes per se. The current page reclaim also
scans a faulted-in page at least twice before it can reclaim it.
That said, the new aging yields a better overall result because it
discovers every page that has been referenced since the last scan,
in addition to what Ying has mentioned. The current page scan stops
stops once it finds enough candidates, which may seem more
efficiently, but actually pays the price for not finding the best.

> If my understanding were correct, the temperature of the processes is
> considered in addition to that of the individual pages.  That is, the
> pages of the processes that haven't been scheduled after the previous
> scanning will not be scanned.  I guess that this helps OOM kills?

Yes, that's correct.

> If so, how about just take advantage of that information for OOM killing
> and page reclaiming?  For example, if a process hasn't been scheduled
> for long time, just reclaim its private pages.

This is how it works. Pages that haven't been scanned grow older
automatically because those that have been scanned will be tagged with
younger generation numbers. Eviction does bucket sort based on
generation numbers and attacks the oldest.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-16  3:57       ` Yu Zhao
@ 2021-03-16  6:44         ` Huang, Ying
  2021-03-16  7:56           ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-16  6:44 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
>> Rik van Riel <riel@surriel.com> writes:
>> 
>> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>> >
>> >> +/*
>> >> + * After pages are faulted in, they become the youngest generation.
>> >> They must
>> >> + * go through aging process twice before they can be evicted. After
>> >> first scan,
>> >> + * their accessed bit set during initial faults are cleared and they
>> >> become the
>> >> + * second youngest generation. And second scan makes sure they
>> >> haven't been used
>> >> + * since the first.
>> >> + */
>> >
>> > I have to wonder if the reductions in OOM kills and 
>> > low-memory tab discards is due to this aging policy
>> > change, rather than from the switch to virtual scanning.
>
> There are no policy changes per se. The current page reclaim also
> scans a faulted-in page at least twice before it can reclaim it.
> That said, the new aging yields a better overall result because it
> discovers every page that has been referenced since the last scan,
> in addition to what Ying has mentioned. The current page scan stops
> stops once it finds enough candidates, which may seem more
> efficiently, but actually pays the price for not finding the best.
>
>> If my understanding were correct, the temperature of the processes is
>> considered in addition to that of the individual pages.  That is, the
>> pages of the processes that haven't been scheduled after the previous
>> scanning will not be scanned.  I guess that this helps OOM kills?
>
> Yes, that's correct.
>
>> If so, how about just take advantage of that information for OOM killing
>> and page reclaiming?  For example, if a process hasn't been scheduled
>> for long time, just reclaim its private pages.
>
> This is how it works. Pages that haven't been scanned grow older
> automatically because those that have been scanned will be tagged with
> younger generation numbers. Eviction does bucket sort based on
> generation numbers and attacks the oldest.

Sorry, my original words are misleading.  What I wanted to say was that
is it good enough that

- Do not change the core algorithm of current page reclaiming.

- Add some new logic to reclaim the process private pages regardless of
  the Accessed bits if the processes are not scheduled for some long
  enough time.  This can be done before the normal page reclaiming.

So this is an one small step improvement to the current page reclaiming
algorithm via taking advantage of the scheduler information.  It's
clearly not sophisticated as your new algorithm, for example, the cold
pages in the hot processes will not be reclaimed in this stage.  But it
can reduce the overhead of scanning too.

All in all, some of your ideas may help the original LRU algorithm too.
Or some can be experimented without replacing the original algorithm.

But from another point of view, your solution can be seen as a kind of
improvement on top of the original LRU algorithm too.  It moves the
recently accessed pages to kind of multiple active lists based on
scanning page tables directly (instead of reversely).

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-16  6:44         ` Huang, Ying
@ 2021-03-16  7:56           ` Yu Zhao
  2021-03-17  3:37             ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-16  7:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> Rik van Riel <riel@surriel.com> writes:
> >> 
> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >
> >> >> +/*
> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> They must
> >> >> + * go through aging process twice before they can be evicted. After
> >> >> first scan,
> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> become the
> >> >> + * second youngest generation. And second scan makes sure they
> >> >> haven't been used
> >> >> + * since the first.
> >> >> + */
> >> >
> >> > I have to wonder if the reductions in OOM kills and 
> >> > low-memory tab discards is due to this aging policy
> >> > change, rather than from the switch to virtual scanning.
> >
> > There are no policy changes per se. The current page reclaim also
> > scans a faulted-in page at least twice before it can reclaim it.
> > That said, the new aging yields a better overall result because it
> > discovers every page that has been referenced since the last scan,
> > in addition to what Ying has mentioned. The current page scan stops
> > stops once it finds enough candidates, which may seem more
> > efficiently, but actually pays the price for not finding the best.
> >
> >> If my understanding were correct, the temperature of the processes is
> >> considered in addition to that of the individual pages.  That is, the
> >> pages of the processes that haven't been scheduled after the previous
> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >
> > Yes, that's correct.
> >
> >> If so, how about just take advantage of that information for OOM killing
> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> for long time, just reclaim its private pages.
> >
> > This is how it works. Pages that haven't been scanned grow older
> > automatically because those that have been scanned will be tagged with
> > younger generation numbers. Eviction does bucket sort based on
> > generation numbers and attacks the oldest.
> 
> Sorry, my original words are misleading.  What I wanted to say was that
> is it good enough that
> 
> - Do not change the core algorithm of current page reclaiming.
> 
> - Add some new logic to reclaim the process private pages regardless of
>   the Accessed bits if the processes are not scheduled for some long
>   enough time.  This can be done before the normal page reclaiming.

This is a good idea, which being used on Android and Chrome OS. We
call it per-process reclaim, and I've mentioned here:
https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
  On Android, our most advanced simulation that generates memory
  pressure from realistic user behavior shows 18% fewer low-memory
  kills, which in turn reduces cold starts by 16%. This is on top of
  per-process reclaim, a predecessor of ``MADV_COLD`` and
  ``MADV_PAGEOUT``, against background apps.

The patches landed not long a ago :) See mm/madvise.c

> So this is an one small step improvement to the current page reclaiming
> algorithm via taking advantage of the scheduler information.  It's
> clearly not sophisticated as your new algorithm, for example, the cold
> pages in the hot processes will not be reclaimed in this stage.  But it
> can reduce the overhead of scanning too.

The general problems with the direction of per-process reclaim:
  1) we can't find the coldest pages, as you have mentioned.
  2) we can't reach file pages accessed via file descriptors only,
  especially those caching config files that were read only once.
  3) we can't reclaim lru pages and slab objects proportionally and
  therefore we leave many stale slab objects behind.
  4) we have to be proactive, as you suggested (once again, you were
  right), and this has a serious problem: client's battery life can
  be affected.

The scanning overhead is only one of the two major problems of the
current page reclaim. The other problem is the granularity of the
active/inactive (sizes). We stopped using them in making job
scheduling decision a long time ago. I know another large internet
company adopted a similar approach as ours, and I'm wondering how
everybody else is coping with the discrepancy from those counters.

> All in all, some of your ideas may help the original LRU algorithm too.
> Or some can be experimented without replacing the original algorithm.
> 
> But from another point of view, your solution can be seen as a kind of
> improvement on top of the original LRU algorithm too.  It moves the
> recently accessed pages to kind of multiple active lists based on
> scanning page tables directly (instead of reversely).

We hope this series can be a framework or an infrastructure flexible
enough that people can build their complex use cases upon, e.g.,
proactive reclaim (machine-wide, not per process), cold memory
estimation (for job scheduling), AEP demotion, specifically, we want
people to use it with what you and Dave are working on here:
https://patchwork.kernel.org/project/linux-mm/cover/20210304235949.7922C1C3@viggo.jf.intel.com/


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-16  2:24   ` Yu Zhao
@ 2021-03-16 14:50     ` Dave Hansen
  2021-03-16 20:30       ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2021-03-16 14:50 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On 3/15/21 7:24 PM, Yu Zhao wrote:
> On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
>> How bad does this scanning get in the worst case if there's a lot of
>> sharing?
> 
> Actually the improvement is larger when there is more sharing, i.e.,
> higher map_count larger improvement. Let's assume we have a shmem
> page mapped by two processes. To reclaim this page, we need to make
> sure neither PTE from the two sets of page tables has the accessed
> bit. The current page reclaim uses the rmap, i.e., rmap_walk_file().
> It first looks up the two VMAs (from the two processes mapping this
> shmem file) in the interval tree of this shmem file, then from each
> VMA, it goes through PGD/PUD/PMD to reach the PTE. The page can't be
> reclaimed if either of the PTEs has the accessed bit, therefore cost
> of the scanning is more than proportional to the number of accesses,
> when there is a lot sharing.
> 
> Why this series makes it better? We track the usage of page tables.
> Specifically, we work alongside switch_mm(): if one of the processes
> above hasn't be scheduled since the last scan, we don't need to scan
> its page tables. So the cost is roughly proportional to the number of
> accesses, regardless of how many processes. And instead of scanning
> pages one by one, we do it in large batches. However, page tables can
> be very sparse -- this is not a problem for the rmap because it knows
> exactly where the PTEs are (by vma_address()). We only know ranges (by
> vma->vm_start/vm_end). This is where the accessed bit on non-leaf
> PMDs can be of help.

That's an interesting argument.  *But*, this pivoted into describing an
optimization.  My takeaway from this is that large amounts of sharing
are probably only handled well if the processes doing the sharing are
not running constantly.

> But I guess you are wondering what downsides are. Well, we haven't
> seen any (yet). We do have page cache (non-shmem) heavy workloads,
> but not at a scale large enough to make any statistically meaningful
> observations. We are very interested in working with anybody who has
> page cache (non-shmem) heavy workloads and is willing to try out this
> series.

I would also be very interested to see some synthetic, worst-case
micros.  Maybe take a few thousand processes with very sparse page
tables that all map some shared memory.  They wake up long enough to
touch a few pages, then go back to sleep.

What happens if we do that?  I'm not saying this is a good workload or
that things must behave well, but I do find it interesting to watch the
worst case.

I think it would also be very worthwhile to include some research in
this series about why the kernel moved away from page table scanning.
What has changed?  Are the workloads we were concerned about way back
then not around any more?  Has faster I/O or larger memory sizes with a
stagnating page size changed something?

>> I'm kinda surprised by this, but my 16GB laptop has a lot more page
>> cache than I would have guessed:
>>
>>> Active(anon):    4065088 kB
>>> Inactive(anon):  3981928 kB
>>> Active(file):    2260580 kB
>>> Inactive(file):  3738096 kB
>>> AnonPages:       6624776 kB
>>> Mapped:           692036 kB
>>> Shmem:            776276 kB
>>
>> Most of it isn't mapped, but it's far from all being used for text.
> 
> We have categorized two groups:
>   1) average users that haven't experienced memory pressure since
>   their systems have booted. The booting process fills up page cache
>   with one-off file pages, and they remain until users experience
>   memory pressure. This can be confirmed by looking at those counters
>   of a freshly rebooted and idle system. My guess this is the case for
>   your laptop.

It's been up ~12 days.  There is ~10GB of data in swap, and there's been
a lot of scanning activity which I would associate with memory pressure:

> SwapCached:      1187596 kB
> SwapTotal:      51199996 kB
> SwapFree:       40419428 kB
...
> nr_vmscan_write 24900719
> nr_vmscan_immediate_reclaim 115535
> pgscan_kswapd 320831544
> pgscan_direct 23396383
> pgscan_direct_throttle 0
> pgscan_anon 127491077
> pgscan_file 216736850
> slabs_scanned 400469680
> compact_migrate_scanned 1092813949
> compact_free_scanned 4919523035
> compact_daemon_migrate_scanned 2372223
> compact_daemon_free_scanned 20989310
> unevictable_pgs_scanned 307388545


>   2) engineering users who store git repos and compile locally. They
>   complained about their browsers being janky because anon memory got
>   swapped even though their systems had a lot of stale file pages in
>   page cache, with the current page reclaim. They are what we consider
>   part of the page cache (non-shmem) heavy group.

Interesting.  You shouldn't have a shortage of folks like that among
kernel developers.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 11/14] mm: multigenerational lru: page activation
  2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
@ 2021-03-16 16:34   ` Matthew Wilcox
  2021-03-16 21:29     ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Matthew Wilcox @ 2021-03-16 16:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi, Ying Huang,
	linux-kernel, page-reclaim

On Sat, Mar 13, 2021 at 12:57:44AM -0700, Yu Zhao wrote:
> In the page fault path, we want to add pages to the per-zone lists
> index by max_seq as they cannot be evicted without going through
> the aging first. For anon pages, we rename
> lru_cache_add_inactive_or_unevictable() to lru_cache_add_page_vma()
> and add a new parameter, which is set to true in the page fault path,
> to indicate whether they should be added to the per-zone lists index
> by max_seq. For page/swap cache, since we cannot differentiate the
> page fault path from the read ahead path at the time we call
> lru_cache_add() in add_to_page_cache_lru() and
> __read_swap_cache_async(), we have to add a new function
> lru_gen_activate_page(), which is essentially activate_page(), to move
> pages to the per-zone lists indexed by max_seq at a later time.
> Hopefully we would find pages we want to activate in lru_pvecs.lru_add
> and simply set PageActive() on them without having to actually move
> them.
> 
> In the reclaim path, pages mapped around a referenced PTE may also
> have been referenced due to spatial locality. We add a new function
> lru_gen_scan_around() to scan the vicinity of such a PTE.
> 
> In addition, we add a new function page_is_active() to tell whether a
> page is active. We cannot use PageActive() because it is only set on
> active pages while they are not on multigenerational lru. It is
> cleared while pages are on multigenerational lru, in order to spare
> the aging the trouble of clearing it when an active generation becomes
> inactive. Internally, page_is_active() compares the generation number
> of a page with max_seq and max_seq-1, which are active generations and
> protected from the eviction. Other generations, which may or may not
> exist, are inactive.

If we go with this multi-LRU approach, it feels like PageActive and
PageInactive should go away as tests.  We should have a LRU field in
the page flags with some special values:

 - Not managed through LRU list
 - Not currently on any LRU list
 - Unevictable
 - Active list 1
 - Active list 2
 - ...
 - Active list 5

Now you don't need any extra bits in the page flags.  Or if you want to
have 13 lists instead of 5, you can use just one extra bit.  I'm not
quite sure whether it makes sense to have that many lists, so I need
to try to understand that better.

I'd like to echo the comments from others that it'd be nice to split apart
the multigenerational part of this and the physical scanning part of this.
It's possible they don't make performance sense without each other,
but from a review point of view, they seem entirely separate things.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-16 14:50     ` Dave Hansen
@ 2021-03-16 20:30       ` Yu Zhao
  2021-03-16 21:14         ` Dave Hansen
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-16 20:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> On 3/15/21 7:24 PM, Yu Zhao wrote:
> > On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
> >> How bad does this scanning get in the worst case if there's a lot of
> >> sharing?
> > 
> > Actually the improvement is larger when there is more sharing, i.e.,
> > higher map_count larger improvement. Let's assume we have a shmem
> > page mapped by two processes. To reclaim this page, we need to make
> > sure neither PTE from the two sets of page tables has the accessed
> > bit. The current page reclaim uses the rmap, i.e., rmap_walk_file().
> > It first looks up the two VMAs (from the two processes mapping this
> > shmem file) in the interval tree of this shmem file, then from each
> > VMA, it goes through PGD/PUD/PMD to reach the PTE. The page can't be
> > reclaimed if either of the PTEs has the accessed bit, therefore cost
> > of the scanning is more than proportional to the number of accesses,
> > when there is a lot sharing.
> > 
> > Why this series makes it better? We track the usage of page tables.
> > Specifically, we work alongside switch_mm(): if one of the processes
> > above hasn't be scheduled since the last scan, we don't need to scan
> > its page tables. So the cost is roughly proportional to the number of
> > accesses, regardless of how many processes. And instead of scanning
> > pages one by one, we do it in large batches. However, page tables can
> > be very sparse -- this is not a problem for the rmap because it knows
> > exactly where the PTEs are (by vma_address()). We only know ranges (by
> > vma->vm_start/vm_end). This is where the accessed bit on non-leaf
> > PMDs can be of help.
> 
> That's an interesting argument.  *But*, this pivoted into describing an
> optimization.  My takeaway from this is that large amounts of sharing
> are probably only handled well if the processes doing the sharing are
> not running constantly.
> 
> > But I guess you are wondering what downsides are. Well, we haven't
> > seen any (yet). We do have page cache (non-shmem) heavy workloads,
> > but not at a scale large enough to make any statistically meaningful
> > observations. We are very interested in working with anybody who has
> > page cache (non-shmem) heavy workloads and is willing to try out this
> > series.
> 
> I would also be very interested to see some synthetic, worst-case
> micros.  Maybe take a few thousand processes with very sparse page
> tables that all map some shared memory.  They wake up long enough to
> touch a few pages, then go back to sleep.
> 
> What happens if we do that?  I'm not saying this is a good workload or
> that things must behave well, but I do find it interesting to watch the
> worst case.

It is a reasonable request, thank you. I've just opened a bug to cover
this case (a large sparse shared shmem) and we'll have something soon.

> I think it would also be very worthwhile to include some research in
> this series about why the kernel moved away from page table scanning.
> What has changed?  Are the workloads we were concerned about way back
> then not around any more?  Has faster I/O or larger memory sizes with a
> stagnating page size changed something?

Sure. Hugh also suggested this too but I personally found that ancient
pre-2.4 history too irrelevant (and uninteresting) to the modern age
and decided to spare audience of the boredom.

> >> I'm kinda surprised by this, but my 16GB laptop has a lot more page
> >> cache than I would have guessed:
> >>
> >>> Active(anon):    4065088 kB
> >>> Inactive(anon):  3981928 kB
> >>> Active(file):    2260580 kB
> >>> Inactive(file):  3738096 kB
> >>> AnonPages:       6624776 kB
> >>> Mapped:           692036 kB
> >>> Shmem:            776276 kB
> >>
> >> Most of it isn't mapped, but it's far from all being used for text.
> > 
> > We have categorized two groups:
> >   1) average users that haven't experienced memory pressure since
> >   their systems have booted. The booting process fills up page cache
> >   with one-off file pages, and they remain until users experience
> >   memory pressure. This can be confirmed by looking at those counters
> >   of a freshly rebooted and idle system. My guess this is the case for
> >   your laptop.
> 
> It's been up ~12 days.  There is ~10GB of data in swap, and there's been
> a lot of scanning activity which I would associate with memory pressure:
> 
> > SwapCached:      1187596 kB
> > SwapTotal:      51199996 kB
> > SwapFree:       40419428 kB
> ...
> > nr_vmscan_write 24900719
> > nr_vmscan_immediate_reclaim 115535
> > pgscan_kswapd 320831544
> > pgscan_direct 23396383
> > pgscan_direct_throttle 0
> > pgscan_anon 127491077
> > pgscan_file 216736850
> > slabs_scanned 400469680
> > compact_migrate_scanned 1092813949
> > compact_free_scanned 4919523035
> > compact_daemon_migrate_scanned 2372223
> > compact_daemon_free_scanned 20989310
> > unevictable_pgs_scanned 307388545

10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
The file rss does seem a bit high to me, my wild speculation is there
have been git/make activities in addition to a VM?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-16 20:30       ` Yu Zhao
@ 2021-03-16 21:14         ` Dave Hansen
  2021-04-10  9:21           ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Hansen @ 2021-03-16 21:14 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On 3/16/21 1:30 PM, Yu Zhao wrote:
> On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
>> I think it would also be very worthwhile to include some research in
>> this series about why the kernel moved away from page table scanning.
>> What has changed?  Are the workloads we were concerned about way back
>> then not around any more?  Has faster I/O or larger memory sizes with a
>> stagnating page size changed something?
> 
> Sure. Hugh also suggested this too but I personally found that ancient
> pre-2.4 history too irrelevant (and uninteresting) to the modern age
> and decided to spare audience of the boredom.

IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
until anon_vmas came around, which was early-ish in the 2.6 era.

But, either way, I think there is a sufficient population of nostalgic
crusty old folks around to warrant a bit of a history lesson.  We'll
enjoy the trip down memory lane, fondly remembering the old days in
Ottawa...

>>> nr_vmscan_write 24900719
>>> nr_vmscan_immediate_reclaim 115535
>>> pgscan_kswapd 320831544
>>> pgscan_direct 23396383
>>> pgscan_direct_throttle 0
>>> pgscan_anon 127491077
>>> pgscan_file 216736850
>>> slabs_scanned 400469680
>>> compact_migrate_scanned 1092813949
>>> compact_free_scanned 4919523035
>>> compact_daemon_migrate_scanned 2372223
>>> compact_daemon_free_scanned 20989310
>>> unevictable_pgs_scanned 307388545
> 
> 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> The file rss does seem a bit high to me, my wild speculation is there
> have been git/make activities in addition to a VM?

I wish I was doing more git/make activities.  It's been an annoying
amount of email and web browsers for 12 days.  If anything, I'd suspect
that Thunderbird is at fault for keeping a bunch of mail in the page
cache.  There are a couple of VM's running though.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 11/14] mm: multigenerational lru: page activation
  2021-03-16 16:34   ` Matthew Wilcox
@ 2021-03-16 21:29     ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-16 21:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi, Ying Huang,
	linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 04:34:37PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 13, 2021 at 12:57:44AM -0700, Yu Zhao wrote:
> > In the page fault path, we want to add pages to the per-zone lists
> > index by max_seq as they cannot be evicted without going through
> > the aging first. For anon pages, we rename
> > lru_cache_add_inactive_or_unevictable() to lru_cache_add_page_vma()
> > and add a new parameter, which is set to true in the page fault path,
> > to indicate whether they should be added to the per-zone lists index
> > by max_seq. For page/swap cache, since we cannot differentiate the
> > page fault path from the read ahead path at the time we call
> > lru_cache_add() in add_to_page_cache_lru() and
> > __read_swap_cache_async(), we have to add a new function
> > lru_gen_activate_page(), which is essentially activate_page(), to move
> > pages to the per-zone lists indexed by max_seq at a later time.
> > Hopefully we would find pages we want to activate in lru_pvecs.lru_add
> > and simply set PageActive() on them without having to actually move
> > them.
> > 
> > In the reclaim path, pages mapped around a referenced PTE may also
> > have been referenced due to spatial locality. We add a new function
> > lru_gen_scan_around() to scan the vicinity of such a PTE.
> > 
> > In addition, we add a new function page_is_active() to tell whether a
> > page is active. We cannot use PageActive() because it is only set on
> > active pages while they are not on multigenerational lru. It is
> > cleared while pages are on multigenerational lru, in order to spare
> > the aging the trouble of clearing it when an active generation becomes
> > inactive. Internally, page_is_active() compares the generation number
> > of a page with max_seq and max_seq-1, which are active generations and
> > protected from the eviction. Other generations, which may or may not
> > exist, are inactive.
> 
> If we go with this multi-LRU approach, it feels like PageActive and
> PageInactive should go away as tests.  We should have a LRU field in
> the page flags with some special values:
> 
>  - Not managed through LRU list
>  - Not currently on any LRU list
>  - Unevictable
>  - Active list 1
>  - Active list 2
>  - ...
>  - Active list 5
> 
> Now you don't need any extra bits in the page flags.  Or if you want to
> have 13 lists instead of 5, you can use just one extra bit.  I'm not
> quite sure whether it makes sense to have that many lists, so I need
> to try to understand that better.

Yes, and this would be a lot cleaner. PG_{lru,unevictable,active,
referenced,reclaim,workingset,young,idle} could all go away. Look how
many bits we've added just for page reclaim. Sigh...

> I'd like to echo the comments from others that it'd be nice to split apart
> the multigenerational part of this and the physical scanning part of this.
> It's possible they don't make performance sense without each other,
> but from a review point of view, they seem entirely separate things.

Thanks for noticing. I do plan to see if the page table scanning part
could be better refactored. (I cut some corners by squashing it while
rebasing to latest kernel.)


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-16  7:56           ` Yu Zhao
@ 2021-03-17  3:37             ` Huang, Ying
  2021-03-17 10:46               ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-17  3:37 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
>> >> Rik van Riel <riel@surriel.com> writes:
>> >> 
>> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>> >> >
>> >> >> +/*
>> >> >> + * After pages are faulted in, they become the youngest generation.
>> >> >> They must
>> >> >> + * go through aging process twice before they can be evicted. After
>> >> >> first scan,
>> >> >> + * their accessed bit set during initial faults are cleared and they
>> >> >> become the
>> >> >> + * second youngest generation. And second scan makes sure they
>> >> >> haven't been used
>> >> >> + * since the first.
>> >> >> + */
>> >> >
>> >> > I have to wonder if the reductions in OOM kills and 
>> >> > low-memory tab discards is due to this aging policy
>> >> > change, rather than from the switch to virtual scanning.
>> >
>> > There are no policy changes per se. The current page reclaim also
>> > scans a faulted-in page at least twice before it can reclaim it.
>> > That said, the new aging yields a better overall result because it
>> > discovers every page that has been referenced since the last scan,
>> > in addition to what Ying has mentioned. The current page scan stops
>> > stops once it finds enough candidates, which may seem more
>> > efficiently, but actually pays the price for not finding the best.
>> >
>> >> If my understanding were correct, the temperature of the processes is
>> >> considered in addition to that of the individual pages.  That is, the
>> >> pages of the processes that haven't been scheduled after the previous
>> >> scanning will not be scanned.  I guess that this helps OOM kills?
>> >
>> > Yes, that's correct.
>> >
>> >> If so, how about just take advantage of that information for OOM killing
>> >> and page reclaiming?  For example, if a process hasn't been scheduled
>> >> for long time, just reclaim its private pages.
>> >
>> > This is how it works. Pages that haven't been scanned grow older
>> > automatically because those that have been scanned will be tagged with
>> > younger generation numbers. Eviction does bucket sort based on
>> > generation numbers and attacks the oldest.
>> 
>> Sorry, my original words are misleading.  What I wanted to say was that
>> is it good enough that
>> 
>> - Do not change the core algorithm of current page reclaiming.
>> 
>> - Add some new logic to reclaim the process private pages regardless of
>>   the Accessed bits if the processes are not scheduled for some long
>>   enough time.  This can be done before the normal page reclaiming.
>
> This is a good idea, which being used on Android and Chrome OS. We
> call it per-process reclaim, and I've mentioned here:
> https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
>   On Android, our most advanced simulation that generates memory
>   pressure from realistic user behavior shows 18% fewer low-memory
>   kills, which in turn reduces cold starts by 16%. This is on top of
>   per-process reclaim, a predecessor of ``MADV_COLD`` and
>   ``MADV_PAGEOUT``, against background apps.

Thanks, now I see your improvement compared with the per-process
reclaim.  How about the per-process reclaim compared with the normal
page reclaiming for the similar test cases?

My intention behind this is that your solution includes several
improvements,

a) take advantage of scheduler information
b) more fine-grained active/inactive dividing
c) page table scanning instead of rmap

Is it possible to evaluate the benefit of the each step?  And is there
still some potential to optimize the current LRU based algorithm before
adopting a totally new algorithm?

> The patches landed not long a ago :) See mm/madvise.c

:) I'm too out-dated.

>> So this is an one small step improvement to the current page reclaiming
>> algorithm via taking advantage of the scheduler information.  It's
>> clearly not sophisticated as your new algorithm, for example, the cold
>> pages in the hot processes will not be reclaimed in this stage.  But it
>> can reduce the overhead of scanning too.
>
> The general problems with the direction of per-process reclaim:
>   1) we can't find the coldest pages, as you have mentioned.
>   2) we can't reach file pages accessed via file descriptors only,
>   especially those caching config files that were read only once.

In theory, this is possible, we can build a inode list based on the
accessing time too.  Although this may not be necessary.  We can reclaim
the read-once file cache before the per-process reclaim in theory.

>   3) we can't reclaim lru pages and slab objects proportionally and
>   therefore we leave many stale slab objects behind.
>   4) we have to be proactive, as you suggested (once again, you were
>   right), and this has a serious problem: client's battery life can
>   be affected.

Why can this not be done reactively?  We can start per-process reclaim
under memory pressure.  This has been used in phone and laptop now, so
there's a solution for this issue?

> The scanning overhead is only one of the two major problems of the
> current page reclaim. The other problem is the granularity of the
> active/inactive (sizes). We stopped using them in making job
> scheduling decision a long time ago. I know another large internet
> company adopted a similar approach as ours, and I'm wondering how
> everybody else is coping with the discrepancy from those counters.

From intuition, the scanning overhead of the full page table scanning
appears higher than that of the rmap scanning for a small portion of
system memory.  But form your words, you think the reality is the
reverse?  If others concern about the overhead too, finally, I think you
need to prove the overhead of the page table scanning isn't too higher,
or even lower with more data and theory.

>> All in all, some of your ideas may help the original LRU algorithm too.
>> Or some can be experimented without replacing the original algorithm.
>> 
>> But from another point of view, your solution can be seen as a kind of
>> improvement on top of the original LRU algorithm too.  It moves the
>> recently accessed pages to kind of multiple active lists based on
>> scanning page tables directly (instead of reversely).
>
> We hope this series can be a framework or an infrastructure flexible
> enough that people can build their complex use cases upon, e.g.,
> proactive reclaim (machine-wide, not per process), cold memory
> estimation (for job scheduling), AEP demotion, specifically, we want
> people to use it with what you and Dave are working on here:
> https://patchwork.kernel.org/project/linux-mm/cover/20210304235949.7922C1C3@viggo.jf.intel.com/

Yes.  A better page reclaiming algorithm will help DRAM->PMEM demotion
much!

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-17  3:37             ` Huang, Ying
@ 2021-03-17 10:46               ` Yu Zhao
  2021-03-22  3:13                 ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-17 10:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> >> Rik van Riel <riel@surriel.com> writes:
> >> >> 
> >> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >> >
> >> >> >> +/*
> >> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> >> They must
> >> >> >> + * go through aging process twice before they can be evicted. After
> >> >> >> first scan,
> >> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> >> become the
> >> >> >> + * second youngest generation. And second scan makes sure they
> >> >> >> haven't been used
> >> >> >> + * since the first.
> >> >> >> + */
> >> >> >
> >> >> > I have to wonder if the reductions in OOM kills and 
> >> >> > low-memory tab discards is due to this aging policy
> >> >> > change, rather than from the switch to virtual scanning.
> >> >
> >> > There are no policy changes per se. The current page reclaim also
> >> > scans a faulted-in page at least twice before it can reclaim it.
> >> > That said, the new aging yields a better overall result because it
> >> > discovers every page that has been referenced since the last scan,
> >> > in addition to what Ying has mentioned. The current page scan stops
> >> > stops once it finds enough candidates, which may seem more
> >> > efficiently, but actually pays the price for not finding the best.
> >> >
> >> >> If my understanding were correct, the temperature of the processes is
> >> >> considered in addition to that of the individual pages.  That is, the
> >> >> pages of the processes that haven't been scheduled after the previous
> >> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >> >
> >> > Yes, that's correct.
> >> >
> >> >> If so, how about just take advantage of that information for OOM killing
> >> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> >> for long time, just reclaim its private pages.
> >> >
> >> > This is how it works. Pages that haven't been scanned grow older
> >> > automatically because those that have been scanned will be tagged with
> >> > younger generation numbers. Eviction does bucket sort based on
> >> > generation numbers and attacks the oldest.
> >> 
> >> Sorry, my original words are misleading.  What I wanted to say was that
> >> is it good enough that
> >> 
> >> - Do not change the core algorithm of current page reclaiming.
> >> 
> >> - Add some new logic to reclaim the process private pages regardless of
> >>   the Accessed bits if the processes are not scheduled for some long
> >>   enough time.  This can be done before the normal page reclaiming.
> >
> > This is a good idea, which being used on Android and Chrome OS. We
> > call it per-process reclaim, and I've mentioned here:
> > https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
> >   On Android, our most advanced simulation that generates memory
> >   pressure from realistic user behavior shows 18% fewer low-memory
> >   kills, which in turn reduces cold starts by 16%. This is on top of
> >   per-process reclaim, a predecessor of ``MADV_COLD`` and
> >   ``MADV_PAGEOUT``, against background apps.
> 
> Thanks, now I see your improvement compared with the per-process
> reclaim.  How about the per-process reclaim compared with the normal
> page reclaiming for the similar test cases?
> 
> My intention behind this is that your solution includes several
> improvements,
> 
> a) take advantage of scheduler information
> b) more fine-grained active/inactive dividing
> c) page table scanning instead of rmap
> 
> Is it possible to evaluate the benefit of the each step?  And is there
> still some potential to optimize the current LRU based algorithm before
> adopting a totally new algorithm?

Well, there isn't really any new algorithm -- it's still the LRU
(algorithm). But I do see your point. In another survey we posted
here:
https://lore.kernel.org/linux-mm/YBkT6175GmMWBvw3@google.com/
we stated:
  Why not try to improve the existing code?
  -----------------------------------------
  We have tried but concluded the two limiting factors [note]_ in the
  existing code are fundamental, and therefore changes made atop them
  will not result in substantial gains on any of the aspects above.

We learned this the hard way.

> > The patches landed not long a ago :) See mm/madvise.c
> 
> :) I'm too out-dated.
> 
> >> So this is an one small step improvement to the current page reclaiming
> >> algorithm via taking advantage of the scheduler information.  It's
> >> clearly not sophisticated as your new algorithm, for example, the cold
> >> pages in the hot processes will not be reclaimed in this stage.  But it
> >> can reduce the overhead of scanning too.
> >
> > The general problems with the direction of per-process reclaim:
> >   1) we can't find the coldest pages, as you have mentioned.
> >   2) we can't reach file pages accessed via file descriptors only,
> >   especially those caching config files that were read only once.
> 
> In theory, this is possible, we can build a inode list based on the
> accessing time too.  Although this may not be necessary.  We can reclaim
> the read-once file cache before the per-process reclaim in theory.

You have to search for unmapped clean pages in page cache. Generally
searching page cache is a lot more expensive than walking lru lists
because page cache can be sparse but lru lists can't.

> >   3) we can't reclaim lru pages and slab objects proportionally and
> >   therefore we leave many stale slab objects behind.
> >   4) we have to be proactive, as you suggested (once again, you were
> >   right), and this has a serious problem: client's battery life can
> >   be affected.
> 
> Why can this not be done reactively? We can start per-process reclaim
> under memory pressure.

Under memory pressure, we could scan a lot of idle processes and find
nothing to reclaim, e.g., mlockall(). In addition, address spaces can
be sparse too.

You are now looking at the direction of cold memory tracking using
page tables, which is not practical. Apparent this series has given
you a bad idea... Page tables are only good at discovering hot memory.
Take my word for it :)

> This has been used in phone and laptop now, so
> there's a solution for this issue?

madvise() is called based on user behavior, which we don't have in
kernel space.

> > The scanning overhead is only one of the two major problems of the
> > current page reclaim. The other problem is the granularity of the
> > active/inactive (sizes). We stopped using them in making job
> > scheduling decision a long time ago. I know another large internet
> > company adopted a similar approach as ours, and I'm wondering how
> > everybody else is coping with the discrepancy from those counters.
> 
> From intuition, the scanning overhead of the full page table scanning
> appears higher than that of the rmap scanning for a small portion of
> system memory.  But form your words, you think the reality is the
> reverse?  If others concern about the overhead too, finally, I think you
> need to prove the overhead of the page table scanning isn't too higher,
> or even lower with more data and theory.

There is a misunderstanding here. I never said anything about full
page table scanning. And this is not how it's done in this series
either. I guess the misunderstanding has something to do with the cold
memory tracking you are thinking about?

This series uses page tables to discover page accesses when a system
has run out of inactive pages. Under such a situation, the system is
very likely to have a lot of page accesses, and using the rmap is
likely to cost a lot more because its poor memory locality compared
with page tables.

But, page tables can be sparse too, in terms of hot memory tracking.
Dave has asked me to test the worst case scenario, which I'll do.
And I'd be happy to share more data. Any specific workload you are
interested in?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 14/14] mm: multigenerational lru: documentation
  2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
@ 2021-03-19  9:31   ` Alex Shi
  2021-03-22  6:09     ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Alex Shi @ 2021-03-19  9:31 UTC (permalink / raw)
  To: Yu Zhao, linux-mm
  Cc: Andrew Morton, Dave Hansen, Hillf Danton, Johannes Weiner,
	Joonsoo Kim, Matthew Wilcox, Mel Gorman, Michal Hocko,
	Roman Gushchin, Vlastimil Babka, Wei Yang, Yang Shi, Ying Huang,
	linux-kernel, page-reclaim



在 2021/3/13 下午3:57, Yu Zhao 写道:
> +Recipes
> +-------
> +:Android on ARMv8.1+: ``X=4``, ``N=0``
> +
> +:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
> + ``ARM64_HW_AFDBM``
> +
> +:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
> +
> +:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
> + to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
> + generation ``max_gen`` and create the next generation ``max_gen+1``.
> + ``gen`` must be equal to ``max_gen`` in order to avoid races. A swap
> + file and a non-zero swappiness value are required to scan anon pages.
> + If swapping is not desired, set ``vm.swappiness`` to ``0`` and
> + overwrite it with a non-zero ``swappiness``.
> +
> +:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
> + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
> + generations less than or equal to ``gen``. ``gen`` must be less than
> + ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active generations
> + and therefore protected from the eviction. ``nr_to_reclaim`` can be
> + used to limit the number of pages to be evicted. Multiple command
> + lines are supported, so does concatenation with delimiters ``,`` and
> + ``;``.
> +


These are difficult options for users, especially for 'races' involving.
Is it possible to simplify them for end users?

Thanks
Alex


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-17 10:46               ` Yu Zhao
@ 2021-03-22  3:13                 ` Huang, Ying
  2021-03-22  8:08                   ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-22  3:13 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> > The scanning overhead is only one of the two major problems of the
>> > current page reclaim. The other problem is the granularity of the
>> > active/inactive (sizes). We stopped using them in making job
>> > scheduling decision a long time ago. I know another large internet
>> > company adopted a similar approach as ours, and I'm wondering how
>> > everybody else is coping with the discrepancy from those counters.
>> 
>> From intuition, the scanning overhead of the full page table scanning
>> appears higher than that of the rmap scanning for a small portion of
>> system memory.  But form your words, you think the reality is the
>> reverse?  If others concern about the overhead too, finally, I think you
>> need to prove the overhead of the page table scanning isn't too higher,
>> or even lower with more data and theory.
>
> There is a misunderstanding here. I never said anything about full
> page table scanning. And this is not how it's done in this series
> either. I guess the misunderstanding has something to do with the cold
> memory tracking you are thinking about?

If my understanding were correct, from the following code path in your
patch 10/14,

age_active_anon
  age_lru_gens
    try_walk_mm_list
      walk_mm_list
        walk_mm

So, in kswapd(), the page tables of many processes may be scanned
fully.  If the number of processes that are active are high, the
overhead may be high too.

> This series uses page tables to discover page accesses when a system
> has run out of inactive pages. Under such a situation, the system is
> very likely to have a lot of page accesses, and using the rmap is
> likely to cost a lot more because its poor memory locality compared
> with page tables.

This is the theory.  Can you verify this with more data?  Including the
CPU cycles or time spent scanning page tables?

> But, page tables can be sparse too, in terms of hot memory tracking.
> Dave has asked me to test the worst case scenario, which I'll do.
> And I'd be happy to share more data. Any specific workload you are
> interested in?

We can start with some simple workloads that are easier to be reasoned.
For example,

1. Run the workload with hot and cold pages, when the free memory
becomes lower than the low watermark, kswapd will be waken up to scan
and reclaim some cold pages.  How long will it take to do that?  It's
expected that almost all pages need to be scanned, so that page table
scanning is expected to have less overhead.  We can measure how well it
is.

2. Run the workload with hot and cold pages, if the whole working-set
cannot fit in DRAM, that is, the cold pages will be reclaimed and
swapped in regularly (for example tens MB/s).  It's expected that less
pages may be scanned with rmap, but the speed of page table scanning is
faster.

3. Run the workload with hot and cold pages, the system is
overcommitted, that is, some cold pages will be placed in swap.  But the
cold pages are cold enough, so there's almost no thrashing.  Then the
hot working-set of the workload changes, that is, some hot pages become
cold, while some cold pages becomes hot, so page reclaiming and swapin
will be triggered.

For each cases, we can use some different parameters.  And we can
measure something like the number of pages scanned, the time taken to
scan them, the number of page reclaimed and swapped in, etc.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 14/14] mm: multigenerational lru: documentation
  2021-03-19  9:31   ` Alex Shi
@ 2021-03-22  6:09     ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-22  6:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: linux-mm, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

On Fri, Mar 19, 2021 at 05:31:20PM +0800, Alex Shi wrote:
> 
> 
> 在 2021/3/13 下午3:57, Yu Zhao 写道:
> > +Recipes
> > +-------
> > +:Android on ARMv8.1+: ``X=4``, ``N=0``
> > +
> > +:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
> > + ``ARM64_HW_AFDBM``
> > +
> > +:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
> > +
> > +:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
> > + to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
> > + generation ``max_gen`` and create the next generation ``max_gen+1``.
> > + ``gen`` must be equal to ``max_gen`` in order to avoid races. A swap
> > + file and a non-zero swappiness value are required to scan anon pages.
> > + If swapping is not desired, set ``vm.swappiness`` to ``0`` and
> > + overwrite it with a non-zero ``swappiness``.
> > +
> > +:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
> > + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
> > + generations less than or equal to ``gen``. ``gen`` must be less than
> > + ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active generations
> > + and therefore protected from the eviction. ``nr_to_reclaim`` can be
> > + used to limit the number of pages to be evicted. Multiple command
> > + lines are supported, so does concatenation with delimiters ``,`` and
> > + ``;``.
> > +
> 
> These are difficult options for users, especially for 'races' involving.
> Is it possible to simplify them for end users?

They look simple for a few lruvecs, but do become human-unfriendly on
servers that have thousands of lruvecs.

It's certainly possible simplify them, but we'd have to sacrifice
some flexibility. Any particular idea in mind?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-22  3:13                 ` Huang, Ying
@ 2021-03-22  8:08                   ` Yu Zhao
  2021-03-24  6:58                     ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-22  8:08 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> > The scanning overhead is only one of the two major problems of the
> >> > current page reclaim. The other problem is the granularity of the
> >> > active/inactive (sizes). We stopped using them in making job
> >> > scheduling decision a long time ago. I know another large internet
> >> > company adopted a similar approach as ours, and I'm wondering how
> >> > everybody else is coping with the discrepancy from those counters.
> >> 
> >> From intuition, the scanning overhead of the full page table scanning
> >> appears higher than that of the rmap scanning for a small portion of
> >> system memory.  But form your words, you think the reality is the
> >> reverse?  If others concern about the overhead too, finally, I think you
> >> need to prove the overhead of the page table scanning isn't too higher,
> >> or even lower with more data and theory.
> >
> > There is a misunderstanding here. I never said anything about full
> > page table scanning. And this is not how it's done in this series
> > either. I guess the misunderstanding has something to do with the cold
> > memory tracking you are thinking about?
> 
> If my understanding were correct, from the following code path in your
> patch 10/14,
> 
> age_active_anon
>   age_lru_gens
>     try_walk_mm_list
>       walk_mm_list
>         walk_mm
> 
> So, in kswapd(), the page tables of many processes may be scanned
> fully.  If the number of processes that are active are high, the
> overhead may be high too.

That's correct. Just in case we have different definitions of what we
call "full":

  I understand it as the full range of the address space of a process
  that was loaded by switch_mm() at least once since the last scan.
  This is not the case because we don't scan the full range -- we skip
  holes and VMAs that are unevictable, as well as PTE tables that have
  no accessed entries on x86_64, by should_skip_vma() and
  CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.

  If you are referring to the full range of PTE tables that have at
  least one accessed entry, i.e., other 511 are not none  but have not
  been accessed either since the last scan on x86_64, then yes, you
  are right again :) This is the worse case scenario.
  
> > This series uses page tables to discover page accesses when a system
> > has run out of inactive pages. Under such a situation, the system is
> > very likely to have a lot of page accesses, and using the rmap is
> > likely to cost a lot more because its poor memory locality compared
> > with page tables.
> 
> This is the theory.  Can you verify this with more data?  Including the
> CPU cycles or time spent scanning page tables?

Yes, I'll be happy to do so as I should, because page table scanning
is counterintuitive. Let me add more theory in case it's still unclear
to others.

From my understanding, the two fundamental questions we need to
consider in terms of page reclaim are:

  What are the sizes of hot clusters (spatial locality) should we
  expect under memory pressure?

  On smaller systems with 4GB memory, our observations are that the
  average size of hot clusters found during each scan is 32KB. On
  larger systems with hundreds of gigabytes of memory, it's well
  above this value -- 512KB or larger. These values vary under
  different workloads and with different memory allocators. Unless
  done deliberately by memory allocators, e.g., Scudo as I've
  mentioned earlier, it's safe to say if a PTE entry has been
  accessed, its neighbors are likely to have been accessed too.

  What's hot memory footprint (total size of hot clusters) should we
  expect when we have run out of inactive pages?

  Some numbers first: on large and heavily overcommitted systems, we
  have observed close to 90% during a scan. Those systems have
  millions of pages and using the rmap to find out which pages to
  reclaim will just blow kswapd. On smaller systems with less memory
  pressure (due to their weaker CPUs), this number is more reasonable,
  ~50%. Here is some kswapd profiles from a smaller systems running
  5.11:

   the rmap                                 page table scan
   ---------------------------------------------------------------------
   31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
   25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
    4.63%  do_raw_spin_lock                  4.45%  memset_erms
    3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
    3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw

  The page table scan is only twice as fast. Only larger systems,
  it's usually more than 4 times, without THP. With THP, both are
  negligible (<1% CPU usage). I can grab profiles from our servers
  too if you are interested in seeing them on 4.15 kernel.

> > But, page tables can be sparse too, in terms of hot memory tracking.
> > Dave has asked me to test the worst case scenario, which I'll do.
> > And I'd be happy to share more data. Any specific workload you are
> > interested in?
> 
> We can start with some simple workloads that are easier to be reasoned.
> For example,
> 
> 1. Run the workload with hot and cold pages, when the free memory
> becomes lower than the low watermark, kswapd will be waken up to scan
> and reclaim some cold pages.  How long will it take to do that?  It's
> expected that almost all pages need to be scanned, so that page table

A typical scenario. Otherwise why would we have run out of cold pages
and still be under memory? Because what's in memory is hot and
therefore most of the them need to be scanned :)

> scanning is expected to have less overhead.  We can measure how well it
> is.

Sounds good to me.

> 2. Run the workload with hot and cold pages, if the whole working-set
> cannot fit in DRAM, that is, the cold pages will be reclaimed and
> swapped in regularly (for example tens MB/s).  It's expected that less
> pages may be scanned with rmap, but the speed of page table scanning is
> faster.

So IIUC, this is a sustained memory pressure, i.e., servers constantly
running under memory pressure?

> 3. Run the workload with hot and cold pages, the system is
> overcommitted, that is, some cold pages will be placed in swap.  But the
> cold pages are cold enough, so there's almost no thrashing.  Then the
> hot working-set of the workload changes, that is, some hot pages become
> cold, while some cold pages becomes hot, so page reclaiming and swapin
> will be triggered.

This is usually what we see on clients, i.e., bursty workloads when
switching from an active app to an inactive one.

> For each cases, we can use some different parameters.  And we can
> measure something like the number of pages scanned, the time taken to
> scan them, the number of page reclaimed and swapped in, etc.

Thanks, I appreciate these -- very well thought test cases. I'll look
into them and probably write some synthetic test cases. If you have
some already, I'd love to get my hands one them.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-22  8:08                   ` Yu Zhao
@ 2021-03-24  6:58                     ` Huang, Ying
  2021-04-10 18:48                       ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-24  6:58 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Rik van Riel, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >> 
>> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> > The scanning overhead is only one of the two major problems of the
>> >> > current page reclaim. The other problem is the granularity of the
>> >> > active/inactive (sizes). We stopped using them in making job
>> >> > scheduling decision a long time ago. I know another large internet
>> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> > everybody else is coping with the discrepancy from those counters.
>> >> 
>> >> From intuition, the scanning overhead of the full page table scanning
>> >> appears higher than that of the rmap scanning for a small portion of
>> >> system memory.  But form your words, you think the reality is the
>> >> reverse?  If others concern about the overhead too, finally, I think you
>> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> or even lower with more data and theory.
>> >
>> > There is a misunderstanding here. I never said anything about full
>> > page table scanning. And this is not how it's done in this series
>> > either. I guess the misunderstanding has something to do with the cold
>> > memory tracking you are thinking about?
>> 
>> If my understanding were correct, from the following code path in your
>> patch 10/14,
>> 
>> age_active_anon
>>   age_lru_gens
>>     try_walk_mm_list
>>       walk_mm_list
>>         walk_mm
>> 
>> So, in kswapd(), the page tables of many processes may be scanned
>> fully.  If the number of processes that are active are high, the
>> overhead may be high too.
>
> That's correct. Just in case we have different definitions of what we
> call "full":
>
>   I understand it as the full range of the address space of a process
>   that was loaded by switch_mm() at least once since the last scan.
>   This is not the case because we don't scan the full range -- we skip
>   holes and VMAs that are unevictable, as well as PTE tables that have
>   no accessed entries on x86_64, by should_skip_vma() and
>   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>
>   If you are referring to the full range of PTE tables that have at
>   least one accessed entry, i.e., other 511 are not none  but have not
>   been accessed either since the last scan on x86_64, then yes, you
>   are right again :) This is the worse case scenario.

OK.  So there's no fundamental difference between us on this.

>> > This series uses page tables to discover page accesses when a system
>> > has run out of inactive pages. Under such a situation, the system is
>> > very likely to have a lot of page accesses, and using the rmap is
>> > likely to cost a lot more because its poor memory locality compared
>> > with page tables.
>> 
>> This is the theory.  Can you verify this with more data?  Including the
>> CPU cycles or time spent scanning page tables?
>
> Yes, I'll be happy to do so as I should, because page table scanning
> is counterintuitive. Let me add more theory in case it's still unclear
> to others.
>
> From my understanding, the two fundamental questions we need to
> consider in terms of page reclaim are:
>
>   What are the sizes of hot clusters (spatial locality) should we
>   expect under memory pressure?
>
>   On smaller systems with 4GB memory, our observations are that the
>   average size of hot clusters found during each scan is 32KB. On
>   larger systems with hundreds of gigabytes of memory, it's well
>   above this value -- 512KB or larger. These values vary under
>   different workloads and with different memory allocators. Unless
>   done deliberately by memory allocators, e.g., Scudo as I've
>   mentioned earlier, it's safe to say if a PTE entry has been
>   accessed, its neighbors are likely to have been accessed too.
>
>   What's hot memory footprint (total size of hot clusters) should we
>   expect when we have run out of inactive pages?
>
>   Some numbers first: on large and heavily overcommitted systems, we
>   have observed close to 90% during a scan. Those systems have
>   millions of pages and using the rmap to find out which pages to
>   reclaim will just blow kswapd. On smaller systems with less memory
>   pressure (due to their weaker CPUs), this number is more reasonable,
>   ~50%. Here is some kswapd profiles from a smaller systems running
>   5.11:
>
>    the rmap                                 page table scan
>    ---------------------------------------------------------------------
>    31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
>    25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
>     4.63%  do_raw_spin_lock                  4.45%  memset_erms
>     3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
>     3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
>
>   The page table scan is only twice as fast. Only larger systems,
>   it's usually more than 4 times, without THP. With THP, both are
>   negligible (<1% CPU usage). I can grab profiles from our servers
>   too if you are interested in seeing them on 4.15 kernel.

Yes.  On a heavily overcommitted systems with high-percent hot pages,
the page table scanning works much better.  Because almost all pages
(and their mappings) will be scanned finally.

But on a not-so-heavily overcommitted system with low-percent hot pages,
it's possible that rmap scanning works better.  That is, only a small
fraction of the pages need to be scanned.  I know that the page table
scanning may still work better in many cases.

And another possibility, on a system with cool instead of completely
cold pages, that is, some pages are accessed at quite low frequency, but
not 0, there will be always some low-bandwidth memory reclaiming.  That
is, it's impossible to find a perfect solution with one or two full
scanning.  But we need to reclaim some pages periodically.  And I guess
there are no perfect (or very good) page reclaiming solutions for some
other situations too. Where what we can do are,

- Avoid OOM, that is, reclaim some pages if possible.

- Control the overhead of the page reclaiming.

But this is theory only.  If anyone can point out that they are not
realistic at all, it's good too :-)

>> > But, page tables can be sparse too, in terms of hot memory tracking.
>> > Dave has asked me to test the worst case scenario, which I'll do.
>> > And I'd be happy to share more data. Any specific workload you are
>> > interested in?
>> 
>> We can start with some simple workloads that are easier to be reasoned.
>> For example,
>> 
>> 1. Run the workload with hot and cold pages, when the free memory
>> becomes lower than the low watermark, kswapd will be waken up to scan
>> and reclaim some cold pages.  How long will it take to do that?  It's
>> expected that almost all pages need to be scanned, so that page table
>
> A typical scenario. Otherwise why would we have run out of cold pages
> and still be under memory? Because what's in memory is hot and
> therefore most of the them need to be scanned :)
>
>> scanning is expected to have less overhead.  We can measure how well it
>> is.
>
> Sounds good to me.
>
>> 2. Run the workload with hot and cold pages, if the whole working-set
>> cannot fit in DRAM, that is, the cold pages will be reclaimed and
>> swapped in regularly (for example tens MB/s).  It's expected that less
>> pages may be scanned with rmap, but the speed of page table scanning is
>> faster.
>
> So IIUC, this is a sustained memory pressure, i.e., servers constantly
> running under memory pressure?

Yes.  The system can accommodate more workloads at the cost of
performance, as long as the end-user latency isn't unacceptable.  Or we
need some time to schedule more computing resources, so we need to run
in this condition for some while.

But again, this is theory only.  I am glad if people can tell me that
this is unrealistic.

>> 3. Run the workload with hot and cold pages, the system is
>> overcommitted, that is, some cold pages will be placed in swap.  But the
>> cold pages are cold enough, so there's almost no thrashing.  Then the
>> hot working-set of the workload changes, that is, some hot pages become
>> cold, while some cold pages becomes hot, so page reclaiming and swapin
>> will be triggered.
>
> This is usually what we see on clients, i.e., bursty workloads when
> switching from an active app to an inactive one.

Thanks for your information.  Now I know a typical realistic use case :-)

>> For each cases, we can use some different parameters.  And we can
>> measure something like the number of pages scanned, the time taken to
>> scan them, the number of page reclaimed and swapped in, etc.
>
> Thanks, I appreciate these -- very well thought test cases. I'll look
> into them and probably write some synthetic test cases. If you have
> some already, I'd love to get my hands one them.

Sorry.  I have no test cases in hand.  Maybe we can add some into
Fengguang's vm-scalability test suite as follows.

https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-03-16 21:14         ` Dave Hansen
@ 2021-04-10  9:21           ` Yu Zhao
  2021-04-13  3:02             ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-04-10  9:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, Ying Huang, linux-kernel, page-reclaim

[-- Attachment #1: Type: text/plain, Size: 6026 bytes --]

On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
> On 3/16/21 1:30 PM, Yu Zhao wrote:
> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> >> I think it would also be very worthwhile to include some research in
> >> this series about why the kernel moved away from page table scanning.
> >> What has changed?  Are the workloads we were concerned about way back
> >> then not around any more?  Has faster I/O or larger memory sizes with a
> >> stagnating page size changed something?
> > 
> > Sure. Hugh also suggested this too but I personally found that ancient
> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
> > and decided to spare audience of the boredom.
> 
> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
> until anon_vmas came around, which was early-ish in the 2.6 era.
> 
> But, either way, I think there is a sufficient population of nostalgic
> crusty old folks around to warrant a bit of a history lesson.  We'll
> enjoy the trip down memory lane, fondly remembering the old days in
> Ottawa...
> 
> >>> nr_vmscan_write 24900719
> >>> nr_vmscan_immediate_reclaim 115535
> >>> pgscan_kswapd 320831544
> >>> pgscan_direct 23396383
> >>> pgscan_direct_throttle 0
> >>> pgscan_anon 127491077
> >>> pgscan_file 216736850
> >>> slabs_scanned 400469680
> >>> compact_migrate_scanned 1092813949
> >>> compact_free_scanned 4919523035
> >>> compact_daemon_migrate_scanned 2372223
> >>> compact_daemon_free_scanned 20989310
> >>> unevictable_pgs_scanned 307388545
> > 
> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> > The file rss does seem a bit high to me, my wild speculation is there
> > have been git/make activities in addition to a VM?
> 
> I wish I was doing more git/make activities.  It's been an annoying
> amount of email and web browsers for 12 days.  If anything, I'd suspect
> that Thunderbird is at fault for keeping a bunch of mail in the page
> cache.  There are a couple of VM's running though.

Hi Dave,

Sorry for the late reply. Here is the benchmark result from the worst
case scenario.

As you suggested, we create a lot of processes sharing one large
sparse shmem, and they access the shmem at random 2MB-aligned offsets.
So there will be at most one valid PTE entry per PTE table, hence the
worst case scenario for the multigenerational LRU, since it is based
on page table scanning.

TL;DR: the multigenerational LRU did not perform worse than the rmap.

My test configurations:

  The size of the shmem: 256GB
  The number of processes: 450
  Total memory size: 200GB
  The number of CPUs: 64
  The number of nodes: 2

There is no clear winner in the background reclaim path (kswapd).

  kswapd (5.12.0-rc6):
    43.99%  kswapd1  page_vma_mapped_walk
    34.86%  kswapd0  page_vma_mapped_walk
     2.43%  kswapd0  count_shadow_nodes
     1.17%  kswapd1  page_referenced_one
     1.15%  kswapd0  _find_next_bit.constprop.0
     0.95%  kswapd0  page_referenced_one
     0.87%  kswapd1  try_to_unmap_one
     0.75%  kswapd0  cpumask_next
     0.67%  kswapd0  shrink_slab
     0.66%  kswapd0  down_read_trylock

  kswapd (the multigenerational LRU):
    33.39%  kswapd0  walk_pud_range
    10.93%  kswapd1  walk_pud_range
     9.36%  kswapd0  page_vma_mapped_walk
     7.15%  kswapd1  page_vma_mapped_walk
     3.83%  kswapd0  count_shadow_nodes
     2.60%  kswapd1  shrink_slab
     2.47%  kswapd1  down_read_trylock
     2.03%  kswapd0  _raw_spin_lock
     1.87%  kswapd0  shrink_slab
     1.67%  kswapd1  count_shadow_nodes

The multigenerational LRU is somewhat winning in the direct reclaim
path (sparse is the test binary name):

  The test process context (5.12.0-rc6):
    65.02%  sparse   page_vma_mapped_walk
     5.49%  sparse   page_counter_try_charge
     3.60%  sparse   propagate_protected_usage
     2.31%  sparse   page_counter_uncharge
     2.06%  sparse   count_shadow_nodes
     1.81%  sparse   native_queued_spin_lock_slowpath
     1.79%  sparse   down_read_trylock
     1.67%  sparse   page_referenced_one
     1.42%  sparse   shrink_slab
     0.87%  sparse   try_to_unmap_one

  CPU % (direct reclaim vs the rest): 71% vs 29%
  # grep oom_kill /proc/vmstat
  oom_kill 81

  The test process context (the multigenerational LRU):
    33.12%  sparse   page_vma_mapped_walk
    10.70%  sparse   walk_pud_range
     9.64%  sparse   page_counter_try_charge
     6.63%  sparse   propagate_protected_usage
     4.43%  sparse   native_queued_spin_lock_slowpath
     3.85%  sparse   page_counter_uncharge
     3.71%  sparse   irqentry_exit_to_user_mode
     2.16%  sparse   _raw_spin_lock
     1.83%  sparse   unmap_page_range
     1.82%  sparse   shrink_slab

  CPU % (direct reclaim vs the rest): 47% vs 53%
  # grep oom_kill /proc/vmstat
  oom_kill 80

I also compared other numbers from /proc/vmstat. They do not provide
any additional insight than the profiles, so I will just omit them
here.

The following optimizations and the stats measuring their efficacies
explain why the multigenerational LRU did not perform worse:

  Optimization 1: take advantage of the scheduling information.
    # of active processes           270
    # of inactive processes         105

  Optimization 2: take the advantage of the accessed bit on non-leaf
  PMD entries.
    # of old non-leaf PMD entries   30523335
    # of young non-leaf PMD entries 1358400

These stats are not currently included. But I will add them to the
debugfs interface in the next version coming soon. And I will also add
another optimization for Android. It reduces zigzags when there are
many single-page VMAs, i.e., not returning to the PGD table for each
of such VMAs. Just a heads-up.

The rmap, on the other hand, had to
  1) lock each (shmem) page it scans
  2) go through five levels of page tables for each page, even though
  some of them have the same LCAs
during the test. The second part is worse given that I have 5 levels
of page tables configured.

Any additional benchmarks you would suggest? Thanks.

[-- Attachment #2: sparse.c --]
[-- Type: text/x-csrc, Size: 961 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/types.h>

#define NR_TASKS	450UL
#define MMAP_SIZE	(256UL << 30)

#define PMD_SIZE	(1UL << 21)
#define NR_PMDS		(MMAP_SIZE / PMD_SIZE)
#define NR_LOOPS	(NR_PMDS * 200)

int main(void)
{
	unsigned long i;
	void *start;
	pid_t pid;

	start = mmap(NULL, MMAP_SIZE, PROT_READ | PROT_WRITE,
		     MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE, -1, 0);
	if (start == MAP_FAILED) {
		perror("mmap");
		return -1;
	}

	if (madvise(start, MMAP_SIZE, MADV_NOHUGEPAGE)) {
		perror("madvise");
		return -1;
	}

	for (i = 0; i < NR_TASKS; i++) {
		pid = fork();
		if (pid < 0) {
			perror("fork");
			return -1;
		}

		if (!pid)
			break;
	}

	pid = getpid();
	srand48(pid);

	for (i = 0; i < NR_LOOPS; i++) {
		unsigned long offset = (lrand48() % NR_PMDS) * PMD_SIZE;
		unsigned long *addr = start + offset;

		*addr = i;
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-03-24  6:58                     ` Huang, Ying
@ 2021-04-10 18:48                       ` Yu Zhao
  2021-04-13  3:06                         ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-04-10 18:48 UTC (permalink / raw)
  To: Huang, Ying, Rong Chen
  Cc: Rik van Riel, Linux-MM, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, Kernel Page Reclaim v2

On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >>
> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >>
> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> >> > The scanning overhead is only one of the two major problems of the
> >> >> > current page reclaim. The other problem is the granularity of the
> >> >> > active/inactive (sizes). We stopped using them in making job
> >> >> > scheduling decision a long time ago. I know another large internet
> >> >> > company adopted a similar approach as ours, and I'm wondering how
> >> >> > everybody else is coping with the discrepancy from those counters.
> >> >>
> >> >> From intuition, the scanning overhead of the full page table scanning
> >> >> appears higher than that of the rmap scanning for a small portion of
> >> >> system memory.  But form your words, you think the reality is the
> >> >> reverse?  If others concern about the overhead too, finally, I think you
> >> >> need to prove the overhead of the page table scanning isn't too higher,
> >> >> or even lower with more data and theory.
> >> >
> >> > There is a misunderstanding here. I never said anything about full
> >> > page table scanning. And this is not how it's done in this series
> >> > either. I guess the misunderstanding has something to do with the cold
> >> > memory tracking you are thinking about?
> >>
> >> If my understanding were correct, from the following code path in your
> >> patch 10/14,
> >>
> >> age_active_anon
> >>   age_lru_gens
> >>     try_walk_mm_list
> >>       walk_mm_list
> >>         walk_mm
> >>
> >> So, in kswapd(), the page tables of many processes may be scanned
> >> fully.  If the number of processes that are active are high, the
> >> overhead may be high too.
> >
> > That's correct. Just in case we have different definitions of what we
> > call "full":
> >
> >   I understand it as the full range of the address space of a process
> >   that was loaded by switch_mm() at least once since the last scan.
> >   This is not the case because we don't scan the full range -- we skip
> >   holes and VMAs that are unevictable, as well as PTE tables that have
> >   no accessed entries on x86_64, by should_skip_vma() and
> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
> >
> >   If you are referring to the full range of PTE tables that have at
> >   least one accessed entry, i.e., other 511 are not none  but have not
> >   been accessed either since the last scan on x86_64, then yes, you
> >   are right again :) This is the worse case scenario.
>
> OK.  So there's no fundamental difference between us on this.
>
> >> > This series uses page tables to discover page accesses when a system
> >> > has run out of inactive pages. Under such a situation, the system is
> >> > very likely to have a lot of page accesses, and using the rmap is
> >> > likely to cost a lot more because its poor memory locality compared
> >> > with page tables.
> >>
> >> This is the theory.  Can you verify this with more data?  Including the
> >> CPU cycles or time spent scanning page tables?
> >
> > Yes, I'll be happy to do so as I should, because page table scanning
> > is counterintuitive. Let me add more theory in case it's still unclear
> > to others.
> >
> > From my understanding, the two fundamental questions we need to
> > consider in terms of page reclaim are:
> >
> >   What are the sizes of hot clusters (spatial locality) should we
> >   expect under memory pressure?
> >
> >   On smaller systems with 4GB memory, our observations are that the
> >   average size of hot clusters found during each scan is 32KB. On
> >   larger systems with hundreds of gigabytes of memory, it's well
> >   above this value -- 512KB or larger. These values vary under
> >   different workloads and with different memory allocators. Unless
> >   done deliberately by memory allocators, e.g., Scudo as I've
> >   mentioned earlier, it's safe to say if a PTE entry has been
> >   accessed, its neighbors are likely to have been accessed too.
> >
> >   What's hot memory footprint (total size of hot clusters) should we
> >   expect when we have run out of inactive pages?
> >
> >   Some numbers first: on large and heavily overcommitted systems, we
> >   have observed close to 90% during a scan. Those systems have
> >   millions of pages and using the rmap to find out which pages to
> >   reclaim will just blow kswapd. On smaller systems with less memory
> >   pressure (due to their weaker CPUs), this number is more reasonable,
> >   ~50%. Here is some kswapd profiles from a smaller systems running
> >   5.11:
> >
> >    the rmap                                 page table scan
> >    ---------------------------------------------------------------------
> >    31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
> >    25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
> >     4.63%  do_raw_spin_lock                  4.45%  memset_erms
> >     3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
> >     3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
> >
> >   The page table scan is only twice as fast. Only larger systems,
> >   it's usually more than 4 times, without THP. With THP, both are
> >   negligible (<1% CPU usage). I can grab profiles from our servers
> >   too if you are interested in seeing them on 4.15 kernel.
>
> Yes.  On a heavily overcommitted systems with high-percent hot pages,
> the page table scanning works much better.  Because almost all pages
> (and their mappings) will be scanned finally.
>
> But on a not-so-heavily overcommitted system with low-percent hot pages,
> it's possible that rmap scanning works better.  That is, only a small
> fraction of the pages need to be scanned.  I know that the page table
> scanning may still work better in many cases.
>
> And another possibility, on a system with cool instead of completely
> cold pages, that is, some pages are accessed at quite low frequency, but
> not 0, there will be always some low-bandwidth memory reclaiming.  That
> is, it's impossible to find a perfect solution with one or two full
> scanning.  But we need to reclaim some pages periodically.  And I guess
> there are no perfect (or very good) page reclaiming solutions for some
> other situations too. Where what we can do are,
>
> - Avoid OOM, that is, reclaim some pages if possible.
>
> - Control the overhead of the page reclaiming.
>
> But this is theory only.  If anyone can point out that they are not
> realistic at all, it's good too :-)
>
> >> > But, page tables can be sparse too, in terms of hot memory tracking.
> >> > Dave has asked me to test the worst case scenario, which I'll do.
> >> > And I'd be happy to share more data. Any specific workload you are
> >> > interested in?
> >>
> >> We can start with some simple workloads that are easier to be reasoned.
> >> For example,
> >>
> >> 1. Run the workload with hot and cold pages, when the free memory
> >> becomes lower than the low watermark, kswapd will be waken up to scan
> >> and reclaim some cold pages.  How long will it take to do that?  It's
> >> expected that almost all pages need to be scanned, so that page table
> >
> > A typical scenario. Otherwise why would we have run out of cold pages
> > and still be under memory? Because what's in memory is hot and
> > therefore most of the them need to be scanned :)
> >
> >> scanning is expected to have less overhead.  We can measure how well it
> >> is.
> >
> > Sounds good to me.
> >
> >> 2. Run the workload with hot and cold pages, if the whole working-set
> >> cannot fit in DRAM, that is, the cold pages will be reclaimed and
> >> swapped in regularly (for example tens MB/s).  It's expected that less
> >> pages may be scanned with rmap, but the speed of page table scanning is
> >> faster.
> >
> > So IIUC, this is a sustained memory pressure, i.e., servers constantly
> > running under memory pressure?
>
> Yes.  The system can accommodate more workloads at the cost of
> performance, as long as the end-user latency isn't unacceptable.  Or we
> need some time to schedule more computing resources, so we need to run
> in this condition for some while.
>
> But again, this is theory only.  I am glad if people can tell me that
> this is unrealistic.
>
> >> 3. Run the workload with hot and cold pages, the system is
> >> overcommitted, that is, some cold pages will be placed in swap.  But the
> >> cold pages are cold enough, so there's almost no thrashing.  Then the
> >> hot working-set of the workload changes, that is, some hot pages become
> >> cold, while some cold pages becomes hot, so page reclaiming and swapin
> >> will be triggered.
> >
> > This is usually what we see on clients, i.e., bursty workloads when
> > switching from an active app to an inactive one.
>
> Thanks for your information.  Now I know a typical realistic use case :-)
>
> >> For each cases, we can use some different parameters.  And we can
> >> measure something like the number of pages scanned, the time taken to
> >> scan them, the number of page reclaimed and swapped in, etc.
> >
> > Thanks, I appreciate these -- very well thought test cases. I'll look
> > into them and probably write some synthetic test cases. If you have
> > some already, I'd love to get my hands one them.
>
> Sorry.  I have no test cases in hand.  Maybe we can add some into
> Fengguang's vm-scalability test suite as follows.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

Hi Ying,

I'm still investigating the test cases you suggested. I'm also
wondering if it's possible to test the next version, which I'll post
soon, with Intel's 0-Day infra.

Thanks.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-04-10  9:21           ` Yu Zhao
@ 2021-04-13  3:02             ` Huang, Ying
  2021-04-13 23:00               ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-04-13  3:02 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Dave Hansen, linux-mm, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
>> On 3/16/21 1:30 PM, Yu Zhao wrote:
>> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
>> >> I think it would also be very worthwhile to include some research in
>> >> this series about why the kernel moved away from page table scanning.
>> >> What has changed?  Are the workloads we were concerned about way back
>> >> then not around any more?  Has faster I/O or larger memory sizes with a
>> >> stagnating page size changed something?
>> > 
>> > Sure. Hugh also suggested this too but I personally found that ancient
>> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
>> > and decided to spare audience of the boredom.
>> 
>> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
>> until anon_vmas came around, which was early-ish in the 2.6 era.
>> 
>> But, either way, I think there is a sufficient population of nostalgic
>> crusty old folks around to warrant a bit of a history lesson.  We'll
>> enjoy the trip down memory lane, fondly remembering the old days in
>> Ottawa...
>> 
>> >>> nr_vmscan_write 24900719
>> >>> nr_vmscan_immediate_reclaim 115535
>> >>> pgscan_kswapd 320831544
>> >>> pgscan_direct 23396383
>> >>> pgscan_direct_throttle 0
>> >>> pgscan_anon 127491077
>> >>> pgscan_file 216736850
>> >>> slabs_scanned 400469680
>> >>> compact_migrate_scanned 1092813949
>> >>> compact_free_scanned 4919523035
>> >>> compact_daemon_migrate_scanned 2372223
>> >>> compact_daemon_free_scanned 20989310
>> >>> unevictable_pgs_scanned 307388545
>> > 
>> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
>> > The file rss does seem a bit high to me, my wild speculation is there
>> > have been git/make activities in addition to a VM?
>> 
>> I wish I was doing more git/make activities.  It's been an annoying
>> amount of email and web browsers for 12 days.  If anything, I'd suspect
>> that Thunderbird is at fault for keeping a bunch of mail in the page
>> cache.  There are a couple of VM's running though.
>
> Hi Dave,
>
> Sorry for the late reply. Here is the benchmark result from the worst
> case scenario.
>
> As you suggested, we create a lot of processes sharing one large
> sparse shmem, and they access the shmem at random 2MB-aligned offsets.
> So there will be at most one valid PTE entry per PTE table, hence the
> worst case scenario for the multigenerational LRU, since it is based
> on page table scanning.
>
> TL;DR: the multigenerational LRU did not perform worse than the rmap.
>
> My test configurations:
>
>   The size of the shmem: 256GB
>   The number of processes: 450
>   Total memory size: 200GB
>   The number of CPUs: 64
>   The number of nodes: 2
>
> There is no clear winner in the background reclaim path (kswapd).
>
>   kswapd (5.12.0-rc6):
>     43.99%  kswapd1  page_vma_mapped_walk
>     34.86%  kswapd0  page_vma_mapped_walk
>      2.43%  kswapd0  count_shadow_nodes
>      1.17%  kswapd1  page_referenced_one
>      1.15%  kswapd0  _find_next_bit.constprop.0
>      0.95%  kswapd0  page_referenced_one
>      0.87%  kswapd1  try_to_unmap_one
>      0.75%  kswapd0  cpumask_next
>      0.67%  kswapd0  shrink_slab
>      0.66%  kswapd0  down_read_trylock
>
>   kswapd (the multigenerational LRU):
>     33.39%  kswapd0  walk_pud_range
>     10.93%  kswapd1  walk_pud_range
>      9.36%  kswapd0  page_vma_mapped_walk
>      7.15%  kswapd1  page_vma_mapped_walk
>      3.83%  kswapd0  count_shadow_nodes
>      2.60%  kswapd1  shrink_slab
>      2.47%  kswapd1  down_read_trylock
>      2.03%  kswapd0  _raw_spin_lock
>      1.87%  kswapd0  shrink_slab
>      1.67%  kswapd1  count_shadow_nodes
>
> The multigenerational LRU is somewhat winning in the direct reclaim
> path (sparse is the test binary name):
>
>   The test process context (5.12.0-rc6):
>     65.02%  sparse   page_vma_mapped_walk
>      5.49%  sparse   page_counter_try_charge
>      3.60%  sparse   propagate_protected_usage
>      2.31%  sparse   page_counter_uncharge
>      2.06%  sparse   count_shadow_nodes
>      1.81%  sparse   native_queued_spin_lock_slowpath
>      1.79%  sparse   down_read_trylock
>      1.67%  sparse   page_referenced_one
>      1.42%  sparse   shrink_slab
>      0.87%  sparse   try_to_unmap_one
>
>   CPU % (direct reclaim vs the rest): 71% vs 29%
>   # grep oom_kill /proc/vmstat
>   oom_kill 81
>
>   The test process context (the multigenerational LRU):
>     33.12%  sparse   page_vma_mapped_walk
>     10.70%  sparse   walk_pud_range
>      9.64%  sparse   page_counter_try_charge
>      6.63%  sparse   propagate_protected_usage
>      4.43%  sparse   native_queued_spin_lock_slowpath
>      3.85%  sparse   page_counter_uncharge
>      3.71%  sparse   irqentry_exit_to_user_mode
>      2.16%  sparse   _raw_spin_lock
>      1.83%  sparse   unmap_page_range
>      1.82%  sparse   shrink_slab
>
>   CPU % (direct reclaim vs the rest): 47% vs 53%
>   # grep oom_kill /proc/vmstat
>   oom_kill 80
>
> I also compared other numbers from /proc/vmstat. They do not provide
> any additional insight than the profiles, so I will just omit them
> here.
>
> The following optimizations and the stats measuring their efficacies
> explain why the multigenerational LRU did not perform worse:
>
>   Optimization 1: take advantage of the scheduling information.
>     # of active processes           270
>     # of inactive processes         105
>
>   Optimization 2: take the advantage of the accessed bit on non-leaf
>   PMD entries.
>     # of old non-leaf PMD entries   30523335
>     # of young non-leaf PMD entries 1358400
>
> These stats are not currently included. But I will add them to the
> debugfs interface in the next version coming soon. And I will also add
> another optimization for Android. It reduces zigzags when there are
> many single-page VMAs, i.e., not returning to the PGD table for each
> of such VMAs. Just a heads-up.
>
> The rmap, on the other hand, had to
>   1) lock each (shmem) page it scans
>   2) go through five levels of page tables for each page, even though
>   some of them have the same LCAs
> during the test. The second part is worse given that I have 5 levels
> of page tables configured.
>
> Any additional benchmarks you would suggest? Thanks.

Hi, Yu,

Thanks for your data.

In addition to the data your measured above, is it possible for you to
measure some raw data?  For example, how many CPU cycles does it take to
scan all pages in the system?  For the page table scanning, the page
tables of all processes will be scanned.  For the rmap scanning, all
pages in LRU will be scanned.  And we can do that with difference
parameters, for example, shared vs. non-shared, sparse vs. dense.  Then
we can get an idea about how fast the page table scanning can be.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list
  2021-04-10 18:48                       ` Yu Zhao
@ 2021-04-13  3:06                         ` Huang, Ying
  0 siblings, 0 replies; 65+ messages in thread
From: Huang, Ying @ 2021-04-13  3:06 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Rong Chen, Rik van Riel, Linux-MM, Alex Shi, Andrew Morton,
	Dave Hansen, Hillf Danton, Johannes Weiner, Joonsoo Kim,
	Matthew Wilcox, Mel Gorman, Michal Hocko, Roman Gushchin,
	Vlastimil Babka, Wei Yang, Yang Shi, linux-kernel,
	Kernel Page Reclaim v2

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >>
>> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> >> Yu Zhao <yuzhao@google.com> writes:
>> >> >>
>> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> >> > The scanning overhead is only one of the two major problems of the
>> >> >> > current page reclaim. The other problem is the granularity of the
>> >> >> > active/inactive (sizes). We stopped using them in making job
>> >> >> > scheduling decision a long time ago. I know another large internet
>> >> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> >> > everybody else is coping with the discrepancy from those counters.
>> >> >>
>> >> >> From intuition, the scanning overhead of the full page table scanning
>> >> >> appears higher than that of the rmap scanning for a small portion of
>> >> >> system memory.  But form your words, you think the reality is the
>> >> >> reverse?  If others concern about the overhead too, finally, I think you
>> >> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> >> or even lower with more data and theory.
>> >> >
>> >> > There is a misunderstanding here. I never said anything about full
>> >> > page table scanning. And this is not how it's done in this series
>> >> > either. I guess the misunderstanding has something to do with the cold
>> >> > memory tracking you are thinking about?
>> >>
>> >> If my understanding were correct, from the following code path in your
>> >> patch 10/14,
>> >>
>> >> age_active_anon
>> >>   age_lru_gens
>> >>     try_walk_mm_list
>> >>       walk_mm_list
>> >>         walk_mm
>> >>
>> >> So, in kswapd(), the page tables of many processes may be scanned
>> >> fully.  If the number of processes that are active are high, the
>> >> overhead may be high too.
>> >
>> > That's correct. Just in case we have different definitions of what we
>> > call "full":
>> >
>> >   I understand it as the full range of the address space of a process
>> >   that was loaded by switch_mm() at least once since the last scan.
>> >   This is not the case because we don't scan the full range -- we skip
>> >   holes and VMAs that are unevictable, as well as PTE tables that have
>> >   no accessed entries on x86_64, by should_skip_vma() and
>> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>> >
>> >   If you are referring to the full range of PTE tables that have at
>> >   least one accessed entry, i.e., other 511 are not none  but have not
>> >   been accessed either since the last scan on x86_64, then yes, you
>> >   are right again :) This is the worse case scenario.
>>
>> OK.  So there's no fundamental difference between us on this.
>>
>> >> > This series uses page tables to discover page accesses when a system
>> >> > has run out of inactive pages. Under such a situation, the system is
>> >> > very likely to have a lot of page accesses, and using the rmap is
>> >> > likely to cost a lot more because its poor memory locality compared
>> >> > with page tables.
>> >>
>> >> This is the theory.  Can you verify this with more data?  Including the
>> >> CPU cycles or time spent scanning page tables?
>> >
>> > Yes, I'll be happy to do so as I should, because page table scanning
>> > is counterintuitive. Let me add more theory in case it's still unclear
>> > to others.
>> >
>> > From my understanding, the two fundamental questions we need to
>> > consider in terms of page reclaim are:
>> >
>> >   What are the sizes of hot clusters (spatial locality) should we
>> >   expect under memory pressure?
>> >
>> >   On smaller systems with 4GB memory, our observations are that the
>> >   average size of hot clusters found during each scan is 32KB. On
>> >   larger systems with hundreds of gigabytes of memory, it's well
>> >   above this value -- 512KB or larger. These values vary under
>> >   different workloads and with different memory allocators. Unless
>> >   done deliberately by memory allocators, e.g., Scudo as I've
>> >   mentioned earlier, it's safe to say if a PTE entry has been
>> >   accessed, its neighbors are likely to have been accessed too.
>> >
>> >   What's hot memory footprint (total size of hot clusters) should we
>> >   expect when we have run out of inactive pages?
>> >
>> >   Some numbers first: on large and heavily overcommitted systems, we
>> >   have observed close to 90% during a scan. Those systems have
>> >   millions of pages and using the rmap to find out which pages to
>> >   reclaim will just blow kswapd. On smaller systems with less memory
>> >   pressure (due to their weaker CPUs), this number is more reasonable,
>> >   ~50%. Here is some kswapd profiles from a smaller systems running
>> >   5.11:
>> >
>> >    the rmap                                 page table scan
>> >    ---------------------------------------------------------------------
>> >    31.03%  page_vma_mapped_walk             49.36%  lzo1x_1_do_compress
>> >    25.59%  lzo1x_1_do_compress               4.54%  page_vma_mapped_walk
>> >     4.63%  do_raw_spin_lock                  4.45%  memset_erms
>> >     3.89%  vma_interval_tree_iter_next       3.47%  walk_pte_range
>> >     3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
>> >
>> >   The page table scan is only twice as fast. Only larger systems,
>> >   it's usually more than 4 times, without THP. With THP, both are
>> >   negligible (<1% CPU usage). I can grab profiles from our servers
>> >   too if you are interested in seeing them on 4.15 kernel.
>>
>> Yes.  On a heavily overcommitted systems with high-percent hot pages,
>> the page table scanning works much better.  Because almost all pages
>> (and their mappings) will be scanned finally.
>>
>> But on a not-so-heavily overcommitted system with low-percent hot pages,
>> it's possible that rmap scanning works better.  That is, only a small
>> fraction of the pages need to be scanned.  I know that the page table
>> scanning may still work better in many cases.
>>
>> And another possibility, on a system with cool instead of completely
>> cold pages, that is, some pages are accessed at quite low frequency, but
>> not 0, there will be always some low-bandwidth memory reclaiming.  That
>> is, it's impossible to find a perfect solution with one or two full
>> scanning.  But we need to reclaim some pages periodically.  And I guess
>> there are no perfect (or very good) page reclaiming solutions for some
>> other situations too. Where what we can do are,
>>
>> - Avoid OOM, that is, reclaim some pages if possible.
>>
>> - Control the overhead of the page reclaiming.
>>
>> But this is theory only.  If anyone can point out that they are not
>> realistic at all, it's good too :-)
>>
>> >> > But, page tables can be sparse too, in terms of hot memory tracking.
>> >> > Dave has asked me to test the worst case scenario, which I'll do.
>> >> > And I'd be happy to share more data. Any specific workload you are
>> >> > interested in?
>> >>
>> >> We can start with some simple workloads that are easier to be reasoned.
>> >> For example,
>> >>
>> >> 1. Run the workload with hot and cold pages, when the free memory
>> >> becomes lower than the low watermark, kswapd will be waken up to scan
>> >> and reclaim some cold pages.  How long will it take to do that?  It's
>> >> expected that almost all pages need to be scanned, so that page table
>> >
>> > A typical scenario. Otherwise why would we have run out of cold pages
>> > and still be under memory? Because what's in memory is hot and
>> > therefore most of the them need to be scanned :)
>> >
>> >> scanning is expected to have less overhead.  We can measure how well it
>> >> is.
>> >
>> > Sounds good to me.
>> >
>> >> 2. Run the workload with hot and cold pages, if the whole working-set
>> >> cannot fit in DRAM, that is, the cold pages will be reclaimed and
>> >> swapped in regularly (for example tens MB/s).  It's expected that less
>> >> pages may be scanned with rmap, but the speed of page table scanning is
>> >> faster.
>> >
>> > So IIUC, this is a sustained memory pressure, i.e., servers constantly
>> > running under memory pressure?
>>
>> Yes.  The system can accommodate more workloads at the cost of
>> performance, as long as the end-user latency isn't unacceptable.  Or we
>> need some time to schedule more computing resources, so we need to run
>> in this condition for some while.
>>
>> But again, this is theory only.  I am glad if people can tell me that
>> this is unrealistic.
>>
>> >> 3. Run the workload with hot and cold pages, the system is
>> >> overcommitted, that is, some cold pages will be placed in swap.  But the
>> >> cold pages are cold enough, so there's almost no thrashing.  Then the
>> >> hot working-set of the workload changes, that is, some hot pages become
>> >> cold, while some cold pages becomes hot, so page reclaiming and swapin
>> >> will be triggered.
>> >
>> > This is usually what we see on clients, i.e., bursty workloads when
>> > switching from an active app to an inactive one.
>>
>> Thanks for your information.  Now I know a typical realistic use case :-)
>>
>> >> For each cases, we can use some different parameters.  And we can
>> >> measure something like the number of pages scanned, the time taken to
>> >> scan them, the number of page reclaimed and swapped in, etc.
>> >
>> > Thanks, I appreciate these -- very well thought test cases. I'll look
>> > into them and probably write some synthetic test cases. If you have
>> > some already, I'd love to get my hands one them.
>>
>> Sorry.  I have no test cases in hand.  Maybe we can add some into
>> Fengguang's vm-scalability test suite as follows.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/
>
> Hi Ying,
>
> I'm still investigating the test cases you suggested. I'm also
> wondering if it's possible to test the next version, which I'll post
> soon, with Intel's 0-Day infra.

Sure.  But now 0-Day has only quite limited coverage for swap testing.
Including the swap test in vm-scalability.git, and several test cases
with pmbench.  I think it's good to improve the coverage of 0-Day for
swap.  But it needs some time.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 00/14] Multigenerational LRU
  2021-04-13  3:02             ` Huang, Ying
@ 2021-04-13 23:00               ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-04-13 23:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Dave Hansen, Linux-MM, Alex Shi, Andrew Morton, Dave Hansen,
	Hillf Danton, Johannes Weiner, Joonsoo Kim, Matthew Wilcox,
	Mel Gorman, Michal Hocko, Roman Gushchin, Vlastimil Babka,
	Wei Yang, Yang Shi, linux-kernel, Kernel Page Reclaim v2

On Mon, Apr 12, 2021 at 9:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
> >> On 3/16/21 1:30 PM, Yu Zhao wrote:
> >> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> >> >> I think it would also be very worthwhile to include some research in
> >> >> this series about why the kernel moved away from page table scanning.
> >> >> What has changed?  Are the workloads we were concerned about way back
> >> >> then not around any more?  Has faster I/O or larger memory sizes with a
> >> >> stagnating page size changed something?
> >> >
> >> > Sure. Hugh also suggested this too but I personally found that ancient
> >> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
> >> > and decided to spare audience of the boredom.
> >>
> >> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
> >> until anon_vmas came around, which was early-ish in the 2.6 era.
> >>
> >> But, either way, I think there is a sufficient population of nostalgic
> >> crusty old folks around to warrant a bit of a history lesson.  We'll
> >> enjoy the trip down memory lane, fondly remembering the old days in
> >> Ottawa...
> >>
> >> >>> nr_vmscan_write 24900719
> >> >>> nr_vmscan_immediate_reclaim 115535
> >> >>> pgscan_kswapd 320831544
> >> >>> pgscan_direct 23396383
> >> >>> pgscan_direct_throttle 0
> >> >>> pgscan_anon 127491077
> >> >>> pgscan_file 216736850
> >> >>> slabs_scanned 400469680
> >> >>> compact_migrate_scanned 1092813949
> >> >>> compact_free_scanned 4919523035
> >> >>> compact_daemon_migrate_scanned 2372223
> >> >>> compact_daemon_free_scanned 20989310
> >> >>> unevictable_pgs_scanned 307388545
> >> >
> >> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> >> > The file rss does seem a bit high to me, my wild speculation is there
> >> > have been git/make activities in addition to a VM?
> >>
> >> I wish I was doing more git/make activities.  It's been an annoying
> >> amount of email and web browsers for 12 days.  If anything, I'd suspect
> >> that Thunderbird is at fault for keeping a bunch of mail in the page
> >> cache.  There are a couple of VM's running though.
> >
> > Hi Dave,
> >
> > Sorry for the late reply. Here is the benchmark result from the worst
> > case scenario.
> >
> > As you suggested, we create a lot of processes sharing one large
> > sparse shmem, and they access the shmem at random 2MB-aligned offsets.
> > So there will be at most one valid PTE entry per PTE table, hence the
> > worst case scenario for the multigenerational LRU, since it is based
> > on page table scanning.
> >
> > TL;DR: the multigenerational LRU did not perform worse than the rmap.
> >
> > My test configurations:
> >
> >   The size of the shmem: 256GB
> >   The number of processes: 450
> >   Total memory size: 200GB
> >   The number of CPUs: 64
> >   The number of nodes: 2
> >
> > There is no clear winner in the background reclaim path (kswapd).
> >
> >   kswapd (5.12.0-rc6):
> >     43.99%  kswapd1  page_vma_mapped_walk
> >     34.86%  kswapd0  page_vma_mapped_walk
> >      2.43%  kswapd0  count_shadow_nodes
> >      1.17%  kswapd1  page_referenced_one
> >      1.15%  kswapd0  _find_next_bit.constprop.0
> >      0.95%  kswapd0  page_referenced_one
> >      0.87%  kswapd1  try_to_unmap_one
> >      0.75%  kswapd0  cpumask_next
> >      0.67%  kswapd0  shrink_slab
> >      0.66%  kswapd0  down_read_trylock
> >
> >   kswapd (the multigenerational LRU):
> >     33.39%  kswapd0  walk_pud_range
> >     10.93%  kswapd1  walk_pud_range
> >      9.36%  kswapd0  page_vma_mapped_walk
> >      7.15%  kswapd1  page_vma_mapped_walk
> >      3.83%  kswapd0  count_shadow_nodes
> >      2.60%  kswapd1  shrink_slab
> >      2.47%  kswapd1  down_read_trylock
> >      2.03%  kswapd0  _raw_spin_lock
> >      1.87%  kswapd0  shrink_slab
> >      1.67%  kswapd1  count_shadow_nodes
> >
> > The multigenerational LRU is somewhat winning in the direct reclaim
> > path (sparse is the test binary name):
> >
> >   The test process context (5.12.0-rc6):
> >     65.02%  sparse   page_vma_mapped_walk
> >      5.49%  sparse   page_counter_try_charge
> >      3.60%  sparse   propagate_protected_usage
> >      2.31%  sparse   page_counter_uncharge
> >      2.06%  sparse   count_shadow_nodes
> >      1.81%  sparse   native_queued_spin_lock_slowpath
> >      1.79%  sparse   down_read_trylock
> >      1.67%  sparse   page_referenced_one
> >      1.42%  sparse   shrink_slab
> >      0.87%  sparse   try_to_unmap_one
> >
> >   CPU % (direct reclaim vs the rest): 71% vs 29%
> >   # grep oom_kill /proc/vmstat
> >   oom_kill 81
> >
> >   The test process context (the multigenerational LRU):
> >     33.12%  sparse   page_vma_mapped_walk
> >     10.70%  sparse   walk_pud_range
> >      9.64%  sparse   page_counter_try_charge
> >      6.63%  sparse   propagate_protected_usage
> >      4.43%  sparse   native_queued_spin_lock_slowpath
> >      3.85%  sparse   page_counter_uncharge
> >      3.71%  sparse   irqentry_exit_to_user_mode
> >      2.16%  sparse   _raw_spin_lock
> >      1.83%  sparse   unmap_page_range
> >      1.82%  sparse   shrink_slab
> >
> >   CPU % (direct reclaim vs the rest): 47% vs 53%
> >   # grep oom_kill /proc/vmstat
> >   oom_kill 80
> >
> > I also compared other numbers from /proc/vmstat. They do not provide
> > any additional insight than the profiles, so I will just omit them
> > here.
> >
> > The following optimizations and the stats measuring their efficacies
> > explain why the multigenerational LRU did not perform worse:
> >
> >   Optimization 1: take advantage of the scheduling information.
> >     # of active processes           270
> >     # of inactive processes         105
> >
> >   Optimization 2: take the advantage of the accessed bit on non-leaf
> >   PMD entries.
> >     # of old non-leaf PMD entries   30523335
> >     # of young non-leaf PMD entries 1358400
> >
> > These stats are not currently included. But I will add them to the
> > debugfs interface in the next version coming soon. And I will also add
> > another optimization for Android. It reduces zigzags when there are
> > many single-page VMAs, i.e., not returning to the PGD table for each
> > of such VMAs. Just a heads-up.
> >
> > The rmap, on the other hand, had to
> >   1) lock each (shmem) page it scans
> >   2) go through five levels of page tables for each page, even though
> >   some of them have the same LCAs
> > during the test. The second part is worse given that I have 5 levels
> > of page tables configured.
> >
> > Any additional benchmarks you would suggest? Thanks.
>
> Hi, Yu,
>
> Thanks for your data.
>
> In addition to the data your measured above, is it possible for you to
> measure some raw data?  For example, how many CPU cycles does it take to
> scan all pages in the system?  For the page table scanning, the page
> tables of all processes will be scanned.  For the rmap scanning, all
> pages in LRU will be scanned.  And we can do that with difference
> parameters, for example, shared vs. non-shared, sparse vs. dense.  Then
> we can get an idea about how fast the page table scanning can be.

SGTM. I'll get back to you later.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-16  8:53       ` Huang, Ying
@ 2021-03-16 18:40         ` Yu Zhao
  0 siblings, 0 replies; 65+ messages in thread
From: Yu Zhao @ 2021-03-16 18:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 04:53:53PM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 02:52:52PM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote:
> >> >> Yu Zhao <yuzhao@google.com> writes:
> >> >> [snip]
> >> >> 
> >> >> > +/* Main function used by foreground, background and user-triggered aging. */
> >> >> > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
> >> >> > +			 struct scan_control *sc, int swappiness)
> >> >> > +{
> >> >> > +	bool last;
> >> >> > +	struct mm_struct *mm = NULL;
> >> >> > +	int nid = lruvec_pgdat(lruvec)->node_id;
> >> >> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >> >> > +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
> >> >> > +
> >> >> > +	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
> >> >> > +
> >> >> > +	/*
> >> >> > +	 * For each walk of the mm list of a memcg, we decrement the priority
> >> >> > +	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
> >> >> > +	 * priorities of all lruvecs.
> >> >> > +	 *
> >> >> > +	 * So if this lruvec has a higher priority (smaller value), it means
> >> >> > +	 * other concurrent reclaimers (global or memcg reclaim) have walked
> >> >> > +	 * its mm list. Skip it for this priority to balance the pressure on
> >> >> > +	 * all memcgs.
> >> >> > +	 */
> >> >> > +#ifdef CONFIG_MEMCG
> >> >> > +	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
> >> >> > +	    sc->priority > atomic_read(&lruvec->evictable.priority))
> >> >> > +		return false;
> >> >> > +#endif
> >> >> > +
> >> >> > +	do {
> >> >> > +		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
> >> >> > +		if (mm)
> >> >> > +			walk_mm(lruvec, mm, swappiness);
> >> >> > +
> >> >> > +		cond_resched();
> >> >> > +	} while (mm);
> >> >> 
> >> >> It appears that we need to scan the whole address space of multiple
> >> >> processes in this loop?
> >> >> 
> >> >> If so, I have some concerns about the duration of the function.  Do you
> >> >> have some number of the distribution of the duration of the function?
> >> >> And may be the number of mm_struct and the number of pages scanned.
> >> >> 
> >> >> In comparison, in the traditional LRU algorithm, for each round, only a
> >> >> small subset of the whole physical memory is scanned.
> >> >
> >> > Reasonable concerns, and insightful too. We are sensitive to direct
> >> > reclaim latency, and we tuned another path carefully so that direct
> >> > reclaims virtually don't hit this path :)
> >> >
> >> > Some numbers from the cover letter first:
> >> >   In addition, direct reclaim latency is reduced by 22% at 99th
> >> >   percentile and the number of refaults is reduced 7%. These metrics are
> >> >   important to phones and laptops as they are correlated to user
> >> >   experience.
> >> >
> >> > And "another path" is the background aging in kswapd:
> >> >   age_active_anon()
> >> >     age_lru_gens()
> >> >       try_walk_mm_list()
> >> >         /* try to spread pages out across spread+1 generations */
> >> >         if (old_and_young[0] >= old_and_young[1] * spread &&
> >> >             min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS))
> >> >                 return;
> >> >
> >> >         walk_mm_list(lruvec, max_seq, sc, swappiness);
> >> >
> >> > By default, spread = 2, which makes kswapd slight more aggressive
> >> > than direct reclaim for our use cases. This can be entirely disabled
> >> > by setting spread to 0, for worloads that don't care about direct
> >> > reclaim latency, or larger values, they are more sensitive than
> >> > ours.
> >> 
> >> OK, I see.  That can avoid the long latency in direct reclaim path.
> >> 
> >> > It's worth noting that walk_mm_list() is multithreaded -- reclaiming
> >> > threads can work on different mm_structs on the same list
> >> > concurrently. We do occasionally see this function in direct reclaims,
> >> > on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under
> >> > the same condition, we saw the current page reclaim live locked and
> >> > triggered hardware watchdog timeouts (our hardware watchdog is set to
> >> > 2 hours) many times.
> >> 
> >> Just to confirm, in the current page reclaim, kswapd will keep running
> >> until watchdog?  This is avoided in your algorithm mainly via
> >> multi-threading?  Or via direct vs. reversing page table scanning?
> >
> > Well, don't tell me you've seen the problem :) Let me explain one
> > subtle difference in how the aging works between the current page
> > reclaim and this series, and point you to the code.
> >
> > In the current page reclaim, we can't scan a page via the rmap without
> > isolating the page first. So the aging basically isolates a batch of
> > pages from a lru list, walks the rmap for each of the pages, and puts
> > active ones back to the list.
> >
> > In this series, aging walks page tables to update the generation
> > numbers of active pages without isolating them. The isolation is the
> > subtle difference: it's not a problem when there are few threads, but
> > it causes live locks when hundreds of threads running the aging and
> > hit the following in shrink_inactive_list():
> >
> > 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > 		if (stalled)
> > 			return 0;
> >
> > 		/* wait a bit for the reclaimer. */
> > 		msleep(100);
> > 		stalled = true;
> >
> > 		/* We are about to die and free our memory. Return now. */
> > 		if (fatal_signal_pending(current))
> > 			return SWAP_CLUSTER_MAX;
> > 	}
> >
> > Thanks to Michal who has improved it considerably by commit
> > db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for
> > ever"). But we still occasionally see live locks on over-overcommitted
> > machines. Reclaiming threads step on each other while interleaving
> > between the msleep() and the aging, on 100+ CPUs.
> 
> Got it!  Thanks a lot for detailed explanation!

You are always welcome. Just a side note as you and Dave are working
on migrating pages from DRAM to AEP: we also observed migrations can
interfere (block) reclaims due to the same piece of code above. It
happened when there were a lot of compaction activities going on. My
guess is it could happen to your use case too. Migrations can isolate
a large number of pages, see migrate_pages().


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-16  8:24     ` Yu Zhao
@ 2021-03-16  8:53       ` Huang, Ying
  2021-03-16 18:40         ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-16  8:53 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 02:52:52PM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote:
>> >> Yu Zhao <yuzhao@google.com> writes:
>> >> [snip]
>> >> 
>> >> > +/* Main function used by foreground, background and user-triggered aging. */
>> >> > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
>> >> > +			 struct scan_control *sc, int swappiness)
>> >> > +{
>> >> > +	bool last;
>> >> > +	struct mm_struct *mm = NULL;
>> >> > +	int nid = lruvec_pgdat(lruvec)->node_id;
>> >> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> >> > +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
>> >> > +
>> >> > +	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
>> >> > +
>> >> > +	/*
>> >> > +	 * For each walk of the mm list of a memcg, we decrement the priority
>> >> > +	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
>> >> > +	 * priorities of all lruvecs.
>> >> > +	 *
>> >> > +	 * So if this lruvec has a higher priority (smaller value), it means
>> >> > +	 * other concurrent reclaimers (global or memcg reclaim) have walked
>> >> > +	 * its mm list. Skip it for this priority to balance the pressure on
>> >> > +	 * all memcgs.
>> >> > +	 */
>> >> > +#ifdef CONFIG_MEMCG
>> >> > +	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
>> >> > +	    sc->priority > atomic_read(&lruvec->evictable.priority))
>> >> > +		return false;
>> >> > +#endif
>> >> > +
>> >> > +	do {
>> >> > +		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
>> >> > +		if (mm)
>> >> > +			walk_mm(lruvec, mm, swappiness);
>> >> > +
>> >> > +		cond_resched();
>> >> > +	} while (mm);
>> >> 
>> >> It appears that we need to scan the whole address space of multiple
>> >> processes in this loop?
>> >> 
>> >> If so, I have some concerns about the duration of the function.  Do you
>> >> have some number of the distribution of the duration of the function?
>> >> And may be the number of mm_struct and the number of pages scanned.
>> >> 
>> >> In comparison, in the traditional LRU algorithm, for each round, only a
>> >> small subset of the whole physical memory is scanned.
>> >
>> > Reasonable concerns, and insightful too. We are sensitive to direct
>> > reclaim latency, and we tuned another path carefully so that direct
>> > reclaims virtually don't hit this path :)
>> >
>> > Some numbers from the cover letter first:
>> >   In addition, direct reclaim latency is reduced by 22% at 99th
>> >   percentile and the number of refaults is reduced 7%. These metrics are
>> >   important to phones and laptops as they are correlated to user
>> >   experience.
>> >
>> > And "another path" is the background aging in kswapd:
>> >   age_active_anon()
>> >     age_lru_gens()
>> >       try_walk_mm_list()
>> >         /* try to spread pages out across spread+1 generations */
>> >         if (old_and_young[0] >= old_and_young[1] * spread &&
>> >             min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS))
>> >                 return;
>> >
>> >         walk_mm_list(lruvec, max_seq, sc, swappiness);
>> >
>> > By default, spread = 2, which makes kswapd slight more aggressive
>> > than direct reclaim for our use cases. This can be entirely disabled
>> > by setting spread to 0, for worloads that don't care about direct
>> > reclaim latency, or larger values, they are more sensitive than
>> > ours.
>> 
>> OK, I see.  That can avoid the long latency in direct reclaim path.
>> 
>> > It's worth noting that walk_mm_list() is multithreaded -- reclaiming
>> > threads can work on different mm_structs on the same list
>> > concurrently. We do occasionally see this function in direct reclaims,
>> > on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under
>> > the same condition, we saw the current page reclaim live locked and
>> > triggered hardware watchdog timeouts (our hardware watchdog is set to
>> > 2 hours) many times.
>> 
>> Just to confirm, in the current page reclaim, kswapd will keep running
>> until watchdog?  This is avoided in your algorithm mainly via
>> multi-threading?  Or via direct vs. reversing page table scanning?
>
> Well, don't tell me you've seen the problem :) Let me explain one
> subtle difference in how the aging works between the current page
> reclaim and this series, and point you to the code.
>
> In the current page reclaim, we can't scan a page via the rmap without
> isolating the page first. So the aging basically isolates a batch of
> pages from a lru list, walks the rmap for each of the pages, and puts
> active ones back to the list.
>
> In this series, aging walks page tables to update the generation
> numbers of active pages without isolating them. The isolation is the
> subtle difference: it's not a problem when there are few threads, but
> it causes live locks when hundreds of threads running the aging and
> hit the following in shrink_inactive_list():
>
> 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> 		if (stalled)
> 			return 0;
>
> 		/* wait a bit for the reclaimer. */
> 		msleep(100);
> 		stalled = true;
>
> 		/* We are about to die and free our memory. Return now. */
> 		if (fatal_signal_pending(current))
> 			return SWAP_CLUSTER_MAX;
> 	}
>
> Thanks to Michal who has improved it considerably by commit
> db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for
> ever"). But we still occasionally see live locks on over-overcommitted
> machines. Reclaiming threads step on each other while interleaving
> between the msleep() and the aging, on 100+ CPUs.

Got it!  Thanks a lot for detailed explanation!

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-16  6:52   ` Huang, Ying
@ 2021-03-16  8:24     ` Yu Zhao
  2021-03-16  8:53       ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-16  8:24 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 02:52:52PM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote:
> >> Yu Zhao <yuzhao@google.com> writes:
> >> [snip]
> >> 
> >> > +/* Main function used by foreground, background and user-triggered aging. */
> >> > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
> >> > +			 struct scan_control *sc, int swappiness)
> >> > +{
> >> > +	bool last;
> >> > +	struct mm_struct *mm = NULL;
> >> > +	int nid = lruvec_pgdat(lruvec)->node_id;
> >> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >> > +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
> >> > +
> >> > +	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
> >> > +
> >> > +	/*
> >> > +	 * For each walk of the mm list of a memcg, we decrement the priority
> >> > +	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
> >> > +	 * priorities of all lruvecs.
> >> > +	 *
> >> > +	 * So if this lruvec has a higher priority (smaller value), it means
> >> > +	 * other concurrent reclaimers (global or memcg reclaim) have walked
> >> > +	 * its mm list. Skip it for this priority to balance the pressure on
> >> > +	 * all memcgs.
> >> > +	 */
> >> > +#ifdef CONFIG_MEMCG
> >> > +	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
> >> > +	    sc->priority > atomic_read(&lruvec->evictable.priority))
> >> > +		return false;
> >> > +#endif
> >> > +
> >> > +	do {
> >> > +		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
> >> > +		if (mm)
> >> > +			walk_mm(lruvec, mm, swappiness);
> >> > +
> >> > +		cond_resched();
> >> > +	} while (mm);
> >> 
> >> It appears that we need to scan the whole address space of multiple
> >> processes in this loop?
> >> 
> >> If so, I have some concerns about the duration of the function.  Do you
> >> have some number of the distribution of the duration of the function?
> >> And may be the number of mm_struct and the number of pages scanned.
> >> 
> >> In comparison, in the traditional LRU algorithm, for each round, only a
> >> small subset of the whole physical memory is scanned.
> >
> > Reasonable concerns, and insightful too. We are sensitive to direct
> > reclaim latency, and we tuned another path carefully so that direct
> > reclaims virtually don't hit this path :)
> >
> > Some numbers from the cover letter first:
> >   In addition, direct reclaim latency is reduced by 22% at 99th
> >   percentile and the number of refaults is reduced 7%. These metrics are
> >   important to phones and laptops as they are correlated to user
> >   experience.
> >
> > And "another path" is the background aging in kswapd:
> >   age_active_anon()
> >     age_lru_gens()
> >       try_walk_mm_list()
> >         /* try to spread pages out across spread+1 generations */
> >         if (old_and_young[0] >= old_and_young[1] * spread &&
> >             min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS))
> >                 return;
> >
> >         walk_mm_list(lruvec, max_seq, sc, swappiness);
> >
> > By default, spread = 2, which makes kswapd slight more aggressive
> > than direct reclaim for our use cases. This can be entirely disabled
> > by setting spread to 0, for worloads that don't care about direct
> > reclaim latency, or larger values, they are more sensitive than
> > ours.
> 
> OK, I see.  That can avoid the long latency in direct reclaim path.
> 
> > It's worth noting that walk_mm_list() is multithreaded -- reclaiming
> > threads can work on different mm_structs on the same list
> > concurrently. We do occasionally see this function in direct reclaims,
> > on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under
> > the same condition, we saw the current page reclaim live locked and
> > triggered hardware watchdog timeouts (our hardware watchdog is set to
> > 2 hours) many times.
> 
> Just to confirm, in the current page reclaim, kswapd will keep running
> until watchdog?  This is avoided in your algorithm mainly via
> multi-threading?  Or via direct vs. reversing page table scanning?

Well, don't tell me you've seen the problem :) Let me explain one
subtle difference in how the aging works between the current page
reclaim and this series, and point you to the code.

In the current page reclaim, we can't scan a page via the rmap without
isolating the page first. So the aging basically isolates a batch of
pages from a lru list, walks the rmap for each of the pages, and puts
active ones back to the list.

In this series, aging walks page tables to update the generation
numbers of active pages without isolating them. The isolation is the
subtle difference: it's not a problem when there are few threads, but
it causes live locks when hundreds of threads running the aging and
hit the following in shrink_inactive_list():

	while (unlikely(too_many_isolated(pgdat, file, sc))) {
		if (stalled)
			return 0;

		/* wait a bit for the reclaimer. */
		msleep(100);
		stalled = true;

		/* We are about to die and free our memory. Return now. */
		if (fatal_signal_pending(current))
			return SWAP_CLUSTER_MAX;
	}

Thanks to Michal who has improved it considerably by commit
db73ee0d4637 ("mm, vmscan: do not loop on too_many_isolated for
ever"). But we still occasionally see live locks on over-overcommitted
machines. Reclaiming threads step on each other while interleaving
between the msleep() and the aging, on 100+ CPUs.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-16  4:45 ` Yu Zhao
@ 2021-03-16  6:52   ` Huang, Ying
  2021-03-16  8:24     ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-16  6:52 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> [snip]
>> 
>> > +/* Main function used by foreground, background and user-triggered aging. */
>> > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
>> > +			 struct scan_control *sc, int swappiness)
>> > +{
>> > +	bool last;
>> > +	struct mm_struct *mm = NULL;
>> > +	int nid = lruvec_pgdat(lruvec)->node_id;
>> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> > +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
>> > +
>> > +	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
>> > +
>> > +	/*
>> > +	 * For each walk of the mm list of a memcg, we decrement the priority
>> > +	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
>> > +	 * priorities of all lruvecs.
>> > +	 *
>> > +	 * So if this lruvec has a higher priority (smaller value), it means
>> > +	 * other concurrent reclaimers (global or memcg reclaim) have walked
>> > +	 * its mm list. Skip it for this priority to balance the pressure on
>> > +	 * all memcgs.
>> > +	 */
>> > +#ifdef CONFIG_MEMCG
>> > +	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
>> > +	    sc->priority > atomic_read(&lruvec->evictable.priority))
>> > +		return false;
>> > +#endif
>> > +
>> > +	do {
>> > +		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
>> > +		if (mm)
>> > +			walk_mm(lruvec, mm, swappiness);
>> > +
>> > +		cond_resched();
>> > +	} while (mm);
>> 
>> It appears that we need to scan the whole address space of multiple
>> processes in this loop?
>> 
>> If so, I have some concerns about the duration of the function.  Do you
>> have some number of the distribution of the duration of the function?
>> And may be the number of mm_struct and the number of pages scanned.
>> 
>> In comparison, in the traditional LRU algorithm, for each round, only a
>> small subset of the whole physical memory is scanned.
>
> Reasonable concerns, and insightful too. We are sensitive to direct
> reclaim latency, and we tuned another path carefully so that direct
> reclaims virtually don't hit this path :)
>
> Some numbers from the cover letter first:
>   In addition, direct reclaim latency is reduced by 22% at 99th
>   percentile and the number of refaults is reduced 7%. These metrics are
>   important to phones and laptops as they are correlated to user
>   experience.
>
> And "another path" is the background aging in kswapd:
>   age_active_anon()
>     age_lru_gens()
>       try_walk_mm_list()
>         /* try to spread pages out across spread+1 generations */
>         if (old_and_young[0] >= old_and_young[1] * spread &&
>             min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS))
>                 return;
>
>         walk_mm_list(lruvec, max_seq, sc, swappiness);
>
> By default, spread = 2, which makes kswapd slight more aggressive
> than direct reclaim for our use cases. This can be entirely disabled
> by setting spread to 0, for worloads that don't care about direct
> reclaim latency, or larger values, they are more sensitive than
> ours.

OK, I see.  That can avoid the long latency in direct reclaim path.

> It's worth noting that walk_mm_list() is multithreaded -- reclaiming
> threads can work on different mm_structs on the same list
> concurrently. We do occasionally see this function in direct reclaims,
> on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under
> the same condition, we saw the current page reclaim live locked and
> triggered hardware watchdog timeouts (our hardware watchdog is set to
> 2 hours) many times.

Just to confirm, in the current page reclaim, kswapd will keep running
until watchdog?  This is avoided in your algorithm mainly via
multi-threading?  Or via direct vs. reversing page table scanning?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
  2021-03-16  2:08 [PATCH v1 10/14] mm: multigenerational lru: core Huang, Ying
@ 2021-03-16  4:45 ` Yu Zhao
  2021-03-16  6:52   ` Huang, Ying
  0 siblings, 1 reply; 65+ messages in thread
From: Yu Zhao @ 2021-03-16  4:45 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, linux-kernel, page-reclaim

On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> [snip]
> 
> > +/* Main function used by foreground, background and user-triggered aging. */
> > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
> > +			 struct scan_control *sc, int swappiness)
> > +{
> > +	bool last;
> > +	struct mm_struct *mm = NULL;
> > +	int nid = lruvec_pgdat(lruvec)->node_id;
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
> > +
> > +	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
> > +
> > +	/*
> > +	 * For each walk of the mm list of a memcg, we decrement the priority
> > +	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
> > +	 * priorities of all lruvecs.
> > +	 *
> > +	 * So if this lruvec has a higher priority (smaller value), it means
> > +	 * other concurrent reclaimers (global or memcg reclaim) have walked
> > +	 * its mm list. Skip it for this priority to balance the pressure on
> > +	 * all memcgs.
> > +	 */
> > +#ifdef CONFIG_MEMCG
> > +	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
> > +	    sc->priority > atomic_read(&lruvec->evictable.priority))
> > +		return false;
> > +#endif
> > +
> > +	do {
> > +		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
> > +		if (mm)
> > +			walk_mm(lruvec, mm, swappiness);
> > +
> > +		cond_resched();
> > +	} while (mm);
> 
> It appears that we need to scan the whole address space of multiple
> processes in this loop?
> 
> If so, I have some concerns about the duration of the function.  Do you
> have some number of the distribution of the duration of the function?
> And may be the number of mm_struct and the number of pages scanned.
> 
> In comparison, in the traditional LRU algorithm, for each round, only a
> small subset of the whole physical memory is scanned.

Reasonable concerns, and insightful too. We are sensitive to direct
reclaim latency, and we tuned another path carefully so that direct
reclaims virtually don't hit this path :)

Some numbers from the cover letter first:
  In addition, direct reclaim latency is reduced by 22% at 99th
  percentile and the number of refaults is reduced 7%. These metrics are
  important to phones and laptops as they are correlated to user
  experience.

And "another path" is the background aging in kswapd:
  age_active_anon()
    age_lru_gens()
      try_walk_mm_list()
        /* try to spread pages out across spread+1 generations */
        if (old_and_young[0] >= old_and_young[1] * spread &&
            min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS))
                return;

        walk_mm_list(lruvec, max_seq, sc, swappiness);

By default, spread = 2, which makes kswapd slight more aggressive
than direct reclaim for our use cases. This can be entirely disabled
by setting spread to 0, for worloads that don't care about direct
reclaim latency, or larger values, they are more sensitive than
ours.

It's worth noting that walk_mm_list() is multithreaded -- reclaiming
threads can work on different mm_structs on the same list
concurrently. We do occasionally see this function in direct reclaims,
on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under
the same condition, we saw the current page reclaim live locked and
triggered hardware watchdog timeouts (our hardware watchdog is set to
2 hours) many times.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v1 10/14] mm: multigenerational lru: core
@ 2021-03-16  2:08 Huang, Ying
  2021-03-16  4:45 ` Yu Zhao
  0 siblings, 1 reply; 65+ messages in thread
From: Huang, Ying @ 2021-03-16  2:08 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Alex Shi, Andrew Morton, Dave Hansen, Hillf Danton,
	Johannes Weiner, Joonsoo Kim, Matthew Wilcox, Mel Gorman,
	Michal Hocko, Roman Gushchin, Vlastimil Babka, Wei Yang,
	Yang Shi, linux-kernel, page-reclaim

Yu Zhao <yuzhao@google.com> writes:
[snip]

> +/* Main function used by foreground, background and user-triggered aging. */
> +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq,
> +			 struct scan_control *sc, int swappiness)
> +{
> +	bool last;
> +	struct mm_struct *mm = NULL;
> +	int nid = lruvec_pgdat(lruvec)->node_id;
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
> +
> +	VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq));
> +
> +	/*
> +	 * For each walk of the mm list of a memcg, we decrement the priority
> +	 * of its lruvec. For each walk of memcgs in kswapd, we increment the
> +	 * priorities of all lruvecs.
> +	 *
> +	 * So if this lruvec has a higher priority (smaller value), it means
> +	 * other concurrent reclaimers (global or memcg reclaim) have walked
> +	 * its mm list. Skip it for this priority to balance the pressure on
> +	 * all memcgs.
> +	 */
> +#ifdef CONFIG_MEMCG
> +	if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) &&
> +	    sc->priority > atomic_read(&lruvec->evictable.priority))
> +		return false;
> +#endif
> +
> +	do {
> +		last = get_next_mm(lruvec, next_seq, swappiness, &mm);
> +		if (mm)
> +			walk_mm(lruvec, mm, swappiness);
> +
> +		cond_resched();
> +	} while (mm);

It appears that we need to scan the whole address space of multiple
processes in this loop?

If so, I have some concerns about the duration of the function.  Do you
have some number of the distribution of the duration of the function?
And may be the number of mm_struct and the number of pages scanned.

In comparison, in the traditional LRU algorithm, for each round, only a
small subset of the whole physical memory is scanned.

Best Regards,
Huang, Ying

> +
> +	if (!last) {
> +		/* foreground aging prefers not to wait unless "necessary" */
> +		if (!current_is_kswapd() && sc->priority < DEF_PRIORITY - 2)
> +			wait_event_killable(mm_list->nodes[nid].wait,
> +				next_seq < READ_ONCE(lruvec->evictable.max_seq));
> +
> +		return next_seq < READ_ONCE(lruvec->evictable.max_seq);
> +	}
> +
> +	VM_BUG_ON(next_seq != READ_ONCE(lruvec->evictable.max_seq));
> +
> +	inc_max_seq(lruvec);
> +
> +#ifdef CONFIG_MEMCG
> +	if (!mem_cgroup_disabled())
> +		atomic_add_unless(&lruvec->evictable.priority, -1, 0);
> +#endif
> +
> +	/* order against inc_max_seq() */
> +	smp_mb();
> +	/* either we see any waiters or they will see updated max_seq */
> +	if (waitqueue_active(&mm_list->nodes[nid].wait))
> +		wake_up_all(&mm_list->nodes[nid].wait);
> +
> +	wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
> +	return true;
> +}
> +

[snip]

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2021-04-13 23:00 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-13  7:57 [PATCH v1 00/14] Multigenerational LRU Yu Zhao
2021-03-13  7:57 ` [PATCH v1 01/14] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao
2021-03-13 15:09   ` Matthew Wilcox
2021-03-14  7:45     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 02/14] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao
2021-03-13  7:57 ` [PATCH v1 03/14] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao
2021-03-13  7:57 ` [PATCH v1 04/14] include/linux/cgroup.h: export cgroup_mutex Yu Zhao
2021-03-13  7:57 ` [PATCH v1 05/14] mm/swap.c: export activate_page() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao
2021-03-14 22:12   ` Zi Yan
2021-03-14 22:51     ` Matthew Wilcox
2021-03-15  0:03       ` Yu Zhao
2021-03-15  0:27         ` Zi Yan
2021-03-15  1:04           ` Yu Zhao
2021-03-14 23:22   ` Dave Hansen
2021-03-15  3:16     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 07/14] mm/pagewalk.c: add pud_entry_post() for post-order traversals Yu Zhao
2021-03-13  7:57 ` [PATCH v1 08/14] mm/vmscan.c: refactor shrink_node() Yu Zhao
2021-03-13  7:57 ` [PATCH v1 09/14] mm: multigenerational lru: mm_struct list Yu Zhao
2021-03-15 19:40   ` Rik van Riel
2021-03-16  2:07     ` Huang, Ying
2021-03-16  3:57       ` Yu Zhao
2021-03-16  6:44         ` Huang, Ying
2021-03-16  7:56           ` Yu Zhao
2021-03-17  3:37             ` Huang, Ying
2021-03-17 10:46               ` Yu Zhao
2021-03-22  3:13                 ` Huang, Ying
2021-03-22  8:08                   ` Yu Zhao
2021-03-24  6:58                     ` Huang, Ying
2021-04-10 18:48                       ` Yu Zhao
2021-04-13  3:06                         ` Huang, Ying
2021-03-13  7:57 ` [PATCH v1 10/14] mm: multigenerational lru: core Yu Zhao
2021-03-15  2:02   ` Andi Kleen
2021-03-15  3:37     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 11/14] mm: multigenerational lru: page activation Yu Zhao
2021-03-16 16:34   ` Matthew Wilcox
2021-03-16 21:29     ` Yu Zhao
2021-03-13  7:57 ` [PATCH v1 12/14] mm: multigenerational lru: user space interface Yu Zhao
2021-03-13 12:23   ` kernel test robot
2021-03-13  7:57 ` [PATCH v1 13/14] mm: multigenerational lru: Kconfig Yu Zhao
2021-03-13 12:53   ` kernel test robot
2021-03-13 13:36   ` kernel test robot
2021-03-13  7:57 ` [PATCH v1 14/14] mm: multigenerational lru: documentation Yu Zhao
2021-03-19  9:31   ` Alex Shi
2021-03-22  6:09     ` Yu Zhao
2021-03-14 22:48 ` [PATCH v1 00/14] Multigenerational LRU Zi Yan
2021-03-15  0:52   ` Yu Zhao
2021-03-15  1:13 ` Hillf Danton
2021-03-15  6:49   ` Yu Zhao
2021-03-15 18:00 ` Dave Hansen
2021-03-16  2:24   ` Yu Zhao
2021-03-16 14:50     ` Dave Hansen
2021-03-16 20:30       ` Yu Zhao
2021-03-16 21:14         ` Dave Hansen
2021-04-10  9:21           ` Yu Zhao
2021-04-13  3:02             ` Huang, Ying
2021-04-13 23:00               ` Yu Zhao
2021-03-15 18:38 ` Yang Shi
2021-03-16  3:38   ` Yu Zhao
2021-03-16  2:08 [PATCH v1 10/14] mm: multigenerational lru: core Huang, Ying
2021-03-16  4:45 ` Yu Zhao
2021-03-16  6:52   ` Huang, Ying
2021-03-16  8:24     ` Yu Zhao
2021-03-16  8:53       ` Huang, Ying
2021-03-16 18:40         ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).