Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3] mm: fix page aging across multiple cgroups
@ 2019-11-07 20:53 Johannes Weiner
  2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Johannes Weiner @ 2019-11-07 20:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, Suren Baghdasaryan, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

When applications are put into unconfigured cgroups for memory
accounting purposes, the cgrouping itself should not change the
behavior of the page reclaim code. We expect the VM to reclaim the
coldest pages in the system. But right now the VM can reclaim hot
pages in one cgroup while there is eligible cold cache in others.

This is because one part of the reclaim algorithm isn't truly cgroup
hierarchy aware: the inactive/active list balancing. That is the part
that is supposed to protect hot cache data from one-off streaming IO.

The recursive cgroup reclaim scheme will scan and rotate the physical
LRU lists of each eligible cgroup at the same rate in a round-robin
fashion, thereby establishing a relative order among the pages of all
those cgroups. However, the inactive/active balancing decisions are
made locally within each cgroup, so when a cgroup is running low on
cold pages, its hot pages will get reclaimed - even when sibling
cgroups have plenty of cold cache eligible in the same reclaim run.

For example:

   [root@ham ~]# head -n1 /proc/meminfo 
   MemTotal:        1016336 kB

   [root@ham ~]# ./reclaimtest2.sh 
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.269s
   user    0m0.051s
   sys     0m4.182s
   Hot pages cached: 134/12800 workingset-a

The streaming IO in B, which doesn't benefit from caching at all,
pushes out most of the workingset in A.

Solution

This series fixes the problem by elevating inactive/active balancing
decisions to the toplevel of the reclaim run. This is either a cgroup
that hit its limit, or straight-up global reclaim if there is physical
memory pressure. From there, it takes a recursive view of the cgroup
subtree to decide whether page deactivation is necessary.

In the test above, the VM will then recognize that cgroup B has plenty
of eligible cold cache, and that the hot pages in A can be spared:

   [root@ham ~]# ./reclaimtest2.sh 
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.244s
   user    0m0.064s
   sys     0m4.177s
   Hot pages cached: 12800/12800 workingset-a

Implementation

Whether active pages can be deactivated or not is influenced by two
factors: the inactive list dropping below a minimum size relative to
the active list, and the occurence of refaults.

This patch series first moves refault detection to the reclaim root,
then enforces the minimum inactive size based on a recursive view of
the cgroup tree's LRUs.

History

Note that this actually never worked correctly in Linux cgroups. In
the past it worked for global reclaim and leaf limit reclaim only (we
used to have two physical LRU linkages per page), but it never worked
for intermediate limit reclaim over multiple leaf cgroups.

We're noticing this now because 1) we're putting everything into
cgroups for accounting, not just the things we want to control and 2)
we're moving away from leaf limits that invoke reclaim on individual
cgroups, toward large tree reclaim, triggered by high-level limits, or
physical memory pressure that is influenced by local protections such
as memory.low and memory.min instead.

Requirements

These changes are based on v5.4-rc6-mmotm-2019-11-05-20-44.

 include/linux/memcontrol.h |   5 +
 include/linux/mmzone.h     |   4 +-
 include/linux/swap.h       |   2 +-
 mm/vmscan.c                | 269 +++++++++++++++++++++++++------------------
 mm/workingset.c            |  72 +++++++++---
 5 files changed, 223 insertions(+), 129 deletions(-)




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level
  2019-11-07 20:53 [PATCH 0/3] mm: fix page aging across multiple cgroups Johannes Weiner
@ 2019-11-07 20:53 ` Johannes Weiner
  2019-11-10 22:02   ` Suren Baghdasaryan
  2019-11-10 22:09   ` Khadarnimcaan Khadarnimcaan
  2019-11-07 20:53 ` [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
  2019-11-07 20:53 ` [PATCH 3/3] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
  2 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2019-11-07 20:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, Suren Baghdasaryan, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

When file pages are lower than the watermark on a node, we try to
force scan anonymous pages to counter-act the balancing algorithms
preference for new file pages when they are likely thrashing. This is
a node-level decision, but it's currently made each time we look at an
lruvec. This is unnecessarily expensive and also a layering violation
that makes the code harder to understand.

Clean this up by making the check once per node and setting a flag in
the scan_control.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/vmscan.c | 80 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d97985262dda..e8dd601e1fad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -101,6 +101,9 @@ struct scan_control {
 	/* One of the zones is ready for compaction */
 	unsigned int compaction_ready:1;
 
+	/* The file pages on the current node are dangerously low */
+	unsigned int file_is_tiny:1;
+
 	/* Allocation order */
 	s8 order;
 
@@ -2289,45 +2292,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 
 	/*
-	 * Prevent the reclaimer from falling into the cache trap: as
-	 * cache pages start out inactive, every cache fault will tip
-	 * the scan balance towards the file LRU.  And as the file LRU
-	 * shrinks, so does the window for rotation from references.
-	 * This means we have a runaway feedback loop where a tiny
-	 * thrashing file LRU becomes infinitely more attractive than
-	 * anon pages.  Try to detect this based on file LRU size.
+	 * If the system is almost out of file pages, force-scan anon.
+	 * But only if there are enough inactive anonymous pages on
+	 * the LRU. Otherwise, the small LRU gets thrashed.
 	 */
-	if (!cgroup_reclaim(sc)) {
-		unsigned long pgdatfile;
-		unsigned long pgdatfree;
-		int z;
-		unsigned long total_high_wmark = 0;
-
-		pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
-		pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
-			   node_page_state(pgdat, NR_INACTIVE_FILE);
-
-		for (z = 0; z < MAX_NR_ZONES; z++) {
-			struct zone *zone = &pgdat->node_zones[z];
-			if (!managed_zone(zone))
-				continue;
-
-			total_high_wmark += high_wmark_pages(zone);
-		}
-
-		if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
-			/*
-			 * Force SCAN_ANON if there are enough inactive
-			 * anonymous pages on the LRU in eligible zones.
-			 * Otherwise, the small LRU gets thrashed.
-			 */
-			if (!inactive_list_is_low(lruvec, false, sc, false) &&
-			    lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx)
-					>> sc->priority) {
-				scan_balance = SCAN_ANON;
-				goto out;
-			}
-		}
+	if (sc->file_is_tiny &&
+	    !inactive_list_is_low(lruvec, false, sc, false) &&
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
+			    sc->reclaim_idx) >> sc->priority) {
+		scan_balance = SCAN_ANON;
+		goto out;
 	}
 
 	/*
@@ -2754,6 +2728,36 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
+	/*
+	 * Prevent the reclaimer from falling into the cache trap: as
+	 * cache pages start out inactive, every cache fault will tip
+	 * the scan balance towards the file LRU.  And as the file LRU
+	 * shrinks, so does the window for rotation from references.
+	 * This means we have a runaway feedback loop where a tiny
+	 * thrashing file LRU becomes infinitely more attractive than
+	 * anon pages.  Try to detect this based on file LRU size.
+	 */
+	if (!cgroup_reclaim(sc)) {
+		unsigned long file;
+		unsigned long free;
+		int z;
+		unsigned long total_high_wmark = 0;
+
+		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			if (!managed_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
+
+		sc->file_is_tiny = file + free <= total_high_wmark;
+	}
+
 	shrink_node_memcgs(pgdat, sc);
 
 	if (reclaim_state) {
-- 
2.24.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-07 20:53 [PATCH 0/3] mm: fix page aging across multiple cgroups Johannes Weiner
  2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
@ 2019-11-07 20:53 ` Johannes Weiner
  2019-11-11  2:01   ` Suren Baghdasaryan
  2019-11-14 23:47   ` Shakeel Butt
  2019-11-07 20:53 ` [PATCH 3/3] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
  2 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2019-11-07 20:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, Suren Baghdasaryan, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

We use refault information to determine whether the cache workingset
is stable or transitioning, and dynamically adjust the inactive:active
file LRU ratio so as to maximize protection from one-off cache during
stable periods, and minimize IO during transitions.

With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect
the local LRU order in the cgroup in which they are occuring. As a
result, cache transitions can take longer in a cgrouped system as the
active pages of sibling cgroups aren't challenged when they should be.

[ Right now, this is somewhat theoretical, because the siblings, under
  continued regular reclaim pressure, should eventually run out of
  inactive pages - and since inactive:active *size* balancing is also
  done on a cgroup-local level, we will challenge the active pages
  eventually in most cases. But the next patch will move that relative
  size enforcement to the reclaim root as well, and then this patch
  here will be necessary to propagate refault pressure to siblings. ]

This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup
that caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.

I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not
just the local order down inside C.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  5 +++
 include/linux/swap.h       |  2 +-
 mm/vmscan.c                | 32 ++++++++---------
 mm/workingset.c            | 72 +++++++++++++++++++++++++++++---------
 4 files changed, 77 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b86287fa069..a7a0a1a5c8d5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -901,6 +901,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 063c0c1e112b..1e99f7ac1d7e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
-void *workingset_eviction(struct page *page);
+void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e8dd601e1fad..527617ee9b73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -853,7 +853,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
  * gets returned with a refcount of 0.
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page,
-			    bool reclaimed)
+			    bool reclaimed, struct mem_cgroup *target_memcg)
 {
 	unsigned long flags;
 	int refcount;
@@ -925,7 +925,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
-			shadow = workingset_eviction(page);
+			shadow = workingset_eviction(page, target_memcg);
 		__delete_from_page_cache(page, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 
@@ -948,7 +948,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page, false)) {
+	if (__remove_mapping(mapping, page, false, NULL)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1426,7 +1426,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			count_vm_event(PGLAZYFREED);
 			count_memcg_page_event(page, PGLAZYFREED);
-		} else if (!mapping || !__remove_mapping(mapping, page, true))
+		} else if (!mapping || !__remove_mapping(mapping, page, true,
+							 sc->target_mem_cgroup))
 			goto keep_locked;
 
 		unlock_page(page);
@@ -2189,6 +2190,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	enum lru_list inactive_lru = file * LRU_FILE;
 	unsigned long inactive, active;
 	unsigned long inactive_ratio;
+	struct lruvec *target_lruvec;
 	unsigned long refaults;
 	unsigned long gb;
 
@@ -2200,8 +2202,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	 * is being established. Disable active list protection to get
 	 * rid of the stale workingset quickly.
 	 */
-	refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
-	if (file && lruvec->refaults != refaults) {
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
+	if (file && target_lruvec->refaults != refaults) {
 		inactive_ratio = 0;
 	} else {
 		gb = (inactive + active) >> (30 - PAGE_SHIFT);
@@ -2973,19 +2976,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	sc->gfp_mask = orig_mask;
 }
 
-static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
+static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 {
-	struct mem_cgroup *memcg;
-
-	memcg = mem_cgroup_iter(root_memcg, NULL, NULL);
-	do {
-		unsigned long refaults;
-		struct lruvec *lruvec;
+	struct lruvec *target_lruvec;
+	unsigned long refaults;
 
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
-		lruvec->refaults = refaults;
-	} while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
+	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
+	target_lruvec->refaults = refaults;
 }
 
 /*
diff --git a/mm/workingset.c b/mm/workingset.c
index e8212123c1c3..f0885d9f41cd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
+static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+{
+	/*
+	 * Reclaiming a cgroup means reclaiming all its children in a
+	 * round-robin fashion. That means that each cgroup has an LRU
+	 * order that is composed of the LRU orders of its child
+	 * cgroups; and every page has an LRU position not just in the
+	 * cgroup that owns it, but in all of that group's ancestors.
+	 *
+	 * So when the physical inactive list of a leaf cgroup ages,
+	 * the virtual inactive lists of all its parents, including
+	 * the root cgroup's, age as well.
+	 */
+	do {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		atomic_long_inc(&lruvec->inactive_age);
+	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+}
+
 /**
  * workingset_eviction - note the eviction of a page from memory
+ * @target_memcg: the cgroup that is causing the reclaim
  * @page: the page being evicted
  *
  * Returns a shadow entry to be stored in @page->mapping->i_pages in place
  * of the evicted @page so that a later refault can be detected.
  */
-void *workingset_eviction(struct page *page)
+void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 {
 	struct pglist_data *pgdat = page_pgdat(page);
-	struct mem_cgroup *memcg = page_memcg(page);
-	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
+	int memcgid;
 
 	/* Page is fully exclusive and pins page->mem_cgroup */
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	eviction = atomic_long_inc_return(&lruvec->inactive_age);
+	advance_inactive_age(page_memcg(page), pgdat);
+
+	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	/* XXX: target_memcg can be NULL, go through lruvec */
+	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
+	eviction = atomic_long_read(&lruvec->inactive_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page)
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the node it was allocated in.
+ * evicted page in the context of the node and the memcg whose memory
+ * pressure caused the eviction.
  */
 void workingset_refault(struct page *page, void *shadow)
 {
+	struct mem_cgroup *eviction_memcg;
+	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
 	struct pglist_data *pgdat;
 	unsigned long active_file;
@@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow)
 	 * would be better if the root_mem_cgroup existed in all
 	 * configurations instead.
 	 */
-	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg)
+	eviction_memcg = mem_cgroup_from_id(memcgid);
+	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	refault = atomic_long_read(&lruvec->inactive_age);
-	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
+	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
+	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
 
 	/*
 	 * Calculate the refault distance
@@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow)
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
+	/*
+	 * The activation decision for this page is made at the level
+	 * where the eviction occurred, as that is where the LRU order
+	 * during page reclaim is being determined.
+	 *
+	 * However, the cgroup that will own the page is the one that
+	 * is actually experiencing the refault event.
+	 */
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
 
 	/*
@@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow)
 	 * the memory was available to the page cache.
 	 */
 	if (refault_distance > active_file)
-		goto out;
+		goto out_memcg;
 
 	SetPageActive(page);
-	atomic_long_inc(&lruvec->inactive_age);
+	advance_inactive_age(memcg, pgdat);
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow)
 		SetPageWorkingset(page);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
+
+out_memcg:
+	mem_cgroup_put(memcg);
 out:
 	rcu_read_unlock();
 }
@@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -345,8 +386,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	atomic_long_inc(&lruvec->inactive_age);
+	advance_inactive_age(memcg, page_pgdat(page));
 out:
 	rcu_read_unlock();
 }
-- 
2.24.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-07 20:53 [PATCH 0/3] mm: fix page aging across multiple cgroups Johannes Weiner
  2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
  2019-11-07 20:53 ` [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
@ 2019-11-07 20:53 ` " Johannes Weiner
  2019-11-11  2:15   ` Suren Baghdasaryan
  2019-11-15  0:29   ` Shakeel Butt
  2 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2019-11-07 20:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, Suren Baghdasaryan, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

We split the LRU lists into inactive and an active parts to maximize
workingset protection while allowing just enough inactive cache space
to faciltate readahead and writeback for one-off file accesses (e.g. a
linear scan through a file, or logging); or just enough inactive anon
to maintain recent reference information when reclaim needs to swap.

With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, inactive:active size
decisions are done on a per-cgroup level. As a result, we'll reclaim a
cgroup's workingset when it doesn't have cold pages, even when one of
its siblings has plenty of it that should be reclaimed first.

For example: workload A has 50M worth of hot cache but doesn't do any
one-off file accesses; meanwhile, parallel workload B scans files and
rarely accesses the same page twice.

If these workloads were to run in an uncgrouped system, A would be
protected from the high rate of cache faults from B. But if they were
put in parallel cgroups for memory accounting purposes, B's fast cache
fault rate would push out the hot cache pages of A. This is unexpected
and undesirable - the "scan resistance" of the page cache is broken.

This patch moves inactive:active size balancing decisions to the root
of reclaim - the same level where the LRU order is established.

It does this by looking at the recursize size of the inactive and the
active file sets of the cgroup subtree at the beginning of the reclaim
cycle, and then making a decision - scan or skip active pages - that
applies throughout the entire run and to every cgroup involved.

With that in place, in the test above, the VM will recognize that
there are plenty of inactive pages in the combined cache set of
workloads A and B and prefer the one-off cache in B over the hot pages
in A. The scan resistance of the cache is restored.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   4 +-
 mm/vmscan.c            | 185 ++++++++++++++++++++++++++---------------
 2 files changed, 118 insertions(+), 71 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7a09087e8c77..454a230ad417 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -229,12 +229,12 @@ enum lru_list {
 
 #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
 
-static inline int is_file_lru(enum lru_list lru)
+static inline bool is_file_lru(enum lru_list lru)
 {
 	return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
 }
 
-static inline int is_active_lru(enum lru_list lru)
+static inline bool is_active_lru(enum lru_list lru)
 {
 	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 527617ee9b73..df859b1d583c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -79,6 +79,13 @@ struct scan_control {
 	 */
 	struct mem_cgroup *target_mem_cgroup;
 
+	/* Can active pages be deactivated as part of reclaim? */
+#define DEACTIVATE_ANON 1
+#define DEACTIVATE_FILE 2
+	unsigned int may_deactivate:2;
+	unsigned int force_deactivate:1;
+	unsigned int skipped_deactivate:1;
+
 	/* Writepage batching in laptop mode; RECLAIM_WRITE */
 	unsigned int may_writepage:1;
 
@@ -101,6 +108,9 @@ struct scan_control {
 	/* One of the zones is ready for compaction */
 	unsigned int compaction_ready:1;
 
+	/* There is easily reclaimable cold cache in the current node */
+	unsigned int cache_trim_mode:1;
+
 	/* The file pages on the current node are dangerously low */
 	unsigned int file_is_tiny:1;
 
@@ -2154,6 +2164,20 @@ unsigned long reclaim_pages(struct list_head *page_list)
 	return nr_reclaimed;
 }
 
+static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+				 struct lruvec *lruvec, struct scan_control *sc)
+{
+	if (is_active_lru(lru)) {
+		if (sc->may_deactivate & (1 << is_file_lru(lru)))
+			shrink_active_list(nr_to_scan, lruvec, sc, lru);
+		else
+			sc->skipped_deactivate = 1;
+		return 0;
+	}
+
+	return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
+}
+
 /*
  * The inactive anon list should be small enough that the VM never has
  * to do too much work.
@@ -2182,59 +2206,25 @@ unsigned long reclaim_pages(struct list_head *page_list)
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
-				 struct scan_control *sc, bool trace)
+static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
 {
-	enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-	enum lru_list inactive_lru = file * LRU_FILE;
+	enum lru_list active_lru = inactive_lru + LRU_ACTIVE;
 	unsigned long inactive, active;
 	unsigned long inactive_ratio;
-	struct lruvec *target_lruvec;
-	unsigned long refaults;
 	unsigned long gb;
 
-	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
-	active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
+	inactive = lruvec_page_state(lruvec, inactive_lru);
+	active = lruvec_page_state(lruvec, active_lru);
 
-	/*
-	 * When refaults are being observed, it means a new workingset
-	 * is being established. Disable active list protection to get
-	 * rid of the stale workingset quickly.
-	 */
-	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
-	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
-	if (file && target_lruvec->refaults != refaults) {
-		inactive_ratio = 0;
-	} else {
-		gb = (inactive + active) >> (30 - PAGE_SHIFT);
-		if (gb)
-			inactive_ratio = int_sqrt(10 * gb);
-		else
-			inactive_ratio = 1;
-	}
-
-	if (trace)
-		trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
-			lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
-			lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
-			inactive_ratio, file);
+	gb = (inactive + active) >> (30 - PAGE_SHIFT);
+	if (gb)
+		inactive_ratio = int_sqrt(10 * gb);
+	else
+		inactive_ratio = 1;
 
 	return inactive * inactive_ratio < active;
 }
 
-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-				 struct lruvec *lruvec, struct scan_control *sc)
-{
-	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
-			shrink_active_list(nr_to_scan, lruvec, sc, lru);
-		return 0;
-	}
-
-	return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
-}
-
 enum scan_balance {
 	SCAN_EQUAL,
 	SCAN_FRACT,
@@ -2296,28 +2286,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/*
 	 * If the system is almost out of file pages, force-scan anon.
-	 * But only if there are enough inactive anonymous pages on
-	 * the LRU. Otherwise, the small LRU gets thrashed.
 	 */
-	if (sc->file_is_tiny &&
-	    !inactive_list_is_low(lruvec, false, sc, false) &&
-	    lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
-			    sc->reclaim_idx) >> sc->priority) {
+	if (sc->file_is_tiny) {
 		scan_balance = SCAN_ANON;
 		goto out;
 	}
 
 	/*
-	 * If there is enough inactive page cache, i.e. if the size of the
-	 * inactive list is greater than that of the active list *and* the
-	 * inactive list actually has some pages to scan on this priority, we
-	 * do not reclaim anything from the anonymous working set right now.
-	 * Without the second condition we could end up never scanning an
-	 * lruvec even if it has plenty of old anonymous pages unless the
-	 * system is under heavy pressure.
+	 * If there is enough inactive page cache, we do not reclaim
+	 * anything from the anonymous working right now.
 	 */
-	if (!inactive_list_is_low(lruvec, true, sc, false) &&
-	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
+	if (sc->cache_trim_mode) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2582,7 +2561,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
+	if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 }
@@ -2722,6 +2701,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
+	unsigned long file;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
@@ -2731,6 +2711,44 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
+	/*
+	 * Target desirable inactive:active list ratios for the anon
+	 * and file LRU lists.
+	 */
+	if (!sc->force_deactivate) {
+		unsigned long refaults;
+
+		if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+			sc->may_deactivate |= DEACTIVATE_ANON;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+		/*
+		 * When refaults are being observed, it means a new
+		 * workingset is being established. Deactivate to get
+		 * rid of any stale active pages quickly.
+		 */
+		refaults = lruvec_page_state(target_lruvec,
+					     WORKINGSET_ACTIVATE);
+		if (refaults != target_lruvec->refaults ||
+		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+			sc->may_deactivate |= DEACTIVATE_FILE;
+		else
+			sc->may_deactivate &= ~DEACTIVATE_FILE;
+	} else
+		sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+	/*
+	 * If we have plenty of inactive file pages that aren't
+	 * thrashing, try to reclaim those first before touching
+	 * anonymous pages.
+	 */
+	file = lruvec_page_state(target_lruvec, LRU_INACTIVE_FILE);
+	if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+		sc->cache_trim_mode = 1;
+	else
+		sc->cache_trim_mode = 0;
+
 	/*
 	 * Prevent the reclaimer from falling into the cache trap: as
 	 * cache pages start out inactive, every cache fault will tip
@@ -2741,10 +2759,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * anon pages.  Try to detect this based on file LRU size.
 	 */
 	if (!cgroup_reclaim(sc)) {
-		unsigned long file;
-		unsigned long free;
-		int z;
 		unsigned long total_high_wmark = 0;
+		unsigned long free, anon;
+		int z;
 
 		free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
 		file = node_page_state(pgdat, NR_ACTIVE_FILE) +
@@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			total_high_wmark += high_wmark_pages(zone);
 		}
 
-		sc->file_is_tiny = file + free <= total_high_wmark;
+		/*
+		 * Consider anon: if that's low too, this isn't a
+		 * runaway file reclaim problem, but rather just
+		 * extreme pressure. Reclaim as per usual then.
+		 */
+		anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+		sc->file_is_tiny =
+			file + free <= total_high_wmark &&
+			!(sc->may_deactivate & DEACTIVATE_ANON) &&
+			anon >> sc->priority;
 	}
 
 	shrink_node_memcgs(pgdat, sc);
@@ -3062,9 +3089,27 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	if (sc->compaction_ready)
 		return 1;
 
+	/*
+	 * We make inactive:active ratio decisions based on the node's
+	 * composition of memory, but a restrictive reclaim_idx or a
+	 * memory.low cgroup setting can exempt large amounts of
+	 * memory from reclaim. Neither of which are very common, so
+	 * instead of doing costly eligibility calculations of the
+	 * entire cgroup subtree up front, we assume the estimates are
+	 * good, and retry with forcible deactivation if that fails.
+	 */
+	if (sc->skipped_deactivate) {
+		sc->priority = initial_priority;
+		sc->force_deactivate = 1;
+		sc->skipped_deactivate = 0;
+		goto retry;
+	}
+
 	/* Untapped cgroup reserves?  Don't OOM, retry. */
 	if (sc->memcg_low_skipped) {
 		sc->priority = initial_priority;
+		sc->force_deactivate = 0;
+		sc->skipped_deactivate = 0;
 		sc->memcg_low_reclaim = 1;
 		sc->memcg_low_skipped = 0;
 		goto retry;
@@ -3347,18 +3392,20 @@ static void age_active_anon(struct pglist_data *pgdat,
 				struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	if (!total_swap_pages)
 		return;
 
+	lruvec = mem_cgroup_lruvec(NULL, pgdat);
+	if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+		return;
+
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
-		if (inactive_list_is_low(lruvec, false, sc, true))
-			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-					   sc, LRU_ACTIVE_ANON);
-
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+				   sc, LRU_ACTIVE_ANON);
 		memcg = mem_cgroup_iter(NULL, memcg, NULL);
 	} while (memcg);
 }
-- 
2.24.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level
  2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
@ 2019-11-10 22:02   ` Suren Baghdasaryan
  2019-11-10 22:09   ` Khadarnimcaan Khadarnimcaan
  1 sibling, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-10 22:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> When file pages are lower than the watermark on a node, we try to
> force scan anonymous pages to counter-act the balancing algorithms
> preference for new file pages when they are likely thrashing. This is
> a node-level decision, but it's currently made each time we look at an
> lruvec. This is unnecessarily expensive and also a layering violation
> that makes the code harder to understand.
>
> Clean this up by making the check once per node and setting a flag in
> the scan_control.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/vmscan.c | 80 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 42 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d97985262dda..e8dd601e1fad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -101,6 +101,9 @@ struct scan_control {
>         /* One of the zones is ready for compaction */
>         unsigned int compaction_ready:1;
>
> +       /* The file pages on the current node are dangerously low */
> +       unsigned int file_is_tiny:1;
> +
>         /* Allocation order */
>         s8 order;
>
> @@ -2289,45 +2292,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>         }
>
>         /*
> -        * Prevent the reclaimer from falling into the cache trap: as
> -        * cache pages start out inactive, every cache fault will tip
> -        * the scan balance towards the file LRU.  And as the file LRU
> -        * shrinks, so does the window for rotation from references.
> -        * This means we have a runaway feedback loop where a tiny
> -        * thrashing file LRU becomes infinitely more attractive than
> -        * anon pages.  Try to detect this based on file LRU size.
> +        * If the system is almost out of file pages, force-scan anon.
> +        * But only if there are enough inactive anonymous pages on
> +        * the LRU. Otherwise, the small LRU gets thrashed.
>          */
> -       if (!cgroup_reclaim(sc)) {
> -               unsigned long pgdatfile;
> -               unsigned long pgdatfree;
> -               int z;
> -               unsigned long total_high_wmark = 0;
> -
> -               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> -               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> -                          node_page_state(pgdat, NR_INACTIVE_FILE);
> -
> -               for (z = 0; z < MAX_NR_ZONES; z++) {
> -                       struct zone *zone = &pgdat->node_zones[z];
> -                       if (!managed_zone(zone))
> -                               continue;
> -
> -                       total_high_wmark += high_wmark_pages(zone);
> -               }
> -
> -               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
> -                       /*
> -                        * Force SCAN_ANON if there are enough inactive
> -                        * anonymous pages on the LRU in eligible zones.
> -                        * Otherwise, the small LRU gets thrashed.
> -                        */
> -                       if (!inactive_list_is_low(lruvec, false, sc, false) &&
> -                           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx)
> -                                       >> sc->priority) {
> -                               scan_balance = SCAN_ANON;
> -                               goto out;
> -                       }
> -               }
> +       if (sc->file_is_tiny &&
> +           !inactive_list_is_low(lruvec, false, sc, false) &&
> +           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> +                           sc->reclaim_idx) >> sc->priority) {
> +               scan_balance = SCAN_ANON;
> +               goto out;
>         }
>
>         /*
> @@ -2754,6 +2728,36 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         nr_reclaimed = sc->nr_reclaimed;
>         nr_scanned = sc->nr_scanned;
>
> +       /*
> +        * Prevent the reclaimer from falling into the cache trap: as
> +        * cache pages start out inactive, every cache fault will tip
> +        * the scan balance towards the file LRU.  And as the file LRU
> +        * shrinks, so does the window for rotation from references.
> +        * This means we have a runaway feedback loop where a tiny
> +        * thrashing file LRU becomes infinitely more attractive than
> +        * anon pages.  Try to detect this based on file LRU size.
> +        */
> +       if (!cgroup_reclaim(sc)) {
> +               unsigned long file;
> +               unsigned long free;
> +               int z;
> +               unsigned long total_high_wmark = 0;
> +
> +               free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!managed_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
> +
> +               sc->file_is_tiny = file + free <= total_high_wmark;
> +       }
> +
>         shrink_node_memcgs(pgdat, sc);
>
>         if (reclaim_state) {
> --
> 2.24.0
>

Hi Johannes,
Thanks for working on this! On Android reclaim regression caused by
memcgs is a known issue and I'll try to test your patcheset next week.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level
  2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
  2019-11-10 22:02   ` Suren Baghdasaryan
@ 2019-11-10 22:09   ` Khadarnimcaan Khadarnimcaan
  1 sibling, 0 replies; 20+ messages in thread
From: Khadarnimcaan Khadarnimcaan @ 2019-11-10 22:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-mm, Michal Hocko, Andrey Ryabinin, Rik van Riel,
	kernel-team, linux-kernel, Andrew Morton, Suren Baghdasaryan,
	Shakeel Butt

[-- Attachment #1: Type: text/plain, Size: 5530 bytes --]

On Nov 7, 2019 11:54 PM, "Johannes Weiner" <hannes@cmpxchg.org> wrote:

> When file pages are lower than the watermark on a node, we try to
> force scan anonymous pages to counter-act the balancing algorithms
> preference for new file pages when they are likely thrashing. This is
> a node-level decision, but it's currently made each time we look at an
> lruvec. This is unnecessarily expensive and also a layering violation
> that makes the code harder to understand.
>
> Clean this up by making the check once per node and setting a flag in
> the scan_control.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/vmscan.c | 80 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 42 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d97985262dda..e8dd601e1fad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -101,6 +101,9 @@ struct scan_control {
>         /* One of the zones is ready for compaction */
>         unsigned int compaction_ready:1;
>
> +       /* The file pages on the current node are dangerously low */
> +       unsigned int file_is_tiny:1;
> +
>         /* Allocation order */
>         s8 order;
>
> @@ -2289,45 +2292,16 @@ static void get_scan_count(struct lruvec *lruvec,
> struct scan_control *sc,
>         }
>
>         /*
> -        * Prevent the reclaimer from falling into the cache trap: as
> -        * cache pages start out inactive, every cache fault will tip
> -        * the scan balance towards the file LRU.  And as the file LRU
> -        * shrinks, so does the window for rotation from references.
> -        * This means we have a runaway feedback loop where a tiny
> -        * thrashing file LRU becomes infinitely more attractive than
> -        * anon pages.  Try to detect this based on file LRU size.
> +        * If the system is almost out of file pages, force-scan anon.
> +        * But only if there are enough inactive anonymous pages on
> +        * the LRU. Otherwise, the small LRU gets thrashed.
>          */
> -       if (!cgroup_reclaim(sc)) {
> -               unsigned long pgdatfile;
> -               unsigned long pgdatfree;
> -               int z;
> -               unsigned long total_high_wmark = 0;
> -
> -               pgdatfree = sum_zone_node_page_state(pgdat->node_id,
> NR_FREE_PAGES);
> -               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> -                          node_page_state(pgdat, NR_INACTIVE_FILE);
> -
> -               for (z = 0; z < MAX_NR_ZONES; z++) {
> -                       struct zone *zone = &pgdat->node_zones[z];
> -                       if (!managed_zone(zone))
> -                               continue;
> -
> -                       total_high_wmark += high_wmark_pages(zone);
> -               }
> -
> -               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
> -                       /*
> -                        * Force SCAN_ANON if there are enough inactive
> -                        * anonymous pages on the LRU in eligible zones.
> -                        * Otherwise, the small LRU gets thrashed.
> -                        */
> -                       if (!inactive_list_is_low(lruvec, false, sc,
> false) &&
> -                           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> sc->reclaim_idx)
> -                                       >> sc->priority) {
> -                               scan_balance = SCAN_ANON;
> -                               goto out;
> -                       }
> -               }
> +       if (sc->file_is_tiny &&
> +           !inactive_list_is_low(lruvec, false, sc, false) &&
> +           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> +                           sc->reclaim_idx) >> sc->priority) {
> +               scan_balance = SCAN_ANON;
> +               goto out;
>         }
>
>         /*
> @@ -2754,6 +2728,36 @@ static bool shrink_node(pg_data_t *pgdat, struct
> scan_control *sc)
>         nr_reclaimed = sc->nr_reclaimed;
>         nr_scanned = sc->nr_scanned;
>
> +       /*
> +        * Prevent the reclaimer from falling into the cache trap: as
> +        * cache pages start out inactive, every cache fault will tip
> +        * the scan balance towards the file LRU.  And as the file LRU
> +        * shrinks, so does the window for rotation from references.
> +        * This means we have a runaway feedback loop where a tiny
> +        * thrashing file LRU becomes infinitely more attractive than
> +        * anon pages.  Try to detect this based on file LRU size.
> +        */
> +       if (!cgroup_reclaim(sc)) {
> +               unsigned long file;
> +               unsigned long free;
> +               int z;
> +               unsigned long total_high_wmark = 0;
> +
> +               free = sum_zone_node_page_state(pgdat->node_id,
> NR_FREE_PAGES);
> +               file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!managed_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
> +
> +               sc->file_is_tiny = file + free <= total_high_wmark;
> +       }
> +
>         shrink_node_memcgs(pgdat, sc);
>
>         if (reclaim_state) {
> --
> 2.24.0
>
>

[-- Attachment #2: Type: text/html, Size: 6964 bytes --]

<div class="gmail_quote">On Nov 7, 2019 11:54 PM, &quot;Johannes Weiner&quot; &lt;<a href="mailto:hannes@cmpxchg.org">hannes@cmpxchg.org</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">When file pages are lower than the watermark on a node, we try to<br>
force scan anonymous pages to counter-act the balancing algorithms<br>
preference for new file pages when they are likely thrashing. This is<br>
a node-level decision, but it&#39;s currently made each time we look at an<br>
lruvec. This is unnecessarily expensive and also a layering violation<br>
that makes the code harder to understand.<br>
<br>
Clean this up by making the check once per node and setting a flag in<br>
the scan_control.<br>
<br>
Signed-off-by: Johannes Weiner &lt;<a href="mailto:hannes@cmpxchg.org">hannes@cmpxchg.org</a>&gt;<br>
Reviewed-by: Shakeel Butt &lt;<a href="mailto:shakeelb@google.com">shakeelb@google.com</a>&gt;<br>
---<br>
 mm/vmscan.c | 80 ++++++++++++++++++++++++++++--<wbr>-----------------------<br>
 1 file changed, 42 insertions(+), 38 deletions(-)<br>
<br>
diff --git a/mm/vmscan.c b/mm/vmscan.c<br>
index d97985262dda..e8dd601e1fad 100644<br>
--- a/mm/vmscan.c<br>
+++ b/mm/vmscan.c<br>
@@ -101,6 +101,9 @@ struct scan_control {<br>
        /* One of the zones is ready for compaction */<br>
        unsigned int compaction_ready:1;<br>
<br>
+       /* The file pages on the current node are dangerously low */<br>
+       unsigned int file_is_tiny:1;<br>
+<br>
        /* Allocation order */<br>
        s8 order;<br>
<br>
@@ -2289,45 +2292,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,<br>
        }<br>
<br>
        /*<br>
-        * Prevent the reclaimer from falling into the cache trap: as<br>
-        * cache pages start out inactive, every cache fault will tip<br>
-        * the scan balance towards the file LRU.  And as the file LRU<br>
-        * shrinks, so does the window for rotation from references.<br>
-        * This means we have a runaway feedback loop where a tiny<br>
-        * thrashing file LRU becomes infinitely more attractive than<br>
-        * anon pages.  Try to detect this based on file LRU size.<br>
+        * If the system is almost out of file pages, force-scan anon.<br>
+        * But only if there are enough inactive anonymous pages on<br>
+        * the LRU. Otherwise, the small LRU gets thrashed.<br>
         */<br>
-       if (!cgroup_reclaim(sc)) {<br>
-               unsigned long pgdatfile;<br>
-               unsigned long pgdatfree;<br>
-               int z;<br>
-               unsigned long total_high_wmark = 0;<br>
-<br>
-               pgdatfree = sum_zone_node_page_state(<wbr>pgdat-&gt;node_id, NR_FREE_PAGES);<br>
-               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +<br>
-                          node_page_state(pgdat, NR_INACTIVE_FILE);<br>
-<br>
-               for (z = 0; z &lt; MAX_NR_ZONES; z++) {<br>
-                       struct zone *zone = &amp;pgdat-&gt;node_zones[z];<br>
-                       if (!managed_zone(zone))<br>
-                               continue;<br>
-<br>
-                       total_high_wmark += high_wmark_pages(zone);<br>
-               }<br>
-<br>
-               if (unlikely(pgdatfile + pgdatfree &lt;= total_high_wmark)) {<br>
-                       /*<br>
-                        * Force SCAN_ANON if there are enough inactive<br>
-                        * anonymous pages on the LRU in eligible zones.<br>
-                        * Otherwise, the small LRU gets thrashed.<br>
-                        */<br>
-                       if (!inactive_list_is_low(lruvec, false, sc, false) &amp;&amp;<br>
-                           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc-&gt;reclaim_idx)<br>
-                                       &gt;&gt; sc-&gt;priority) {<br>
-                               scan_balance = SCAN_ANON;<br>
-                               goto out;<br>
-                       }<br>
-               }<br>
+       if (sc-&gt;file_is_tiny &amp;&amp;<br>
+           !inactive_list_is_low(lruvec, false, sc, false) &amp;&amp;<br>
+           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,<br>
+                           sc-&gt;reclaim_idx) &gt;&gt; sc-&gt;priority) {<br>
+               scan_balance = SCAN_ANON;<br>
+               goto out;<br>
        }<br>
<br>
        /*<br>
@@ -2754,6 +2728,36 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)<br>
        nr_reclaimed = sc-&gt;nr_reclaimed;<br>
        nr_scanned = sc-&gt;nr_scanned;<br>
<br>
+       /*<br>
+        * Prevent the reclaimer from falling into the cache trap: as<br>
+        * cache pages start out inactive, every cache fault will tip<br>
+        * the scan balance towards the file LRU.  And as the file LRU<br>
+        * shrinks, so does the window for rotation from references.<br>
+        * This means we have a runaway feedback loop where a tiny<br>
+        * thrashing file LRU becomes infinitely more attractive than<br>
+        * anon pages.  Try to detect this based on file LRU size.<br>
+        */<br>
+       if (!cgroup_reclaim(sc)) {<br>
+               unsigned long file;<br>
+               unsigned long free;<br>
+               int z;<br>
+               unsigned long total_high_wmark = 0;<br>
+<br>
+               free = sum_zone_node_page_state(<wbr>pgdat-&gt;node_id, NR_FREE_PAGES);<br>
+               file = node_page_state(pgdat, NR_ACTIVE_FILE) +<br>
+                          node_page_state(pgdat, NR_INACTIVE_FILE);<br>
+<br>
+               for (z = 0; z &lt; MAX_NR_ZONES; z++) {<br>
+                       struct zone *zone = &amp;pgdat-&gt;node_zones[z];<br>
+                       if (!managed_zone(zone))<br>
+                               continue;<br>
+<br>
+                       total_high_wmark += high_wmark_pages(zone);<br>
+               }<br>
+<br>
+               sc-&gt;file_is_tiny = file + free &lt;= total_high_wmark;<br>
+       }<br>
+<br>
        shrink_node_memcgs(pgdat, sc);<br>
<br>
        if (reclaim_state) {<br>
-- <br>
2.24.0<br>
<br>
</blockquote></div>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-07 20:53 ` [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
@ 2019-11-11  2:01   ` Suren Baghdasaryan
  2019-11-12 17:45     ` Johannes Weiner
  2019-11-14 23:47   ` Shakeel Butt
  1 sibling, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-11  2:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> We use refault information to determine whether the cache workingset
> is stable or transitioning, and dynamically adjust the inactive:active
> file LRU ratio so as to maximize protection from one-off cache during
> stable periods, and minimize IO during transitions.
>
> With cgroups and their nested LRU lists, we currently don't do this
> correctly. While recursive cgroup reclaim establishes a relative LRU
> order among the pages of all involved cgroups, refaults only affect
> the local LRU order in the cgroup in which they are occuring. As a
> result, cache transitions can take longer in a cgrouped system as the
> active pages of sibling cgroups aren't challenged when they should be.
>
> [ Right now, this is somewhat theoretical, because the siblings, under
>   continued regular reclaim pressure, should eventually run out of
>   inactive pages - and since inactive:active *size* balancing is also
>   done on a cgroup-local level, we will challenge the active pages
>   eventually in most cases. But the next patch will move that relative
>   size enforcement to the reclaim root as well, and then this patch
>   here will be necessary to propagate refault pressure to siblings. ]
>
> This patch moves refault detection to the root of reclaim. Instead of
> remembering the cgroup owner of an evicted page, remember the cgroup
> that caused the reclaim to happen. When refaults later occur, they'll
> correctly influence the cross-cgroup LRU order that reclaim follows.

I spent some time thinking about the idea of calculating refault
distance using target_memcg's inactive_age and then activating
refaulted page in (possibly) another memcg and I am still having
trouble convincing myself that this should work correctly. However I
also was unable to convince myself otherwise... We use refault
distance to calculate the deficit in inactive LRU space and then
activate the refaulted page if that distance is less that
active+inactive LRU size. However making that decision based on LRU
sizes of one memcg and then activating the page in another one seems
very counterintuitive to me. Maybe that's just me though...

>
> I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
> refault of those pages will challenge the global LRU order, and not
> just the local order down inside C.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  5 +++
>  include/linux/swap.h       |  2 +-
>  mm/vmscan.c                | 32 ++++++++---------
>  mm/workingset.c            | 72 +++++++++++++++++++++++++++++---------
>  4 files changed, 77 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5b86287fa069..a7a0a1a5c8d5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -901,6 +901,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>         return &pgdat->__lruvec;
>  }
>
> +static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
> +{
> +       return NULL;
> +}
> +
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
>                 struct mem_cgroup *memcg)
>  {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 063c0c1e112b..1e99f7ac1d7e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -307,7 +307,7 @@ struct vma_swap_readahead {
>  };
>
>  /* linux/mm/workingset.c */
> -void *workingset_eviction(struct page *page);
> +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
>  void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e8dd601e1fad..527617ee9b73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -853,7 +853,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
>   * gets returned with a refcount of 0.
>   */
>  static int __remove_mapping(struct address_space *mapping, struct page *page,
> -                           bool reclaimed)
> +                           bool reclaimed, struct mem_cgroup *target_memcg)
>  {
>         unsigned long flags;
>         int refcount;
> @@ -925,7 +925,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>                  */
>                 if (reclaimed && page_is_file_cache(page) &&
>                     !mapping_exiting(mapping) && !dax_mapping(mapping))
> -                       shadow = workingset_eviction(page);
> +                       shadow = workingset_eviction(page, target_memcg);
>                 __delete_from_page_cache(page, shadow);
>                 xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> @@ -948,7 +948,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>   */
>  int remove_mapping(struct address_space *mapping, struct page *page)
>  {
> -       if (__remove_mapping(mapping, page, false)) {
> +       if (__remove_mapping(mapping, page, false, NULL)) {
>                 /*
>                  * Unfreezing the refcount with 1 rather than 2 effectively
>                  * drops the pagecache ref for us without requiring another
> @@ -1426,7 +1426,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>
>                         count_vm_event(PGLAZYFREED);
>                         count_memcg_page_event(page, PGLAZYFREED);
> -               } else if (!mapping || !__remove_mapping(mapping, page, true))
> +               } else if (!mapping || !__remove_mapping(mapping, page, true,
> +                                                        sc->target_mem_cgroup))
>                         goto keep_locked;
>
>                 unlock_page(page);
> @@ -2189,6 +2190,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>         enum lru_list inactive_lru = file * LRU_FILE;
>         unsigned long inactive, active;
>         unsigned long inactive_ratio;
> +       struct lruvec *target_lruvec;
>         unsigned long refaults;
>         unsigned long gb;
>
> @@ -2200,8 +2202,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>          * is being established. Disable active list protection to get
>          * rid of the stale workingset quickly.
>          */
> -       refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
> -       if (file && lruvec->refaults != refaults) {
> +       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> +       if (file && target_lruvec->refaults != refaults) {
>                 inactive_ratio = 0;
>         } else {
>                 gb = (inactive + active) >> (30 - PAGE_SHIFT);
> @@ -2973,19 +2976,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>         sc->gfp_mask = orig_mask;
>  }
>
> -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
> +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
>  {
> -       struct mem_cgroup *memcg;
> -
> -       memcg = mem_cgroup_iter(root_memcg, NULL, NULL);
> -       do {
> -               unsigned long refaults;
> -               struct lruvec *lruvec;
> +       struct lruvec *target_lruvec;
> +       unsigned long refaults;
>
> -               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -               refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
> -               lruvec->refaults = refaults;
> -       } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
> +       target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> +       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> +       target_lruvec->refaults = refaults;
>  }
>
>  /*
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e8212123c1c3..f0885d9f41cd 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
>         *workingsetp = workingset;
>  }
>
> +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
> +{
> +       /*
> +        * Reclaiming a cgroup means reclaiming all its children in a
> +        * round-robin fashion. That means that each cgroup has an LRU
> +        * order that is composed of the LRU orders of its child
> +        * cgroups; and every page has an LRU position not just in the
> +        * cgroup that owns it, but in all of that group's ancestors.
> +        *
> +        * So when the physical inactive list of a leaf cgroup ages,
> +        * the virtual inactive lists of all its parents, including
> +        * the root cgroup's, age as well.
> +        */
> +       do {
> +               struct lruvec *lruvec;
> +
> +               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +               atomic_long_inc(&lruvec->inactive_age);
> +       } while (memcg && (memcg = parent_mem_cgroup(memcg)));
> +}
> +
>  /**
>   * workingset_eviction - note the eviction of a page from memory
> + * @target_memcg: the cgroup that is causing the reclaim
>   * @page: the page being evicted
>   *
>   * Returns a shadow entry to be stored in @page->mapping->i_pages in place
>   * of the evicted @page so that a later refault can be detected.
>   */
> -void *workingset_eviction(struct page *page)
> +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
>  {
>         struct pglist_data *pgdat = page_pgdat(page);
> -       struct mem_cgroup *memcg = page_memcg(page);
> -       int memcgid = mem_cgroup_id(memcg);
>         unsigned long eviction;
>         struct lruvec *lruvec;
> +       int memcgid;
>
>         /* Page is fully exclusive and pins page->mem_cgroup */
>         VM_BUG_ON_PAGE(PageLRU(page), page);
>         VM_BUG_ON_PAGE(page_count(page), page);
>         VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> -       lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       eviction = atomic_long_inc_return(&lruvec->inactive_age);
> +       advance_inactive_age(page_memcg(page), pgdat);
> +
> +       lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> +       /* XXX: target_memcg can be NULL, go through lruvec */
> +       memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> +       eviction = atomic_long_read(&lruvec->inactive_age);
>         return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
>  }
>
> @@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page)
>   * @shadow: shadow entry of the evicted page
>   *
>   * Calculates and evaluates the refault distance of the previously
> - * evicted page in the context of the node it was allocated in.
> + * evicted page in the context of the node and the memcg whose memory
> + * pressure caused the eviction.
>   */
>  void workingset_refault(struct page *page, void *shadow)
>  {
> +       struct mem_cgroup *eviction_memcg;
> +       struct lruvec *eviction_lruvec;
>         unsigned long refault_distance;
>         struct pglist_data *pgdat;
>         unsigned long active_file;
> @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow)
>          * would be better if the root_mem_cgroup existed in all
>          * configurations instead.
>          */
> -       memcg = mem_cgroup_from_id(memcgid);
> -       if (!mem_cgroup_disabled() && !memcg)
> +       eviction_memcg = mem_cgroup_from_id(memcgid);
> +       if (!mem_cgroup_disabled() && !eviction_memcg)
>                 goto out;
> -       lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       refault = atomic_long_read(&lruvec->inactive_age);
> -       active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
> +       eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> +       refault = atomic_long_read(&eviction_lruvec->inactive_age);
> +       active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
>
>         /*
>          * Calculate the refault distance
> @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow)
>          */
>         refault_distance = (refault - eviction) & EVICTION_MASK;
>
> +       /*
> +        * The activation decision for this page is made at the level
> +        * where the eviction occurred, as that is where the LRU order
> +        * during page reclaim is being determined.
> +        *
> +        * However, the cgroup that will own the page is the one that
> +        * is actually experiencing the refault event.
> +        */
> +       memcg = get_mem_cgroup_from_mm(current->mm);
> +       lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
>         inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
>
>         /*
> @@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow)
>          * the memory was available to the page cache.
>          */
>         if (refault_distance > active_file)
> -               goto out;
> +               goto out_memcg;
>
>         SetPageActive(page);
> -       atomic_long_inc(&lruvec->inactive_age);
> +       advance_inactive_age(memcg, pgdat);
>         inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
>
>         /* Page was active prior to eviction */
> @@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow)
>                 SetPageWorkingset(page);
>                 inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
>         }
> +
> +out_memcg:
> +       mem_cgroup_put(memcg);
>  out:
>         rcu_read_unlock();
>  }
> @@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow)
>  void workingset_activation(struct page *page)
>  {
>         struct mem_cgroup *memcg;
> -       struct lruvec *lruvec;
>
>         rcu_read_lock();
>         /*
> @@ -345,8 +386,7 @@ void workingset_activation(struct page *page)
>         memcg = page_memcg_rcu(page);
>         if (!mem_cgroup_disabled() && !memcg)
>                 goto out;
> -       lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> -       atomic_long_inc(&lruvec->inactive_age);
> +       advance_inactive_age(memcg, page_pgdat(page));
>  out:
>         rcu_read_unlock();
>  }
> --
> 2.24.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-07 20:53 ` [PATCH 3/3] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
@ 2019-11-11  2:15   ` Suren Baghdasaryan
  2019-11-12 18:00     ` Johannes Weiner
  2019-11-15  0:29   ` Shakeel Butt
  1 sibling, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-11  2:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> We split the LRU lists into inactive and an active parts to maximize
> workingset protection while allowing just enough inactive cache space
> to faciltate readahead and writeback for one-off file accesses (e.g. a
> linear scan through a file, or logging); or just enough inactive anon
> to maintain recent reference information when reclaim needs to swap.
>
> With cgroups and their nested LRU lists, we currently don't do this
> correctly. While recursive cgroup reclaim establishes a relative LRU
> order among the pages of all involved cgroups, inactive:active size
> decisions are done on a per-cgroup level. As a result, we'll reclaim a
> cgroup's workingset when it doesn't have cold pages, even when one of
> its siblings has plenty of it that should be reclaimed first.
>
> For example: workload A has 50M worth of hot cache but doesn't do any
> one-off file accesses; meanwhile, parallel workload B scans files and
> rarely accesses the same page twice.
>
> If these workloads were to run in an uncgrouped system, A would be
> protected from the high rate of cache faults from B. But if they were
> put in parallel cgroups for memory accounting purposes, B's fast cache
> fault rate would push out the hot cache pages of A. This is unexpected
> and undesirable - the "scan resistance" of the page cache is broken.
>
> This patch moves inactive:active size balancing decisions to the root
> of reclaim - the same level where the LRU order is established.
>
> It does this by looking at the recursize

recursive?

> size of the inactive and the
> active file sets of the cgroup subtree at the beginning of the reclaim
> cycle, and then making a decision - scan or skip active pages - that
> applies throughout the entire run and to every cgroup involved.
>
> With that in place, in the test above, the VM will recognize that
> there are plenty of inactive pages in the combined cache set of
> workloads A and B and prefer the one-off cache in B over the hot pages
> in A. The scan resistance of the cache is restored.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h |   4 +-
>  mm/vmscan.c            | 185 ++++++++++++++++++++++++++---------------
>  2 files changed, 118 insertions(+), 71 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7a09087e8c77..454a230ad417 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -229,12 +229,12 @@ enum lru_list {
>
>  #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
>
> -static inline int is_file_lru(enum lru_list lru)
> +static inline bool is_file_lru(enum lru_list lru)
>  {
>         return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
>  }
>
> -static inline int is_active_lru(enum lru_list lru)
> +static inline bool is_active_lru(enum lru_list lru)
>  {
>         return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 527617ee9b73..df859b1d583c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -79,6 +79,13 @@ struct scan_control {
>          */
>         struct mem_cgroup *target_mem_cgroup;
>
> +       /* Can active pages be deactivated as part of reclaim? */
> +#define DEACTIVATE_ANON 1
> +#define DEACTIVATE_FILE 2
> +       unsigned int may_deactivate:2;
> +       unsigned int force_deactivate:1;
> +       unsigned int skipped_deactivate:1;
> +
>         /* Writepage batching in laptop mode; RECLAIM_WRITE */
>         unsigned int may_writepage:1;
>
> @@ -101,6 +108,9 @@ struct scan_control {
>         /* One of the zones is ready for compaction */
>         unsigned int compaction_ready:1;
>
> +       /* There is easily reclaimable cold cache in the current node */
> +       unsigned int cache_trim_mode:1;
> +
>         /* The file pages on the current node are dangerously low */
>         unsigned int file_is_tiny:1;
>
> @@ -2154,6 +2164,20 @@ unsigned long reclaim_pages(struct list_head *page_list)
>         return nr_reclaimed;
>  }
>
> +static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> +                                struct lruvec *lruvec, struct scan_control *sc)
> +{
> +       if (is_active_lru(lru)) {
> +               if (sc->may_deactivate & (1 << is_file_lru(lru)))
> +                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
> +               else
> +                       sc->skipped_deactivate = 1;
> +               return 0;
> +       }
> +
> +       return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
> +}
> +
>  /*
>   * The inactive anon list should be small enough that the VM never has
>   * to do too much work.
> @@ -2182,59 +2206,25 @@ unsigned long reclaim_pages(struct list_head *page_list)
>   *    1TB     101        10GB
>   *   10TB     320        32GB
>   */
> -static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> -                                struct scan_control *sc, bool trace)
> +static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
>  {
> -       enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
> -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> -       enum lru_list inactive_lru = file * LRU_FILE;
> +       enum lru_list active_lru = inactive_lru + LRU_ACTIVE;
>         unsigned long inactive, active;
>         unsigned long inactive_ratio;
> -       struct lruvec *target_lruvec;
> -       unsigned long refaults;
>         unsigned long gb;
>
> -       inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
> -       active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
> +       inactive = lruvec_page_state(lruvec, inactive_lru);
> +       active = lruvec_page_state(lruvec, active_lru);
>
> -       /*
> -        * When refaults are being observed, it means a new workingset
> -        * is being established. Disable active list protection to get
> -        * rid of the stale workingset quickly.
> -        */
> -       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> -       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> -       if (file && target_lruvec->refaults != refaults) {
> -               inactive_ratio = 0;
> -       } else {
> -               gb = (inactive + active) >> (30 - PAGE_SHIFT);
> -               if (gb)
> -                       inactive_ratio = int_sqrt(10 * gb);
> -               else
> -                       inactive_ratio = 1;
> -       }
> -
> -       if (trace)
> -               trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
> -                       lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
> -                       lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
> -                       inactive_ratio, file);
> +       gb = (inactive + active) >> (30 - PAGE_SHIFT);
> +       if (gb)
> +               inactive_ratio = int_sqrt(10 * gb);
> +       else
> +               inactive_ratio = 1;
>
>         return inactive * inactive_ratio < active;
>  }
>
> -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> -                                struct lruvec *lruvec, struct scan_control *sc)
> -{
> -       if (is_active_lru(lru)) {
> -               if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
> -                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
> -               return 0;
> -       }
> -
> -       return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
> -}
> -
>  enum scan_balance {
>         SCAN_EQUAL,
>         SCAN_FRACT,
> @@ -2296,28 +2286,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>
>         /*
>          * If the system is almost out of file pages, force-scan anon.
> -        * But only if there are enough inactive anonymous pages on
> -        * the LRU. Otherwise, the small LRU gets thrashed.
>          */
> -       if (sc->file_is_tiny &&
> -           !inactive_list_is_low(lruvec, false, sc, false) &&
> -           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> -                           sc->reclaim_idx) >> sc->priority) {
> +       if (sc->file_is_tiny) {
>                 scan_balance = SCAN_ANON;
>                 goto out;
>         }
>
>         /*
> -        * If there is enough inactive page cache, i.e. if the size of the
> -        * inactive list is greater than that of the active list *and* the
> -        * inactive list actually has some pages to scan on this priority, we
> -        * do not reclaim anything from the anonymous working set right now.
> -        * Without the second condition we could end up never scanning an
> -        * lruvec even if it has plenty of old anonymous pages unless the
> -        * system is under heavy pressure.
> +        * If there is enough inactive page cache, we do not reclaim
> +        * anything from the anonymous working right now.
>          */
> -       if (!inactive_list_is_low(lruvec, true, sc, false) &&
> -           lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
> +       if (sc->cache_trim_mode) {
>                 scan_balance = SCAN_FILE;
>                 goto out;
>         }
> @@ -2582,7 +2561,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>          * Even if we did not try to evict anon pages at all, we want to
>          * rebalance the anon lru active/inactive ratio.
>          */
> -       if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
> +       if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>                 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>                                    sc, LRU_ACTIVE_ANON);
>  }
> @@ -2722,6 +2701,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> +       unsigned long file;
>
>         target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>
> @@ -2731,6 +2711,44 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         nr_reclaimed = sc->nr_reclaimed;
>         nr_scanned = sc->nr_scanned;
>
> +       /*
> +        * Target desirable inactive:active list ratios for the anon
> +        * and file LRU lists.
> +        */
> +       if (!sc->force_deactivate) {
> +               unsigned long refaults;
> +
> +               if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> +                       sc->may_deactivate |= DEACTIVATE_ANON;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> +
> +               /*
> +                * When refaults are being observed, it means a new
> +                * workingset is being established. Deactivate to get
> +                * rid of any stale active pages quickly.
> +                */
> +               refaults = lruvec_page_state(target_lruvec,
> +                                            WORKINGSET_ACTIVATE);
> +               if (refaults != target_lruvec->refaults ||
> +                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> +                       sc->may_deactivate |= DEACTIVATE_FILE;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> +       } else
> +               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> +
> +       /*
> +        * If we have plenty of inactive file pages that aren't
> +        * thrashing, try to reclaim those first before touching
> +        * anonymous pages.
> +        */
> +       file = lruvec_page_state(target_lruvec, LRU_INACTIVE_FILE);
> +       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> +               sc->cache_trim_mode = 1;
> +       else
> +               sc->cache_trim_mode = 0;
> +
>         /*
>          * Prevent the reclaimer from falling into the cache trap: as
>          * cache pages start out inactive, every cache fault will tip
> @@ -2741,10 +2759,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (!cgroup_reclaim(sc)) {
> -               unsigned long file;
> -               unsigned long free;
> -               int z;
>                 unsigned long total_high_wmark = 0;
> +               unsigned long free, anon;
> +               int z;
>
>                 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
>                 file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>                         total_high_wmark += high_wmark_pages(zone);
>                 }
>
> -               sc->file_is_tiny = file + free <= total_high_wmark;
> +               /*
> +                * Consider anon: if that's low too, this isn't a
> +                * runaway file reclaim problem, but rather just
> +                * extreme pressure. Reclaim as per usual then.
> +                */
> +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +               sc->file_is_tiny =
> +                       file + free <= total_high_wmark &&
> +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> +                       anon >> sc->priority;

The name of file_is_tiny flag seems to not correspond with its actual
semantics anymore. Maybe rename it into "skip_file"?

I'm confused about why !(sc->may_deactivate & DEACTIVATE_ANON) should
be a prerequisite for skipping file LRU reclaim. IIUC this means we
will skip reclaiming from file LRU only when anonymous page
deactivation is not allowed. Could you please add a comment explaining
this?

>         }
>
>         shrink_node_memcgs(pgdat, sc);
> @@ -3062,9 +3089,27 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>         if (sc->compaction_ready)
>                 return 1;
>
> +       /*
> +        * We make inactive:active ratio decisions based on the node's
> +        * composition of memory, but a restrictive reclaim_idx or a
> +        * memory.low cgroup setting can exempt large amounts of
> +        * memory from reclaim. Neither of which are very common, so
> +        * instead of doing costly eligibility calculations of the
> +        * entire cgroup subtree up front, we assume the estimates are
> +        * good, and retry with forcible deactivation if that fails.
> +        */
> +       if (sc->skipped_deactivate) {
> +               sc->priority = initial_priority;
> +               sc->force_deactivate = 1;
> +               sc->skipped_deactivate = 0;
> +               goto retry;
> +       }
> +
>         /* Untapped cgroup reserves?  Don't OOM, retry. */
>         if (sc->memcg_low_skipped) {
>                 sc->priority = initial_priority;
> +               sc->force_deactivate = 0;
> +               sc->skipped_deactivate = 0;
>                 sc->memcg_low_reclaim = 1;
>                 sc->memcg_low_skipped = 0;
>                 goto retry;
> @@ -3347,18 +3392,20 @@ static void age_active_anon(struct pglist_data *pgdat,
>                                 struct scan_control *sc)
>  {
>         struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
>
>         if (!total_swap_pages)
>                 return;
>
> +       lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +       if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> +               return;
> +
>         memcg = mem_cgroup_iter(NULL, NULL, NULL);
>         do {
> -               struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -
> -               if (inactive_list_is_low(lruvec, false, sc, true))
> -                       shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> -                                          sc, LRU_ACTIVE_ANON);
> -
> +               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +               shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> +                                  sc, LRU_ACTIVE_ANON);
>                 memcg = mem_cgroup_iter(NULL, memcg, NULL);
>         } while (memcg);
>  }
> --
> 2.24.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-11  2:01   ` Suren Baghdasaryan
@ 2019-11-12 17:45     ` Johannes Weiner
  2019-11-12 18:45       ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2019-11-12 17:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote:
> On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > We use refault information to determine whether the cache workingset
> > is stable or transitioning, and dynamically adjust the inactive:active
> > file LRU ratio so as to maximize protection from one-off cache during
> > stable periods, and minimize IO during transitions.
> >
> > With cgroups and their nested LRU lists, we currently don't do this
> > correctly. While recursive cgroup reclaim establishes a relative LRU
> > order among the pages of all involved cgroups, refaults only affect
> > the local LRU order in the cgroup in which they are occuring. As a
> > result, cache transitions can take longer in a cgrouped system as the
> > active pages of sibling cgroups aren't challenged when they should be.
> >
> > [ Right now, this is somewhat theoretical, because the siblings, under
> >   continued regular reclaim pressure, should eventually run out of
> >   inactive pages - and since inactive:active *size* balancing is also
> >   done on a cgroup-local level, we will challenge the active pages
> >   eventually in most cases. But the next patch will move that relative
> >   size enforcement to the reclaim root as well, and then this patch
> >   here will be necessary to propagate refault pressure to siblings. ]
> >
> > This patch moves refault detection to the root of reclaim. Instead of
> > remembering the cgroup owner of an evicted page, remember the cgroup
> > that caused the reclaim to happen. When refaults later occur, they'll
> > correctly influence the cross-cgroup LRU order that reclaim follows.
> 
> I spent some time thinking about the idea of calculating refault
> distance using target_memcg's inactive_age and then activating
> refaulted page in (possibly) another memcg and I am still having
> trouble convincing myself that this should work correctly. However I
> also was unable to convince myself otherwise... We use refault
> distance to calculate the deficit in inactive LRU space and then
> activate the refaulted page if that distance is less that
> active+inactive LRU size. However making that decision based on LRU
> sizes of one memcg and then activating the page in another one seems
> very counterintuitive to me. Maybe that's just me though...

It's not activating in a random, unrelated memcg - it's the parental
relationship that makes it work.

If you have a cgroup tree

	root
         |
         A
        / \
       B1 B2

and reclaim is driven by a limit in A, we are reclaiming the pages in
B1 and B2 as if they were on a single LRU list A (it's approximated by
the round-robin reclaim and has some caveats, but that's the idea).

So when a page that belongs to B2 gets evicted, it gets evicted from
virtual LRU list A. When it refaults later, we make the (in)active
size and distance comparisons against virtual LRU list A as well.

The pages on the physical LRU list B2 are not just ordered relative to
its B2 peers, they are also ordered relative to the pages in B1. And
that of course is necessary if we want fair competition between them
under shared reclaim pressure from A.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-11  2:15   ` Suren Baghdasaryan
@ 2019-11-12 18:00     ` Johannes Weiner
  2019-11-12 19:13       ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2019-11-12 18:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Sun, Nov 10, 2019 at 06:15:50PM -0800, Suren Baghdasaryan wrote:
> On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >                         total_high_wmark += high_wmark_pages(zone);
> >                 }
> >
> > -               sc->file_is_tiny = file + free <= total_high_wmark;
> > +               /*
> > +                * Consider anon: if that's low too, this isn't a
> > +                * runaway file reclaim problem, but rather just
> > +                * extreme pressure. Reclaim as per usual then.
> > +                */
> > +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> > +
> > +               sc->file_is_tiny =
> > +                       file + free <= total_high_wmark &&
> > +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> > +                       anon >> sc->priority;
> 
> The name of file_is_tiny flag seems to not correspond with its actual
> semantics anymore. Maybe rename it into "skip_file"?

I'm not a fan of file_is_tiny, but I also don't like skip_file. IMO
it's better to have it describe a situation instead of an action, in
case we later want to take additional action for that situation.

Any other ideas? ;)

> I'm confused about why !(sc->may_deactivate & DEACTIVATE_ANON) should
> be a prerequisite for skipping file LRU reclaim. IIUC this means we
> will skip reclaiming from file LRU only when anonymous page
> deactivation is not allowed. Could you please add a comment explaining
> this?

The comment above this check tries to explain it: the definition of
file being "tiny" is dependent on the availability of anon. It's a
relative comparison.

If file only has a few pages, and anon is easily reclaimable (does not
require deactivation to reclaim pages), then file is "tiny" and we
should go after the more plentiful anon pages.

If anon is under duress, too, this preference doesn't make sense and
we should just reclaim both lists equally, as per usual.

Note that I'm not introducing this constraint, I'm just changing how
it's implemented. From the patch:

> >         /*
> >          * If the system is almost out of file pages, force-scan anon.
> > -        * But only if there are enough inactive anonymous pages on
> > -        * the LRU. Otherwise, the small LRU gets thrashed.
> >          */
> > -       if (sc->file_is_tiny &&
> > -           !inactive_list_is_low(lruvec, false, sc, false) &&
> > -           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> > -                           sc->reclaim_idx) >> sc->priority) {
> > +       if (sc->file_is_tiny) {
> >                 scan_balance = SCAN_ANON;
> >                 goto out;
> >         }

So it's always been checking whether reclaim would deactivate anon,
and whether inactive_anon has sufficient pages for this priority.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-12 17:45     ` Johannes Weiner
@ 2019-11-12 18:45       ` Suren Baghdasaryan
  2019-11-12 18:59         ` Johannes Weiner
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-12 18:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Tue, Nov 12, 2019 at 9:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote:
> > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > We use refault information to determine whether the cache workingset
> > > is stable or transitioning, and dynamically adjust the inactive:active
> > > file LRU ratio so as to maximize protection from one-off cache during
> > > stable periods, and minimize IO during transitions.
> > >
> > > With cgroups and their nested LRU lists, we currently don't do this
> > > correctly. While recursive cgroup reclaim establishes a relative LRU
> > > order among the pages of all involved cgroups, refaults only affect
> > > the local LRU order in the cgroup in which they are occuring. As a
> > > result, cache transitions can take longer in a cgrouped system as the
> > > active pages of sibling cgroups aren't challenged when they should be.
> > >
> > > [ Right now, this is somewhat theoretical, because the siblings, under
> > >   continued regular reclaim pressure, should eventually run out of
> > >   inactive pages - and since inactive:active *size* balancing is also
> > >   done on a cgroup-local level, we will challenge the active pages
> > >   eventually in most cases. But the next patch will move that relative
> > >   size enforcement to the reclaim root as well, and then this patch
> > >   here will be necessary to propagate refault pressure to siblings. ]
> > >
> > > This patch moves refault detection to the root of reclaim. Instead of
> > > remembering the cgroup owner of an evicted page, remember the cgroup
> > > that caused the reclaim to happen. When refaults later occur, they'll
> > > correctly influence the cross-cgroup LRU order that reclaim follows.
> >
> > I spent some time thinking about the idea of calculating refault
> > distance using target_memcg's inactive_age and then activating
> > refaulted page in (possibly) another memcg and I am still having
> > trouble convincing myself that this should work correctly. However I
> > also was unable to convince myself otherwise... We use refault
> > distance to calculate the deficit in inactive LRU space and then
> > activate the refaulted page if that distance is less that
> > active+inactive LRU size. However making that decision based on LRU
> > sizes of one memcg and then activating the page in another one seems
> > very counterintuitive to me. Maybe that's just me though...
>
> It's not activating in a random, unrelated memcg - it's the parental
> relationship that makes it work.
>
> If you have a cgroup tree
>
>         root
>          |
>          A
>         / \
>        B1 B2
>
> and reclaim is driven by a limit in A, we are reclaiming the pages in
> B1 and B2 as if they were on a single LRU list A (it's approximated by
> the round-robin reclaim and has some caveats, but that's the idea).
>
> So when a page that belongs to B2 gets evicted, it gets evicted from
> virtual LRU list A. When it refaults later, we make the (in)active
> size and distance comparisons against virtual LRU list A as well.
>
> The pages on the physical LRU list B2 are not just ordered relative to
> its B2 peers, they are also ordered relative to the pages in B1. And
> that of course is necessary if we want fair competition between them
> under shared reclaim pressure from A.

Thanks for clarification. The testcase in your description when group
B has a large inactive cache which does not get reclaimed while its
sibling group A has to drop its active cache got me under the
impression that sibling cgroups (in your reply above B1 and B2) can
cause memory pressure in each other. Maybe that's not a legit case and
B1 would not cause pressure in B2 without causing pressure in their
shared parent A? It now makes more sense to me and I want to confirm
that is the case.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-12 18:45       ` Suren Baghdasaryan
@ 2019-11-12 18:59         ` Johannes Weiner
  2019-11-12 20:35           ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2019-11-12 18:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Tue, Nov 12, 2019 at 10:45:44AM -0800, Suren Baghdasaryan wrote:
> On Tue, Nov 12, 2019 at 9:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > We use refault information to determine whether the cache workingset
> > > > is stable or transitioning, and dynamically adjust the inactive:active
> > > > file LRU ratio so as to maximize protection from one-off cache during
> > > > stable periods, and minimize IO during transitions.
> > > >
> > > > With cgroups and their nested LRU lists, we currently don't do this
> > > > correctly. While recursive cgroup reclaim establishes a relative LRU
> > > > order among the pages of all involved cgroups, refaults only affect
> > > > the local LRU order in the cgroup in which they are occuring. As a
> > > > result, cache transitions can take longer in a cgrouped system as the
> > > > active pages of sibling cgroups aren't challenged when they should be.
> > > >
> > > > [ Right now, this is somewhat theoretical, because the siblings, under
> > > >   continued regular reclaim pressure, should eventually run out of
> > > >   inactive pages - and since inactive:active *size* balancing is also
> > > >   done on a cgroup-local level, we will challenge the active pages
> > > >   eventually in most cases. But the next patch will move that relative
> > > >   size enforcement to the reclaim root as well, and then this patch
> > > >   here will be necessary to propagate refault pressure to siblings. ]
> > > >
> > > > This patch moves refault detection to the root of reclaim. Instead of
> > > > remembering the cgroup owner of an evicted page, remember the cgroup
> > > > that caused the reclaim to happen. When refaults later occur, they'll
> > > > correctly influence the cross-cgroup LRU order that reclaim follows.
> > >
> > > I spent some time thinking about the idea of calculating refault
> > > distance using target_memcg's inactive_age and then activating
> > > refaulted page in (possibly) another memcg and I am still having
> > > trouble convincing myself that this should work correctly. However I
> > > also was unable to convince myself otherwise... We use refault
> > > distance to calculate the deficit in inactive LRU space and then
> > > activate the refaulted page if that distance is less that
> > > active+inactive LRU size. However making that decision based on LRU
> > > sizes of one memcg and then activating the page in another one seems
> > > very counterintuitive to me. Maybe that's just me though...
> >
> > It's not activating in a random, unrelated memcg - it's the parental
> > relationship that makes it work.
> >
> > If you have a cgroup tree
> >
> >         root
> >          |
> >          A
> >         / \
> >        B1 B2
> >
> > and reclaim is driven by a limit in A, we are reclaiming the pages in
> > B1 and B2 as if they were on a single LRU list A (it's approximated by
> > the round-robin reclaim and has some caveats, but that's the idea).
> >
> > So when a page that belongs to B2 gets evicted, it gets evicted from
> > virtual LRU list A. When it refaults later, we make the (in)active
> > size and distance comparisons against virtual LRU list A as well.
> >
> > The pages on the physical LRU list B2 are not just ordered relative to
> > its B2 peers, they are also ordered relative to the pages in B1. And
> > that of course is necessary if we want fair competition between them
> > under shared reclaim pressure from A.
> 
> Thanks for clarification. The testcase in your description when group
> B has a large inactive cache which does not get reclaimed while its
> sibling group A has to drop its active cache got me under the
> impression that sibling cgroups (in your reply above B1 and B2) can
> cause memory pressure in each other. Maybe that's not a legit case and
> B1 would not cause pressure in B2 without causing pressure in their
> shared parent A? It now makes more sense to me and I want to confirm
> that is the case.

Yes. I'm sorry if this was misleading. They should only cause pressure
onto each other by causing pressure on A; and then reclaim in A treats
them as one combined pool of pages.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-12 18:00     ` Johannes Weiner
@ 2019-11-12 19:13       ` Suren Baghdasaryan
  2019-11-12 20:34         ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-12 19:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Tue, Nov 12, 2019 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sun, Nov 10, 2019 at 06:15:50PM -0800, Suren Baghdasaryan wrote:
> > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >                         total_high_wmark += high_wmark_pages(zone);
> > >                 }
> > >
> > > -               sc->file_is_tiny = file + free <= total_high_wmark;
> > > +               /*
> > > +                * Consider anon: if that's low too, this isn't a
> > > +                * runaway file reclaim problem, but rather just
> > > +                * extreme pressure. Reclaim as per usual then.
> > > +                */
> > > +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> > > +
> > > +               sc->file_is_tiny =
> > > +                       file + free <= total_high_wmark &&
> > > +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> > > +                       anon >> sc->priority;
> >
> > The name of file_is_tiny flag seems to not correspond with its actual
> > semantics anymore. Maybe rename it into "skip_file"?
>
> I'm not a fan of file_is_tiny, but I also don't like skip_file. IMO
> it's better to have it describe a situation instead of an action, in
> case we later want to take additional action for that situation.
>
> Any other ideas? ;)

All other ideas still yield verbs (like sc->prefer_anon). Maybe then
add some comment at the file_is_tiny declaration that it represents
not only the fact that the file LRU is too small to reclaim but also
that there are easily reclaimable anon pages?

>
> > I'm confused about why !(sc->may_deactivate & DEACTIVATE_ANON) should
> > be a prerequisite for skipping file LRU reclaim. IIUC this means we
> > will skip reclaiming from file LRU only when anonymous page
> > deactivation is not allowed. Could you please add a comment explaining
> > this?
>
> The comment above this check tries to explain it: the definition of
> file being "tiny" is dependent on the availability of anon. It's a
> relative comparison.
>
> If file only has a few pages, and anon is easily reclaimable (does not
> require deactivation to reclaim pages), then file is "tiny" and we
> should go after the more plentiful anon pages.

Your above explanation is much clearer to me than the one in the comment :)

>
> If anon is under duress, too, this preference doesn't make sense and
> we should just reclaim both lists equally, as per usual.
>
> Note that I'm not introducing this constraint, I'm just changing how
> it's implemented. From the patch:
>
> > >         /*
> > >          * If the system is almost out of file pages, force-scan anon.
> > > -        * But only if there are enough inactive anonymous pages on
> > > -        * the LRU. Otherwise, the small LRU gets thrashed.
> > >          */
> > > -       if (sc->file_is_tiny &&
> > > -           !inactive_list_is_low(lruvec, false, sc, false) &&
> > > -           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> > > -                           sc->reclaim_idx) >> sc->priority) {
> > > +       if (sc->file_is_tiny) {
> > >                 scan_balance = SCAN_ANON;
> > >                 goto out;
> > >         }
>
> So it's always been checking whether reclaim would deactivate anon,
> and whether inactive_anon has sufficient pages for this priority.

I didn't realize !inactive_list_is_low(lruvec, false, sc, false) is
effectively the same as !(sc->may_deactivate & DEACTIVATE_ANON) but
after re-reading the patch that makes sense... Except when
force_deactivate==true, in which case shouldn't you consider
NR_ACTIVE_ANON as easily reclaimable too? IOW should it be smth like
this:

anon = node_page_state(pgdat, NR_INACTIVE_ANON) +
(sc->force_deactivate ? node_page_state(pgdat, NR_ACTIVE_ANON) : 0);

?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-12 19:13       ` Suren Baghdasaryan
@ 2019-11-12 20:34         ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-12 20:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Tue, Nov 12, 2019 at 11:13 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Nov 12, 2019 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Sun, Nov 10, 2019 at 06:15:50PM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > > >                         total_high_wmark += high_wmark_pages(zone);
> > > >                 }
> > > >
> > > > -               sc->file_is_tiny = file + free <= total_high_wmark;
> > > > +               /*
> > > > +                * Consider anon: if that's low too, this isn't a
> > > > +                * runaway file reclaim problem, but rather just
> > > > +                * extreme pressure. Reclaim as per usual then.
> > > > +                */
> > > > +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> > > > +
> > > > +               sc->file_is_tiny =
> > > > +                       file + free <= total_high_wmark &&
> > > > +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> > > > +                       anon >> sc->priority;
> > >
> > > The name of file_is_tiny flag seems to not correspond with its actual
> > > semantics anymore. Maybe rename it into "skip_file"?
> >
> > I'm not a fan of file_is_tiny, but I also don't like skip_file. IMO
> > it's better to have it describe a situation instead of an action, in
> > case we later want to take additional action for that situation.
> >
> > Any other ideas? ;)
>
> All other ideas still yield verbs (like sc->prefer_anon). Maybe then
> add some comment at the file_is_tiny declaration that it represents
> not only the fact that the file LRU is too small to reclaim but also
> that there are easily reclaimable anon pages?
>
> >
> > > I'm confused about why !(sc->may_deactivate & DEACTIVATE_ANON) should
> > > be a prerequisite for skipping file LRU reclaim. IIUC this means we
> > > will skip reclaiming from file LRU only when anonymous page
> > > deactivation is not allowed. Could you please add a comment explaining
> > > this?
> >
> > The comment above this check tries to explain it: the definition of
> > file being "tiny" is dependent on the availability of anon. It's a
> > relative comparison.
> >
> > If file only has a few pages, and anon is easily reclaimable (does not
> > require deactivation to reclaim pages), then file is "tiny" and we
> > should go after the more plentiful anon pages.
>
> Your above explanation is much clearer to me than the one in the comment :)
>
> >
> > If anon is under duress, too, this preference doesn't make sense and
> > we should just reclaim both lists equally, as per usual.
> >
> > Note that I'm not introducing this constraint, I'm just changing how
> > it's implemented. From the patch:
> >
> > > >         /*
> > > >          * If the system is almost out of file pages, force-scan anon.
> > > > -        * But only if there are enough inactive anonymous pages on
> > > > -        * the LRU. Otherwise, the small LRU gets thrashed.
> > > >          */
> > > > -       if (sc->file_is_tiny &&
> > > > -           !inactive_list_is_low(lruvec, false, sc, false) &&
> > > > -           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> > > > -                           sc->reclaim_idx) >> sc->priority) {
> > > > +       if (sc->file_is_tiny) {
> > > >                 scan_balance = SCAN_ANON;
> > > >                 goto out;
> > > >         }
> >
> > So it's always been checking whether reclaim would deactivate anon,
> > and whether inactive_anon has sufficient pages for this priority.
>
> I didn't realize !inactive_list_is_low(lruvec, false, sc, false) is
> effectively the same as !(sc->may_deactivate & DEACTIVATE_ANON) but
> after re-reading the patch that makes sense... Except when
> force_deactivate==true, in which case shouldn't you consider
> NR_ACTIVE_ANON as easily reclaimable too? IOW should it be smth like
> this:
>
> anon = node_page_state(pgdat, NR_INACTIVE_ANON) +
> (sc->force_deactivate ? node_page_state(pgdat, NR_ACTIVE_ANON) : 0);
>
> ?

On second thought that proposal would not be correct since
deactivation is not the same as reclaim... So the way it is now looks
correct.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-12 18:59         ` Johannes Weiner
@ 2019-11-12 20:35           ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2019-11-12 20:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Shakeel Butt, Rik van Riel,
	Michal Hocko, linux-mm, cgroups mailinglist, LKML, kernel-team

On Tue, Nov 12, 2019 at 10:59 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 12, 2019 at 10:45:44AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Nov 12, 2019 at 9:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote:
> > > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > We use refault information to determine whether the cache workingset
> > > > > is stable or transitioning, and dynamically adjust the inactive:active
> > > > > file LRU ratio so as to maximize protection from one-off cache during
> > > > > stable periods, and minimize IO during transitions.
> > > > >
> > > > > With cgroups and their nested LRU lists, we currently don't do this
> > > > > correctly. While recursive cgroup reclaim establishes a relative LRU
> > > > > order among the pages of all involved cgroups, refaults only affect
> > > > > the local LRU order in the cgroup in which they are occuring. As a
> > > > > result, cache transitions can take longer in a cgrouped system as the
> > > > > active pages of sibling cgroups aren't challenged when they should be.
> > > > >
> > > > > [ Right now, this is somewhat theoretical, because the siblings, under
> > > > >   continued regular reclaim pressure, should eventually run out of
> > > > >   inactive pages - and since inactive:active *size* balancing is also
> > > > >   done on a cgroup-local level, we will challenge the active pages
> > > > >   eventually in most cases. But the next patch will move that relative
> > > > >   size enforcement to the reclaim root as well, and then this patch
> > > > >   here will be necessary to propagate refault pressure to siblings. ]
> > > > >
> > > > > This patch moves refault detection to the root of reclaim. Instead of
> > > > > remembering the cgroup owner of an evicted page, remember the cgroup
> > > > > that caused the reclaim to happen. When refaults later occur, they'll
> > > > > correctly influence the cross-cgroup LRU order that reclaim follows.
> > > >
> > > > I spent some time thinking about the idea of calculating refault
> > > > distance using target_memcg's inactive_age and then activating
> > > > refaulted page in (possibly) another memcg and I am still having
> > > > trouble convincing myself that this should work correctly. However I
> > > > also was unable to convince myself otherwise... We use refault
> > > > distance to calculate the deficit in inactive LRU space and then
> > > > activate the refaulted page if that distance is less that
> > > > active+inactive LRU size. However making that decision based on LRU
> > > > sizes of one memcg and then activating the page in another one seems
> > > > very counterintuitive to me. Maybe that's just me though...
> > >
> > > It's not activating in a random, unrelated memcg - it's the parental
> > > relationship that makes it work.
> > >
> > > If you have a cgroup tree
> > >
> > >         root
> > >          |
> > >          A
> > >         / \
> > >        B1 B2
> > >
> > > and reclaim is driven by a limit in A, we are reclaiming the pages in
> > > B1 and B2 as if they were on a single LRU list A (it's approximated by
> > > the round-robin reclaim and has some caveats, but that's the idea).
> > >
> > > So when a page that belongs to B2 gets evicted, it gets evicted from
> > > virtual LRU list A. When it refaults later, we make the (in)active
> > > size and distance comparisons against virtual LRU list A as well.
> > >
> > > The pages on the physical LRU list B2 are not just ordered relative to
> > > its B2 peers, they are also ordered relative to the pages in B1. And
> > > that of course is necessary if we want fair competition between them
> > > under shared reclaim pressure from A.
> >
> > Thanks for clarification. The testcase in your description when group
> > B has a large inactive cache which does not get reclaimed while its
> > sibling group A has to drop its active cache got me under the
> > impression that sibling cgroups (in your reply above B1 and B2) can
> > cause memory pressure in each other. Maybe that's not a legit case and
> > B1 would not cause pressure in B2 without causing pressure in their
> > shared parent A? It now makes more sense to me and I want to confirm
> > that is the case.
>
> Yes. I'm sorry if this was misleading. They should only cause pressure
> onto each other by causing pressure on A; and then reclaim in A treats
> them as one combined pool of pages.


Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-07 20:53 ` [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
  2019-11-11  2:01   ` Suren Baghdasaryan
@ 2019-11-14 23:47   ` Shakeel Butt
  2019-11-15 16:07     ` Johannes Weiner
  1 sibling, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2019-11-14 23:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Suren Baghdasaryan, Rik van Riel,
	Michal Hocko, Linux MM, Cgroups, LKML, Kernel Team

On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> We use refault information to determine whether the cache workingset
> is stable or transitioning, and dynamically adjust the inactive:active
> file LRU ratio so as to maximize protection from one-off cache during
> stable periods, and minimize IO during transitions.
>
> With cgroups and their nested LRU lists, we currently don't do this
> correctly. While recursive cgroup reclaim establishes a relative LRU
> order among the pages of all involved cgroups, refaults only affect
> the local LRU order in the cgroup in which they are occuring. As a
> result, cache transitions can take longer in a cgrouped system as the
> active pages of sibling cgroups aren't challenged when they should be.
>
> [ Right now, this is somewhat theoretical, because the siblings, under
>   continued regular reclaim pressure, should eventually run out of
>   inactive pages - and since inactive:active *size* balancing is also
>   done on a cgroup-local level, we will challenge the active pages
>   eventually in most cases. But the next patch will move that relative
>   size enforcement to the reclaim root as well, and then this patch
>   here will be necessary to propagate refault pressure to siblings. ]
>
> This patch moves refault detection to the root of reclaim. Instead of
> remembering the cgroup owner of an evicted page, remember the cgroup
> that caused the reclaim to happen. When refaults later occur, they'll
> correctly influence the cross-cgroup LRU order that reclaim follows.

Can you please explain how "they'll correctly influence"? I see that
if the refaulted page was evicted due to pressure in some ancestor,
then that's ancestor's refault distance and active file size will be
used to decide to activate the refaulted page but how that is
influencing cross-cgroup LRUs?

>
> I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
> refault of those pages will challenge the global LRU order, and not
> just the local order down inside C.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  5 +++
>  include/linux/swap.h       |  2 +-
>  mm/vmscan.c                | 32 ++++++++---------
>  mm/workingset.c            | 72 +++++++++++++++++++++++++++++---------
>  4 files changed, 77 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5b86287fa069..a7a0a1a5c8d5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -901,6 +901,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>         return &pgdat->__lruvec;
>  }
>
> +static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
> +{
> +       return NULL;
> +}
> +
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
>                 struct mem_cgroup *memcg)
>  {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 063c0c1e112b..1e99f7ac1d7e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -307,7 +307,7 @@ struct vma_swap_readahead {
>  };
>
>  /* linux/mm/workingset.c */
> -void *workingset_eviction(struct page *page);
> +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
>  void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e8dd601e1fad..527617ee9b73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -853,7 +853,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
>   * gets returned with a refcount of 0.
>   */
>  static int __remove_mapping(struct address_space *mapping, struct page *page,
> -                           bool reclaimed)
> +                           bool reclaimed, struct mem_cgroup *target_memcg)
>  {
>         unsigned long flags;
>         int refcount;
> @@ -925,7 +925,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>                  */
>                 if (reclaimed && page_is_file_cache(page) &&
>                     !mapping_exiting(mapping) && !dax_mapping(mapping))
> -                       shadow = workingset_eviction(page);
> +                       shadow = workingset_eviction(page, target_memcg);
>                 __delete_from_page_cache(page, shadow);
>                 xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> @@ -948,7 +948,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>   */
>  int remove_mapping(struct address_space *mapping, struct page *page)
>  {
> -       if (__remove_mapping(mapping, page, false)) {
> +       if (__remove_mapping(mapping, page, false, NULL)) {
>                 /*
>                  * Unfreezing the refcount with 1 rather than 2 effectively
>                  * drops the pagecache ref for us without requiring another
> @@ -1426,7 +1426,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>
>                         count_vm_event(PGLAZYFREED);
>                         count_memcg_page_event(page, PGLAZYFREED);
> -               } else if (!mapping || !__remove_mapping(mapping, page, true))
> +               } else if (!mapping || !__remove_mapping(mapping, page, true,
> +                                                        sc->target_mem_cgroup))
>                         goto keep_locked;
>
>                 unlock_page(page);
> @@ -2189,6 +2190,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>         enum lru_list inactive_lru = file * LRU_FILE;
>         unsigned long inactive, active;
>         unsigned long inactive_ratio;
> +       struct lruvec *target_lruvec;
>         unsigned long refaults;
>         unsigned long gb;
>
> @@ -2200,8 +2202,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>          * is being established. Disable active list protection to get
>          * rid of the stale workingset quickly.
>          */
> -       refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
> -       if (file && lruvec->refaults != refaults) {
> +       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> +       if (file && target_lruvec->refaults != refaults) {
>                 inactive_ratio = 0;
>         } else {
>                 gb = (inactive + active) >> (30 - PAGE_SHIFT);
> @@ -2973,19 +2976,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>         sc->gfp_mask = orig_mask;
>  }
>
> -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
> +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
>  {
> -       struct mem_cgroup *memcg;
> -
> -       memcg = mem_cgroup_iter(root_memcg, NULL, NULL);
> -       do {
> -               unsigned long refaults;
> -               struct lruvec *lruvec;
> +       struct lruvec *target_lruvec;
> +       unsigned long refaults;
>
> -               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -               refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
> -               lruvec->refaults = refaults;
> -       } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
> +       target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> +       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> +       target_lruvec->refaults = refaults;
>  }
>
>  /*
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e8212123c1c3..f0885d9f41cd 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
>         *workingsetp = workingset;
>  }
>
> +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
> +{
> +       /*
> +        * Reclaiming a cgroup means reclaiming all its children in a
> +        * round-robin fashion. That means that each cgroup has an LRU
> +        * order that is composed of the LRU orders of its child
> +        * cgroups; and every page has an LRU position not just in the
> +        * cgroup that owns it, but in all of that group's ancestors.
> +        *
> +        * So when the physical inactive list of a leaf cgroup ages,
> +        * the virtual inactive lists of all its parents, including
> +        * the root cgroup's, age as well.
> +        */
> +       do {
> +               struct lruvec *lruvec;
> +
> +               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +               atomic_long_inc(&lruvec->inactive_age);
> +       } while (memcg && (memcg = parent_mem_cgroup(memcg)));
> +}
> +
>  /**
>   * workingset_eviction - note the eviction of a page from memory
> + * @target_memcg: the cgroup that is causing the reclaim
>   * @page: the page being evicted
>   *
>   * Returns a shadow entry to be stored in @page->mapping->i_pages in place
>   * of the evicted @page so that a later refault can be detected.
>   */
> -void *workingset_eviction(struct page *page)
> +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
>  {
>         struct pglist_data *pgdat = page_pgdat(page);
> -       struct mem_cgroup *memcg = page_memcg(page);
> -       int memcgid = mem_cgroup_id(memcg);
>         unsigned long eviction;
>         struct lruvec *lruvec;
> +       int memcgid;
>
>         /* Page is fully exclusive and pins page->mem_cgroup */
>         VM_BUG_ON_PAGE(PageLRU(page), page);
>         VM_BUG_ON_PAGE(page_count(page), page);
>         VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> -       lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       eviction = atomic_long_inc_return(&lruvec->inactive_age);
> +       advance_inactive_age(page_memcg(page), pgdat);
> +
> +       lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> +       /* XXX: target_memcg can be NULL, go through lruvec */
> +       memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> +       eviction = atomic_long_read(&lruvec->inactive_age);
>         return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
>  }
>
> @@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page)
>   * @shadow: shadow entry of the evicted page
>   *
>   * Calculates and evaluates the refault distance of the previously
> - * evicted page in the context of the node it was allocated in.
> + * evicted page in the context of the node and the memcg whose memory
> + * pressure caused the eviction.
>   */
>  void workingset_refault(struct page *page, void *shadow)
>  {
> +       struct mem_cgroup *eviction_memcg;
> +       struct lruvec *eviction_lruvec;
>         unsigned long refault_distance;
>         struct pglist_data *pgdat;
>         unsigned long active_file;
> @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow)
>          * would be better if the root_mem_cgroup existed in all
>          * configurations instead.
>          */
> -       memcg = mem_cgroup_from_id(memcgid);
> -       if (!mem_cgroup_disabled() && !memcg)
> +       eviction_memcg = mem_cgroup_from_id(memcgid);
> +       if (!mem_cgroup_disabled() && !eviction_memcg)
>                 goto out;
> -       lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -       refault = atomic_long_read(&lruvec->inactive_age);
> -       active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
> +       eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> +       refault = atomic_long_read(&eviction_lruvec->inactive_age);
> +       active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
>
>         /*
>          * Calculate the refault distance
> @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow)
>          */
>         refault_distance = (refault - eviction) & EVICTION_MASK;
>
> +       /*
> +        * The activation decision for this page is made at the level
> +        * where the eviction occurred, as that is where the LRU order
> +        * during page reclaim is being determined.
> +        *
> +        * However, the cgroup that will own the page is the one that
> +        * is actually experiencing the refault event.
> +        */
> +       memcg = get_mem_cgroup_from_mm(current->mm);

Why not page_memcg(page)? page is locked.


> +       lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
>         inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
>
>         /*
> @@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow)
>          * the memory was available to the page cache.
>          */
>         if (refault_distance > active_file)
> -               goto out;
> +               goto out_memcg;
>
>         SetPageActive(page);
> -       atomic_long_inc(&lruvec->inactive_age);
> +       advance_inactive_age(memcg, pgdat);
>         inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
>
>         /* Page was active prior to eviction */
> @@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow)
>                 SetPageWorkingset(page);
>                 inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
>         }
> +
> +out_memcg:
> +       mem_cgroup_put(memcg);
>  out:
>         rcu_read_unlock();
>  }
> @@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow)
>  void workingset_activation(struct page *page)
>  {
>         struct mem_cgroup *memcg;
> -       struct lruvec *lruvec;
>
>         rcu_read_lock();
>         /*
> @@ -345,8 +386,7 @@ void workingset_activation(struct page *page)
>         memcg = page_memcg_rcu(page);
>         if (!mem_cgroup_disabled() && !memcg)
>                 goto out;
> -       lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> -       atomic_long_inc(&lruvec->inactive_age);
> +       advance_inactive_age(memcg, page_pgdat(page));
>  out:
>         rcu_read_unlock();
>  }
> --
> 2.24.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-07 20:53 ` [PATCH 3/3] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
  2019-11-11  2:15   ` Suren Baghdasaryan
@ 2019-11-15  0:29   ` Shakeel Butt
  2019-11-27 22:16     ` Shakeel Butt
  1 sibling, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2019-11-15  0:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Suren Baghdasaryan, Rik van Riel,
	Michal Hocko, Linux MM, Cgroups, LKML, Kernel Team

On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> We split the LRU lists into inactive and an active parts to maximize
> workingset protection while allowing just enough inactive cache space
> to faciltate readahead and writeback for one-off file accesses (e.g. a
> linear scan through a file, or logging); or just enough inactive anon
> to maintain recent reference information when reclaim needs to swap.
>
> With cgroups and their nested LRU lists, we currently don't do this
> correctly. While recursive cgroup reclaim establishes a relative LRU
> order among the pages of all involved cgroups, inactive:active size
> decisions are done on a per-cgroup level. As a result, we'll reclaim a
> cgroup's workingset when it doesn't have cold pages, even when one of
> its siblings has plenty of it that should be reclaimed first.
>
> For example: workload A has 50M worth of hot cache but doesn't do any
> one-off file accesses; meanwhile, parallel workload B scans files and
> rarely accesses the same page twice.
>
> If these workloads were to run in an uncgrouped system, A would be
> protected from the high rate of cache faults from B. But if they were
> put in parallel cgroups for memory accounting purposes, B's fast cache
> fault rate would push out the hot cache pages of A. This is unexpected
> and undesirable - the "scan resistance" of the page cache is broken.
>
> This patch moves inactive:active size balancing decisions to the root
> of reclaim - the same level where the LRU order is established.
>
> It does this by looking at the recursize size of the inactive and the
> active file sets of the cgroup subtree at the beginning of the reclaim
> cycle, and then making a decision - scan or skip active pages - that
> applies throughout the entire run and to every cgroup involved.

Oh ok, this answer my question on previous patch. The reclaim root
looks at the full tree inactive and active count to make decisions and
thus active list of some descendant cgroup will be protected from the
inactive list of its sibling.

>
> With that in place, in the test above, the VM will recognize that
> there are plenty of inactive pages in the combined cache set of
> workloads A and B and prefer the one-off cache in B over the hot pages
> in A. The scan resistance of the cache is restored.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

BTW no love for the anon memory? The whole "workingset" mechanism only
works for file pages. Are there any plans to extend it for anon as
well?

> ---
>  include/linux/mmzone.h |   4 +-
>  mm/vmscan.c            | 185 ++++++++++++++++++++++++++---------------
>  2 files changed, 118 insertions(+), 71 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7a09087e8c77..454a230ad417 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -229,12 +229,12 @@ enum lru_list {
>
>  #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
>
> -static inline int is_file_lru(enum lru_list lru)
> +static inline bool is_file_lru(enum lru_list lru)
>  {
>         return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
>  }
>
> -static inline int is_active_lru(enum lru_list lru)
> +static inline bool is_active_lru(enum lru_list lru)
>  {
>         return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 527617ee9b73..df859b1d583c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -79,6 +79,13 @@ struct scan_control {
>          */
>         struct mem_cgroup *target_mem_cgroup;
>
> +       /* Can active pages be deactivated as part of reclaim? */
> +#define DEACTIVATE_ANON 1
> +#define DEACTIVATE_FILE 2
> +       unsigned int may_deactivate:2;
> +       unsigned int force_deactivate:1;
> +       unsigned int skipped_deactivate:1;
> +
>         /* Writepage batching in laptop mode; RECLAIM_WRITE */
>         unsigned int may_writepage:1;
>
> @@ -101,6 +108,9 @@ struct scan_control {
>         /* One of the zones is ready for compaction */
>         unsigned int compaction_ready:1;
>
> +       /* There is easily reclaimable cold cache in the current node */
> +       unsigned int cache_trim_mode:1;
> +
>         /* The file pages on the current node are dangerously low */
>         unsigned int file_is_tiny:1;
>
> @@ -2154,6 +2164,20 @@ unsigned long reclaim_pages(struct list_head *page_list)
>         return nr_reclaimed;
>  }
>
> +static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> +                                struct lruvec *lruvec, struct scan_control *sc)
> +{
> +       if (is_active_lru(lru)) {
> +               if (sc->may_deactivate & (1 << is_file_lru(lru)))
> +                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
> +               else
> +                       sc->skipped_deactivate = 1;
> +               return 0;
> +       }
> +
> +       return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
> +}
> +
>  /*
>   * The inactive anon list should be small enough that the VM never has
>   * to do too much work.
> @@ -2182,59 +2206,25 @@ unsigned long reclaim_pages(struct list_head *page_list)
>   *    1TB     101        10GB
>   *   10TB     320        32GB
>   */
> -static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> -                                struct scan_control *sc, bool trace)
> +static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
>  {
> -       enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
> -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> -       enum lru_list inactive_lru = file * LRU_FILE;
> +       enum lru_list active_lru = inactive_lru + LRU_ACTIVE;
>         unsigned long inactive, active;
>         unsigned long inactive_ratio;
> -       struct lruvec *target_lruvec;
> -       unsigned long refaults;
>         unsigned long gb;
>
> -       inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
> -       active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
> +       inactive = lruvec_page_state(lruvec, inactive_lru);
> +       active = lruvec_page_state(lruvec, active_lru);
>
> -       /*
> -        * When refaults are being observed, it means a new workingset
> -        * is being established. Disable active list protection to get
> -        * rid of the stale workingset quickly.
> -        */
> -       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> -       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> -       if (file && target_lruvec->refaults != refaults) {
> -               inactive_ratio = 0;
> -       } else {
> -               gb = (inactive + active) >> (30 - PAGE_SHIFT);
> -               if (gb)
> -                       inactive_ratio = int_sqrt(10 * gb);
> -               else
> -                       inactive_ratio = 1;
> -       }
> -
> -       if (trace)
> -               trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
> -                       lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
> -                       lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
> -                       inactive_ratio, file);
> +       gb = (inactive + active) >> (30 - PAGE_SHIFT);
> +       if (gb)
> +               inactive_ratio = int_sqrt(10 * gb);
> +       else
> +               inactive_ratio = 1;
>
>         return inactive * inactive_ratio < active;
>  }
>
> -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> -                                struct lruvec *lruvec, struct scan_control *sc)
> -{
> -       if (is_active_lru(lru)) {
> -               if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
> -                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
> -               return 0;
> -       }
> -
> -       return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
> -}
> -
>  enum scan_balance {
>         SCAN_EQUAL,
>         SCAN_FRACT,
> @@ -2296,28 +2286,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>
>         /*
>          * If the system is almost out of file pages, force-scan anon.
> -        * But only if there are enough inactive anonymous pages on
> -        * the LRU. Otherwise, the small LRU gets thrashed.
>          */
> -       if (sc->file_is_tiny &&
> -           !inactive_list_is_low(lruvec, false, sc, false) &&
> -           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> -                           sc->reclaim_idx) >> sc->priority) {
> +       if (sc->file_is_tiny) {
>                 scan_balance = SCAN_ANON;
>                 goto out;
>         }
>
>         /*
> -        * If there is enough inactive page cache, i.e. if the size of the
> -        * inactive list is greater than that of the active list *and* the
> -        * inactive list actually has some pages to scan on this priority, we
> -        * do not reclaim anything from the anonymous working set right now.
> -        * Without the second condition we could end up never scanning an
> -        * lruvec even if it has plenty of old anonymous pages unless the
> -        * system is under heavy pressure.
> +        * If there is enough inactive page cache, we do not reclaim
> +        * anything from the anonymous working right now.
>          */
> -       if (!inactive_list_is_low(lruvec, true, sc, false) &&
> -           lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
> +       if (sc->cache_trim_mode) {
>                 scan_balance = SCAN_FILE;
>                 goto out;
>         }
> @@ -2582,7 +2561,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>          * Even if we did not try to evict anon pages at all, we want to
>          * rebalance the anon lru active/inactive ratio.
>          */
> -       if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
> +       if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>                 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>                                    sc, LRU_ACTIVE_ANON);
>  }
> @@ -2722,6 +2701,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> +       unsigned long file;
>
>         target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>
> @@ -2731,6 +2711,44 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         nr_reclaimed = sc->nr_reclaimed;
>         nr_scanned = sc->nr_scanned;
>
> +       /*
> +        * Target desirable inactive:active list ratios for the anon
> +        * and file LRU lists.
> +        */
> +       if (!sc->force_deactivate) {
> +               unsigned long refaults;
> +
> +               if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> +                       sc->may_deactivate |= DEACTIVATE_ANON;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> +
> +               /*
> +                * When refaults are being observed, it means a new
> +                * workingset is being established. Deactivate to get
> +                * rid of any stale active pages quickly.
> +                */
> +               refaults = lruvec_page_state(target_lruvec,
> +                                            WORKINGSET_ACTIVATE);
> +               if (refaults != target_lruvec->refaults ||
> +                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> +                       sc->may_deactivate |= DEACTIVATE_FILE;
> +               else
> +                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> +       } else
> +               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> +
> +       /*
> +        * If we have plenty of inactive file pages that aren't
> +        * thrashing, try to reclaim those first before touching
> +        * anonymous pages.
> +        */
> +       file = lruvec_page_state(target_lruvec, LRU_INACTIVE_FILE);
> +       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> +               sc->cache_trim_mode = 1;
> +       else
> +               sc->cache_trim_mode = 0;
> +
>         /*
>          * Prevent the reclaimer from falling into the cache trap: as
>          * cache pages start out inactive, every cache fault will tip
> @@ -2741,10 +2759,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (!cgroup_reclaim(sc)) {
> -               unsigned long file;
> -               unsigned long free;
> -               int z;
>                 unsigned long total_high_wmark = 0;
> +               unsigned long free, anon;
> +               int z;
>
>                 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
>                 file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>                         total_high_wmark += high_wmark_pages(zone);
>                 }
>
> -               sc->file_is_tiny = file + free <= total_high_wmark;
> +               /*
> +                * Consider anon: if that's low too, this isn't a
> +                * runaway file reclaim problem, but rather just
> +                * extreme pressure. Reclaim as per usual then.
> +                */
> +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +               sc->file_is_tiny =
> +                       file + free <= total_high_wmark &&
> +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> +                       anon >> sc->priority;
>         }
>
>         shrink_node_memcgs(pgdat, sc);
> @@ -3062,9 +3089,27 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>         if (sc->compaction_ready)
>                 return 1;
>
> +       /*
> +        * We make inactive:active ratio decisions based on the node's
> +        * composition of memory, but a restrictive reclaim_idx or a
> +        * memory.low cgroup setting can exempt large amounts of
> +        * memory from reclaim. Neither of which are very common, so
> +        * instead of doing costly eligibility calculations of the
> +        * entire cgroup subtree up front, we assume the estimates are
> +        * good, and retry with forcible deactivation if that fails.
> +        */
> +       if (sc->skipped_deactivate) {
> +               sc->priority = initial_priority;
> +               sc->force_deactivate = 1;
> +               sc->skipped_deactivate = 0;
> +               goto retry;
> +       }
> +

Not really an objection but in the worst case this will double the
cost of direct reclaim.

>         /* Untapped cgroup reserves?  Don't OOM, retry. */
>         if (sc->memcg_low_skipped) {
>                 sc->priority = initial_priority;
> +               sc->force_deactivate = 0;
> +               sc->skipped_deactivate = 0;
>                 sc->memcg_low_reclaim = 1;
>                 sc->memcg_low_skipped = 0;
>                 goto retry;
> @@ -3347,18 +3392,20 @@ static void age_active_anon(struct pglist_data *pgdat,
>                                 struct scan_control *sc)
>  {
>         struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
>
>         if (!total_swap_pages)
>                 return;
>
> +       lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +       if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> +               return;
> +
>         memcg = mem_cgroup_iter(NULL, NULL, NULL);
>         do {
> -               struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -
> -               if (inactive_list_is_low(lruvec, false, sc, true))
> -                       shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> -                                          sc, LRU_ACTIVE_ANON);
> -
> +               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +               shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> +                                  sc, LRU_ACTIVE_ANON);
>                 memcg = mem_cgroup_iter(NULL, memcg, NULL);
>         } while (memcg);
>  }
> --
> 2.24.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-14 23:47   ` Shakeel Butt
@ 2019-11-15 16:07     ` Johannes Weiner
  2019-11-15 16:52       ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2019-11-15 16:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Andrey Ryabinin, Suren Baghdasaryan, Rik van Riel,
	Michal Hocko, Linux MM, Cgroups, LKML, Kernel Team

On Thu, Nov 14, 2019 at 03:47:59PM -0800, Shakeel Butt wrote:
> On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > We use refault information to determine whether the cache workingset
> > is stable or transitioning, and dynamically adjust the inactive:active
> > file LRU ratio so as to maximize protection from one-off cache during
> > stable periods, and minimize IO during transitions.
> >
> > With cgroups and their nested LRU lists, we currently don't do this
> > correctly. While recursive cgroup reclaim establishes a relative LRU
> > order among the pages of all involved cgroups, refaults only affect
> > the local LRU order in the cgroup in which they are occuring. As a
> > result, cache transitions can take longer in a cgrouped system as the
> > active pages of sibling cgroups aren't challenged when they should be.
> >
> > [ Right now, this is somewhat theoretical, because the siblings, under
> >   continued regular reclaim pressure, should eventually run out of
> >   inactive pages - and since inactive:active *size* balancing is also
> >   done on a cgroup-local level, we will challenge the active pages
> >   eventually in most cases. But the next patch will move that relative
> >   size enforcement to the reclaim root as well, and then this patch
> >   here will be necessary to propagate refault pressure to siblings. ]
> >
> > This patch moves refault detection to the root of reclaim. Instead of
> > remembering the cgroup owner of an evicted page, remember the cgroup
> > that caused the reclaim to happen. When refaults later occur, they'll
> > correctly influence the cross-cgroup LRU order that reclaim follows.
> 
> Can you please explain how "they'll correctly influence"? I see that
> if the refaulted page was evicted due to pressure in some ancestor,
> then that's ancestor's refault distance and active file size will be
> used to decide to activate the refaulted page but how that is
> influencing cross-cgroup LRUs?

I take it the next patch answered your question: Activating a page
inside a cgroup has an effect on how it's reclaimed relative to pages
in sibling cgroups. So the influence part isn't new with this change -
it's about recognizing that an activation has an effect on a wider
scope than just the local cgroup, and considering that scope when
making the decision whether to activate or not.

> > @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow)
> >          */
> >         refault_distance = (refault - eviction) & EVICTION_MASK;
> >
> > +       /*
> > +        * The activation decision for this page is made at the level
> > +        * where the eviction occurred, as that is where the LRU order
> > +        * during page reclaim is being determined.
> > +        *
> > +        * However, the cgroup that will own the page is the one that
> > +        * is actually experiencing the refault event.
> > +        */
> > +       memcg = get_mem_cgroup_from_mm(current->mm);
> 
> Why not page_memcg(page)? page is locked.

Nice catch! Indeed, the page is charged and locked at this point, so
we don't have to do another lookup and refcounting dance here.

Delta patch:

---

From 8984f37f3e88b1b39c7d6470b313730093b24474 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Nov 2019 09:14:04 -0500
Subject: [PATCH] mm: vmscan: detect file thrashing at the reclaim root fix

Shakeel points out that the page is locked and already charged by the
time we call workingset_refault(). Instead of doing another cgroup
lookup and reference from current->mm we can simply use page_memcg().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index f0885d9f41cd..474186b76ced 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -338,7 +338,7 @@ void workingset_refault(struct page *page, void *shadow)
 	 * However, the cgroup that will own the page is the one that
 	 * is actually experiencing the refault event.
 	 */
-	memcg = get_mem_cgroup_from_mm(current->mm);
+	memcg = page_memcg(page);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
@@ -349,7 +349,7 @@ void workingset_refault(struct page *page, void *shadow)
 	 * the memory was available to the page cache.
 	 */
 	if (refault_distance > active_file)
-		goto out_memcg;
+		goto out;
 
 	SetPageActive(page);
 	advance_inactive_age(memcg, pgdat);
@@ -360,9 +360,6 @@ void workingset_refault(struct page *page, void *shadow)
 		SetPageWorkingset(page);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
-
-out_memcg:
-	mem_cgroup_put(memcg);
 out:
 	rcu_read_unlock();
 }
-- 
2.24.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root
  2019-11-15 16:07     ` Johannes Weiner
@ 2019-11-15 16:52       ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-11-15 16:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Suren Baghdasaryan, Rik van Riel,
	Michal Hocko, Linux MM, Cgroups, LKML, Kernel Team

On Fri, Nov 15, 2019 at 8:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Nov 14, 2019 at 03:47:59PM -0800, Shakeel Butt wrote:
> > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > We use refault information to determine whether the cache workingset
> > > is stable or transitioning, and dynamically adjust the inactive:active
> > > file LRU ratio so as to maximize protection from one-off cache during
> > > stable periods, and minimize IO during transitions.
> > >
> > > With cgroups and their nested LRU lists, we currently don't do this
> > > correctly. While recursive cgroup reclaim establishes a relative LRU
> > > order among the pages of all involved cgroups, refaults only affect
> > > the local LRU order in the cgroup in which they are occuring. As a
> > > result, cache transitions can take longer in a cgrouped system as the
> > > active pages of sibling cgroups aren't challenged when they should be.
> > >
> > > [ Right now, this is somewhat theoretical, because the siblings, under
> > >   continued regular reclaim pressure, should eventually run out of
> > >   inactive pages - and since inactive:active *size* balancing is also
> > >   done on a cgroup-local level, we will challenge the active pages
> > >   eventually in most cases. But the next patch will move that relative
> > >   size enforcement to the reclaim root as well, and then this patch
> > >   here will be necessary to propagate refault pressure to siblings. ]
> > >
> > > This patch moves refault detection to the root of reclaim. Instead of
> > > remembering the cgroup owner of an evicted page, remember the cgroup
> > > that caused the reclaim to happen. When refaults later occur, they'll
> > > correctly influence the cross-cgroup LRU order that reclaim follows.
> >
> > Can you please explain how "they'll correctly influence"? I see that
> > if the refaulted page was evicted due to pressure in some ancestor,
> > then that's ancestor's refault distance and active file size will be
> > used to decide to activate the refaulted page but how that is
> > influencing cross-cgroup LRUs?
>
> I take it the next patch answered your question: Activating a page
> inside a cgroup has an effect on how it's reclaimed relative to pages
> in sibling cgroups. So the influence part isn't new with this change -
> it's about recognizing that an activation has an effect on a wider
> scope than just the local cgroup, and considering that scope when
> making the decision whether to activate or not.
>

Thanks for the clarification.

> > > @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow)
> > >          */
> > >         refault_distance = (refault - eviction) & EVICTION_MASK;
> > >
> > > +       /*
> > > +        * The activation decision for this page is made at the level
> > > +        * where the eviction occurred, as that is where the LRU order
> > > +        * during page reclaim is being determined.
> > > +        *
> > > +        * However, the cgroup that will own the page is the one that
> > > +        * is actually experiencing the refault event.
> > > +        */
> > > +       memcg = get_mem_cgroup_from_mm(current->mm);
> >
> > Why not page_memcg(page)? page is locked.
>
> Nice catch! Indeed, the page is charged and locked at this point, so
> we don't have to do another lookup and refcounting dance here.
>
> Delta patch:
>
> ---
>
> From 8984f37f3e88b1b39c7d6470b313730093b24474 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 15 Nov 2019 09:14:04 -0500
> Subject: [PATCH] mm: vmscan: detect file thrashing at the reclaim root fix
>
> Shakeel points out that the page is locked and already charged by the
> time we call workingset_refault(). Instead of doing another cgroup
> lookup and reference from current->mm we can simply use page_memcg().
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

For the complete patch:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  mm/workingset.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index f0885d9f41cd..474186b76ced 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -338,7 +338,7 @@ void workingset_refault(struct page *page, void *shadow)
>          * However, the cgroup that will own the page is the one that
>          * is actually experiencing the refault event.
>          */
> -       memcg = get_mem_cgroup_from_mm(current->mm);
> +       memcg = page_memcg(page);
>         lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
>         inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> @@ -349,7 +349,7 @@ void workingset_refault(struct page *page, void *shadow)
>          * the memory was available to the page cache.
>          */
>         if (refault_distance > active_file)
> -               goto out_memcg;
> +               goto out;
>
>         SetPageActive(page);
>         advance_inactive_age(memcg, pgdat);
> @@ -360,9 +360,6 @@ void workingset_refault(struct page *page, void *shadow)
>                 SetPageWorkingset(page);
>                 inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
>         }
> -
> -out_memcg:
> -       mem_cgroup_put(memcg);
>  out:
>         rcu_read_unlock();
>  }
> --
> 2.24.0
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root
  2019-11-15  0:29   ` Shakeel Butt
@ 2019-11-27 22:16     ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-11-27 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrey Ryabinin, Suren Baghdasaryan, Rik van Riel,
	Michal Hocko, Linux MM, Cgroups, LKML, Kernel Team

On Thu, Nov 14, 2019 at 4:29 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > We split the LRU lists into inactive and an active parts to maximize
> > workingset protection while allowing just enough inactive cache space
> > to faciltate readahead and writeback for one-off file accesses (e.g. a
> > linear scan through a file, or logging); or just enough inactive anon
> > to maintain recent reference information when reclaim needs to swap.
> >
> > With cgroups and their nested LRU lists, we currently don't do this
> > correctly. While recursive cgroup reclaim establishes a relative LRU
> > order among the pages of all involved cgroups, inactive:active size
> > decisions are done on a per-cgroup level. As a result, we'll reclaim a
> > cgroup's workingset when it doesn't have cold pages, even when one of
> > its siblings has plenty of it that should be reclaimed first.
> >
> > For example: workload A has 50M worth of hot cache but doesn't do any
> > one-off file accesses; meanwhile, parallel workload B scans files and
> > rarely accesses the same page twice.
> >
> > If these workloads were to run in an uncgrouped system, A would be
> > protected from the high rate of cache faults from B. But if they were
> > put in parallel cgroups for memory accounting purposes, B's fast cache
> > fault rate would push out the hot cache pages of A. This is unexpected
> > and undesirable - the "scan resistance" of the page cache is broken.
> >
> > This patch moves inactive:active size balancing decisions to the root
> > of reclaim - the same level where the LRU order is established.
> >
> > It does this by looking at the recursize size of the inactive and the
> > active file sets of the cgroup subtree at the beginning of the reclaim
> > cycle, and then making a decision - scan or skip active pages - that
> > applies throughout the entire run and to every cgroup involved.
>
> Oh ok, this answer my question on previous patch. The reclaim root
> looks at the full tree inactive and active count to make decisions and
> thus active list of some descendant cgroup will be protected from the
> inactive list of its sibling.
>
> >
> > With that in place, in the test above, the VM will recognize that
> > there are plenty of inactive pages in the combined cache set of
> > workloads A and B and prefer the one-off cache in B over the hot pages
> > in A. The scan resistance of the cache is restored.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Forgot to add:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

>
> BTW no love for the anon memory? The whole "workingset" mechanism only
> works for file pages. Are there any plans to extend it for anon as
> well?
>
> > ---
> >  include/linux/mmzone.h |   4 +-
> >  mm/vmscan.c            | 185 ++++++++++++++++++++++++++---------------
> >  2 files changed, 118 insertions(+), 71 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 7a09087e8c77..454a230ad417 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -229,12 +229,12 @@ enum lru_list {
> >
> >  #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> >
> > -static inline int is_file_lru(enum lru_list lru)
> > +static inline bool is_file_lru(enum lru_list lru)
> >  {
> >         return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
> >  }
> >
> > -static inline int is_active_lru(enum lru_list lru)
> > +static inline bool is_active_lru(enum lru_list lru)
> >  {
> >         return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 527617ee9b73..df859b1d583c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -79,6 +79,13 @@ struct scan_control {
> >          */
> >         struct mem_cgroup *target_mem_cgroup;
> >
> > +       /* Can active pages be deactivated as part of reclaim? */
> > +#define DEACTIVATE_ANON 1
> > +#define DEACTIVATE_FILE 2
> > +       unsigned int may_deactivate:2;
> > +       unsigned int force_deactivate:1;
> > +       unsigned int skipped_deactivate:1;
> > +
> >         /* Writepage batching in laptop mode; RECLAIM_WRITE */
> >         unsigned int may_writepage:1;
> >
> > @@ -101,6 +108,9 @@ struct scan_control {
> >         /* One of the zones is ready for compaction */
> >         unsigned int compaction_ready:1;
> >
> > +       /* There is easily reclaimable cold cache in the current node */
> > +       unsigned int cache_trim_mode:1;
> > +
> >         /* The file pages on the current node are dangerously low */
> >         unsigned int file_is_tiny:1;
> >
> > @@ -2154,6 +2164,20 @@ unsigned long reclaim_pages(struct list_head *page_list)
> >         return nr_reclaimed;
> >  }
> >
> > +static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> > +                                struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +       if (is_active_lru(lru)) {
> > +               if (sc->may_deactivate & (1 << is_file_lru(lru)))
> > +                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
> > +               else
> > +                       sc->skipped_deactivate = 1;
> > +               return 0;
> > +       }
> > +
> > +       return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
> > +}
> > +
> >  /*
> >   * The inactive anon list should be small enough that the VM never has
> >   * to do too much work.
> > @@ -2182,59 +2206,25 @@ unsigned long reclaim_pages(struct list_head *page_list)
> >   *    1TB     101        10GB
> >   *   10TB     320        32GB
> >   */
> > -static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
> > -                                struct scan_control *sc, bool trace)
> > +static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
> >  {
> > -       enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE;
> > -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > -       enum lru_list inactive_lru = file * LRU_FILE;
> > +       enum lru_list active_lru = inactive_lru + LRU_ACTIVE;
> >         unsigned long inactive, active;
> >         unsigned long inactive_ratio;
> > -       struct lruvec *target_lruvec;
> > -       unsigned long refaults;
> >         unsigned long gb;
> >
> > -       inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
> > -       active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
> > +       inactive = lruvec_page_state(lruvec, inactive_lru);
> > +       active = lruvec_page_state(lruvec, active_lru);
> >
> > -       /*
> > -        * When refaults are being observed, it means a new workingset
> > -        * is being established. Disable active list protection to get
> > -        * rid of the stale workingset quickly.
> > -        */
> > -       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> > -       refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> > -       if (file && target_lruvec->refaults != refaults) {
> > -               inactive_ratio = 0;
> > -       } else {
> > -               gb = (inactive + active) >> (30 - PAGE_SHIFT);
> > -               if (gb)
> > -                       inactive_ratio = int_sqrt(10 * gb);
> > -               else
> > -                       inactive_ratio = 1;
> > -       }
> > -
> > -       if (trace)
> > -               trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx,
> > -                       lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive,
> > -                       lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active,
> > -                       inactive_ratio, file);
> > +       gb = (inactive + active) >> (30 - PAGE_SHIFT);
> > +       if (gb)
> > +               inactive_ratio = int_sqrt(10 * gb);
> > +       else
> > +               inactive_ratio = 1;
> >
> >         return inactive * inactive_ratio < active;
> >  }
> >
> > -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> > -                                struct lruvec *lruvec, struct scan_control *sc)
> > -{
> > -       if (is_active_lru(lru)) {
> > -               if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
> > -                       shrink_active_list(nr_to_scan, lruvec, sc, lru);
> > -               return 0;
> > -       }
> > -
> > -       return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
> > -}
> > -
> >  enum scan_balance {
> >         SCAN_EQUAL,
> >         SCAN_FRACT,
> > @@ -2296,28 +2286,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >
> >         /*
> >          * If the system is almost out of file pages, force-scan anon.
> > -        * But only if there are enough inactive anonymous pages on
> > -        * the LRU. Otherwise, the small LRU gets thrashed.
> >          */
> > -       if (sc->file_is_tiny &&
> > -           !inactive_list_is_low(lruvec, false, sc, false) &&
> > -           lruvec_lru_size(lruvec, LRU_INACTIVE_ANON,
> > -                           sc->reclaim_idx) >> sc->priority) {
> > +       if (sc->file_is_tiny) {
> >                 scan_balance = SCAN_ANON;
> >                 goto out;
> >         }
> >
> >         /*
> > -        * If there is enough inactive page cache, i.e. if the size of the
> > -        * inactive list is greater than that of the active list *and* the
> > -        * inactive list actually has some pages to scan on this priority, we
> > -        * do not reclaim anything from the anonymous working set right now.
> > -        * Without the second condition we could end up never scanning an
> > -        * lruvec even if it has plenty of old anonymous pages unless the
> > -        * system is under heavy pressure.
> > +        * If there is enough inactive page cache, we do not reclaim
> > +        * anything from the anonymous working right now.
> >          */
> > -       if (!inactive_list_is_low(lruvec, true, sc, false) &&
> > -           lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) {
> > +       if (sc->cache_trim_mode) {
> >                 scan_balance = SCAN_FILE;
> >                 goto out;
> >         }
> > @@ -2582,7 +2561,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >          * Even if we did not try to evict anon pages at all, we want to
> >          * rebalance the anon lru active/inactive ratio.
> >          */
> > -       if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
> > +       if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> >                 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> >                                    sc, LRU_ACTIVE_ANON);
> >  }
> > @@ -2722,6 +2701,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >         unsigned long nr_reclaimed, nr_scanned;
> >         struct lruvec *target_lruvec;
> >         bool reclaimable = false;
> > +       unsigned long file;
> >
> >         target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> >
> > @@ -2731,6 +2711,44 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >         nr_reclaimed = sc->nr_reclaimed;
> >         nr_scanned = sc->nr_scanned;
> >
> > +       /*
> > +        * Target desirable inactive:active list ratios for the anon
> > +        * and file LRU lists.
> > +        */
> > +       if (!sc->force_deactivate) {
> > +               unsigned long refaults;
> > +
> > +               if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> > +                       sc->may_deactivate |= DEACTIVATE_ANON;
> > +               else
> > +                       sc->may_deactivate &= ~DEACTIVATE_ANON;
> > +
> > +               /*
> > +                * When refaults are being observed, it means a new
> > +                * workingset is being established. Deactivate to get
> > +                * rid of any stale active pages quickly.
> > +                */
> > +               refaults = lruvec_page_state(target_lruvec,
> > +                                            WORKINGSET_ACTIVATE);
> > +               if (refaults != target_lruvec->refaults ||
> > +                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> > +                       sc->may_deactivate |= DEACTIVATE_FILE;
> > +               else
> > +                       sc->may_deactivate &= ~DEACTIVATE_FILE;
> > +       } else
> > +               sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
> > +
> > +       /*
> > +        * If we have plenty of inactive file pages that aren't
> > +        * thrashing, try to reclaim those first before touching
> > +        * anonymous pages.
> > +        */
> > +       file = lruvec_page_state(target_lruvec, LRU_INACTIVE_FILE);
> > +       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
> > +               sc->cache_trim_mode = 1;
> > +       else
> > +               sc->cache_trim_mode = 0;
> > +
> >         /*
> >          * Prevent the reclaimer from falling into the cache trap: as
> >          * cache pages start out inactive, every cache fault will tip
> > @@ -2741,10 +2759,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >          * anon pages.  Try to detect this based on file LRU size.
> >          */
> >         if (!cgroup_reclaim(sc)) {
> > -               unsigned long file;
> > -               unsigned long free;
> > -               int z;
> >                 unsigned long total_high_wmark = 0;
> > +               unsigned long free, anon;
> > +               int z;
> >
> >                 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> >                 file = node_page_state(pgdat, NR_ACTIVE_FILE) +
> > @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >                         total_high_wmark += high_wmark_pages(zone);
> >                 }
> >
> > -               sc->file_is_tiny = file + free <= total_high_wmark;
> > +               /*
> > +                * Consider anon: if that's low too, this isn't a
> > +                * runaway file reclaim problem, but rather just
> > +                * extreme pressure. Reclaim as per usual then.
> > +                */
> > +               anon = node_page_state(pgdat, NR_INACTIVE_ANON);
> > +
> > +               sc->file_is_tiny =
> > +                       file + free <= total_high_wmark &&
> > +                       !(sc->may_deactivate & DEACTIVATE_ANON) &&
> > +                       anon >> sc->priority;
> >         }
> >
> >         shrink_node_memcgs(pgdat, sc);
> > @@ -3062,9 +3089,27 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >         if (sc->compaction_ready)
> >                 return 1;
> >
> > +       /*
> > +        * We make inactive:active ratio decisions based on the node's
> > +        * composition of memory, but a restrictive reclaim_idx or a
> > +        * memory.low cgroup setting can exempt large amounts of
> > +        * memory from reclaim. Neither of which are very common, so
> > +        * instead of doing costly eligibility calculations of the
> > +        * entire cgroup subtree up front, we assume the estimates are
> > +        * good, and retry with forcible deactivation if that fails.
> > +        */
> > +       if (sc->skipped_deactivate) {
> > +               sc->priority = initial_priority;
> > +               sc->force_deactivate = 1;
> > +               sc->skipped_deactivate = 0;
> > +               goto retry;
> > +       }
> > +
>
> Not really an objection but in the worst case this will double the
> cost of direct reclaim.
>
> >         /* Untapped cgroup reserves?  Don't OOM, retry. */
> >         if (sc->memcg_low_skipped) {
> >                 sc->priority = initial_priority;
> > +               sc->force_deactivate = 0;
> > +               sc->skipped_deactivate = 0;
> >                 sc->memcg_low_reclaim = 1;
> >                 sc->memcg_low_skipped = 0;
> >                 goto retry;
> > @@ -3347,18 +3392,20 @@ static void age_active_anon(struct pglist_data *pgdat,
> >                                 struct scan_control *sc)
> >  {
> >         struct mem_cgroup *memcg;
> > +       struct lruvec *lruvec;
> >
> >         if (!total_swap_pages)
> >                 return;
> >
> > +       lruvec = mem_cgroup_lruvec(NULL, pgdat);
> > +       if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> > +               return;
> > +
> >         memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >         do {
> > -               struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > -
> > -               if (inactive_list_is_low(lruvec, false, sc, true))
> > -                       shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> > -                                          sc, LRU_ACTIVE_ANON);
> > -
> > +               lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > +               shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> > +                                  sc, LRU_ACTIVE_ANON);
> >                 memcg = mem_cgroup_iter(NULL, memcg, NULL);
> >         } while (memcg);
> >  }
> > --
> > 2.24.0
> >


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-07 20:53 [PATCH 0/3] mm: fix page aging across multiple cgroups Johannes Weiner
2019-11-07 20:53 ` [PATCH 1/3] mm: vmscan: move file exhaustion detection to the node level Johannes Weiner
2019-11-10 22:02   ` Suren Baghdasaryan
2019-11-10 22:09   ` Khadarnimcaan Khadarnimcaan
2019-11-07 20:53 ` [PATCH 2/3] mm: vmscan: detect file thrashing at the reclaim root Johannes Weiner
2019-11-11  2:01   ` Suren Baghdasaryan
2019-11-12 17:45     ` Johannes Weiner
2019-11-12 18:45       ` Suren Baghdasaryan
2019-11-12 18:59         ` Johannes Weiner
2019-11-12 20:35           ` Suren Baghdasaryan
2019-11-14 23:47   ` Shakeel Butt
2019-11-15 16:07     ` Johannes Weiner
2019-11-15 16:52       ` Shakeel Butt
2019-11-07 20:53 ` [PATCH 3/3] mm: vmscan: enforce inactive:active ratio " Johannes Weiner
2019-11-11  2:15   ` Suren Baghdasaryan
2019-11-12 18:00     ` Johannes Weiner
2019-11-12 19:13       ` Suren Baghdasaryan
2019-11-12 20:34         ` Suren Baghdasaryan
2019-11-15  0:29   ` Shakeel Butt
2019-11-27 22:16     ` Shakeel Butt

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git