linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2
@ 2020-05-20 23:25 Johannes Weiner
  2020-05-20 23:25 ` [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
                   ` (13 more replies)
  0 siblings, 14 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

The reclaim code that balances between swapping and cache reclaim
tries to predict likely reuse based on in-memory reference patterns
alone. This works in many cases, but when it fails it cannot detect
when the cache is thrashing pathologically, or when we're in the
middle of a swap storm.

The high seek cost of rotational drives under which the algorithm
evolved also meant that mistakes could quickly result in lockups from
too aggressive swapping (which is predominantly random IO). As a
result, the balancing code has been tuned over time to a point where
it mostly goes for page cache and defers swapping until the VM is
under significant memory pressure.

The resulting strategy doesn't make optimal caching decisions - where
optimal is the least amount of IO required to execute the workload.

The proliferation of fast random IO devices such as SSDs, in-memory
compression such as zswap, and persistent memory technologies on the
horizon, has made this undesirable behavior very noticable: Even in
the presence of large amounts of cold anonymous memory and a capable
swap device, the VM refuses to even seriously scan these pages, and
can leave the page cache thrashing needlessly.

This series sets out to address this. Since commit ("a528910e12ec mm:
thrash detection-based file cache sizing") we have exact tracking of
refault IO - the ultimate cost of reclaiming the wrong pages. This
allows us to use an IO cost based balancing model that is more
aggressive about scanning anonymous memory when the cache is
thrashing, while being able to avoid unnecessary swap storms.

These patches base the LRU balance on the rate of refaults on each
list, times the relative IO cost between swap device and filesystem
(swappiness), in order to optimize reclaim for least IO cost incurred.

	History

I floated these changes in 2016. At the time they were incomplete and
full of workarounds due to a lack of infrastructure in the reclaim
code: We didn't have PageWorkingset, we didn't have hierarchical
cgroup statistics, and problems with the cgroup swap controller. As
swapping wasn't too high a priority then, the patches stalled out.
With all dependencies in place now, here we are again with much
cleaner, feature-complete patches.

I kept the acks for patches that stayed materially the same :-)

Below is a series of test results that demonstrate certain problematic
behavior of the current code, as well as showcase the new code's more
predictable and appropriate balancing decisions.

	Test #1: No convergence

This test shows an edge case where the VM currently doesn't converge
at all on a new file workingset with a stale anon/tmpfs set.

The test sets up a cold anon set the size of 3/4 RAM, then tries to
establish a new file set half the size of RAM (flat access pattern).

The vanilla kernel refuses to even scan anon pages and never
converges. The file set is perpetually served from the filesystem.

The first test kernel is with the series up to the workingset patch
applied. This allows thrashing page cache to challenge the anonymous
workingset. The VM then scans the lists based on the current
scanned/rotated balancing algorithm. It converges on a stable state
where all cold anon pages are pushed out and the fileset is served
entirely from cache:

			    noconverge/5.7-rc5-mm	noconverge/5.7-rc5-mm-workingset
Scanned			417719308.00 (    +0.00%)		64091155.00 (   -84.66%)
Reclaimed		417711094.00 (    +0.00%)		61640308.00 (   -85.24%)
Reclaim efficiency %	      100.00 (    +0.00%)		      96.18 (    -3.78%)
Scanned file		417719308.00 (    +0.00%)		59211118.00 (   -85.83%)
Scanned anon			0.00 (    +0.00%)	         4880037.00 (          )
Swapouts			0.00 (    +0.00%)	         2439957.00 (          )
Swapins				0.00 (    +0.00%)		     257.00 (          )
Refaults		415246605.00 (    +0.00%)		59183722.00 (   -85.75%)
Restore refaults		0.00 (    +0.00%)	        54988252.00 (          )

The second test kernel is with the full patch series applied, which
replaces the scanned/rotated ratios with refault/swapin rate-based
balancing. It evicts the cold anon pages more aggressively in the
presence of a thrashing cache and the absence of swapins, and so
converges with about 60% of the IO and reclaim activity:

			noconverge/5.7-rc5-mm-workingset	noconverge/5.7-rc5-mm-lrubalance
Scanned				64091155.00 (    +0.00%)		37579741.00 (   -41.37%)
Reclaimed			61640308.00 (    +0.00%)		35129293.00 (   -43.01%)
Reclaim efficiency %		      96.18 (    +0.00%)		      93.48 (    -2.78%)
Scanned file			59211118.00 (    +0.00%)		32708385.00 (   -44.76%)
Scanned anon			 4880037.00 (    +0.00%)		 4871356.00 (    -0.18%)
Swapouts			 2439957.00 (    +0.00%)		 2435565.00 (    -0.18%)
Swapins				     257.00 (    +0.00%)		     262.00 (    +1.94%)
Refaults			59183722.00 (    +0.00%)		32675667.00 (   -44.79%)
Restore refaults		54988252.00 (    +0.00%)		28480430.00 (   -48.21%)

We're triggering this case in host sideloading scenarios: When a
host's primary workload is not saturating the machine (primary load is
usually driven by user activity), we can optimistically sideload a
batch job; if user activity picks up and the primary workload needs
the whole host during this time, we freeze the sideload and rely on it
getting pushed to swap. Frequently that swapping doesn't happen and
the completely inactive sideload simply stays resident while the
expanding primary worklad is struggling to gain ground.

	Test #2: Kernel build

This test is a a kernel build that is slightly memory-restricted (make
-j4 inside a 400M cgroup).

Despite the very aggressive swapping of cold anon pages in test #1,
this test shows that the new kernel carefully balances swap against
cache refaults when both the file and the cache set are pressured.

It shows the patched kernel to be slightly better at finding the
coldest memory from the combined anon and file set to evict under
pressure. The result is lower aggregate reclaim and paging activity:

				    5.7-rc5-mm	5.7-rc5-mm-lrubalance
Real time		   210.60 (    +0.00%)	   210.97 (    +0.18%)
User time		   745.42 (    +0.00%)	   746.48 (    +0.14%)
System time		    69.78 (    +0.00%)	    69.79 (    +0.02%)
Scanned file		354682.00 (    +0.00%)	293661.00 (   -17.20%)
Scanned anon		465381.00 (    +0.00%)	378144.00 (   -18.75%)
Swapouts		185920.00 (    +0.00%)	147801.00 (   -20.50%)
Swapins			 34583.00 (    +0.00%)	 32491.00 (    -6.05%)
Refaults		212664.00 (    +0.00%)	172409.00 (   -18.93%)
Restore refaults	 48861.00 (    +0.00%)	 80091.00 (   +63.91%)
Total paging IO		433167.00 (    +0.00%)	352701.00 (   -18.58%)

	Test #3: Overload

This next test is not about performance, but rather about the
predictability of the algorithm. The current balancing behavior
doesn't always lead to comprehensible results, which makes performance
analysis and parameter tuning (swappiness e.g.) very difficult.

The test shows the balancing behavior under equivalent anon and file
input. Anon and file sets are created of equal size (3/4 RAM), have
the same access patterns (a hot-cold gradient), and synchronized
access rates. Swappiness is raised from the default of 60 to 100 to
indicate equal IO cost between swap and cache.

With the vanilla balancing code, anon scans make up around 9% of the
total pages scanned, or a ~1:10 ratio. This is a surprisingly skewed
ratio, and it's an outcome that is hard to explain given the input
parameters to the VM.

The new balancing model targets a 1:2 balance: All else being equal,
reclaiming a file page costs one page IO - the refault; reclaiming an
anon page costs two IOs - the swapout and the swapin. In the test we
observe a ~1:3 balance.

The scanned and paging IO numbers indicate that the anon LRU algorithm
we have in place right now does a slightly worse job at picking the
coldest pages compared to the file algorithm. There is ongoing work to
improve this, like Joonsoo's anon workingset patches; however, it's
difficult to compare the two aging strategies when the balancing
between them is behaving unintuitively.

The slightly less efficient anon reclaim results in a deviation from
the optimal 1:2 scan ratio we would like to see here - however, 1:3 is
much closer to what we'd want to see in this test than the vanilla
kernel's aging of 10+ cache pages for every anonymous one:

			overload-100/5.7-rc5-mm-workingset	overload-100/5.7-rc5-mm-lrubalance-realfile
Scanned				 533633725.00 (    +0.00%)			  595687785.00 (   +11.63%)
Reclaimed			 494325440.00 (    +0.00%)			  518154380.00 (    +4.82%)
Reclaim efficiency %			92.63 (    +0.00%)				 86.98 (    -6.03%)
Scanned file			 484532894.00 (    +0.00%)			  456937722.00 (    -5.70%)
Scanned anon			  49100831.00 (    +0.00%)			  138750063.00 (  +182.58%)
Swapouts			   8096423.00 (    +0.00%)			   48982142.00 (  +504.98%)
Swapins				  10027384.00 (    +0.00%)			   62325044.00 (  +521.55%)
Refaults			 479819973.00 (    +0.00%)			  451309483.00 (    -5.94%)
Restore refaults		 426422087.00 (    +0.00%)			  399914067.00 (    -6.22%)
Total paging IO			 497943780.00 (    +0.00%)			  562616669.00 (   +12.99%)

	Test #4: Parallel IO

It's important to note that these patches only affect the situation
where the kernel has to reclaim workingset memory, which is usually a
transitionary period. The vast majority of page reclaim occuring in a
system is from trimming the ever-expanding page cache.

These patches don't affect cache trimming behavior. We never swap as
long as we only have use-once cache moving through the file LRU, we
only consider swapping when the cache is actively thrashing.

The following test demonstrates this. It has an anon workingset that
takes up half of RAM and then writes a file that is twice the size of
RAM out to disk.

As the cache is funneled through the inactive file list, no anon pages
are scanned (aside from apparently some background noise of 10 pages):

					  5.7-rc5-mm		          5.7-rc5-mm-lrubalance
Scanned			    10714722.00 (    +0.00%)		       10723445.00 (    +0.08%)
Reclaimed		    10703596.00 (    +0.00%)		       10712166.00 (    +0.08%)
Reclaim efficiency %		  99.90 (    +0.00%)			     99.89 (    -0.00%)
Scanned file		    10714722.00 (    +0.00%)		       10723435.00 (    +0.08%)
Scanned anon			   0.00 (    +0.00%)			     10.00 (          )
Swapouts			   0.00 (    +0.00%)			      7.00 (          )
Swapins				   0.00 (    +0.00%)			      0.00 (    +0.00%)
Refaults			  92.00 (    +0.00%)			     41.00 (   -54.84%)
Restore refaults		   0.00 (    +0.00%)			      0.00 (    +0.00%)
Total paging IO			  92.00 (    +0.00%)			     48.00 (   -47.31%)

These patches are based on v5.7-rc5-mm (minus linux-next and up).

 Documentation/admin-guide/sysctl/vm.rst |  23 ++++--
 fs/cifs/file.c                          |  10 +--
 fs/fuse/dev.c                           |   2 +-
 include/linux/memcontrol.h              |  13 ++++
 include/linux/mmzone.h                  |  21 ++----
 include/linux/swap.h                    |   5 +-
 include/linux/vm_event_item.h           |   4 ++
 include/linux/vmstat.h                  |   1 +
 kernel/sysctl.c                         |   3 +-
 mm/khugepaged.c                         |   8 +--
 mm/memcontrol.c                         |  18 ++---
 mm/memory.c                             |   2 +-
 mm/shmem.c                              |   6 +-
 mm/swap.c                               |  90 +++++++++++------------
 mm/swap_state.c                         |   7 +-
 mm/vmscan.c                             | 114 ++++++++++++------------------
 mm/vmstat.c                             |   4 ++
 mm/workingset.c                         |  21 ++++--
 18 files changed, 180 insertions(+), 172 deletions(-)




^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-27 19:48   ` Shakeel Butt
  2020-05-20 23:25 ` [PATCH 02/14] mm: keep separate anon and file statistics on page reclaim activity Johannes Weiner
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

Currently, THP are counted as single pages until they are split right
before being swapped out. However, at that point the VM is already in
the middle of reclaim, and adjusting the LRU balance then is useless.

Always account THP by the number of basepages, and remove the fixup
from the splitting path.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
---
 mm/swap.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index bf9a79fed62d..68eae1e2787a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -263,13 +263,14 @@ void rotate_reclaimable_page(struct page *page)
 }
 
 static void update_page_reclaim_stat(struct lruvec *lruvec,
-				     int file, int rotated)
+				     int file, int rotated,
+				     unsigned int nr_pages)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	reclaim_stat->recent_scanned[file]++;
+	reclaim_stat->recent_scanned[file] += nr_pages;
 	if (rotated)
-		reclaim_stat->recent_rotated[file]++;
+		reclaim_stat->recent_rotated[file] += nr_pages;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -286,7 +287,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 1);
+		update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
 	}
 }
 
@@ -541,7 +542,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(lruvec, file, 0);
+	update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
 }
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
@@ -557,7 +558,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
-		update_page_reclaim_stat(lruvec, file, 0);
+		update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
 	}
 }
 
@@ -582,7 +583,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 
 		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
 		count_memcg_page_event(page, PGLAZYFREE);
-		update_page_reclaim_stat(lruvec, 1, 0);
+		update_page_reclaim_stat(lruvec, 1, 0, hpage_nr_pages(page));
 	}
 }
 
@@ -890,8 +891,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
-	const int file = 0;
-
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
@@ -917,9 +916,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		add_page_to_lru_list_tail(page_tail, lruvec,
 					  page_lru(page_tail));
 	}
-
-	if (!PageUnevictable(page))
-		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -962,8 +958,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 
 	if (page_evictable(page)) {
 		lru = page_lru(page);
-		update_page_reclaim_stat(lruvec, page_is_file_lru(page),
-					 PageActive(page));
+		update_page_reclaim_stat(lruvec, is_file_lru(lru),
+					 PageActive(page),
+					 hpage_nr_pages(page));
 		if (was_unevictable)
 			count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/14] mm: keep separate anon and file statistics on page reclaim activity
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
  2020-05-20 23:25 ` [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 03/14] mm: allow swappiness that prefers reclaiming anon over the file workingset Johannes Weiner
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

Having statistics on pages scanned and pages reclaimed for both anon
and file pages makes it easier to evaluate changes to LRU balancing.

While at it, clean up the stat-keeping mess for isolation, putback,
reclaim stats etc. a bit: first the physical LRU operation (isolation
and putback), followed by vmstats, reclaim_stats, and then vm events.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vm_event_item.h |  4 ++++
 mm/vmscan.c                   | 17 +++++++++--------
 mm/vmstat.c                   |  4 ++++
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ffef0f279747..24fc7c3ae7d6 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -35,6 +35,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
+		PGSCAN_ANON,
+		PGSCAN_FILE,
+		PGSTEAL_ANON,
+		PGSTEAL_FILE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..70b0e2c6a4b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1913,7 +1913,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned int nr_reclaimed = 0;
 	unsigned long nr_taken;
 	struct reclaim_stat stat;
-	int file = is_file_lru(lru);
+	bool file = is_file_lru(lru);
 	enum vm_event_item item;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
@@ -1941,11 +1941,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
-
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
+	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
+
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	if (nr_taken == 0)
@@ -1956,16 +1957,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	spin_lock_irq(&pgdat->lru_lock);
 
+	move_pages_to_lru(lruvec, &page_list);
+
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
+	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
-	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
-	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
-
-	move_pages_to_lru(lruvec, &page_list);
-
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
 
 	spin_unlock_irq(&pgdat->lru_lock);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4b0c90e4de71..3d06bd89d9ec 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1201,6 +1201,10 @@ const char * const vmstat_text[] = {
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_direct_throttle",
+	"pgscan_anon",
+	"pgscan_file",
+	"pgsteal_anon",
+	"pgsteal_file",
 
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/14] mm: allow swappiness that prefers reclaiming anon over the file workingset
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
  2020-05-20 23:25 ` [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
  2020-05-20 23:25 ` [PATCH 02/14] mm: keep separate anon and file statistics on page reclaim activity Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 04/14] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

With the advent of fast random IO devices (SSDs, PMEM) and in-memory
swap devices such as zswap, it's possible for swap to be much faster
than filesystems, and for swapping to be preferable over thrashing
filesystem caches.

Allow setting swappiness - which defines the rough relative IO cost of
cache misses between page cache and swap-backed pages - to reflect
such situations by making the swap-preferred range configurable.

v2: clarify how to calculate swappiness (Minchan Kim)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 23 ++++++++++++++++++-----
 kernel/sysctl.c                         |  3 ++-
 mm/vmscan.c                             |  2 +-
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..d46d5b7013c6 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -831,14 +831,27 @@ When page allocation performance is not a bottleneck and you want all
 swappiness
 ==========
 
-This control is used to define how aggressive the kernel will swap
-memory pages.  Higher values will increase aggressiveness, lower values
-decrease the amount of swap.  A value of 0 instructs the kernel not to
-initiate swap until the amount of free and file-backed pages is less
-than the high water mark in a zone.
+This control is used to define the rough relative IO cost of swapping
+and filesystem paging, as a value between 0 and 200. At 100, the VM
+assumes equal IO cost and will thus apply memory pressure to the page
+cache and swap-backed pages equally; lower values signify more
+expensive swap IO, higher values indicates cheaper.
+
+Keep in mind that filesystem IO patterns under memory pressure tend to
+be more efficient than swap's random IO. An optimal value will require
+experimentation and will also be workload-dependent.
 
 The default value is 60.
 
+For in-memory swap, like zram or zswap, as well as hybrid setups that
+have swap on faster devices than the filesystem, values beyond 100 can
+be considered. For example, if the random IO against the swap device
+is on average 2x faster than IO from the filesystem, swappiness should
+be 133 (x + 2x = 200, 2x = 133.33).
+
+At 0, the kernel will not initiate swap until the amount of free and
+file-backed pages is less than the high watermark in a zone.
+
 
 unprivileged_userfaultfd
 ========================
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..7f15d292e44c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -131,6 +131,7 @@ static unsigned long zero_ul;
 static unsigned long one_ul = 1;
 static unsigned long long_max = LONG_MAX;
 static int one_hundred = 100;
+static int two_hundred = 200;
 static int one_thousand = 1000;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 10000;
@@ -1391,7 +1392,7 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= &one_hundred,
+		.extra2		= &two_hundred,
 	},
 #ifdef CONFIG_HUGETLB_PAGE
 	{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70b0e2c6a4b9..43f88b1a4f14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -161,7 +161,7 @@ struct scan_control {
 #endif
 
 /*
- * From 0 .. 100.  Higher means more swappy.
+ * From 0 .. 200.  Higher means more swappy.
  */
 int vm_swappiness = 60;
 /*
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/14] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (2 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 03/14] mm: allow swappiness that prefers reclaiming anon over the file workingset Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 05/14] mm: workingset: let cache workingset challenge anon Johannes Weiner
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

They're the same function, and for the purpose of all callers they are
equivalent to lru_cache_add().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
---
 fs/cifs/file.c       | 10 +++++-----
 fs/fuse/dev.c        |  2 +-
 include/linux/swap.h |  2 --
 mm/khugepaged.c      |  8 ++------
 mm/memory.c          |  2 +-
 mm/shmem.c           |  6 +++---
 mm/swap.c            | 38 ++++++++------------------------------
 mm/swap_state.c      |  2 +-
 8 files changed, 21 insertions(+), 49 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0b1528edebcf..169ab37bb3aa 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4162,7 +4162,7 @@ cifs_readv_complete(struct work_struct *work)
 	for (i = 0; i < rdata->nr_pages; i++) {
 		struct page *page = rdata->pages[i];
 
-		lru_cache_add_file(page);
+		lru_cache_add(page);
 
 		if (rdata->result == 0 ||
 		    (rdata->result == -EAGAIN && got_bytes)) {
@@ -4232,7 +4232,7 @@ readpages_fill_pages(struct TCP_Server_Info *server,
 			 * fill them until the writes are flushed.
 			 */
 			zero_user(page, 0, PAGE_SIZE);
-			lru_cache_add_file(page);
+			lru_cache_add(page);
 			flush_dcache_page(page);
 			SetPageUptodate(page);
 			unlock_page(page);
@@ -4242,7 +4242,7 @@ readpages_fill_pages(struct TCP_Server_Info *server,
 			continue;
 		} else {
 			/* no need to hold page hostage */
-			lru_cache_add_file(page);
+			lru_cache_add(page);
 			unlock_page(page);
 			put_page(page);
 			rdata->pages[i] = NULL;
@@ -4437,7 +4437,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 			/* best to give up if we're out of mem */
 			list_for_each_entry_safe(page, tpage, &tmplist, lru) {
 				list_del(&page->lru);
-				lru_cache_add_file(page);
+				lru_cache_add(page);
 				unlock_page(page);
 				put_page(page);
 			}
@@ -4475,7 +4475,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 			add_credits_and_wake_if(server, &rdata->credits, 0);
 			for (i = 0; i < rdata->nr_pages; i++) {
 				page = rdata->pages[i];
-				lru_cache_add_file(page);
+				lru_cache_add(page);
 				unlock_page(page);
 				put_page(page);
 			}
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 97eec7522bf2..8c0cc79d8071 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -840,7 +840,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
 	get_page(newpage);
 
 	if (!(buf->flags & PIPE_BUF_FLAG_LRU))
-		lru_cache_add_file(newpage);
+		lru_cache_add(newpage);
 
 	err = 0;
 	spin_lock(&cs->req->waitq.lock);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..30fd4641890e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -334,8 +334,6 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 /* linux/mm/swap.c */
 extern void lru_cache_add(struct page *);
-extern void lru_cache_add_anon(struct page *page);
-extern void lru_cache_add_file(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f2e0a5e5cfbb..d458c61d6195 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1869,13 +1869,9 @@ static void collapse_file(struct mm_struct *mm,
 
 		SetPageUptodate(new_page);
 		page_ref_add(new_page, HPAGE_PMD_NR - 1);
-
-		if (is_shmem) {
+		if (is_shmem)
 			set_page_dirty(new_page);
-			lru_cache_add_anon(new_page);
-		} else {
-			lru_cache_add_file(new_page);
-		}
+		lru_cache_add(new_page);
 
 		/*
 		 * Remove pte page tables, so we can re-fault the page as huge.
diff --git a/mm/memory.c b/mm/memory.c
index b4efc125ae09..a1813ca388a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3139,7 +3139,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				if (err)
 					goto out_page;
 
-				lru_cache_add_anon(page);
+				lru_cache_add(page);
 				swap_readpage(page, true);
 			}
 		} else {
diff --git a/mm/shmem.c b/mm/shmem.c
index e83de27ce8f4..ea95a3e46fbb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1609,7 +1609,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		 */
 		oldpage = newpage;
 	} else {
-		lru_cache_add_anon(newpage);
+		lru_cache_add(newpage);
 		*pagep = newpage;
 	}
 
@@ -1860,7 +1860,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 					charge_mm);
 	if (error)
 		goto unacct;
-	lru_cache_add_anon(page);
+	lru_cache_add(page);
 
 	spin_lock_irq(&info->lock);
 	info->alloced += compound_nr(page);
@@ -2376,7 +2376,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	if (!pte_none(*dst_pte))
 		goto out_release_unlock;
 
-	lru_cache_add_anon(page);
+	lru_cache_add(page);
 
 	spin_lock_irq(&info->lock);
 	info->alloced++;
diff --git a/mm/swap.c b/mm/swap.c
index 68eae1e2787a..c93a6c84464c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -403,35 +403,6 @@ void mark_page_accessed(struct page *page)
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-static void __lru_cache_add(struct page *page)
-{
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
-
-	get_page(page);
-	if (!pagevec_add(pvec, page) || PageCompound(page))
-		__pagevec_lru_add(pvec);
-	put_cpu_var(lru_add_pvec);
-}
-
-/**
- * lru_cache_add_anon - add a page to the page lists
- * @page: the page to add
- */
-void lru_cache_add_anon(struct page *page)
-{
-	if (PageActive(page))
-		ClearPageActive(page);
-	__lru_cache_add(page);
-}
-
-void lru_cache_add_file(struct page *page)
-{
-	if (PageActive(page))
-		ClearPageActive(page);
-	__lru_cache_add(page);
-}
-EXPORT_SYMBOL(lru_cache_add_file);
-
 /**
  * lru_cache_add - add a page to a page list
  * @page: the page to be added to the LRU.
@@ -443,10 +414,17 @@ EXPORT_SYMBOL(lru_cache_add_file);
  */
 void lru_cache_add(struct page *page)
 {
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
+
 	VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-	__lru_cache_add(page);
+
+	get_page(page);
+	if (!pagevec_add(pvec, page) || PageCompound(page))
+		__pagevec_lru_add(pvec);
+	put_cpu_var(lru_add_pvec);
 }
+EXPORT_SYMBOL(lru_cache_add);
 
 /**
  * lru_cache_add_active_or_unevictable
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 17123caec6dd..b5e08ff00e1e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -432,7 +432,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	/* Initiate read into locked page */
 	SetPageWorkingset(page);
-	lru_cache_add_anon(page);
+	lru_cache_add(page);
 	*new_page_allocated = true;
 	return page;
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (3 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 04/14] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-27  2:06   ` Joonsoo Kim
  2020-05-20 23:25 ` [PATCH 06/14] mm: remove use-once cache bias from LRU balancing Johannes Weiner
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

We activate cache refaults with reuse distances in pages smaller than
the size of the total cache. This allows new pages with competitive
access frequencies to establish themselves, as well as challenge and
potentially displace pages on the active list that have gone cold.

However, that assumes that active cache can only replace other active
cache in a competition for the hottest memory. This is not a great
default assumption. The page cache might be thrashing while there are
enough completely cold and unused anonymous pages sitting around that
we'd only have to write to swap once to stop all IO from the cache.

Activate cache refaults when their reuse distance in pages is smaller
than the total userspace workingset, including anonymous pages.

Reclaim can still decide how to balance pressure among the two LRUs
depending on the IO situation. Rotational drives will prefer avoiding
random IO from swap and go harder after cache. But fundamentally, hot
cache should be able to compete with anon pages for a place in RAM.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..e69865739539 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -277,8 +277,8 @@ void workingset_refault(struct page *page, void *shadow)
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
+	unsigned long workingset_size;
 	struct pglist_data *pgdat;
-	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -310,7 +310,6 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
 	refault = atomic_long_read(&eviction_lruvec->inactive_age);
-	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
 
 	/*
 	 * Calculate the refault distance
@@ -345,10 +344,18 @@ void workingset_refault(struct page *page, void *shadow)
 
 	/*
 	 * Compare the distance to the existing workingset size. We
-	 * don't act on pages that couldn't stay resident even if all
-	 * the memory was available to the page cache.
+	 * don't activate pages that couldn't stay resident even if
+	 * all the memory was available to the page cache. Whether
+	 * cache can compete with anon or not depends on having swap.
 	 */
-	if (refault_distance > active_file)
+	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+		workingset_size += lruvec_page_state(eviction_lruvec,
+						     NR_INACTIVE_ANON);
+		workingset_size += lruvec_page_state(eviction_lruvec,
+						     NR_ACTIVE_ANON);
+	}
+	if (refault_distance > workingset_size)
 		goto out;
 
 	SetPageActive(page);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/14] mm: remove use-once cache bias from LRU balancing
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (4 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 05/14] mm: workingset: let cache workingset challenge anon Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 07/14] mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count() Johannes Weiner
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

When the splitlru patches divided page cache and swap-backed pages
into separate LRU lists, the pressure balance between the lists was
biased to account for the fact that streaming IO can cause memory
pressure with a flood of pages that are used only once. New page cache
additions would tip the balance toward the file LRU, and repeat access
would neutralize that bias again. This ensured that page reclaim would
always go for used-once cache first.

Since e9868505987a ("mm,vmscan: only evict file pages when we have
plenty"), page reclaim generally skips over swap-backed memory
entirely as long as there is used-once cache present, and will apply
the LRU balancing when only repeatedly accessed cache pages are left -
at which point the previous use-once bias will have been
neutralized. This makes the use-once cache balancing bias unnecessary.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
---
 mm/swap.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index c93a6c84464c..3b8c81bc93cd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -277,7 +277,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_lru(page);
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
@@ -287,7 +286,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
 	}
 }
 
@@ -936,9 +934,6 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 
 	if (page_evictable(page)) {
 		lru = page_lru(page);
-		update_page_reclaim_stat(lruvec, is_file_lru(lru),
-					 PageActive(page),
-					 hpage_nr_pages(page));
 		if (was_unevictable)
 			count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/14] mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count()
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (5 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 06/14] mm: remove use-once cache bias from LRU balancing Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 08/14] mm: base LRU balancing on an explicit cost model Johannes Weiner
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

When we calculate the relative scan pressure between the anon and file
LRU lists, we have to assume that reclaim_stat can contain zeroes. To
avoid div0 crashes, we add 1 to all denominators like so:

        anon_prio = swappiness;
        file_prio = 200 - anon_prio;

	[...]

        /*
         * The amount of pressure on anon vs file pages is inversely
         * proportional to the fraction of recently scanned pages on
         * each list that were recently referenced and in active use.
         */
        ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
        ap /= reclaim_stat->recent_rotated[0] + 1;

        fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
        fp /= reclaim_stat->recent_rotated[1] + 1;
        spin_unlock_irq(&pgdat->lru_lock);

        fraction[0] = ap;
        fraction[1] = fp;
        denominator = ap + fp + 1;

While reclaim_stat can contain 0, it's not actually possible for ap +
fp to be 0. One of anon_prio or file_prio could be zero, but they must
still add up to 200. And the reclaim_stat fraction, due to the +1 in
there, is always at least 1. So if one of the two numerators is 0, the
other one can't be. ap + fp is always at least 1. Drop the + 1.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 43f88b1a4f14..6cd1029ea9d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2348,7 +2348,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	fraction[0] = ap;
 	fraction[1] = fp;
-	denominator = ap + fp + 1;
+	denominator = ap + fp;
 out:
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/14] mm: base LRU balancing on an explicit cost model
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (6 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 07/14] mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count() Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

Currently, scan pressure between the anon and file LRU lists is
balanced based on a mixture of reclaim efficiency and a somewhat vague
notion of "value" of having certain pages in memory over others. That
concept of value is problematic, because it has caused us to count any
event that remotely makes one LRU list more or less preferrable for
reclaim, even when these events are not directly comparable and impose
very different costs on the system. One example is referenced file
pages that we still deactivate and referenced anonymous pages that we
actually rotate back to the head of the list.

There is also conceptual overlap with the LRU algorithm itself. By
rotating recently used pages instead of reclaiming them, the algorithm
already biases the applied scan pressure based on page value. Thus,
when rebalancing scan pressure due to rotations, we should think of
reclaim cost, and leave assessing the page value to the LRU algorithm.

Lastly, considering both value-increasing as well as value-decreasing
events can sometimes cause the same type of event to be counted twice,
i.e. how rotating a page increases the LRU value, while reclaiming it
succesfully decreases the value. In itself this will balance out fine,
but it quietly skews the impact of events that are only recorded once.

The abstract metric of "value", the murky relationship with the LRU
algorithm, and accounting both negative and positive events make the
current pressure balancing model hard to reason about and modify.

This patch switches to a balancing model of accounting the concrete,
actually observed cost of reclaiming one LRU over another. For now,
that cost includes pages that are scanned but rotated back to the list
head. Subsequent patches will add consideration for IO caused by
refaulting of recently evicted pages.

Replace struct zone_reclaim_stat with two cost counters in the lruvec,
and make everything that affects cost go through a new lru_note_cost()
function.

v2: remove superfluous cost denominator (Minchan Kim)
    improve cost variable naming (Michal Hocko)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h | 21 +++++++-------------
 include/linux/swap.h   |  2 ++
 mm/memcontrol.c        | 18 ++++++-----------
 mm/swap.c              | 19 ++++++++----------
 mm/vmscan.c            | 44 +++++++++++++++++++++---------------------
 5 files changed, 45 insertions(+), 59 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fbda9ddd1f..e959602140b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -240,19 +240,6 @@ static inline bool is_active_lru(enum lru_list lru)
 	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
 }
 
-struct zone_reclaim_stat {
-	/*
-	 * The pageout code in vmscan.c keeps track of how many of the
-	 * mem/swap backed and file backed pages are referenced.
-	 * The higher the rotated/scanned ratio, the more valuable
-	 * that cache is.
-	 *
-	 * The anon LRU stats live in [0], file LRU stats in [1]
-	 */
-	unsigned long		recent_rotated[2];
-	unsigned long		recent_scanned[2];
-};
-
 enum lruvec_flags {
 	LRUVEC_CONGESTED,		/* lruvec has many dirty pages
 					 * backed by a congested BDI
@@ -261,7 +248,13 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
-	struct zone_reclaim_stat	reclaim_stat;
+	/*
+	 * These track the cost of reclaiming one LRU - file or anon -
+	 * over the other. As the observed cost of reclaiming one LRU
+	 * increases, the reclaim scan balance tips toward the other.
+	 */
+	unsigned long			anon_cost;
+	unsigned long			file_cost;
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
 	/* Refaults at the time of last reclaim cycle */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 30fd4641890e..5ace6d8a33bd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -333,6 +333,8 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 
 /* linux/mm/swap.c */
+extern void lru_note_cost(struct lruvec *lruvec, bool file,
+			  unsigned int nr_pages);
 extern void lru_cache_add(struct page *);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f4d96ae3e..3e000a316b59 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3790,23 +3790,17 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 	{
 		pg_data_t *pgdat;
 		struct mem_cgroup_per_node *mz;
-		struct zone_reclaim_stat *rstat;
-		unsigned long recent_rotated[2] = {0, 0};
-		unsigned long recent_scanned[2] = {0, 0};
+		unsigned long anon_cost = 0;
+		unsigned long file_cost = 0;
 
 		for_each_online_pgdat(pgdat) {
 			mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
-			rstat = &mz->lruvec.reclaim_stat;
 
-			recent_rotated[0] += rstat->recent_rotated[0];
-			recent_rotated[1] += rstat->recent_rotated[1];
-			recent_scanned[0] += rstat->recent_scanned[0];
-			recent_scanned[1] += rstat->recent_scanned[1];
+			anon_cost += mz->lruvec.anon_cost;
+			file_cost += mz->lruvec.file_cost;
 		}
-		seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
-		seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
-		seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
-		seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
+		seq_printf(m, "anon_cost %lu\n", anon_cost);
+		seq_printf(m, "file_cost %lu\n", file_cost);
 	}
 #endif
 
diff --git a/mm/swap.c b/mm/swap.c
index 3b8c81bc93cd..5d62c5a0c651 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -262,15 +262,12 @@ void rotate_reclaimable_page(struct page *page)
 	}
 }
 
-static void update_page_reclaim_stat(struct lruvec *lruvec,
-				     int file, int rotated,
-				     unsigned int nr_pages)
+void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-
-	reclaim_stat->recent_scanned[file] += nr_pages;
-	if (rotated)
-		reclaim_stat->recent_rotated[file] += nr_pages;
+	if (file)
+		lruvec->file_cost += nr_pages;
+	else
+		lruvec->anon_cost += nr_pages;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -518,7 +515,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
+	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 }
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
@@ -534,7 +531,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
-		update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
+		lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 	}
 }
 
@@ -559,7 +556,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 
 		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
 		count_memcg_page_event(page, PGLAZYFREE);
-		update_page_reclaim_stat(lruvec, 1, 0, hpage_nr_pages(page));
+		lru_note_cost(lruvec, 0, hpage_nr_pages(page));
 	}
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6cd1029ea9d4..6ff63906a288 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1916,7 +1916,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	bool file = is_file_lru(lru);
 	enum vm_event_item item;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
@@ -1940,7 +1939,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 				     &nr_scanned, sc, lru);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
-	reclaim_stat->recent_scanned[file] += nr_taken;
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
@@ -1960,8 +1958,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
-	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
+	/*
+	 * Rotating pages costs CPU without actually
+	 * progressing toward the reclaim goal.
+	 */
+	lru_note_cost(lruvec, 0, stat.nr_activate[0]);
+	lru_note_cost(lruvec, 1, stat.nr_activate[1]);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
@@ -2013,7 +2015,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	unsigned nr_deactivate, nr_activate;
 	unsigned nr_rotated = 0;
 	int file = is_file_lru(lru);
@@ -2027,7 +2028,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 				     &nr_scanned, sc, lru);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
-	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
@@ -2085,7 +2085,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 * helps balance scan pressure between file and anonymous pages in
 	 * get_scan_count.
 	 */
-	reclaim_stat->recent_rotated[file] += nr_rotated;
+	lru_note_cost(lruvec, file, nr_rotated);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2242,13 +2242,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 {
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	int swappiness = mem_cgroup_swappiness(memcg);
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
+	unsigned long totalcost;
 	unsigned long ap, fp;
 	enum lru_list lru;
 
@@ -2324,26 +2324,26 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
 
 	spin_lock_irq(&pgdat->lru_lock);
-	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
-		reclaim_stat->recent_scanned[0] /= 2;
-		reclaim_stat->recent_rotated[0] /= 2;
-	}
-
-	if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
-		reclaim_stat->recent_scanned[1] /= 2;
-		reclaim_stat->recent_rotated[1] /= 2;
+	totalcost = lruvec->anon_cost + lruvec->file_cost;
+	if (unlikely(totalcost > (anon + file) / 4)) {
+		lruvec->anon_cost /= 2;
+		lruvec->file_cost /= 2;
+		totalcost /= 2;
 	}
 
 	/*
 	 * The amount of pressure on anon vs file pages is inversely
-	 * proportional to the fraction of recently scanned pages on
-	 * each list that were recently referenced and in active use.
+	 * proportional to the assumed cost of reclaiming each list,
+	 * as determined by the share of pages that are likely going
+	 * to refault or rotate on each list (recently referenced),
+	 * times the relative IO cost of bringing back a swapped out
+	 * anonymous page vs reloading a filesystem page (swappiness).
 	 */
-	ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
-	ap /= reclaim_stat->recent_rotated[0] + 1;
+	ap = anon_prio * (totalcost + 1);
+	ap /= lruvec->anon_cost + 1;
 
-	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
-	fp /= reclaim_stat->recent_rotated[1] + 1;
+	fp = file_prio * (totalcost + 1);
+	fp /= lruvec->file_cost + 1;
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	fraction[0] = ap;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (7 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 08/14] mm: base LRU balancing on an explicit cost model Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-22 13:33   ` Qian Cai
  2020-05-20 23:25 ` [PATCH 10/14] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

Operations like MADV_FREE, FADV_DONTNEED etc. currently move any
affected active pages to the inactive list to accelerate their reclaim
(good) but also steer page reclaim toward that LRU type, or away from
the other (bad).

The reason why this is undesirable is that such operations are not
part of the regular page aging cycle, and rather a fluke that doesn't
say much about the remaining pages on that list; they might all be in
heavy use, and once the chunk of easy victims has been purged, the VM
continues to apply elevated pressure on those remaining hot pages. The
other LRU, meanwhile, might have easily reclaimable pages, and there
was never a need to steer away from it in the first place.

As the previous patch outlined, we should focus on recording actually
observed cost to steer the balance rather than speculating about the
potential value of one LRU list over the other. In that spirit, leave
explicitely deactivated pages to the LRU algorithm to pick up, and let
rotations decide which list is the easiest to reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/swap.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5d62c5a0c651..d7912bfb597f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -515,14 +515,12 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 }
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_lru(page);
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
@@ -531,7 +529,6 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
-		lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 	}
 }
 
@@ -556,7 +553,6 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 
 		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
 		count_memcg_page_event(page, PGLAZYFREE);
-		lru_note_cost(lruvec, 0, hpage_nr_pages(page));
 	}
 }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/14] mm: only count actual rotations as LRU reclaim cost
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (8 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 11/14] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

When shrinking the active file list we rotate referenced pages only
when they're in an executable mapping. The others get deactivated.
When it comes to balancing scan pressure, though, we count all
referenced pages as rotated, even the deactivated ones. Yet they do
not carry the same cost to the system: the deactivated page *might*
refault later on, but the deactivation is tangible progress toward
freeing pages; rotations on the other hand cost time and effort
without getting any closer to freeing memory.

Don't treat both events as equal. The following patch will hook up LRU
balancing to cache and anon refaults, which are a much more concrete
cost signal for reclaiming one list over the other. Thus, remove the
maybe-IO cost bias from page references, and only note the CPU cost
for actual rotations that prevent the pages from getting reclaimed.

v2: readable changelog (Michal Hocko)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6ff63906a288..2c3fb8dd1159 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2054,7 +2054,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
 				    &vm_flags)) {
-			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So
@@ -2065,6 +2064,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			 * so we ignore them here.
 			 */
 			if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
+				nr_rotated += hpage_nr_pages(page);
 				list_add(&page->lru, &l_active);
 				continue;
 			}
@@ -2080,10 +2080,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 */
 	spin_lock_irq(&pgdat->lru_lock);
 	/*
-	 * Count referenced pages from currently used mappings as rotated,
-	 * even though only some of them are actually re-activated.  This
-	 * helps balance scan pressure between file and anonymous pages in
-	 * get_scan_count.
+	 * Rotating pages costs CPU without actually
+	 * progressing toward the reclaim goal.
 	 */
 	lru_note_cost(lruvec, file, nr_rotated);
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/14] mm: balance LRU lists based on relative thrashing
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (9 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 10/14] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 12/14] mm: vmscan: determine anon/file pressure balance at the reclaim root Johannes Weiner
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

Since the LRUs were split into anon and file lists, the VM has been
balancing between page cache and anonymous pages based on per-list
ratios of scanned vs. rotated pages. In most cases that tips page
reclaim towards the list that is easier to reclaim and has the fewest
actively used pages, but there are a few problems with it:

1. Refaults and LRU rotations are weighted the same way, even though
   one costs IO and the other costs a bit of CPU.

2. The less we scan an LRU list based on already observed rotations,
   the more we increase the sampling interval for new references, and
   rotations become even more likely on that list. This can enter a
   death spiral in which we stop looking at one list completely until
   the other one is all but annihilated by page reclaim.

Since commit a528910e12ec ("mm: thrash detection-based file cache
sizing") we have refault detection for the page cache. Along with
swapin events, they are good indicators of when the file or anon list,
respectively, is too small for its workingset and needs to grow.

For example, if the page cache is thrashing, the cache pages need more
time in memory, while there may be colder pages on the anonymous list.
Likewise, if swapped pages are faulting back in, it indicates that we
reclaim anonymous pages too aggressively and should back off.

Replace LRU rotations with refaults and swapins as the basis for
relative reclaim cost of the two LRUs. This will have the VM target
list balances that incur the least amount of IO on aggregate.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |  3 +--
 mm/swap.c            | 11 +++++++----
 mm/swap_state.c      |  5 +++++
 mm/vmscan.c          | 39 ++++++++++-----------------------------
 mm/workingset.c      |  4 ++++
 5 files changed, 27 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5ace6d8a33bd..818a94b41d82 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -333,8 +333,7 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 
 /* linux/mm/swap.c */
-extern void lru_note_cost(struct lruvec *lruvec, bool file,
-			  unsigned int nr_pages);
+extern void lru_note_cost(struct page *);
 extern void lru_cache_add(struct page *);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
diff --git a/mm/swap.c b/mm/swap.c
index d7912bfb597f..2ff91656dea2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -262,12 +262,15 @@ void rotate_reclaimable_page(struct page *page)
 	}
 }
 
-void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
+void lru_note_cost(struct page *page)
 {
-	if (file)
-		lruvec->file_cost += nr_pages;
+	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+	/* Record new data point */
+	if (page_is_file_lru(page))
+		lruvec->file_cost++;
 	else
-		lruvec->anon_cost += nr_pages;
+		lruvec->anon_cost++;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b5e08ff00e1e..8b902897a867 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -430,6 +430,11 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
 		goto fail_delete;
 
+	/* XXX: Move to lru_cache_add() when it supports new vs putback */
+	spin_lock_irq(&page_pgdat(page)->lru_lock);
+	lru_note_cost(page);
+	spin_unlock_irq(&page_pgdat(page)->lru_lock);
+
 	/* Initiate read into locked page */
 	SetPageWorkingset(page);
 	lru_cache_add(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2c3fb8dd1159..e7e6868bcbf7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1958,12 +1958,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	/*
-	 * Rotating pages costs CPU without actually
-	 * progressing toward the reclaim goal.
-	 */
-	lru_note_cost(lruvec, 0, stat.nr_activate[0]);
-	lru_note_cost(lruvec, 1, stat.nr_activate[1]);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
@@ -2079,11 +2073,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 * Move pages back to the lru list.
 	 */
 	spin_lock_irq(&pgdat->lru_lock);
-	/*
-	 * Rotating pages costs CPU without actually
-	 * progressing toward the reclaim goal.
-	 */
-	lru_note_cost(lruvec, file, nr_rotated);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2298,22 +2287,23 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	scan_balance = SCAN_FRACT;
 
 	/*
-	 * With swappiness at 100, anonymous and file have the same priority.
-	 * This scanning priority is essentially the inverse of IO cost.
+	 * Calculate the pressure balance between anon and file pages.
+	 *
+	 * The amount of pressure we put on each LRU is inversely
+	 * proportional to the cost of reclaiming each list, as
+	 * determined by the share of pages that are refaulting, times
+	 * the relative IO cost of bringing back a swapped out
+	 * anonymous page vs reloading a filesystem page (swappiness).
+	 *
+	 * With swappiness at 100, anon and file have equal IO cost.
 	 */
 	anon_prio = swappiness;
 	file_prio = 200 - anon_prio;
 
 	/*
-	 * OK, so we have swap space and a fair amount of page cache
-	 * pages.  We use the recently rotated / recently scanned
-	 * ratios to determine how valuable each cache is.
-	 *
 	 * Because workloads change over time (and to avoid overflow)
 	 * we keep these statistics as a floating average, which ends
-	 * up weighing recent references more than old ones.
-	 *
-	 * anon in [0], file in [1]
+	 * up weighing recent refaults more than old ones.
 	 */
 
 	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) +
@@ -2328,15 +2318,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		lruvec->file_cost /= 2;
 		totalcost /= 2;
 	}
-
-	/*
-	 * The amount of pressure on anon vs file pages is inversely
-	 * proportional to the assumed cost of reclaiming each list,
-	 * as determined by the share of pages that are likely going
-	 * to refault or rotate on each list (recently referenced),
-	 * times the relative IO cost of bringing back a swapped out
-	 * anonymous page vs reloading a filesystem page (swappiness).
-	 */
 	ap = anon_prio * (totalcost + 1);
 	ap /= lruvec->anon_cost + 1;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index e69865739539..a6a2a740ed0b 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -365,6 +365,10 @@ void workingset_refault(struct page *page, void *shadow)
 	/* Page was active prior to eviction */
 	if (workingset) {
 		SetPageWorkingset(page);
+		/* XXX: Move to lru_cache_add() when it supports new vs putback */
+		spin_lock_irq(&page_pgdat(page)->lru_lock);
+		lru_note_cost(page);
+		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
 out:
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/14] mm: vmscan: determine anon/file pressure balance at the reclaim root
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (10 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 11/14] mm: balance LRU lists based on relative thrashing Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 13/14] mm: vmscan: reclaim writepage is IO cost Johannes Weiner
  2020-05-20 23:25 ` [PATCH 14/14] mm: vmscan: limit the range of LRU type balancing Johannes Weiner
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

We split the LRU lists into anon and file, and we rebalance the scan
pressure between them when one of them begins thrashing: if the file
cache experiences workingset refaults, we increase the pressure on
anonymous pages; if the workload is stalled on swapins, we increase
the pressure on the file cache instead.

With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, LRU pressure balancing
is done on an individual cgroup LRU level. As a result, when one
cgroup is thrashing on the filesystem cache while a sibling may have
cold anonymous pages, pressure doesn't get equalized between them.

This patch moves LRU balancing decision to the root of reclaim - the
same level where the LRU order is established.

It does this by tracking LRU cost recursively, so that every level of
the cgroup tree knows the aggregate LRU cost of all memory within its
domain. When the page scanner calculates the scan balance for any
given individual cgroup's LRU list, it uses the values from the
ancestor cgroup that initiated the reclaim cycle.

If one sibling is then thrashing on the cache, it will tip the
pressure balance inside its ancestors, and the next hierarchical
reclaim iteration will go more after the anon pages in the tree.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 13 ++++++++++++
 mm/swap.c                  | 32 ++++++++++++++++++++++++-----
 mm/vmscan.c                | 41 ++++++++++++++++----------------------
 3 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 32a0b4d47540..d982c80da157 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page,
 	mod_lruvec_page_state(page, idx, -1);
 }
 
+static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = lruvec_memcg(lruvec);
+	if (!memcg)
+		return NULL;
+	memcg = parent_mem_cgroup(memcg);
+	if (!memcg)
+		return NULL;
+	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/swap.c b/mm/swap.c
index 2ff91656dea2..3d8aa46c47ff 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -266,11 +266,33 @@ void lru_note_cost(struct page *page)
 {
 	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
-	/* Record new data point */
-	if (page_is_file_lru(page))
-		lruvec->file_cost++;
-	else
-		lruvec->anon_cost++;
+	do {
+		unsigned long lrusize;
+
+		/* Record cost event */
+		if (page_is_file_lru(page))
+			lruvec->file_cost++;
+		else
+			lruvec->anon_cost++;
+
+		/*
+		 * Decay previous events
+		 *
+		 * Because workloads change over time (and to avoid
+		 * overflow) we keep these statistics as a floating
+		 * average, which ends up weighing recent refaults
+		 * more than old ones.
+		 */
+		lrusize = lruvec_page_state(lruvec, NR_INACTIVE_ANON) +
+			  lruvec_page_state(lruvec, NR_ACTIVE_ANON) +
+			  lruvec_page_state(lruvec, NR_INACTIVE_FILE) +
+			  lruvec_page_state(lruvec, NR_ACTIVE_FILE);
+
+		if (lruvec->file_cost + lruvec->anon_cost > lrusize / 4) {
+			lruvec->file_cost /= 2;
+			lruvec->anon_cost /= 2;
+		}
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e7e6868bcbf7..1487ff5d4698 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -79,6 +79,12 @@ struct scan_control {
 	 */
 	struct mem_cgroup *target_mem_cgroup;
 
+	/*
+	 * Scan pressure balancing between anon and file LRUs
+	 */
+	unsigned long	anon_cost;
+	unsigned long	file_cost;
+
 	/* Can active pages be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
 #define DEACTIVATE_FILE 2
@@ -2231,10 +2237,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	int swappiness = mem_cgroup_swappiness(memcg);
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
-	unsigned long anon, file;
 	unsigned long totalcost;
 	unsigned long ap, fp;
 	enum lru_list lru;
@@ -2285,7 +2289,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 
 	scan_balance = SCAN_FRACT;
-
 	/*
 	 * Calculate the pressure balance between anon and file pages.
 	 *
@@ -2300,30 +2303,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	anon_prio = swappiness;
 	file_prio = 200 - anon_prio;
 
-	/*
-	 * Because workloads change over time (and to avoid overflow)
-	 * we keep these statistics as a floating average, which ends
-	 * up weighing recent refaults more than old ones.
-	 */
-
-	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) +
-		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, MAX_NR_ZONES);
-	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
-		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
-
-	spin_lock_irq(&pgdat->lru_lock);
-	totalcost = lruvec->anon_cost + lruvec->file_cost;
-	if (unlikely(totalcost > (anon + file) / 4)) {
-		lruvec->anon_cost /= 2;
-		lruvec->file_cost /= 2;
-		totalcost /= 2;
-	}
+	totalcost = sc->anon_cost + sc->file_cost;
 	ap = anon_prio * (totalcost + 1);
-	ap /= lruvec->anon_cost + 1;
+	ap /= sc->anon_cost + 1;
 
 	fp = file_prio * (totalcost + 1);
-	fp /= lruvec->file_cost + 1;
-	spin_unlock_irq(&pgdat->lru_lock);
+	fp /= sc->file_cost + 1;
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -2679,6 +2664,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	nr_reclaimed = sc->nr_reclaimed;
 	nr_scanned = sc->nr_scanned;
 
+	/*
+	 * Determine the scan balance between anon and file LRUs.
+	 */
+	spin_lock_irq(&pgdat->lru_lock);
+	sc->anon_cost = target_lruvec->anon_cost;
+	sc->file_cost = target_lruvec->file_cost;
+	spin_unlock_irq(&pgdat->lru_lock);
+
 	/*
 	 * Target desirable inactive:active list ratios for the anon
 	 * and file LRU lists.
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/14] mm: vmscan: reclaim writepage is IO cost
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (11 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 12/14] mm: vmscan: determine anon/file pressure balance at the reclaim root Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  2020-05-20 23:25 ` [PATCH 14/14] mm: vmscan: limit the range of LRU type balancing Johannes Weiner
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

The VM tries to balance reclaim pressure between anon and file so as
to reduce the amount of IO incurred due to the memory shortage. It
already counts refaults and swapins, but in addition it should also
count writepage calls during reclaim.

For swap, this is obvious: it's IO that wouldn't have occurred if the
anonymous memory hadn't been under memory pressure. From a relative
balancing point of view this makes sense as well: even if anon is cold
and reclaimable, a cache that isn't thrashing may have equally cold
pages that don't require IO to reclaim.

For file writeback, it's trickier: some of the reclaim writepage IO
would have likely occurred anyway due to dirty expiration. But not all
of it - premature writeback reduces batching and generates additional
writes. Since the flushers are already woken up by the time the VM
starts writing cache pages one by one, let's assume that we'e likely
causing writes that wouldn't have happened without memory pressure. In
addition, the per-page cost of IO would have probably been much
cheaper if written in larger batches from the flusher thread rather
than the single-page-writes from kswapd.

For our purposes - getting the trend right to accelerate convergence
on a stable state that doesn't require paging at all - this is
sufficiently accurate. If we later wanted to optimize for sustained
thrashing, we can still refine the measurements.

Count all writepage calls from kswapd as IO cost toward the LRU that
the page belongs to.

Why do this dynamically? Don't we know in advance that anon pages
require IO to reclaim, and so could build in a static bias?

First, scanning is not the same as reclaiming. If all the anon pages
are referenced, we may not swap for a while just because we're
scanning the anon list. During this time, however, it's important that
we age anonymous memory and the page cache at the same rate so that
their hot-cold gradients are comparable. Everything else being equal,
we still want to reclaim the coldest memory overall.

Second, we keep copies in swap unless the page changes. If there is
swap-backed data that's mostly read (tmpfs file) and has been swapped
out before, we can reclaim it without incurring additional IO.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h   |  4 +++-
 include/linux/vmstat.h |  1 +
 mm/swap.c              | 16 ++++++++++------
 mm/swap_state.c        |  2 +-
 mm/vmscan.c            |  3 +++
 mm/workingset.c        |  2 +-
 6 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 818a94b41d82..157e5081bf98 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -333,7 +333,9 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 
 /* linux/mm/swap.c */
-extern void lru_note_cost(struct page *);
+extern void lru_note_cost(struct lruvec *lruvec, bool file,
+			  unsigned int nr_pages);
+extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 10cc932e209a..3d12c34cd42a 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -26,6 +26,7 @@ struct reclaim_stat {
 	unsigned nr_congested;
 	unsigned nr_writeback;
 	unsigned nr_immediate;
+	unsigned nr_pageout;
 	unsigned nr_activate[2];
 	unsigned nr_ref_keep;
 	unsigned nr_unmap_fail;
diff --git a/mm/swap.c b/mm/swap.c
index 3d8aa46c47ff..ffc457911be2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -262,18 +262,16 @@ void rotate_reclaimable_page(struct page *page)
 	}
 }
 
-void lru_note_cost(struct page *page)
+void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
-	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-
 	do {
 		unsigned long lrusize;
 
 		/* Record cost event */
-		if (page_is_file_lru(page))
-			lruvec->file_cost++;
+		if (file)
+			lruvec->file_cost += nr_pages;
 		else
-			lruvec->anon_cost++;
+			lruvec->anon_cost += nr_pages;
 
 		/*
 		 * Decay previous events
@@ -295,6 +293,12 @@ void lru_note_cost(struct page *page)
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
+void lru_note_cost_page(struct page *page)
+{
+	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
+		      page_is_file_lru(page), hpage_nr_pages(page));
+}
+
 static void __activate_page(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8b902897a867..1e744e6c9c20 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -432,7 +432,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	/* XXX: Move to lru_cache_add() when it supports new vs putback */
 	spin_lock_irq(&page_pgdat(page)->lru_lock);
-	lru_note_cost(page);
+	lru_note_cost_page(page);
 	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 
 	/* Initiate read into locked page */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1487ff5d4698..5453b4ef2ea1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			case PAGE_ACTIVATE:
 				goto activate_locked;
 			case PAGE_SUCCESS:
+				stat->nr_pageout += hpage_nr_pages(page);
+
 				if (PageWriteback(page))
 					goto keep;
 				if (PageDirty(page))
@@ -1964,6 +1966,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	lru_note_cost(lruvec, file, stat.nr_pageout);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
diff --git a/mm/workingset.c b/mm/workingset.c
index a6a2a740ed0b..d481ea452eeb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -367,7 +367,7 @@ void workingset_refault(struct page *page, void *shadow)
 		SetPageWorkingset(page);
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
 		spin_lock_irq(&page_pgdat(page)->lru_lock);
-		lru_note_cost(page);
+		lru_note_cost_page(page);
 		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 14/14] mm: vmscan: limit the range of LRU type balancing
  2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
                   ` (12 preceding siblings ...)
  2020-05-20 23:25 ` [PATCH 13/14] mm: vmscan: reclaim writepage is IO cost Johannes Weiner
@ 2020-05-20 23:25 ` Johannes Weiner
  13 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-20 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

When LRU cost only shows up on one list, we abruptly stop scanning
that list altogether. That's an extreme reaction: by the time the
other list starts thrashing and the pendulum swings back, we may have
no recent age information on the first list anymore, and we could have
significant latencies until the scanner has caught up.

Soften this change in the feedback system by ensuring that no list
receives less than a third of overall pressure, and only distribute
the other 66% according to LRU cost. This ensures that we maintain a
minimum rate of aging on the entire workingset while it's being
pressured, while still allowing a generous rate of convergence when
the relative sizes of the lists need to adjust.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5453b4ef2ea1..c628f9ab886b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2237,12 +2237,11 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr)
 {
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	unsigned long anon_cost, file_cost, total_cost;
 	int swappiness = mem_cgroup_swappiness(memcg);
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
-	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
-	unsigned long totalcost;
 	unsigned long ap, fp;
 	enum lru_list lru;
 
@@ -2301,17 +2300,22 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * the relative IO cost of bringing back a swapped out
 	 * anonymous page vs reloading a filesystem page (swappiness).
 	 *
+	 * Although we limit that influence to ensure no list gets
+	 * left behind completely: at least a third of the pressure is
+	 * applied, before swappiness.
+	 *
 	 * With swappiness at 100, anon and file have equal IO cost.
 	 */
-	anon_prio = swappiness;
-	file_prio = 200 - anon_prio;
+	total_cost = sc->anon_cost + sc->file_cost;
+	anon_cost = total_cost + sc->anon_cost;
+	file_cost = total_cost + sc->file_cost;
+	total_cost = anon_cost + file_cost;
 
-	totalcost = sc->anon_cost + sc->file_cost;
-	ap = anon_prio * (totalcost + 1);
-	ap /= sc->anon_cost + 1;
+	ap = swappiness * (total_cost + 1);
+	ap /= anon_cost + 1;
 
-	fp = file_prio * (totalcost + 1);
-	fp /= sc->file_cost + 1;
+	fp = (200 - swappiness) * (total_cost + 1);
+	fp /= file_cost + 1;
 
 	fraction[0] = ap;
 	fraction[1] = fp;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance
  2020-05-20 23:25 ` [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
@ 2020-05-22 13:33   ` Qian Cai
  2020-05-26 15:55     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Qian Cai @ 2020-05-22 13:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

On Wed, May 20, 2020 at 07:25:20PM -0400, Johannes Weiner wrote:
> Operations like MADV_FREE, FADV_DONTNEED etc. currently move any
> affected active pages to the inactive list to accelerate their reclaim
> (good) but also steer page reclaim toward that LRU type, or away from
> the other (bad).
> 
> The reason why this is undesirable is that such operations are not
> part of the regular page aging cycle, and rather a fluke that doesn't
> say much about the remaining pages on that list; they might all be in
> heavy use, and once the chunk of easy victims has been purged, the VM
> continues to apply elevated pressure on those remaining hot pages. The
> other LRU, meanwhile, might have easily reclaimable pages, and there
> was never a need to steer away from it in the first place.
> 
> As the previous patch outlined, we should focus on recording actually
> observed cost to steer the balance rather than speculating about the
> potential value of one LRU list over the other. In that spirit, leave
> explicitely deactivated pages to the LRU algorithm to pick up, and let
> rotations decide which list is the easiest to reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/swap.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 5d62c5a0c651..d7912bfb597f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -515,14 +515,12 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  
>  	if (active)
>  		__count_vm_event(PGDEACTIVATE);
> -	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
>  }
>
[]

mm/swap.c: In function 'lru_deactivate_file_fn':
mm/swap.c:504:11: warning: variable 'file' set but not used
[-Wunused-but-set-variable]
  int lru, file;
           ^~~~  

This?

diff --git a/mm/swap.c b/mm/swap.c
index fedf5847dfdb..9c38c1b545af 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
 static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 			      void *arg)
 {
-	int lru, file;
+	int lru;
 	bool active;
 
 	if (!PageLRU(page))
@@ -515,7 +515,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 		return;
 
 	active = PageActive(page);
-	file = page_is_file_lru(page);
 	lru = page_lru_base_type(page);
 
 	del_page_from_lru_list(page, lruvec, lru + active);



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance
  2020-05-22 13:33   ` Qian Cai
@ 2020-05-26 15:55     ` Johannes Weiner
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-05-26 15:55 UTC (permalink / raw)
  To: Qian Cai
  Cc: linux-mm, Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, linux-kernel, kernel-team

On Fri, May 22, 2020 at 09:33:35AM -0400, Qian Cai wrote:
> On Wed, May 20, 2020 at 07:25:20PM -0400, Johannes Weiner wrote:
> > Operations like MADV_FREE, FADV_DONTNEED etc. currently move any
> > affected active pages to the inactive list to accelerate their reclaim
> > (good) but also steer page reclaim toward that LRU type, or away from
> > the other (bad).
> > 
> > The reason why this is undesirable is that such operations are not
> > part of the regular page aging cycle, and rather a fluke that doesn't
> > say much about the remaining pages on that list; they might all be in
> > heavy use, and once the chunk of easy victims has been purged, the VM
> > continues to apply elevated pressure on those remaining hot pages. The
> > other LRU, meanwhile, might have easily reclaimable pages, and there
> > was never a need to steer away from it in the first place.
> > 
> > As the previous patch outlined, we should focus on recording actually
> > observed cost to steer the balance rather than speculating about the
> > potential value of one LRU list over the other. In that spirit, leave
> > explicitely deactivated pages to the LRU algorithm to pick up, and let
> > rotations decide which list is the easiest to reclaim.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/swap.c | 4 ----
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 5d62c5a0c651..d7912bfb597f 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -515,14 +515,12 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> >  
> >  	if (active)
> >  		__count_vm_event(PGDEACTIVATE);
> > -	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
> >  }
> >
> []
> 
> mm/swap.c: In function 'lru_deactivate_file_fn':
> mm/swap.c:504:11: warning: variable 'file' set but not used
> [-Wunused-but-set-variable]
>   int lru, file;
>            ^~~~ 

Oops, my gcc doesn't warn about that, but yes, it's clearly dead code.

$ make mm/swap.o
  GEN     Makefile
  CALL    /home/hannes/src/linux/linux/scripts/checksyscalls.sh
  CALL    /home/hannes/src/linux/linux/scripts/atomic/check-atomics.sh
  DESCEND  objtool
  CC      mm/swap.o
$
 
> This?
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index fedf5847dfdb..9c38c1b545af 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -501,7 +501,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
>  static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  			      void *arg)
>  {
> -	int lru, file;
> +	int lru;
>  	bool active;
>  
>  	if (!PageLRU(page))
> @@ -515,7 +515,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  		return;
>  
>  	active = PageActive(page);
> -	file = page_is_file_lru(page);
>  	lru = page_lru_base_type(page);
>  
>  	del_page_from_lru_list(page, lruvec, lru + active);

Looks good, and it appears Andrew has already queued it. Would you
mind providing the Signed-off-by: for it?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-20 23:25 ` [PATCH 05/14] mm: workingset: let cache workingset challenge anon Johannes Weiner
@ 2020-05-27  2:06   ` Joonsoo Kim
  2020-05-27 13:43     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2020-05-27  2:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> We activate cache refaults with reuse distances in pages smaller than
> the size of the total cache. This allows new pages with competitive
> access frequencies to establish themselves, as well as challenge and
> potentially displace pages on the active list that have gone cold.
>
> However, that assumes that active cache can only replace other active
> cache in a competition for the hottest memory. This is not a great
> default assumption. The page cache might be thrashing while there are
> enough completely cold and unused anonymous pages sitting around that
> we'd only have to write to swap once to stop all IO from the cache.
>
> Activate cache refaults when their reuse distance in pages is smaller
> than the total userspace workingset, including anonymous pages.

Hmm... I'm not sure the correctness of this change.

IIUC, this patch leads to more activations in the file list and more activations
here will challenge the anon list since rotation ratio for the file
list will be increased.

However, this change breaks active/inactive concept of the file list.
active/inactive
separation is implemented by in-list refault distance. anon list size has
no direct connection with refault distance of the file list so using
anon list size
to detect workingset for file page breaks the concept.

My suspicion is started by this counter example.

Environment:
anon: 500 MB (so hot) / 500 MB (so hot)
file: 50 MB (hot) / 50 MB (cold)

Think about the situation that there is periodical access to other file (100 MB)
with low frequency (refault distance is 500 MB)

Without your change, this periodical access doesn't make thrashing for cached
active file page since refault distance of periodical access is larger
than the size of
the active file list. However, with your change, it causes thrashing
on the file list.

In fact, I'm not sure that I'm thinking correctly. It's very
complicated for me. :)
Please let me know if there is a missing piece.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-27  2:06   ` Joonsoo Kim
@ 2020-05-27 13:43     ` Johannes Weiner
  2020-05-28  7:16       ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-05-27 13:43 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > We activate cache refaults with reuse distances in pages smaller than
> > the size of the total cache. This allows new pages with competitive
> > access frequencies to establish themselves, as well as challenge and
> > potentially displace pages on the active list that have gone cold.
> >
> > However, that assumes that active cache can only replace other active
> > cache in a competition for the hottest memory. This is not a great
> > default assumption. The page cache might be thrashing while there are
> > enough completely cold and unused anonymous pages sitting around that
> > we'd only have to write to swap once to stop all IO from the cache.
> >
> > Activate cache refaults when their reuse distance in pages is smaller
> > than the total userspace workingset, including anonymous pages.
> 
> Hmm... I'm not sure the correctness of this change.
> 
> IIUC, this patch leads to more activations in the file list and more activations
> here will challenge the anon list since rotation ratio for the file
> list will be increased.

Yes.

> However, this change breaks active/inactive concept of the file list.
> active/inactive
> separation is implemented by in-list refault distance. anon list size has
> no direct connection with refault distance of the file list so using
> anon list size
> to detect workingset for file page breaks the concept.

This is intentional, because there IS a connection: they both take up
space in RAM, and they both cost IO to bring back once reclaimed.

When file is refaulting, it means we need to make more space for
cache. That space can come from stale active file pages. But what if
active cache is all hot, and meanwhile there are cold anon pages that
we could swap out once and then serve everything from RAM?

When file is refaulting, we should find the coldest data that is
taking up RAM and kick it out. It doesn't matter whether it's file or
anon: the goal is to free up RAM with the least amount of IO risk.

Remember that the file/anon split, and the inactive/active split, are
there to optimize reclaim. It doesn't mean that these memory pools are
independent from each other.

The file list is split in two because of use-once cache. The anon and
file lists are split because of different IO patterns, because we may
not have swap etc. But once we are out of use-once cache, have swap
space available, and have corrected for the different cost of IO,
there needs to be a relative order between all pages in the system to
find the optimal candidates to reclaim.

> My suspicion is started by this counter example.
> 
> Environment:
> anon: 500 MB (so hot) / 500 MB (so hot)
> file: 50 MB (hot) / 50 MB (cold)
> 
> Think about the situation that there is periodical access to other file (100 MB)
> with low frequency (refault distance is 500 MB)
> 
> Without your change, this periodical access doesn't make thrashing for cached
> active file page since refault distance of periodical access is larger
> than the size of
> the active file list. However, with your change, it causes thrashing
> on the file list.

It doesn't cause thrashing. It causes scanning because that 100M file
IS thrashing: with or without my patch, that refault IO is occuring.

What this patch acknowledges is that the 100M file COULD fit fully
into memory, and not require any IO to serve, IFF 100M of the active
file or anon pages were cold and could be reclaimed or swapped out.

In your example, the anon set is hot. We'll scan it slowly (at the
rate of IO from the other file) and rotate the pages that are in use -
which would be all of them. Likewise for the file - there will be some
deactivations, but mark_page_accessed() or the second chance algorithm
in page_check_references() for mapped will keep the hottest pages active.

In a slightly modified example, 400M of the anon set is hot and 100M
cold. Without my patch, we would never look for them and the second
file would be IO-bound forever. After my patch, we would scan anon,
eventually find the cold pages, swap them out, and then serve the
entire workingset from memory.

Does it cause more scanning than before in your scenario? Yes, but we
don't even know it's your scenario instead of mine until we actually
sample references of all memory. Not scanning is a false stable state.

And your scenario could easily change over time. Even if the amount of
repeatedly accessed pages stays larger than memory, and will always
require IO to serve, the relative access frequencies could change.
Some pages could become hotter, others colder. Without scanning, we
wouldn't adapt the LRU ordering and cause more IO than necessary.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages
  2020-05-20 23:25 ` [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
@ 2020-05-27 19:48   ` Shakeel Butt
  0 siblings, 0 replies; 35+ messages in thread
From: Shakeel Butt @ 2020-05-27 19:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux MM, Rik van Riel, Minchan Kim, Michal Hocko, Andrew Morton,
	Joonsoo Kim, LKML, Kernel Team

On Wed, May 20, 2020 at 4:28 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Currently, THP are counted as single pages until they are split right
> before being swapped out. However, at that point the VM is already in
> the middle of reclaim, and adjusting the LRU balance then is useless.
>
> Always account THP by the number of basepages, and remove the fixup
> from the splitting path.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-27 13:43     ` Johannes Weiner
@ 2020-05-28  7:16       ` Joonsoo Kim
  2020-05-28 17:01         ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2020-05-28  7:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > We activate cache refaults with reuse distances in pages smaller than
> > > the size of the total cache. This allows new pages with competitive
> > > access frequencies to establish themselves, as well as challenge and
> > > potentially displace pages on the active list that have gone cold.
> > >
> > > However, that assumes that active cache can only replace other active
> > > cache in a competition for the hottest memory. This is not a great
> > > default assumption. The page cache might be thrashing while there are
> > > enough completely cold and unused anonymous pages sitting around that
> > > we'd only have to write to swap once to stop all IO from the cache.
> > >
> > > Activate cache refaults when their reuse distance in pages is smaller
> > > than the total userspace workingset, including anonymous pages.
> >
> > Hmm... I'm not sure the correctness of this change.
> >
> > IIUC, this patch leads to more activations in the file list and more activations
> > here will challenge the anon list since rotation ratio for the file
> > list will be increased.
>
> Yes.
>
> > However, this change breaks active/inactive concept of the file list.
> > active/inactive
> > separation is implemented by in-list refault distance. anon list size has
> > no direct connection with refault distance of the file list so using
> > anon list size
> > to detect workingset for file page breaks the concept.
>
> This is intentional, because there IS a connection: they both take up
> space in RAM, and they both cost IO to bring back once reclaimed.

I know that. This is the reason that I said 'no direct connection'. The anon
list size is directly related the *possible* file list size. But,
active/inactive
separation in one list is firstly based on *current* list size rather than
the possible list size. Adding anon list size to detect workingset means
to use the possible list size and I think that it's wrong.

> When file is refaulting, it means we need to make more space for
> cache. That space can come from stale active file pages. But what if
> active cache is all hot, and meanwhile there are cold anon pages that
> we could swap out once and then serve everything from RAM?
>
> When file is refaulting, we should find the coldest data that is
> taking up RAM and kick it out. It doesn't matter whether it's file or
> anon: the goal is to free up RAM with the least amount of IO risk.

I understand your purpose and agree with it. We need to find a solution.
To achieve your goal, my suggestion is:

- refault distance < active file, then do activation and add up IO cost
- refault distance < active file + anon list, then add up IO cost

This doesn't break workingset detection on file list and challenge
the anon list as the same degree as you did.

> Remember that the file/anon split, and the inactive/active split, are
> there to optimize reclaim. It doesn't mean that these memory pools are
> independent from each other.
>
> The file list is split in two because of use-once cache. The anon and
> file lists are split because of different IO patterns, because we may
> not have swap etc. But once we are out of use-once cache, have swap
> space available, and have corrected for the different cost of IO,
> there needs to be a relative order between all pages in the system to
> find the optimal candidates to reclaim.
>
> > My suspicion is started by this counter example.
> >
> > Environment:
> > anon: 500 MB (so hot) / 500 MB (so hot)
> > file: 50 MB (hot) / 50 MB (cold)
> >
> > Think about the situation that there is periodical access to other file (100 MB)
> > with low frequency (refault distance is 500 MB)
> >
> > Without your change, this periodical access doesn't make thrashing for cached
> > active file page since refault distance of periodical access is larger
> > than the size of
> > the active file list. However, with your change, it causes thrashing
> > on the file list.
>
> It doesn't cause thrashing. It causes scanning because that 100M file
> IS thrashing: with or without my patch, that refault IO is occuring.

It could cause thrashing for your patch. Without the patch, current logic try to
find most hottest file pages that are fit into the current file list
size and protect them
successfully. Assume that access distance of 50 MB hot file pages is 60 MB
which is less than whole file list size but larger than inactive list
size. Without
your patch, 50 MB (hot) pages are not evicted at all. All these hot
pages will be
protected from the 100MB low access frequency pages. 100 MB low access
frequency pages will be refaulted repeatedely but it's correct behaviour.

However, with your patch, 50 MB (hot) file pages are deactivated due to newly
added file pages with low access frequency. And, then, since access distance of
50 MB (hot) pages is larger than inactive list size, they could not
get a second chance
and finally could be evicted. I think that this is a thrashing since
low access frequency
pages that are not fit for current file list size pushes out the high
access frequency
pages that are fit for current file list size and it would happen
again and again.

Maybe, logic can be corrected if the patch considers inactive age of
anon list but
I think that my above suggestion would be enough.

> What this patch acknowledges is that the 100M file COULD fit fully
> into memory, and not require any IO to serve, IFF 100M of the active
> file or anon pages were cold and could be reclaimed or swapped out.
>
> In your example, the anon set is hot. We'll scan it slowly (at the
> rate of IO from the other file) and rotate the pages that are in use -
> which would be all of them. Likewise for the file - there will be some
> deactivations, but mark_page_accessed() or the second chance algorithm
> in page_check_references() for mapped will keep the hottest pages active.
> In a slightly modified example, 400M of the anon set is hot and 100M
> cold. Without my patch, we would never look for them and the second
> file would be IO-bound forever. After my patch, we would scan anon,
> eventually find the cold pages, swap them out, and then serve the
> entire workingset from memory.

Again, I agree with your goal. What I don't agree is the implementation
to achieve the goal.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-28  7:16       ` Joonsoo Kim
@ 2020-05-28 17:01         ` Johannes Weiner
  2020-05-29  6:48           ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-05-28 17:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > >
> > > > We activate cache refaults with reuse distances in pages smaller than
> > > > the size of the total cache. This allows new pages with competitive
> > > > access frequencies to establish themselves, as well as challenge and
> > > > potentially displace pages on the active list that have gone cold.
> > > >
> > > > However, that assumes that active cache can only replace other active
> > > > cache in a competition for the hottest memory. This is not a great
> > > > default assumption. The page cache might be thrashing while there are
> > > > enough completely cold and unused anonymous pages sitting around that
> > > > we'd only have to write to swap once to stop all IO from the cache.
> > > >
> > > > Activate cache refaults when their reuse distance in pages is smaller
> > > > than the total userspace workingset, including anonymous pages.
> > >
> > > Hmm... I'm not sure the correctness of this change.
> > >
> > > IIUC, this patch leads to more activations in the file list and more activations
> > > here will challenge the anon list since rotation ratio for the file
> > > list will be increased.
> >
> > Yes.
> >
> > > However, this change breaks active/inactive concept of the file list.
> > > active/inactive
> > > separation is implemented by in-list refault distance. anon list size has
> > > no direct connection with refault distance of the file list so using
> > > anon list size
> > > to detect workingset for file page breaks the concept.
> >
> > This is intentional, because there IS a connection: they both take up
> > space in RAM, and they both cost IO to bring back once reclaimed.
> 
> I know that. This is the reason that I said 'no direct connection'. The anon
> list size is directly related the *possible* file list size. But,
> active/inactive
> separation in one list is firstly based on *current* list size rather than
> the possible list size. Adding anon list size to detect workingset means
> to use the possible list size and I think that it's wrong.

Hm so you're saying we should try to grow the page cache, but maintain
an inactive/active balance within the cache based on its current size
in case growing is not possible.

I think it would be implementable*, but I'm not quite convinced that
it's worth distinguishing. We're talking about an overcommitted
workingset and thereby an inherently unstable state that may thrash
purely based on timing differences anyway. See below.

*It would require another page flag to tell whether a refaulting cache
page has challenged the anon set once (transitioning) or repeatedly
(thrashing), as we currently use the active state for that. If we
would repurpose PG_workingset to tell the first from the second
refault, we'd need a new flag to mark a page for memstall accounting.

> > When file is refaulting, it means we need to make more space for
> > cache. That space can come from stale active file pages. But what if
> > active cache is all hot, and meanwhile there are cold anon pages that
> > we could swap out once and then serve everything from RAM?
> >
> > When file is refaulting, we should find the coldest data that is
> > taking up RAM and kick it out. It doesn't matter whether it's file or
> > anon: the goal is to free up RAM with the least amount of IO risk.
> 
> I understand your purpose and agree with it. We need to find a solution.
> To achieve your goal, my suggestion is:
> 
> - refault distance < active file, then do activation and add up IO cost
> - refault distance < active file + anon list, then add up IO cost
> 
> This doesn't break workingset detection on file list and challenge
> the anon list as the same degree as you did.

Unfortunately, it doesn't challenge the anon set. We'll stay in cache
trimming mode where the IO cost balance doesn't have any effect.

We would need to

  1. activate on distance < active file
  2. force reclaim to scan anon when distance < active file + anon
  3. record thrashing and IO cost when distance < active file + anon
     and it's the second refault of the page.

Something like this, where vmscan would scan anon when
WORKINGSET_RESTORE or WORKINGSET_EXPAND events are occuring:

diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..41dde6274fff 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -275,9 +275,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 void workingset_refault(struct page *page, void *shadow)
 {
 	struct mem_cgroup *eviction_memcg;
+	unsigned long active_file, anon;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
-	unsigned long workingset_size;
 	struct pglist_data *pgdat;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
@@ -330,7 +330,7 @@ void workingset_refault(struct page *page, void *shadow)
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	/*
-	 * The activation decision for this page is made at the level
+	 * The distance decisions for this page is made at the level
 	 * where the eviction occurred, as that is where the LRU order
 	 * during page reclaim is being determined.
 	 *
@@ -344,32 +344,52 @@ void workingset_refault(struct page *page, void *shadow)
 
 	/*
 	 * Compare the distance to the existing workingset size. We
-	 * don't activate pages that couldn't stay resident even if
-	 * all the memory was available to the page cache. Whether
-	 * cache can compete with anon or not depends on having swap.
+	 * ignore pages that couldn't stay resident even if all the
+	 * memory was available to the page cache. Whether cache can
+	 * compete with anon or not depends on having swap.
 	 */
-	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	anon = 0;
 	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_ANON);
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_ACTIVE_ANON);
+		anon += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON);
+		anon += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON);
 	}
-	if (refault_distance > workingset_size)
-		goto out;
 
-	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
-	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	if (refault_distance > active_file + anon)
+		goto out;
 
-	/* Page was active prior to eviction */
+	/*
+	 * Page has already challenged the workingset before, and it's
+	 * refaulting again: we're not transitioning out old cache,
+	 * we're thrashing and need to grow. Record the IO to tip the
+	 * reclaim balance and mark the page for memstall detection.
+	 */
 	if (workingset) {
-		SetPageWorkingset(page);
+		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
 		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
 		spin_unlock_irq(&page_pgdat(page)->lru_lock);
-		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+
+		SetPageThrashing(page);
+	}
+
+	SetPageWorkingset(page);
+
+	/*
+	 * Activate right away if page can compete with the existing
+	 * active set given the *current* size of the page cache.
+	 *
+	 * Otherwise, the page cache needs to grow to house this
+	 * page. Tell reclaim to rebalance against anonymous memory.
+	 */
+	if (refault_distance <= active_file) {
+		SetPageActive(page);
+		advance_inactive_age(memcg, pgdat);
+		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	} else {
+		inc_lruvec_state(lruvec, WORKINGSET_EXPAND);
 	}
 out:
 	rcu_read_unlock();

> > Remember that the file/anon split, and the inactive/active split, are
> > there to optimize reclaim. It doesn't mean that these memory pools are
> > independent from each other.
> >
> > The file list is split in two because of use-once cache. The anon and
> > file lists are split because of different IO patterns, because we may
> > not have swap etc. But once we are out of use-once cache, have swap
> > space available, and have corrected for the different cost of IO,
> > there needs to be a relative order between all pages in the system to
> > find the optimal candidates to reclaim.
> >
> > > My suspicion is started by this counter example.
> > >
> > > Environment:
> > > anon: 500 MB (so hot) / 500 MB (so hot)
> > > file: 50 MB (hot) / 50 MB (cold)
> > >
> > > Think about the situation that there is periodical access to other file (100 MB)
> > > with low frequency (refault distance is 500 MB)
> > >
> > > Without your change, this periodical access doesn't make thrashing for cached
> > > active file page since refault distance of periodical access is larger
> > > than the size of
> > > the active file list. However, with your change, it causes thrashing
> > > on the file list.
> >
> > It doesn't cause thrashing. It causes scanning because that 100M file
> > IS thrashing: with or without my patch, that refault IO is occuring.
> 
> It could cause thrashing for your patch.
> Without the patch, current logic try to
> find most hottest file pages that are fit into the current file list
> size and protect them
> successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> which is less than whole file list size but larger than inactive list
> size. Without
> your patch, 50 MB (hot) pages are not evicted at all. All these hot
> pages will be
> protected from the 100MB low access frequency pages. 100 MB low access
> frequency pages will be refaulted repeatedely but it's correct behaviour.

Hm, something doesn't quite add up. Why is the 50M hot set evicted
with my patch?

With a 60M access distance and a 100M page cache, they might get
deactivated, but then activated back to the head of the active list on
their next access. They should get scanned but not actually reclaimed.

The only way they could get reclaimed is if their access distance ends
up bigger than the file cache. But if that's the case, then the
workingset is overcommitted, and none of the pages qualify for reclaim
protection. Picking a subset to protect against the rest is arbitrary.

For example, if you started out with 50M of free space and an empty
cache and started accessing the 50M and 100M files simultaneously,
with the access distances you specified, none of them would get
activated. You would have the same behavior with or without my patch.

> However, with your patch, 50 MB (hot) file pages are deactivated due to newly
> added file pages with low access frequency. And, then, since access distance of
> 50 MB (hot) pages is larger than inactive list size, they could not
> get a second chance
> and finally could be evicted.

Ah okay, that's a different concern. We DO get two references, but we
fail to detect them properly.

First, we don't enforce the small inactive list size when there is
refault turnover like this. Deactivation is permanently allowed, which
means the lists are scanned in proportion to their size and should
converge on a 50/50 balance. The 50M pages should get referenced again
in the second half of their in-memory time and get re-activated.

It's somewhat debatable whether that's the right thing to do when
access distances are bunched up instead of uniform. Should a page
that's accessed twice with a 10M distance, and then not again for
another 80M, stay resident in a 60M cache?

It's not entirely unreasonable to say a page needs to get accessed in
each half of in-memory time to be considered still active, to ensure a
reasonable maximum distance between reference clusters.

But if this is an actual concern, shouldn't we instead improve the
accuracy of the LRU ordering? For example, if we're refaulting and
have dissolved the inactive/active separation, shouldn't
shrink_active_list() and shrink_page_list() actually rotate pages with
PG_referenced or pte references?

We ignore those references right now because we rely on second chance
to catch them; and respecting them would incur too much CPU overhead
when all we try to do is funnel one-off cache through the list.

But if we don't trust second-chance with this, and we're already
having refault IO and CPU is less of a concern, that seems like an
obvious change to make.


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-28 17:01         ` Johannes Weiner
@ 2020-05-29  6:48           ` Joonsoo Kim
  2020-05-29 15:12             ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2020-05-29  6:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > >
> > > > > We activate cache refaults with reuse distances in pages smaller than
> > > > > the size of the total cache. This allows new pages with competitive
> > > > > access frequencies to establish themselves, as well as challenge and
> > > > > potentially displace pages on the active list that have gone cold.
> > > > >
> > > > > However, that assumes that active cache can only replace other active
> > > > > cache in a competition for the hottest memory. This is not a great
> > > > > default assumption. The page cache might be thrashing while there are
> > > > > enough completely cold and unused anonymous pages sitting around that
> > > > > we'd only have to write to swap once to stop all IO from the cache.
> > > > >
> > > > > Activate cache refaults when their reuse distance in pages is smaller
> > > > > than the total userspace workingset, including anonymous pages.
> > > >
> > > > Hmm... I'm not sure the correctness of this change.
> > > >
> > > > IIUC, this patch leads to more activations in the file list and more activations
> > > > here will challenge the anon list since rotation ratio for the file
> > > > list will be increased.
> > >
> > > Yes.
> > >
> > > > However, this change breaks active/inactive concept of the file list.
> > > > active/inactive
> > > > separation is implemented by in-list refault distance. anon list size has
> > > > no direct connection with refault distance of the file list so using
> > > > anon list size
> > > > to detect workingset for file page breaks the concept.
> > >
> > > This is intentional, because there IS a connection: they both take up
> > > space in RAM, and they both cost IO to bring back once reclaimed.
> >
> > I know that. This is the reason that I said 'no direct connection'. The anon
> > list size is directly related the *possible* file list size. But,
> > active/inactive
> > separation in one list is firstly based on *current* list size rather than
> > the possible list size. Adding anon list size to detect workingset means
> > to use the possible list size and I think that it's wrong.
>
> Hm so you're saying we should try to grow the page cache, but maintain
> an inactive/active balance within the cache based on its current size
> in case growing is not possible.
>
> I think it would be implementable*, but I'm not quite convinced that
> it's worth distinguishing. We're talking about an overcommitted
> workingset and thereby an inherently unstable state that may thrash
> purely based on timing differences anyway. See below.

I think that this patch wrongly judge the overcommitted workingset.
See below new example in this reply.

> *It would require another page flag to tell whether a refaulting cache
> page has challenged the anon set once (transitioning) or repeatedly
> (thrashing), as we currently use the active state for that. If we
> would repurpose PG_workingset to tell the first from the second
> refault, we'd need a new flag to mark a page for memstall accounting.

I don't understand why a new flag is needed. Whenever we found that
challenge is needed (dist < active + anon), we need to add up IO cost.

> > > When file is refaulting, it means we need to make more space for
> > > cache. That space can come from stale active file pages. But what if
> > > active cache is all hot, and meanwhile there are cold anon pages that
> > > we could swap out once and then serve everything from RAM?
> > >
> > > When file is refaulting, we should find the coldest data that is
> > > taking up RAM and kick it out. It doesn't matter whether it's file or
> > > anon: the goal is to free up RAM with the least amount of IO risk.
> >
> > I understand your purpose and agree with it. We need to find a solution.
> > To achieve your goal, my suggestion is:
> >
> > - refault distance < active file, then do activation and add up IO cost
> > - refault distance < active file + anon list, then add up IO cost
> >
> > This doesn't break workingset detection on file list and challenge
> > the anon list as the same degree as you did.
>
> Unfortunately, it doesn't challenge the anon set. We'll stay in cache
> trimming mode where the IO cost balance doesn't have any effect.

Ah... I missed that. I think that we can avoid cache trimming mode easily.
Like as WORKINGSET_REFAULT, we need to snapshot
WORKINGSET_EXPAND. And, if we find a change to it, we can skip
cache trimming mode.

> We would need to
>
>   1. activate on distance < active file
>   2. force reclaim to scan anon when distance < active file + anon
>   3. record thrashing and IO cost when distance < active file + anon
>      and it's the second refault of the page.
>
> Something like this, where vmscan would scan anon when
> WORKINGSET_RESTORE or WORKINGSET_EXPAND events are occuring:
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index d481ea452eeb..41dde6274fff 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -275,9 +275,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
>  void workingset_refault(struct page *page, void *shadow)
>  {
>         struct mem_cgroup *eviction_memcg;
> +       unsigned long active_file, anon;
>         struct lruvec *eviction_lruvec;
>         unsigned long refault_distance;
> -       unsigned long workingset_size;
>         struct pglist_data *pgdat;
>         struct mem_cgroup *memcg;
>         unsigned long eviction;
> @@ -330,7 +330,7 @@ void workingset_refault(struct page *page, void *shadow)
>         refault_distance = (refault - eviction) & EVICTION_MASK;
>
>         /*
> -        * The activation decision for this page is made at the level
> +        * The distance decisions for this page is made at the level
>          * where the eviction occurred, as that is where the LRU order
>          * during page reclaim is being determined.
>          *
> @@ -344,32 +344,52 @@ void workingset_refault(struct page *page, void *shadow)
>
>         /*
>          * Compare the distance to the existing workingset size. We
> -        * don't activate pages that couldn't stay resident even if
> -        * all the memory was available to the page cache. Whether
> -        * cache can compete with anon or not depends on having swap.
> +        * ignore pages that couldn't stay resident even if all the
> +        * memory was available to the page cache. Whether cache can
> +        * compete with anon or not depends on having swap.
>          */
> -       workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> +       active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> +       anon = 0;
>         if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
> -               workingset_size += lruvec_page_state(eviction_lruvec,
> -                                                    NR_INACTIVE_ANON);
> -               workingset_size += lruvec_page_state(eviction_lruvec,
> -                                                    NR_ACTIVE_ANON);
> +               anon += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON);
> +               anon += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON);
>         }
> -       if (refault_distance > workingset_size)
> -               goto out;
>
> -       SetPageActive(page);
> -       advance_inactive_age(memcg, pgdat);
> -       inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> +       if (refault_distance > active_file + anon)
> +               goto out;
>
> -       /* Page was active prior to eviction */
> +       /*
> +        * Page has already challenged the workingset before, and it's
> +        * refaulting again: we're not transitioning out old cache,
> +        * we're thrashing and need to grow. Record the IO to tip the
> +        * reclaim balance and mark the page for memstall detection.
> +        */
>         if (workingset) {
> -               SetPageWorkingset(page);
> +               inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
> +
>                 /* XXX: Move to lru_cache_add() when it supports new vs putback */
>                 spin_lock_irq(&page_pgdat(page)->lru_lock);
>                 lru_note_cost_page(page);
>                 spin_unlock_irq(&page_pgdat(page)->lru_lock);
> -               inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
> +
> +               SetPageThrashing(page);
> +       }
> +
> +       SetPageWorkingset(page);
> +
> +       /*
> +        * Activate right away if page can compete with the existing
> +        * active set given the *current* size of the page cache.
> +        *
> +        * Otherwise, the page cache needs to grow to house this
> +        * page. Tell reclaim to rebalance against anonymous memory.
> +        */
> +       if (refault_distance <= active_file) {
> +               SetPageActive(page);
> +               advance_inactive_age(memcg, pgdat);
> +               inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> +       } else {
> +               inc_lruvec_state(lruvec, WORKINGSET_EXPAND);
>         }
>  out:
>         rcu_read_unlock();
>
> > > Remember that the file/anon split, and the inactive/active split, are
> > > there to optimize reclaim. It doesn't mean that these memory pools are
> > > independent from each other.
> > >
> > > The file list is split in two because of use-once cache. The anon and
> > > file lists are split because of different IO patterns, because we may
> > > not have swap etc. But once we are out of use-once cache, have swap
> > > space available, and have corrected for the different cost of IO,
> > > there needs to be a relative order between all pages in the system to
> > > find the optimal candidates to reclaim.
> > >
> > > > My suspicion is started by this counter example.
> > > >
> > > > Environment:
> > > > anon: 500 MB (so hot) / 500 MB (so hot)
> > > > file: 50 MB (hot) / 50 MB (cold)
> > > >
> > > > Think about the situation that there is periodical access to other file (100 MB)
> > > > with low frequency (refault distance is 500 MB)
> > > >
> > > > Without your change, this periodical access doesn't make thrashing for cached
> > > > active file page since refault distance of periodical access is larger
> > > > than the size of
> > > > the active file list. However, with your change, it causes thrashing
> > > > on the file list.
> > >
> > > It doesn't cause thrashing. It causes scanning because that 100M file
> > > IS thrashing: with or without my patch, that refault IO is occuring.
> >
> > It could cause thrashing for your patch.
> > Without the patch, current logic try to
> > find most hottest file pages that are fit into the current file list
> > size and protect them
> > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > which is less than whole file list size but larger than inactive list
> > size. Without
> > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > pages will be
> > protected from the 100MB low access frequency pages. 100 MB low access
> > frequency pages will be refaulted repeatedely but it's correct behaviour.
>
> Hm, something doesn't quite add up. Why is the 50M hot set evicted
> with my patch?

Thanks for kind explanation. I read all and I found that I was confused before.
Please let me correct the example.

Environment:
anon: 500 MB (hot) / 500 MB (hot)
file: 50 MB (so hot) / 50 MB (dummy)

I will call 50 MB file hot pages as F(H).
Let's assume that periodical access to other file (500 MB) is started. That
file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.

Problem will occur on following access pattern.

F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...

With this access pattern, access distance of F(H) and Pn is:

F(H) = 150 MB
Pn = 750 MB

Without your patch, F(H) is kept on the memory since deactivation would not
happen. However, with your patch, deactivation happens since Pn's refault
distance is less than 'active file + anon'. In the end, F(H) would be finally
evicted.

> With a 60M access distance and a 100M page cache, they might get
> deactivated, but then activated back to the head of the active list on
> their next access. They should get scanned but not actually reclaimed.

Sorry about the wrong example. I fixed the example in the above phrase.

> The only way they could get reclaimed is if their access distance ends
> up bigger than the file cache. But if that's the case, then the
> workingset is overcommitted, and none of the pages qualify for reclaim
> protection. Picking a subset to protect against the rest is arbitrary.

In the fixed example, although other file (500 MB) is repeatedly accessed,
it's not workingset. If we have unified list (file + anon), access distance of
Pn will be larger than whole memory size. Therefore, it's not overcommitted
workingset and this patch wrongly try to activate it. As I said before,
without considering inactive_age for anon list, this calculation can not be
correct.

> For example, if you started out with 50M of free space and an empty
> cache and started accessing the 50M and 100M files simultaneously,
> with the access distances you specified, none of them would get
> activated. You would have the same behavior with or without my patch.

It's not good example for my case. The purpose of active/inactive separation
is to protect previous hot pages until new evidence is found. If accessing
50M and 100M files simultaneously on 100M free space, all of them are
not activated. However, if 50M is in active list and then access to 50M and
100M happens simultaneously, we need to protect previous 50M. My example
is started with 50M in active list. So, it should be protected until
new evidence
is found. Pn is not new evidence.

> > However, with your patch, 50 MB (hot) file pages are deactivated due to newly
> > added file pages with low access frequency. And, then, since access distance of
> > 50 MB (hot) pages is larger than inactive list size, they could not
> > get a second chance
> > and finally could be evicted.
>
> Ah okay, that's a different concern. We DO get two references, but we
> fail to detect them properly.

As I said before, I was confused here. Your following explanation is correct
and I understand it. Please focus on my fixed example.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-29  6:48           ` Joonsoo Kim
@ 2020-05-29 15:12             ` Johannes Weiner
  2020-06-01  6:14               ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-05-29 15:12 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > *It would require another page flag to tell whether a refaulting cache
> > page has challenged the anon set once (transitioning) or repeatedly
> > (thrashing), as we currently use the active state for that. If we
> > would repurpose PG_workingset to tell the first from the second
> > refault, we'd need a new flag to mark a page for memstall accounting.
> 
> I don't understand why a new flag is needed. Whenever we found that
> challenge is needed (dist < active + anon), we need to add up IO cost.

It sounds like this was cleared up later on in the email.

> > > It could cause thrashing for your patch.
> > > Without the patch, current logic try to
> > > find most hottest file pages that are fit into the current file list
> > > size and protect them
> > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > > which is less than whole file list size but larger than inactive list
> > > size. Without
> > > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > > pages will be
> > > protected from the 100MB low access frequency pages. 100 MB low access
> > > frequency pages will be refaulted repeatedely but it's correct behaviour.
> >
> > Hm, something doesn't quite add up. Why is the 50M hot set evicted
> > with my patch?
> 
> Thanks for kind explanation. I read all and I found that I was confused before.
> Please let me correct the example.
> 
> Environment:
> anon: 500 MB (hot) / 500 MB (hot)
> file: 50 MB (so hot) / 50 MB (dummy)
> 
> I will call 50 MB file hot pages as F(H).
> Let's assume that periodical access to other file (500 MB) is started. That
> file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.
> 
> Problem will occur on following access pattern.
> 
> F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...
> 
> With this access pattern, access distance of F(H) and Pn is:
> 
> F(H) = 150 MB
> Pn = 750 MB
> 
> Without your patch, F(H) is kept on the memory since deactivation would not
> happen. However, with your patch, deactivation happens since Pn's refault
> distance is less than 'active file + anon'. In the end, F(H) would be finally
> evicted.

Okay, this example makes sense to me.

I do think P needs to challenge the workingset - at first. P could
easily fit into memory by itself if anon and active_file were cold, so
we need to challenge them to find out that they're hot. As you can
see, if anon and F(H) were completely unused, the current behavior
would be incorrect.

The current behavior would do the same in a cache-only example:

	anon = 1G (unreclaimable)
	file = 500M (hot) / 300M (dummy)

	P = 400M

	F(H) -> P1 -> F(H) -> P2 ...

If F(H) is already active, the first P refaults would have a distance
of 100M, thereby challenging F(H). As F(H) reactivates, its size will
be reflected in the refault distances, pushing them beyond the size of
memory that is available to the cache: 600M refault distance > 500M
active cache, or 900M access distance > 800M cache space.

However, I agree with your observation about the anon age below. When
we start aging the anon set, we have to reflect that in the refault
distances. Once we know that the 1G anon pages are indeed hotter than
the pages in P, there is no reason to keep churning the workingset.

> > The only way they could get reclaimed is if their access distance ends
> > up bigger than the file cache. But if that's the case, then the
> > workingset is overcommitted, and none of the pages qualify for reclaim
> > protection. Picking a subset to protect against the rest is arbitrary.
> 
> In the fixed example, although other file (500 MB) is repeatedly accessed,
> it's not workingset. If we have unified list (file + anon), access distance of
> Pn will be larger than whole memory size. Therefore, it's not overcommitted
> workingset and this patch wrongly try to activate it. As I said before,
> without considering inactive_age for anon list, this calculation can not be
> correct.

You're right. If we don't take anon age into account, the activations
could be over-eager; however, so would counting IO cost and exerting
pressure on anon be, which means my previous patch to split these two
wouldn't fix fundamental the problem you're pointing out. We simply
have to take anon age into account for the refaults to be comparable.

Once we do that, in your example, we would see activations in the
beginning in order to challenge the combined workingset (active_file +
anon) - which is legitimate as long as we don't know it's hot. And as
the anon pages are scanned and rotated (and the challenged F(h)
reactivated), the refault distances increase accordingly to reflect
the size of the hot pages sampled, which will correctly put P's
distances beyond the size of available memory.

However, your example cannot have a completely silent stable state. As
we stop workingset aging, the refault distances will slowly increase
again. We will always have a bit of churn, and rightfully so, because
the workingset *could* go stale.

That's the same situation in my cache-only example above. Anytime you
have a subset of pages that by itself could fit into memory, but can't
because of an established workingset, ongoing sampling is necessary.

But the rate definitely needs to reduce as we detect that in-memory
pages are indeed hot. Otherwise we cause more churn than is required
for an appropriate rate of workingset sampling.

How about the patch below? It looks correct, but I will have to re-run
my tests to make sure I / we are not missing anything.

---

From b105c85bf1cebfc4d1057f7228d7484c0ee77127 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 29 May 2020 09:40:00 -0400
Subject: [PATCH] mm: workingset: age nonresident information alongside
 anonymous pages

After ("mm: workingset: let cache workingset challenge anon fix"), we
compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.

Make anon aging drive nonresident age as well to address that.

Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  1 +
 mm/vmscan.c            |  3 +++
 mm/workingset.c        | 46 +++++++++++++++++++++++++-----------------
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e959602140b4..39459b8a7bb8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,8 +255,8 @@ struct lruvec {
 	 */
 	unsigned long			anon_cost;
 	unsigned long			file_cost;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Non-resident age, driven by LRU movement */
+	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 157e5081bf98..216df7bdaf62 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -312,6 +312,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c628f9ab886b..2fee237063e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -904,6 +904,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		__delete_from_swap_cache(page, swap);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
+		workingset_eviction(page, target_memcg);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -1884,6 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				list_add(&page->lru, &pages_to_free);
 		} else {
 			nr_moved += nr_pages;
+			if (PageActive(page))
+				workingset_age_nonresident(lruvec, nr_pages);
 		}
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..50b7937bab32 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,8 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
- * and activations is maintained (node->inactive_age).
+ * For each node's LRU lists, a counter for inactive evictions and
+ * activations is maintained (node->nonresident_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the node) is stored in the now empty page cache
@@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @memcg: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 	 * the root cgroup's, age as well.
 	 */
 	do {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
-	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 /**
@@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
-
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->nonresident_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
 	/*
 	 * Calculate the refault distance
 	 *
 	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases. There is a
+	 * across nonresident_age overflows in most cases. There is a
 	 * special case: usually, shadow entries have a short lifetime
 	 * and are either refaulted or reclaimed along with the inode
 	 * before they get too old.  But it is not impossible for the
-	 * inactive_age to lap a shadow entry in the field, which can
-	 * then result in a false small refault distance, leading to a
-	 * false activation should this old entry actually refault
-	 * again.  However, earlier kernels used to deactivate
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
 	 * unconditionally with *every* reclaim invocation for the
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
@@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -382,6 +388,7 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -394,7 +401,8 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 out:
 	rcu_read_unlock();
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-05-29 15:12             ` Johannes Weiner
@ 2020-06-01  6:14               ` Joonsoo Kim
  2020-06-01 15:56                 ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2020-06-01  6:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > *It would require another page flag to tell whether a refaulting cache
> > > page has challenged the anon set once (transitioning) or repeatedly
> > > (thrashing), as we currently use the active state for that. If we
> > > would repurpose PG_workingset to tell the first from the second
> > > refault, we'd need a new flag to mark a page for memstall accounting.
> >
> > I don't understand why a new flag is needed. Whenever we found that
> > challenge is needed (dist < active + anon), we need to add up IO cost.
>
> It sounds like this was cleared up later on in the email.
>
> > > > It could cause thrashing for your patch.
> > > > Without the patch, current logic try to
> > > > find most hottest file pages that are fit into the current file list
> > > > size and protect them
> > > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB
> > > > which is less than whole file list size but larger than inactive list
> > > > size. Without
> > > > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > > > pages will be
> > > > protected from the 100MB low access frequency pages. 100 MB low access
> > > > frequency pages will be refaulted repeatedely but it's correct behaviour.
> > >
> > > Hm, something doesn't quite add up. Why is the 50M hot set evicted
> > > with my patch?
> >
> > Thanks for kind explanation. I read all and I found that I was confused before.
> > Please let me correct the example.
> >
> > Environment:
> > anon: 500 MB (hot) / 500 MB (hot)
> > file: 50 MB (so hot) / 50 MB (dummy)
> >
> > I will call 50 MB file hot pages as F(H).
> > Let's assume that periodical access to other file (500 MB) is started. That
> > file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5.
> >
> > Problem will occur on following access pattern.
> >
> > F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...
> >
> > With this access pattern, access distance of F(H) and Pn is:
> >
> > F(H) = 150 MB
> > Pn = 750 MB
> >
> > Without your patch, F(H) is kept on the memory since deactivation would not
> > happen. However, with your patch, deactivation happens since Pn's refault
> > distance is less than 'active file + anon'. In the end, F(H) would be finally
> > evicted.
>
> Okay, this example makes sense to me.
>
> I do think P needs to challenge the workingset - at first. P could
> easily fit into memory by itself if anon and active_file were cold, so
> we need to challenge them to find out that they're hot. As you can
> see, if anon and F(H) were completely unused, the current behavior
> would be incorrect.
>
> The current behavior would do the same in a cache-only example:
>
>         anon = 1G (unreclaimable)
>         file = 500M (hot) / 300M (dummy)
>
>         P = 400M
>
>         F(H) -> P1 -> F(H) -> P2 ...
>
> If F(H) is already active, the first P refaults would have a distance
> of 100M, thereby challenging F(H). As F(H) reactivates, its size will
> be reflected in the refault distances, pushing them beyond the size of
> memory that is available to the cache: 600M refault distance > 500M
> active cache, or 900M access distance > 800M cache space.

Hmm... It seems that the current behavior (before your patch) for this
example has no problem. It causes deactivation but doesn't cause eviction
so there is no workingset thrashing.

> However, I agree with your observation about the anon age below. When
> we start aging the anon set, we have to reflect that in the refault
> distances. Once we know that the 1G anon pages are indeed hotter than
> the pages in P, there is no reason to keep churning the workingset.

Okay.

> > > The only way they could get reclaimed is if their access distance ends
> > > up bigger than the file cache. But if that's the case, then the
> > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > protection. Picking a subset to protect against the rest is arbitrary.
> >
> > In the fixed example, although other file (500 MB) is repeatedly accessed,
> > it's not workingset. If we have unified list (file + anon), access distance of
> > Pn will be larger than whole memory size. Therefore, it's not overcommitted
> > workingset and this patch wrongly try to activate it. As I said before,
> > without considering inactive_age for anon list, this calculation can not be
> > correct.
>
> You're right. If we don't take anon age into account, the activations
> could be over-eager; however, so would counting IO cost and exerting
> pressure on anon be, which means my previous patch to split these two
> wouldn't fix fundamental the problem you're pointing out. We simply

Splitting would not fix the fundamental problem (over-eager) but it would
greatly weaken the problem. Just counting IO cost doesn't break the
active/inactive separation in file list. It does cause more scan on anon list
but I think that it's endurable.

> have to take anon age into account for the refaults to be comparable.

Yes, taking anon age into account is also a good candidate to fix the problem.

> Once we do that, in your example, we would see activations in the
> beginning in order to challenge the combined workingset (active_file +
> anon) - which is legitimate as long as we don't know it's hot. And as
> the anon pages are scanned and rotated (and the challenged F(h)
> reactivated), the refault distances increase accordingly to reflect
> the size of the hot pages sampled, which will correctly put P's
> distances beyond the size of available memory.

Okay.

> However, your example cannot have a completely silent stable state. As
> we stop workingset aging, the refault distances will slowly increase
> again. We will always have a bit of churn, and rightfully so, because
> the workingset *could* go stale.
>
> That's the same situation in my cache-only example above. Anytime you
> have a subset of pages that by itself could fit into memory, but can't
> because of an established workingset, ongoing sampling is necessary.
>
> But the rate definitely needs to reduce as we detect that in-memory
> pages are indeed hot. Otherwise we cause more churn than is required
> for an appropriate rate of workingset sampling.
>
> How about the patch below? It looks correct, but I will have to re-run
> my tests to make sure I / we are not missing anything.

Much better! It may solve my concern mostly.

But, I still think that modified refault activation equation isn't
safe. The next
problem I found is related to the scan ratio limit patch ("limit the range of
LRU type balancing") on this series. See the below example.

anon: Hot (X M)
file: Hot (200 M) / dummy (200 M)
P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->

Without this patch, A and F(H) are kept on the memory and look like
it's correct.

With this patch and below fix, refault equation for Pn would be:

Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
ratio (from anon non-resident)
anon + active file = X + 200
1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200

According to the size of X, Pn's refault result would be different. Pn could
be activated with large enough X and then F(H) could be evicted. In ideal
case (unified list), for this example, Pn should not be activated in any X.

This is a fundamental problem since we have two list type (file/anon) and
scan ratio limit is required. Anyway, we need to take care of this reality and
the way most safe is to count IO cost instead of doing activation in this
'non-resident dist < (active + anon list)' case.

Again, for this patch, I'm not confident myself so please let me know if I'm
wrong.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-01  6:14               ` Joonsoo Kim
@ 2020-06-01 15:56                 ` Johannes Weiner
  2020-06-01 20:44                   ` Johannes Weiner
  2020-06-02  2:34                   ` Joonsoo Kim
  0 siblings, 2 replies; 35+ messages in thread
From: Johannes Weiner @ 2020-06-01 15:56 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > The only way they could get reclaimed is if their access distance ends
> > > > up bigger than the file cache. But if that's the case, then the
> > > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > > protection. Picking a subset to protect against the rest is arbitrary.
> > >
> > > In the fixed example, although other file (500 MB) is repeatedly accessed,
> > > it's not workingset. If we have unified list (file + anon), access distance of
> > > Pn will be larger than whole memory size. Therefore, it's not overcommitted
> > > workingset and this patch wrongly try to activate it. As I said before,
> > > without considering inactive_age for anon list, this calculation can not be
> > > correct.
> >
> > You're right. If we don't take anon age into account, the activations
> > could be over-eager; however, so would counting IO cost and exerting
> > pressure on anon be, which means my previous patch to split these two
> > wouldn't fix fundamental the problem you're pointing out. We simply
> 
> Splitting would not fix the fundamental problem (over-eager) but it would
> greatly weaken the problem. Just counting IO cost doesn't break the
> active/inactive separation in file list. It does cause more scan on anon list
> but I think that it's endurable.

I think the split is a good idea.

The only thing I'm not sure yet is if we can get away without an
additional page flag if the active flag cannot be reused to denote
thrashing. I'll keep at it, maybe I can figure something out.

But I think it would be follow-up work.

> > have to take anon age into account for the refaults to be comparable.
> 
> Yes, taking anon age into account is also a good candidate to fix the problem.

Okay, good.

> > However, your example cannot have a completely silent stable state. As
> > we stop workingset aging, the refault distances will slowly increase
> > again. We will always have a bit of churn, and rightfully so, because
> > the workingset *could* go stale.
> >
> > That's the same situation in my cache-only example above. Anytime you
> > have a subset of pages that by itself could fit into memory, but can't
> > because of an established workingset, ongoing sampling is necessary.
> >
> > But the rate definitely needs to reduce as we detect that in-memory
> > pages are indeed hot. Otherwise we cause more churn than is required
> > for an appropriate rate of workingset sampling.
> >
> > How about the patch below? It looks correct, but I will have to re-run
> > my tests to make sure I / we are not missing anything.
> 
> Much better! It may solve my concern mostly.

Okay thanks for confirming. I'll send a proper version to Andrew.

> But, I still think that modified refault activation equation isn't
> safe. The next
> problem I found is related to the scan ratio limit patch ("limit the range of
> LRU type balancing") on this series. See the below example.
> 
> anon: Hot (X M)
> file: Hot (200 M) / dummy (200 M)
> P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> 
> Without this patch, A and F(H) are kept on the memory and look like
> it's correct.
> 
> With this patch and below fix, refault equation for Pn would be:
> 
> Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> ratio (from anon non-resident)
> anon + active file = X + 200
> 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200

That doesn't look quite right to me. The anon part of the refault
distance is driven by X, so the left-hand of this formula contains X
as well.

1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000

Activations persist as long as anon isn't fully scanned and it isn't
established yet that it's fully hot. Meaning, we optimistically assume
the refaulting pages can be workingset until we're proven wrong.

> According to the size of X, Pn's refault result would be different. Pn could
> be activated with large enough X and then F(H) could be evicted. In ideal
> case (unified list), for this example, Pn should not be activated in any X.

Yes. The active/iocost split would allow us to be smarter about it.

> This is a fundamental problem since we have two list type (file/anon) and
> scan ratio limit is required. Anyway, we need to take care of this reality and
> the way most safe is to count IO cost instead of doing activation in this
> 'non-resident dist < (active + anon list)' case.

Agreed here again.

> Again, for this patch, I'm not confident myself so please let me know if I'm
> wrong.

As far as this patch goes, I think it's important to look at the
bigger picture.

We need to have convergence first before being able to worry about
optimizing. Stable states are optimizations, but false stable states
are correctness problems.

For the longest time, we scanned active pages unconditionally during
page reclaim. This was always safe in the sense that it wouldn't get
stuck on a stale workingset, but it incurs unnecessary workingset
churn when reclaim is driven by use-once patterns.

We optimized the latter too aggressively, and as a result caused
situations where we indefinitely fail to cache the hottest
data. That's not really a workable trade-off.

With the active/iocost split you're suggesting, we can reasonably
optimize your example scenario. But we can't do it if the flipside
means complete failure to transition between in-memory sets.

So I think we should go ahead with this patch (with the anon age
recognition fixed, because that's a correctness issue), and follow it
up with the stable state optimization to shrink anon first.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-01 15:56                 ` Johannes Weiner
@ 2020-06-01 20:44                   ` Johannes Weiner
  2020-06-04 13:35                     ` Vlastimil Babka
  2020-06-02  2:34                   ` Joonsoo Kim
  1 sibling, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-06-01 20:44 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On Mon, Jun 01, 2020 at 11:56:17AM -0400, Johannes Weiner wrote:
> On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > However, your example cannot have a completely silent stable state. As
> > > we stop workingset aging, the refault distances will slowly increase
> > > again. We will always have a bit of churn, and rightfully so, because
> > > the workingset *could* go stale.
> > >
> > > That's the same situation in my cache-only example above. Anytime you
> > > have a subset of pages that by itself could fit into memory, but can't
> > > because of an established workingset, ongoing sampling is necessary.
> > >
> > > But the rate definitely needs to reduce as we detect that in-memory
> > > pages are indeed hot. Otherwise we cause more churn than is required
> > > for an appropriate rate of workingset sampling.
> > >
> > > How about the patch below? It looks correct, but I will have to re-run
> > > my tests to make sure I / we are not missing anything.
> > 
> > Much better! It may solve my concern mostly.
> 
> Okay thanks for confirming. I'll send a proper version to Andrew.

I sent the below patch through my battery of tests to make sure it
doesn't break anything fundamental, and they didn't. For the
convergence test, the overload test, and the kernel build, the
difference, if any, is in the noise, which is expected.

Andrew, absent any objections, can you please fold the below into the
original patch?

It's a bit noisy due to the rename of inactive_age now that it's used
for anon as well (frankly, it was a misnomer from the start), and
switching the interface to take an lruvec instead of a memcg and a
pgdat, which works better for the callers (I stole parent_lruvec()
from "mm: vmscan: determine anon/file pressure balance at the reclaim
root" later in the series, so that isn't new code).

The main change is the two bits in __remove_mapping and
move_pages_to_lru, so that shadow entries are aged when anon is
reclaimed or rotated.

---

From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 1 Jun 2020 14:04:09 -0400
Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix

We compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.

Make anon aging drive nonresident age as well to address that.

Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 13 +++++++++++
 include/linux/mmzone.h     |  4 ++--
 include/linux/swap.h       |  1 +
 mm/vmscan.c                |  3 +++
 mm/workingset.c            | 46 ++++++++++++++++++++++----------------
 5 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 32a0b4d47540..d982c80da157 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page,
 	mod_lruvec_page_state(page, idx, -1);
 }
 
+static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = lruvec_memcg(lruvec);
+	if (!memcg)
+		return NULL;
+	memcg = parent_mem_cgroup(memcg);
+	if (!memcg)
+		return NULL;
+	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fbda9ddd1f..7cccbb8bc2d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -262,8 +262,8 @@ enum lruvec_flags {
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat	reclaim_stat;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Non-resident age, driven by LRU movement */
+	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 30fd4641890e..f0d2dca28c99 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -312,6 +312,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 43f88b1a4f14..3033f1c951cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -898,6 +898,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		__delete_from_swap_cache(page, swap);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
+		workingset_eviction(page, target_memcg);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -1876,6 +1877,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				list_add(&page->lru, &pages_to_free);
 		} else {
 			nr_moved += nr_pages;
+			if (PageActive(page))
+				workingset_age_nonresident(lruvec, nr_pages);
 		}
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index e69865739539..98b91e623b85 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,8 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
- * and activations is maintained (node->inactive_age).
+ * For each node's LRU lists, a counter for inactive evictions and
+ * activations is maintained (node->nonresident_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the node) is stored in the now empty page cache
@@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @memcg: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 	 * the root cgroup's, age as well.
 	 */
 	do {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
-	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 /**
@@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
-
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->nonresident_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
 	/*
 	 * Calculate the refault distance
 	 *
 	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases. There is a
+	 * across nonresident_age overflows in most cases. There is a
 	 * special case: usually, shadow entries have a short lifetime
 	 * and are either refaulted or reclaimed along with the inode
 	 * before they get too old.  But it is not impossible for the
-	 * inactive_age to lap a shadow entry in the field, which can
-	 * then result in a false small refault distance, leading to a
-	 * false activation should this old entry actually refault
-	 * again.  However, earlier kernels used to deactivate
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
 	 * unconditionally with *every* reclaim invocation for the
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
@@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -378,6 +384,7 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -390,7 +397,8 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 out:
 	rcu_read_unlock();
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-01 15:56                 ` Johannes Weiner
  2020-06-01 20:44                   ` Johannes Weiner
@ 2020-06-02  2:34                   ` Joonsoo Kim
  2020-06-02 16:47                     ` Johannes Weiner
  1 sibling, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2020-06-02  2:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > > The only way they could get reclaimed is if their access distance ends
> > > > > up bigger than the file cache. But if that's the case, then the
> > > > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > > > protection. Picking a subset to protect against the rest is arbitrary.
> > > >
> > > > In the fixed example, although other file (500 MB) is repeatedly accessed,
> > > > it's not workingset. If we have unified list (file + anon), access distance of
> > > > Pn will be larger than whole memory size. Therefore, it's not overcommitted
> > > > workingset and this patch wrongly try to activate it. As I said before,
> > > > without considering inactive_age for anon list, this calculation can not be
> > > > correct.
> > >
> > > You're right. If we don't take anon age into account, the activations
> > > could be over-eager; however, so would counting IO cost and exerting
> > > pressure on anon be, which means my previous patch to split these two
> > > wouldn't fix fundamental the problem you're pointing out. We simply
> >
> > Splitting would not fix the fundamental problem (over-eager) but it would
> > greatly weaken the problem. Just counting IO cost doesn't break the
> > active/inactive separation in file list. It does cause more scan on anon list
> > but I think that it's endurable.
>
> I think the split is a good idea.
>
> The only thing I'm not sure yet is if we can get away without an
> additional page flag if the active flag cannot be reused to denote
> thrashing. I'll keep at it, maybe I can figure something out.
>
> But I think it would be follow-up work.
>
> > > have to take anon age into account for the refaults to be comparable.
> >
> > Yes, taking anon age into account is also a good candidate to fix the problem.
>
> Okay, good.
>
> > > However, your example cannot have a completely silent stable state. As
> > > we stop workingset aging, the refault distances will slowly increase
> > > again. We will always have a bit of churn, and rightfully so, because
> > > the workingset *could* go stale.
> > >
> > > That's the same situation in my cache-only example above. Anytime you
> > > have a subset of pages that by itself could fit into memory, but can't
> > > because of an established workingset, ongoing sampling is necessary.
> > >
> > > But the rate definitely needs to reduce as we detect that in-memory
> > > pages are indeed hot. Otherwise we cause more churn than is required
> > > for an appropriate rate of workingset sampling.
> > >
> > > How about the patch below? It looks correct, but I will have to re-run
> > > my tests to make sure I / we are not missing anything.
> >
> > Much better! It may solve my concern mostly.
>
> Okay thanks for confirming. I'll send a proper version to Andrew.

Okay.

> > But, I still think that modified refault activation equation isn't
> > safe. The next
> > problem I found is related to the scan ratio limit patch ("limit the range of
> > LRU type balancing") on this series. See the below example.
> >
> > anon: Hot (X M)
> > file: Hot (200 M) / dummy (200 M)
> > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> >
> > Without this patch, A and F(H) are kept on the memory and look like
> > it's correct.
> >
> > With this patch and below fix, refault equation for Pn would be:
> >
> > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > ratio (from anon non-resident)
> > anon + active file = X + 200
> > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
>
> That doesn't look quite right to me. The anon part of the refault
> distance is driven by X, so the left-hand of this formula contains X
> as well.
>
> 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000

As I said before, there is no X on left-hand of this formula. To
access all Pn and
re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
With your patch "limit the range of LRU type balancing", scan ratio
between file/anon
list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
2.0, that is,
2400 M and not bounded by X. That means that file list cannot be
stable with some X.

> Activations persist as long as anon isn't fully scanned and it isn't
> established yet that it's fully hot. Meaning, we optimistically assume
> the refaulting pages can be workingset until we're proven wrong.
>
> > According to the size of X, Pn's refault result would be different. Pn could
> > be activated with large enough X and then F(H) could be evicted. In ideal
> > case (unified list), for this example, Pn should not be activated in any X.
>
> Yes. The active/iocost split would allow us to be smarter about it.
>
> > This is a fundamental problem since we have two list type (file/anon) and
> > scan ratio limit is required. Anyway, we need to take care of this reality and
> > the way most safe is to count IO cost instead of doing activation in this
> > 'non-resident dist < (active + anon list)' case.
>
> Agreed here again.
>
> > Again, for this patch, I'm not confident myself so please let me know if I'm
> > wrong.
>
> As far as this patch goes, I think it's important to look at the
> bigger picture.
>
> We need to have convergence first before being able to worry about
> optimizing. Stable states are optimizations, but false stable states
> are correctness problems.
>
> For the longest time, we scanned active pages unconditionally during
> page reclaim. This was always safe in the sense that it wouldn't get
> stuck on a stale workingset, but it incurs unnecessary workingset
> churn when reclaim is driven by use-once patterns.
>
> We optimized the latter too aggressively, and as a result caused
> situations where we indefinitely fail to cache the hottest
> data. That's not really a workable trade-off.
>
> With the active/iocost split you're suggesting, we can reasonably
> optimize your example scenario. But we can't do it if the flipside
> means complete failure to transition between in-memory sets.
>
> So I think we should go ahead with this patch (with the anon age
> recognition fixed, because that's a correctness issue), and follow it
> up with the stable state optimization to shrink anon first.

If my lastly found example is a correct example (your confirm is required),
it is also related to the correctness issue since cold pages causes
eviction of the hot pages repeatedly.

In this case, they (without patch, with patch) all have some correctness
issue so we need to judge which one is better in terms of overall impact.
I don't have strong opinion about it so it's up to you to decide the way to go.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-02  2:34                   ` Joonsoo Kim
@ 2020-06-02 16:47                     ` Johannes Weiner
  2020-06-03  5:40                       ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-06-02 16:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote:
> 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > > But, I still think that modified refault activation equation isn't
> > > safe. The next
> > > problem I found is related to the scan ratio limit patch ("limit the range of
> > > LRU type balancing") on this series. See the below example.
> > >
> > > anon: Hot (X M)
> > > file: Hot (200 M) / dummy (200 M)
> > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> > >
> > > Without this patch, A and F(H) are kept on the memory and look like
> > > it's correct.
> > >
> > > With this patch and below fix, refault equation for Pn would be:
> > >
> > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > > ratio (from anon non-resident)
> > > anon + active file = X + 200
> > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
> >
> > That doesn't look quite right to me. The anon part of the refault
> > distance is driven by X, so the left-hand of this formula contains X
> > as well.
> >
> > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000
> 
> As I said before, there is no X on left-hand of this formula. To
> access all Pn and
> re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
> With your patch "limit the range of LRU type balancing", scan ratio
> between file/anon
> list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
> 2.0, that is,
> 2400 M and not bounded by X. That means that file list cannot be
> stable with some X.

Oh, no X on the left because you're talking about the number of pages
scanned until the first refaults, which is fixed - so why are we still
interpreting the refault distance against a variable anon size X?

Well, that's misleading. We compare against anon because part of the
cache is already encoded in the refault distance. What we're really
checking is access distance against total amount of available RAM.

Consider this. We want to activate pages where

	access_distance <= RAM

and our measure of access distance is:

	access_distance = refault_distance + inactive_file

So the comparison becomes:

	refault_distance + inactive_file < RAM

which we simplify to:

	refault_distance < active_file + anon

There is a certain threshold for X simply because there is a certain
threshold for RAM beyond which we need to start activating. X cannot
be arbitrary, it must be X + cache filling up memory - after all we
have page reclaim evicting pages.

Again, this isn't new. In the current code, we activate when:

	refault_distance < active_file

which is

	access_distance <= RAM - anon

You can see, whether things are stable or not always depends on the
existing workingset size. It's just a proxy for how much total RAM we
have potentially available to the refaulting page.

> If my lastly found example is a correct example (your confirm is required),
> it is also related to the correctness issue since cold pages causes
> eviction of the hot pages repeatedly.

I think your example is correct, but it benefits from the VM
arbitrarily making an assumption that has a 50/50 shot of being true.

You and I know which pages are hot and which are cold because you
designed the example.

All the VM sees is this:

- We have an established workingset that has previously shown an
  access distance <= RAM and therefor was activated.

- We now have another set that also appears to have an access distance
  <= RAM. The only way to know for sure, however, is sample the
  established workingset and compare the relative access frequencies.

Currently, we just assume the incoming pages are colder. Clearly
that's beneficial when it's true. Clearly that can be totally wrong.

We must allow a fair comparison between these two sets.

For cache, that's already the case - that's why I brought up the
cache-only example: if refault distances are 50M and you have 60M of
active cache, we activate all refaults and force an even competition
between the established workingset and the new pages.

Whether we can protect active file when anon needs to shrink first and
can't (the activate/iocost split) that's a different question. But I'm
no longer so sure after looking into it further.

First, we would need two different refault distances: either we
consider anon age and need to compare to active_file + anon, or we
don't and compare to active_file only. We cannot mix willy nilly,
because the metrics wouldn't be comparable. We don't have the space to
store two different eviction timestamps, nor could we afford to cut
the precision in half.

Second, the additional page flag required to implement it.

Third, it's somewhat moot because we still have the same behavior when
active_file would need to shrink and can't. There can't be a stable
state as long as refault distances <= active_file.

> In this case, they (without patch, with patch) all have some correctness
> issue so we need to judge which one is better in terms of overall impact.
> I don't have strong opinion about it so it's up to you to decide the way to go.

If my patch was simply changing the default assumption on which pages
are hot and which are cold, I would agree with you - the pros would be
equal to the cons, one way wouldn't be more correct than the other.

But that isn't what my patch is doing. What it does is get rid of the
assumption, to actually sample and compare the access frequencies when
there isn't enough data to make an informed decision.

That's a net improvement.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-02 16:47                     ` Johannes Weiner
@ 2020-06-03  5:40                       ` Joonsoo Kim
  0 siblings, 0 replies; 35+ messages in thread
From: Joonsoo Kim @ 2020-06-03  5:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

2020년 6월 3일 (수) 오전 1:48, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote:
> > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > > > But, I still think that modified refault activation equation isn't
> > > > safe. The next
> > > > problem I found is related to the scan ratio limit patch ("limit the range of
> > > > LRU type balancing") on this series. See the below example.
> > > >
> > > > anon: Hot (X M)
> > > > file: Hot (200 M) / dummy (200 M)
> > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> > > >
> > > > Without this patch, A and F(H) are kept on the memory and look like
> > > > it's correct.
> > > >
> > > > With this patch and below fix, refault equation for Pn would be:
> > > >
> > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > > > ratio (from anon non-resident)
> > > > anon + active file = X + 200
> > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
> > >
> > > That doesn't look quite right to me. The anon part of the refault
> > > distance is driven by X, so the left-hand of this formula contains X
> > > as well.
> > >
> > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000
> >
> > As I said before, there is no X on left-hand of this formula. To
> > access all Pn and
> > re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed.
> > With your patch "limit the range of LRU type balancing", scan ratio
> > between file/anon
> > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
> > 2.0, that is,
> > 2400 M and not bounded by X. That means that file list cannot be
> > stable with some X.
>
> Oh, no X on the left because you're talking about the number of pages
> scanned until the first refaults, which is fixed - so why are we still
> interpreting the refault distance against a variable anon size X?

Looks like I was confused again. Your formula is correct and mine is
wrong. My mistake is I thought that your patch "limit the range of LRU
type balancing"
which makes scan *ratio* 2:1 leads to actual scan *count* ratio
between anon/file to 2:1.
But, now I realized that 2:1 is just scan ratio and actual scan
*count* ratio could be far
larger with certain list size. It would be X * scan ratio in above example so my
explanation is wrong and you are right.

Sorry for making a trouble.

> Well, that's misleading. We compare against anon because part of the
> cache is already encoded in the refault distance. What we're really
> checking is access distance against total amount of available RAM.
>
> Consider this. We want to activate pages where
>
>         access_distance <= RAM
>
> and our measure of access distance is:
>
>         access_distance = refault_distance + inactive_file
>
> So the comparison becomes:
>
>         refault_distance + inactive_file < RAM
>
> which we simplify to:
>
>         refault_distance < active_file + anon
>
> There is a certain threshold for X simply because there is a certain
> threshold for RAM beyond which we need to start activating. X cannot
> be arbitrary, it must be X + cache filling up memory - after all we
> have page reclaim evicting pages.
>
> Again, this isn't new. In the current code, we activate when:
>
>         refault_distance < active_file
>
> which is
>
>         access_distance <= RAM - anon
>
> You can see, whether things are stable or not always depends on the
> existing workingset size. It's just a proxy for how much total RAM we
> have potentially available to the refaulting page.
>
> > If my lastly found example is a correct example (your confirm is required),
> > it is also related to the correctness issue since cold pages causes
> > eviction of the hot pages repeatedly.
>
> I think your example is correct, but it benefits from the VM
> arbitrarily making an assumption that has a 50/50 shot of being true.
>
> You and I know which pages are hot and which are cold because you
> designed the example.
>
> All the VM sees is this:
>
> - We have an established workingset that has previously shown an
>   access distance <= RAM and therefor was activated.
>
> - We now have another set that also appears to have an access distance
>   <= RAM. The only way to know for sure, however, is sample the
>   established workingset and compare the relative access frequencies.
>
> Currently, we just assume the incoming pages are colder. Clearly
> that's beneficial when it's true. Clearly that can be totally wrong.
>
> We must allow a fair comparison between these two sets.
>
> For cache, that's already the case - that's why I brought up the
> cache-only example: if refault distances are 50M and you have 60M of
> active cache, we activate all refaults and force an even competition
> between the established workingset and the new pages.
>
> Whether we can protect active file when anon needs to shrink first and
> can't (the activate/iocost split) that's a different question. But I'm
> no longer so sure after looking into it further.
>
> First, we would need two different refault distances: either we
> consider anon age and need to compare to active_file + anon, or we
> don't and compare to active_file only. We cannot mix willy nilly,
> because the metrics wouldn't be comparable. We don't have the space to
> store two different eviction timestamps, nor could we afford to cut
> the precision in half.
>
> Second, the additional page flag required to implement it.
>
> Third, it's somewhat moot because we still have the same behavior when
> active_file would need to shrink and can't. There can't be a stable
> state as long as refault distances <= active_file.
>
> > In this case, they (without patch, with patch) all have some correctness
> > issue so we need to judge which one is better in terms of overall impact.
> > I don't have strong opinion about it so it's up to you to decide the way to go.
>
> If my patch was simply changing the default assumption on which pages
> are hot and which are cold, I would agree with you - the pros would be
> equal to the cons, one way wouldn't be more correct than the other.
>
> But that isn't what my patch is doing. What it does is get rid of the
> assumption, to actually sample and compare the access frequencies when
> there isn't enough data to make an informed decision.
>
> That's a net improvement.

Okay. Now I think that this patch has a net improvement.

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-01 20:44                   ` Johannes Weiner
@ 2020-06-04 13:35                     ` Vlastimil Babka
  2020-06-04 15:05                       ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2020-06-04 13:35 UTC (permalink / raw)
  To: Johannes Weiner, Joonsoo Kim
  Cc: Linux Memory Management List, Rik van Riel, Minchan Kim,
	Michal Hocko, Andrew Morton, Joonsoo Kim, LKML, kernel-team

On 6/1/20 10:44 PM, Johannes Weiner wrote:
> From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 1 Jun 2020 14:04:09 -0400
> Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix

Looks like the whole series is now in mainline (that was quick!) without this
followup? As such it won't be squashed so perhaps the subject should be more
"standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
let cache workingset challenge anon"), possibly with Fixes: tag etc...

> We compare refault distances to active_file + anon. But age of the
> non-resident information is only driven by the file LRU. As a result,
> we may overestimate the recency of any incoming refaults and activate
> them too eagerly, causing unnecessary LRU churn in certain situations.
> 
> Make anon aging drive nonresident age as well to address that.
> 
> Reported-by: Joonsoo Kim <js1304@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h | 13 +++++++++++
>  include/linux/mmzone.h     |  4 ++--
>  include/linux/swap.h       |  1 +
>  mm/vmscan.c                |  3 +++
>  mm/workingset.c            | 46 ++++++++++++++++++++++----------------
>  5 files changed, 46 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 32a0b4d47540..d982c80da157 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page,
>  	mod_lruvec_page_state(page, idx, -1);
>  }
>  
> +static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	memcg = lruvec_memcg(lruvec);
> +	if (!memcg)
> +		return NULL;
> +	memcg = parent_mem_cgroup(memcg);
> +	if (!memcg)
> +		return NULL;
> +	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
> +}
> +
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c1fbda9ddd1f..7cccbb8bc2d7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -262,8 +262,8 @@ enum lruvec_flags {
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
>  	struct zone_reclaim_stat	reclaim_stat;
> -	/* Evictions & activations on the inactive file list */
> -	atomic_long_t			inactive_age;
> +	/* Non-resident age, driven by LRU movement */
> +	atomic_long_t			nonresident_age;
>  	/* Refaults at the time of last reclaim cycle */
>  	unsigned long			refaults;
>  	/* Various lruvec state flags (enum lruvec_flags) */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 30fd4641890e..f0d2dca28c99 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -312,6 +312,7 @@ struct vma_swap_readahead {
>  };
>  
>  /* linux/mm/workingset.c */
> +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
>  void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
>  void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 43f88b1a4f14..3033f1c951cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -898,6 +898,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  		__delete_from_swap_cache(page, swap);
>  		xa_unlock_irqrestore(&mapping->i_pages, flags);
>  		put_swap_page(page, swap);
> +		workingset_eviction(page, target_memcg);
>  	} else {
>  		void (*freepage)(struct page *);
>  		void *shadow = NULL;
> @@ -1876,6 +1877,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  				list_add(&page->lru, &pages_to_free);
>  		} else {
>  			nr_moved += nr_pages;
> +			if (PageActive(page))
> +				workingset_age_nonresident(lruvec, nr_pages);
>  		}
>  	}
>  
> diff --git a/mm/workingset.c b/mm/workingset.c
> index e69865739539..98b91e623b85 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -156,8 +156,8 @@
>   *
>   *		Implementation
>   *
> - * For each node's file LRU lists, a counter for inactive evictions
> - * and activations is maintained (node->inactive_age).
> + * For each node's LRU lists, a counter for inactive evictions and
> + * activations is maintained (node->nonresident_age).
>   *
>   * On eviction, a snapshot of this counter (along with some bits to
>   * identify the node) is stored in the now empty page cache
> @@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
>  	*workingsetp = workingset;
>  }
>  
> -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
> +/**
> + * workingset_age_nonresident - age non-resident entries as LRU ages
> + * @memcg: the lruvec that was aged
> + * @nr_pages: the number of pages to count
> + *
> + * As in-memory pages are aged, non-resident pages need to be aged as
> + * well, in order for the refault distances later on to be comparable
> + * to the in-memory dimensions. This function allows reclaim and LRU
> + * operations to drive the non-resident aging along in parallel.
> + */
> +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
>  {
>  	/*
>  	 * Reclaiming a cgroup means reclaiming all its children in a
> @@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
>  	 * the root cgroup's, age as well.
>  	 */
>  	do {
> -		struct lruvec *lruvec;
> -
> -		lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -		atomic_long_inc(&lruvec->inactive_age);
> -	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
> +		atomic_long_add(nr_pages, &lruvec->nonresident_age);
> +	} while ((lruvec = parent_lruvec(lruvec)));
>  }
>  
>  /**
> @@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
> -	advance_inactive_age(page_memcg(page), pgdat);
> -
>  	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> +	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
>  	/* XXX: target_memcg can be NULL, go through lruvec */
>  	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> -	eviction = atomic_long_read(&lruvec->inactive_age);
> +	eviction = atomic_long_read(&lruvec->nonresident_age);
>  	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
>  }
>  
> @@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
>  	if (!mem_cgroup_disabled() && !eviction_memcg)
>  		goto out;
>  	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> -	refault = atomic_long_read(&eviction_lruvec->inactive_age);
> +	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
>  
>  	/*
>  	 * Calculate the refault distance
>  	 *
>  	 * The unsigned subtraction here gives an accurate distance
> -	 * across inactive_age overflows in most cases. There is a
> +	 * across nonresident_age overflows in most cases. There is a
>  	 * special case: usually, shadow entries have a short lifetime
>  	 * and are either refaulted or reclaimed along with the inode
>  	 * before they get too old.  But it is not impossible for the
> -	 * inactive_age to lap a shadow entry in the field, which can
> -	 * then result in a false small refault distance, leading to a
> -	 * false activation should this old entry actually refault
> -	 * again.  However, earlier kernels used to deactivate
> +	 * nonresident_age to lap a shadow entry in the field, which
> +	 * can then result in a false small refault distance, leading
> +	 * to a false activation should this old entry actually
> +	 * refault again.  However, earlier kernels used to deactivate
>  	 * unconditionally with *every* reclaim invocation for the
>  	 * longest time, so the occasional inappropriate activation
>  	 * leading to pressure on the active list is not a problem.
> @@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
>  		goto out;
>  
>  	SetPageActive(page);
> -	advance_inactive_age(memcg, pgdat);
> +	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
>  	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
>  
>  	/* Page was active prior to eviction */
> @@ -378,6 +384,7 @@ void workingset_refault(struct page *page, void *shadow)
>  void workingset_activation(struct page *page)
>  {
>  	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
>  
>  	rcu_read_lock();
>  	/*
> @@ -390,7 +397,8 @@ void workingset_activation(struct page *page)
>  	memcg = page_memcg_rcu(page);
>  	if (!mem_cgroup_disabled() && !memcg)
>  		goto out;
> -	advance_inactive_age(memcg, page_pgdat(page));
> +	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
>  out:
>  	rcu_read_unlock();
>  }
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-04 13:35                     ` Vlastimil Babka
@ 2020-06-04 15:05                       ` Johannes Weiner
  2020-06-12  3:19                         ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-06-04 15:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Linux Memory Management List, Rik van Riel,
	Minchan Kim, Michal Hocko, Andrew Morton, Joonsoo Kim, LKML,
	kernel-team

On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> 
> Looks like the whole series is now in mainline (that was quick!) without this
> followup? As such it won't be squashed so perhaps the subject should be more
> "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> let cache workingset challenge anon"), possibly with Fixes: tag etc...

Yep, I'll send a stand-alone version of this. It was a bit fat to get
squashed last minute anyway, and I don't mind having a bit of extra
time to quantify the actual impact of this delta.

Thanks Vlastimil!


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-04 15:05                       ` Johannes Weiner
@ 2020-06-12  3:19                         ` Joonsoo Kim
  2020-06-15 13:41                           ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2020-06-12  3:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Linux Memory Management List, Rik van Riel,
	Minchan Kim, Michal Hocko, Andrew Morton, Joonsoo Kim, LKML,
	kernel-team

2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> >
> > Looks like the whole series is now in mainline (that was quick!) without this
> > followup? As such it won't be squashed so perhaps the subject should be more
> > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> > let cache workingset challenge anon"), possibly with Fixes: tag etc...
>
> Yep, I'll send a stand-alone version of this. It was a bit fat to get
> squashed last minute anyway, and I don't mind having a bit of extra
> time to quantify the actual impact of this delta.

Hello, Johannes.

Now, a week has passed. I hope that this patch is merged as soon as possible,
since my WIP patchset depends on this patch. How about trying to merge
this patch now? If you don't mind, I could send it on your behalf.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-12  3:19                         ` Joonsoo Kim
@ 2020-06-15 13:41                           ` Johannes Weiner
  2020-06-16  6:09                             ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2020-06-15 13:41 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Linux Memory Management List, Rik van Riel,
	Minchan Kim, Michal Hocko, Andrew Morton, Joonsoo Kim, LKML,
	kernel-team

On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote:
> 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> > >
> > > Looks like the whole series is now in mainline (that was quick!) without this
> > > followup? As such it won't be squashed so perhaps the subject should be more
> > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> > > let cache workingset challenge anon"), possibly with Fixes: tag etc...
> >
> > Yep, I'll send a stand-alone version of this. It was a bit fat to get
> > squashed last minute anyway, and I don't mind having a bit of extra
> > time to quantify the actual impact of this delta.
> 
> Hello, Johannes.
> 
> Now, a week has passed. I hope that this patch is merged as soon as possible,
> since my WIP patchset depends on this patch. How about trying to merge
> this patch now? If you don't mind, I could send it on your behalf.

I didn't realize you directly depended on it, my apologies.

Do you depend on it code-wise or do you have test case that shows a
the difference?

Here is the latest version again, you can include it in your patch
series until we get it merged. But I think we should be able to merge
it as a follow-up fix into 5.8 still.

---

From d604062a66bc88ad1a1529c9e16952dc0d7c01c6 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 29 May 2020 09:40:00 -0400
Subject: [PATCH] mm: workingset: age nonresident information alongside
 anonymous pages

After ("mm: workingset: let cache workingset challenge anon fix"), we
compare refault distances to active_file + anon. But age of the
non-resident information is only driven by the file LRU. As a result,
we may overestimate the recency of any incoming refaults and activate
them too eagerly, causing unnecessary LRU churn in certain situations.

Make anon aging drive nonresident age as well to address that.

Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  1 +
 mm/vmscan.c            |  3 +++
 mm/workingset.c        | 46 +++++++++++++++++++++++++-----------------
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4c37fd12104..f6f884970511 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -257,8 +257,8 @@ struct lruvec {
 	 */
 	unsigned long			anon_cost;
 	unsigned long			file_cost;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Non-resident age, driven by LRU movement */
+	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4c5974bb9ba9..5b3216ba39a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -313,6 +313,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..749d239c62b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -904,6 +904,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		__delete_from_swap_cache(page, swap);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
+		workingset_eviction(page, target_memcg);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -1884,6 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				list_add(&page->lru, &pages_to_free);
 		} else {
 			nr_moved += nr_pages;
+			if (PageActive(page))
+				workingset_age_nonresident(lruvec, nr_pages);
 		}
 	}
 
diff --git a/mm/workingset.c b/mm/workingset.c
index d481ea452eeb..50b7937bab32 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,8 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
- * and activations is maintained (node->inactive_age).
+ * For each node's LRU lists, a counter for inactive evictions and
+ * activations is maintained (node->nonresident_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the node) is stored in the now empty page cache
@@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @memcg: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 	 * the root cgroup's, age as well.
 	 */
 	do {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
-	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 /**
@@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
-
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->nonresident_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
+	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
 	/*
 	 * Calculate the refault distance
 	 *
 	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases. There is a
+	 * across nonresident_age overflows in most cases. There is a
 	 * special case: usually, shadow entries have a short lifetime
 	 * and are either refaulted or reclaimed along with the inode
 	 * before they get too old.  But it is not impossible for the
-	 * inactive_age to lap a shadow entry in the field, which can
-	 * then result in a false small refault distance, leading to a
-	 * false activation should this old entry actually refault
-	 * again.  However, earlier kernels used to deactivate
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
 	 * unconditionally with *every* reclaim invocation for the
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
@@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
 
 	/* Page was active prior to eviction */
@@ -382,6 +388,7 @@ void workingset_refault(struct page *page, void *shadow)
 void workingset_activation(struct page *page)
 {
 	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
 	rcu_read_lock();
 	/*
@@ -394,7 +401,8 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	workingset_age_nonresident(lruvec, hpage_nr_pages(page));
 out:
 	rcu_read_unlock();
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon
  2020-06-15 13:41                           ` Johannes Weiner
@ 2020-06-16  6:09                             ` Joonsoo Kim
  0 siblings, 0 replies; 35+ messages in thread
From: Joonsoo Kim @ 2020-06-16  6:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Linux Memory Management List, Rik van Riel,
	Minchan Kim, Michal Hocko, Andrew Morton, Joonsoo Kim, LKML,
	kernel-team

2020년 6월 15일 (월) 오후 10:41, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote:
> > 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > > > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> > > >
> > > > Looks like the whole series is now in mainline (that was quick!) without this
> > > > followup? As such it won't be squashed so perhaps the subject should be more
> > > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset:
> > > > let cache workingset challenge anon"), possibly with Fixes: tag etc...
> > >
> > > Yep, I'll send a stand-alone version of this. It was a bit fat to get
> > > squashed last minute anyway, and I don't mind having a bit of extra
> > > time to quantify the actual impact of this delta.
> >
> > Hello, Johannes.
> >
> > Now, a week has passed. I hope that this patch is merged as soon as possible,
> > since my WIP patchset depends on this patch. How about trying to merge
> > this patch now? If you don't mind, I could send it on your behalf.
>
> I didn't realize you directly depended on it, my apologies.

No problem.

> Do you depend on it code-wise or do you have test case that shows a
> the difference?

Code-wise. My patchset also requires activation counting for the anon
pages and conflict could occur in this case.

> Here is the latest version again, you can include it in your patch
> series until we get it merged. But I think we should be able to merge
> it as a follow-up fix into 5.8 still.

Yes. I will send it separately for 5.8. And, I will send some minor fixes, too.
It's appreciated if you review them.

Thanks.


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-06-16  6:10 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 23:25 [PATCH 00/14] mm: balance LRU lists based on relative thrashing v2 Johannes Weiner
2020-05-20 23:25 ` [PATCH 01/14] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
2020-05-27 19:48   ` Shakeel Butt
2020-05-20 23:25 ` [PATCH 02/14] mm: keep separate anon and file statistics on page reclaim activity Johannes Weiner
2020-05-20 23:25 ` [PATCH 03/14] mm: allow swappiness that prefers reclaiming anon over the file workingset Johannes Weiner
2020-05-20 23:25 ` [PATCH 04/14] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
2020-05-20 23:25 ` [PATCH 05/14] mm: workingset: let cache workingset challenge anon Johannes Weiner
2020-05-27  2:06   ` Joonsoo Kim
2020-05-27 13:43     ` Johannes Weiner
2020-05-28  7:16       ` Joonsoo Kim
2020-05-28 17:01         ` Johannes Weiner
2020-05-29  6:48           ` Joonsoo Kim
2020-05-29 15:12             ` Johannes Weiner
2020-06-01  6:14               ` Joonsoo Kim
2020-06-01 15:56                 ` Johannes Weiner
2020-06-01 20:44                   ` Johannes Weiner
2020-06-04 13:35                     ` Vlastimil Babka
2020-06-04 15:05                       ` Johannes Weiner
2020-06-12  3:19                         ` Joonsoo Kim
2020-06-15 13:41                           ` Johannes Weiner
2020-06-16  6:09                             ` Joonsoo Kim
2020-06-02  2:34                   ` Joonsoo Kim
2020-06-02 16:47                     ` Johannes Weiner
2020-06-03  5:40                       ` Joonsoo Kim
2020-05-20 23:25 ` [PATCH 06/14] mm: remove use-once cache bias from LRU balancing Johannes Weiner
2020-05-20 23:25 ` [PATCH 07/14] mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count() Johannes Weiner
2020-05-20 23:25 ` [PATCH 08/14] mm: base LRU balancing on an explicit cost model Johannes Weiner
2020-05-20 23:25 ` [PATCH 09/14] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
2020-05-22 13:33   ` Qian Cai
2020-05-26 15:55     ` Johannes Weiner
2020-05-20 23:25 ` [PATCH 10/14] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
2020-05-20 23:25 ` [PATCH 11/14] mm: balance LRU lists based on relative thrashing Johannes Weiner
2020-05-20 23:25 ` [PATCH 12/14] mm: vmscan: determine anon/file pressure balance at the reclaim root Johannes Weiner
2020-05-20 23:25 ` [PATCH 13/14] mm: vmscan: reclaim writepage is IO cost Johannes Weiner
2020-05-20 23:25 ` [PATCH 14/14] mm: vmscan: limit the range of LRU type balancing Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).