linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list
@ 2020-03-23  5:52 js1304
  2020-03-23  5:52 ` [PATCH v4 1/8] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Hello,

This patchset implements workingset protection and detection on
the anonymous LRU list.

* Changes on v4
- In the patch "mm/swapcache: support to handle the exceptional
entries in swapcache":
-- replace the word "value" with "exceptional entries"
-- add to handle the shadow entry in add_to_swap_cache()
-- support the huge page
-- remove the registration code for shadow shrinker

- remove the patch "mm/workingset: use the node counter
if memcg is the root memcg" since workingset detection for
anonymous page doesn't use shadow shrinker now
- minor style fixes

* Changes on v3
- rework the patch, "mm/vmscan: protect the workingset on anonymous LRU"
(use almost same reference tracking algorithm to the one for the file
mapped page)

* Changes on v2
- fix a critical bug that uses out of index lru list in
workingset_refault()
- fix a bug that reuses the rotate value for previous page

* SUBJECT
workingset protection

* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

As we can see, hot pages are swapped-out and it would cause swap-in later.

* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list,
if enough reference happens, the page will be promoted. This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

hot pages remains in the active list. :)

* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.

I did another test to show the performance effect of this patchset.

- ebizzy (with modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in/out
will happen if allocated memory is larger than the system memory.

The random function that represents the zipf distribution is used to
make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
the parameter is high, hot memory is accessed much larger than cold one.
If the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.

My test setup is a virtual machine with 8 cpus and 1024MB memory.

Result format is as following.

Parameter 0.1 ... 1.3
Allocated memory size
Throughput for base (larger is better)
Throughput for patchset (larger is better)
Improvement (larger is better)


* single thread

     0.1      0.3      0.5      0.7      0.9      1.1      1.3
<512M>
  7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
  6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
   -0.01     -0.0     -0.0     0.01    -0.01      0.0     0.01
<768M>
   915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
   920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
    0.01     -0.0    -0.03      0.0     0.02     0.04    -0.01
<1024M>
   425.0    471.0    539.0    753.0   1183.0   2130.0   3839.0
   414.0    468.0    553.0    770.0   1242.0   2187.0   3932.0
   -0.03    -0.01     0.03     0.02     0.05     0.03     0.02
<1280M>
   320.0    346.0    410.0    556.0    871.0   1654.0   3298.0
   316.0    346.0    411.0    550.0    892.0   1652.0   3293.0
   -0.01      0.0      0.0    -0.01     0.02     -0.0     -0.0
<1536M>
   273.0    290.0    341.0    458.0    733.0   1381.0   2925.0
   271.0    293.0    344.0    462.0    740.0   1398.0   2969.0
   -0.01     0.01     0.01     0.01     0.01     0.01     0.02
<2048M>
    77.0     79.0     95.0    147.0    276.0    690.0   1816.0
    91.0     94.0    115.0    170.0    321.0    770.0   2018.0
    0.18     0.19     0.21     0.16     0.16     0.12     0.11


* multi thread (8)

     0.1      0.3      0.5      0.7      0.9      1.1      1.3
<512M>
 29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
 29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
    0.01      0.0     0.01    -0.01     -0.0    -0.01     -0.0
<768M>
  3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
  3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
    0.07     0.08    -0.05     0.05     0.05    -0.03     0.02
<1024M>
  1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
  1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
    0.04     0.01     0.05     0.08     0.01    -0.01     0.07
<1280M>
  1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  12443.0
  1560.0   1707.0   2016.0   2714.0   4262.0   7518.0  13910.0
    0.02     0.05     0.04     0.05     0.03     0.08     0.12
<1536M>
  1303.0   1399.0   1550.0   2137.0   3469.0   6712.0  12944.0
  1356.0   1465.0   1701.0   2237.0   3583.0   6830.0  13580.0
    0.04     0.05      0.1     0.05     0.03     0.02     0.05
<2048M>
   172.0    184.0    215.0    289.0    514.0   1318.0   4153.0
   175.0    190.0    225.0    329.0    606.0   1585.0   5170.0
    0.02     0.03     0.05     0.14     0.18      0.2     0.24

As we can see, as allocated memory grows, patched kernel get the better
result. Maximum improvement is 21% for the single thread test and
24% for the multi thread test.


* SUBJECT
workingset detection

* PROBLEM
Later part of the patchset implements the workingset detection for
the anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.

Following is an example of thrashing due to the workingset protection.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)

5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

6. repeat 4, 5

Without workingset detection, this kind of workload cannot be promoted
and thrashing happens forever.

* SOLUTION
Therefore, this patchset implements workingset detection.
All the infrastructure for workingset detecion is already implemented,
so there is not much work to do. First, extend workingset detection
code to deal with the anonymous LRU list. Then, make swap cache handles
the exceptional value for the shadow entry. Lastly, install/retrieve
the shadow value into/from the swap cache and check the refault distance.

* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.

My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk
to capture the trace.

Test scenario is like as below.

1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6

Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval
(96+128 MB) is more 190MB, so it cannot be promoted without workingset
detection and swap-in/out happens repeatedly. With this patchset,
workingset detection works and promotion happens. Therefore, swap-in/out
occurs less.

Here is the result. (average of 5 runs)

type swap-in swap-out
base 863240 989945
patch 681565 809273

As we can see, patched kernel do less swap-in/out.

Patchset is based on v5.5.
Patchset can also be available at

https://github.com/JoonsooKim/linux/tree/improve-anonymous-lru-management-v4.00-v5.5

Enjoy it.

Thanks.

Joonsoo Kim (8):
  mm/vmscan: make active/inactive ratio as 1:1 for anon lru
  mm/vmscan: protect the workingset on anonymous LRU
  mm/workingset: extend the workingset detection for anon LRU
  mm/swapcache: support to handle the exceptional entries in swapcache
  mm/workingset: handle the page without memcg
  mm/swap: implement workingset detection for anonymous LRU
  mm/vmscan: restore active/inactive ratio for anonymous LRU
  mm/swap: count a new anonymous page as a reclaim_state's rotate

 include/linux/mmzone.h  | 14 +++++++++-----
 include/linux/swap.h    | 18 +++++++++++++-----
 kernel/events/uprobes.c |  2 +-
 mm/huge_memory.c        |  6 +++---
 mm/khugepaged.c         |  2 +-
 mm/memcontrol.c         | 12 ++++++++----
 mm/memory.c             | 16 ++++++++++------
 mm/migrate.c            |  2 +-
 mm/shmem.c              |  3 ++-
 mm/swap.c               | 42 +++++++++++++++++++++++++++++++++++-------
 mm/swap_state.c         | 44 +++++++++++++++++++++++++++++++++++++-------
 mm/swapfile.c           |  2 +-
 mm/userfaultfd.c        |  2 +-
 mm/vmscan.c             | 26 ++++++++++++++++----------
 mm/vmstat.c             |  6 ++++--
 mm/workingset.c         | 46 +++++++++++++++++++++++++++++++++-------------
 16 files changed, 175 insertions(+), 68 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/8] mm/vmscan: make active/inactive ratio as 1:1 for anon lru
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23  5:52 ` [PATCH v4 2/8] mm/vmscan: protect the workingset on anonymous LRU js1304
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Current implementation of LRU management for anonymous page has some
problems. Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list. Although, this problem will be
fixed in the following patchset, the preparation is required and
this patch does it.

What following patchset does is to restore workingset protection. In this
case, newly created or swap-in pages are started their lifetime on the
inactive list. If inactive list is too small, there is not enough chance
to be referenced and the page cannot become the workingset.

In order to provide enough chance to the newly anonymous pages, this patch
makes active/inactive LRU ratio as 1:1.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17..e772f3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2217,7 +2217,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
 	active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
-	if (gb)
+	if (gb && is_file_lru(inactive_lru))
 		inactive_ratio = int_sqrt(10 * gb);
 	else
 		inactive_ratio = 1;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/8] mm/vmscan: protect the workingset on anonymous LRU
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
  2020-03-23  5:52 ` [PATCH v4 1/8] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23  5:52 ` [PATCH v4 3/8] mm/workingset: extend the workingset detection for anon LRU js1304
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

In current implementation, newly created or swap-in anonymous page
is started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.
Like as file LRU, newly created or swap-in anonymous pages will be
inserted to the inactive list. They are promoted to active list if
enough reference happens. This simple modification changes the above
example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot
be promoted and will be swapped-out if re-access interval is greater than
the size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection that is applied to file LRU some day before.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h    |  2 +-
 kernel/events/uprobes.c |  2 +-
 mm/huge_memory.c        |  6 +++---
 mm/khugepaged.c         |  2 +-
 mm/memory.c             |  9 ++++-----
 mm/migrate.c            |  2 +-
 mm/swap.c               | 13 +++++++------
 mm/swapfile.c           |  2 +-
 mm/userfaultfd.c        |  2 +-
 mm/vmscan.c             |  4 +---
 10 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7a..954e13e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,7 +344,7 @@ extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
+extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ece7e13..14156fc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -190,7 +190,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		get_page(new_page);
 		page_add_new_anon_rmap(new_page, vma, addr, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
 		dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a880932..6356dfd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -638,7 +638,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false, true);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1282,7 +1282,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false, false);
-		lru_cache_add_active_or_unevictable(pages[i], vma);
+		lru_cache_add_inactive_or_unevictable(pages[i], vma);
 		vmf->pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*vmf->pte));
 		set_pte_at(vma->vm_mm, haddr, vmf->pte, entry);
@@ -1435,7 +1435,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908..246c155 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1092,7 +1092,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
-	lru_cache_add_active_or_unevictable(new_page, vma);
+	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9..5f7813a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2513,7 +2513,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
 		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3038,11 +3038,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(page != swapcache && swapcache)) {
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
-		activate_page(page);
 	}
 
 	swap_free(entry);
@@ -3186,7 +3185,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(page, vma);
+	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3449,7 +3448,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 86873b6..ef034c0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2784,7 +2784,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	page_add_new_anon_rmap(page, vma, addr, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	if (!is_zone_device_page(page))
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
 
 	if (flush) {
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae9..442d27e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -448,23 +448,24 @@ void lru_cache_add(struct page *page)
 }
 
 /**
- * lru_cache_add_active_or_unevictable
+ * lru_cache_add_inactive_or_unevictable
  * @page:  the page to be added to LRU
  * @vma:   vma in which page is mapped for determining reclaimability
  *
- * Place @page on the active or unevictable LRU list, depending on its
+ * Place @page on the inactive or unevictable LRU list, depending on its
  * evictability.  Note that if the page is not evictable, it goes
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
+void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
+	bool unevictable;
+
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
-		SetPageActive(page);
-	else if (!TestSetPageMlocked(page)) {
+	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
+	if (unevictable && !TestSetPageMlocked(page)) {
 		/*
 		 * We use the irq-unsafe __mod_zone_page_stat because this
 		 * counter is not modified from interrupt context, and the pte
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bb3261d..6bdcbf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1888,7 +1888,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	swap_free(entry);
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 1b0d7ab..875e329 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -120,7 +120,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(page, dst_vma);
+	lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e772f3f..c932141 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1010,8 +1010,6 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_RECLAIM;
 
 	if (referenced_ptes) {
-		if (PageSwapBacked(page))
-			return PAGEREF_ACTIVATE;
 		/*
 		 * All mapped pages start out with page table
 		 * references from the instantiating fault, so we need
@@ -1034,7 +1032,7 @@ static enum page_references page_check_references(struct page *page,
 		/*
 		 * Activate file-backed executable pages after first usage.
 		 */
-		if (vm_flags & VM_EXEC)
+		if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
 
 		return PAGEREF_KEEP;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/8] mm/workingset: extend the workingset detection for anon LRU
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
  2020-03-23  5:52 ` [PATCH v4 1/8] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
  2020-03-23  5:52 ` [PATCH v4 2/8] mm/vmscan: protect the workingset on anonymous LRU js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23  5:52 ` [PATCH v4 4/8] mm/swapcache: support to handle the exceptional entries in swapcache js1304
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

In the following patch, workingset detection will be applied to
anonymous LRU. To prepare it, this patch adds some code to
distinguish/handle the both LRUs.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h | 14 +++++++++-----
 mm/memcontrol.c        | 12 ++++++++----
 mm/vmscan.c            | 15 ++++++++++-----
 mm/vmstat.c            |  6 ++++--
 mm/workingset.c        | 33 ++++++++++++++++++++-------------
 5 files changed, 51 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5334ad8..ad0639f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,8 +220,12 @@ enum node_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_NODES,
-	WORKINGSET_REFAULT,
-	WORKINGSET_ACTIVATE,
+	WORKINGSET_REFAULT_BASE,
+	WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
+	WORKINGSET_REFAULT_FILE,
+	WORKINGSET_ACTIVATE_BASE,
+	WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE,
+	WORKINGSET_ACTIVATE_FILE,
 	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
@@ -304,10 +308,10 @@ enum lruvec_flags {
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat	reclaim_stat;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Evictions & activations on the inactive list, anon=0, file=1 */
+	atomic_long_t			inactive_age[2];
 	/* Refaults at the time of last reclaim cycle */
-	unsigned long			refaults;
+	unsigned long			refaults[2];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
 #ifdef CONFIG_MEMCG
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c83cf4..8f4473d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1431,10 +1431,14 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
 		       memcg_events(memcg, PGMAJFAULT));
 
-	seq_buf_printf(&s, "workingset_refault %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_REFAULT));
-	seq_buf_printf(&s, "workingset_activate %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_ACTIVATE));
+	seq_buf_printf(&s, "workingset_refault_anon %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
+	seq_buf_printf(&s, "workingset_refault_file %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
+	seq_buf_printf(&s, "workingset_activate_anon %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
+	seq_buf_printf(&s, "workingset_activate_file %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
 	seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
 		       memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c932141..0493c25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2716,7 +2716,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!sc->force_deactivate) {
 		unsigned long refaults;
 
-		if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
 			sc->may_deactivate |= DEACTIVATE_ANON;
 		else
 			sc->may_deactivate &= ~DEACTIVATE_ANON;
@@ -2727,8 +2730,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * rid of any stale active pages quickly.
 		 */
 		refaults = lruvec_page_state(target_lruvec,
-					     WORKINGSET_ACTIVATE);
-		if (refaults != target_lruvec->refaults ||
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
 		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
 			sc->may_deactivate |= DEACTIVATE_FILE;
 		else
@@ -3007,8 +3010,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	unsigned long refaults;
 
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
-	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
-	target_lruvec->refaults = refaults;
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
+	target_lruvec->refaults[0] = refaults;
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
+	target_lruvec->refaults[1] = refaults;
 }
 
 /*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d5337..3cdf8e9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1146,8 +1146,10 @@ const char * const vmstat_text[] = {
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"workingset_nodes",
-	"workingset_refault",
-	"workingset_activate",
+	"workingset_refault_anon",
+	"workingset_refault_file",
+	"workingset_activate_anon",
+	"workingset_activate_file",
 	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b..59415e0 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
 #include <linux/writeback.h>
 #include <linux/shmem_fs.h>
 #include <linux/pagemap.h>
@@ -156,7 +157,7 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
+ * For each node's anon/file LRU lists, a counter for inactive evictions
  * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
@@ -213,7 +214,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat,
+				bool file)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -230,7 +232,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 		struct lruvec *lruvec;
 
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
+		atomic_long_inc(&lruvec->inactive_age[file]);
 	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
 }
 
@@ -245,6 +247,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 {
 	struct pglist_data *pgdat = page_pgdat(page);
+	bool file = page_is_file_cache(page);
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	int memcgid;
@@ -254,12 +257,12 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
+	advance_inactive_age(page_memcg(page), pgdat, file);
 
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->inactive_age[file]);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -274,15 +277,16 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
  */
 void workingset_refault(struct page *page, void *shadow)
 {
+	bool file = page_is_file_cache(page);
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
 	struct pglist_data *pgdat;
-	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
+	unsigned long active;
 	bool workingset;
 	int memcgid;
 
@@ -308,9 +312,11 @@ void workingset_refault(struct page *page, void *shadow)
 	eviction_memcg = mem_cgroup_from_id(memcgid);
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
+
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
-	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	refault = atomic_long_read(&eviction_lruvec->inactive_age[file]);
+	active = lruvec_page_state(eviction_lruvec,
+				page_lru_base_type(page) + LRU_ACTIVE);
 
 	/*
 	 * Calculate the refault distance
@@ -341,19 +347,19 @@ void workingset_refault(struct page *page, void *shadow)
 	memcg = page_memcg(page);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
+	inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
 
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't act on pages that couldn't stay resident even if all
 	 * the memory was available to the page cache.
 	 */
-	if (refault_distance > active_file)
+	if (refault_distance > active)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
-	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	advance_inactive_age(memcg, pgdat, file);
+	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
 
 	/* Page was active prior to eviction */
 	if (workingset) {
@@ -370,6 +376,7 @@ void workingset_refault(struct page *page, void *shadow)
  */
 void workingset_activation(struct page *page)
 {
+	bool file = page_is_file_cache(page);
 	struct mem_cgroup *memcg;
 
 	rcu_read_lock();
@@ -383,7 +390,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	advance_inactive_age(memcg, page_pgdat(page), file);
 out:
 	rcu_read_unlock();
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/8] mm/swapcache: support to handle the exceptional entries in swapcache
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
                   ` (2 preceding siblings ...)
  2020-03-23  5:52 ` [PATCH v4 3/8] mm/workingset: extend the workingset detection for anon LRU js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23 16:50   ` Johannes Weiner
  2020-03-23  5:52 ` [PATCH v4 5/8] mm/workingset: handle the page without memcg js1304
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Swapcache doesn't handle the exceptional entries since there is no case
using it. In the following patch, workingset detection for anonymous
page will be implemented and it stores the shadow entries as exceptional
entries into the swapcache. So, we need to handle the exceptional entries
and this patch implements it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h | 10 ++++++----
 mm/shmem.c           |  3 ++-
 mm/swap_state.c      | 26 ++++++++++++++++++++------
 mm/vmscan.c          |  2 +-
 4 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 954e13e..273de48 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,9 +408,11 @@ extern struct address_space *swapper_spaces[];
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+			gfp_t gfp, void **shadowp);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern void __delete_from_swap_cache(struct page *page,
+			swp_entry_t entry, void *shadow);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
@@ -565,13 +567,13 @@ static inline int add_to_swap(struct page *page)
 }
 
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
-							gfp_t gfp_mask)
+					gfp_t gfp_mask, void **shadowp)
 {
 	return -1;
 }
 
 static inline void __delete_from_swap_cache(struct page *page,
-							swp_entry_t entry)
+					swp_entry_t entry, void *shadow)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 8793e8c..c6663ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1370,7 +1370,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		list_add(&info->swaplist, &shmem_swaplist);
 
 	if (add_to_swap_cache(page, swap,
-			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
+			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
+			NULL) == 0) {
 		spin_lock_irq(&info->lock);
 		shmem_recalc_inode(inode);
 		info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a..f06af84 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -111,12 +111,15 @@ void show_swap_cache_info(void)
  * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
  */
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+			gfp_t gfp, void **shadowp)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
 	XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
 	unsigned long i, nr = compound_nr(page);
+	unsigned long nrexceptional = 0;
+	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -132,10 +135,17 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 			goto unlock;
 		for (i = 0; i < nr; i++) {
 			VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+			old = xas_load(&xas);
+			if (xa_is_value(old)) {
+				nrexceptional++;
+				if (shadowp)
+					*shadowp = old;
+			}
 			set_page_private(page + i, entry.val + i);
 			xas_store(&xas, page);
 			xas_next(&xas);
 		}
+		address_space->nrexceptional -= nrexceptional;
 		address_space->nrpages += nr;
 		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
 		ADD_CACHE_INFO(add_total, nr);
@@ -155,7 +165,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
  * This must be called only on pages that have
  * been verified to be in the swap cache.
  */
-void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
+void __delete_from_swap_cache(struct page *page,
+			swp_entry_t entry, void *shadow)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	int i, nr = hpage_nr_pages(page);
@@ -167,12 +178,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(PageWriteback(page), page);
 
 	for (i = 0; i < nr; i++) {
-		void *entry = xas_store(&xas, NULL);
+		void *entry = xas_store(&xas, shadow);
 		VM_BUG_ON_PAGE(entry != page, entry);
 		set_page_private(page + i, 0);
 		xas_next(&xas);
 	}
 	ClearPageSwapCache(page);
+	if (shadow)
+		address_space->nrexceptional += nr;
 	address_space->nrpages -= nr;
 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
 	ADD_CACHE_INFO(del_total, nr);
@@ -209,7 +222,7 @@ int add_to_swap(struct page *page)
 	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
 	if (err)
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -247,7 +260,7 @@ void delete_from_swap_cache(struct page *page)
 	struct address_space *address_space = swap_address_space(entry);
 
 	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(page, entry);
+	__delete_from_swap_cache(page, entry, NULL);
 	xa_unlock_irq(&address_space->i_pages);
 
 	put_swap_page(page, entry);
@@ -418,7 +431,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if XArray node allocation failed. */
 		__SetPageLocked(new_page);
 		__SetPageSwapBacked(new_page);
-		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+		err = add_to_swap_cache(new_page, entry,
+				gfp_mask & GFP_KERNEL, NULL);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
 			SetPageWorkingset(new_page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0493c25..9871861 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -909,7 +909,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		mem_cgroup_swapout(page, swap);
-		__delete_from_swap_cache(page, swap);
+		__delete_from_swap_cache(page, swap, NULL);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
 	} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/8] mm/workingset: handle the page without memcg
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
                   ` (3 preceding siblings ...)
  2020-03-23  5:52 ` [PATCH v4 4/8] mm/swapcache: support to handle the exceptional entries in swapcache js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23 16:51   ` Johannes Weiner
  2020-03-23  5:52 ` [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU js1304
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

When implementing workingset detection for anonymous page, I found
some swapcache pages with NULL memcg. They are brought in by swap
readahead and nobody has touched it.

The idea behind the workingset code is to tell on page fault time
whether pages have been previously used or not. Since this page
hasn't been used, don't store a shadow entry for it; when it later
faults back in, we treat it as the new page that it is.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/workingset.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/workingset.c b/mm/workingset.c
index 59415e0..8b192e8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -257,6 +257,19 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
+	/*
+	 * A page can be without a cgroup here when it was brought in by
+	 * swap readahead and nobody has touched it since.
+	 *
+	 * The idea behind the workingset code is to tell on page fault
+	 * time whether pages have been previously used or not. Since
+	 * this page hasn't been used, don't store a shadow entry for it;
+	 * when it later faults back in, we treat it as the new page
+	 * that it is.
+	 */
+	if (!page_memcg(page))
+		return NULL;
+
 	advance_inactive_age(page_memcg(page), pgdat, file);
 
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
                   ` (4 preceding siblings ...)
  2020-03-23  5:52 ` [PATCH v4 5/8] mm/workingset: handle the page without memcg js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23 17:17   ` Johannes Weiner
  2020-03-23  5:52 ` [PATCH v4 7/8] mm/vmscan: restore active/inactive ratio " js1304
  2020-03-23  5:52 ` [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate js1304
  7 siblings, 1 reply; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

This patch implements workingset detection for anonymous LRU.
All the infrastructure is implemented by the previous patches so this patch
just activates the workingset detection by installing/retrieving
the shadow entry.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h |  6 ++++++
 mm/memory.c          |  7 ++++++-
 mm/swap_state.c      | 20 ++++++++++++++++++--
 mm/vmscan.c          |  7 +++++--
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 273de48..fb4772e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,6 +408,7 @@ extern struct address_space *swapper_spaces[];
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
+extern void *get_shadow_from_swap_cache(swp_entry_t entry);
 extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
 			gfp_t gfp, void **shadowp);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
@@ -566,6 +567,11 @@ static inline int add_to_swap(struct page *page)
 	return 0;
 }
 
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 					gfp_t gfp_mask, void **shadowp)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 5f7813a..91a2097 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2925,10 +2925,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 							vmf->address);
 			if (page) {
+				void *shadow;
+
 				__SetPageLocked(page);
 				__SetPageSwapBacked(page);
 				set_page_private(page, entry.val);
-				lru_cache_add_anon(page);
+				shadow = get_shadow_from_swap_cache(entry);
+				if (shadow)
+					workingset_refault(page, shadow);
+				lru_cache_add(page);
 				swap_readpage(page, true);
 			}
 		} else {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f06af84..f996455 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -107,6 +107,18 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
 }
 
+void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+	struct address_space *address_space = swap_address_space(entry);
+	pgoff_t idx = swp_offset(entry);
+	struct page *page;
+
+	page = find_get_entry(address_space, idx);
+	if (xa_is_value(page))
+		return page;
+	return NULL;
+}
+
 /*
  * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
@@ -376,6 +388,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	struct page *found_page = NULL, *new_page = NULL;
 	struct swap_info_struct *si;
 	int err;
+	void *shadow;
 	*new_page_allocated = false;
 
 	do {
@@ -431,12 +444,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if XArray node allocation failed. */
 		__SetPageLocked(new_page);
 		__SetPageSwapBacked(new_page);
+		shadow = NULL;
 		err = add_to_swap_cache(new_page, entry,
-				gfp_mask & GFP_KERNEL, NULL);
+				gfp_mask & GFP_KERNEL, &shadow);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
 			SetPageWorkingset(new_page);
-			lru_cache_add_anon(new_page);
+			if (shadow)
+				workingset_refault(new_page, shadow);
+			lru_cache_add(new_page);
 			*new_page_allocated = true;
 			return new_page;
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9871861..b37cc26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -867,6 +867,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 {
 	unsigned long flags;
 	int refcount;
+	void *shadow = NULL;
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -909,12 +910,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		mem_cgroup_swapout(page, swap);
-		__delete_from_swap_cache(page, swap, NULL);
+		if (reclaimed && !mapping_exiting(mapping))
+			shadow = workingset_eviction(page, target_memcg);
+		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
 	} else {
 		void (*freepage)(struct page *);
-		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -1476,6 +1478,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			SetPageActive(page);
 			stat->nr_activate[type] += nr_pages;
 			count_memcg_page_event(page, PGACTIVATE);
+			workingset_activation(page);
 		}
 keep_locked:
 		unlock_page(page);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 7/8] mm/vmscan: restore active/inactive ratio for anonymous LRU
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
                   ` (5 preceding siblings ...)
  2020-03-23  5:52 ` [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23 17:18   ` Johannes Weiner
  2020-03-23  5:52 ` [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate js1304
  7 siblings, 1 reply; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now, workingset detection is implemented for anonymous LRU.
We don't have to worry about the misfound for workingset due to
the ratio of active/inactive. Let's restore the ratio.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b37cc26..3d44e32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2218,7 +2218,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
 	active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
-	if (gb && is_file_lru(inactive_lru))
+	if (gb)
 		inactive_ratio = int_sqrt(10 * gb);
 	else
 		inactive_ratio = 1;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate
  2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
                   ` (6 preceding siblings ...)
  2020-03-23  5:52 ` [PATCH v4 7/8] mm/vmscan: restore active/inactive ratio " js1304
@ 2020-03-23  5:52 ` js1304
  2020-03-23 17:42   ` Johannes Weiner
  7 siblings, 1 reply; 18+ messages in thread
From: js1304 @ 2020-03-23  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

reclaim_stat's rotate is used for controlling the ratio of scanning page
between file and anonymous LRU. All new anonymous pages are counted
for rotate before the patch, protecting anonymous pages on active LRU, and,
it makes that reclaim on anonymous LRU is less happened than file LRU.

Now, situation is changed. all new anonymous pages are not added
to the active LRU so rotate would be far less than before. It will cause
that reclaim on anonymous LRU happens more and it would result in bad
effect on some system that is optimized for previous setting.

Therefore, this patch counts a new anonymous page as a reclaim_state's
rotate. Although it is non-logical to add this count to
the reclaim_state's rotate in current algorithm, reducing the regression
would be more important.

I found this regression on kernel-build test and it is roughly 2~5%
performance degradation. With this workaround, performance is completely
restored.

v2: fix a bug that reuses the rotate value for previous page

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/swap.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 442d27e..1f19301 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -187,6 +187,9 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 }
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
+				 void *arg);
+
 static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
 	void *arg)
@@ -199,6 +202,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 		struct pglist_data *pagepgdat = page_pgdat(page);
+		void *arg_orig = arg;
 
 		if (pagepgdat != pgdat) {
 			if (pgdat)
@@ -207,8 +211,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
+		if (move_fn == __pagevec_lru_add_fn) {
+			struct list_head *entry = &page->lru;
+			unsigned long next = (unsigned long)entry->next;
+			unsigned long rotate = next & 2;
+
+			if (rotate) {
+				VM_BUG_ON(arg);
+
+				next = next & ~2;
+				entry->next = (struct list_head *)next;
+				arg = (void *)rotate;
+			}
+		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec, arg);
+		arg = arg_orig;
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -475,6 +493,14 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
 				    hpage_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
+
+	if (PageSwapBacked(page) && !unevictable) {
+		struct list_head *entry = &page->lru;
+		unsigned long next = (unsigned long)entry->next;
+
+		next = next | 2;
+		entry->next = (struct list_head *)next;
+	}
 	lru_cache_add(page);
 }
 
@@ -927,6 +953,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
+	unsigned long rotate = (unsigned long)arg;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
@@ -962,7 +989,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	if (page_evictable(page)) {
 		lru = page_lru(page);
 		update_page_reclaim_stat(lruvec, page_is_file_cache(page),
-					 PageActive(page));
+					 PageActive(page) | rotate);
 		if (was_unevictable)
 			count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/8] mm/swapcache: support to handle the exceptional entries in swapcache
  2020-03-23  5:52 ` [PATCH v4 4/8] mm/swapcache: support to handle the exceptional entries in swapcache js1304
@ 2020-03-23 16:50   ` Johannes Weiner
  0 siblings, 0 replies; 18+ messages in thread
From: Johannes Weiner @ 2020-03-23 16:50 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Mon, Mar 23, 2020 at 02:52:08PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Swapcache doesn't handle the exceptional entries since there is no case
> using it. In the following patch, workingset detection for anonymous
> page will be implemented and it stores the shadow entries as exceptional
> entries into the swapcache. So, we need to handle the exceptional entries
> and this patch implements it.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 5/8] mm/workingset: handle the page without memcg
  2020-03-23  5:52 ` [PATCH v4 5/8] mm/workingset: handle the page without memcg js1304
@ 2020-03-23 16:51   ` Johannes Weiner
  0 siblings, 0 replies; 18+ messages in thread
From: Johannes Weiner @ 2020-03-23 16:51 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Mon, Mar 23, 2020 at 02:52:09PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> When implementing workingset detection for anonymous page, I found
> some swapcache pages with NULL memcg. They are brought in by swap
> readahead and nobody has touched it.
> 
> The idea behind the workingset code is to tell on page fault time
> whether pages have been previously used or not. Since this page
> hasn't been used, don't store a shadow entry for it; when it later
> faults back in, we treat it as the new page that it is.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU
  2020-03-23  5:52 ` [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU js1304
@ 2020-03-23 17:17   ` Johannes Weiner
  2020-03-24  6:25     ` Joonsoo Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2020-03-23 17:17 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Mon, Mar 23, 2020 at 02:52:10PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> This patch implements workingset detection for anonymous LRU.
> All the infrastructure is implemented by the previous patches so this patch
> just activates the workingset detection by installing/retrieving
> the shadow entry.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/swap.h |  6 ++++++
>  mm/memory.c          |  7 ++++++-
>  mm/swap_state.c      | 20 ++++++++++++++++++--
>  mm/vmscan.c          |  7 +++++--
>  4 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 273de48..fb4772e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -408,6 +408,7 @@ extern struct address_space *swapper_spaces[];
>  extern unsigned long total_swapcache_pages(void);
>  extern void show_swap_cache_info(void);
>  extern int add_to_swap(struct page *page);
> +extern void *get_shadow_from_swap_cache(swp_entry_t entry);
>  extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  			gfp_t gfp, void **shadowp);
>  extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
> @@ -566,6 +567,11 @@ static inline int add_to_swap(struct page *page)
>  	return 0;
>  }
>  
> +static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
>  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  					gfp_t gfp_mask, void **shadowp)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 5f7813a..91a2097 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2925,10 +2925,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>  							vmf->address);
>  			if (page) {
> +				void *shadow;
> +
>  				__SetPageLocked(page);
>  				__SetPageSwapBacked(page);
>  				set_page_private(page, entry.val);
> -				lru_cache_add_anon(page);
> +				shadow = get_shadow_from_swap_cache(entry);
> +				if (shadow)
> +					workingset_refault(page, shadow);

Hm, this is calling workingset_refault() on a page that isn't charged
to a cgroup yet. That means the refault stats and inactive age counter
will be bumped incorrectly in the root cgroup instead of the real one.

> +				lru_cache_add(page);
>  				swap_readpage(page, true);
>  			}
>  		} else {

You need to look up and remember the shadow entry at the top and call
workingset_refault() after mem_cgroup_commit_charge() has run.

It'd be nice if we could do the shadow lookup for everybody in
lookup_swap_cache(), but that's subject to race conditions if multiple
faults on the same swap page happen in multiple vmas concurrently. The
swapcache bypass scenario is only safe because it checks that there is
a single pte under the mmap sem to prevent forking. So it looks like
you have to bubble up the shadow entry through swapin_readahead().

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 7/8] mm/vmscan: restore active/inactive ratio for anonymous LRU
  2020-03-23  5:52 ` [PATCH v4 7/8] mm/vmscan: restore active/inactive ratio " js1304
@ 2020-03-23 17:18   ` Johannes Weiner
  0 siblings, 0 replies; 18+ messages in thread
From: Johannes Weiner @ 2020-03-23 17:18 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Mon, Mar 23, 2020 at 02:52:11PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Now, workingset detection is implemented for anonymous LRU.
> We don't have to worry about the misfound for workingset due to
> the ratio of active/inactive. Let's restore the ratio.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate
  2020-03-23  5:52 ` [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate js1304
@ 2020-03-23 17:42   ` Johannes Weiner
  2020-03-24  6:28     ` Joonsoo Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2020-03-23 17:42 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Mon, Mar 23, 2020 at 02:52:12PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> reclaim_stat's rotate is used for controlling the ratio of scanning page
> between file and anonymous LRU. All new anonymous pages are counted
> for rotate before the patch, protecting anonymous pages on active LRU, and,
> it makes that reclaim on anonymous LRU is less happened than file LRU.
> 
> Now, situation is changed. all new anonymous pages are not added
> to the active LRU so rotate would be far less than before. It will cause
> that reclaim on anonymous LRU happens more and it would result in bad
> effect on some system that is optimized for previous setting.
> 
> Therefore, this patch counts a new anonymous page as a reclaim_state's
> rotate. Although it is non-logical to add this count to
> the reclaim_state's rotate in current algorithm, reducing the regression
> would be more important.
> 
> I found this regression on kernel-build test and it is roughly 2~5%
> performance degradation. With this workaround, performance is completely
> restored.
> 
> v2: fix a bug that reuses the rotate value for previous page

I agree with the rationale, but the magic bit in the page->lru list
pointers seems pretty ugly.

I wrote a patch a few years ago that split lru_add_pvecs into an add
and a putback component. This was to avoid unintentional balancing
effects of LRU isolations, but I think you can benefit from that
cleanup here as well. Would you mind taking a look at it and maybe
take it up into your series?

https://lore.kernel.org/patchwork/patch/685708/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU
  2020-03-23 17:17   ` Johannes Weiner
@ 2020-03-24  6:25     ` Joonsoo Kim
  2020-04-02  5:50       ` Joonsoo Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Joonsoo Kim @ 2020-03-24  6:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

2020년 3월 24일 (화) 오전 2:17, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Mon, Mar 23, 2020 at 02:52:10PM +0900, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > This patch implements workingset detection for anonymous LRU.
> > All the infrastructure is implemented by the previous patches so this patch
> > just activates the workingset detection by installing/retrieving
> > the shadow entry.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  include/linux/swap.h |  6 ++++++
> >  mm/memory.c          |  7 ++++++-
> >  mm/swap_state.c      | 20 ++++++++++++++++++--
> >  mm/vmscan.c          |  7 +++++--
> >  4 files changed, 35 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 273de48..fb4772e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -408,6 +408,7 @@ extern struct address_space *swapper_spaces[];
> >  extern unsigned long total_swapcache_pages(void);
> >  extern void show_swap_cache_info(void);
> >  extern int add_to_swap(struct page *page);
> > +extern void *get_shadow_from_swap_cache(swp_entry_t entry);
> >  extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
> >                       gfp_t gfp, void **shadowp);
> >  extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
> > @@ -566,6 +567,11 @@ static inline int add_to_swap(struct page *page)
> >       return 0;
> >  }
> >
> > +static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> > +{
> > +     return NULL;
> > +}
> > +
> >  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
> >                                       gfp_t gfp_mask, void **shadowp)
> >  {
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5f7813a..91a2097 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2925,10 +2925,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                       page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> >                                                       vmf->address);
> >                       if (page) {
> > +                             void *shadow;
> > +
> >                               __SetPageLocked(page);
> >                               __SetPageSwapBacked(page);
> >                               set_page_private(page, entry.val);
> > -                             lru_cache_add_anon(page);
> > +                             shadow = get_shadow_from_swap_cache(entry);
> > +                             if (shadow)
> > +                                     workingset_refault(page, shadow);
>
> Hm, this is calling workingset_refault() on a page that isn't charged
> to a cgroup yet. That means the refault stats and inactive age counter
> will be bumped incorrectly in the root cgroup instead of the real one.

Okay.

> > +                             lru_cache_add(page);
> >                               swap_readpage(page, true);
> >                       }
> >               } else {
>
> You need to look up and remember the shadow entry at the top and call
> workingset_refault() after mem_cgroup_commit_charge() has run.

Okay. I will call workingset_refault() after charging.

I completely missed that workingset_refault() should be called after charging.
workingset_refault() in __read_swap_cache_async() also has the same problem.

> It'd be nice if we could do the shadow lookup for everybody in
> lookup_swap_cache(), but that's subject to race conditions if multiple
> faults on the same swap page happen in multiple vmas concurrently. The
> swapcache bypass scenario is only safe because it checks that there is
> a single pte under the mmap sem to prevent forking. So it looks like
> you have to bubble up the shadow entry through swapin_readahead().

The problem looks not that easy. Hmm...

In current code, there is a large time gap between the shadow entries
are poped up and the page is charged to the memcg, especially,
for readahead-ed pages. We cannot maintain the shadow entries of
the readahead-ed pages until the pages are charged.

My plan to solve this problem is propagating the charged mm to
__read_swap_cache_async(), like as file cache, charging when
the page is added on to the swap cache and calling workingset_refault()
there. Charging will only occur if:

1. faulted page
2. readahead-ed page with the shadow entry for the same memcg

Also, readahead only happens when shadow entry's memcg is the same
with the charged memcg. If not the same, it's mostly not ours so
readahead isn't needed.

Please let me know how you think of the feasibility of this idea.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate
  2020-03-23 17:42   ` Johannes Weiner
@ 2020-03-24  6:28     ` Joonsoo Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Joonsoo Kim @ 2020-03-24  6:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

2020년 3월 24일 (화) 오전 2:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Mon, Mar 23, 2020 at 02:52:12PM +0900, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > reclaim_stat's rotate is used for controlling the ratio of scanning page
> > between file and anonymous LRU. All new anonymous pages are counted
> > for rotate before the patch, protecting anonymous pages on active LRU, and,
> > it makes that reclaim on anonymous LRU is less happened than file LRU.
> >
> > Now, situation is changed. all new anonymous pages are not added
> > to the active LRU so rotate would be far less than before. It will cause
> > that reclaim on anonymous LRU happens more and it would result in bad
> > effect on some system that is optimized for previous setting.
> >
> > Therefore, this patch counts a new anonymous page as a reclaim_state's
> > rotate. Although it is non-logical to add this count to
> > the reclaim_state's rotate in current algorithm, reducing the regression
> > would be more important.
> >
> > I found this regression on kernel-build test and it is roughly 2~5%
> > performance degradation. With this workaround, performance is completely
> > restored.
> >
> > v2: fix a bug that reuses the rotate value for previous page
>
> I agree with the rationale, but the magic bit in the page->lru list
> pointers seems pretty ugly.

Yes, pretty ugly.

> I wrote a patch a few years ago that split lru_add_pvecs into an add
> and a putback component. This was to avoid unintentional balancing
> effects of LRU isolations, but I think you can benefit from that
> cleanup here as well. Would you mind taking a look at it and maybe
> take it up into your series?
>
> https://lore.kernel.org/patchwork/patch/685708/

Thanks for pointing that.
I will remove the magic bit and imitate your patch to solve the problem
of this patch.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU
  2020-03-24  6:25     ` Joonsoo Kim
@ 2020-04-02  5:50       ` Joonsoo Kim
  2020-04-02 15:14         ` Johannes Weiner
  0 siblings, 1 reply; 18+ messages in thread
From: Joonsoo Kim @ 2020-04-02  5:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

2020년 3월 24일 (화) 오후 3:25, Joonsoo Kim <js1304@gmail.com>님이 작성:
>
> 2020년 3월 24일 (화) 오전 2:17, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > On Mon, Mar 23, 2020 at 02:52:10PM +0900, js1304@gmail.com wrote:
> > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > >
> > > This patch implements workingset detection for anonymous LRU.
> > > All the infrastructure is implemented by the previous patches so this patch
> > > just activates the workingset detection by installing/retrieving
> > > the shadow entry.
> > >
> > > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > ---
> > >  include/linux/swap.h |  6 ++++++
> > >  mm/memory.c          |  7 ++++++-
> > >  mm/swap_state.c      | 20 ++++++++++++++++++--
> > >  mm/vmscan.c          |  7 +++++--
> > >  4 files changed, 35 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 273de48..fb4772e 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -408,6 +408,7 @@ extern struct address_space *swapper_spaces[];
> > >  extern unsigned long total_swapcache_pages(void);
> > >  extern void show_swap_cache_info(void);
> > >  extern int add_to_swap(struct page *page);
> > > +extern void *get_shadow_from_swap_cache(swp_entry_t entry);
> > >  extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > >                       gfp_t gfp, void **shadowp);
> > >  extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
> > > @@ -566,6 +567,11 @@ static inline int add_to_swap(struct page *page)
> > >       return 0;
> > >  }
> > >
> > > +static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> > > +{
> > > +     return NULL;
> > > +}
> > > +
> > >  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > >                                       gfp_t gfp_mask, void **shadowp)
> > >  {
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 5f7813a..91a2097 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2925,10 +2925,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >                       page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> > >                                                       vmf->address);
> > >                       if (page) {
> > > +                             void *shadow;
> > > +
> > >                               __SetPageLocked(page);
> > >                               __SetPageSwapBacked(page);
> > >                               set_page_private(page, entry.val);
> > > -                             lru_cache_add_anon(page);
> > > +                             shadow = get_shadow_from_swap_cache(entry);
> > > +                             if (shadow)
> > > +                                     workingset_refault(page, shadow);
> >
> > Hm, this is calling workingset_refault() on a page that isn't charged
> > to a cgroup yet. That means the refault stats and inactive age counter
> > will be bumped incorrectly in the root cgroup instead of the real one.
>
> Okay.
>
> > > +                             lru_cache_add(page);
> > >                               swap_readpage(page, true);
> > >                       }
> > >               } else {
> >
> > You need to look up and remember the shadow entry at the top and call
> > workingset_refault() after mem_cgroup_commit_charge() has run.
>
> Okay. I will call workingset_refault() after charging.
>
> I completely missed that workingset_refault() should be called after charging.
> workingset_refault() in __read_swap_cache_async() also has the same problem.
>
> > It'd be nice if we could do the shadow lookup for everybody in
> > lookup_swap_cache(), but that's subject to race conditions if multiple
> > faults on the same swap page happen in multiple vmas concurrently. The
> > swapcache bypass scenario is only safe because it checks that there is
> > a single pte under the mmap sem to prevent forking. So it looks like
> > you have to bubble up the shadow entry through swapin_readahead().
>
> The problem looks not that easy. Hmm...
>
> In current code, there is a large time gap between the shadow entries
> are poped up and the page is charged to the memcg, especially,
> for readahead-ed pages. We cannot maintain the shadow entries of
> the readahead-ed pages until the pages are charged.
>
> My plan to solve this problem is propagating the charged mm to
> __read_swap_cache_async(), like as file cache, charging when
> the page is added on to the swap cache and calling workingset_refault()
> there. Charging will only occur if:
>
> 1. faulted page
> 2. readahead-ed page with the shadow entry for the same memcg
>
> Also, readahead only happens when shadow entry's memcg is the same
> with the charged memcg. If not the same, it's mostly not ours so
> readahead isn't needed.
>
> Please let me know how you think of the feasibility of this idea.

Hello, Johannes.

Could you let me know your opinion about the idea above?
In fact, since your reply is delayed, I completed the solution about the
above idea. If you want, I will submit it first. Then, we could discuss
the solution more easily.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU
  2020-04-02  5:50       ` Joonsoo Kim
@ 2020-04-02 15:14         ` Johannes Weiner
  0 siblings, 0 replies; 18+ messages in thread
From: Johannes Weiner @ 2020-04-02 15:14 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Thu, Apr 02, 2020 at 02:50:28PM +0900, Joonsoo Kim wrote:
> 2020년 3월 24일 (화) 오후 3:25, Joonsoo Kim <js1304@gmail.com>님이 작성:
> > The problem looks not that easy. Hmm...
> >
> > In current code, there is a large time gap between the shadow entries
> > are poped up and the page is charged to the memcg, especially,
> > for readahead-ed pages. We cannot maintain the shadow entries of
> > the readahead-ed pages until the pages are charged.
> >
> > My plan to solve this problem is propagating the charged mm to
> > __read_swap_cache_async(), like as file cache, charging when
> > the page is added on to the swap cache and calling workingset_refault()
> > there. Charging will only occur if:
> >
> > 1. faulted page
> > 2. readahead-ed page with the shadow entry for the same memcg
> >
> > Also, readahead only happens when shadow entry's memcg is the same
> > with the charged memcg. If not the same, it's mostly not ours so
> > readahead isn't needed.
> >
> > Please let me know how you think of the feasibility of this idea.
> 
> Hello, Johannes.
> 
> Could you let me know your opinion about the idea above?
> In fact, since your reply is delayed, I completed the solution about the
> above idea. If you want, I will submit it first. Then, we could discuss
> the solution more easily.

It's probably easiest if you send out your implementation and we
discuss it over the code.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-04-02 15:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-23  5:52 [PATCH v4 0/8] workingset protection/detection on the anonymous LRU list js1304
2020-03-23  5:52 ` [PATCH v4 1/8] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
2020-03-23  5:52 ` [PATCH v4 2/8] mm/vmscan: protect the workingset on anonymous LRU js1304
2020-03-23  5:52 ` [PATCH v4 3/8] mm/workingset: extend the workingset detection for anon LRU js1304
2020-03-23  5:52 ` [PATCH v4 4/8] mm/swapcache: support to handle the exceptional entries in swapcache js1304
2020-03-23 16:50   ` Johannes Weiner
2020-03-23  5:52 ` [PATCH v4 5/8] mm/workingset: handle the page without memcg js1304
2020-03-23 16:51   ` Johannes Weiner
2020-03-23  5:52 ` [PATCH v4 6/8] mm/swap: implement workingset detection for anonymous LRU js1304
2020-03-23 17:17   ` Johannes Weiner
2020-03-24  6:25     ` Joonsoo Kim
2020-04-02  5:50       ` Joonsoo Kim
2020-04-02 15:14         ` Johannes Weiner
2020-03-23  5:52 ` [PATCH v4 7/8] mm/vmscan: restore active/inactive ratio " js1304
2020-03-23 17:18   ` Johannes Weiner
2020-03-23  5:52 ` [PATCH v4 8/8] mm/swap: count a new anonymous page as a reclaim_state's rotate js1304
2020-03-23 17:42   ` Johannes Weiner
2020-03-24  6:28     ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).